0% found this document useful (0 votes)
11 views39 pages

Getting Started With Databricks

The document serves as a guide for getting started with Databricks, covering key concepts such as creating clusters, data exploration, and data engineering foundations. It discusses various data sources, data transformations, and orchestration capabilities within Databricks. Additionally, it highlights the use of Delta tables and the importance of automation in data workflows.

Uploaded by

animillasekhar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views39 pages

Getting Started With Databricks

The document serves as a guide for getting started with Databricks, covering key concepts such as creating clusters, data exploration, and data engineering foundations. It discusses various data sources, data transformations, and orchestration capabilities within Databricks. Additionally, it highlights the use of Delta tables and the importance of automation in data workflows.

Uploaded by

animillasekhar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

Getting started with

Databricks
D ATA B R I C K S C O N C E P T S

Kevin Barlow
Data Practitioner
Compute cluster refresh

DATABRICKS CONCEPTS
Create your first cluster
The first step is to create a cluster for your
data processing!

Configuration options:

DATABRICKS CONCEPTS
Create your first cluster
The first step is to create a cluster for your
data processing!

Configuration options:

Cluster policies and access

DATABRICKS CONCEPTS
Cluster Access

DATABRICKS CONCEPTS
Create your first cluster
The first step is to create a cluster for your
data processing!

Configuration options:

Cluster policies and access

Databricks Runtime

Photon Acceleration

DATABRICKS CONCEPTS
Create your first cluster
The first step is to create a cluster for your
data processing!

Configuration options:

Cluster policies and access

Databricks Runtime

Photon Acceleration

Node instance types and number

Auto-scaling / Auto-termination

DATABRICKS CONCEPTS
Data Explorer
Get familiar with the Data Explorer! In this UI,
you can:

1. Browse available catalogs/schemas/tables

2. Look at sample data and summary


statistics

3. View data lineage and history

You can also upload new data by clicking the


"plus" icon!

1 Photo by Jakub Zerdzicki: https://www.pexels.com/photo/magnifier-loupe-17284804/

DATABRICKS CONCEPTS
Create a notebook
Databricks notebooks:

Standard interface for Databricks

Improvements on open-source Jupyter


Support for many languages
Python, R, Scala, SQL

Magic commands (%sql)

Built-in visualizations

Real-time commenting and collaboration

DATABRICKS CONCEPTS
Let's practice!
D ATA B R I C K S C O N C E P T S
Data Engineering
foundations in
Databricks
D ATA B R I C K S C O N C E P T S

Kevin Barlow
Data Practitioner
Medallion architecture

DATABRICKS CONCEPTS
Reading data
Spark is a highly flexible framework and can
read from various data sources/types.

Common data sources and types:

Delta tables

File formats (CSV, JSON, Parquet, XML)

Databases (MySQL, Postgres, EDW)

Streaming data

Images / Videos

DATABRICKS CONCEPTS
Reading data
Spark is a highly flexible framework and can #Delta table
read from various data sources/types. spark.read.table()
#CSV files
Common data sources and types:
spark.read.format('csv').load('*.csv')
Delta tables #Postgres table
spark.read.format("jdbc")
File formats (CSV, JSON, Parquet, XML)
.option("driver", driver)
Databases (MySQL, Postgres, EDW)
.option("url", url)
Streaming data .option("dbtable", table)
.option("user", user)
Images / Videos
.option("password", password)
.load()

DATABRICKS CONCEPTS
Structure of a Delta table
A Delta table provides table-like qualities to an open file format.

Feels like a table when reading

Access to underlying files (Parquet and JSON)

DATABRICKS CONCEPTS
Explaining the Delta Lake structure

DATABRICKS CONCEPTS
DataFrames
DataFrames are two-dimensional id customerName bookTitle
representations of data. 1 John Data Guide to Spark

Look and feel similar to tables 2 Sally Bricks SQL for Data
Engineering
Similar concept for many different data
3 Adam Delta Keeping Data Clean
tools
Spark (default), pandas, dplyr, SQL df = (spark.read
queries .format("csv")
Underlying construct for most data .option("header", "true")
processes .option("inferSchema", "true")
.load("/data.csv"))

DATABRICKS CONCEPTS
Writing data
Kinds of tables in Databricks df.write.saveAsTable(table_name)

1. Managed tables
CREATE TABLE table_name
Default type
USING delta
Stored with Unity Catalog AS ...
Databricks managed

2. External tables df.write


Stored in another location .location('').saveAsTable(table_name)

Set LOCATION
CREATE TABLE table_name
Customer managed USING delta
LOCATION "<path>"
AS ...

DATABRICKS CONCEPTS
Let's practice!
D ATA B R I C K S C O N C E P T S
Data
transformations in
Databricks
D ATA B R I C K S C O N C E P T S

Kevin Barlow
Data Practitioner
SQL for data engineering
SQL -- Creating a new table in SQL

Familiar for Database Administrators


CREATE TABLE table_name
(DBAs)
USING delta
Great for standard manipulations AS (
Execute pre-defined UDFs SELECT *
FROM source_table
WHERE date >= '2023-01-01'
)

DATABRICKS CONCEPTS
Other languages for data engineering
Python, R, Scala #Creating a new table in Pyspark

Familiar for software engineers


spark
Standard and complex transformations .read
Use and define custom functions .table('source_table')
.filter(col('date') >= '2023-01-01')
.write
.saveAsTable('table_name')

DATABRICKS CONCEPTS
Common transformations
Schema manipulation #Pyspark

Add and remove columns df


Redefine columns .withColumn(col('newCol'), ...)
.drop(col('oldCol'))

Filtering #Pyspark

Reduce DataFrame to subset of data


df
Pass multiple criteria .filter(col('date') >= target_date)
.filter(col('id') IS NOT NULL)

DATABRICKS CONCEPTS
Common transformations (continued)
Nested data df
.explode(col('arrayCol')) #wide to long
Arrays or Struct data
.flatten(col('items')) #long to wide
Expand or contract

Aggregation df
.groupBy(col('region'))
Group data based on columns
.agg(sum(col('sales')))
Calculate data summarizations

DATABRICKS CONCEPTS
Auto Loader
Auto Loader processes new data files as they
land in a data lake.

Incremental processing

Efficient processing

Automatic

spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "json")
.load(file_path)

1 https://www.databricks.com/blog/2020/02/24/introducing-databricks-ingest-easy-data-ingestion-into-delta-
lake.html

DATABRICKS CONCEPTS
Structured Streaming
spark.readStream
.format("kafka")
.option("subscribe", "<topic>")
.load()
.join(table_df,
on="<id>", how="left")
.writeStream
.format("kafka")
.option("topic", "<topic>")
.start()

DATABRICKS CONCEPTS
Let's practice!
D ATA B R I C K S C O N C E P T S
Orchestration in
Databricks
D ATA B R I C K S C O N C E P T S

Kevin Barlow
Data Analytics Practitioner
What is data orchestration?
Data orchestration is a form of automation!

Enables data engineers to automate the end-to-end data life cycle

DATABRICKS CONCEPTS
Databricks Workflows
Databricks Workflows is a collection of built-in capabilities to orchestrate all your data
processes, at no additional cost!

Example Databricks Workflow

1 https://docs.databricks.com/workflows

DATABRICKS CONCEPTS
What can we orchestrate?
Data engineers/data scientists Data analysts

DATABRICKS CONCEPTS
Databricks Jobs
Workflows UI
Users can create jobs directly from the
Databricks UI:

Directly from a notebook

In the Workflows section

1 https://docs.databricks.com/workflows/jobs

DATABRICKS CONCEPTS
Databricks Jobs
Programmatic {
Users can also programmatically create jobs "name": "A multitask job",
using the Jobs CLI or Jobs API with the "tags": {},
Databricks platform. "tasks": [],
"job_clusters": [],
"format": "MULTI_TASK",
}

DATABRICKS CONCEPTS
Delta Live Tables

DATABRICKS CONCEPTS
Delta Live Tables

DATABRICKS CONCEPTS
Delta Live Tables

DATABRICKS CONCEPTS
Let's practice!
D ATA B R I C K S C O N C E P T S
End-to-end data
pipeline example in
Databricks
D ATA B R I C K S C O N C E P T S

Kevin Barlow
Data Practitioner
Let's practice!
D ATA B R I C K S C O N C E P T S

You might also like