Getting Started With Databricks
Getting Started With Databricks
Databricks
D ATA B R I C K S C O N C E P T S
Kevin Barlow
Data Practitioner
Compute cluster refresh
DATABRICKS CONCEPTS
Create your first cluster
The first step is to create a cluster for your
data processing!
Configuration options:
DATABRICKS CONCEPTS
Create your first cluster
The first step is to create a cluster for your
data processing!
Configuration options:
DATABRICKS CONCEPTS
Cluster Access
DATABRICKS CONCEPTS
Create your first cluster
The first step is to create a cluster for your
data processing!
Configuration options:
Databricks Runtime
Photon Acceleration
DATABRICKS CONCEPTS
Create your first cluster
The first step is to create a cluster for your
data processing!
Configuration options:
Databricks Runtime
Photon Acceleration
Auto-scaling / Auto-termination
DATABRICKS CONCEPTS
Data Explorer
Get familiar with the Data Explorer! In this UI,
you can:
DATABRICKS CONCEPTS
Create a notebook
Databricks notebooks:
Built-in visualizations
DATABRICKS CONCEPTS
Let's practice!
D ATA B R I C K S C O N C E P T S
Data Engineering
foundations in
Databricks
D ATA B R I C K S C O N C E P T S
Kevin Barlow
Data Practitioner
Medallion architecture
DATABRICKS CONCEPTS
Reading data
Spark is a highly flexible framework and can
read from various data sources/types.
Delta tables
Streaming data
Images / Videos
DATABRICKS CONCEPTS
Reading data
Spark is a highly flexible framework and can #Delta table
read from various data sources/types. spark.read.table()
#CSV files
Common data sources and types:
spark.read.format('csv').load('*.csv')
Delta tables #Postgres table
spark.read.format("jdbc")
File formats (CSV, JSON, Parquet, XML)
.option("driver", driver)
Databases (MySQL, Postgres, EDW)
.option("url", url)
Streaming data .option("dbtable", table)
.option("user", user)
Images / Videos
.option("password", password)
.load()
DATABRICKS CONCEPTS
Structure of a Delta table
A Delta table provides table-like qualities to an open file format.
DATABRICKS CONCEPTS
Explaining the Delta Lake structure
DATABRICKS CONCEPTS
DataFrames
DataFrames are two-dimensional id customerName bookTitle
representations of data. 1 John Data Guide to Spark
Look and feel similar to tables 2 Sally Bricks SQL for Data
Engineering
Similar concept for many different data
3 Adam Delta Keeping Data Clean
tools
Spark (default), pandas, dplyr, SQL df = (spark.read
queries .format("csv")
Underlying construct for most data .option("header", "true")
processes .option("inferSchema", "true")
.load("/data.csv"))
DATABRICKS CONCEPTS
Writing data
Kinds of tables in Databricks df.write.saveAsTable(table_name)
1. Managed tables
CREATE TABLE table_name
Default type
USING delta
Stored with Unity Catalog AS ...
Databricks managed
Set LOCATION
CREATE TABLE table_name
Customer managed USING delta
LOCATION "<path>"
AS ...
DATABRICKS CONCEPTS
Let's practice!
D ATA B R I C K S C O N C E P T S
Data
transformations in
Databricks
D ATA B R I C K S C O N C E P T S
Kevin Barlow
Data Practitioner
SQL for data engineering
SQL -- Creating a new table in SQL
DATABRICKS CONCEPTS
Other languages for data engineering
Python, R, Scala #Creating a new table in Pyspark
DATABRICKS CONCEPTS
Common transformations
Schema manipulation #Pyspark
Filtering #Pyspark
DATABRICKS CONCEPTS
Common transformations (continued)
Nested data df
.explode(col('arrayCol')) #wide to long
Arrays or Struct data
.flatten(col('items')) #long to wide
Expand or contract
Aggregation df
.groupBy(col('region'))
Group data based on columns
.agg(sum(col('sales')))
Calculate data summarizations
DATABRICKS CONCEPTS
Auto Loader
Auto Loader processes new data files as they
land in a data lake.
Incremental processing
Efficient processing
Automatic
spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "json")
.load(file_path)
1 https://www.databricks.com/blog/2020/02/24/introducing-databricks-ingest-easy-data-ingestion-into-delta-
lake.html
DATABRICKS CONCEPTS
Structured Streaming
spark.readStream
.format("kafka")
.option("subscribe", "<topic>")
.load()
.join(table_df,
on="<id>", how="left")
.writeStream
.format("kafka")
.option("topic", "<topic>")
.start()
DATABRICKS CONCEPTS
Let's practice!
D ATA B R I C K S C O N C E P T S
Orchestration in
Databricks
D ATA B R I C K S C O N C E P T S
Kevin Barlow
Data Analytics Practitioner
What is data orchestration?
Data orchestration is a form of automation!
DATABRICKS CONCEPTS
Databricks Workflows
Databricks Workflows is a collection of built-in capabilities to orchestrate all your data
processes, at no additional cost!
1 https://docs.databricks.com/workflows
DATABRICKS CONCEPTS
What can we orchestrate?
Data engineers/data scientists Data analysts
DATABRICKS CONCEPTS
Databricks Jobs
Workflows UI
Users can create jobs directly from the
Databricks UI:
1 https://docs.databricks.com/workflows/jobs
DATABRICKS CONCEPTS
Databricks Jobs
Programmatic {
Users can also programmatically create jobs "name": "A multitask job",
using the Jobs CLI or Jobs API with the "tags": {},
Databricks platform. "tasks": [],
"job_clusters": [],
"format": "MULTI_TASK",
}
DATABRICKS CONCEPTS
Delta Live Tables
DATABRICKS CONCEPTS
Delta Live Tables
DATABRICKS CONCEPTS
Delta Live Tables
DATABRICKS CONCEPTS
Let's practice!
D ATA B R I C K S C O N C E P T S
End-to-end data
pipeline example in
Databricks
D ATA B R I C K S C O N C E P T S
Kevin Barlow
Data Practitioner
Let's practice!
D ATA B R I C K S C O N C E P T S