Data Bricks
Data Bricks
The Azure Databricks Lakehouse Platform provides a unified set of tools for building, deploying, sharing, and
maintaining enterprise-grade data solutions at scale
Azure Databricks is used to process, store, clean, share, analyze, model, and monetize their datasets with
solutions from BI to machine learning.
You can use the Azure Databricks platform to build many different applications spanning data personas.
Databricks Data Science & Engineering (sometimes called simply "Workspace") is an analytics platform based on
Apache Spark. It is integrated with Azure to provide one-click setup, streamlined workflows, and an interactive
workspace that enables collaboration between data engineers, data scientists, and machine learning engineers.
For a big data pipeline, the data (raw or structured) is ingested into Azure through Azure Data Factory in
batches, or streamed near real-time using Apache Kafka, Event Hub, or IoT Hub.
This data lands in a data lake for long term persisted storage, in Azure Blob Storage or Azure Data Lake Storage.
As part of your analytics workflow, use Azure Databricks to read data from multiple data sources such as Azure
Blob Storage, Azure Data Lake Storage, Azure Cosmos DB, or Azure SQL Data Warehouse and turn it into
breakthrough insights using Spark.
Streaming: Real-time data processing and analysis for analytical and interactive applications. Integrates with
HDFS, Flume, and Kafka.
MLlib: Machine Learning library consisting of common learning algorithms and utilities, including classification,
regression, clustering, collaborative filtering, dimensionality reduction, as well as underlying optimization
primitives.
GraphX: Graphs and graph computation for a broad scope of use cases from cognitive analytics to data
exploration.
Spark Core API: Includes support for R, SQL, Python, Scala, and Java.
What is Databricks Machine Learning?
Databricks Machine Learning is an integrated end-to-end machine learning platform incorporating managed services for
experiment tracking, model training, feature development and management, and feature and model serving. The
diagram shows how the capabilities of Databricks map to the steps of the model development and deployment process.
Apache Spark Architecture
Apache Spark works in a master-slave architecture where the master is called “Driver” and slaves are called
“Workers”. When you run a Spark application, Spark Driver creates a context that is an entry point to your
application, and all operations (transformations and actions) are executed on worker nodes, and the
resources are managed by Cluster Manager
Menu Items
Spark Driver
Cluster Manager
Executors
How does Spark Work?
When a client submits a spark user application code, the driver implicitly converts the code containing
transformations and actions into a logical directed acyclic graph (DAG).
At this stage, the driver program also performs certain optimizations like pipelining transformations, and then it
converts the logical DAG into a physical execution plan with a set of stages.
After creating the physical execution plan, it creates small physical execution units referred to as tasks under each
stage. Then tasks are bundled to be sent to the Spark Cluster.
The driver program then talks to the cluster manager and negotiates for resources. The cluster manager then
launches executors on the worker nodes on behalf of the driver.
At this point the driver sends tasks to the cluster manager based on data placement. Before executors begin
execution, they register themselves with the driver program so that the driver has holistic view of all the executors.
Now executors start executing the various tasks assigned by the driver program.
At any point of time when the spark application is running, the driver program will monitor the set of executors that
run.
Driver program in the spark architecture also schedules future tasks based on data placement by tracking the location
of cached data.
When driver program main () method exits or when it call the stop () method of the Spark Context, it will terminate all
the executors and release the resources from the cluster manager.
Day -2
Clusters
An Azure Databricks cluster is a set of computation resources and configurations on which
you run data engineering, data science, and data analytics workloads, such as production ETL
pipelines, streaming analytics, ad-hoc analytics, and machine learning.
Types of Clusters
1. all-purpose clusters
2. job clusters.
You can manually terminate and restart an all-purpose cluster. Multiple users can share such
clusters to do collaborative interactive analysis.
The Azure Databricks job scheduler creates a job cluster when you run a job on a new job cluster
and terminates the cluster when the job is complete. You cannot restart a job cluster.
Cluster mode
Azure Databricks supports three cluster modes: Standard, High Concurrency, and Single Node.
The default cluster mode is Standard.
Standard Vs High Concurrency Vs Single Node Clusters
Standard clusters High Concurrency clusters Single Node clusters
Standard mode clusters A High Concurrency cluster A Single Node cluster has
(sometimes called No is a managed cloud no workers and runs Spark
Isolation Shared clusters) resource. The key benefits jobs on the driver node.
can be shared by multiple of High Concurrency
users, with no isolation clusters are that they
between users – [follows provide fine-grained
driver and worker node sharing for maximum
approach] resource utilization and
minimum query latencies.
Standard and Single Node High Concurrency clusters Single Node clusters
clusters terminate do not terminate terminate automatically
automatically after 120 automatically by default. after 120 minutes by
minutes by default. default.
Standard clusters can run High Concurrency clusters Single Node clusters can
workloads developed in can run workloads run workloads developed
Python, SQL, R, and Scala. developed in SQL, Python, in Python, SQL, R, and
and R. Scala.
Pools
To reduce cluster start time, you can attach a cluster to a predefined pool of idle instances, for the driver
and worker nodes.
The cluster is created using instances in the pools. If a pool does not have sufficient idle resources to
create the requested driver or worker nodes, the pool expands by allocating new instances from the
instance provider.
When an attached cluster is terminated, the instances it used are returned to the pools and can be reused
by a different cluster.
Databricks Runtime
Databricks runtimes are the set of core components that run on your clusters. All Databricks runtimes
include Apache Spark and add components and updates that improve usability, performance, and security.
For details, see Databricks runtimes.
Azure Databricks offers several types of runtimes and several versions of those runtime types in the
Databricks Runtime Version drop-down when you create or edit a cluster.
DataBricks Runtime Types
Databricks Runtime
Databricks Runtime includes Apache Spark but also adds a number of components
and updates that substantially improve the usability, performance, and security of big
data analytics.
Photon runtime
Photon is the Azure Databricks native vectorized query engine that runs SQL
workloads faster and reduces your total cost per workload.
Databricks Light
Databricks Light provides a runtime option for jobs that don’t need the advanced
performance, reliability, or autoscaling benefits provided by Databricks Runtime.
Cluster node type
A cluster consists of one driver node and zero or more worker nodes.
You can pick separate cloud provider instance types for the driver and worker nodes, although by default the
driver node uses the same instance type as the worker node.
Different families of instance types fit different use cases, such as memory-intensive or compute-intensive
workloads.
Driver node
The driver node maintains state information of all notebooks attached to the cluster.
The driver node also maintains the SparkContext and interprets all the commands you run from a notebook or a
library on the cluster, and runs the Apache Spark master that coordinates with the Spark executors.
Worker node
Azure Databricks worker nodes run the Spark executors and other services required for the proper functioning
of the clusters. When you distribute your workload with Spark, all of the distributed processing happens on
worker nodes
Cluster size and autoscaling
When you create a Azure Databricks cluster, you can either provide a fixed number of workers for the
cluster or provide a minimum and maximum number of workers for the cluster.
When you provide a fixed size cluster, Azure Databricks ensures that your cluster has the specified number
of workers. When you provide a range for the number of workers, Databricks chooses the appropriate
number of workers required to run your job. This is referred to as autoscaling.
With autoscaling, Azure Databricks dynamically reallocates workers to account for the characteristics of
your job. Certain parts of your pipeline may be more computationally demanding than others, and
Databricks automatically adds additional workers during these phases of your job (and removes them when
they’re no longer needed).
Autoscaling makes it easier to achieve high cluster utilization, because you don’t need to provision the
cluster to match a workload. This applies especially to workloads whose requirements change over time
(like exploring a dataset during the course of a day), but it can also apply to a one-time shorter workload
whose provisioning requirements are unknown. Autoscaling thus offers two advantages:
1. Enable autoscaling.
An init script is a shell script that runs during startup of each cluster node before the Apache Spark driver
or worker JVM starts.
Install packages and libraries not included in Databricks Runtime. To install Python packages, use the Azure
Databricks pip binary located at /databricks/python/bin/pip to ensure that Python packages install into
the Azure Databricks Python virtual environment rather than the system Python environment.
For example, /databricks/python/bin/pip install <package-name>.
Modify the JVM system classpath in special cases.
Set system properties and environment variables used by the JVM.
Modify Spark configuration parameters.
Init script types
Azure Databricks supports two kinds of init scripts: cluster-scoped and global.
Cluster-scoped: run on every cluster configured with the script. This is the recommended way to run an init
script.
Global: run on every cluster in the workspace. They can help you to enforce consistent cluster configurations
across your workspace. Use them carefully because they can cause unanticipated impacts, like library conflicts.
Only admin users can create global init scripts. Global init scripts are not run on model serving clusters.
Environment variables
Cluster-scoped and global init scripts support the following environment variables:
DB_CLUSTER_ID: the ID of the cluster on which the script is running. See Clusters API 2.0.
DB_CONTAINER_IP: the private IP address of the container in which Spark runs. The init script is run inside
this container. See SparkNode.
DB_IS_DRIVER: whether the script is running on a driver node.
DB_DRIVER_IP: the IP address of the driver node.
DB_INSTANCE_TYPE: the instance type of the host VM.
DB_CLUSTER_NAME: the name of the cluster the script is executing on.
DB_IS_JOB_CLUSTER: whether the cluster was created to run a job. See Create a job.
echo $DB_IS_DRIVER
if [[ $DB_IS_DRIVER = "TRUE" ]]; then
<run this part only on driver>
else
<run this part only on workers>
fi
<run this part on both driver and workers>
Manage clusters
Manage Azure Databricks clusters, including displaying, editing,
starting, terminating, deleting, controlling access, and
monitoring performance and logs.
What is the Databricks File System (DBFS)?
The Databricks File System (DBFS) is a distributed file system mounted into an Azure Databricks workspace and
available on Azure Databricks clusters.
DBFS is an abstraction on top of scalable object storage that maps Unix-like filesystem calls to native cloud storage
API calls
DBFS provides convenience by mapping cloud object storage URIs to relative paths.
Allows you to interact with object storage using directory and file semantics instead of cloud-specific API
commands.
Allows you to mount cloud object storage locations so that you can map storage credentials to paths in the
Azure Databricks workspace.
Simplifies the process of persisting files to object storage, allowing virtual machines and attached volume
storage to be safely deleted on cluster termination.
Mounting cloud object storage on Azure Databricks
Azure Databricks mounts create a link between a workspace and cloud object storage, which enables you to
interact with cloud object storage using familiar file paths relative to the Databricks file system. Mounts work by
creating a local alias under the /mnt directory that stores the following information:
# Optionally, you can add <directory-name> to the source URI of your mount point.
dbutils.fs.mount(
source = "abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/",
mount_point = "/mnt/<mount-name>",
extra_configs = configs)
<application-id> with the Application (client) ID for the Azure Active Directory application.
<scope-name> with the Databricks secret scope name.
<service-credential-key-name> with the name of the key containing the client secret.
<directory-id> with the Directory (tenant) ID for the Azure Active Directory application.
<container-name> with the name of a container in the ADLS Gen2 storage account.
<storage-account-name> with the ADLS Gen2 storage account name.
<mount-name> with the name of the intended mount point in DBFS.
Day-3
Databricks Utilities
Databricks Utilities (dbutils) make it easy to perform powerful combinations of tasks. You can use the utilities
to work with object storage efficiently, to chain and parameterize notebooks, and to work with secrets. dbutils
are not supported outside of notebooks.
This module provides various utilities for users to interact with the rest of Databricks.
fs: DbfsUtils -> Manipulates the Databricks filesystem (DBFS) from the console
jobs: JobsUtils -> Utilities for leveraging jobs features
library: LibraryUtils -> Utilities for session isolated libraries
notebook: NotebookUtils -> Utilities for the control flow of a notebook (EXPERIMENTAL)
secrets: SecretUtils -> Provides utilities for leveraging secrets within notebooks
widgets: WidgetsUtils -> Methods to create and get bound value of input widgets inside notebooks
File system utility (dbutils.fs)
The file system utility allows you to access What is the Databricks File System (DBFS)?, making it easier to use Azure
Databricks as a file system. To list the available commands, run dbutils.fs.help().
ls command (dbutils.fs.ls)
Lists the contents of a directory.
dbutils.fs.ls("/tmp")
# Out[13]: [FileInfo(path='dbfs:/tmp/my_file.txt',
name='my_file.txt', size=40,
modificationTime=1622054945000)]
Jobs utility (dbutils.jobs)
The jobs utility allows you to access and get the jobs features. To display help for this
utility, run dbutils.jobs.help().
dbutils.jobs.taskValues.get(taskKey = "my-task", \
key = "my-key", \
default = 7, \
debugValue = 42)
dbutils.library.install("abfss:/path/to/your/library.whl")
dbutils.library.restartPython() # Removes Python state, but some
libraries might not work without calling this command.
Notebook utility (dbutils.notebook)
exit(value: String): void -> This method lets you exit a notebook
with a value
run(path: String, timeoutSeconds: int, arguments: Map): String
-> This method runs a notebook and returns its exit value.
The secrets utility allows you to store and access sensitive credential information without making them visible in
notebooks. See Secret management and Use the secrets in a notebook. To list the available commands, run
dbutils.secrets.help().
get(scope: String, key: String): String -> Gets the string representation of a secret value with scope and key
getBytes(scope: String, key: String): byte[] -> Gets the bytes representation of a secret value with scope and key
list(scope: String): Seq -> Lists secret metadata for secrets within a scope
listScopes: Seq -> Lists secret scopes
Widgets utility (dbutils.widgets)
This example creates and displays a combobox widget with the programmatic name fruits_combobox. It
offers the choices apple, banana, coconut, and dragon fruit and is set to the initial value of banana. This
combobox widget has an accompanying label Fruits. This example ends by printing the initial value of the
combobox widget, banana.
dbutils.widgets.combobox(
name='fruits_combobox',
defaultValue='banana',
choices=['apple', 'banana', 'coconut', 'dragon fruit'],
label='Fruits'
)
print(dbutils.widgets.get("fruits_combobox"))
# banana
Browse files in DBFS
Databricks notebook interface and controls
Cell actions menu
Create cells
Notebooks use two types of cells: code cells and markdown cells. Code cells contain runnable code. Markdown
cells contain markdown code that renders into text and graphics when the cell is executed and can be used to
document or illustrate your code.
Data Sources Various data sources. Various data sources. Various data sources.
rdd= sc.parallelize(range(1000))
transformation_1 = rdd.map(lambda x: x+2)
In this example, we’re just creating a list of integers and applying a transformation of adding “2” to each object in the list, so our logical executio
plan would be something like the following:
Output>
rdd= sc.parallelize(range(1000),8)
transformation_1 = rdd.map(lambda x: x+2)
transformation_2 = transformation_1.filter(lambda x: x%2 != 0)
Output>
A DataFrame is a two-dimensional labeled data structure with columns joined_df = df1.join(df2, how="inner", on="id")
of potentially different types. You can think of a DataFrame like a
spreadsheet, a SQL table, or a dictionary of series objects. unioned_df = df1.union(df2)
Apache Spark DataFrames are an abstraction built on top of Resilient
Distributed Datasets (RDDs). Filter rows in a DataFrame
A DataFrame is a Dataset organized into named columns. It is filtered_df = df.filter("id > 1")
conceptually equivalent to a table in a relational database or a data
frame in R/Python, but with richer optimizations under the hood.
filtered_df = df.where("id > 1")
Create a DataFrame
Select columns from a DataFrame
data = [[1, "Elia"], [2, "Teo"], [3, "Fang"]]
df = spark.createDataFrame(data, schema="id LONG, name subset_df = df.filter("id > 1").select("name")
STRING")
Load data into a DataFrame from files
Save a DataFrame to a table
df = (spark.read
.format("csv")
.option("header", "true") df.write.saveAsTable("<table_name>")
.option("inferSchema", "true")
.load("/databricks-datasets/samples/population-vs-price/data_geo.csv")
) df.write.format("json").save("/tmp/json_data")
Dataset
A Spark Dataset is a distributed collection of typed objects, which are partitioned across
multiple nodes in a cluster and can be operated on in parallel.
Datasets are composed of typed objects, which means that transformation syntax errors (like a typo in the
method name) and analysis errors (like an incorrect input variable type) can be caught at compile time.
Here is a list of some commonly used typed transformations, which can be used on Datasets of typed objects
(Dataset[T]).
map
Returns new Dataset with result of applying input function to each element
filter
Returns new Dataset containing elements where input function is true
groupByKey
Returns a KeyValueGroupedDataset where the data is grouped by the given key function
The entry point to programming in Spark is the
org.apache.spark.sql.SparkSession class, which
you use to create a SparkSession object as
shown below:
val spark =
SparkSession.builder().appName("example"
).master("local[*]".getOrCreate()
Shuffle operations
Certain operations within Spark trigger an event known as the shuffle.
The shuffle is Spark’s mechanism for re-distributing data so that it’s grouped differently across partitions.
This typically involves copying data across executors and machines, making the shuffle a complex and costly
operation.
In Spark, data is generally not distributed across partitions to be in the necessary place for a specific
operation. During computations, a single task will operate on a single partition - thus, to organize all the data
for a single reduceByKey reduce task to execute, Spark needs to perform an all-to-all operation. It must read
from all partitions to find all the values for all keys, and then bring together values across partitions to
compute the final result for each key - this is called the shuffle.
Operations which can cause a shuffle include repartition operations like repartition and coalesce, ‘ByKey operations
(except for counting) like groupByKey and reduceByKey, and join operations like cogroup and join.
Performance Impact
The Shuffle is an expensive operation since it involves disk I/O, data serialization, and network I/O.
Certain shuffle operations can consume significant amounts of heap memory since they employ in-memory data
structures to organize records before or after transferring them.
When data does not fit in memory Spark will spill these tables to disk, incurring the additional overhead of disk I/O
and increased garbage collection.
Shuffle also generates a large number of intermediate files on disk.
Spark Default Shuffle Partition
DataFrame increases the partition number to 200 automatically when Spark operation performs data shuffling
(join(), aggregation functions). This default shuffle partition number comes from Spark SQL configuration
spark.sql.shuffle.partitions which is by default set to 200.
You can change this default shuffle partition value using conf method of the SparkSession object
spark.conf.set("spark.sql.shuffle.partitions",100)
On another hand, when you have too much data and have less number of partitions results in fewer longer running
tasks, and sometimes you may also get out of memory error.
Getting the right size of the shuffle partition is always tricky and takes many runs with different values to achieve the
optimized number. This is one of the key properties to look for when you have performance issues on Spark jobs.
Spark Partitioning & Partition Understanding
Spark/PySpark partitioning is a way to split the data into multiple partitions so that you can execute
transformations on multiple partitions in parallel which allows completing the job faster.
You can also write partitioned data into a file system (multiple sub-directories) for faster reads by downstream
systems
Repartitioning with coalesce function
There are two functions you can use in Spark to repartition data and coalesce is one of them.
def coalesce(numPartitions)
Returns a new :class:DataFrame that has exactly numPartitions partitions.
Similar to coalesce defined on an :class:RDD, this operation results in a narrow dependency, e.g. if you go from 1000
partitions to 100 partitions, there will not be a shuffle, instead each of the 100 new partitions will claim 10 of the current
partitions. If a larger number of partitions is requested, it will stay at the current number of partitions.
numPartitions can be an int to specify the target number of partitions or a Column. If it is a Column, it will be used as the
first partitioning column. If not specified, the default number of partitions is used.
Added optional arguments to specify the partitioning columns. Also made numPartitions
optional if partitioning columns are specified.
Repartition by number
Use the following code to repartition the data to 10 partitions.
df = df.repartition(10)
print(df.rdd.getNumPartitions())
df.write.mode("overwrite").csv("data/example.csv",
header=True)
Spark will try to evenly distribute the data to each partitions. If the total partition number is greater than the actual
record count (or RDD size), some partitions will be empty.
After we run the above code, data will be reshuffled to 10 partitions with 10 sharded files generated.
If we repartition the data frame to 1000 partitions, how many sharded files will be generated?
The answer is 100 because the other 900 partitions are empty and each file has one record.
Repartition by column
We can also repartition by columns.
For example, let’s run the following code to repartition the data by column Country.
df = df.repartition("Country")
print(df.rdd.getNumPartitions())
df.write.mode("overwrite").csv("data/example.csv", header=True)
groupByKey Vs reduceByKey
syntax:
reduceByKey:
Syntax:
sparkContext.textFile("hdfs://") .flatMap(line => line.split(" ")) .map(word => (word,1)) .reduceByKey((x,y)=> (x+y))
Data are combined at each partition, with only one output for one key at each partition to send over the network.
reduceByKey required combining all your values into another value with the exact same type.
map() – Spark map() transformation applies a function to each row in a DataFrame/Dataset and returns the
new transformed Dataset.
flatMap() – Spark flatMap() transformation flattens the DataFrame/Dataset after applying the function on
every element and returns a new transformed Dataset. The returned Dataset will return more rows than the
current DataFrame.
While both of these functions will produce the correct answer, the reduceByKey example works much better on a
large dataset.
That's because Spark knows it can combine output with a common key on each partition before shuffling the
data.
Look at the diagram below to understand what happens with reduceByKey.
Notice how pairs on the same machine with the same key are combined (by using the lamdba function passed
into reduceByKey) before the data is shuffled.
Then the lamdba function is called again to reduce all the values from each partition to produce one final result.
On the other hand, when calling groupByKey - all the key-value pairs are shuffled around.
This is a lot of unnessary data to being transferred over the network.
To determine which machine to shuffle a pair to, Spark calls a partitioning function on the key of the pair.
Spark spills data to disk when there is more data shuffled onto a single executor machine than can fit in memory.
However, it flushes out the data to disk one key at a time - so if a single key has more key-value pairs than can fit in memory,
an out of memory exception occurs. This will be more gracefully handled in a later release of Spark so the job can still proceed,
but should still be avoided - when Spark needs to spill to disk, performance is severely impacted.
Day-6
Frequently Used Data Frame API Methods
Spark DataFrame supports parallelization. Pandas DataFrame does not support parallelization.
Spark DataFrame has Multiple Nodes. Pandas DataFrame has a Single Node.
It follows Lazy Execution which means that a task is not It follows Eager Execution, which means task is executed
executed until an action is performed. immediately.
Complex operations are difficult to perform as compared to Complex operations are easier to perform as compared to
Pandas DataFrame. Spark DataFrame.
Spark DataFrame is distributed and hence processing in the Pandas DataFrame is not distributed and hence processing
Spark DataFrame is faster for a large amount of data. in the Pandas DataFrame will be slower for a large amount
of data.
Spark DataFrames are excellent for building a scalable Pandas DataFrames can’t be used to build a scalable
application. application.
DataFrame with toPandas() and when creating a PySpark DataFrame from a pandas DataFrame with
createDataFrame(pandas_df).
import numpy as np
import pandas as pd
UDF’s are used to extend the functions of the framework and re-use these functions on multiple DataFrame’s.
For example, you wanted to convert every first letter of a word in a name string to a capital case; PySpark build-in
features don’t have this function hence you can create it a UDF and reuse this as needed on many Data Frames
1. Create a dataframe
from pyspark.sql import SparkSession
spark =
SparkSession.builder.appName('SparkByExamples.com').getOrC
reate()
columns = ["Seqno","Name"]
data = [("1", "john jones"),
("2", "tracey smith"),
("3", "amy sanders")]
df = spark.createDataFrame(data=data,schema=columns)
df.show(truncate=False)
2. Create a Python Function
creates a function convertCase() which takes a string parameter and converts the first letter of every word to capital
letter. UDF’s take parameters of your choice and returns a value.
def convertCase(str):
resStr=""
arr = str.split(" ")
for x in arr:
resStr= resStr + x[0:1].upper() + x[1:len(x)] + " "
return resStr
Note: The default type of the udf() is StringType hence, you can also write the above statement without
return type.
4. Using UDF with DataFrame
df.select(col("Seqno"), \
convertUDF(col("Name")).alias("Name") ) \
.show(truncate=False)
def upperCase(str):
return str.upper()
In order to use convertCase() function on PySpark SQL, you need to register the function
with PySpark by using spark.udf.register()
@udf(returnType=StringType())
def upperCase(str):
return str.upper()
Parquet is column oriented and CSV is row oriented. Row-oriented formats are optimized for OLTP workloads while column-
oriented formats are better suited for analytical workloads.
Column-oriented databases such as AWS Redshift Spectrum bill by the amount data scanned per query
Therefore, converting CSV to Parquet with partitioning and compression lowers overall costs and improves performance
Parquet has helped its users reduce storage requirements by at least one-third on large datasets, in addition, it greatly
improves scan and deserialization time, hence the overall costs.
Parquet file format consists of 2 parts –
1. Data
2. Metadata.
Data is written first in the file and the metadata is written at the end to allow for single pass writing. Let’s see
the parquet file format first and then lets us have a look at the metadata.
File Format -
For exampleFor example, if there is a record comprising ID, employee Name, and Department, then all
the values for the ID column will be stored together, values for the Name column together, and so on. If
we take the same record schema as mentioned above, having three fields ID (int), NAME (varchar), and
Department (varchar), the table will look something like this:
For this table, the data in a row-wise storage format will be stored as follows:
Whereas, the same data in a Column-oriented storage format will look like The columnar storage format is more
this: efficient when you need to query a
few columns from a table. It will read
only the required columns since they
are adjacent, thus minimizing IO.
Databricks uses Delta Lake by default for all reads and writes and builds upon the ACID guarantees provided
by the open source Delta Lake protocol.
ACID stands for atomicity, consistency, isolation, and durability.
Consistency guarantees relate to how a given state of the data is observed by simultaneous operations.
Isolation refers to how simultaneous operations potentially conflict with one another.
display(loan_risks_upload_data)
'''
Result:
+---------+-------------+-----------+------------+
| loan_id | funded_amnt | paid_amnt | addr_state |
+=========+=============+===========+============+
|0 | 1000 | 182.22 | CA |
+---------+-------------+-----------+------------+
|1 | 1000 | 361.19 | WA |
+---------+-------------+-----------+------------+
|2 | 1000 | 176.26 | TX |
+---------+-------------+-----------+------------+
...
'''
What is Auto Loader?
Auto Loader incrementally and efficiently processes new data files as they arrive in cloud storage without any additional
setup.
How does Auto Loader work?
Auto Loader incrementally and efficiently processes new data files as they arrive in cloud storage.
Auto Loader can load data files from AWS S3 (s3://), Azure Data Lake Storage Gen2 (ADLS Gen2, abfss://), Google
Cloud Storage (GCS, gs://), Azure Blob Storage (wasbs://), ADLS Gen1 (adl://), and Databricks File System (DBFS,
dbfs:/). Auto Loader can ingest JSON, CSV, PARQUET, AVRO, ORC, TEXT, and BINARYFILE file formats.
Databricks recommends Auto Loader in Delta Live Tables for incremental data ingestion.
Databricks recommends Auto Loader whenever you use Apache Spark Structured Streaming to ingest data from
cloud object storage. APIs are available in Python and Scala.
What is Auto Loader directory listing mode?
Auto Loader uses directory listing mode by default. In directory listing mode, Auto Loader identifies new files by
listing the input directory. Directory listing mode allows you to quickly start Auto Loader streams without any
permission configurations other than access to your data on cloud storage.
How does directory listing mode work?
For example, if you have files being uploaded every 5 minutes as /some/path/YYYY/MM/DD/HH/fileName, to find all the
files in these directories, the Apache Spark file source lists all subdirectories in parallel. The following algorithm estimates
the total number of API LIST directory calls to object storage:
It is an open format storage layer that delivers reliability, security, and performance on your Data
Lake for both streaming and batch operations.
It not only houses structured, semi-structured, and unstructured data but also provides Low-cost
Data Management solutions.
Databricks Delta Lake also handles ACID (Atomicity, Consistency, Isolation, and Durability)
transactions, scalable metadata handling, and data processing on existing data lakes.
Query Performance
As the data grows exponentially over time, query performance becomes a crucial factor. Delta improves the
performance from 10 to 100 times faster as compared to Apache Spark on the Parquet (human unreadable)
file format. Below are some techniques that assist in improving the performance:
Indexing: Databricks Delta creates and maintains Indexes on the tables to arrange queried data.
Skipping: Databricks Delta helps maintain file statistics so that only relevant portions of the data are read.
Compression: Databricks Delta consumes less memory space by efficiently managing Parquet files to optimize
queries.
Caching: Databricks Delta automatically caches highly accessed data to improve run times for commonly run
queries.
Optimize Layout
Delta optimizes table size with a built-in “optimize” command. End users can optimize certain portions of
the Databricks Delta Table that are most relevant instead of querying an entire table. It saves the overhead
cost of storing metadata and can help speed up queries
System Complexity
System Complexity increases the effort required to complete data-related tasks, making it difficult while
responding to any changes. With Delta, organizations solve system complexities by:
Time Travel
Time travel allows users to roll back in case of bad writes. Some Data Scientists run models on datasets for a
specific time, and this ability to reference previous versions becomes useful for Temporal Data Management.
A user can query Delta Tables for a specific timestamp because any change in Databricks Delta Table creates
new table versions. These tasks help data pipelines to audit, roll back accidental deletes, or reproduce
experiments and reports.
Simplify Databricks ETL and Analysis with Hevo’s No-code Data Pipelin
What is Databricks Delta Table?
A Databricks Delta Table records version changes or modifications in a feature class of table in Delta Lake. Unlike
traditional tables that store data in a row and column format, the Databricks Delta Table facilitates ACID
transactions and time travel features to store metadata information for quicker Data Ingestion. Data stored in a
Databricks Delta Table is a secure Parquet file format that is an encoded layer over data.
These stale data files and logs of transactions are converted from ‘Parquet’ to ‘Delta’ format to reduce custom coding
in the Databricks Delta Table. It also facilitates some advanced features that provide a history of events, and more
flexibility in changing content — update, delete and merge operations — to avoid dDduplication.
Every transaction performed on a Delta Lake table contains an ordered record of a transaction log called DeltaLog.
Delta Lake breaks the process into discrete steps of one or more actions whenever a user performs modification
operations in a table. It facilitates multiple readers and writers to work on a given Databricks Delta Table at the same
time. These actions are recorded in the ordered transaction log known as commits. For instance, if a user creates a
transaction to add a new column to a Databricks Delta Table while adding some more data, Delta Lake would break
that transaction into its consequent parts.
Once the transaction is completed in the Databricks Delta Table, the files are added to the transaction log like the
following commits:
Update Metadata: To change the Schema while including the new column to the Databricks Delta Table.
Add File: To add new files to the Databricks Delta Table
Delta is made of many components:
from delta.tables import *
Parquet data files organized or not as partitions
deltaTable = DeltaTable.forPath(spark, "/tmp/delta-table")
Json files as transaction log
deltaTable.restoreToVersion(1)
Checkpoint file
io.delta.tables.DeltaTable.restoreToTimestamp('2021-01-01')
io.delta.tables.DeltaTable.restoreToTimestamp('2021-01-01
01:01:01')
You can roll back a Delta Lake table to any previous version with the
restoreToVersion command in PySpark:
How to Rollback a Delta Lake Table to a Previous Version with Restore
Delta Lake makes it easy to access different versions of your data. For example, you can time travel back to version 0
of your Delta Lake table to see the original data that was stored when you created it. During time travel we are
loading the table up to some version - in this case, we’re loading up to the initial version.
spark.read.format("delta").option("versionAsOf", "0").load("/tmp/delta-table").show()
+---+
| id|
+---+
| 0|
| 1|
| 2|
+---+
spark.read.format("delta").option("versionAsOf", "1").load("/tmp/delta-table").show()
+---+
| id|
+---+
| 4|
| 5|
+---+
+---+
| id|
+---+
| 7|
| 8|
| 9|
+---+
Using the restore command resets the table’s content to an earlier version, but doesn’t remove any data. It
simply updates the transaction log to indicate that certain files should not be read.
Delta Lake restore after vacuum
To completely remove a later version of the data after restoring to a previous version, you need
to run the Delta Lake vacuum command.
vacuum is a widely used command that removes files that are not needed by the latest version
of the table. Running vacuum doesn’t make your Delta Lake operations any faster, but it
removes files on disk, which reduces storage costs.
deltaTable.vacuum(retentionHours=0)
deltaTable.vacuum() # vacuum files not required by versions more than 7 days old
deltaTable.vacuum(100) # vacuum files not required by versions more than 100 hours old
As expected, reading the contents still returns the table’s data as of version 1:
spark.read.format("delta").load("/tmp/delta-table").show()
+---+
| id|
+---+
| 4|
| 5|
+---+
Optimize performance with file management
To improve query speed, Delta Lake supports the ability to optimize the layout of data in storage. There are various
ways to optimize the layout.
Compaction (bin-packing)
Delta Lake can improve the speed of read queries from a table by coalescing small files into larger ones.
deltaTable.optimize().executeCompaction()
# If you have a large amount of data and only want to optimize a subset of it, you can specify an optional partition
predicate using `where`
deltaTable.optimize().where("date='2021-11-18'").executeCompaction()
Bin-packing optimization is idempotent, meaning that if it is run twice on the same dataset, the second run has no
effect.
Bin-packing aims to produce evenly-balanced data files with respect to their size on disk, but not necessarily number
of tuples per file. However, the two measures are most often correlated.
OPTIMIZE makes no data related changes to the table, so a read before and after an OPTIMIZE has the same results.
Performing OPTIMIZE on a table that is a streaming source does not affect any current or future streams that treat this
table as a source.
OPTIMIZE returns the file statistics (min, max, total, and so on) for the files removed and the files added by the operation
deltaTable.optimize().executeZOrderBy(eventType)
# If you have a large amount of data and only want to optimize a subset of it, you can specify an optional
partition predicate using `where`
deltaTable.optimize().where("date='2021-11-18'").executeZOrderBy(eventType)
Z-Ordering is not idempotent. Everytime the Z-Ordering is executed it will try to create a new clustering of data
in all files (new and existing files that were part of previous Z-Ordering) in a partition.
Z-Ordering aims to produce evenly-balanced data files with respect to the number of tuples, but not
necessarily data size on disk. The two measures are most often correlated, but there can be situations when
that is not the case, leading to skew in optimize task times.
Multi-part checkpointing
Delta Lake table periodically and automatically compacts all the incremental updates to the Delta log into a
Parquet file.
This “checkpointing” allows read queries to quickly reconstruct the current state of the table (that is, which
files to process, what is the current schema) without reading too many files having incremental updates.
Delta Lake protocol allows splitting the checkpoint into multiple Parquet files. This parallelizes and speeds up
writing the checkpoint
In Delta Lake, by default each checkpoint is written as a single Parquet file.
To use this feature, set the SQL configuration spark.databricks.delta.checkpoint.partSize=<n>, where n is the
limit of number of actions (such as AddFile) at which Delta Lake on Apache Spark will start parallelizing the
checkpoint and attempt to write a maximum of this many actions per checkpoint file.
Azure data factory - Dynamically add timestamp in copied
filename
Problem Statement:
Azure data factory is copying files to the target folder and I need files to have current timestamp in it.
Example:
SourceFolder has files --> File1.txt, File2.txt and so on
TargetFolder should have copied files with the names --> File1_2019-11-01.txt, File2_2019-11-01.txt and so on.
1. Create a Source dataset that points to Source folder which has files to be copied.
3. Drag a Get Metadata activity on pipeline. This will give us the file names having .txt extensions in the source folder.
Add an argument - Child Items to retrieve files details under Source folder
Updating and modifying Delta Lake tables
You can upsert data from a source table, view, or DataFrame into a target Delta table by using the MERGE
SQL operation. Delta Lake supports inserts, updates and deletes in MERGE, and it supports extended
syntax beyond the SQL standards to facilitate advanced use cases.
To merge the new data, you want to update rows where the person’s id is already present and insert the
new rows where no matching id is present.
PySpark distinct() function is used to drop/remove the duplicate rows (all columns) from DataFrame and
dropDuplicates() is used to drop rows based on selected (one or multiple) columns.
MERGE INTO people10m
USING people10mupdates
ON people10m.id = people10mupdates.id
WHEN MATCHED THEN
UPDATE SET
id = people10mupdates.id,
firstName = people10mupdates.firstName,
middleName = people10mupdates.middleName,
lastName = people10mupdates.lastName,
gender = people10mupdates.gender,
birthDate = people10mupdates.birthDate,
ssn = people10mupdates.ssn,
salary = people10mupdates.salary
WHEN NOT MATCHED
THEN INSERT (
id,
firstName,
middleName,
lastName,
gender,
birthDate,
ssn,
salary
)
VALUES (
people10mupdates.id,
people10mupdates.firstName,
people10mupdates.middleName,
people10mupdates.lastName,
people10mupdates.gender,
people10mupdates.birthDate,
people10mupdates.ssn,
people10mupdates.salary
)
What is a Data Warehouse?
A data warehouse is a relational database that is designed for query and analysis rather than for transaction processing. It
usually contains historical data derived from transaction data, but it can include data from other sources.
What is Data Warehousing?
A Data Warehousing (DW) is process for collecting and managing data from varied sources to provide meaningful business
insights. A Data warehouse is typically used to connect and analyze business data from heterogeneous sources. The data
warehouse is the core of the BI system which is built for data analysis and reporting.
Types of Data Warehouse
Three main types of Data Warehouses (DWH) are:
1. Enterprise Data Warehouse (EDW):
Enterprise Data Warehouse (EDW) is a centralized warehouse. It provides decision support service across the enterprise. It
offers a unified approach for organizing and representing data. It also provide the ability to classify data according to the
subject and give access according to those divisions.
Operational Data Store, which is also called ODS, are nothing but data store required when neither Data warehouse nor
OLTP systems support organizations reporting needs. In ODS, Data warehouse is refreshed in real time. Hence, it is widely
preferred for routine activities like storing records of the Employees.
3. Data Mart:
A data mart is a subset of the data warehouse. It specially designed for a particular line of business, such as sales, finance,
sales or finance. In an independent data mart, data can collect directly from sources.
Who needs Data warehouse?
DWH (Data warehouse) is needed for all types of users like:
Airline:
In the Airline system, it is used for operation purpose like crew assignment, analyses of route profitability, frequent flyer program promotions, etc.
Banking:
It is widely used in the banking sector to manage the resources available on desk effectively. Few banks also used for the market research, performance analysis
of the product and operations.
Healthcare:
Healthcare sector also used Data warehouse to strategize and predict outcomes, generate patient’s treatment reports, share data with tie-in insurance
companies, medical aid services, etc.
Public sector:
In the public sector, data warehouse is used for intelligence gathering. It helps government agencies to maintain and analyze tax records, health policy records,
for every individual.
Retain chain:
In retail chains, Data warehouse is widely used for distribution and marketing. It also helps to track items, customer buying pattern, promotions and also used for
determining pricing policy.
Telecommunication:
A data warehouse is used in this sector for product promotions, sales decisions and to make distribution decisions.
Hospitality Industry:
This Industry utilizes warehouse services to design as well as estimate their advertising and promotion campaigns where they want to target clients based on
their feedback and travel patterns.
What is OLAP?
Online Analytical Processing, a category of software tools which provide analysis of data for business
decisions. OLAP systems allow users to analyze database information from multiple database systems at
one time.
What is OLTP?
Online transaction processing shortly known as OLTP supports transaction-oriented applications in a 3-tier
architecture. OLTP administers day to day transaction of an organization.
Difference between OLTP and OLAP
Facts and dimensions are the fundamental elements that define a data warehouse. They record relevant
events of a subject or functional area (facts) and the characteristics that define them (dimensions).
The fact table contains measurements, metrics, and facts about a business process, while the Dimension table is a
companion to the fact table, which contains descriptive attributes to be used as query constraining.
The fact table is located at the center of a star or snowflake schema, whereas the Dimension table is located at
the edges of the star or snowflake schema.
A fact table is defined by its grain or most atomic level, whereas a Dimension table should be wordy, descriptive,
complete, and of assured quality.
The fact table helps to store report labels, whereas Dimension table contains detailed data.
The fact table does not contain a hierarchy, whereas the Dimension table contains hierarchies.
What is Fact Table?
A fact table is a primary table in a dimensional model.
Measurements/facts
Foreign key to dimension table
What is a Dimension Table?
A dimension table contains dimensions of a fact.
They are joined to fact table via a foreign key.
Dimension tables are de-normalized tables.
The Dimension Attributes are the various columns in a dimension table.
Dimensions offers descriptive characteristics of the facts with the help of their attributes.
No set limit set for given for number of dimensions.
The dimension can also contain one or more hierarchical relationships
1. Hash distributed
A hash-distributed table distributes table rows across the Compute nodes by using a
deterministic hash function to assign each row to one distribution.
Hash-distributed tables work well for large fact tables in a star schema.
They can have very large numbers of rows and still achieve high performance.
Consider using a hash-distributed table when:
3. Replicated table
A replicated table has a full copy of the table accessible on each Compute node.
Replicating a table removes the need to transfer data among Compute nodes before a join or aggregation. Since
the table has multiple copies, replicated tables work best when the table size is less than 2 GB compressed.
2 GB is not a hard limit. If the data is static and does not change, you can replicate larger tables.
Replicated tables work well for dimension tables in a star schema. Dimension tables are typically joined to fact tables,
which are distributed differently than the dimension table. Dimensions are usually of a size that makes it feasible to store
and maintain multiple copies.
Consider using a replicated table when:
The table size on disk is less than 2 GB, regardless of the number of rows. To find the size of a table, you can use the
DBCC PDW_SHOWSPACEUSED command: DBCC PDW_SHOWSPACEUSED('ReplTableCandidate').
The table is used in joins that would otherwise require data movement. When joining tables that are not distributed on
the same column, such as a hash-distributed table to a round-robin table, data movement is required to complete the
query. If one of the tables is small, consider a replicated table. We recommend using replicated tables instead of round-
robin tables in most cases. To view data movement operations in query plans, use sys.dm_pdw_request_steps. The
BroadcastMoveOperation is the typical data movement operation that can be eliminated by using a replicated table.