0% found this document useful (0 votes)
31 views

Data Bricks

Uploaded by

vigneshdataprof2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views

Data Bricks

Uploaded by

vigneshdataprof2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 115

Data Bricks

What is Azure Databricks?

The Azure Databricks Lakehouse Platform provides a unified set of tools for building, deploying, sharing, and
maintaining enterprise-grade data solutions at scale

What is Azure Databricks used for?

Azure Databricks is used to process, store, clean, share, analyze, model, and monetize their datasets with
solutions from BI to machine learning.

You can use the Azure Databricks platform to build many different applications spanning data personas.

What is Databricks Data Science & Engineering?

Databricks Data Science & Engineering (sometimes called simply "Workspace") is an analytics platform based on
Apache Spark. It is integrated with Azure to provide one-click setup, streamlined workflows, and an interactive
workspace that enables collaboration between data engineers, data scientists, and machine learning engineers.
For a big data pipeline, the data (raw or structured) is ingested into Azure through Azure Data Factory in
batches, or streamed near real-time using Apache Kafka, Event Hub, or IoT Hub.

This data lands in a data lake for long term persisted storage, in Azure Blob Storage or Azure Data Lake Storage.
As part of your analytics workflow, use Azure Databricks to read data from multiple data sources such as Azure
Blob Storage, Azure Data Lake Storage, Azure Cosmos DB, or Azure SQL Data Warehouse and turn it into
breakthrough insights using Spark.

Apache Spark analytics platform


Spark SQL and DataFrames: Spark SQL is the Spark module for working with structured data. A DataFrame is a
distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational
database or a data frame in R/Python.

Streaming: Real-time data processing and analysis for analytical and interactive applications. Integrates with
HDFS, Flume, and Kafka.

MLlib: Machine Learning library consisting of common learning algorithms and utilities, including classification,
regression, clustering, collaborative filtering, dimensionality reduction, as well as underlying optimization
primitives.

GraphX: Graphs and graph computation for a broad scope of use cases from cognitive analytics to data
exploration.

Spark Core API: Includes support for R, SQL, Python, Scala, and Java.
What is Databricks Machine Learning?
Databricks Machine Learning is an integrated end-to-end machine learning platform incorporating managed services for
experiment tracking, model training, feature development and management, and feature and model serving. The
diagram shows how the capabilities of Databricks map to the steps of the model development and deployment process.
Apache Spark Architecture
Apache Spark works in a master-slave architecture where the master is called “Driver” and slaves are called
“Workers”. When you run a Spark application, Spark Driver creates a context that is an entry point to your
application, and all operations (transformations and actions) are executed on worker nodes, and the
resources are managed by Cluster Manager
Menu Items

• Create → To create a notebook, cluster, or table


• Workspace → To check how many folders or users there are
• Recents → All recent things, whatever you have done there
• Search → For searching in the workspace
• Compute → Here you can see your cluster or create the cluster
• WorkFlow → Here you can find your jobs/Workflow
Components of Apache Spark Run-Time Architecture
The three high-level components of the architecture of a spark application include -

Spark Driver
Cluster Manager
Executors
How does Spark Work?

Apache Spark follows a master/slave architecture with two


main daemons and a cluster manager –

Master Daemon – (Master/Driver Process)


Worker Daemon –(Slave Process)

A spark cluster has a single Master and any number of


Slaves/Workers. The driver and the executors run their
individual Java processes and users can run them on the same
horizontal spark cluster or on separate machines i.e. in a
vertical spark cluster or in mixed machine configuration.
Understanding the Spark Application Architecture

What happens when a Spark Job is submitted?

 When a client submits a spark user application code, the driver implicitly converts the code containing
transformations and actions into a logical directed acyclic graph (DAG).
 At this stage, the driver program also performs certain optimizations like pipelining transformations, and then it
converts the logical DAG into a physical execution plan with a set of stages.
 After creating the physical execution plan, it creates small physical execution units referred to as tasks under each
stage. Then tasks are bundled to be sent to the Spark Cluster.
 The driver program then talks to the cluster manager and negotiates for resources. The cluster manager then
launches executors on the worker nodes on behalf of the driver.
 At this point the driver sends tasks to the cluster manager based on data placement. Before executors begin
execution, they register themselves with the driver program so that the driver has holistic view of all the executors.
 Now executors start executing the various tasks assigned by the driver program.
 At any point of time when the spark application is running, the driver program will monitor the set of executors that
run.
 Driver program in the spark architecture also schedules future tasks based on data placement by tracking the location
of cached data.
 When driver program main () method exits or when it call the stop () method of the Spark Context, it will terminate all
the executors and release the resources from the cluster manager.
Day -2
Clusters
An Azure Databricks cluster is a set of computation resources and configurations on which
you run data engineering, data science, and data analytics workloads, such as production ETL
pipelines, streaming analytics, ad-hoc analytics, and machine learning.
Types of Clusters
1. all-purpose clusters
2. job clusters.

You can manually terminate and restart an all-purpose cluster. Multiple users can share such
clusters to do collaborative interactive analysis.
The Azure Databricks job scheduler creates a job cluster when you run a job on a new job cluster
and terminates the cluster when the job is complete. You cannot restart a job cluster.

Cluster mode
Azure Databricks supports three cluster modes: Standard, High Concurrency, and Single Node.
The default cluster mode is Standard.
Standard Vs High Concurrency Vs Single Node Clusters
Standard clusters High Concurrency clusters Single Node clusters

Standard mode clusters A High Concurrency cluster A Single Node cluster has
(sometimes called No is a managed cloud no workers and runs Spark
Isolation Shared clusters) resource. The key benefits jobs on the driver node.
can be shared by multiple of High Concurrency
users, with no isolation clusters are that they
between users – [follows provide fine-grained
driver and worker node sharing for maximum
approach] resource utilization and
minimum query latencies.
Standard and Single Node High Concurrency clusters Single Node clusters
clusters terminate do not terminate terminate automatically
automatically after 120 automatically by default. after 120 minutes by
minutes by default. default.
Standard clusters can run High Concurrency clusters Single Node clusters can
workloads developed in can run workloads run workloads developed
Python, SQL, R, and Scala. developed in SQL, Python, in Python, SQL, R, and
and R. Scala.
Pools

 To reduce cluster start time, you can attach a cluster to a predefined pool of idle instances, for the driver
and worker nodes.
 The cluster is created using instances in the pools. If a pool does not have sufficient idle resources to
create the requested driver or worker nodes, the pool expands by allocating new instances from the
instance provider.
 When an attached cluster is terminated, the instances it used are returned to the pools and can be reused
by a different cluster.

Databricks Runtime

Databricks runtimes are the set of core components that run on your clusters. All Databricks runtimes
include Apache Spark and add components and updates that improve usability, performance, and security.
For details, see Databricks runtimes.

Azure Databricks offers several types of runtimes and several versions of those runtime types in the
Databricks Runtime Version drop-down when you create or edit a cluster.
DataBricks Runtime Types
Databricks Runtime

Databricks Runtime includes Apache Spark but also adds a number of components
and updates that substantially improve the usability, performance, and security of big
data analytics.

Databricks Runtime for Machine Learning

Databricks Runtime ML is a variant of Databricks Runtime that adds multiple popular


machine learning libraries, including TensorFlow, Keras, PyTorch, and XGBoost.

Photon runtime

Photon is the Azure Databricks native vectorized query engine that runs SQL
workloads faster and reduces your total cost per workload.

Databricks Light

Databricks Light provides a runtime option for jobs that don’t need the advanced
performance, reliability, or autoscaling benefits provided by Databricks Runtime.
Cluster node type

A cluster consists of one driver node and zero or more worker nodes.

You can pick separate cloud provider instance types for the driver and worker nodes, although by default the
driver node uses the same instance type as the worker node.
Different families of instance types fit different use cases, such as memory-intensive or compute-intensive
workloads.

Driver node

The driver node maintains state information of all notebooks attached to the cluster.
The driver node also maintains the SparkContext and interprets all the commands you run from a notebook or a
library on the cluster, and runs the Apache Spark master that coordinates with the Spark executors.

Worker node

Azure Databricks worker nodes run the Spark executors and other services required for the proper functioning
of the clusters. When you distribute your workload with Spark, all of the distributed processing happens on
worker nodes
Cluster size and autoscaling

When you create a Azure Databricks cluster, you can either provide a fixed number of workers for the
cluster or provide a minimum and maximum number of workers for the cluster.

When you provide a fixed size cluster, Azure Databricks ensures that your cluster has the specified number
of workers. When you provide a range for the number of workers, Databricks chooses the appropriate
number of workers required to run your job. This is referred to as autoscaling.

With autoscaling, Azure Databricks dynamically reallocates workers to account for the characteristics of
your job. Certain parts of your pipeline may be more computationally demanding than others, and
Databricks automatically adds additional workers during these phases of your job (and removes them when
they’re no longer needed).

Autoscaling makes it easier to achieve high cluster utilization, because you don’t need to provision the
cluster to match a workload. This applies especially to workloads whose requirements change over time
(like exploring a dataset during the course of a day), but it can also apply to a one-time shorter workload
whose provisioning requirements are unknown. Autoscaling thus offers two advantages:

Workloads can run faster compared to a constant-sized under-provisioned cluster.


Autoscaling clusters can reduce overall costs compared to a statically-sized cluster.
Enable and configure autoscaling
To allow Azure Databricks to resize your cluster automatically, you enable autoscaling for the cluster
and provide the min and max range of workers.

1. Enable autoscaling.

All-Purpose cluster - On the Create Cluster page, select the


Enable autoscaling checkbox in the Autopilot Options box:

Job cluster - On the Configure Cluster page, select the Enable


autoscaling checkbox in the Autopilot Options box:

2. Configure the min and max workers.


cluster logs

Databricks provides three kinds of logging of cluster-related activity:


•Cluster event logs, which capture cluster lifecycle events, like creation, termination,
configuration edits, and so on.
•Apache Spark driver and worker logs, which you can use for debugging.
•Cluster init-script logs, valuable for debugging init scripts.
Init scripts

An init script is a shell script that runs during startup of each cluster node before the Apache Spark driver
or worker JVM starts.

Some examples of tasks performed by init scripts include:

Install packages and libraries not included in Databricks Runtime. To install Python packages, use the Azure
Databricks pip binary located at /databricks/python/bin/pip to ensure that Python packages install into
the Azure Databricks Python virtual environment rather than the system Python environment.
For example, /databricks/python/bin/pip install <package-name>.
Modify the JVM system classpath in special cases.
Set system properties and environment variables used by the JVM.
Modify Spark configuration parameters.
Init script types

Azure Databricks supports two kinds of init scripts: cluster-scoped and global.

Cluster-scoped: run on every cluster configured with the script. This is the recommended way to run an init
script.
Global: run on every cluster in the workspace. They can help you to enforce consistent cluster configurations
across your workspace. Use them carefully because they can cause unanticipated impacts, like library conflicts.
Only admin users can create global init scripts. Global init scripts are not run on model serving clusters.
Environment variables
Cluster-scoped and global init scripts support the following environment variables:

DB_CLUSTER_ID: the ID of the cluster on which the script is running. See Clusters API 2.0.
DB_CONTAINER_IP: the private IP address of the container in which Spark runs. The init script is run inside
this container. See SparkNode.
DB_IS_DRIVER: whether the script is running on a driver node.
DB_DRIVER_IP: the IP address of the driver node.
DB_INSTANCE_TYPE: the instance type of the host VM.
DB_CLUSTER_NAME: the name of the cluster the script is executing on.
DB_IS_JOB_CLUSTER: whether the cluster was created to run a job. See Create a job.

echo $DB_IS_DRIVER
if [[ $DB_IS_DRIVER = "TRUE" ]]; then
<run this part only on driver>
else
<run this part only on workers>
fi
<run this part on both driver and workers>
Manage clusters
Manage Azure Databricks clusters, including displaying, editing,
starting, terminating, deleting, controlling access, and
monitoring performance and logs.
What is the Databricks File System (DBFS)?

The Databricks File System (DBFS) is a distributed file system mounted into an Azure Databricks workspace and
available on Azure Databricks clusters.
DBFS is an abstraction on top of scalable object storage that maps Unix-like filesystem calls to native cloud storage
API calls

What can you do with DBFS?

DBFS provides convenience by mapping cloud object storage URIs to relative paths.

Allows you to interact with object storage using directory and file semantics instead of cloud-specific API
commands.
Allows you to mount cloud object storage locations so that you can map storage credentials to paths in the
Azure Databricks workspace.
Simplifies the process of persisting files to object storage, allowing virtual machines and attached volume
storage to be safely deleted on cluster termination.
Mounting cloud object storage on Azure Databricks
Azure Databricks mounts create a link between a workspace and cloud object storage, which enables you to
interact with cloud object storage using familiar file paths relative to the Databricks file system. Mounts work by
creating a local alias under the /mnt directory that stores the following information:

 Location of the cloud object storage.


 Driver specifications to connect to the storage account or container.
 Security credentials required to access the data.
syntax
mount(
source: str,
mountPoint: str,
encryptionType: Optional[str] = "",
extraConfigs: Optional[dict[str:str]] = None
)
The source specifies the URI of the object storage (and can optionally encode security credentials). The
mountPoint specifies the local path in the /mnt directory. Some object storage sources support an
optional encryptionType argument. For some access patterns you can pass additional configuration
specifications as a dictionary to extraConfigs.
Mount ADLS Gen2 or Blob Storage with ABFS
configs = {"fs.azure.account.auth.type": "OAuth",
"fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
"fs.azure.account.oauth2.client.id": "<application-id>",
"fs.azure.account.oauth2.client.secret": dbutils.secrets.get(scope="<scope-name>",key="<service-credential-key-
name>"),
"fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/<directory-id>/oauth2/token"}

# Optionally, you can add <directory-name> to the source URI of your mount point.
dbutils.fs.mount(
source = "abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/",
mount_point = "/mnt/<mount-name>",
extra_configs = configs)

<application-id> with the Application (client) ID for the Azure Active Directory application.
<scope-name> with the Databricks secret scope name.
<service-credential-key-name> with the name of the key containing the client secret.
<directory-id> with the Directory (tenant) ID for the Azure Active Directory application.
<container-name> with the name of a container in the ADLS Gen2 storage account.
<storage-account-name> with the ADLS Gen2 storage account name.
<mount-name> with the name of the intended mount point in DBFS.
Day-3
Databricks Utilities

Databricks Utilities (dbutils) make it easy to perform powerful combinations of tasks. You can use the utilities
to work with object storage efficiently, to chain and parameterize notebooks, and to work with secrets. dbutils
are not supported outside of notebooks.

This module provides various utilities for users to interact with the rest of Databricks.

fs: DbfsUtils -> Manipulates the Databricks filesystem (DBFS) from the console
jobs: JobsUtils -> Utilities for leveraging jobs features
library: LibraryUtils -> Utilities for session isolated libraries
notebook: NotebookUtils -> Utilities for the control flow of a notebook (EXPERIMENTAL)
secrets: SecretUtils -> Provides utilities for leveraging secrets within notebooks
widgets: WidgetsUtils -> Methods to create and get bound value of input widgets inside notebooks
File system utility (dbutils.fs)
The file system utility allows you to access What is the Databricks File System (DBFS)?, making it easier to use Azure
Databricks as a file system. To list the available commands, run dbutils.fs.help().

ls command (dbutils.fs.ls)
Lists the contents of a directory.

To display help for this command, run dbutils.fs.help("ls").

dbutils.fs.ls("/tmp")

# Out[13]: [FileInfo(path='dbfs:/tmp/my_file.txt',
name='my_file.txt', size=40,
modificationTime=1622054945000)]
Jobs utility (dbutils.jobs)

The jobs utility allows you to access and get the jobs features. To display help for this
utility, run dbutils.jobs.help().

dbutils.jobs.taskValues.get(taskKey = "my-task", \
key = "my-key", \
default = 7, \
debugValue = 42)

Library utility (dbutils.library)


The library utility allows you to install Python libraries and create an environment scoped to a notebook session. The
libraries are available both on the driver and on the executors, so you can reference them in user defined functions.
This enables:

Library dependencies of a notebook to be organized within the notebook itself.


Notebook users with different library dependencies to share a cluster without interference.
To display help for this command, run
dbutils.library.help("install").

dbutils.library.install("abfss:/path/to/your/library.whl")
dbutils.library.restartPython() # Removes Python state, but some
libraries might not work without calling this command.
Notebook utility (dbutils.notebook)

Commands: exit, run

The notebook utility allows you to chain together notebooks


and act on their results.

To list the available commands, run dbutils.notebook.help().

exit(value: String): void -> This method lets you exit a notebook
with a value
run(path: String, timeoutSeconds: int, arguments: Map): String
-> This method runs a notebook and returns its exit value.

Secrets utility (dbutils.secrets)


Commands: get, getBytes, list, listScopes

The secrets utility allows you to store and access sensitive credential information without making them visible in
notebooks. See Secret management and Use the secrets in a notebook. To list the available commands, run
dbutils.secrets.help().
get(scope: String, key: String): String -> Gets the string representation of a secret value with scope and key
getBytes(scope: String, key: String): byte[] -> Gets the bytes representation of a secret value with scope and key
list(scope: String): Seq -> Lists secret metadata for secrets within a scope
listScopes: Seq -> Lists secret scopes
Widgets utility (dbutils.widgets)

Commands: combobox, dropdown, get, getArgument, multiselect, remove, removeAll,


text

The widgets utility allows you to parameterize notebooks.


To list the available commands, run dbutils.widgets.help().
combobox(name: String, defaultValue: String, choices: Seq, label: String): void -> Creates a combobox input
widget with a given name, default value and choices
dropdown(name: String, defaultValue: String, choices: Seq, label: String): void -> Creates a dropdown input
widget a with given name, default value and choices
get(name: String): String -> Retrieves current value of an input widget
getArgument(name: String, optional: String): String -> (DEPRECATED) Equivalent to get
multiselect(name: String, defaultValue: String, choices: Seq, label: String): void -> Creates a multiselect input
widget with a given name, default value and choices
remove(name: String): void -> Removes an input widget from the notebook
removeAll: void -> Removes all widgets in the notebook
text(name: String, defaultValue: String, label: String): void -> Creates a text input widget with a given name and
default value
combobox command (dbutils.widgets.combobox)
Creates and displays a combobox widget with the specified programmatic name, default value, choices, and
optional label.

To display help for this command, run dbutils.widgets.help("combobox").

This example creates and displays a combobox widget with the programmatic name fruits_combobox. It
offers the choices apple, banana, coconut, and dragon fruit and is set to the initial value of banana. This
combobox widget has an accompanying label Fruits. This example ends by printing the initial value of the
combobox widget, banana.

dbutils.widgets.combobox(
name='fruits_combobox',
defaultValue='banana',
choices=['apple', 'banana', 'coconut', 'dragon fruit'],
label='Fruits'
)

print(dbutils.widgets.get("fruits_combobox"))

# banana
Browse files in DBFS
Databricks notebook interface and controls
Cell actions menu

Create cells
Notebooks use two types of cells: code cells and markdown cells. Code cells contain runnable code. Markdown
cells contain markdown code that renders into text and graphics when the cell is executed and can be used to
document or illustrate your code.

Attach a notebook to a cluster


Create a notebook in any folder

View all notebooks attached to a cluster


Schedule a notebook job
RDD (Resilient Distributed Dataset) is a fundamental building block of Spark which is fault-tolerant, immutable
distributed collections of objects. Immutable meaning once you create an RDD you cannot change it. Each record
in RDD is divided into logical partitions, which can be computed on different nodes of the cluster.
RDD Benefits:
In-Memory Processing
Immutability
Fault Tolerance
Lazy Evalution
Partitioning
RDD vs. DataFrame vs. Dataset
RDD DataFrame Dataset

Distributed collection Distributed collection Combination of RDD


Data Representation of data organized into
of elements. columns. and DataFrame.

Structured and Structured and semi- Structured and


Data Formats unstructured are structured are unstructured are
accepted. accepted. accepted.

Data Sources Various data sources. Various data sources. Various data sources.

Immutable partitions Transforming into a The original RDD


Immutability and that easily transform DataFrame loses the regenerates after
Interoperability into DataFrames. original RDD. transformation.

Compile-time type Available compile- No compile-time type Available compile-


safety time type safety. safety. Errors detect time type safety.
on runtime.

No built-in Query optimization


optimization engine. Query optimization through the Catalyst
Optimization Each RDD is optimized through the Catalyst optimizer, like
individually. optimizer. DataFrames.
Being Lazy is Useful — Lazy Evaluation in Spark
Spark is based on transformations and actions. A transformation is a set of operations that manipulate the data while actions are those that disp
a result.
Data transformations in Spark are performed using the lazy evaluation technique. Thus, they are delayed until a result is needed. When Spark
detects that an action is going to be executed, it creates a DAG where it registers all the transformations in an orderly fashion. In this way, when
needed, the transformations will be performed, optimised and the expected result will be obtained.
Let’s take another simpler example to show it a bit better:

rdd= sc.parallelize(range(1000))
transformation_1 = rdd.map(lambda x: x+2)
In this example, we’re just creating a list of integers and applying a transformation of adding “2” to each object in the list, so our logical executio
plan would be something like the following:

Output>

(8) PythonRDD[1] at RDD at PythonRDD.scala:53 []


| ParallelCollectionRDD[0] at readRDDFromFile at PythonRDD.scala:274 []
But if we again apply a transformation (filter):

rdd= sc.parallelize(range(1000),8)
transformation_1 = rdd.map(lambda x: x+2)
transformation_2 = transformation_1.filter(lambda x: x%2 != 0)
Output>

(8) PythonRDD[1] at RDD at PythonRDD.scala:53 []\n | ParallelCollectionRDD[0] at readRDDFromFile at PythonRDD.scala:274 []


To display the execution plan we use the toDebugString()
function.

Some RDDs that we can see in this execution plan are:

PairwiseRDD: This is an RDD that contains objects of type


Key/Value.
ShuffledRDD: This was created by calling the reduceByKey
Immutability
transformation to make sure that each item with the same KEY
PySpark RDD’s are immutable in nature meaning, once RDDs
belongs to the same partition. Recall that RDDs are distributed,
are created you cannot modify. When we apply transformations
so to ensure that it is grouped by Key, Spark generates a shuffle
on RDD, PySpark creates a new RDD and maintains the RDD
operation to exchange information with the different nodes.
Lineage
MapPartitionsRDD: This is an RDD created from the map and
Fault Tolerance filter transformations.
PySpark operates on fault-tolerant data stores on HDFS, S3 e.t.c
hence any RDD operation fails, it automatically reloads the data
from other partitions. Also, When PySpark applications running
on a cluster, PySpark task failures are automatically recovered
for a certain number of times (as per the configuration) and
finish the application seamlessly
#Create RDD from parallelize Shared Variables
data = [1,2,3,4,5,6,7,8,9,10,11,12] Broadcast variables (read-only shared
rdd=spark.sparkContext.parallelize(data) variable)
Accumulator variables (updatable shared
#Reads entire file into a RDD as single record. variables)
rdd3 = spark.sparkContext.wholeTextFiles("/path/textFile.txt") Broadcast variables are read-only shared variables that
are cached and available on all nodes in a cluster in-
# Creates empty RDD with no partition order to access or use by the tasks. Instead of sending
rdd = spark.sparkContext.emptyRDD this data along with every task, PySpark distributes
# rddString = spark.sparkContext.emptyRDD[String] broadcast variables to the machine using efficient
broadcast algorithms to reduce communication costs.
RDD Cache
PySpark RDD cache() method by default saves RDD Accumulators
computation to storage level `MEMORY_ONLY` meaning it will PySpark Accumulators are another type shared
store the data in the JVM heap as unserialized objects. variable that are only “added” through an
associative and commutative operation and are
RDD Persist
used to perform counters (Similar to Map-reduce
PySpark persist() method is used to store the RDD to one of the
counters) or sum operations.
storage levels MEMORY_ONLY,MEMORY_AND_DISK,
MEMORY_ONLY_SER, MEMORY_AND_DISK_SER, DISK_ONLY,
PySpark by default supports creating an
MEMORY_ONLY_2,MEMORY_AND_DISK_2
accumulator of any numeric type and provides the
capability to add custom accumulator types.
What is a DataFrame? Combine DataFrames with join and union

A DataFrame is a two-dimensional labeled data structure with columns joined_df = df1.join(df2, how="inner", on="id")
of potentially different types. You can think of a DataFrame like a
spreadsheet, a SQL table, or a dictionary of series objects. unioned_df = df1.union(df2)
Apache Spark DataFrames are an abstraction built on top of Resilient
Distributed Datasets (RDDs). Filter rows in a DataFrame
A DataFrame is a Dataset organized into named columns. It is filtered_df = df.filter("id > 1")
conceptually equivalent to a table in a relational database or a data
frame in R/Python, but with richer optimizations under the hood.
filtered_df = df.where("id > 1")
Create a DataFrame
Select columns from a DataFrame
data = [[1, "Elia"], [2, "Teo"], [3, "Fang"]]
df = spark.createDataFrame(data, schema="id LONG, name subset_df = df.filter("id > 1").select("name")
STRING")
Load data into a DataFrame from files
Save a DataFrame to a table
df = (spark.read
.format("csv")
.option("header", "true") df.write.saveAsTable("<table_name>")
.option("inferSchema", "true")
.load("/databricks-datasets/samples/population-vs-price/data_geo.csv")
) df.write.format("json").save("/tmp/json_data")
Dataset
A Spark Dataset is a distributed collection of typed objects, which are partitioned across
multiple nodes in a cluster and can be operated on in parallel.

There are two types of operations you can perform on a Dataset:

transformations: create a new Dataset from the current Dataset


actions: trigger computation and return a result to the driver program

Datasets and Type Safety

Datasets are composed of typed objects, which means that transformation syntax errors (like a typo in the
method name) and analysis errors (like an incorrect input variable type) can be caught at compile time.
Here is a list of some commonly used typed transformations, which can be used on Datasets of typed objects
(Dataset[T]).
map
Returns new Dataset with result of applying input function to each element
filter
Returns new Dataset containing elements where input function is true
groupByKey
Returns a KeyValueGroupedDataset where the data is grouped by the given key function
The entry point to programming in Spark is the
org.apache.spark.sql.SparkSession class, which
you use to create a SparkSession object as
shown below:
val spark =
SparkSession.builder().appName("example"
).master("local[*]".getOrCreate()
Shuffle operations
 Certain operations within Spark trigger an event known as the shuffle.
 The shuffle is Spark’s mechanism for re-distributing data so that it’s grouped differently across partitions.
 This typically involves copying data across executors and machines, making the shuffle a complex and costly
operation.
In Spark, data is generally not distributed across partitions to be in the necessary place for a specific
operation. During computations, a single task will operate on a single partition - thus, to organize all the data
for a single reduceByKey reduce task to execute, Spark needs to perform an all-to-all operation. It must read
from all partitions to find all the values for all keys, and then bring together values across partitions to
compute the final result for each key - this is called the shuffle.
Operations which can cause a shuffle include repartition operations like repartition and coalesce, ‘ByKey operations
(except for counting) like groupByKey and reduceByKey, and join operations like cogroup and join.

Performance Impact
The Shuffle is an expensive operation since it involves disk I/O, data serialization, and network I/O.
Certain shuffle operations can consume significant amounts of heap memory since they employ in-memory data
structures to organize records before or after transferring them.
When data does not fit in memory Spark will spill these tables to disk, incurring the additional overhead of disk I/O
and increased garbage collection.
Shuffle also generates a large number of intermediate files on disk.
Spark Default Shuffle Partition
DataFrame increases the partition number to 200 automatically when Spark operation performs data shuffling
(join(), aggregation functions). This default shuffle partition number comes from Spark SQL configuration
spark.sql.shuffle.partitions which is by default set to 200.

You can change this default shuffle partition value using conf method of the SparkSession object

spark.conf.set("spark.sql.shuffle.partitions",100)

Shuffle partition size


Based on your dataset size, number of cores, and memory, Spark shuffling can benefit or harm your jobs. When you
dealing with less amount of data, you should typically reduce the shuffle partitions otherwise you will end up with
many partitioned files with a fewer number of records in each partition. which results in running many tasks with
lesser data to process.

On another hand, when you have too much data and have less number of partitions results in fewer longer running
tasks, and sometimes you may also get out of memory error.

Getting the right size of the shuffle partition is always tricky and takes many runs with different values to achieve the
optimized number. This is one of the key properties to look for when you have performance issues on Spark jobs.
Spark Partitioning & Partition Understanding
Spark/PySpark partitioning is a way to split the data into multiple partitions so that you can execute
transformations on multiple partitions in parallel which allows completing the job faster.
You can also write partitioned data into a file system (multiple sub-directories) for faster reads by downstream
systems
Repartitioning with coalesce function
There are two functions you can use in Spark to repartition data and coalesce is one of them.

This function is defined as the following

def coalesce(numPartitions)
Returns a new :class:DataFrame that has exactly numPartitions partitions.

Similar to coalesce defined on an :class:RDD, this operation results in a narrow dependency, e.g. if you go from 1000
partitions to 100 partitions, there will not be a shuffle, instead each of the 100 new partitions will claim 10 of the current
partitions. If a larger number of partitions is requested, it will stay at the current number of partitions.

Repartitioning with repartition function


The other method for repartitioning is repartition. It’s defined as the follows:

def repartition(numPartitions, *cols)


Returns a new :class:DataFrame partitioned by the given partitioning expressions. The resulting DataFrame is hash
partitioned.

numPartitions can be an int to specify the target number of partitions or a Column. If it is a Column, it will be used as the
first partitioning column. If not specified, the default number of partitions is used.

Added optional arguments to specify the partitioning columns. Also made numPartitions
optional if partitioning columns are specified.

Repartition by number
Use the following code to repartition the data to 10 partitions.
df = df.repartition(10)
print(df.rdd.getNumPartitions())
df.write.mode("overwrite").csv("data/example.csv",
header=True)
Spark will try to evenly distribute the data to each partitions. If the total partition number is greater than the actual
record count (or RDD size), some partitions will be empty.

After we run the above code, data will be reshuffled to 10 partitions with 10 sharded files generated.

If we repartition the data frame to 1000 partitions, how many sharded files will be generated?

The answer is 100 because the other 900 partitions are empty and each file has one record.
Repartition by column
We can also repartition by columns.

For example, let’s run the following code to repartition the data by column Country.
df = df.repartition("Country")
print(df.rdd.getNumPartitions())
df.write.mode("overwrite").csv("data/example.csv", header=True)

Partition by multiple columns


df = df.withColumn("Year", year("Date")).withColumn(
"Month", month("Date")).withColumn("Day", dayofmonth("Date"))
df = df.repartition("Year", "Month", "Day", "Country")
print(df.rdd.getNumPartitions())
df.write.mode("overwrite").csv("data/example.csv", header=True)
•Narrow transformation — In Narrow transformation, all the elements that
are required to compute the records in single partition live in the single partition
of parent RDD. A limited subset of partition is used to calculate the
result. Narrow transformations are the result of map(), filter().
•Wide transformation — In wide transformation, all the elements that are
required to compute the records in the single partition may live in many
partitions of parent RDD. The partition may live in many partitions of parent
RDD. Wide transformations are the result of groupbyKey and reducebyKey.

groupByKey Vs reduceByKey
syntax:

sparkContext.textFile("hdfs://") .flatMap(line => line.split(" ") ) .map(word => (word,1))


.groupByKey() .map((x,y) => (x,sum(y)))
groupByKey can cause out of disk problems as data is sent over the network and collected on the reduced workers

reduceByKey:
Syntax:
sparkContext.textFile("hdfs://") .flatMap(line => line.split(" ")) .map(word => (word,1)) .reduceByKey((x,y)=> (x+y))
Data are combined at each partition, with only one output for one key at each partition to send over the network.
reduceByKey required combining all your values into another value with the exact same type.
map() – Spark map() transformation applies a function to each row in a DataFrame/Dataset and returns the
new transformed Dataset.

flatMap() – Spark flatMap() transformation flattens the DataFrame/Dataset after applying the function on
every element and returns a new transformed Dataset. The returned Dataset will return more rows than the
current DataFrame.
While both of these functions will produce the correct answer, the reduceByKey example works much better on a
large dataset.
That's because Spark knows it can combine output with a common key on each partition before shuffling the
data.
Look at the diagram below to understand what happens with reduceByKey.
Notice how pairs on the same machine with the same key are combined (by using the lamdba function passed
into reduceByKey) before the data is shuffled.
Then the lamdba function is called again to reduce all the values from each partition to produce one final result.
On the other hand, when calling groupByKey - all the key-value pairs are shuffled around.
This is a lot of unnessary data to being transferred over the network.
To determine which machine to shuffle a pair to, Spark calls a partitioning function on the key of the pair.
Spark spills data to disk when there is more data shuffled onto a single executor machine than can fit in memory.
However, it flushes out the data to disk one key at a time - so if a single key has more key-value pairs than can fit in memory,
an out of memory exception occurs. This will be more gracefully handled in a later release of Spark so the job can still proceed,
but should still be avoided - when Spark needs to spill to disk, performance is severely impacted.
Day-6
Frequently Used Data Frame API Methods

SNo Methods Usage


1 DataFrame.agg(*exprs) Aggregate on the entire DataFrame without groups
(shorthand for df.groupBy().agg()).

2 DataFrame.collect() Returns all the records as a list of Row.

3 DataFrame.columns Returns all column names as a list.

4 DataFrame.count() Returns the number of rows in this DataFrame.

5 DataFrame.createGlobalTempView(name) Creates a global temporary view with this DataFrame.

6 DataFrame.createOrReplaceTempView Creates or replaces a local temporary view with this


(name) DataFrame.

7 DataFrame.distinct() Returns a new DataFrame containing the distinct rows in


this DataFrame.
8 DataFrame.drop(*cols) Returns a new DataFrame that drops the specified
column.
9 DataFrame.dropDuplicates([subset]) Return a new DataFrame with duplicate rows removed,
optionally only considering certain columns.
10 DataFrame.drop_duplicates([subset]) drop_duplicates() is an alias for dropDuplicates().
11 DataFrame.dropna([how, thresh, subset]) Returns a new DataFrame omitting rows with null values.
12 DataFrame.fillna(value[, subset]) Replace null values, alias for na.fill().
13 DataFrame.filter(condition) Filters rows using the given condition.
14 DataFrame.first() Returns the first row as a Row.
15 DataFrame.foreach(f) Applies the f function to all Row of this DataFrame.
16 DataFrame.isEmpty() Returns True if this DataFrame is empty.
17 DataFrame.join(other[, on, how]) Joins with another DataFrame, using the given join expression.
18 DataFrame.replace(to_replace[, value, Returns a new DataFrame replacing a value with another
subset]) value.
19 DataFrame.select(*cols) Projects a set of expressions and returns a new DataFrame.
20 DataFrame.selectExpr(*expr) Projects a set of SQL expressions and returns a new
DataFrame.
21 DataFrame.tail(num) Returns the last num rows as a list of Row.
22 DataFrame.toDF(*cols) Returns a new DataFrame that with new specified column names
23 DataFrame.take(num) Returns the first num rows as a list of Row.
24 DataFrame.toPandas() Returns the contents of this DataFrame as Pandas pandas.DataFrame.
25 DataFrame.union(other) Return a new DataFrame containing union of rows in this and another
DataFrame.
26 DataFrame.unionAll(other) Return a new DataFrame containing union of rows in this and another
DataFrame.
27 DataFrame.unionByName(other[, …]) Returns a new DataFrame containing union of rows in this and another
DataFrame.
28 DataFrame.where(condition) where() is an alias for filter().
29 DataFrame.withColumn(colName, col) Returns a new DataFrame by adding a column or replacing the existing
column that has the same name
30 DataFrame.withColumnRenamed Returns a new DataFrame by renaming an existing column.
(existing, new)
31 DataFrame.withMetadata Returns a new DataFrame by updating an existing column with
(columnName, metadata) metadata.
32 DataFrame.dtypes Returns all column names and their data types as a list.
Difference between pandas dataframe and pyspark dataframe
Spark DataFrame Pandas DataFrame

Spark DataFrame supports parallelization. Pandas DataFrame does not support parallelization.

Spark DataFrame has Multiple Nodes. Pandas DataFrame has a Single Node.

It follows Lazy Execution which means that a task is not It follows Eager Execution, which means task is executed
executed until an action is performed. immediately.

Spark DataFrame is Immutable. Pandas DataFrame is Mutable.

Complex operations are difficult to perform as compared to Complex operations are easier to perform as compared to
Pandas DataFrame. Spark DataFrame.

Spark DataFrame is distributed and hence processing in the Pandas DataFrame is not distributed and hence processing
Spark DataFrame is faster for a large amount of data. in the Pandas DataFrame will be slower for a large amount
of data.

pandasDataFrame.count() returns the number of non


sparkDataFrame.count() returns the number of rows. NA/null observations for each column.

Spark DataFrames are excellent for building a scalable Pandas DataFrames can’t be used to build a scalable
application. application.

Pandas DataFrame does not assure fault tolerance. We need


Spark DataFrame assures fault tolerance. to implement our own framework to assure it.
Convert PySpark DataFrames to and from pandas DataFrames

DataFrame with toPandas() and when creating a PySpark DataFrame from a pandas DataFrame with
createDataFrame(pandas_df).

import numpy as np
import pandas as pd

# Enable Arrow-based columnar data transfers


spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")

# Generate a pandas DataFrame


pdf = pd.DataFrame(np.random.rand(100, 3))

# Create a Spark DataFrame from a pandas DataFrame using Arrow


df = spark.createDataFrame(pdf)

# Convert the Spark DataFrame back to a pandas DataFrame using Arrow


result_pdf = df.select("*").toPandas()
User Defined Functions:

UDF’s are used to extend the functions of the framework and re-use these functions on multiple DataFrame’s.
For example, you wanted to convert every first letter of a word in a name string to a capital case; PySpark build-in
features don’t have this function hence you can create it a UDF and reuse this as needed on many Data Frames

1. Create a dataframe
from pyspark.sql import SparkSession

spark =
SparkSession.builder.appName('SparkByExamples.com').getOrC
reate()

columns = ["Seqno","Name"]
data = [("1", "john jones"),
("2", "tracey smith"),
("3", "amy sanders")]

df = spark.createDataFrame(data=data,schema=columns)

df.show(truncate=False)
2. Create a Python Function
creates a function convertCase() which takes a string parameter and converts the first letter of every word to capital
letter. UDF’s take parameters of your choice and returns a value.

def convertCase(str):
resStr=""
arr = str.split(" ")
for x in arr:
resStr= resStr + x[0:1].upper() + x[1:len(x)] + " "
return resStr

3. Convert a Python function to PySpark UDF

from pyspark.sql.functions import col, udf


from pyspark.sql.types import StringType

# Converting function to UDF


convertUDF = udf(lambda z: convertCase(z),StringType())

Note: The default type of the udf() is StringType hence, you can also write the above statement without
return type.
4. Using UDF with DataFrame

df.select(col("Seqno"), \
convertUDF(col("Name")).alias("Name") ) \
.show(truncate=False)

Using UDF with PySpark DataFrame withColumn()

def upperCase(str):
return str.upper()

upperCaseUDF = udf(lambda z:upperCase(z),StringType())

df.withColumn("Cureated Name", upperCaseUDF(col("Name")))


\
.show(truncate=False)
Registering PySpark UDF & use it on SQL

In order to use convertCase() function on PySpark SQL, you need to register the function
with PySpark by using spark.udf.register()

""" Using UDF on SQL """


spark.udf.register("convertUDF", convertCase,StringType())
df.createOrReplaceTempView("NAME_TABLE")
spark.sql("select Seqno, convertUDF(Name) as Name from NAME_TABLE") \
.show(truncate=False)

Creating UDF using annotation

@udf(returnType=StringType())
def upperCase(str):
return str.upper()

df.withColumn("Cureated Name", upperCase(col("Name"))) \


.show(truncate=False)
Day-7
What data formats can you use in Azure Databricks?
Delta Lake
Parquet
ORC
JSON
CSV
Avro
Text
Binary
What is Delta Lake?
 Delta Lake is open source software that extends Parquet data files with a file-based transaction log for ACID
transactions and scalable metadata handling.
 Delta Lake is the default storage format for all operations on Azure Databricks. Unless otherwise specified,
all tables on Azure Databricks are Delta tables.
 All tables on Azure Databricks are Delta tables by default. Whether you’re using Apache
Spark DataFrames or SQL, you get all the benefits of Delta Lake just by saving your data to the lakehouse
with default settings.
What is Delta Lake?
Delta Lake is an open source storage layer that brings reliability to data lakes. Delta Lake provides ACID
transactions, scalable metadata handling, and unifies streaming and batch data processing. Delta Lake runs on
top of your existing data lake and is fully compatible with Apache Spark APIs.
What format does Delta Lake use to store data?
Delta Lake uses versioned Parquet files to store your data in your cloud storage. Apart from the versions, Delta Lake also stores
a transaction log to keep track of all the commits made to the table or blob store directory to provide ACID transactions.
WHAT IS PARQUET?
Parquet is an open source file format built to handle flat columnar storage data formats. Parquet operates well with
complex data in large volumes.It is known for its both performant data compression and its ability to handle a wide
variety of encoding types.
o Fast queries that can fetch specific column values without
reading full row data.
o Highly efficient column-wise compression
HOW IS PARQUET DIFFERENT FROM CSV?
While CSV is simple and the most widely used data format (Excel, Google Sheets), there are several distinct advantages for
Parquet, including:

Parquet is column oriented and CSV is row oriented. Row-oriented formats are optimized for OLTP workloads while column-
oriented formats are better suited for analytical workloads.

Column-oriented databases such as AWS Redshift Spectrum bill by the amount data scanned per query

Therefore, converting CSV to Parquet with partitioning and compression lowers overall costs and improves performance

Parquet has helped its users reduce storage requirements by at least one-third on large datasets, in addition, it greatly
improves scan and deserialization time, hence the overall costs.
Parquet file format consists of 2 parts –

1. Data

2. Metadata.
Data is written first in the file and the metadata is written at the end to allow for single pass writing. Let’s see
the parquet file format first and then lets us have a look at the metadata.

File Format -

A sample parquet file format is as below -


At a high level, the parquet file consists of header, one or more blocks and footer.
Each block in the parquet file is stored in the form of row groups. So, data in a parquet file is
partitioned into multiple row groups. These row groups in turn consists of one or more column
chunks which corresponds to a column in the dataset. The data for each column chunk is then
written in the form of pages. Each page contains values for a particular column only, hence pages
are very good candidates for compression as they contain similar values.
What is a columnar storage format?
In order to understand the Parquet file format in Hadoop better, first, let’s see what a columnar format is. In a column-oriented
format, the values of each column of the same type in the records are stored together

For exampleFor example, if there is a record comprising ID, employee Name, and Department, then all
the values for the ID column will be stored together, values for the Name column together, and so on. If
we take the same record schema as mentioned above, having three fields ID (int), NAME (varchar), and
Department (varchar), the table will look something like this:

For this table, the data in a row-wise storage format will be stored as follows:

Whereas, the same data in a Column-oriented storage format will look like The columnar storage format is more
this: efficient when you need to query a
few columns from a table. It will read
only the required columns since they
are adjacent, thus minimizing IO.
Databricks uses Delta Lake by default for all reads and writes and builds upon the ACID guarantees provided
by the open source Delta Lake protocol.
ACID stands for atomicity, consistency, isolation, and durability.

Atomicity means that all transactions either succeed or fail completely.

Consistency guarantees relate to how a given state of the data is observed by simultaneous operations.

Isolation refers to how simultaneous operations potentially conflict with one another.

Durability means that committed changes are permanent

Converting and ingesting data to Delta Lake


The COPY INTO SQL command lets you load data from a file location into a Delta table. This is a re-triable and
idempotent operation; files in the source location that have already been loaded are skipped.
CREATE TABLE IF NOT EXISTS my_table
[COMMENT <table_description>]
[TBLPROPERTIES (<table_properties>)];

COPY INTO my_table


FROM '/path/to/files'
FILEFORMAT = <format>
FORMAT_OPTIONS ('mergeSchema' = 'true')
COPY_OPTIONS ('mergeSchema' = 'true');
table_name = 'default.loan_risks_upload'
source_data = '/databricks-datasets/learning-spark-v2/loans/loan-risks.snappy.parquet'
source_format = 'PARQUET'

spark.sql("DROP TABLE IF EXISTS " + table_name)

spark.sql("CREATE TABLE " + table_name + " (" \


"loan_id BIGINT, " + \
"funded_amnt INT, " + \
"paid_amnt DOUBLE, " + \
"addr_state STRING)"
)

spark.sql("COPY INTO " + table_name + \


" FROM '" + source_data + "'" + \
" FILEFORMAT = " + source_format
)

loan_risks_upload_data = spark.sql("SELECT * FROM " + table_name)

display(loan_risks_upload_data)

'''
Result:
+---------+-------------+-----------+------------+
| loan_id | funded_amnt | paid_amnt | addr_state |
+=========+=============+===========+============+
|0 | 1000 | 182.22 | CA |
+---------+-------------+-----------+------------+
|1 | 1000 | 361.19 | WA |
+---------+-------------+-----------+------------+
|2 | 1000 | 176.26 | TX |
+---------+-------------+-----------+------------+
...
'''
What is Auto Loader?
Auto Loader incrementally and efficiently processes new data files as they arrive in cloud storage without any additional
setup.
How does Auto Loader work?
 Auto Loader incrementally and efficiently processes new data files as they arrive in cloud storage.
 Auto Loader can load data files from AWS S3 (s3://), Azure Data Lake Storage Gen2 (ADLS Gen2, abfss://), Google
Cloud Storage (GCS, gs://), Azure Blob Storage (wasbs://), ADLS Gen1 (adl://), and Databricks File System (DBFS,
dbfs:/). Auto Loader can ingest JSON, CSV, PARQUET, AVRO, ORC, TEXT, and BINARYFILE file formats.
 Databricks recommends Auto Loader in Delta Live Tables for incremental data ingestion.
 Databricks recommends Auto Loader whenever you use Apache Spark Structured Streaming to ingest data from
cloud object storage. APIs are available in Python and Scala.
What is Auto Loader directory listing mode?

Auto Loader uses directory listing mode by default. In directory listing mode, Auto Loader identifies new files by
listing the input directory. Directory listing mode allows you to quickly start Auto Loader streams without any
permission configurations other than access to your data on cloud storage.
How does directory listing mode work?
For example, if you have files being uploaded every 5 minutes as /some/path/YYYY/MM/DD/HH/fileName, to find all the
files in these directories, the Apache Spark file source lists all subdirectories in parallel. The following algorithm estimates
the total number of API LIST directory calls to object storage:

1 (base directory) + 365 (per day) * 24 (per hour) = 8761 calls


What is the Need for Databricks Delta Lakes?
Organizations collect large amounts of data from different sources that can be — schema-based, schema-less, or
streaming data. Such large volumes of data can be stored either in a data warehouse or data lake. Companies are often in
a dilemma while selecting appropriate data storage tools for storing incoming data and then streamlining the flow of data
for analysis. However, Databricks fuses the performance of data warehouses and the affordability of data lakes in a single
Cloud-based repository called Lake House. The Lake House (Data Lake + Data Warehouse) Architecture built on top of the
data lake is called Delta Lake. Below are a few aspects that describe the need for Databricks’ Delta Lake:

 It is an open format storage layer that delivers reliability, security, and performance on your Data
Lake for both streaming and batch operations.
 It not only houses structured, semi-structured, and unstructured data but also provides Low-cost
Data Management solutions.
 Databricks Delta Lake also handles ACID (Atomicity, Consistency, Isolation, and Durability)
transactions, scalable metadata handling, and data processing on existing data lakes.
 Query Performance
As the data grows exponentially over time, query performance becomes a crucial factor. Delta improves the
performance from 10 to 100 times faster as compared to Apache Spark on the Parquet (human unreadable)
file format. Below are some techniques that assist in improving the performance:

Indexing: Databricks Delta creates and maintains Indexes on the tables to arrange queried data.
Skipping: Databricks Delta helps maintain file statistics so that only relevant portions of the data are read.
Compression: Databricks Delta consumes less memory space by efficiently managing Parquet files to optimize
queries.
Caching: Databricks Delta automatically caches highly accessed data to improve run times for commonly run
queries.

 Optimize Layout
Delta optimizes table size with a built-in “optimize” command. End users can optimize certain portions of
the Databricks Delta Table that are most relevant instead of querying an entire table. It saves the overhead
cost of storing metadata and can help speed up queries
 System Complexity
System Complexity increases the effort required to complete data-related tasks, making it difficult while
responding to any changes. With Delta, organizations solve system complexities by:

 Providing flexible Data Analytics Architecture and response to any changes.


 Writing Batch and Streaming data into the same table.
 Allow a simpler architecture and quicker Data Ingestion to query results.
 The ability to ‘infer schemas’ for the data input reduces the effort required to manage schema changes.

 Time Travel
Time travel allows users to roll back in case of bad writes. Some Data Scientists run models on datasets for a
specific time, and this ability to reference previous versions becomes useful for Temporal Data Management.
A user can query Delta Tables for a specific timestamp because any change in Databricks Delta Table creates
new table versions. These tasks help data pipelines to audit, roll back accidental deletes, or reproduce
experiments and reports.
Simplify Databricks ETL and Analysis with Hevo’s No-code Data Pipelin
What is Databricks Delta Table?
A Databricks Delta Table records version changes or modifications in a feature class of table in Delta Lake. Unlike
traditional tables that store data in a row and column format, the Databricks Delta Table facilitates ACID
transactions and time travel features to store metadata information for quicker Data Ingestion. Data stored in a
Databricks Delta Table is a secure Parquet file format that is an encoded layer over data.

These stale data files and logs of transactions are converted from ‘Parquet’ to ‘Delta’ format to reduce custom coding
in the Databricks Delta Table. It also facilitates some advanced features that provide a history of events, and more
flexibility in changing content — update, delete and merge operations — to avoid dDduplication.

Every transaction performed on a Delta Lake table contains an ordered record of a transaction log called DeltaLog.
Delta Lake breaks the process into discrete steps of one or more actions whenever a user performs modification
operations in a table. It facilitates multiple readers and writers to work on a given Databricks Delta Table at the same
time. These actions are recorded in the ordered transaction log known as commits. For instance, if a user creates a
transaction to add a new column to a Databricks Delta Table while adding some more data, Delta Lake would break
that transaction into its consequent parts.

Once the transaction is completed in the Databricks Delta Table, the files are added to the transaction log like the
following commits:

Update Metadata: To change the Schema while including the new column to the Databricks Delta Table.
Add File: To add new files to the Databricks Delta Table
Delta is made of many components:
from delta.tables import *
Parquet data files organized or not as partitions
deltaTable = DeltaTable.forPath(spark, "/tmp/delta-table")
Json files as transaction log
deltaTable.restoreToVersion(1)
Checkpoint file

RESTORE TABLE your_table TO VERSION AS OF 1

Restore the DeltaTable to an older version of the table specified


by a timestamp. Timestamp can be of the format yyyy-MM-dd
or yyyy-MM-dd HH:mm:ss

io.delta.tables.DeltaTable.restoreToTimestamp('2021-01-01')
io.delta.tables.DeltaTable.restoreToTimestamp('2021-01-01
01:01:01')

You can roll back a Delta Lake table to any previous version with the
restoreToVersion command in PySpark:
How to Rollback a Delta Lake Table to a Previous Version with Restore
Delta Lake makes it easy to access different versions of your data. For example, you can time travel back to version 0
of your Delta Lake table to see the original data that was stored when you created it. During time travel we are
loading the table up to some version - in this case, we’re loading up to the initial version.
spark.read.format("delta").option("versionAsOf", "0").load("/tmp/delta-table").show()

+---+
| id|
+---+
| 0|
| 1|
| 2|
+---+

spark.read.format("delta").option("versionAsOf", "1").load("/tmp/delta-table").show()

+---+
| id|
+---+
| 4|
| 5|
+---+

from delta.tables import *

deltaTable = DeltaTable.forPath(spark, "/tmp/delta-table")


deltaTable.restoreToVersion(1)
Note, however, that this change does not erase version 2; instead, a metadata-only operation is
performed in which the changes in version 2 are undone. This means you can still time travel to
versions 0, 1, or 2 of the Delta Lake table, even after running the restoreToVersion command. Let’s
time travel to version 2 of the Delta Lake table to demonstrate the data is preserved:
spark.read.format("delta").option("versionAsOf", "2").load("/tmp/delta-table").show()

+---+
| id|
+---+
| 7|
| 8|
| 9|
+---+
Using the restore command resets the table’s content to an earlier version, but doesn’t remove any data. It
simply updates the transaction log to indicate that certain files should not be read.
Delta Lake restore after vacuum
 To completely remove a later version of the data after restoring to a previous version, you need
to run the Delta Lake vacuum command.
 vacuum is a widely used command that removes files that are not needed by the latest version
of the table. Running vacuum doesn’t make your Delta Lake operations any faster, but it
removes files on disk, which reduces storage costs.

deltaTable.vacuum(retentionHours=0)

deltaTable.vacuum() # vacuum files not required by versions more than 7 days old
deltaTable.vacuum(100) # vacuum files not required by versions more than 100 hours old

As expected, reading the contents still returns the table’s data as of version 1:

spark.read.format("delta").load("/tmp/delta-table").show()

+---+
| id|
+---+
| 4|
| 5|
+---+
Optimize performance with file management

To improve query speed, Delta Lake supports the ability to optimize the layout of data in storage. There are various
ways to optimize the layout.

Compaction (bin-packing)
Delta Lake can improve the speed of read queries from a table by coalescing small files into larger ones.

from delta.tables import *

deltaTable = DeltaTable.forPath(spark, pathToTable) # For path-based tables


# For Hive metastore-based tables: deltaTable = DeltaTable.forName(spark, tableName)

deltaTable.optimize().executeCompaction()

# If you have a large amount of data and only want to optimize a subset of it, you can specify an optional partition
predicate using `where`
deltaTable.optimize().where("date='2021-11-18'").executeCompaction()

 Bin-packing optimization is idempotent, meaning that if it is run twice on the same dataset, the second run has no
effect.
 Bin-packing aims to produce evenly-balanced data files with respect to their size on disk, but not necessarily number
of tuples per file. However, the two measures are most often correlated.
OPTIMIZE makes no data related changes to the table, so a read before and after an OPTIMIZE has the same results.
Performing OPTIMIZE on a table that is a streaming source does not affect any current or future streams that treat this
table as a source.
OPTIMIZE returns the file statistics (min, max, total, and so on) for the files removed and the files added by the operation

Z-Ordering (multi-dimensional clustering)


Z-Ordering is a technique to colocate related information in the same set of files. This co-locality is automatically
used by Delta Lake in data-skipping algorithms. This behavior dramatically reduces the amount of data that Delta
Lake on Apache Spark needs to read.
To Z-Order data, you specify the columns to order on in the ZORDER BY clause:
from delta.tables import *

deltaTable = DeltaTable.forPath(spark, pathToTable) # path-based table


# For Hive metastore-based tables: deltaTable = DeltaTable.forName(spark, tableName)

deltaTable.optimize().executeZOrderBy(eventType)

# If you have a large amount of data and only want to optimize a subset of it, you can specify an optional
partition predicate using `where`
deltaTable.optimize().where("date='2021-11-18'").executeZOrderBy(eventType)
Z-Ordering is not idempotent. Everytime the Z-Ordering is executed it will try to create a new clustering of data
in all files (new and existing files that were part of previous Z-Ordering) in a partition.

Z-Ordering aims to produce evenly-balanced data files with respect to the number of tuples, but not
necessarily data size on disk. The two measures are most often correlated, but there can be situations when
that is not the case, leading to skew in optimize task times.

Multi-part checkpointing

 Delta Lake table periodically and automatically compacts all the incremental updates to the Delta log into a
Parquet file.
 This “checkpointing” allows read queries to quickly reconstruct the current state of the table (that is, which
files to process, what is the current schema) without reading too many files having incremental updates.

 Delta Lake protocol allows splitting the checkpoint into multiple Parquet files. This parallelizes and speeds up
writing the checkpoint
 In Delta Lake, by default each checkpoint is written as a single Parquet file.
 To use this feature, set the SQL configuration spark.databricks.delta.checkpoint.partSize=<n>, where n is the
limit of number of actions (such as AddFile) at which Delta Lake on Apache Spark will start parallelizing the
checkpoint and attempt to write a maximum of this many actions per checkpoint file.
Azure data factory - Dynamically add timestamp in copied
filename

Problem Statement:
​Azure data factory is copying files to the target folder and I need files to have current timestamp in it.

Example:
SourceFolder has files --> File1.txt, File2.txt and so on
TargetFolder should have copied files with the names --> File1_2019-11-01.txt, File2_2019-11-01.txt and so on.

1. Create a Source dataset that points to Source folder which has files to be copied.

In Parameters tab - Define a parameter named - "Filename"


2. Create a Target dataset that points to Target folder where files will be copied

In Parameters tab - Define a parameter named - "TargetFilename"

3. Drag a Get Metadata activity on pipeline. This will give us the file names having .txt extensions in the source folder.

Add an argument - Child Items to retrieve files details under Source folder
Updating and modifying Delta Lake tables

 Delta Lake supports upserts using the merge operation.


 Delta Lake provides numerous options for selective overwrites based on filters and partitions.
 You can manually or automatically update your table schema without rewriting data.
 Column mapping enables columns to be renamed or deleted without rewriting data.

Upsert into a Delta Lake table using merge

You can upsert data from a source table, view, or DataFrame into a target Delta table by using the MERGE
SQL operation. Delta Lake supports inserts, updates and deletes in MERGE, and it supports extended
syntax beyond the SQL standards to facilitate advanced use cases.
To merge the new data, you want to update rows where the person’s id is already present and insert the
new rows where no matching id is present.

PySpark distinct() function is used to drop/remove the duplicate rows (all columns) from DataFrame and
dropDuplicates() is used to drop rows based on selected (one or multiple) columns.
MERGE INTO people10m
USING people10mupdates
ON people10m.id = people10mupdates.id
WHEN MATCHED THEN
UPDATE SET
id = people10mupdates.id,
firstName = people10mupdates.firstName,
middleName = people10mupdates.middleName,
lastName = people10mupdates.lastName,
gender = people10mupdates.gender,
birthDate = people10mupdates.birthDate,
ssn = people10mupdates.ssn,
salary = people10mupdates.salary
WHEN NOT MATCHED
THEN INSERT (
id,
firstName,
middleName,
lastName,
gender,
birthDate,
ssn,
salary
)
VALUES (
people10mupdates.id,
people10mupdates.firstName,
people10mupdates.middleName,
people10mupdates.lastName,
people10mupdates.gender,
people10mupdates.birthDate,
people10mupdates.ssn,
people10mupdates.salary
)
What is a Data Warehouse?
A data warehouse is a relational database that is designed for query and analysis rather than for transaction processing. It
usually contains historical data derived from transaction data, but it can include data from other sources.
What is Data Warehousing?
A Data Warehousing (DW) is process for collecting and managing data from varied sources to provide meaningful business
insights. A Data warehouse is typically used to connect and analyze business data from heterogeneous sources. The data
warehouse is the core of the BI system which is built for data analysis and reporting.
Types of Data Warehouse
Three main types of Data Warehouses (DWH) are:
1. Enterprise Data Warehouse (EDW):

Enterprise Data Warehouse (EDW) is a centralized warehouse. It provides decision support service across the enterprise. It
offers a unified approach for organizing and representing data. It also provide the ability to classify data according to the
subject and give access according to those divisions.

2. Operational Data Store:

Operational Data Store, which is also called ODS, are nothing but data store required when neither Data warehouse nor
OLTP systems support organizations reporting needs. In ODS, Data warehouse is refreshed in real time. Hence, it is widely
preferred for routine activities like storing records of the Employees.
3. Data Mart:

A data mart is a subset of the data warehouse. It specially designed for a particular line of business, such as sales, finance,
sales or finance. In an independent data mart, data can collect directly from sources.
Who needs Data warehouse?
DWH (Data warehouse) is needed for all types of users like:

 Decision makers who rely on mass amount of data.


 Users who use customized, complex processes to obtain information from multiple
data sources.
 It is also used by the people who want simple technology to access the data.
 It also essential for those people who want a systematic approach for making
decisions.
 If the user wants fast performance on a huge amount of data which is a necessity
for reports, grids or charts, then Data warehouse proves useful.
 Data warehouse is a first step If you want to discover ‘hidden patterns’ of data-
flows and groupings.
What Is a Data Warehouse Used For?
Here, are most common sectors where Data warehouse is used:

Airline:
In the Airline system, it is used for operation purpose like crew assignment, analyses of route profitability, frequent flyer program promotions, etc.

Banking:
It is widely used in the banking sector to manage the resources available on desk effectively. Few banks also used for the market research, performance analysis
of the product and operations.

Healthcare:
Healthcare sector also used Data warehouse to strategize and predict outcomes, generate patient’s treatment reports, share data with tie-in insurance
companies, medical aid services, etc.

Public sector:
In the public sector, data warehouse is used for intelligence gathering. It helps government agencies to maintain and analyze tax records, health policy records,
for every individual.

Investment and Insurance sector:


In this sector, the warehouses are primarily used to analyze data patterns, customer trends, and to track market movements.

Retain chain:
In retail chains, Data warehouse is widely used for distribution and marketing. It also helps to track items, customer buying pattern, promotions and also used for
determining pricing policy.

Telecommunication:
A data warehouse is used in this sector for product promotions, sales decisions and to make distribution decisions.

Hospitality Industry:
This Industry utilizes warehouse services to design as well as estimate their advertising and promotion campaigns where they want to target clients based on
their feedback and travel patterns.
What is OLAP?
Online Analytical Processing, a category of software tools which provide analysis of data for business
decisions. OLAP systems allow users to analyze database information from multiple database systems at
one time.

The primary objective is data analysis and not data processing.

What is OLTP?
Online transaction processing shortly known as OLTP supports transaction-oriented applications in a 3-tier
architecture. OLTP administers day to day transaction of an organization.
Difference between OLTP and OLAP

Below is the difference between OLAP and OLTP in Data Warehouse:

Parameters OLTP OLAP


Process It is an online transactional system.
It manages database modification. OLAP is an online analysis and data retrieving process.
Characteristic It is characterized by large numbers
of short online transactions. It is characterized by a large volume of data.
Functionality OLTP is an online database modifying
system. OLAP is an online database query management system.
Method OLTP uses traditional DBMS. OLAP uses the data warehouse.
What Are Facts and Dimensions in a Data Warehouse?

Facts and dimensions are the fundamental elements that define a data warehouse. They record relevant
events of a subject or functional area (facts) and the characteristics that define them (dimensions).

Key Difference between a Fact table and a Dimension table

 The fact table contains measurements, metrics, and facts about a business process, while the Dimension table is a
companion to the fact table, which contains descriptive attributes to be used as query constraining.
 The fact table is located at the center of a star or snowflake schema, whereas the Dimension table is located at
the edges of the star or snowflake schema.
 A fact table is defined by its grain or most atomic level, whereas a Dimension table should be wordy, descriptive,
complete, and of assured quality.
 The fact table helps to store report labels, whereas Dimension table contains detailed data.
 The fact table does not contain a hierarchy, whereas the Dimension table contains hierarchies.
What is Fact Table?
A fact table is a primary table in a dimensional model.

A Fact Table contains

Measurements/facts
Foreign key to dimension table
What is a Dimension Table?
 A dimension table contains dimensions of a fact.
 They are joined to fact table via a foreign key.
 Dimension tables are de-normalized tables.
 The Dimension Attributes are the various columns in a dimension table.
 Dimensions offers descriptive characteristics of the facts with the help of their attributes.
 No set limit set for given for number of dimensions.
 The dimension can also contain one or more hierarchical relationships

Facts and Dimensions Joined in a Star Schema


dimension attributes supply the report — filters and labeling

fact tables supply the report’s — numeric values


Snowflake Schema in Data Warehouse Model
The snowflake schema is a variant of the star schema. Here, the centralized fact table is connected to
multiple dimensions. In the snowflake schema, dimensions are present in a normalized form in multiple
related tables.
Table Distribution Types

1. Hash distributed
A hash-distributed table distributes table rows across the Compute nodes by using a
deterministic hash function to assign each row to one distribution.

 Hash-distributed tables work well for large fact tables in a star schema.
 They can have very large numbers of rows and still achieve high performance.
Consider using a hash-distributed table when:

o The table size on disk is more than 2 GB.


o The table has frequent insert, update, and delete operations.
2. Round-robin distributed
 A round-robin distributed table distributes table rows evenly across all distributions.
 The assignment of rows to distributions is random.
 Unlike hash-distributed tables, rows with equal values are not guaranteed to be assigned to the same
distribution.
Consider using the round-robin distribution for your table in the following scenarios:

o When getting started as a simple starting point since it is the default.


o If there is no obvious joining key.
o If there is no good candidate column for hash distributing the table.
o If the table does not share a common join key with other tables.
o If the join is less significant than other joins in the query.
o When the table is a temporary staging table

3. Replicated table
 A replicated table has a full copy of the table accessible on each Compute node.
 Replicating a table removes the need to transfer data among Compute nodes before a join or aggregation. Since
the table has multiple copies, replicated tables work best when the table size is less than 2 GB compressed.
 2 GB is not a hard limit. If the data is static and does not change, you can replicate larger tables.
Replicated tables work well for dimension tables in a star schema. Dimension tables are typically joined to fact tables,
which are distributed differently than the dimension table. Dimensions are usually of a size that makes it feasible to store
and maintain multiple copies.
Consider using a replicated table when:

The table size on disk is less than 2 GB, regardless of the number of rows. To find the size of a table, you can use the
DBCC PDW_SHOWSPACEUSED command: DBCC PDW_SHOWSPACEUSED('ReplTableCandidate').
The table is used in joins that would otherwise require data movement. When joining tables that are not distributed on
the same column, such as a hash-distributed table to a round-robin table, data movement is required to complete the
query. If one of the tables is small, consider a replicated table. We recommend using replicated tables instead of round-
robin tables in most cases. To view data movement operations in query plans, use sys.dm_pdw_request_steps. The
BroadcastMoveOperation is the typical data movement operation that can be eliminated by using a replicated table.

You might also like