Databricks
Databricks
https://www.youtube.com/@TRRaveendra
@TRRaveendra
Different Types of Architectures from 1980 To Till
https://www.youtube.com/@TRRaveendra
@TRRaveendra
Data Warehouse
https://www.youtube.com/@TRRaveendra
@TRRaveendra
@TRRaveendra
https://www.youtube.com/@TRRaveendra
@TRRaveendra
Big Data Data Lake with Data Warehouses
https://www.youtube.com/@TRRaveendra
@TRRaveendra
Big Data Data Lake with Data Warehouses
A datalake is a centralized repository for storing, processing, and securing massive volumes of organized, semistructured,
and unstructured data. It can store data in its native format and handle any type of data, regardless of size.
https://www.youtube.com/@TRRaveendra
@TRRaveendra
Disadvantages of DataLake
1. Data appending is difficult.
Adding additional data results in improper readings.
2. It is tough to modify existing data.
GDPR/CCPA necessitates fine-grained adjustments to current data lakes.
3. Jobs that fail in the middle result in corrupted files.
Half of the data is present in the data lake, but the other half is missing.
4. Operation in real time
Inconsistency results from combining streaming and batch.
5. Expensive to preserve older data versions.
Reproducibility, auditing, and governance are all required in regulated situations.
6. Difficulty in dealing with vast amounts of metadata
The metadata itself becomes challenging to handle in vast data lakes.
7. "Too many files" issues
Data lakes struggle to handle millions of little files.
8. It is difficult to obtain excellent performance.
Data partitioning for performance is error-prone and difficult to modify.
9. Problems with data quality
It is a continual challenge to guarantee that all data is correct and of high quality.
10. Incomplete ACID Transactions
Insert, update, and delete operations are not possible due to a lack of ACID Properties.
11 We require extra Warehouse and ETL processes to load from DataLake to DataWarehouse in
order to satisfy Incomplete ACID Transactions and Metadata Catalog maintenance.
7
https://www.youtube.com/@TRRaveendra
@TRRaveendra
Modern Lakehouse Architecture
https://www.youtube.com/@TRRaveendra
@TRRaveendra
DataWarehouse + Data Lake = Lakehouse
https://www.youtube.com/@TRRaveendra
@TRRaveendra
Modern Lakehouse Features
Delta Lake is an open-source storage framework that allows for the creation of a Lakehouse architecture with computation engines such as Spark.
DeltaLake enables ACID transactions, scalable #metadata management, and the unification of streaming and batch data processing. Delta Lake is fully
compatible with Apache Spark APIs and runs on top of your existing datalake.
https://www.youtube.com/@TRRaveendra
@TRRaveendra
Lakehouse Architecture
https://www.youtube.com/@TRRaveendra
@TRRaveendra
Agenda
➢ Introduction
➢ Key Features
https://www.youtube.com/@TRRaveendra
@TRRaveendra
Introduction
➢ Delta Lake is an open source storage layer that brings reliability to data
lakes.
➢ Delta Lake runs on top of your existing data lake and is fully compatible
with Apache Spark APIs.
13
https://www.youtube.com/@TRRaveendra
@TRRaveendra
Why Delta?
➢ Challenges in implementation of a data lake.
➢ Missing ACID properties.
➢ Lack of Schema enforcement.
➢ Lack of Consistency.
➢ Lack of Data Quality.
➢ Too many small files.
➢ Corrupted data due to Frequent job failures in prod
14
https://www.youtube.com/@TRRaveendra
@TRRaveendra
Challenges faced by most of the data lakes
Source: databricks
16
https://www.youtube.com/@TRRaveendra
@TRRaveendra
17
https://www.youtube.com/@TRRaveendra
@TRRaveendra
18
https://www.youtube.com/@TRRaveendra
@TRRaveendra
19
https://www.youtube.com/@TRRaveendra
@TRRaveendra
20
https://www.youtube.com/@TRRaveendra
@TRRaveendra
21
https://www.youtube.com/@TRRaveendra
@TRRaveendra
22
https://www.youtube.com/@TRRaveendra
@TRRaveendra
23
https://www.youtube.com/@TRRaveendra
@TRRaveendra
24
https://www.youtube.com/@TRRaveendra
@TRRaveendra
25
https://www.youtube.com/@TRRaveendra
@TRRaveendra
Delta Lake Features
An open-source storage format that brings ACID transactions to Apache Spark™ and big data workloads.
➢ ACID Transactions: Ensures data integrity and read consistency with complex, concurrent data pipelines.
➢ Schema Enforcement and Evolution: Ensures data cleanliness by blocking writes with unexpected.
➢ Audit History: History of all the operations that happened in the table.
➢ Time Travel: Query previous versions of the table by time or version number.
➢ Deletes and upserts: Supports deleting and upserting into tables with programmatic APIs.
➢ Scalable Metadata management: Able to handle millions of files are scaling the metadata operations with
Spark.
➢ Unified Batch and Streaming Source and Sink: A table in Delta Lake is both a batch table, as well as a
streaming source and sink. Streaming data ingest, batch historic backfill, and interactive queries all just work
out of the box. 26
https://www.youtube.com/@TRRaveendra
@TRRaveendra
Bronze tables contain raw data ingested from various sources (JSON files, RDBMS data,
IoT data, etc.).
Silver tables will provide a more refined view of our data. We can join fields from
various bronze tables to enrich streaming records, or update account statuses based on
recent activity.
Gold tables provide business level aggregates often used for reporting and
dashboarding. This would include aggregations such as daily active website users,
weekly sales per store, or gross revenue per quarter by department.
The end outputs are actionable insights, dashboards, and reports of business metrics.
27
https://www.youtube.com/@TRRaveendra
@TRRaveendra
How Delta Lake Works
➢ Delta lake provides a storage layer on top of existing storage data lake. It acts as a middle layer between
Spark runtime and storage
➢ Delta Lake will generate delta logs for each committed transactions
➢ Delta logs will have delta files stored as JSON which has information about the operations occurred
➢ Delta files are sequentially increasing named JSON files and together make up the log of all changes that
have occurred to a table
28
https://www.youtube.com/@TRRaveendra
@TRRaveendra
29
https://www.youtube.com/@TRRaveendra
@TRRaveendra
30
https://www.youtube.com/@TRRaveendra
@TRRaveendra
We have seen that Spark and Delta Lake make it easy for us to ingest data from disparate sources,
and work with it as a relational database.
31
https://www.youtube.com/@TRRaveendra
@TRRaveendra
32
https://www.youtube.com/@TRRaveendra
@TRRaveendra
Creating Delta Table in SQL & Python
If your source files are in Parquet format, you can use the SQL Convert to
Delta statement to convert files in place to create an unmanaged table:
33
https://www.youtube.com/@TRRaveendra
@TRRaveendra
Delta Lake Storage & Types Of Files
34
https://www.youtube.com/@TRRaveendra
@TRRaveendra
Tables created with a specified LOCATION are considered unmanaged by the metastore. Unlike a managed table, where no
path is specified, an unmanaged table’s files are not deleted when you DROP the table.
the table in the Hive metastore automatically inherits the schema, partitioning, and table
properties of the existing data. This functionality can be used to “import” data into the metastore.
35
https://www.youtube.com/@TRRaveendra
@TRRaveendra
Batch upserts
To merge a set of updates and insertions into an existing table, you use the MERGE
INTO statement. For example, the following statement takes a stream of updates and
merges it into the events table. When there is already an event present with the same
eventId, Delta Lake updates the data column using the given expression. When there
is no matching event, Delta Lake adds a new row.
You must specify a value for every column in your table when you perform an
INSERT (for example, when there is no matching row in the existing dataset).
However, you do not need to update all values. 36
https://www.youtube.com/@TRRaveendra
@TRRaveendra
37
https://www.youtube.com/@TRRaveendra
@TRRaveendra
38
https://www.youtube.com/@TRRaveendra
@TRRaveendra
39
https://www.youtube.com/@TRRaveendra
@TRRaveendra
40
https://www.youtube.com/@TRRaveendra
@TRRaveendra
41
https://www.youtube.com/@TRRaveendra
@TRRaveendra
Solving Conflicts Optimistically
In order to offer ACID transactions, Delta Lake has a protocol for figuring out how commits should be ordered
(known as the concept of serializability in databases).
It should allow two or more commits are made at the same time. Delta Lake handles these cases by
implementing a rule of mutual exclusion, then attempting to solve any conflict optimistically. This protocol allows
Delta Lake to deliver on the ACID principle of isolation,
https://www.youtube.com/@TRRaveendra
@TRRaveendra
1. Delta Lake records the starting table version of the table (version 0) that is read prior to making any changes.
2. Users 1 and 2 both attempt to append some data to the table at the same time. Here, we’ve run into a conflict
because only one commit can come next and be recorded as 000001.json.
3. Delta Lake handles this conflict with the concept of “mutual exclusion,” which means that only one user can
successfully make commit 000001.json. User 1’s commit is accepted, while User 2’s is rejected.
4. Rather than throw an error for User 2, Delta Lake prefers to handle this conflict optimistically. It checks to see
whether any new commits have been made to the table, and updates the table silently to reflect those changes,
then simply retries User 2’s commit on the newly updated table (without any data processing), successfully
committing 000002.json.
43
https://www.youtube.com/@TRRaveendra
@TRRaveendra
44
https://www.youtube.com/@TRRaveendra
@TRRaveendra
Quickly Recomputing State With Checkpoint Files
Once we’ve made a total of 10 commits to the transaction log, Delta Lake saves a checkpoint file in Parquet format in
the same _delta_log subdirectory. Delta Lake automatically generates checkpoint files every 10 commits.
45
https://www.youtube.com/@TRRaveendra
@TRRaveendra
Major Differences between the CSV table and Delta Lake table
•The transaction log enables ACID compliance and many other important features. We'll
be looking more deeply into these features throughout the rest of this workshop.
•Parquet is a popular data format for many data lakes. It stores data in columnar format,
and generally provides faster performance.
Let's see how Delta Lake's Data Skipping improves query performance over CSV.
46
https://www.youtube.com/@TRRaveendra
@TRRaveendra
Table Partitioning
All Big Data lakes divide logical tables into physical partitions. This keeps
physical file sizes manageable, and can also be used to speed query
processing.
Summarizing Topic 1: Partitioning
1. We have seen the benefits of Partitioning, a performance-enhancing
feature common to all Big Data lakes. As we have seen:
2. Partitioning splits large files into smaller chunks
3. We can choose a semantic partition key that can make appropriate
queries run much faster
However, partitioning has some limitations:
➢ We can't use partitioning to support a wide range of diverse
queries. In order to benefit, queries must filter on the partition key
➢ We must choose a partition key of moderate cardinality 47
https://www.youtube.com/@TRRaveendra
@TRRaveendra
Optimize performance with file management
To improve query speed, Delta Lake on Databricks supports the ability to optimize the layout of
data stored in cloud storage. Delta Lake on Databricks supports few algorithms: bin-packing ,
Data-Skipping And Z-Ordering.
Compaction (bin-packing)
Delta Lake on Databricks can improve the speed of read queries from a table by coalescing
small files into larger ones. You trigger compaction by running the OPTIMIZE command
If you have a large amount of data and only want to optimize a subset of it, you can
specify an optional partition predicate using WHERE
48
https://www.youtube.com/@TRRaveendra
@TRRaveendra
Data Skipping
Databricks' Data Skipping feature takes advantage of the multi-file
structure. As new data is inserted into a Databricks Delta table, file-level
min/max statistics are collected for all columns. Then, when there’s a
lookup query against the table, Databricks Delta first consults these
statistics in order to determine which files can safely be skipped. The
picture below illustrates the process:
1. Keep track of simple statistics such as minimum and maximum values
at a certain granularity that’s correlated with I/O granularity.
2. Leverage those statistics at query planning time in order to avoid
unnecessary I/O.
49
https://www.youtube.com/@TRRaveendra
@TRRaveendra
Data Skipping Example
In this example, we skip file 1 because its minimum value is higher than our
desired value.
Similarly, we skip file 3 based on its maximum value. File 2 is the only
one we need to access.
50
https://www.youtube.com/@TRRaveendra
@TRRaveendra
Z-Ordering (multi-dimensional clustering)
Z-Ordering is another Databricks enhancement
to Delta Lake.
What is Z-Ordering?
https://www.youtube.com/@TRRaveendra
@TRRaveendra
The chess board is a 2-dimensional array. But when I persist the data, I need to express is in only one
dimension; in other words, I need to serialize it. My desire is to have squares that are close to each
other in the array remain close to each other in physical files in my Delta partition directories. For this
illustration, I’ve decided that each physical file should hold 4 squares.
52
https://www.youtube.com/@TRRaveendra
@TRRaveendra
If I serialize row-by-row, I’ll distance myself from 2 of my closest neighbors. The same thing happens if I
serialize column-by-column.
53
https://www.youtube.com/@TRRaveendra
@TRRaveendra
If we interleave the bits of each dimension’s values, and sort by the result, we can write files where the
2-dimensional neighbors are also close in 1 dimension.
Even better, Z-ordering works even if the values are of different lengths. It also works across more than
2 dimensions, although it breaks down quickly after 4-5.
Z-ordering lets us skip more files, and get fewer false positives in the files we do read.
54
https://www.youtube.com/@TRRaveendra
@TRRaveendra
ZORDER - Examples..
55
https://www.youtube.com/@TRRaveendra
@TRRaveendra
Star Schemas and Dynamic Partition Pruning
56
https://www.youtube.com/@TRRaveendra
@TRRaveendra
Star Schemas and Dynamic Partition Pruning
Notice that the Dimension table introduces a level of indirection between the query and the partitions we would like to prune
(which are on the Fact table). In most data lakes, this means that we will have to scan the entire Fact table (which of course, is the
biggest table in a star schema). Databricks' query optimizer, however, will understand this query, and skip any partitions on the fact
table that do not match the filter.
57
https://www.youtube.com/@TRRaveendra
@TRRaveendra
Star Schemas and Dynamic Partition Pruning
The star schema is very effective for Analytics queries, as long as all the dimensions stay the same. But what happens when
something changes in a dimension table? For example, here is a dimension table that represents our company’s Suppliers. Suppose
we are keeping 5 years of history in our warehouse, and at some point, say year 3, this Supplier moves its facilities to a new State? If
we are building reports that report on Suppliers grouped by State, how do we keep our report accurate?
Remember, we’re reporting over a 5-year history, so the problem is that we want results that are accurate over that whole time
period.
58
https://www.youtube.com/@TRRaveendra
@TRRaveendra
Star Schemas and Dynamic Partition Pruning
There are several design choices available to solve the SCD problem. There are actually more choices than we're showing here, but
these are the most common.
These solutions are known by Type x, where x is a number from 1 to 6 (although there is actually no Type 5).
Type 0 is easy (but not very effective). We simply refuse to allow changes. In Type 1 we simply overwrite the old information with new
information. In Type 2, we keep a historical trail of the information. Above type 2, we simply implement more sophisticated ways of
keeping history. Let’s look deeper into Types 1 and 2…
A type 1 slowly changing dimension overwrites the existing data warehouse value with the new value coming from the OLTP system.
Although the type I does not maintain history, it is the simplest and fastest way to load dimension data. Type I is used when the old
value of the changed dimension is not deemed important for tracking or is an historically insignificant attribute.
59
https://www.youtube.com/@TRRaveendra
@TRRaveendra
Star Schemas and Dynamic Partition Pruning
Here is a Type 1 example. When the Supplier moves from CA to IL, we simply overwrite the information in the record. That means
that the older historical parts of our report will now be incorrect, because it will appear that this Supplier was always in IL.
60
https://www.youtube.com/@TRRaveendra
@TRRaveendra
Star Schemas and Dynamic Partition Pruning
A type 2 slowly changing dimension enables you to track the history of updates to your dimension records. When a changed record
enters the warehouse, it creates a new record to store the changed data and leaves the old record intact. Type 2 is the most common
type of slowly changing dimension because it enables you to track historically significant attributes. The old records point to all history
prior to the latest change, and the new record maintains the most current information.
61
https://www.youtube.com/@TRRaveendra
@TRRaveendra
Star Schemas and Dynamic Partition Pruning
62
https://www.youtube.com/@TRRaveendra
@TRRaveendra
Star Schemas and Dynamic Partition Pruning
Here are three different ways we might implement a Type 2 solution. Our goal here is to make
sure our historical reports are accurate throughout the entire time period.
In the top example, we add a new row to the dimension table whenever data changes, and
we keep a version number to show the order of the changes. Why do you think this might be
an ineffective solution? Is it important to know when in time each version changed?
In the middle example, we see a more effective solution. Each row for the supplier has a start
and end date. If we design our queries well, we can make sure that each time period in our
report uses the corresponding Supplier row. If there is no data in the End_Date column, we
know we have the current information.
The bottom example is similar. Effective_Date is the same as Start_Date in the middle
example. Instead of End_Date, we have a flag that says whether or not this row is current.
Our queries may get a bit more complex here, because we must read a row to get the start
date, then read the next row to determine the end date (assuming we are using non-current
rows).
63
https://www.youtube.com/@TRRaveendra
@TRRaveendra
What Is Schema Evolution?
Schema evolution is a feature that allows users to easily change a table’s current schema to
accommodate data that is changing over time. Most commonly, it’s used when performing an append
or overwrite operation, to automatically adapt the schema to include one or more new columns.
Following up on the example from the previous section, developers can easily use schema evolution to add the
new columns that were previously rejected due to a schema mismatch. Schema evolution is activated by
adding .option('mergeSchema', 'true') to your .write or .writeStream Spark command.
64
https://www.youtube.com/@TRRaveendra
@TRRaveendra
65
https://www.youtube.com/@TRRaveendra
@TRRaveendra
Query an older snapshot of a table (time travel)
Delta Lake time travel allows you to query an older snapshot of a Delta table. Time
travel has many use cases, including:
66
https://www.youtube.com/@TRRaveendra
@TRRaveendra
Delta Lake Data retention
By default, Delta tables retain the commit history for 30 days. This means that you can
specify a version from 30 days ago. However, there are some caveats:
vacuum deletes only data files, not log files. Log files are deleted automatically and
asynchronously after checkpoint operations. The default retention period of log files
is 30 days, configurable through the delta.logRetentionPeriod property which you
set with the ALTER TABLE SET TBLPROPERTIES SQL method.
https://www.youtube.com/@TRRaveendra
@TRRaveendra
Data Recovery Based on Snapshots
Fix accidental deletes to a table for the user 111:
68
https://www.youtube.com/@TRRaveendra
@TRRaveendra
Clean up snapshots
Delta Lake provides snapshot isolation for reads, which means that it is safe to run OPTIMIZE even while other
users or jobs are querying the table. Eventually however, you should clean up old snapshots. You can do this by
running the VACUUM command:
You control the age of the latest retained snapshot by using the RETAIN <N> HOURS option:
69
https://www.youtube.com/@TRRaveendra
@TRRaveendra
Describe Detail History
You can retrieve more information about the table (for example, number of files, data size) using DESCRIBE
DETAIL.
70
https://www.youtube.com/@TRRaveendra
@TRRaveendra
Clone a Delta table
You can create a copy of an existing Delta table at a specific version using the clone command. Clones
can be either deep or shallow.
A deep clone is a clone copies the source table data to the clone target in addition to the metadata of the
existing table. Additionally, stream metadata is also cloned such that a stream that writes to the Delta table
can be stopped on a source table and continued on the target of a clone from where it left off.
A shallow clone is a clone that does not copy the data files to the clone target. The table metadata is
equivalent to the source. These clones are cheaper to create.
Note: Any changes made to either deep or shallow clones affect only the clones themselves and not the
source table.
71
https://www.youtube.com/@TRRaveendra
@TRRaveendra
72
https://www.youtube.com/@TRRaveendra
@TRRaveendra
Clone use cases
Data archiving: Data may need to be kept for longer than is feasible with time travel or for disaster
recovery. In these cases, you can create a deep clone to preserve the state of a table at a certain point in
time for archival. Incremental archiving is also possible to keep a continually updating state of a source
table for disaster recovery.
Machine learning flow reproduction: When doing machine learning, you may want to archive a
certain version of a table on which you trained an ML model. Future models can be tested using this
archived data set.
73
https://www.youtube.com/@TRRaveendra
@TRRaveendra
Clone use cases Continue…
Short-term experiments on a production table: In order to test out a workflow on a production table without
corrupting the table, you can easily create a shallow clone. This allows you to run arbitrary workflows on the cloned
table that contains all the production data but does not affect any production workloads.
74
https://www.youtube.com/@TRRaveendra
@TRRaveendra
Clone use cases Continue…
Data sharing: Other business units within a single organization may want to access the same data but may not
require the latest updates. Instead of giving access to the source table directly, clones with different permissions can
be provided for different business units. The performance of the clone can exceed that of a simple view as well.
75
https://www.youtube.com/@TRRaveendra
@TRRaveendra
76
https://www.youtube.com/@TRRaveendra
@TRRaveendra
77
https://www.youtube.com/@TRRaveendra
@TRRaveendra
Lakehouse Architecture
78
https://www.youtube.com/@TRRaveendra
@TRRaveendra
79
https://www.youtube.com/@TRRaveendra
@TRRaveendra
80
https://www.youtube.com/@TRRaveendra
@TRRaveendra
THANK YOU
81
https://www.youtube.com/@TRRaveendra
@TRRaveendra