0% found this document useful (0 votes)

166 views

Databricks

Databricks Spark + Delta + Data Lake = LakeHouse The document discusses how Databricks, Spark, Delta Lake, and data lakes can be combined to create a "LakeHouse" architecture. A LakeHouse provides the benefits of data lakes such as scalable storage of structured and unstructured data as well as the benefits of data warehouses like reliability, governance, and support for analytics workloads. Delta Lake adds features like ACID transactions and time travel capabilities to enable these joint capabilities.

Uploaded by

Adventure World

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

166 views

Databricks

Uploaded by

Adventure World

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 81

Databricks Spark + Delta + Data Lake = LakeHouse

Source and Credit : databricks.com and delta.io

https://www.youtube.com/@TRRaveendra
@TRRaveendra
Different Types of Architectures from 1980 To Till

https://www.youtube.com/@TRRaveendra
@TRRaveendra
Data Warehouse

The Data Warehouse is your company's central data store.

A Data Warehouse is required for all companies that wish to make data-driven choices since it
serves as the "Single Source of Truth" for all data in the company.
Analyzing Existing Business Data
New Business Models Launched Based on Existing Data Analysis
Enhance customer service and interaction
boost quality and resource utilisation

Are there any gaps in the Data Warehouse?

Data Warehouses only support Structured Data.

There is no support for video, audio, or text.
There is no support for data science or machine learning, and there is only limited support for
streaming.
As a result, the majority of commercial and social media data is kept in data lakes and blob storage.

https://www.youtube.com/@TRRaveendra
@TRRaveendra
@TRRaveendra

1990’s Traditional Data Warehouses

https://www.youtube.com/@TRRaveendra
@TRRaveendra
Big Data Data Lake with Data Warehouses

A datalake is a centralized repository for storing, processing, and securing massive volumes of organized, semistructured,
and unstructured data. It can store data in its native format and handle any type of data, regardless of size.

ADVANTAGES OF DATA LAKE

The primary benefit of data lakes is the concentration of several information sources.
At a cheaper cost, distributed and infinite storage
There are image, video, audio, json, csv, and other data storage types accessible.
Volume and Variety: A data lake can hold the massive amounts of data required by Big Data, artificial intelligence, and
machine learning. Data lakes can manage the volume, diversity, and velocity of data absorbed in any format from numerous
sources.
Ingest Speed: Format is unimportant during ingest. It employs schema-on-read rather than schema-on-write, which implies
that data is not processed for use until it is required. Data can be swiftly written.
Cost savings: In terms of storage expenses, a data lake can be substantially less expensive than a data warehouse.
This enables businesses to acquire a broader range of data, such as unstructured data like rich media, sensor data from the
Internet of Things (IoT), email, or social media.
Greater Accessibility: Data stored in a data lake makes it simple to open copies or subsets of data for multiple users or user
groups to access. Data access might be restricted, while businesses can allow more accessibility.
Advanced Algorithms: Data lakes enable businesses to run complicated queries and deep learning algorithms to identify
trends.

https://www.youtube.com/@TRRaveendra
@TRRaveendra
Disadvantages of DataLake
1. Data appending is difficult.
Adding additional data results in improper readings.
2. It is tough to modify existing data.
GDPR/CCPA necessitates fine-grained adjustments to current data lakes.
3. Jobs that fail in the middle result in corrupted files.
Half of the data is present in the data lake, but the other half is missing.
4. Operation in real time
Inconsistency results from combining streaming and batch.
5. Expensive to preserve older data versions.
Reproducibility, auditing, and governance are all required in regulated situations.
6. Difficulty in dealing with vast amounts of metadata
The metadata itself becomes challenging to handle in vast data lakes.
7. "Too many files" issues
Data lakes struggle to handle millions of little files.
8. It is difficult to obtain excellent performance.
Data partitioning for performance is error-prone and difficult to modify.
9. Problems with data quality
It is a continual challenge to guarantee that all data is correct and of high quality.
10. Incomplete ACID Transactions
Insert, update, and delete operations are not possible due to a lack of ACID Properties.
11 We require extra Warehouse and ETL processes to load from DataLake to DataWarehouse in
order to satisfy Incomplete ACID Transactions and Metadata Catalog maintenance.
7

https://www.youtube.com/@TRRaveendra
@TRRaveendra
Modern Lakehouse Architecture

https://www.youtube.com/@TRRaveendra
@TRRaveendra
DataWarehouse + Data Lake = Lakehouse

https://www.youtube.com/@TRRaveendra
@TRRaveendra
Modern Lakehouse Features
Delta Lake is an open-source storage framework that allows for the creation of a Lakehouse architecture with computation engines such as Spark.
DeltaLake enables ACID transactions, scalable #metadata management, and the unification of streaming and batch data processing. Delta Lake is fully
compatible with Apache Spark APIs and runs on top of your existing datalake.

1) Because it is ACID compliant, it provides guaranteed consistency.

2) Strong data storage. Data is stored in versioned parquet files.
3) It is intended to work with Apache Spark.
4) The audit logs will be kept in json format.
5) Time Travel Support

A lakehouse has the following features:

1) support for numerous data kinds and formats
2) data reliability and consistency help with a range of jobs (BI, data science, machine learning, and analytics)
3) direct use of BI techniques to source data
4) Metadata layers for data lakes
5) new query engine designs that enable high-performance SQL execution on data lakes
6) improved access to data science and machine learning tools
7) Data reading and writing at the same time.
8) Schema support combined with data governance methods.
9) Source data can be accessed directly.
10) Storage and computation resources are separated.
11) Storage formats that are standardised.
12) Structured and semi-structured data types, including IoT data, are supported.
13) End-To-End Batch and Streaming Loads.
10

https://www.youtube.com/@TRRaveendra
@TRRaveendra
Lakehouse Architecture

Delta Lake Alternative…..

https://www.youtube.com/@TRRaveendra
@TRRaveendra
Agenda
➢ Introduction

➢ Why Delta Lake

➢ Key Features

➢ How to use Delta Lake in Databricks

➢ Delta Lake Examples

➢ Delta Lake DML Operations

➢ Delta Lake Versioning

➢ Delta Lake SCD Type 2

➢ Delta Lake Partitioning

➢ Delta Lake Clone

https://www.youtube.com/@TRRaveendra
@TRRaveendra
Introduction

➢ Delta Lake is an open source storage layer that brings reliability to data
lakes.

➢ Delta Lake provides ACID transactions, scalable metadata handling

➢ Delta Lake runs on top of your existing data lake and is fully compatible
with Apache Spark APIs.

➢ Delta Lake supports Parquet format

https://www.youtube.com/@TRRaveendra
@TRRaveendra
Why Delta?
➢ Challenges in implementation of a data lake.
➢ Missing ACID properties.
➢ Lack of Schema enforcement.
➢ Lack of Consistency.
➢ Lack of Data Quality.
➢ Too many small files.
➢ Corrupted data due to Frequent job failures in prod

https://www.youtube.com/@TRRaveendra
@TRRaveendra
Challenges faced by most of the data lakes

Source: databricks
16

https://www.youtube.com/@TRRaveendra
@TRRaveendra
17

https://www.youtube.com/@TRRaveendra
@TRRaveendra
18

https://www.youtube.com/@TRRaveendra
@TRRaveendra
19

https://www.youtube.com/@TRRaveendra
@TRRaveendra
20

https://www.youtube.com/@TRRaveendra
@TRRaveendra
21

https://www.youtube.com/@TRRaveendra
@TRRaveendra
22

https://www.youtube.com/@TRRaveendra
@TRRaveendra
23

https://www.youtube.com/@TRRaveendra
@TRRaveendra
24

https://www.youtube.com/@TRRaveendra
@TRRaveendra
25

https://www.youtube.com/@TRRaveendra
@TRRaveendra
Delta Lake Features
An open-source storage format that brings ACID transactions to Apache Spark™ and big data workloads.

➢ Open format: Stored as Parquet format in blob storage.

➢ ACID Transactions: Ensures data integrity and read consistency with complex, concurrent data pipelines.

➢ Schema Enforcement and Evolution: Ensures data cleanliness by blocking writes with unexpected.

➢ Audit History: History of all the operations that happened in the table.

➢ Time Travel: Query previous versions of the table by time or version number.

➢ Deletes and upserts: Supports deleting and upserting into tables with programmatic APIs.

➢ Scalable Metadata management: Able to handle millions of files are scaling the metadata operations with
Spark.

➢ Unified Batch and Streaming Source and Sink: A table in Delta Lake is both a batch table, as well as a
streaming source and sink. Streaming data ingest, batch historic backfill, and interactive queries all just work
out of the box. 26

https://www.youtube.com/@TRRaveendra
@TRRaveendra
Bronze tables contain raw data ingested from various sources (JSON files, RDBMS data,
IoT data, etc.).

Silver tables will provide a more refined view of our data. We can join fields from
various bronze tables to enrich streaming records, or update account statuses based on
recent activity.

Gold tables provide business level aggregates often used for reporting and
dashboarding. This would include aggregations such as daily active website users,
weekly sales per store, or gross revenue per quarter by department.
The end outputs are actionable insights, dashboards, and reports of business metrics.

https://www.youtube.com/@TRRaveendra
@TRRaveendra
How Delta Lake Works

➢ Delta lake provides a storage layer on top of existing storage data lake. It acts as a middle layer between
Spark runtime and storage

➢ Delta Lake will generate delta logs for each committed transactions

➢ Delta logs will have delta files stored as JSON which has information about the operations occurred

➢ Delta files are sequentially increasing named JSON files and together make up the log of all changes that
have occurred to a table

https://www.youtube.com/@TRRaveendra
@TRRaveendra
29

https://www.youtube.com/@TRRaveendra
@TRRaveendra
30

https://www.youtube.com/@TRRaveendra
@TRRaveendra
We have seen that Spark and Delta Lake make it easy for us to ingest data from disparate sources,
and work with it as a relational database.

➢ Cleansing the data to remove corrupt or inaccurate information

➢ Formatting and pruning the data for downstream use

➢ Enriching the data by joining it with other data sources

➢ Aggregating the data to make it more convenient for use

https://www.youtube.com/@TRRaveendra
@TRRaveendra
32

https://www.youtube.com/@TRRaveendra
@TRRaveendra
Creating Delta Table in SQL & Python

If your source files are in Parquet format, you can use the SQL Convert to
Delta statement to convert files in place to create an unmanaged table:

https://www.youtube.com/@TRRaveendra
@TRRaveendra
Delta Lake Storage & Types Of Files

https://www.youtube.com/@TRRaveendra
@TRRaveendra
Tables created with a specified LOCATION are considered unmanaged by the metastore. Unlike a managed table, where no
path is specified, an unmanaged table’s files are not deleted when you DROP the table.

the table in the Hive metastore automatically inherits the schema, partitioning, and table
properties of the existing data. This functionality can be used to “import” data into the metastore.

Reading Delta Table in SQL & Python

https://www.youtube.com/@TRRaveendra
@TRRaveendra
Batch upserts
To merge a set of updates and insertions into an existing table, you use the MERGE
INTO statement. For example, the following statement takes a stream of updates and
merges it into the events table. When there is already an event present with the same
eventId, Delta Lake updates the data column using the given expression. When there
is no matching event, Delta Lake adds a new row.

You must specify a value for every column in your table when you perform an
INSERT (for example, when there is no matching row in the existing dataset).
However, you do not need to update all values. 36

https://www.youtube.com/@TRRaveendra
@TRRaveendra
37

https://www.youtube.com/@TRRaveendra
@TRRaveendra
38

https://www.youtube.com/@TRRaveendra
@TRRaveendra
39

https://www.youtube.com/@TRRaveendra
@TRRaveendra
40

https://www.youtube.com/@TRRaveendra
@TRRaveendra
41

https://www.youtube.com/@TRRaveendra
@TRRaveendra
Solving Conflicts Optimistically
In order to offer ACID transactions, Delta Lake has a protocol for figuring out how commits should be ordered
(known as the concept of serializability in databases).
It should allow two or more commits are made at the same time. Delta Lake handles these cases by
implementing a rule of mutual exclusion, then attempting to solve any conflict optimistically. This protocol allows
Delta Lake to deliver on the ACID principle of isolation,

1. Record the starting table version.

2. Record reads/writes.
3. Attempt a commit.
4. If someone else wins, check whether anything you
read has changed.
5. Repeat.
42

https://www.youtube.com/@TRRaveendra
@TRRaveendra
1. Delta Lake records the starting table version of the table (version 0) that is read prior to making any changes.
2. Users 1 and 2 both attempt to append some data to the table at the same time. Here, we’ve run into a conflict
because only one commit can come next and be recorded as 000001.json.
3. Delta Lake handles this conflict with the concept of “mutual exclusion,” which means that only one user can
successfully make commit 000001.json. User 1’s commit is accepted, while User 2’s is rejected.
4. Rather than throw an error for User 2, Delta Lake prefers to handle this conflict optimistically. It checks to see
whether any new commits have been made to the table, and updates the table silently to reflect those changes,
then simply retries User 2’s commit on the newly updated table (without any data processing), successfully
committing 000002.json.

https://www.youtube.com/@TRRaveendra
@TRRaveendra
44

https://www.youtube.com/@TRRaveendra
@TRRaveendra
Quickly Recomputing State With Checkpoint Files
Once we’ve made a total of 10 commits to the transaction log, Delta Lake saves a checkpoint file in Parquet format in
the same _delta_log subdirectory. Delta Lake automatically generates checkpoint files every 10 commits.

https://www.youtube.com/@TRRaveendra
@TRRaveendra
Major Differences between the CSV table and Delta Lake table

•Delta Lake adds a Transaction Log

•Delta Lake data is stored in Parquet format
These differences are the heart and soul of Delta Lake.

•The transaction log enables ACID compliance and many other important features. We'll
be looking more deeply into these features throughout the rest of this workshop.

•Parquet is a popular data format for many data lakes. It stores data in columnar format,
and generally provides faster performance.
Let's see how Delta Lake's Data Skipping improves query performance over CSV.

https://www.youtube.com/@TRRaveendra
@TRRaveendra
Table Partitioning
All Big Data lakes divide logical tables into physical partitions. This keeps
physical file sizes manageable, and can also be used to speed query
processing.
Summarizing Topic 1: Partitioning
1. We have seen the benefits of Partitioning, a performance-enhancing
feature common to all Big Data lakes. As we have seen:
2. Partitioning splits large files into smaller chunks
3. We can choose a semantic partition key that can make appropriate
queries run much faster
However, partitioning has some limitations:
➢ We can't use partitioning to support a wide range of diverse
queries. In order to benefit, queries must filter on the partition key
➢ We must choose a partition key of moderate cardinality 47

https://www.youtube.com/@TRRaveendra
@TRRaveendra
Optimize performance with file management
To improve query speed, Delta Lake on Databricks supports the ability to optimize the layout of
data stored in cloud storage. Delta Lake on Databricks supports few algorithms: bin-packing ,
Data-Skipping And Z-Ordering.

Compaction (bin-packing)
Delta Lake on Databricks can improve the speed of read queries from a table by coalescing
small files into larger ones. You trigger compaction by running the OPTIMIZE command

If you have a large amount of data and only want to optimize a subset of it, you can
specify an optional partition predicate using WHERE

https://www.youtube.com/@TRRaveendra
@TRRaveendra
Data Skipping
Databricks' Data Skipping feature takes advantage of the multi-file
structure. As new data is inserted into a Databricks Delta table, file-level
min/max statistics are collected for all columns. Then, when there’s a
lookup query against the table, Databricks Delta first consults these
statistics in order to determine which files can safely be skipped. The
picture below illustrates the process:
1. Keep track of simple statistics such as minimum and maximum values
at a certain granularity that’s correlated with I/O granularity.
2. Leverage those statistics at query planning time in order to avoid
unnecessary I/O.

https://www.youtube.com/@TRRaveendra
@TRRaveendra
Data Skipping Example
In this example, we skip file 1 because its minimum value is higher than our
desired value.

Similarly, we skip file 3 based on its maximum value. File 2 is the only
one we need to access.

https://www.youtube.com/@TRRaveendra
@TRRaveendra
Z-Ordering (multi-dimensional clustering)
Z-Ordering is another Databricks enhancement
to Delta Lake.
What is Z-Ordering?

Relational databases use secondary indexes to help speed queries.

However, indexing becomes impractical with Big Data; the indexes
themselves become very large, and they cannot be updated at write
time in a performant manner. Z-ordering is a technique that replaces
indexes by placing data columns that are "close" in value into the same
physical files within a partition.

Z-Ordering is a technique to colocate related information in the same

set of files. This co-locality is automatically used by Delta Lake on
Databricks data-skipping algorithms to dramatically reduce the amount
of data that needs to be read. To Z-Order data, you specify the columns
to order on in the ZORDER BY clause:
51

https://www.youtube.com/@TRRaveendra
@TRRaveendra
The chess board is a 2-dimensional array. But when I persist the data, I need to express is in only one
dimension; in other words, I need to serialize it. My desire is to have squares that are close to each
other in the array remain close to each other in physical files in my Delta partition directories. For this
illustration, I’ve decided that each physical file should hold 4 squares.

https://www.youtube.com/@TRRaveendra
@TRRaveendra
If I serialize row-by-row, I’ll distance myself from 2 of my closest neighbors. The same thing happens if I
serialize column-by-column.

Z-ordering is a solution to this problem.

https://www.youtube.com/@TRRaveendra
@TRRaveendra
If we interleave the bits of each dimension’s values, and sort by the result, we can write files where the
2-dimensional neighbors are also close in 1 dimension.
Even better, Z-ordering works even if the values are of different lengths. It also works across more than
2 dimensions, although it breaks down quickly after 4-5.
Z-ordering lets us skip more files, and get fewer false positives in the files we do read.

https://www.youtube.com/@TRRaveendra
@TRRaveendra
ZORDER - Examples..

https://www.youtube.com/@TRRaveendra
@TRRaveendra
Star Schemas and Dynamic Partition Pruning

https://www.youtube.com/@TRRaveendra
@TRRaveendra
Star Schemas and Dynamic Partition Pruning
Notice that the Dimension table introduces a level of indirection between the query and the partitions we would like to prune
(which are on the Fact table). In most data lakes, this means that we will have to scan the entire Fact table (which of course, is the
biggest table in a star schema). Databricks' query optimizer, however, will understand this query, and skip any partitions on the fact
table that do not match the filter.

https://www.youtube.com/@TRRaveendra
@TRRaveendra
Star Schemas and Dynamic Partition Pruning
The star schema is very effective for Analytics queries, as long as all the dimensions stay the same. But what happens when
something changes in a dimension table? For example, here is a dimension table that represents our company’s Suppliers. Suppose
we are keeping 5 years of history in our warehouse, and at some point, say year 3, this Supplier moves its facilities to a new State? If
we are building reports that report on Suppliers grouped by State, how do we keep our report accurate?

Remember, we’re reporting over a 5-year history, so the problem is that we want results that are accurate over that whole time
period.

https://www.youtube.com/@TRRaveendra
@TRRaveendra
Star Schemas and Dynamic Partition Pruning
There are several design choices available to solve the SCD problem. There are actually more choices than we're showing here, but
these are the most common.
These solutions are known by Type x, where x is a number from 1 to 6 (although there is actually no Type 5).

Type 0 is easy (but not very effective). We simply refuse to allow changes. In Type 1 we simply overwrite the old information with new
information. In Type 2, we keep a historical trail of the information. Above type 2, we simply implement more sophisticated ways of
keeping history. Let’s look deeper into Types 1 and 2…
A type 1 slowly changing dimension overwrites the existing data warehouse value with the new value coming from the OLTP system.
Although the type I does not maintain history, it is the simplest and fastest way to load dimension data. Type I is used when the old
value of the changed dimension is not deemed important for tracking or is an historically insignificant attribute.

https://www.youtube.com/@TRRaveendra
@TRRaveendra
Star Schemas and Dynamic Partition Pruning
Here is a Type 1 example. When the Supplier moves from CA to IL, we simply overwrite the information in the record. That means
that the older historical parts of our report will now be incorrect, because it will appear that this Supplier was always in IL.

https://www.youtube.com/@TRRaveendra
@TRRaveendra
Star Schemas and Dynamic Partition Pruning
A type 2 slowly changing dimension enables you to track the history of updates to your dimension records. When a changed record
enters the warehouse, it creates a new record to store the changed data and leaves the old record intact. Type 2 is the most common
type of slowly changing dimension because it enables you to track historically significant attributes. The old records point to all history
prior to the latest change, and the new record maintains the most current information.

https://www.youtube.com/@TRRaveendra
@TRRaveendra
Star Schemas and Dynamic Partition Pruning

https://www.youtube.com/@TRRaveendra
@TRRaveendra
Star Schemas and Dynamic Partition Pruning
Here are three different ways we might implement a Type 2 solution. Our goal here is to make
sure our historical reports are accurate throughout the entire time period.
In the top example, we add a new row to the dimension table whenever data changes, and
we keep a version number to show the order of the changes. Why do you think this might be
an ineffective solution? Is it important to know when in time each version changed?
In the middle example, we see a more effective solution. Each row for the supplier has a start
and end date. If we design our queries well, we can make sure that each time period in our
report uses the corresponding Supplier row. If there is no data in the End_Date column, we
know we have the current information.
The bottom example is similar. Effective_Date is the same as Start_Date in the middle
example. Instead of End_Date, we have a flag that says whether or not this row is current.
Our queries may get a bit more complex here, because we must read a row to get the start
date, then read the next row to determine the end date (assuming we are using non-current
rows).

https://www.youtube.com/@TRRaveendra
@TRRaveendra
What Is Schema Evolution?
Schema evolution is a feature that allows users to easily change a table’s current schema to
accommodate data that is changing over time. Most commonly, it’s used when performing an append
or overwrite operation, to automatically adapt the schema to include one or more new columns.

Following up on the example from the previous section, developers can easily use schema evolution to add the
new columns that were previously rejected due to a schema mismatch. Schema evolution is activated by
adding .option('mergeSchema', 'true') to your .write or .writeStream Spark command.

https://www.youtube.com/@TRRaveendra
@TRRaveendra
65

https://www.youtube.com/@TRRaveendra
@TRRaveendra
Query an older snapshot of a table (time travel)
Delta Lake time travel allows you to query an older snapshot of a Delta table. Time
travel has many use cases, including:

1. Re-creating analyses, reports, or outputs (for example, the output of a

machine learning model). This could be useful for debugging or auditing,
especially in regulated industries.
2. Writing complex temporal queries.
3. Fixing mistakes in your data.
4. Providing snapshot isolation for a set of queries for fast changing tables.

https://www.youtube.com/@TRRaveendra
@TRRaveendra
Delta Lake Data retention
By default, Delta tables retain the commit history for 30 days. This means that you can
specify a version from 30 days ago. However, there are some caveats:

vacuum deletes only data files, not log files. Log files are deleted automatically and
asynchronously after checkpoint operations. The default retention period of log files
is 30 days, configurable through the delta.logRetentionPeriod property which you
set with the ALTER TABLE SET TBLPROPERTIES SQL method.

delta.logRetentionDuration = "interval <interval>": controls how long the history

for a table is kept. Each time a a checkpoint is written
delta.deletedFileRetentionDuration = "interval <interval>": controls how long
ago a file must have been deleted before being a candidate for VACUUM. The default
is interval 7 days
NOTE: VACUUM doesn’t clean up log files; log files are automatically cleaned
up after checkpoints are written. 67

https://www.youtube.com/@TRRaveendra
@TRRaveendra
Data Recovery Based on Snapshots
Fix accidental deletes to a table for the user 111:

Fix accidental incorrect updates to a table:

Query the number of new customers added over the last

week.

Query an earlier version of the table (time travel)

https://www.youtube.com/@TRRaveendra
@TRRaveendra
Clean up snapshots
Delta Lake provides snapshot isolation for reads, which means that it is safe to run OPTIMIZE even while other
users or jobs are querying the table. Eventually however, you should clean up old snapshots. You can do this by
running the VACUUM command:

You control the age of the latest retained snapshot by using the RETAIN <N> HOURS option:

https://www.youtube.com/@TRRaveendra
@TRRaveendra
Describe Detail History
You can retrieve more information about the table (for example, number of files, data size) using DESCRIBE
DETAIL.

https://www.youtube.com/@TRRaveendra
@TRRaveendra
Clone a Delta table
You can create a copy of an existing Delta table at a specific version using the clone command. Clones
can be either deep or shallow.

A deep clone is a clone copies the source table data to the clone target in addition to the metadata of the
existing table. Additionally, stream metadata is also cloned such that a stream that writes to the Delta table
can be stopped on a source table and continued on the target of a clone from where it left off.

A shallow clone is a clone that does not copy the data files to the clone target. The table metadata is
equivalent to the source. These clones are cheaper to create.

Note: Any changes made to either deep or shallow clones affect only the clones themselves and not the
source table.

https://www.youtube.com/@TRRaveendra
@TRRaveendra
72

https://www.youtube.com/@TRRaveendra
@TRRaveendra
Clone use cases
Data archiving: Data may need to be kept for longer than is feasible with time travel or for disaster
recovery. In these cases, you can create a deep clone to preserve the state of a table at a certain point in
time for archival. Incremental archiving is also possible to keep a continually updating state of a source
table for disaster recovery.

Machine learning flow reproduction: When doing machine learning, you may want to archive a
certain version of a table on which you trained an ML model. Future models can be tested using this
archived data set.

https://www.youtube.com/@TRRaveendra
@TRRaveendra
Clone use cases Continue…
Short-term experiments on a production table: In order to test out a workflow on a production table without
corrupting the table, you can easily create a shallow clone. This allows you to run arbitrary workflows on the cloned
table that contains all the production data but does not affect any production workloads.

https://www.youtube.com/@TRRaveendra
@TRRaveendra
Clone use cases Continue…
Data sharing: Other business units within a single organization may want to access the same data but may not
require the latest updates. Instead of giving access to the source table directly, clones with different permissions can
be provided for different business units. The performance of the clone can exceed that of a simple view as well.

https://www.youtube.com/@TRRaveendra
@TRRaveendra
76

https://www.youtube.com/@TRRaveendra
@TRRaveendra
77

https://www.youtube.com/@TRRaveendra
@TRRaveendra
Lakehouse Architecture

https://www.youtube.com/@TRRaveendra
@TRRaveendra
79

https://www.youtube.com/@TRRaveendra
@TRRaveendra
80

https://www.youtube.com/@TRRaveendra
@TRRaveendra
THANK YOU
81

https://www.youtube.com/@TRRaveendra
@TRRaveendra

Big Book of Data Warehousing and Bi v9 122723 Final 0
No ratings yet
Big Book of Data Warehousing and Bi v9 122723 Final 0
88 pages
THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE: "THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE"
From Everand
THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE: "THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE"
AJIT DASH
2/5 (2)
PySpark Data Frame Questions PDF
100% (1)
PySpark Data Frame Questions PDF
57 pages
Data Analysis With Databricks Version 2
No ratings yet
Data Analysis With Databricks Version 2
137 pages
Data Analysis With Databricks
75% (4)
Data Analysis With Databricks
80 pages
Zx81 Manual
No ratings yet
Zx81 Manual
102 pages
Lakehouse: A Unified Data Architecture
No ratings yet
Lakehouse: A Unified Data Architecture
9 pages
Details of Delta Lake Tutorial
67% (3)
Details of Delta Lake Tutorial
43 pages
The Delta Lake Series Lakehouse 012921
No ratings yet
The Delta Lake Series Lakehouse 012921
19 pages
Delta Table and Pyspark Interview Questions
100% (1)
Delta Table and Pyspark Interview Questions
14 pages
Databricks For The SQL Developer: Gerhard Brueckl
No ratings yet
Databricks For The SQL Developer: Gerhard Brueckl
40 pages
HDInsight Essentials - Second Edition
From Everand
HDInsight Essentials - Second Edition
Rajesh Nadipalli
No ratings yet
The Snowflake Handbook: Optimizing Data Warehousing and Analytics
From Everand
The Snowflake Handbook: Optimizing Data Warehousing and Analytics
Robert Johnson
No ratings yet
Connect Databricks Delta Tables With DBeaver
No ratings yet
Connect Databricks Delta Tables With DBeaver
10 pages
Databricks Performance Tuning
No ratings yet
Databricks Performance Tuning
54 pages
Databricks Course Curriculum
No ratings yet
Databricks Course Curriculum
2 pages
EDW-ETL Migration Approaches With Databricks
No ratings yet
EDW-ETL Migration Approaches With Databricks
34 pages
Advanced Project For Data Engineering in Azure
100% (1)
Advanced Project For Data Engineering in Azure
5 pages
Snowflake To Lakehouse Migration Assessment 5-23
100% (1)
Snowflake To Lakehouse Migration Assessment 5-23
22 pages
Databricks Question
No ratings yet
Databricks Question
89 pages
Azure Data Factory Vs Databricks - 4 Key Differences - Hevo
No ratings yet
Azure Data Factory Vs Databricks - 4 Key Differences - Hevo
14 pages
Aksha Interview Questions
100% (1)
Aksha Interview Questions
52 pages
Data Hub Guide For Architects
No ratings yet
Data Hub Guide For Architects
83 pages
Databricks Guide
No ratings yet
Databricks Guide
27 pages
Datamesh Ebook
No ratings yet
Datamesh Ebook
46 pages
HowToCrackInterview Udemy
No ratings yet
HowToCrackInterview Udemy
58 pages
Delta Lake Cheat Sheet-1
100% (1)
Delta Lake Cheat Sheet-1
2 pages
Spark 4.0
No ratings yet
Spark 4.0
123 pages
Databricks: Building and Operating A Big Data Service Based On Apache Spark
No ratings yet
Databricks: Building and Operating A Big Data Service Based On Apache Spark
32 pages
8 Steps For A Developer To Learn Apache Spark and Delta Lake PDF
No ratings yet
8 Steps For A Developer To Learn Apache Spark and Delta Lake PDF
35 pages
4.1 The Spark UI - Databricks
No ratings yet
4.1 The Spark UI - Databricks
7 pages
Manage Data Access With Unity Catalog
No ratings yet
Manage Data Access With Unity Catalog
17 pages
Unity Catalog
No ratings yet
Unity Catalog
16 pages
Data Engineer Interview Questions
No ratings yet
Data Engineer Interview Questions
6 pages
DataBricks_Note_free__1736678274
No ratings yet
DataBricks_Note_free__1736678274
87 pages
Final Print Py Spark
No ratings yet
Final Print Py Spark
133 pages
Performance Tuning in Azure Databricks
100% (1)
Performance Tuning in Azure Databricks
124 pages
Spark Interview Q&A
No ratings yet
Spark Interview Q&A
31 pages
Designing Data Integration The ETL Pattern Approac
No ratings yet
Designing Data Integration The ETL Pattern Approac
9 pages
Microsoft Certified: Azure Data Engineer Associate - Skills Measured
No ratings yet
Microsoft Certified: Azure Data Engineer Associate - Skills Measured
4 pages
Data Warehouse - Concept and Fundamentals: Sridevi
No ratings yet
Data Warehouse - Concept and Fundamentals: Sridevi
25 pages
Cert DEWD (Edits)
No ratings yet
Cert DEWD (Edits)
158 pages
Pyspark PDF
0% (1)
Pyspark PDF
239 pages
PySpark VS SQL Interview Questions
No ratings yet
PySpark VS SQL Interview Questions
16 pages
Databuildtoolpdf 220704 142715
No ratings yet
Databuildtoolpdf 220704 142715
39 pages
Ajay Kadiyala Resume 2023 PDF
No ratings yet
Ajay Kadiyala Resume 2023 PDF
6 pages
PySpark Tutorial For Beginners - Python Examples - Spark by (Examples)
No ratings yet
PySpark Tutorial For Beginners - Python Examples - Spark by (Examples)
19 pages
Spark SQL and DataFrames - Spark 2.2.0 Documentation
No ratings yet
Spark SQL and DataFrames - Spark 2.2.0 Documentation
35 pages
Demystifying The Medallion and Lakehouse Architectures 1714820046
No ratings yet
Demystifying The Medallion and Lakehouse Architectures 1714820046
19 pages
Top Answers To Spark Interview Questions
No ratings yet
Top Answers To Spark Interview Questions
4 pages
Azure Data Engineer Interview Questions
No ratings yet
Azure Data Engineer Interview Questions
15 pages
Data Engineering Roadmap 2023
No ratings yet
Data Engineering Roadmap 2023
1 page
Data Warehousing Interview Questions
No ratings yet
Data Warehousing Interview Questions
6 pages
Pyspark Interview Code
100% (3)
Pyspark Interview Code
197 pages
Azure DataEngineer Course Outline
No ratings yet
Azure DataEngineer Course Outline
4 pages
What Is Bigquery: Enterprise Data Warehouse
No ratings yet
What Is Bigquery: Enterprise Data Warehouse
2 pages
C2 Databricks - Sparks - EE
No ratings yet
C2 Databricks - Sparks - EE
9 pages
PySpark Cheatsheet
No ratings yet
PySpark Cheatsheet
12 pages
Spark Interview QUestions
No ratings yet
Spark Interview QUestions
200 pages
Matthieu - Lamairesse - Reda - Khouani - Why The Best Serverless Data Warehouse Is A Lakehouse - (DAIWT - PARIS)
No ratings yet
Matthieu - Lamairesse - Reda - Khouani - Why The Best Serverless Data Warehouse Is A Lakehouse - (DAIWT - PARIS)
38 pages
Well Architected Lakehouse Workshop
100% (1)
Well Architected Lakehouse Workshop
49 pages
Mastering Data Engineering and Analytics with Databricks: A Hands-on Guide to Build Scalable Pipelines Using Databricks, Delta Lake, and MLflow (English Edition)
From Everand
Mastering Data Engineering and Analytics with Databricks: A Hands-on Guide to Build Scalable Pipelines Using Databricks, Delta Lake, and MLflow (English Edition)
Manoj Kumar
No ratings yet
03 Sqlfunctions
No ratings yet
03 Sqlfunctions
122 pages
Python Learning (Basics I & II)
No ratings yet
Python Learning (Basics I & II)
33 pages
Ade Mod 1 Incremental Processing With Spark Structured Streaming
No ratings yet
Ade Mod 1 Incremental Processing With Spark Structured Streaming
73 pages
Azure Data Factory Notes 1682135573
No ratings yet
Azure Data Factory Notes 1682135573
78 pages
Spark Keywords 1675605055
No ratings yet
Spark Keywords 1675605055
6 pages
Instruction On Philhealth Softcopy Report
No ratings yet
Instruction On Philhealth Softcopy Report
4 pages
Ax 25proctocol
No ratings yet
Ax 25proctocol
13 pages
VHDL Data Types and Operators
No ratings yet
VHDL Data Types and Operators
61 pages
Key Plan: Viaduct General Arrangement Plan and Profile
No ratings yet
Key Plan: Viaduct General Arrangement Plan and Profile
1 page
WLAN Fundamentals
No ratings yet
WLAN Fundamentals
53 pages
Must Prepare Python File Handling MCQ Class 12 For Term 1
No ratings yet
Must Prepare Python File Handling MCQ Class 12 For Term 1
48 pages
Presentation On Heapsort
100% (1)
Presentation On Heapsort
23 pages
FoxBox GT Rackable Info Brochure
No ratings yet
FoxBox GT Rackable Info Brochure
2 pages
Python Interview Questions PDF
No ratings yet
Python Interview Questions PDF
6 pages
5.2.1.7 Lab - Viewing The Switch MAC Address Table
No ratings yet
5.2.1.7 Lab - Viewing The Switch MAC Address Table
5 pages
Introduction To Computing - Module 3 - Hardware Components of Personal Computer
No ratings yet
Introduction To Computing - Module 3 - Hardware Components of Personal Computer
42 pages
Interview Questions
No ratings yet
Interview Questions
96 pages
The Performance of Arrays
No ratings yet
The Performance of Arrays
5 pages
My Assignment
No ratings yet
My Assignment
7 pages
Quick Install Pacsone
No ratings yet
Quick Install Pacsone
4 pages
SQL Question
No ratings yet
SQL Question
15 pages
Data and Network Security Checklist: Technology in Your Corner
No ratings yet
Data and Network Security Checklist: Technology in Your Corner
2 pages
School Billing System Project
No ratings yet
School Billing System Project
2 pages
IBM Industry Data Models - Netezza Whitepaper v1.2
No ratings yet
IBM Industry Data Models - Netezza Whitepaper v1.2
13 pages
Computer Organization
No ratings yet
Computer Organization
67 pages
Veeam Cloud Connect: Manual Configuration Guide For Microsoft Azure
No ratings yet
Veeam Cloud Connect: Manual Configuration Guide For Microsoft Azure
60 pages
Lista de Precios 06-11
No ratings yet
Lista de Precios 06-11
166 pages
Computer-Science-Model-Paper-XI - (Paper II)
No ratings yet
Computer-Science-Model-Paper-XI - (Paper II)
12 pages
2021 10 11 - Log
No ratings yet
2021 10 11 - Log
4 pages
Export Report To PDF Vba Ssrs
No ratings yet
Export Report To PDF Vba Ssrs
2 pages
Lecture 5
100% (1)
Lecture 5
9 pages
Microsoft Certified Azure Developer Associate Skills Measured
No ratings yet
Microsoft Certified Azure Developer Associate Skills Measured
4 pages
Routconf PDF
No ratings yet
Routconf PDF
10 pages
Whatisarestapi?: Api: Application Programming Interface
No ratings yet
Whatisarestapi?: Api: Application Programming Interface
3 pages

Databricks

Uploaded by

Databricks

Uploaded by

Databricks Spark + Delta + Data Lake = LakeHouse

Source and Credit : databricks.com and delta.io

The Data Warehouse is your company's central data store.

Are there any gaps in the Data Warehouse?

Data Warehouses only support Structured Data.

1990’s Traditional Data Warehouses

ADVANTAGES OF DATA LAKE

1) Because it is ACID compliant, it provides guaranteed consistency.

A lakehouse has the following features:

Delta Lake Alternative…..

➢ Why Delta Lake

➢ How to use Delta Lake in Databricks

➢ Delta Lake Examples

➢ Delta Lake DML Operations

➢ Delta Lake Versioning

➢ Delta Lake SCD Type 2

➢ Delta Lake Partitioning

➢ Delta Lake Clone

➢ Delta Lake provides ACID transactions, scalable metadata handling

➢ Delta Lake supports Parquet format

➢ Open format: Stored as Parquet format in blob storage.

➢ Cleansing the data to remove corrupt or inaccurate information

➢ Formatting and pruning the data for downstream use

➢ Enriching the data by joining it with other data sources

➢ Aggregating the data to make it more convenient for use

Reading Delta Table in SQL & Python

1. Record the starting table version.

•Delta Lake adds a Transaction Log

Relational databases use secondary indexes to help speed queries.

Z-Ordering is a technique to colocate related information in the same

Z-ordering is a solution to this problem.

1. Re-creating analyses, reports, or outputs (for example, the output of a

delta.logRetentionDuration = "interval <interval>": controls how long the history

Fix accidental incorrect updates to a table:

Query the number of new customers added over the last

Query an earlier version of the table (time travel)

You might also like