0% found this document useful (0 votes)
166 views

Databricks

Databricks Spark + Delta + Data Lake = LakeHouse The document discusses how Databricks, Spark, Delta Lake, and data lakes can be combined to create a "LakeHouse" architecture. A LakeHouse provides the benefits of data lakes such as scalable storage of structured and unstructured data as well as the benefits of data warehouses like reliability, governance, and support for analytics workloads. Delta Lake adds features like ACID transactions and time travel capabilities to enable these joint capabilities.

Uploaded by

Adventure World
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
166 views

Databricks

Databricks Spark + Delta + Data Lake = LakeHouse The document discusses how Databricks, Spark, Delta Lake, and data lakes can be combined to create a "LakeHouse" architecture. A LakeHouse provides the benefits of data lakes such as scalable storage of structured and unstructured data as well as the benefits of data warehouses like reliability, governance, and support for analytics workloads. Delta Lake adds features like ACID transactions and time travel capabilities to enable these joint capabilities.

Uploaded by

Adventure World
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 81

Databricks Spark + Delta + Data Lake = LakeHouse

Source and Credit : databricks.com and delta.io


1

https://www.youtube.com/@TRRaveendra
@TRRaveendra
Different Types of Architectures from 1980 To Till

https://www.youtube.com/@TRRaveendra
@TRRaveendra
Data Warehouse

The Data Warehouse is your company's central data store.


A Data Warehouse is required for all companies that wish to make data-driven choices since it
serves as the "Single Source of Truth" for all data in the company.
Analyzing Existing Business Data
New Business Models Launched Based on Existing Data Analysis
Enhance customer service and interaction
boost quality and resource utilisation

Are there any gaps in the Data Warehouse?

Data Warehouses only support Structured Data.


There is no support for video, audio, or text.
There is no support for data science or machine learning, and there is only limited support for
streaming.
As a result, the majority of commercial and social media data is kept in data lakes and blob storage.

https://www.youtube.com/@TRRaveendra
@TRRaveendra
@TRRaveendra

1990’s Traditional Data Warehouses

https://www.youtube.com/@TRRaveendra
@TRRaveendra
Big Data Data Lake with Data Warehouses

https://www.youtube.com/@TRRaveendra
@TRRaveendra
Big Data Data Lake with Data Warehouses

A datalake is a centralized repository for storing, processing, and securing massive volumes of organized, semistructured,
and unstructured data. It can store data in its native format and handle any type of data, regardless of size.

ADVANTAGES OF DATA LAKE


The primary benefit of data lakes is the concentration of several information sources.
At a cheaper cost, distributed and infinite storage
There are image, video, audio, json, csv, and other data storage types accessible.
Volume and Variety: A data lake can hold the massive amounts of data required by Big Data, artificial intelligence, and
machine learning. Data lakes can manage the volume, diversity, and velocity of data absorbed in any format from numerous
sources.
Ingest Speed: Format is unimportant during ingest. It employs schema-on-read rather than schema-on-write, which implies
that data is not processed for use until it is required. Data can be swiftly written.
Cost savings: In terms of storage expenses, a data lake can be substantially less expensive than a data warehouse.
This enables businesses to acquire a broader range of data, such as unstructured data like rich media, sensor data from the
Internet of Things (IoT), email, or social media.
Greater Accessibility: Data stored in a data lake makes it simple to open copies or subsets of data for multiple users or user
groups to access. Data access might be restricted, while businesses can allow more accessibility.
Advanced Algorithms: Data lakes enable businesses to run complicated queries and deep learning algorithms to identify
trends.

https://www.youtube.com/@TRRaveendra
@TRRaveendra
Disadvantages of DataLake
1. Data appending is difficult.
Adding additional data results in improper readings.
2. It is tough to modify existing data.
GDPR/CCPA necessitates fine-grained adjustments to current data lakes.
3. Jobs that fail in the middle result in corrupted files.
Half of the data is present in the data lake, but the other half is missing.
4. Operation in real time
Inconsistency results from combining streaming and batch.
5. Expensive to preserve older data versions.
Reproducibility, auditing, and governance are all required in regulated situations.
6. Difficulty in dealing with vast amounts of metadata
The metadata itself becomes challenging to handle in vast data lakes.
7. "Too many files" issues
Data lakes struggle to handle millions of little files.
8. It is difficult to obtain excellent performance.
Data partitioning for performance is error-prone and difficult to modify.
9. Problems with data quality
It is a continual challenge to guarantee that all data is correct and of high quality.
10. Incomplete ACID Transactions
Insert, update, and delete operations are not possible due to a lack of ACID Properties.
11 We require extra Warehouse and ETL processes to load from DataLake to DataWarehouse in
order to satisfy Incomplete ACID Transactions and Metadata Catalog maintenance.
7

https://www.youtube.com/@TRRaveendra
@TRRaveendra
Modern Lakehouse Architecture

https://www.youtube.com/@TRRaveendra
@TRRaveendra
DataWarehouse + Data Lake = Lakehouse

https://www.youtube.com/@TRRaveendra
@TRRaveendra
Modern Lakehouse Features
Delta Lake is an open-source storage framework that allows for the creation of a Lakehouse architecture with computation engines such as Spark.
DeltaLake enables ACID transactions, scalable #metadata management, and the unification of streaming and batch data processing. Delta Lake is fully
compatible with Apache Spark APIs and runs on top of your existing datalake.

1) Because it is ACID compliant, it provides guaranteed consistency.


2) Strong data storage. Data is stored in versioned parquet files.
3) It is intended to work with Apache Spark.
4) The audit logs will be kept in json format.
5) Time Travel Support

A lakehouse has the following features:


1) support for numerous data kinds and formats
2) data reliability and consistency help with a range of jobs (BI, data science, machine learning, and analytics)
3) direct use of BI techniques to source data
4) Metadata layers for data lakes
5) new query engine designs that enable high-performance SQL execution on data lakes
6) improved access to data science and machine learning tools
7) Data reading and writing at the same time.
8) Schema support combined with data governance methods.
9) Source data can be accessed directly.
10) Storage and computation resources are separated.
11) Storage formats that are standardised.
12) Structured and semi-structured data types, including IoT data, are supported.
13) End-To-End Batch and Streaming Loads.
10

https://www.youtube.com/@TRRaveendra
@TRRaveendra
Lakehouse Architecture

Delta Lake Alternative…..


11

https://www.youtube.com/@TRRaveendra
@TRRaveendra
Agenda
➢ Introduction

➢ Why Delta Lake

➢ Key Features

➢ How to use Delta Lake in Databricks

➢ Delta Lake Examples

➢ Delta Lake DML Operations

➢ Delta Lake Versioning

➢ Delta Lake SCD Type 2

➢ Delta Lake Partitioning

➢ Delta Lake Clone


12

https://www.youtube.com/@TRRaveendra
@TRRaveendra
Introduction

➢ Delta Lake is an open source storage layer that brings reliability to data
lakes.

➢ Delta Lake provides ACID transactions, scalable metadata handling

➢ Delta Lake runs on top of your existing data lake and is fully compatible
with Apache Spark APIs.

➢ Delta Lake supports Parquet format

13

https://www.youtube.com/@TRRaveendra
@TRRaveendra
Why Delta?
➢ Challenges in implementation of a data lake.
➢ Missing ACID properties.
➢ Lack of Schema enforcement.
➢ Lack of Consistency.
➢ Lack of Data Quality.
➢ Too many small files.
➢ Corrupted data due to Frequent job failures in prod

14

https://www.youtube.com/@TRRaveendra
@TRRaveendra
Challenges faced by most of the data lakes

Source: databricks
16

https://www.youtube.com/@TRRaveendra
@TRRaveendra
17

https://www.youtube.com/@TRRaveendra
@TRRaveendra
18

https://www.youtube.com/@TRRaveendra
@TRRaveendra
19

https://www.youtube.com/@TRRaveendra
@TRRaveendra
20

https://www.youtube.com/@TRRaveendra
@TRRaveendra
21

https://www.youtube.com/@TRRaveendra
@TRRaveendra
22

https://www.youtube.com/@TRRaveendra
@TRRaveendra
23

https://www.youtube.com/@TRRaveendra
@TRRaveendra
24

https://www.youtube.com/@TRRaveendra
@TRRaveendra
25

https://www.youtube.com/@TRRaveendra
@TRRaveendra
Delta Lake Features
An open-source storage format that brings ACID transactions to Apache Spark™ and big data workloads.

➢ Open format: Stored as Parquet format in blob storage.

➢ ACID Transactions: Ensures data integrity and read consistency with complex, concurrent data pipelines.

➢ Schema Enforcement and Evolution: Ensures data cleanliness by blocking writes with unexpected.

➢ Audit History: History of all the operations that happened in the table.

➢ Time Travel: Query previous versions of the table by time or version number.

➢ Deletes and upserts: Supports deleting and upserting into tables with programmatic APIs.

➢ Scalable Metadata management: Able to handle millions of files are scaling the metadata operations with
Spark.

➢ Unified Batch and Streaming Source and Sink: A table in Delta Lake is both a batch table, as well as a
streaming source and sink. Streaming data ingest, batch historic backfill, and interactive queries all just work
out of the box. 26

https://www.youtube.com/@TRRaveendra
@TRRaveendra
Bronze tables contain raw data ingested from various sources (JSON files, RDBMS data,
IoT data, etc.).

Silver tables will provide a more refined view of our data. We can join fields from
various bronze tables to enrich streaming records, or update account statuses based on
recent activity.

Gold tables provide business level aggregates often used for reporting and
dashboarding. This would include aggregations such as daily active website users,
weekly sales per store, or gross revenue per quarter by department.
The end outputs are actionable insights, dashboards, and reports of business metrics.

27

https://www.youtube.com/@TRRaveendra
@TRRaveendra
How Delta Lake Works

➢ Delta lake provides a storage layer on top of existing storage data lake. It acts as a middle layer between
Spark runtime and storage

➢ Delta Lake will generate delta logs for each committed transactions

➢ Delta logs will have delta files stored as JSON which has information about the operations occurred

➢ Delta files are sequentially increasing named JSON files and together make up the log of all changes that
have occurred to a table

28

https://www.youtube.com/@TRRaveendra
@TRRaveendra
29

https://www.youtube.com/@TRRaveendra
@TRRaveendra
30

https://www.youtube.com/@TRRaveendra
@TRRaveendra
We have seen that Spark and Delta Lake make it easy for us to ingest data from disparate sources,
and work with it as a relational database.

➢ Cleansing the data to remove corrupt or inaccurate information

➢ Formatting and pruning the data for downstream use

➢ Enriching the data by joining it with other data sources

➢ Aggregating the data to make it more convenient for use

31

https://www.youtube.com/@TRRaveendra
@TRRaveendra
32

https://www.youtube.com/@TRRaveendra
@TRRaveendra
Creating Delta Table in SQL & Python

If your source files are in Parquet format, you can use the SQL Convert to
Delta statement to convert files in place to create an unmanaged table:

33

https://www.youtube.com/@TRRaveendra
@TRRaveendra
Delta Lake Storage & Types Of Files

34

https://www.youtube.com/@TRRaveendra
@TRRaveendra
Tables created with a specified LOCATION are considered unmanaged by the metastore. Unlike a managed table, where no
path is specified, an unmanaged table’s files are not deleted when you DROP the table.

the table in the Hive metastore automatically inherits the schema, partitioning, and table
properties of the existing data. This functionality can be used to “import” data into the metastore.

Reading Delta Table in SQL & Python

35

https://www.youtube.com/@TRRaveendra
@TRRaveendra
Batch upserts
To merge a set of updates and insertions into an existing table, you use the MERGE
INTO statement. For example, the following statement takes a stream of updates and
merges it into the events table. When there is already an event present with the same
eventId, Delta Lake updates the data column using the given expression. When there
is no matching event, Delta Lake adds a new row.

You must specify a value for every column in your table when you perform an
INSERT (for example, when there is no matching row in the existing dataset).
However, you do not need to update all values. 36

https://www.youtube.com/@TRRaveendra
@TRRaveendra
37

https://www.youtube.com/@TRRaveendra
@TRRaveendra
38

https://www.youtube.com/@TRRaveendra
@TRRaveendra
39

https://www.youtube.com/@TRRaveendra
@TRRaveendra
40

https://www.youtube.com/@TRRaveendra
@TRRaveendra
41

https://www.youtube.com/@TRRaveendra
@TRRaveendra
Solving Conflicts Optimistically
In order to offer ACID transactions, Delta Lake has a protocol for figuring out how commits should be ordered
(known as the concept of serializability in databases).
It should allow two or more commits are made at the same time. Delta Lake handles these cases by
implementing a rule of mutual exclusion, then attempting to solve any conflict optimistically. This protocol allows
Delta Lake to deliver on the ACID principle of isolation,

1. Record the starting table version.


2. Record reads/writes.
3. Attempt a commit.
4. If someone else wins, check whether anything you
read has changed.
5. Repeat.
42

https://www.youtube.com/@TRRaveendra
@TRRaveendra
1. Delta Lake records the starting table version of the table (version 0) that is read prior to making any changes.
2. Users 1 and 2 both attempt to append some data to the table at the same time. Here, we’ve run into a conflict
because only one commit can come next and be recorded as 000001.json.
3. Delta Lake handles this conflict with the concept of “mutual exclusion,” which means that only one user can
successfully make commit 000001.json. User 1’s commit is accepted, while User 2’s is rejected.
4. Rather than throw an error for User 2, Delta Lake prefers to handle this conflict optimistically. It checks to see
whether any new commits have been made to the table, and updates the table silently to reflect those changes,
then simply retries User 2’s commit on the newly updated table (without any data processing), successfully
committing 000002.json.

43

https://www.youtube.com/@TRRaveendra
@TRRaveendra
44

https://www.youtube.com/@TRRaveendra
@TRRaveendra
Quickly Recomputing State With Checkpoint Files
Once we’ve made a total of 10 commits to the transaction log, Delta Lake saves a checkpoint file in Parquet format in
the same _delta_log subdirectory. Delta Lake automatically generates checkpoint files every 10 commits.

45

https://www.youtube.com/@TRRaveendra
@TRRaveendra
Major Differences between the CSV table and Delta Lake table

•Delta Lake adds a Transaction Log


•Delta Lake data is stored in Parquet format
These differences are the heart and soul of Delta Lake.

•The transaction log enables ACID compliance and many other important features. We'll
be looking more deeply into these features throughout the rest of this workshop.

•Parquet is a popular data format for many data lakes. It stores data in columnar format,
and generally provides faster performance.
Let's see how Delta Lake's Data Skipping improves query performance over CSV.

46

https://www.youtube.com/@TRRaveendra
@TRRaveendra
Table Partitioning
All Big Data lakes divide logical tables into physical partitions. This keeps
physical file sizes manageable, and can also be used to speed query
processing.
Summarizing Topic 1: Partitioning
1. We have seen the benefits of Partitioning, a performance-enhancing
feature common to all Big Data lakes. As we have seen:
2. Partitioning splits large files into smaller chunks
3. We can choose a semantic partition key that can make appropriate
queries run much faster
However, partitioning has some limitations:
➢ We can't use partitioning to support a wide range of diverse
queries. In order to benefit, queries must filter on the partition key
➢ We must choose a partition key of moderate cardinality 47

https://www.youtube.com/@TRRaveendra
@TRRaveendra
Optimize performance with file management
To improve query speed, Delta Lake on Databricks supports the ability to optimize the layout of
data stored in cloud storage. Delta Lake on Databricks supports few algorithms: bin-packing ,
Data-Skipping And Z-Ordering.

Compaction (bin-packing)
Delta Lake on Databricks can improve the speed of read queries from a table by coalescing
small files into larger ones. You trigger compaction by running the OPTIMIZE command

If you have a large amount of data and only want to optimize a subset of it, you can
specify an optional partition predicate using WHERE

48

https://www.youtube.com/@TRRaveendra
@TRRaveendra
Data Skipping
Databricks' Data Skipping feature takes advantage of the multi-file
structure. As new data is inserted into a Databricks Delta table, file-level
min/max statistics are collected for all columns. Then, when there’s a
lookup query against the table, Databricks Delta first consults these
statistics in order to determine which files can safely be skipped. The
picture below illustrates the process:
1. Keep track of simple statistics such as minimum and maximum values
at a certain granularity that’s correlated with I/O granularity.
2. Leverage those statistics at query planning time in order to avoid
unnecessary I/O.

49

https://www.youtube.com/@TRRaveendra
@TRRaveendra
Data Skipping Example
In this example, we skip file 1 because its minimum value is higher than our
desired value.

Similarly, we skip file 3 based on its maximum value. File 2 is the only
one we need to access.

50

https://www.youtube.com/@TRRaveendra
@TRRaveendra
Z-Ordering (multi-dimensional clustering)
Z-Ordering is another Databricks enhancement
to Delta Lake.
What is Z-Ordering?

Relational databases use secondary indexes to help speed queries.


However, indexing becomes impractical with Big Data; the indexes
themselves become very large, and they cannot be updated at write
time in a performant manner. Z-ordering is a technique that replaces
indexes by placing data columns that are "close" in value into the same
physical files within a partition.

Z-Ordering is a technique to colocate related information in the same


set of files. This co-locality is automatically used by Delta Lake on
Databricks data-skipping algorithms to dramatically reduce the amount
of data that needs to be read. To Z-Order data, you specify the columns
to order on in the ZORDER BY clause:
51

https://www.youtube.com/@TRRaveendra
@TRRaveendra
The chess board is a 2-dimensional array. But when I persist the data, I need to express is in only one
dimension; in other words, I need to serialize it. My desire is to have squares that are close to each
other in the array remain close to each other in physical files in my Delta partition directories. For this
illustration, I’ve decided that each physical file should hold 4 squares.

52

https://www.youtube.com/@TRRaveendra
@TRRaveendra
If I serialize row-by-row, I’ll distance myself from 2 of my closest neighbors. The same thing happens if I
serialize column-by-column.

Z-ordering is a solution to this problem.

53

https://www.youtube.com/@TRRaveendra
@TRRaveendra
If we interleave the bits of each dimension’s values, and sort by the result, we can write files where the
2-dimensional neighbors are also close in 1 dimension.
Even better, Z-ordering works even if the values are of different lengths. It also works across more than
2 dimensions, although it breaks down quickly after 4-5.
Z-ordering lets us skip more files, and get fewer false positives in the files we do read.

54

https://www.youtube.com/@TRRaveendra
@TRRaveendra
ZORDER - Examples..

55

https://www.youtube.com/@TRRaveendra
@TRRaveendra
Star Schemas and Dynamic Partition Pruning

56

https://www.youtube.com/@TRRaveendra
@TRRaveendra
Star Schemas and Dynamic Partition Pruning
Notice that the Dimension table introduces a level of indirection between the query and the partitions we would like to prune
(which are on the Fact table). In most data lakes, this means that we will have to scan the entire Fact table (which of course, is the
biggest table in a star schema). Databricks' query optimizer, however, will understand this query, and skip any partitions on the fact
table that do not match the filter.

57

https://www.youtube.com/@TRRaveendra
@TRRaveendra
Star Schemas and Dynamic Partition Pruning
The star schema is very effective for Analytics queries, as long as all the dimensions stay the same. But what happens when
something changes in a dimension table? For example, here is a dimension table that represents our company’s Suppliers. Suppose
we are keeping 5 years of history in our warehouse, and at some point, say year 3, this Supplier moves its facilities to a new State? If
we are building reports that report on Suppliers grouped by State, how do we keep our report accurate?

Remember, we’re reporting over a 5-year history, so the problem is that we want results that are accurate over that whole time
period.

58

https://www.youtube.com/@TRRaveendra
@TRRaveendra
Star Schemas and Dynamic Partition Pruning
There are several design choices available to solve the SCD problem. There are actually more choices than we're showing here, but
these are the most common.
These solutions are known by Type x, where x is a number from 1 to 6 (although there is actually no Type 5).

Type 0 is easy (but not very effective). We simply refuse to allow changes. In Type 1 we simply overwrite the old information with new
information. In Type 2, we keep a historical trail of the information. Above type 2, we simply implement more sophisticated ways of
keeping history. Let’s look deeper into Types 1 and 2…
A type 1 slowly changing dimension overwrites the existing data warehouse value with the new value coming from the OLTP system.
Although the type I does not maintain history, it is the simplest and fastest way to load dimension data. Type I is used when the old
value of the changed dimension is not deemed important for tracking or is an historically insignificant attribute.

59

https://www.youtube.com/@TRRaveendra
@TRRaveendra
Star Schemas and Dynamic Partition Pruning
Here is a Type 1 example. When the Supplier moves from CA to IL, we simply overwrite the information in the record. That means
that the older historical parts of our report will now be incorrect, because it will appear that this Supplier was always in IL.

60

https://www.youtube.com/@TRRaveendra
@TRRaveendra
Star Schemas and Dynamic Partition Pruning
A type 2 slowly changing dimension enables you to track the history of updates to your dimension records. When a changed record
enters the warehouse, it creates a new record to store the changed data and leaves the old record intact. Type 2 is the most common
type of slowly changing dimension because it enables you to track historically significant attributes. The old records point to all history
prior to the latest change, and the new record maintains the most current information.

61

https://www.youtube.com/@TRRaveendra
@TRRaveendra
Star Schemas and Dynamic Partition Pruning

62

https://www.youtube.com/@TRRaveendra
@TRRaveendra
Star Schemas and Dynamic Partition Pruning
Here are three different ways we might implement a Type 2 solution. Our goal here is to make
sure our historical reports are accurate throughout the entire time period.
In the top example, we add a new row to the dimension table whenever data changes, and
we keep a version number to show the order of the changes. Why do you think this might be
an ineffective solution? Is it important to know when in time each version changed?
In the middle example, we see a more effective solution. Each row for the supplier has a start
and end date. If we design our queries well, we can make sure that each time period in our
report uses the corresponding Supplier row. If there is no data in the End_Date column, we
know we have the current information.
The bottom example is similar. Effective_Date is the same as Start_Date in the middle
example. Instead of End_Date, we have a flag that says whether or not this row is current.
Our queries may get a bit more complex here, because we must read a row to get the start
date, then read the next row to determine the end date (assuming we are using non-current
rows).

63

https://www.youtube.com/@TRRaveendra
@TRRaveendra
What Is Schema Evolution?
Schema evolution is a feature that allows users to easily change a table’s current schema to
accommodate data that is changing over time. Most commonly, it’s used when performing an append
or overwrite operation, to automatically adapt the schema to include one or more new columns.

Following up on the example from the previous section, developers can easily use schema evolution to add the
new columns that were previously rejected due to a schema mismatch. Schema evolution is activated by
adding .option('mergeSchema', 'true') to your .write or .writeStream Spark command.

64

https://www.youtube.com/@TRRaveendra
@TRRaveendra
65

https://www.youtube.com/@TRRaveendra
@TRRaveendra
Query an older snapshot of a table (time travel)
Delta Lake time travel allows you to query an older snapshot of a Delta table. Time
travel has many use cases, including:

1. Re-creating analyses, reports, or outputs (for example, the output of a


machine learning model). This could be useful for debugging or auditing,
especially in regulated industries.
2. Writing complex temporal queries.
3. Fixing mistakes in your data.
4. Providing snapshot isolation for a set of queries for fast changing tables.

66

https://www.youtube.com/@TRRaveendra
@TRRaveendra
Delta Lake Data retention
By default, Delta tables retain the commit history for 30 days. This means that you can
specify a version from 30 days ago. However, there are some caveats:

vacuum deletes only data files, not log files. Log files are deleted automatically and
asynchronously after checkpoint operations. The default retention period of log files
is 30 days, configurable through the delta.logRetentionPeriod property which you
set with the ALTER TABLE SET TBLPROPERTIES SQL method.

delta.logRetentionDuration = "interval <interval>": controls how long the history


for a table is kept. Each time a a checkpoint is written
delta.deletedFileRetentionDuration = "interval <interval>": controls how long
ago a file must have been deleted before being a candidate for VACUUM. The default
is interval 7 days
NOTE: VACUUM doesn’t clean up log files; log files are automatically cleaned
up after checkpoints are written. 67

https://www.youtube.com/@TRRaveendra
@TRRaveendra
Data Recovery Based on Snapshots
Fix accidental deletes to a table for the user 111:

Fix accidental incorrect updates to a table:

Query the number of new customers added over the last


week.

Query an earlier version of the table (time travel)

68

https://www.youtube.com/@TRRaveendra
@TRRaveendra
Clean up snapshots
Delta Lake provides snapshot isolation for reads, which means that it is safe to run OPTIMIZE even while other
users or jobs are querying the table. Eventually however, you should clean up old snapshots. You can do this by
running the VACUUM command:

You control the age of the latest retained snapshot by using the RETAIN <N> HOURS option:

69

https://www.youtube.com/@TRRaveendra
@TRRaveendra
Describe Detail History
You can retrieve more information about the table (for example, number of files, data size) using DESCRIBE
DETAIL.

70

https://www.youtube.com/@TRRaveendra
@TRRaveendra
Clone a Delta table
You can create a copy of an existing Delta table at a specific version using the clone command. Clones
can be either deep or shallow.

A deep clone is a clone copies the source table data to the clone target in addition to the metadata of the
existing table. Additionally, stream metadata is also cloned such that a stream that writes to the Delta table
can be stopped on a source table and continued on the target of a clone from where it left off.

A shallow clone is a clone that does not copy the data files to the clone target. The table metadata is
equivalent to the source. These clones are cheaper to create.

Note: Any changes made to either deep or shallow clones affect only the clones themselves and not the
source table.

71

https://www.youtube.com/@TRRaveendra
@TRRaveendra
72

https://www.youtube.com/@TRRaveendra
@TRRaveendra
Clone use cases
Data archiving: Data may need to be kept for longer than is feasible with time travel or for disaster
recovery. In these cases, you can create a deep clone to preserve the state of a table at a certain point in
time for archival. Incremental archiving is also possible to keep a continually updating state of a source
table for disaster recovery.

Machine learning flow reproduction: When doing machine learning, you may want to archive a
certain version of a table on which you trained an ML model. Future models can be tested using this
archived data set.

73

https://www.youtube.com/@TRRaveendra
@TRRaveendra
Clone use cases Continue…
Short-term experiments on a production table: In order to test out a workflow on a production table without
corrupting the table, you can easily create a shallow clone. This allows you to run arbitrary workflows on the cloned
table that contains all the production data but does not affect any production workloads.

74

https://www.youtube.com/@TRRaveendra
@TRRaveendra
Clone use cases Continue…
Data sharing: Other business units within a single organization may want to access the same data but may not
require the latest updates. Instead of giving access to the source table directly, clones with different permissions can
be provided for different business units. The performance of the clone can exceed that of a simple view as well.

75

https://www.youtube.com/@TRRaveendra
@TRRaveendra
76

https://www.youtube.com/@TRRaveendra
@TRRaveendra
77

https://www.youtube.com/@TRRaveendra
@TRRaveendra
Lakehouse Architecture

78

https://www.youtube.com/@TRRaveendra
@TRRaveendra
79

https://www.youtube.com/@TRRaveendra
@TRRaveendra
80

https://www.youtube.com/@TRRaveendra
@TRRaveendra
THANK YOU
81

https://www.youtube.com/@TRRaveendra
@TRRaveendra

You might also like