0% found this document useful (0 votes)
32 views

Pyspark File Commands and Theory

Uploaded by

karangole7074
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views

Pyspark File Commands and Theory

Uploaded by

karangole7074
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Day -1

File commands

1) %fs ls “filepath” - this will list the files on that location


2) %fs head “filepath” - this will look into file content
3) %fs mkdirs “filepath” - this will create a directory
4) %fs mv dbfs:/Filestore/tables/kgole/p3/p2-1.txt
dbfs:/Filestore/tables/kgole/p3/pure.txt
(to remove or rename file)

Day -2 and 3
RDD - Resilient Distributed Datasets

RDD, which stands for Resilient Distributed Dataset, is a fundamental data structure in
Apache Spark. RDD represents an immutable, distributed collection of objects that can
be processed in parallel across a cluster of machines. RDDs provide fault tolerance and
parallel processing capabilities, making them a key abstraction in Spark for distributed
data processing.

#RDD is the in memory object and we can create it by 3 ways


#1) can create from collection of data (Parallelize())
#2) from flat files (textfile())
#3) From existing RDD (filter(), union(), map())

#Properties of RDD
#1) immutable - can't modify
#2) Partitioned/Distributed
#3) Resilient - Fault tolerant

# two types of functions


#1) Transformations - used to create RDD and are dependent on actions functions for
executions. known as lazy evaluation. (e.g parallelize, filter, textfile)
#2) Actions - Are used to run and execute transformations (e.g. collect, first, take)
Command
1. myRDD = sc.textFile("dbfs:/FileStore/new_file_day2.txt") #creating a RDD using flat
files
2. collectionRDD = sc.parallelize(data) #parallelize - Convert collection into RDD
3. originalRDD = sc.parallelize([1, 2, 3, 4, 5])
4. E.g. 1 - duplicateRDD = originalRDD.map(lambda x : x) #creating from existing data
E.g. 2 - firstRDD = sc.parallelize([1, 2, 3])
secondRDD = sc.parallelize([3, 4, 5])
combinedRDD = firstRDD.union(secondRDD)

Day - 4 & 5
Creating a Dataframe

# there are 3 ways we can create a dataframe


#1)a) from RDD
# b) from data collection
#2) Read file

Command
1. From RDD
data = [ [1, "karan", 10000], [2, "Rick", 30000], [3, "Else", 240000], [4, "never", 23000]]

empolyeeRDD = sc.parallelize(data)
empDF = empolyeeRDD.toDF("Id: long, Name: string, salary: long")

2. From Data Collection

data = [ [1, "karan", 10000], [2, "Rick", 30000], [3, "Else", 240000], [4, "never", 23000]]
empDF1 = spark.createDataFrame(data, "Id: long, Name: string, salary: long") #here we
created a dataframe from data collection and also pass the schema

3. a) Read File (CSV)


yellowtaxiDF =
spark.read.csv("dbfs:/FileStore/shared_uploads/karangole7074@gmail.com/YellowTaxi
s_202211.csv")

3. b) Read File (JSON)


paymentTypesDF =
(spark.read.json("dbfs:/FileStore/shared_uploads/karangole7074@gmail.com/Payment
Types.json"))

Few more commands for setting up data frame via read file

a) #we have a first row which is a header row and not a data actually. so we need to
make it count as a header so do as following
yellowtaxiDF = spark.read.option("header",
"true").csv("dbfs:/FileStore/shared_uploads/karangole7074@gmail.com/YellowT
axis_202211.csv")
#but here data type is string only..

b) #to display different data types we will call the command again but in different
manner
yellowtaxiDF = spark.read.option("Inferschema", "true") .option("header",
"true").csv("dbfs:/FileStore/shared_uploads/karangole7074@gmail.com/YellowT
axis_202211.csv")
#inferSchema - is good when data is less
# inferschema - is not good when data is huge
# because it needs to read each row and each column value to decide/define data type for a
column
#performance will degrade if inferschema is used for huge dataset.

# dataframe output file format is parquet by default


#snappy is a compression method by default
# after loading data you can't give manually schema
# 4 threads per core

Define schema & apply


#create schema for yellow Taxi Data
yellowTaxiSchema = (

StructType
([

StructField("VendorId" ,IntegerType(), True),


StructField("lpep_pickup_datetime" ,TimestampType(), True),
StructField("lpep_dropoff_datetime" ,TimestampType(), True),
StructField("passenger_count" ,DoubleType(), True),
StructField("trip_distance" ,DoubleType(), True),
StructField("RatecodeID" ,DoubleType(), True)
])
)

yellowtaxiDF = spark.read.option("header",
"true").schema(yellowTaxiSchema).csv("dbfs:/FileStore/shared_uploads/karangole7074@gma
il.com/TaxiBases.json") #here schema is applied while creating a dataframe
Creating a dataframe from multiline JSON file
#read multiline JSON file
taxiBasesDF= (
spark
.read
.option("multiline", "true")
.json("dbfs:/FileStore/shared_uploads/karangole7074@gmail.com/TaxiBases.json")
)
taxiBasesDF.display()

Creating a nested schema


Adding nested schema for JSON file

How to save a dataframe


1) yellowtaxiDF.write.format("csv").mode("append").csv("/dbfs/FileStore/shared_uploads")
#save new added data to file
2) yellowtaxiDF.write.save("/dbfs/FileStore/shared_uploads")#save file
3) yellowtaxiDF.write.format("csv").mode("overwrite").csv("/dbfs/FileStore/shared_uploads
")#overwrite existing file

Doing compression of output file in desired format

yellowtaxiDF.write.format("csv").option("compression","gzip").csv("/dbfs/FileStore/shared_uplo
ads/output") #to compress the output file in desired format.

yellowtaxiDF.write.mode("overwrite").option("compression",
"none").save("/dbfs/FileStore/shared_uploads/output_with_no_compression") #to save without
compression
Day- 6

#describe - it is a function that can be called on numeric type columns to analyze information
about data.
mytaxiDF = yellowtaxiDF.describe("passenger_count", "trip_distance")
(Gives details like, min, max, stddev, count)

Clean Data
It consist of process like below

1.Accuracy Check: filter inaccurate data:

Here both function (where & filter works same, we are filtering out data which
is not zero

yellowtaxiDF = (

yellowtaxiDF

.where("passenger_count > 0")

.filter(col("trip_distance") > 0.0)

2. a) Completeness Check: Drop rows with nulls

yellow_taxi_nullDF = yellowtaxiDF.na.drop() #dropped null value

#difference in na.drop('all') and na.drop()

na.drop('all'):-

This method drops rows that have all null (missing) values across all specified
columns.

If a row contains at least one non-null value in the specified columns, it will be
retained.
It is useful when you want to remove only those rows where all specified
columns have missing values.

na.drop():-

This method drops rows that have any null (missing) values in the specified
columns.

If a row contains at least one null value in the specified columns, it will be
removed.

It is more inclusive than .na.drop('all') as it removes rows with any missing


values.

2. b) Completeness Check : replace null value by default value

defaultValueMap = {'payment_type': 5, 'RateCodeID': 1}

#here we are setting default value for payment type and ratecode

yellowtaxiDF = (

yellowtaxiDF.na.fill(defaultValueMap)

3) Uniqueness check : Drop duplicate rows

yellowtaxiDF = (

yellowtaxiDF.dropDuplicates() # this command drop the duplicate


rows...

4) Timeliness Check: Remove records outside the bound

yellowtaxiDF = (

yellowtaxiDF.where("tpep_pickup_datetime >= '2022-10-01' AND


tpep_dropoff_datetime < '2022-11-01'") #here we are setting a filter which
gives data between specific timeline

)
Transform data
1.Select Limited Columns

yellowtaxiDF.select("vendorID", "passenger_count").where("vendorID ==
2").where("passenger_count > 5").show() # here we have selected two columns
and after that we added a condition so we can get filtered data

2. Rename Columns

yellowtaxi_new_DF = yellowtaxiDF.withColumnRenamed("old_col_name",
"new_col_name")

3. a) Create derived columns - TripYear, TripMonth, TripDay(this is an example)

yellowtaxiDF33 = (

yellowtaxiDF1

.withColumn("TripYear", year(col("PickupTime")))

.withColumn("TripMonth", month(col("PickupTime")))

.withColumn("TripDay", dayofmonth(col("PickupTime")))

)
3. b) Create derived column - TripTimeInMinutes

yellowtaxiDF = (

yellowtaxiDF

.withColumn("TripTimeInMinutes",

round(

(unix_timestamp(col("DropTime"))

-unix_timestamp(col("PickupTime")))

/ 60

))

Day- 7

Creating a Dataframe from data and creating a column for that


data = [
('2023-06-27 09:30:00',),
('2023-06-28 15:45:00',),
('2023-06-29 12:00:00',)
]
df = spark.createDataFrame(data, ['PickupTime'])
df.display()
df.printSchema()
Day- 8
Corrupt Data Handling:

there is 3 modes to read corrupted data:-

1. Permissive mode
2. Dropmalformed
3. FailFast

1) Parse mode in spark-

In Spark, "parsing" typically refers to the process of extracting structured


information from unstructured or semi-structured data. It is known as
permissive mode and it allows dealing with corrupt data. It will show
where is corrupt data

rc_corrupt_1_DF=spark.read.option("mode","PERMISSIVE").option("columnName
OfCorruptRecord",
"corrupted_data").json("dbfs:/FileStore/shared_uploads/ratecode/RateCodes.json
") #you can set a column name for the corrupted data

2) DROPMALFORMED-

It will drop the row which is having a corrupt data

#here you can drop a whole row which is having a corrupt record

ratecodes_json_corruptedDF=spark.read.option("mode","DROPMALFORMED").jso
n("dbfs:/FileStore/shared_uploads/ratecode/RateCodes.json")
3) FAILFAST:-

It will not be allowed to create a dataframe if there is any corrupt record.

FAIlFAST mode- it will not allowed creation of data frame if there is any corrupt
record

ratecodes_json_corruptedDF=spark.read.option("mode","FAILFAST").json("dbfs:/F
ileStore/shared_uploads/ratecode/RateCodes.json")

Day- 9

Whenever you create a table in sql in databricks you can get its different version like if you
inserted the first row it will calculate as a one transaction and it will be your version 1 of the
table
Example:
create table emp
(
eid int,
ename string,
eloc string
)

insert into emp values(1, 'Ankush', 'Amravati') #this is one transaction


INSERT INTO emp VALUES(2,'SIMRAN','JALNA'); #this is one transaction
INSERT INTO emp VALUES(3,'TEJAL','PUNE'); #this is one transaction

select * from emp version as of 2 #this will give the records of ankush and simran bcoz ankush 1st
transaction and simran 2nd transaction

describe history emp; #gives all version and transaction details on sql table
-- select query on table always shows latest version (last version 4)
-- we can query a older version of table always
-- called as Time Travel Queries (e.g. select * from emp version as of 2 )
-- so even if data deleted still it will show in versions as shown below (data not deleted actually from
warehouse)
Parquet file and dataframe
empDF1 =
spark.read.parquet('dbfs:/FileStore/shared_uploads/pafiles/part-00000-0d4d89fd-cbdf-4b95-912e-53
5d02f30ae0-c000.snappy.parquet') #if you want to create dataframe from parquet file you need to
store that file to another folder apart from warehouse

empDF = spark.read.format("delta").load("dbfs:/user/hive/warehouse/akshaya.db/emp/") #but if you


want to create DF directly from parquet file stored in warehouse you need to use format("delta")

Saving Dataframe as a Table:


empDF.write.option("header", "true").format("csv").saveAsTable("empcsv")

Day- 10

EVERY DATABASE GET CREATED IN user/hive/warehouse


LOCATION

1) Way to create database/schema

CREATE SCHEMA IF NOT EXISTS nyctaxidb;

2) Way to create database/schema with location to store

create schema if not exists nyctaxi_custom_location


location 'dbfs:/FileStore/schema/nyctaxidb_custom_location.db'

3) Way to create a table

CREATE OR REPLACE TABLE managed_traxizones (LocId int,Borogh string ,zone string ,service
string)
Temporary View:

Temporary views are session-scoped, meaning they are only available for the duration
of the session in which they are created.

#here we created a temporary view using a csv file

CREATE OR REPLACE TEMPORARY VIEW RATECODES_VIEW USING CSV OPTIONS (

PATH ='dbfs:/FileStore/shared_uploads/ratecode/RateCodes1.csv',

header = "true",

mode = 'FAILFAST');

Creating a table at specified location from getting records from another table:

CREATE OR REPLACE TABLE EXTERNAL_RATECODES

LOCATION 'dbfs:/user/hive/warehouse/nyctaxidb/external_data_table_rates'

AS

SELECT * from RATECODES_VIEW;


Day- 11
1) Read data directly from csv files
select * from csv.`dbfs:/FileStore/NYCtaxidata/YellowTaxis_202211.csv’

2) Read data directly from JSON files


select * from
json.`dbfs:/FileStore/shared_uploads/karangole7074@gmail.com/PaymentTypes.json’

3) Create Temp View directly from JSON files


create or replace temp view payment_view
as select * from
json.`dbfs:/FileStore/shared_uploads/karangole7074@gmail.com/PaymentTypes.json`

4) Create Table from Temp View and save at location


create table paynmet_types_ext
location 'dbfs:/FileStore/NYCtaxidata/nyc_ext'
select * from payment_view

Day- 12

Deep Clone:

create or replace table emp_deep_clone


deep clone emp_tbl

in above deep clone all files of source is copied (metadata+transactions)


Shallow Clone:

create or replace table emp_shallow_clone


SHALLOW CLONE emp_tbl

in above shallow clone only metadata get copied

Deep Clone Shallow Clone


Not dependent on original one Dependent on original one

Slower and consumes more memory Faster requires less memory

Useful when you need an independent Shallow is useful when you explicitly
copy of the original one want changes in the clone to reflect in the
original.

Day -13

Literal or lit()
The lit() function in Apache Spark's DataFrame API is short for "literal". It's a function that
creates a Column with a constant literal value.

yellowtaxi_DF = yellowtaxi_DF.withColumn("taxiType", lit("Yellow"))


#here we have created a column taxitype with the constant value Yellow

Combining two dataframes:


combined_df = yellowtaxi_DF.unionAll(greentaxi_DF)
Day -14
UDF(user defined function)

Here we have created one udf in pyspark:


def convertCase(str):
result=" "
namewordsarray=str.split(",")
for nameword in namewordsarray:
result=(result
+nameword[0:1].upper() #Ex- for word 'ANKUSH', returns="A"
+nameword[1:len(nameword)].lower() #Ex- for word 'ANKUSH',
returns="Ankush"
+",")
result=result[0:len(result)-1] #Ex-for word 'Ankush, Shirbhate', returns='Ankush,
Shirbhate'
return result

Option1: Register function as a User Defined Function(UDF) This registration option is


for using UDF in Python/Scala

from pyspark.sql.functions import StringType

convertCaseUDF=udf(lambda str: convertCase(str), StringType()) #here we have set a name


to the udf and within that we have used a UDF
Use UDF in Dataframe code

(cabsDF.

select(

"Name",

convertCaseUDF("Name").alias("ConvertCase_Name")

) )

Option2: Register function as a User Defined Function(UDF) This registration option is


for using UDF in SQL

spark.udf.register("convertCaseSqlUdf", convertCase, StringType())

Use of UDF in SQL

spark.sql("""

SELECT Name,

convertCaseSqlUdf(Name) as Name_convertedCase

From cabs_vw""")

#Catalyst optimizer
#Python UDF's disadvantages
#1 Python UDFs are not optimised by catalyst optimizer
#2 Performance of spark degrades if UDFs are used
#3 Use UDFs rarely
#4 SQL UDFs (are another option to python UDFs, which are optimised by catalyst optimizer,
resulting into better/optimised performance of pyspark/spark)--sql and java prefer in
databricks that be the reason
#5 we cannot use python UDF on huge dataset or complex logic bcoz it affect on spark engine
Catalyst optimiser:

The Catalyst optimizer is a key component of Apache Spark's SQL engine. It's responsible for
optimising the execution plan of Spark SQL queries for better performance.
1. Query Optimisation: Catalyst optimises query plans to minimise data processing by
applying rules like predicate pushdown, constant folding, filter pushdown, and join
reordering.

2. Logical Optimisation and Physical Optimisation:


a. Logical - Catalyst transforms the query's abstract syntax tree into a logical plan
and applies optimizations like expression simplification, operation combination,
and query structure optimization.
b. Physical - During physical optimization, Catalyst translates logical plans into one
or more execution strategies using Spark's RDDs, considering factors such as
data distribution, partitioning, resources
3. Cost-Based Optimization: in Catalyst estimates query plan execution costs to

select the most efficient plan, aiding in informed application of optimization

rules and choice of execution strategies.

4. Extensibility: Catalyst's extensibility allows developers to integrate custom optimization


rules and strategies, enabling Spark SQL to adapt efficiently to diverse workloads and
environments.
5. Performance: Catalyst optimises queries transparently, so users can focus on writing
declarative SQL or DataFrame operations without needing to manually optimise query
execution.

# to run another notebook in current notebook then run(magic command) is used for that we
will use run command with absolute path of that notebook

SQL UDF is better than Python, if you create an SQL UDF you don't need to
register it.
#here we created a UDF
Create or replace function yelling(text string)
returns string
return concat(upper(text), "!!!")

#here we used the function


select yelling(food) from foods; --select function name column name(food) from foods

#python function UDF advantages


#1 we write our own function
#2 we can register as a UDF
#3 we can use the function
#4 to solve complex task we can write function

#we cannot use sqlUDF on another database we can only use on current database to use on
another database we will create another SQLUDF for that database

In order to use SQL UDF, a user must have 'USAGE' and 'SELECT' permissions on the
function.
Day -16
yellowtaxidf.createOrReplaceTempView("yellowtaxi_vw") #creating a temp view of dataframe

Joining a Dataframe:

joinedDF=(
yellowtaxidf.join
(
taxizonedf,
yellowtaxidf.PULocationID==taxizonedf.PULocationID, #[condition1,condition2]

"inner" #left, leftouter, right, rightouter, full, etc

)
)

Dropping duplicate column


.drop(col("LocationID")

drivers_listdf.count() #suggest me why it is not running


registered_driversdf.count() #tell me why it generating an error
Day -18

Reading stream data


yellowstream=spark.readStream.option("header","true").schema(yellowTaxiSchema).csv("dbfs
:/FileStore/syparkdataset/yellowtaxi") #in this way we can read the streaming data

Count cant apply on streaming data


yellowstream.count() #It will show error bcoz it is streaming data that's why it will not run

Checking whether dataframe is streaming or not


yellowtaxidf.isStreaming #This command will show whether dataframe is streaming DF or not

Why you cant save stream table directly


yellowstream.writeStream.table("yellowsttable") #It will show you error bcoz to save as table
we want to create a checkpoint location for streaming data table to save (Streaming data will
continuously generate the data that's why we will create a checkpoint directory to save the
streaming data)

Using Checkpoint to save streaming table


yellowstream.writeStream.option("checkpointLocation","dbfs:/FileStore/tables/yellowstreamta
ble/checkpoint1").table("yellowstream_table") #Now it will run continuously and it will create
table to the checkpoint location

--Spark : Batch & Stream


--API -->Spark Streaming (Based on RDD), Structured Streaming (Based on Dataframe)
--Delta live tables (DLT based on Delta tables/Structured Streaming)

-Latency=>Time delay
--The more the latency is - time will be more to process data
--The latency should be as low as possible.
--Structured streaming-->low latency
--Spark Streaming-->was having more latency

#What is Auto Loader-->Autoloader incrementally and efficiently processes new data files as
they arrive in cloud storage without any additional setup
#Auto Loader provides a Structured Streaming source called 'cloudFiles'

#What is Auto Loader-->Auto Loader incrementally and efficiently processes new data files as
they arrive in cloud storage without any additional setup
#Auto Loader provides a Structured Streaming source called 'cloudFiles'
#Auto loader function has four argument
#cluodfiles is a function from which we use data_source
#In auto loader manually sechma defination is not allowed
#In auto loader it will automatically take header "true" we cannot define command in
autoloader

Here we are defining a autoloader


def autoload_to_table(data_source, source_format, table_name, checkpoint_directory):
query = (spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", source_format)
.option("cloudFiles.schemaLocation", checkpoint_directory)
.load(data_source)
.writeStream

.option("path",'abfss://nyctaxicontainer@avdjulyacc.dfs.core.windows.net/nyctaxidata/yellowstr
eamtable')
.option("checkpointLocation", checkpoint_directory)
.option("mergeSchema", "true") #mergeSchema true is used whenever schema
changes it will accept the changed schema and it will modify data table
.table(table_name))
return query

Here we are using that autoloader


query = autoload_to_table(data_source =
"abfss://nyctaxicontainer@avdjulyacc.dfs.core.windows.net/nyctaxidata/yellowdata/",
source_format = "csv",
table_name = "yellowtripstreamtable",
checkpoint_directory =
"abfss://nyctaxicontainer@avdjulyacc.dfs.core.windows.net/nyctaxidata/yellow_checkpoint1")

Creating a view from streaming table


(spark.readStream
.table("yellowtripstreamtable")
.createOrReplaceTempView("streaming_yellowtrip_vw"))

What is Unity Catalog?


Unity Catalog is a data governance and management tool for organizing and securing
data in Databricks, which is built on top of Apache Spark. It helps you keep track of your
data, who can access it, and how it's used, all in one place.

Key Features of Unity Catalog:

1. Centralized Data Management:


● Single Interface: Provides a single interface to manage all your data across
various data sources like cloud storage (e.g., AWS S3, Azure Blob Storage),
databases, and data lakes.
● Catalogs, Schemas, and Tables: Organizes your data into catalogs,
schemas, and tables, similar to how databases are structured. This makes
it easier to find and manage your data.
2. Access Control:
● Permissions: Allows you to set detailed permissions on who can view or
modify your data. This ensures that only authorized users can access
sensitive information.
● Role-Based Access: Supports role-based access control (RBAC) to manage
permissions efficiently.
3. Data Lineage:
● Tracking Data Flow: Keeps track of how data moves and changes over time,
which is useful for auditing and understanding the impact of data changes.
4. Data Governance:
● Compliance: Helps you comply with regulatory requirements by providing
tools for data masking, encryption, and auditing.
● Data Quality: Ensures data quality by tracking and managing metadata
(data about your data).

How Unity Catalog Works:

● Catalog: A top-level container that holds schemas and tables. Think of it as a big
folder that organizes your data projects.
● Schema: A collection of tables and views, similar to a folder within a catalog. It
helps organize related data.
● Table: A structured format to store your data, much like a spreadsheet or a
database table.

Example Scenario:
Imagine you have data stored in various places: some in AWS S3, some in Azure Blob
Storage, and some in a SQL database. Managing access to all this data separately can
be complex and time-consuming. With Unity Catalog, you can:

1. Centralize Access: Bring all your data together under one unified catalog.
2. Organize Data: Structure your data into catalogs, schemas, and tables for easy
navigation.
3. Control Permissions: Set who can access which parts of the data, ensuring
security and compliance.
4. Track Changes: Monitor how data flows and changes across your systems.
Simplified Example:

With these steps, you've organized your data into a catalog and schema, making it
easier to manage and secure.
Medallion architecture:

What is Hadoop-

Hadoop is an open-source framework designed for distributed storage and processing


of large data sets using a cluster of computers.

Core components-

1. HDFS - distributed file system tht stores data across multiple machines
2. mapReduce -A programming model and processing engine for parallel and
distributed computing on large datasets. It involves two main steps: the Map
step, where data is processed in parallel, and the Reduce step, where the results
are aggregated.
3. Yarn - A resource management layer that allows multiple data processing
engines, including MapReduce, to share and efficiently utilize cluster resources.
What is Hive -
Hive is a data warehousing and SQL-like query language built on top of Hadoop.
Hive makes it easier for users familiar with SQL to analyze and process data using
Hadoop without requiring extensive knowledge of low-level MapReduce programming.

Key features of Hive include:


1. HiveQL: A SQL-like language used for querying and managing data stored in
Hadoop. Users can write queries in HiveQL, and Hive translates them into
MapReduce jobs to execute on the Hadoop cluster.
2. Metastore: Hive maintains a metadata repository called the Metastore, which
stores schema information, table metadata, and other details about the data.
3. Hive UDF (User-Defined Functions): Users can extend Hive's functionality by
creating custom functions in Java or other supported languages.

Datalake:
A data lake is a centralised repository that allows you to store and manage vast
amounts of raw data in its native format until it's needed.
Whereas a data warehouse stores only structured and processed data. The datalake can
store structured, semi-structured, unstructured data.

Characteristics:
1. Storage of Raw Data: A data lake stores data in its raw, unprocessed form. This
includes diverse data types such as text, images, videos, logs, sensor data, and
more.
2. Scalability: Data lakes are designed to scale horizontally, meaning they can handle
a massive amount of data by adding more storage capacity and processing power
as needed.
3. Flexibility: Data lakes support various data formats and structures.
4. Schema-on-Read: you can apply schema on stored data while reading and
analysing it. (in traditional database there is Schema-on-write approach.
Data Lake vs Delta Lake:

Data Lake Delta Lake


1. Centralised repository which 1. Open source storage layer that brings
stores large amount of raw and ACID transactions to apache spark and
diverse data big data workloads. It is built on top of
existing data lake file formats like
apache parquet, apache avro

2. Data lake lacks such kind of 2. ACID transactions (Atomicity,


transaction
consistency, Isolation, Durability)

3. Schema on read 3. Schema enforcement (changing schema


over time on existing data)

Every delta table will be a parquet file.


CRC stands for cyclic redundancy check. It is a calculation made from all the data in a
file to insure accuracy.

Delta Lake uses schema enforcement and schema evolution to maintain consistency in
data structures over time.
Isolation ensures that concurrent transactions do not interfere with each other. In Delta
Lake, isolation is achieved by leveraging the underlying transaction log.
Delta Lake achieves durability by storing transaction logs and data in a fault-tolerant
manner.

You might also like