0% found this document useful (0 votes)

64 views29 pages

Pyspark File Commands and Theory

Uploaded by

karangole7074

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

64 views29 pages

Pyspark File Commands and Theory

Uploaded by

karangole7074

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 29

Day -1

File commands

1) %fs ls “filepath” - this will list the files on that location

2) %fs head “filepath” - this will look into file content
3) %fs mkdirs “filepath” - this will create a directory
4) %fs mv dbfs:/Filestore/tables/kgole/p3/p2-1.txt
dbfs:/Filestore/tables/kgole/p3/pure.txt
(to remove or rename file)

Day -2 and 3
RDD - Resilient Distributed Datasets

RDD, which stands for Resilient Distributed Dataset, is a fundamental data structure in
Apache Spark. RDD represents an immutable, distributed collection of objects that can
be processed in parallel across a cluster of machines. RDDs provide fault tolerance and
parallel processing capabilities, making them a key abstraction in Spark for distributed
data processing.

#RDD is the in memory object and we can create it by 3 ways

#1) can create from collection of data (Parallelize())
#2) from flat files (textfile())
#3) From existing RDD (filter(), union(), map())

#Properties of RDD
#1) immutable - can't modify
#2) Partitioned/Distributed
#3) Resilient - Fault tolerant

# two types of functions

#1) Transformations - used to create RDD and are dependent on actions functions for
executions. known as lazy evaluation. (e.g parallelize, filter, textfile)
#2) Actions - Are used to run and execute transformations (e.g. collect, first, take)
Command
1. myRDD = sc.textFile("dbfs:/FileStore/new_file_day2.txt") #creating a RDD using flat
files
2. collectionRDD = sc.parallelize(data) #parallelize - Convert collection into RDD
3. originalRDD = sc.parallelize([1, 2, 3, 4, 5])
4. E.g. 1 - duplicateRDD = originalRDD.map(lambda x : x) #creating from existing data
E.g. 2 - firstRDD = sc.parallelize([1, 2, 3])
secondRDD = sc.parallelize([3, 4, 5])
combinedRDD = firstRDD.union(secondRDD)

Day - 4 & 5
Creating a Dataframe

# there are 3 ways we can create a dataframe

#1)a) from RDD
# b) from data collection
#2) Read file

Command
1. From RDD
data = [ [1, "karan", 10000], [2, "Rick", 30000], [3, "Else", 240000], [4, "never", 23000]]

empolyeeRDD = sc.parallelize(data)
empDF = empolyeeRDD.toDF("Id: long, Name: string, salary: long")

2. From Data Collection

data = [ [1, "karan", 10000], [2, "Rick", 30000], [3, "Else", 240000], [4, "never", 23000]]
empDF1 = spark.createDataFrame(data, "Id: long, Name: string, salary: long") #here we
created a dataframe from data collection and also pass the schema

3. a) Read File (CSV)

yellowtaxiDF =
spark.read.csv("dbfs:/FileStore/shared_uploads/[email protected]/YellowTaxi
s_202211.csv")

3. b) Read File (JSON)

paymentTypesDF =
(spark.read.json("dbfs:/FileStore/shared_uploads/[email protected]/Payment
Types.json"))

Few more commands for setting up data frame via read file

a) #we have a first row which is a header row and not a data actually. so we need to
make it count as a header so do as following
yellowtaxiDF = spark.read.option("header",
"true").csv("dbfs:/FileStore/shared_uploads/[email protected]/YellowT
axis_202211.csv")
#but here data type is string only..

b) #to display different data types we will call the command again but in different
manner
yellowtaxiDF = spark.read.option("Inferschema", "true") .option("header",
"true").csv("dbfs:/FileStore/shared_uploads/[email protected]/YellowT
axis_202211.csv")
#inferSchema - is good when data is less
# inferschema - is not good when data is huge
# because it needs to read each row and each column value to decide/define data type for a
column
#performance will degrade if inferschema is used for huge dataset.

# dataframe output file format is parquet by default

#snappy is a compression method by default
# after loading data you can't give manually schema
# 4 threads per core

Define schema & apply

#create schema for yellow Taxi Data
yellowTaxiSchema = (

StructType
([

StructField("VendorId" ,IntegerType(), True),

StructField("lpep_pickup_datetime" ,TimestampType(), True),
StructField("lpep_dropoff_datetime" ,TimestampType(), True),
StructField("passenger_count" ,DoubleType(), True),
StructField("trip_distance" ,DoubleType(), True),
StructField("RatecodeID" ,DoubleType(), True)
])
)

yellowtaxiDF = spark.read.option("header",
"true").schema(yellowTaxiSchema).csv("dbfs:/FileStore/shared_uploads/karangole7074@gma
il.com/TaxiBases.json") #here schema is applied while creating a dataframe
Creating a dataframe from multiline JSON file
#read multiline JSON file
taxiBasesDF= (
spark
.read
.option("multiline", "true")
.json("dbfs:/FileStore/shared_uploads/[email protected]/TaxiBases.json")
)
taxiBasesDF.display()

Creating a nested schema

Adding nested schema for JSON file

How to save a dataframe

1) yellowtaxiDF.write.format("csv").mode("append").csv("/dbfs/FileStore/shared_uploads")
#save new added data to file
2) yellowtaxiDF.write.save("/dbfs/FileStore/shared_uploads")#save file
3) yellowtaxiDF.write.format("csv").mode("overwrite").csv("/dbfs/FileStore/shared_uploads
")#overwrite existing file

Doing compression of output file in desired format

yellowtaxiDF.write.format("csv").option("compression","gzip").csv("/dbfs/FileStore/shared_uplo
ads/output") #to compress the output file in desired format.

yellowtaxiDF.write.mode("overwrite").option("compression",
"none").save("/dbfs/FileStore/shared_uploads/output_with_no_compression") #to save without
compression
Day- 6

#describe - it is a function that can be called on numeric type columns to analyze information
about data.
mytaxiDF = yellowtaxiDF.describe("passenger_count", "trip_distance")
(Gives details like, min, max, stddev, count)

Clean Data
It consist of process like below

1.Accuracy Check: filter inaccurate data:

Here both function (where & filter works same, we are filtering out data which
is not zero

yellowtaxiDF = (

yellowtaxiDF

.where("passenger_count > 0")

.filter(col("trip_distance") > 0.0)

2. a) Completeness Check: Drop rows with nulls

yellow_taxi_nullDF = yellowtaxiDF.na.drop() #dropped null value

#difference in na.drop('all') and na.drop()

na.drop('all'):-

This method drops rows that have all null (missing) values across all specified
columns.

If a row contains at least one non-null value in the specified columns, it will be
retained.
It is useful when you want to remove only those rows where all specified
columns have missing values.

na.drop():-

This method drops rows that have any null (missing) values in the specified
columns.

If a row contains at least one null value in the specified columns, it will be
removed.

It is more inclusive than .na.drop('all') as it removes rows with any missing

values.

2. b) Completeness Check : replace null value by default value

defaultValueMap = {'payment_type': 5, 'RateCodeID': 1}

#here we are setting default value for payment type and ratecode

yellowtaxiDF = (

yellowtaxiDF.na.fill(defaultValueMap)

3) Uniqueness check : Drop duplicate rows

yellowtaxiDF = (

yellowtaxiDF.dropDuplicates() # this command drop the duplicate

rows...

4) Timeliness Check: Remove records outside the bound

yellowtaxiDF = (

yellowtaxiDF.where("tpep_pickup_datetime >= '2022-10-01' AND

tpep_dropoff_datetime < '2022-11-01'") #here we are setting a filter which
gives data between specific timeline

)
Transform data
1.Select Limited Columns

yellowtaxiDF.select("vendorID", "passenger_count").where("vendorID ==
2").where("passenger_count > 5").show() # here we have selected two columns
and after that we added a condition so we can get filtered data

2. Rename Columns

yellowtaxi_new_DF = yellowtaxiDF.withColumnRenamed("old_col_name",
"new_col_name")

3. a) Create derived columns - TripYear, TripMonth, TripDay(this is an example)

yellowtaxiDF33 = (

yellowtaxiDF1

.withColumn("TripYear", year(col("PickupTime")))

.withColumn("TripMonth", month(col("PickupTime")))

.withColumn("TripDay", dayofmonth(col("PickupTime")))

)
3. b) Create derived column - TripTimeInMinutes

yellowtaxiDF = (

yellowtaxiDF

.withColumn("TripTimeInMinutes",

round(

(unix_timestamp(col("DropTime"))

-unix_timestamp(col("PickupTime")))

/ 60

))

Day- 7

Creating a Dataframe from data and creating a column for that

data = [
('2023-06-27 09:30:00',),
('2023-06-28 15:45:00',),
('2023-06-29 12:00:00',)
]
df = spark.createDataFrame(data, ['PickupTime'])
df.display()
df.printSchema()
Day- 8
Corrupt Data Handling:

there is 3 modes to read corrupted data:-

1. Permissive mode
2. Dropmalformed
3. FailFast

1) Parse mode in spark-

In Spark, "parsing" typically refers to the process of extracting structured

information from unstructured or semi-structured data. It is known as
permissive mode and it allows dealing with corrupt data. It will show
where is corrupt data

rc_corrupt_1_DF=spark.read.option("mode","PERMISSIVE").option("columnName
OfCorruptRecord",
"corrupted_data").json("dbfs:/FileStore/shared_uploads/ratecode/RateCodes.json
") #you can set a column name for the corrupted data

2) DROPMALFORMED-

It will drop the row which is having a corrupt data

#here you can drop a whole row which is having a corrupt record

ratecodes_json_corruptedDF=spark.read.option("mode","DROPMALFORMED").jso
n("dbfs:/FileStore/shared_uploads/ratecode/RateCodes.json")
3) FAILFAST:-

It will not be allowed to create a dataframe if there is any corrupt record.

FAIlFAST mode- it will not allowed creation of data frame if there is any corrupt
record

ratecodes_json_corruptedDF=spark.read.option("mode","FAILFAST").json("dbfs:/F
ileStore/shared_uploads/ratecode/RateCodes.json")

Day- 9

Whenever you create a table in sql in databricks you can get its different version like if you
inserted the first row it will calculate as a one transaction and it will be your version 1 of the
table
Example:
create table emp
(
eid int,
ename string,
eloc string
)

insert into emp values(1, 'Ankush', 'Amravati') #this is one transaction

INSERT INTO emp VALUES(2,'SIMRAN','JALNA'); #this is one transaction
INSERT INTO emp VALUES(3,'TEJAL','PUNE'); #this is one transaction

select * from emp version as of 2 #this will give the records of ankush and simran bcoz ankush 1st
transaction and simran 2nd transaction

describe history emp; #gives all version and transaction details on sql table
-- select query on table always shows latest version (last version 4)
-- we can query a older version of table always
-- called as Time Travel Queries (e.g. select * from emp version as of 2 )
-- so even if data deleted still it will show in versions as shown below (data not deleted actually from
warehouse)
Parquet file and dataframe
empDF1 =
spark.read.parquet('dbfs:/FileStore/shared_uploads/pafiles/part-00000-0d4d89fd-cbdf-4b95-912e-53
5d02f30ae0-c000.snappy.parquet') #if you want to create dataframe from parquet file you need to
store that file to another folder apart from warehouse

empDF = spark.read.format("delta").load("dbfs:/user/hive/warehouse/akshaya.db/emp/") #but if you

want to create DF directly from parquet file stored in warehouse you need to use format("delta")

Saving Dataframe as a Table:

empDF.write.option("header", "true").format("csv").saveAsTable("empcsv")

Day- 10

EVERY DATABASE GET CREATED IN user/hive/warehouse

LOCATION

1) Way to create database/schema

CREATE SCHEMA IF NOT EXISTS nyctaxidb;

2) Way to create database/schema with location to store

create schema if not exists nyctaxi_custom_location

location 'dbfs:/FileStore/schema/nyctaxidb_custom_location.db'

3) Way to create a table

CREATE OR REPLACE TABLE managed_traxizones (LocId int,Borogh string ,zone string ,service
string)
Temporary View:

Temporary views are session-scoped, meaning they are only available for the duration
of the session in which they are created.

#here we created a temporary view using a csv file

CREATE OR REPLACE TEMPORARY VIEW RATECODES_VIEW USING CSV OPTIONS (

PATH ='dbfs:/FileStore/shared_uploads/ratecode/RateCodes1.csv',

header = "true",

mode = 'FAILFAST');

Creating a table at specified location from getting records from another table:

CREATE OR REPLACE TABLE EXTERNAL_RATECODES

LOCATION 'dbfs:/user/hive/warehouse/nyctaxidb/external_data_table_rates'

SELECT * from RATECODES_VIEW;

Day- 11
1) Read data directly from csv files
select * from csv.`dbfs:/FileStore/NYCtaxidata/YellowTaxis_202211.csv’

2) Read data directly from JSON files

select * from
json.`dbfs:/FileStore/shared_uploads/[email protected]/PaymentTypes.json’

3) Create Temp View directly from JSON files

create or replace temp view payment_view
as select * from
json.`dbfs:/FileStore/shared_uploads/[email protected]/PaymentTypes.json`

4) Create Table from Temp View and save at location

create table paynmet_types_ext
location 'dbfs:/FileStore/NYCtaxidata/nyc_ext'
select * from payment_view

Day- 12

Deep Clone:

create or replace table emp_deep_clone

deep clone emp_tbl

in above deep clone all files of source is copied (metadata+transactions)

Shallow Clone:

create or replace table emp_shallow_clone

SHALLOW CLONE emp_tbl

in above shallow clone only metadata get copied

Deep Clone Shallow Clone

Not dependent on original one Dependent on original one

Slower and consumes more memory Faster requires less memory

Useful when you need an independent Shallow is useful when you explicitly
copy of the original one want changes in the clone to reflect in the
original.

Day -13

Literal or lit()
The lit() function in Apache Spark's DataFrame API is short for "literal". It's a function that
creates a Column with a constant literal value.

yellowtaxi_DF = yellowtaxi_DF.withColumn("taxiType", lit("Yellow"))

#here we have created a column taxitype with the constant value Yellow

Combining two dataframes:

combined_df = yellowtaxi_DF.unionAll(greentaxi_DF)
Day -14
UDF(user defined function)

Here we have created one udf in pyspark:

def convertCase(str):
result=" "
namewordsarray=str.split(",")
for nameword in namewordsarray:
result=(result
+nameword[0:1].upper() #Ex- for word 'ANKUSH', returns="A"
+nameword[1:len(nameword)].lower() #Ex- for word 'ANKUSH',
returns="Ankush"
+",")
result=result[0:len(result)-1] #Ex-for word 'Ankush, Shirbhate', returns='Ankush,
Shirbhate'
return result

Option1: Register function as a User Defined Function(UDF) This registration option is

for using UDF in Python/Scala

from pyspark.sql.functions import StringType

convertCaseUDF=udf(lambda str: convertCase(str), StringType()) #here we have set a name

to the udf and within that we have used a UDF
Use UDF in Dataframe code

(cabsDF.

select(

"Name",

convertCaseUDF("Name").alias("ConvertCase_Name")

) )

Option2: Register function as a User Defined Function(UDF) This registration option is

for using UDF in SQL

spark.udf.register("convertCaseSqlUdf", convertCase, StringType())

Use of UDF in SQL

spark.sql("""

SELECT Name,

convertCaseSqlUdf(Name) as Name_convertedCase

From cabs_vw""")

#Catalyst optimizer
#Python UDF's disadvantages
#1 Python UDFs are not optimised by catalyst optimizer
#2 Performance of spark degrades if UDFs are used
#3 Use UDFs rarely
#4 SQL UDFs (are another option to python UDFs, which are optimised by catalyst optimizer,
resulting into better/optimised performance of pyspark/spark)--sql and java prefer in
databricks that be the reason
#5 we cannot use python UDF on huge dataset or complex logic bcoz it affect on spark engine
Catalyst optimiser:

The Catalyst optimizer is a key component of Apache Spark's SQL engine. It's responsible for
optimising the execution plan of Spark SQL queries for better performance.
1. Query Optimisation: Catalyst optimises query plans to minimise data processing by
applying rules like predicate pushdown, constant folding, filter pushdown, and join
reordering.

2. Logical Optimisation and Physical Optimisation:

a. Logical - Catalyst transforms the query's abstract syntax tree into a logical plan
and applies optimizations like expression simplification, operation combination,
and query structure optimization.
b. Physical - During physical optimization, Catalyst translates logical plans into one
or more execution strategies using Spark's RDDs, considering factors such as
data distribution, partitioning, resources
3. Cost-Based Optimization: in Catalyst estimates query plan execution costs to

select the most efficient plan, aiding in informed application of optimization

rules and choice of execution strategies.

4. Extensibility: Catalyst's extensibility allows developers to integrate custom optimization

rules and strategies, enabling Spark SQL to adapt efficiently to diverse workloads and
environments.
5. Performance: Catalyst optimises queries transparently, so users can focus on writing
declarative SQL or DataFrame operations without needing to manually optimise query
execution.

# to run another notebook in current notebook then run(magic command) is used for that we
will use run command with absolute path of that notebook

SQL UDF is better than Python, if you create an SQL UDF you don't need to
register it.
#here we created a UDF
Create or replace function yelling(text string)
returns string
return concat(upper(text), "!!!")

#here we used the function

select yelling(food) from foods; --select function name column name(food) from foods

#python function UDF advantages

#1 we write our own function
#2 we can register as a UDF
#3 we can use the function
#4 to solve complex task we can write function

#we cannot use sqlUDF on another database we can only use on current database to use on
another database we will create another SQLUDF for that database

In order to use SQL UDF, a user must have 'USAGE' and 'SELECT' permissions on the
function.
Day -16
yellowtaxidf.createOrReplaceTempView("yellowtaxi_vw") #creating a temp view of dataframe

Joining a Dataframe:

joinedDF=(
yellowtaxidf.join
(
taxizonedf,
yellowtaxidf.PULocationID==taxizonedf.PULocationID, #[condition1,condition2]

"inner" #left, leftouter, right, rightouter, full, etc

)
)

Dropping duplicate column

.drop(col("LocationID")

drivers_listdf.count() #suggest me why it is not running

registered_driversdf.count() #tell me why it generating an error
Day -18

Reading stream data

yellowstream=spark.readStream.option("header","true").schema(yellowTaxiSchema).csv("dbfs
:/FileStore/syparkdataset/yellowtaxi") #in this way we can read the streaming data

Count cant apply on streaming data

yellowstream.count() #It will show error bcoz it is streaming data that's why it will not run

Checking whether dataframe is streaming or not

yellowtaxidf.isStreaming #This command will show whether dataframe is streaming DF or not

Why you cant save stream table directly

yellowstream.writeStream.table("yellowsttable") #It will show you error bcoz to save as table
we want to create a checkpoint location for streaming data table to save (Streaming data will
continuously generate the data that's why we will create a checkpoint directory to save the
streaming data)

Using Checkpoint to save streaming table

yellowstream.writeStream.option("checkpointLocation","dbfs:/FileStore/tables/yellowstreamta
ble/checkpoint1").table("yellowstream_table") #Now it will run continuously and it will create
table to the checkpoint location

--Spark : Batch & Stream

--API -->Spark Streaming (Based on RDD), Structured Streaming (Based on Dataframe)
--Delta live tables (DLT based on Delta tables/Structured Streaming)

-Latency=>Time delay
--The more the latency is - time will be more to process data
--The latency should be as low as possible.
--Structured streaming-->low latency
--Spark Streaming-->was having more latency

#What is Auto Loader-->Autoloader incrementally and efficiently processes new data files as
they arrive in cloud storage without any additional setup
#Auto Loader provides a Structured Streaming source called 'cloudFiles'

#What is Auto Loader-->Auto Loader incrementally and efficiently processes new data files as
they arrive in cloud storage without any additional setup
#Auto Loader provides a Structured Streaming source called 'cloudFiles'
#Auto loader function has four argument
#cluodfiles is a function from which we use data_source
#In auto loader manually sechma defination is not allowed
#In auto loader it will automatically take header "true" we cannot define command in
autoloader

Here we are defining a autoloader

def autoload_to_table(data_source, source_format, table_name, checkpoint_directory):
query = (spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", source_format)
.option("cloudFiles.schemaLocation", checkpoint_directory)
.load(data_source)
.writeStream

.option("path",'abfss://[email protected]/nyctaxidata/yellowstr
eamtable')
.option("checkpointLocation", checkpoint_directory)
.option("mergeSchema", "true") #mergeSchema true is used whenever schema
changes it will accept the changed schema and it will modify data table
.table(table_name))
return query

Here we are using that autoloader

query = autoload_to_table(data_source =
"abfss://[email protected]/nyctaxidata/yellowdata/",
source_format = "csv",
table_name = "yellowtripstreamtable",
checkpoint_directory =
"abfss://[email protected]/nyctaxidata/yellow_checkpoint1")

Creating a view from streaming table

(spark.readStream
.table("yellowtripstreamtable")
.createOrReplaceTempView("streaming_yellowtrip_vw"))

What is Unity Catalog?

Unity Catalog is a data governance and management tool for organizing and securing
data in Databricks, which is built on top of Apache Spark. It helps you keep track of your
data, who can access it, and how it's used, all in one place.

Key Features of Unity Catalog:

1. Centralized Data Management:

● Single Interface: Provides a single interface to manage all your data across
various data sources like cloud storage (e.g., AWS S3, Azure Blob Storage),
databases, and data lakes.
● Catalogs, Schemas, and Tables: Organizes your data into catalogs,
schemas, and tables, similar to how databases are structured. This makes
it easier to find and manage your data.
2. Access Control:
● Permissions: Allows you to set detailed permissions on who can view or
modify your data. This ensures that only authorized users can access
sensitive information.
● Role-Based Access: Supports role-based access control (RBAC) to manage
permissions efficiently.
3. Data Lineage:
● Tracking Data Flow: Keeps track of how data moves and changes over time,
which is useful for auditing and understanding the impact of data changes.
4. Data Governance:
● Compliance: Helps you comply with regulatory requirements by providing
tools for data masking, encryption, and auditing.
● Data Quality: Ensures data quality by tracking and managing metadata
(data about your data).

How Unity Catalog Works:

● Catalog: A top-level container that holds schemas and tables. Think of it as a big
folder that organizes your data projects.
● Schema: A collection of tables and views, similar to a folder within a catalog. It
helps organize related data.
● Table: A structured format to store your data, much like a spreadsheet or a
database table.

Example Scenario:
Imagine you have data stored in various places: some in AWS S3, some in Azure Blob
Storage, and some in a SQL database. Managing access to all this data separately can
be complex and time-consuming. With Unity Catalog, you can:

1. Centralize Access: Bring all your data together under one unified catalog.
2. Organize Data: Structure your data into catalogs, schemas, and tables for easy
navigation.
3. Control Permissions: Set who can access which parts of the data, ensuring
security and compliance.
4. Track Changes: Monitor how data flows and changes across your systems.
Simplified Example:

With these steps, you've organized your data into a catalog and schema, making it
easier to manage and secure.
Medallion architecture:

What is Hadoop-

Hadoop is an open-source framework designed for distributed storage and processing

of large data sets using a cluster of computers.

Core components-

1. HDFS - distributed file system tht stores data across multiple machines
2. mapReduce -A programming model and processing engine for parallel and
distributed computing on large datasets. It involves two main steps: the Map
step, where data is processed in parallel, and the Reduce step, where the results
are aggregated.
3. Yarn - A resource management layer that allows multiple data processing
engines, including MapReduce, to share and efficiently utilize cluster resources.
What is Hive -
Hive is a data warehousing and SQL-like query language built on top of Hadoop.
Hive makes it easier for users familiar with SQL to analyze and process data using
Hadoop without requiring extensive knowledge of low-level MapReduce programming.

Key features of Hive include:

1. HiveQL: A SQL-like language used for querying and managing data stored in
Hadoop. Users can write queries in HiveQL, and Hive translates them into
MapReduce jobs to execute on the Hadoop cluster.
2. Metastore: Hive maintains a metadata repository called the Metastore, which
stores schema information, table metadata, and other details about the data.
3. Hive UDF (User-Defined Functions): Users can extend Hive's functionality by
creating custom functions in Java or other supported languages.

Datalake:
A data lake is a centralised repository that allows you to store and manage vast
amounts of raw data in its native format until it's needed.
Whereas a data warehouse stores only structured and processed data. The datalake can
store structured, semi-structured, unstructured data.

Characteristics:
1. Storage of Raw Data: A data lake stores data in its raw, unprocessed form. This
includes diverse data types such as text, images, videos, logs, sensor data, and
more.
2. Scalability: Data lakes are designed to scale horizontally, meaning they can handle
a massive amount of data by adding more storage capacity and processing power
as needed.
3. Flexibility: Data lakes support various data formats and structures.
4. Schema-on-Read: you can apply schema on stored data while reading and
analysing it. (in traditional database there is Schema-on-write approach.
Data Lake vs Delta Lake:

Data Lake Delta Lake

1. Centralised repository which 1. Open source storage layer that brings
stores large amount of raw and ACID transactions to apache spark and
diverse data big data workloads. It is built on top of
existing data lake file formats like
apache parquet, apache avro

2. Data lake lacks such kind of 2. ACID transactions (Atomicity,

transaction
consistency, Isolation, Durability)

3. Schema on read 3. Schema enforcement (changing schema

over time on existing data)

Every delta table will be a parquet file.

CRC stands for cyclic redundancy check. It is a calculation made from all the data in a
file to insure accuracy.

Delta Lake uses schema enforcement and schema evolution to maintain consistency in
data structures over time.
Isolation ensures that concurrent transactions do not interfere with each other. In Delta
Lake, isolation is achieved by leveraging the underlying transaction log.
Delta Lake achieves durability by storing transaction logs and data in a fault-tolerant
manner.

Pyspark Basics
No ratings yet
Pyspark Basics
16 pages
Apache Spark With Scala - Cheatsheet
No ratings yet
Apache Spark With Scala - Cheatsheet
7 pages
Hands - On Exercise: Using The Spark Shell..................................
100% (2)
Hands - On Exercise: Using The Spark Shell..................................
13 pages
Service Manual - EC24 Coffee System
100% (1)
Service Manual - EC24 Coffee System
43 pages
PySpark Data Frame Questions PDF
100% (2)
PySpark Data Frame Questions PDF
57 pages
Etl Commands For Pyspark
No ratings yet
Etl Commands For Pyspark
8 pages
2 - Tesco Top Drive Safety Manual
100% (1)
2 - Tesco Top Drive Safety Manual
18 pages
PySpark Interview Questions
No ratings yet
PySpark Interview Questions
3 pages
7 Apache Spark
No ratings yet
7 Apache Spark
48 pages
Indrani Cheat Sheet
No ratings yet
Indrani Cheat Sheet
2 pages
Lab Spark
No ratings yet
Lab Spark
3 pages
Spark RDD Commands - Spark Core
No ratings yet
Spark RDD Commands - Spark Core
7 pages
Slide 10 PySpark - SQL
No ratings yet
Slide 10 PySpark - SQL
131 pages
ApacheSpark MyNotes
No ratings yet
ApacheSpark MyNotes
6 pages
Big Data - Spark
100% (1)
Big Data - Spark
72 pages
Pyspark Basics
No ratings yet
Pyspark Basics
74 pages
Lab06 Spark Dataframes
No ratings yet
Lab06 Spark Dataframes
12 pages
Big Data Analytics in Apache Spark
No ratings yet
Big Data Analytics in Apache Spark
79 pages
Pyspark DataEngineering Power Guide
No ratings yet
Pyspark DataEngineering Power Guide
73 pages
Journal
No ratings yet
Journal
47 pages
02_sparkml (1)
No ratings yet
02_sparkml (1)
104 pages
Lec 9
No ratings yet
Lec 9
33 pages
Pyspark
No ratings yet
Pyspark
31 pages
Spark
No ratings yet
Spark
11 pages
Unit 4 (Data Frame and Apache Kafka)
No ratings yet
Unit 4 (Data Frame and Apache Kafka)
28 pages
Basic DataFrame Operation
No ratings yet
Basic DataFrame Operation
11 pages
Support of Big Data Machine Learning With Apache Spark
No ratings yet
Support of Big Data Machine Learning With Apache Spark
7 pages
Pyspark
No ratings yet
Pyspark
44 pages
What Is Spark?: Up To 100× Faster
No ratings yet
What Is Spark?: Up To 100× Faster
56 pages
Apache Spark With Java
No ratings yet
Apache Spark With Java
209 pages
PySpark Notes
No ratings yet
PySpark Notes
190 pages
Day 11 Notes
No ratings yet
Day 11 Notes
3 pages
Pyspark Coding Interview Questions
No ratings yet
Pyspark Coding Interview Questions
19 pages
BDA Lect5 Apache Spark 2023
No ratings yet
BDA Lect5 Apache Spark 2023
115 pages
Master Pyspark Zero To Big Data Hero: Day 1 Day 2 Day 3 Day 4 Day 5 Day 6 Day 7 Day 8 Day 9 Day 10
No ratings yet
Master Pyspark Zero To Big Data Hero: Day 1 Day 2 Day 3 Day 4 Day 5 Day 6 Day 7 Day 8 Day 9 Day 10
106 pages
Spark Revision
No ratings yet
Spark Revision
16 pages
External Video-En
No ratings yet
External Video-En
2 pages
Lecture 19-RDD in Spark
No ratings yet
Lecture 19-RDD in Spark
12 pages
Day11 Notes
No ratings yet
Day11 Notes
2 pages
PySpark Notes
No ratings yet
PySpark Notes
64 pages
SPARK
No ratings yet
SPARK
35 pages
T09 Sparksql
No ratings yet
T09 Sparksql
30 pages
PySpark Basics Overview 2
No ratings yet
PySpark Basics Overview 2
15 pages
Lecture 4 - Pair RDD and DataFrame
No ratings yet
Lecture 4 - Pair RDD and DataFrame
38 pages
Top 100 Pyspark Functions For Data Engineers 1738131847
No ratings yet
Top 100 Pyspark Functions For Data Engineers 1738131847
30 pages
Spark
No ratings yet
Spark
12 pages
PySpark Notes
No ratings yet
PySpark Notes
31 pages
Fundamental Pyspark Operations 1708364268
No ratings yet
Fundamental Pyspark Operations 1708364268
10 pages
Pyspark Funcamentals
No ratings yet
Pyspark Funcamentals
10 pages
Lec 9
No ratings yet
Lec 9
38 pages
Spark Running Notes
No ratings yet
Spark Running Notes
19 pages
Pyspark SQL Basics Cheat Sheet: Python For Data Science
No ratings yet
Pyspark SQL Basics Cheat Sheet: Python For Data Science
1 page
Important PySpark Operations 1698872557
No ratings yet
Important PySpark Operations 1698872557
4 pages
Spark Material
No ratings yet
Spark Material
6 pages
Python Pyspark Q's
No ratings yet
Python Pyspark Q's
16 pages
4220 6 (DataFormat)
No ratings yet
4220 6 (DataFormat)
15 pages
(Big Data Analytics With PySpark) (CheatSheet)
No ratings yet
(Big Data Analytics With PySpark) (CheatSheet)
7 pages
CISD 42 Introduction To Spark - Spark Transformation - Spark Actions
No ratings yet
CISD 42 Introduction To Spark - Spark Transformation - Spark Actions
27 pages
w12 - Runningnotes 201026 001818
No ratings yet
w12 - Runningnotes 201026 001818
25 pages
Pyspark IQ FREE Guide
100% (1)
Pyspark IQ FREE Guide
57 pages
Task 1: This Notebook Illustrates The Use of "MAP-REDUCE" To Calculate Averages From The Data Contained in Nsedata - CSV
No ratings yet
Task 1: This Notebook Illustrates The Use of "MAP-REDUCE" To Calculate Averages From The Data Contained in Nsedata - CSV
5 pages
Computer Engineering Laboratory Solution Primer
From Everand
Computer Engineering Laboratory Solution Primer
Karan Bhandari
No ratings yet
Astm A47 A47m (1999)
No ratings yet
Astm A47 A47m (1999)
5 pages
Physics 3b
No ratings yet
Physics 3b
4 pages
Assignment of Biomechaic1
No ratings yet
Assignment of Biomechaic1
10 pages
Full Waveform Inversion of A Land Seismic Data
No ratings yet
Full Waveform Inversion of A Land Seismic Data
2 pages
CP 5
No ratings yet
CP 5
6 pages
Music Theory Basic
No ratings yet
Music Theory Basic
25 pages
CARTÃO ANALOGICO RX3i-IC694ALG223
No ratings yet
CARTÃO ANALOGICO RX3i-IC694ALG223
6 pages
APK 9 Penentuan Rating Performance Allowances PDF
No ratings yet
APK 9 Penentuan Rating Performance Allowances PDF
30 pages
ML Project (1) Final
No ratings yet
ML Project (1) Final
15 pages
Cast Iron
100% (3)
Cast Iron
43 pages
Blox Fruits Script - Roblox Scripts
100% (1)
Blox Fruits Script - Roblox Scripts
2 pages
Activation Functions and Convolutional Neural Networks
No ratings yet
Activation Functions and Convolutional Neural Networks
137 pages
Hanel Lean Lift
No ratings yet
Hanel Lean Lift
58 pages
UCT IVM M1 U4 Notes
No ratings yet
UCT IVM M1 U4 Notes
11 pages
Research Midterm Reviewer
No ratings yet
Research Midterm Reviewer
9 pages
2021 B.Tech CSE Curriculum
No ratings yet
2021 B.Tech CSE Curriculum
7 pages
Borgnakke's Fundamentals of Thermodynamics: Global Edition
No ratings yet
Borgnakke's Fundamentals of Thermodynamics: Global Edition
107 pages
Chapter 01 - Process Diagrams
No ratings yet
Chapter 01 - Process Diagrams
36 pages
Analog Circuits For Ultra-Broadband Sensing
No ratings yet
Analog Circuits For Ultra-Broadband Sensing
189 pages
Mass, Weights, Density and Volume
No ratings yet
Mass, Weights, Density and Volume
39 pages
Specifications Outer Dimension and Terminal Battery Picture and Construction
No ratings yet
Specifications Outer Dimension and Terminal Battery Picture and Construction
2 pages
Barrel-Stave Flextensional Transducer Design: by Aykut Şahin March 2009
No ratings yet
Barrel-Stave Flextensional Transducer Design: by Aykut Şahin March 2009
119 pages
ISO/FDIS 8434-1:2018 (E) : Figure 10 - Weld On Nipple (WDNP)
100% (1)
ISO/FDIS 8434-1:2018 (E) : Figure 10 - Weld On Nipple (WDNP)
2 pages
WMF Combination
No ratings yet
WMF Combination
72 pages
Deep Learning Based Sign Language Recognition System Using Convolutional Neural Network
No ratings yet
Deep Learning Based Sign Language Recognition System Using Convolutional Neural Network
68 pages
The Evening Sky Map: December 2020
No ratings yet
The Evening Sky Map: December 2020
2 pages
St. Joseph's Anglo-Chinese School: NSS Physics SBA Manual The Cathode Ray Oscilloscope (CRO)
No ratings yet
St. Joseph's Anglo-Chinese School: NSS Physics SBA Manual The Cathode Ray Oscilloscope (CRO)
5 pages
Creating Shared Value in The Mining Industry - Rev 3
No ratings yet
Creating Shared Value in The Mining Industry - Rev 3
27 pages