Pyspark File Commands and Theory
Pyspark File Commands and Theory
File commands
Day -2 and 3
RDD - Resilient Distributed Datasets
RDD, which stands for Resilient Distributed Dataset, is a fundamental data structure in
Apache Spark. RDD represents an immutable, distributed collection of objects that can
be processed in parallel across a cluster of machines. RDDs provide fault tolerance and
parallel processing capabilities, making them a key abstraction in Spark for distributed
data processing.
#Properties of RDD
#1) immutable - can't modify
#2) Partitioned/Distributed
#3) Resilient - Fault tolerant
Day - 4 & 5
Creating a Dataframe
Command
1. From RDD
data = [ [1, "karan", 10000], [2, "Rick", 30000], [3, "Else", 240000], [4, "never", 23000]]
empolyeeRDD = sc.parallelize(data)
empDF = empolyeeRDD.toDF("Id: long, Name: string, salary: long")
data = [ [1, "karan", 10000], [2, "Rick", 30000], [3, "Else", 240000], [4, "never", 23000]]
empDF1 = spark.createDataFrame(data, "Id: long, Name: string, salary: long") #here we
created a dataframe from data collection and also pass the schema
Few more commands for setting up data frame via read file
a) #we have a first row which is a header row and not a data actually. so we need to
make it count as a header so do as following
yellowtaxiDF = spark.read.option("header",
"true").csv("dbfs:/FileStore/shared_uploads/karangole7074@gmail.com/YellowT
axis_202211.csv")
#but here data type is string only..
b) #to display different data types we will call the command again but in different
manner
yellowtaxiDF = spark.read.option("Inferschema", "true") .option("header",
"true").csv("dbfs:/FileStore/shared_uploads/karangole7074@gmail.com/YellowT
axis_202211.csv")
#inferSchema - is good when data is less
# inferschema - is not good when data is huge
# because it needs to read each row and each column value to decide/define data type for a
column
#performance will degrade if inferschema is used for huge dataset.
StructType
([
yellowtaxiDF = spark.read.option("header",
"true").schema(yellowTaxiSchema).csv("dbfs:/FileStore/shared_uploads/karangole7074@gma
il.com/TaxiBases.json") #here schema is applied while creating a dataframe
Creating a dataframe from multiline JSON file
#read multiline JSON file
taxiBasesDF= (
spark
.read
.option("multiline", "true")
.json("dbfs:/FileStore/shared_uploads/karangole7074@gmail.com/TaxiBases.json")
)
taxiBasesDF.display()
yellowtaxiDF.write.format("csv").option("compression","gzip").csv("/dbfs/FileStore/shared_uplo
ads/output") #to compress the output file in desired format.
yellowtaxiDF.write.mode("overwrite").option("compression",
"none").save("/dbfs/FileStore/shared_uploads/output_with_no_compression") #to save without
compression
Day- 6
#describe - it is a function that can be called on numeric type columns to analyze information
about data.
mytaxiDF = yellowtaxiDF.describe("passenger_count", "trip_distance")
(Gives details like, min, max, stddev, count)
Clean Data
It consist of process like below
Here both function (where & filter works same, we are filtering out data which
is not zero
yellowtaxiDF = (
yellowtaxiDF
na.drop('all'):-
This method drops rows that have all null (missing) values across all specified
columns.
If a row contains at least one non-null value in the specified columns, it will be
retained.
It is useful when you want to remove only those rows where all specified
columns have missing values.
na.drop():-
This method drops rows that have any null (missing) values in the specified
columns.
If a row contains at least one null value in the specified columns, it will be
removed.
#here we are setting default value for payment type and ratecode
yellowtaxiDF = (
yellowtaxiDF.na.fill(defaultValueMap)
yellowtaxiDF = (
yellowtaxiDF = (
)
Transform data
1.Select Limited Columns
yellowtaxiDF.select("vendorID", "passenger_count").where("vendorID ==
2").where("passenger_count > 5").show() # here we have selected two columns
and after that we added a condition so we can get filtered data
2. Rename Columns
yellowtaxi_new_DF = yellowtaxiDF.withColumnRenamed("old_col_name",
"new_col_name")
yellowtaxiDF33 = (
yellowtaxiDF1
.withColumn("TripYear", year(col("PickupTime")))
.withColumn("TripMonth", month(col("PickupTime")))
.withColumn("TripDay", dayofmonth(col("PickupTime")))
)
3. b) Create derived column - TripTimeInMinutes
yellowtaxiDF = (
yellowtaxiDF
.withColumn("TripTimeInMinutes",
round(
(unix_timestamp(col("DropTime"))
-unix_timestamp(col("PickupTime")))
/ 60
))
Day- 7
1. Permissive mode
2. Dropmalformed
3. FailFast
rc_corrupt_1_DF=spark.read.option("mode","PERMISSIVE").option("columnName
OfCorruptRecord",
"corrupted_data").json("dbfs:/FileStore/shared_uploads/ratecode/RateCodes.json
") #you can set a column name for the corrupted data
2) DROPMALFORMED-
#here you can drop a whole row which is having a corrupt record
ratecodes_json_corruptedDF=spark.read.option("mode","DROPMALFORMED").jso
n("dbfs:/FileStore/shared_uploads/ratecode/RateCodes.json")
3) FAILFAST:-
FAIlFAST mode- it will not allowed creation of data frame if there is any corrupt
record
ratecodes_json_corruptedDF=spark.read.option("mode","FAILFAST").json("dbfs:/F
ileStore/shared_uploads/ratecode/RateCodes.json")
Day- 9
Whenever you create a table in sql in databricks you can get its different version like if you
inserted the first row it will calculate as a one transaction and it will be your version 1 of the
table
Example:
create table emp
(
eid int,
ename string,
eloc string
)
select * from emp version as of 2 #this will give the records of ankush and simran bcoz ankush 1st
transaction and simran 2nd transaction
describe history emp; #gives all version and transaction details on sql table
-- select query on table always shows latest version (last version 4)
-- we can query a older version of table always
-- called as Time Travel Queries (e.g. select * from emp version as of 2 )
-- so even if data deleted still it will show in versions as shown below (data not deleted actually from
warehouse)
Parquet file and dataframe
empDF1 =
spark.read.parquet('dbfs:/FileStore/shared_uploads/pafiles/part-00000-0d4d89fd-cbdf-4b95-912e-53
5d02f30ae0-c000.snappy.parquet') #if you want to create dataframe from parquet file you need to
store that file to another folder apart from warehouse
Day- 10
CREATE OR REPLACE TABLE managed_traxizones (LocId int,Borogh string ,zone string ,service
string)
Temporary View:
Temporary views are session-scoped, meaning they are only available for the duration
of the session in which they are created.
PATH ='dbfs:/FileStore/shared_uploads/ratecode/RateCodes1.csv',
header = "true",
mode = 'FAILFAST');
Creating a table at specified location from getting records from another table:
LOCATION 'dbfs:/user/hive/warehouse/nyctaxidb/external_data_table_rates'
AS
Day- 12
Deep Clone:
Useful when you need an independent Shallow is useful when you explicitly
copy of the original one want changes in the clone to reflect in the
original.
Day -13
Literal or lit()
The lit() function in Apache Spark's DataFrame API is short for "literal". It's a function that
creates a Column with a constant literal value.
(cabsDF.
select(
"Name",
convertCaseUDF("Name").alias("ConvertCase_Name")
) )
spark.sql("""
SELECT Name,
convertCaseSqlUdf(Name) as Name_convertedCase
From cabs_vw""")
#Catalyst optimizer
#Python UDF's disadvantages
#1 Python UDFs are not optimised by catalyst optimizer
#2 Performance of spark degrades if UDFs are used
#3 Use UDFs rarely
#4 SQL UDFs (are another option to python UDFs, which are optimised by catalyst optimizer,
resulting into better/optimised performance of pyspark/spark)--sql and java prefer in
databricks that be the reason
#5 we cannot use python UDF on huge dataset or complex logic bcoz it affect on spark engine
Catalyst optimiser:
The Catalyst optimizer is a key component of Apache Spark's SQL engine. It's responsible for
optimising the execution plan of Spark SQL queries for better performance.
1. Query Optimisation: Catalyst optimises query plans to minimise data processing by
applying rules like predicate pushdown, constant folding, filter pushdown, and join
reordering.
# to run another notebook in current notebook then run(magic command) is used for that we
will use run command with absolute path of that notebook
SQL UDF is better than Python, if you create an SQL UDF you don't need to
register it.
#here we created a UDF
Create or replace function yelling(text string)
returns string
return concat(upper(text), "!!!")
#we cannot use sqlUDF on another database we can only use on current database to use on
another database we will create another SQLUDF for that database
In order to use SQL UDF, a user must have 'USAGE' and 'SELECT' permissions on the
function.
Day -16
yellowtaxidf.createOrReplaceTempView("yellowtaxi_vw") #creating a temp view of dataframe
Joining a Dataframe:
joinedDF=(
yellowtaxidf.join
(
taxizonedf,
yellowtaxidf.PULocationID==taxizonedf.PULocationID, #[condition1,condition2]
)
)
-Latency=>Time delay
--The more the latency is - time will be more to process data
--The latency should be as low as possible.
--Structured streaming-->low latency
--Spark Streaming-->was having more latency
#What is Auto Loader-->Autoloader incrementally and efficiently processes new data files as
they arrive in cloud storage without any additional setup
#Auto Loader provides a Structured Streaming source called 'cloudFiles'
#What is Auto Loader-->Auto Loader incrementally and efficiently processes new data files as
they arrive in cloud storage without any additional setup
#Auto Loader provides a Structured Streaming source called 'cloudFiles'
#Auto loader function has four argument
#cluodfiles is a function from which we use data_source
#In auto loader manually sechma defination is not allowed
#In auto loader it will automatically take header "true" we cannot define command in
autoloader
.option("path",'abfss://nyctaxicontainer@avdjulyacc.dfs.core.windows.net/nyctaxidata/yellowstr
eamtable')
.option("checkpointLocation", checkpoint_directory)
.option("mergeSchema", "true") #mergeSchema true is used whenever schema
changes it will accept the changed schema and it will modify data table
.table(table_name))
return query
● Catalog: A top-level container that holds schemas and tables. Think of it as a big
folder that organizes your data projects.
● Schema: A collection of tables and views, similar to a folder within a catalog. It
helps organize related data.
● Table: A structured format to store your data, much like a spreadsheet or a
database table.
Example Scenario:
Imagine you have data stored in various places: some in AWS S3, some in Azure Blob
Storage, and some in a SQL database. Managing access to all this data separately can
be complex and time-consuming. With Unity Catalog, you can:
1. Centralize Access: Bring all your data together under one unified catalog.
2. Organize Data: Structure your data into catalogs, schemas, and tables for easy
navigation.
3. Control Permissions: Set who can access which parts of the data, ensuring
security and compliance.
4. Track Changes: Monitor how data flows and changes across your systems.
Simplified Example:
With these steps, you've organized your data into a catalog and schema, making it
easier to manage and secure.
Medallion architecture:
What is Hadoop-
Core components-
1. HDFS - distributed file system tht stores data across multiple machines
2. mapReduce -A programming model and processing engine for parallel and
distributed computing on large datasets. It involves two main steps: the Map
step, where data is processed in parallel, and the Reduce step, where the results
are aggregated.
3. Yarn - A resource management layer that allows multiple data processing
engines, including MapReduce, to share and efficiently utilize cluster resources.
What is Hive -
Hive is a data warehousing and SQL-like query language built on top of Hadoop.
Hive makes it easier for users familiar with SQL to analyze and process data using
Hadoop without requiring extensive knowledge of low-level MapReduce programming.
Datalake:
A data lake is a centralised repository that allows you to store and manage vast
amounts of raw data in its native format until it's needed.
Whereas a data warehouse stores only structured and processed data. The datalake can
store structured, semi-structured, unstructured data.
Characteristics:
1. Storage of Raw Data: A data lake stores data in its raw, unprocessed form. This
includes diverse data types such as text, images, videos, logs, sensor data, and
more.
2. Scalability: Data lakes are designed to scale horizontally, meaning they can handle
a massive amount of data by adding more storage capacity and processing power
as needed.
3. Flexibility: Data lakes support various data formats and structures.
4. Schema-on-Read: you can apply schema on stored data while reading and
analysing it. (in traditional database there is Schema-on-write approach.
Data Lake vs Delta Lake:
Delta Lake uses schema enforcement and schema evolution to maintain consistency in
data structures over time.
Isolation ensures that concurrent transactions do not interfere with each other. In Delta
Lake, isolation is achieved by leveraging the underlying transaction log.
Delta Lake achieves durability by storing transaction logs and data in a fault-tolerant
manner.