Shwetank Singh
GritSetGrow - [Link]
DATA AND AI
EVERYTHING
SPARK
[Link]
Spark Cheat Sheet
Spark Initialization in Scala
SparkContext SparkSession
import [Link] import [Link]
import [Link]
val sc = new SparkContext("local[*]","app1"
val sparkConf = new SparkConf()
[Link]("[Link]","my first app")
[Link]("[Link]","local[2]")
val spark=[Link]()
.config(sparkConf)
.getOrCreate()
Read files in Scala Read files in Python
val ordersDf=[Link] df=[Link]("csv") \
.format("csv") .option"header",”true”) \
.option("header",true) .option("inferSchema",”true”)\
.option("inferSchema",true) .option("sep",",") \
.option("path","C:/Users/Lenovo/Documents/BIG .option("path","/FileStore/tables/Employees-
DATA/WEEK11/[Link]") [Link]") \
.load .load()
[Link]() display(df)
Read Modes in Scala Read Modes in Python
val ordersDf=[Link] df=[Link]("csv") \
.format("csv") .option"header",”true”) \
.option("header",true) .option("inferSchema",”true”)\
.option("mode", "FAILFAST") .option("mode", "FAILFAST") \
.option("inferSchema",true) .option("sep",",") \
.option("path","C:/Users/Lenovo/Documents/BIG .option("path","/FileStore/tables/Employees-
DATA/WEEK11/[Link]") [Link]") \
.load .load()
PERMISSIVE display(df)
Sets all fields to null when it encounters a
corrupted record and places all corrupted records
in a string column called _corrupt_record
DROPMALFORMED
Drops the row that contains malformed records
FAILFAST
Fails immediately upon encountering malformed
records
The default is permissive.
Write to Sink in Scala Write to sink in Python
import [Link] [Link]("csv") \
.mode("overwrite") \
[Link] .csv('/FileStore/tables_output/[Link]')
.format("json") //default format is parquet if
not specified
.mode([Link]) //4 modes:-
Append, overwrite, Errorifexists, ignore
.option("path","C:/Users/Lenovo/Documents/BIG
DATA/WEEK11/newfolder")
.save()
Default is Errorifexists
Impose Schema in Scala(StructType) Impose Schema in Python
import [Link]
import [Link] from [Link] import
import [Link] StructType,StructField,StringType,IntegerType
import [Link]
import [Link]
empSchema=StructType((
StructField("empid",IntegerType()),
val ordersSchema= StructType(List(
StructField("empname",StringType()),
StructField("orderid",IntegerType), StructField("city",StringType()),
StructField("orderdate",TimestampType), StructField("salary",IntegerType())
StructField("customerid",IntegerType), ))
StructField("status",StringType)
)) df = [Link]("csv") \
.option("header","false") \
val ordersDf=[Link] .schema(empSchema) \
.format("csv")
.schema(ordersSchema) .option("path","/FileStore/tables/[Link]") \
.load()
.option("path","C:/Users/Lenovo/Documents/BIG
DATA/WEEK11/[Link]")
[Link]()
.load
[Link]()
[Link]()
Impose Schema in Scala(DDL string) Impose Schema in Scala(DDL string)
val ordersSchema="orderid int, orderdate string, empschema="empid int,empname string,city
custid int, orderstatus string" string,salary double"
val ordersDf=[Link] df=[Link]("csv") \
.format("csv") .option("header","false") \
.schema(ordersSchema) .schema(empschema) \
.option("path","/FileStore/tables/[Link]
.option("path","C:/Users/Lenovo/Documents/BIG v") \
DATA/WEEK11/[Link]") .load()
.load
[Link]()
[Link]() [Link]()
Rename columns in Scala Rename columns in Pyspark
val newDf= df=[Link]("id","id_new")
[Link]("order_customer_
id", "customer_id")
Rename Multiple columns in Scala Rename Multiple columns in Pyspark
val newDf= df=[Link]("id","id_new")
[Link]("order_id", "id") .withColumnRenamed("name","name_New")
.withColumnRenamed("order_date", "date") .withColumnRenamed("City","City_New")
.withColumnRenamed("order_customer_id",
customer_id")
.withColumnRenamed("order_status", "status")
Rename Multiple columns in Scala(SelectExpr) Rename Multiple columns in Pyspark(SelectExpr)
[Link]("order_id as id","order_date [Link]("id as NewId","Name as
as date") NewName")
Add columns in Scala Add columns in Pyspark
[Link]("country", lit("india")) [Link]("Country",lit("India"))
[Link]("dblid", col("order_id")*2) [Link]("Incentive",col("salary")*0.2)
Drop column in Scala Drop column in Pyspark
val newDf =[Link]("REGION") newdf2=[Link]("REGION")
val newDf =[Link]("ID","REGION") newdf3=[Link]("ID","REGION")
Select columns in Scala Select columns in Pyspark
import [Link].{col, [Link]("id","name","salary")
column,expr}
[Link]("order_id”,” order_customer_id",
"order_status").show
[Link](column("order_id"),col("order_da [Link](col("id"),col("name"))
te")
,$"order_customer_id",'order_status).show
[Link](column("order_id"), [Link](col("id"),
expr("concat(order_status,'_STATUS')")).show(fal expr("concat(name,'_STATUS')"))
se)
[Link]("order_id","order_date" [Link]("id","name"
,"concat(order_status,'_STATUS')")
,"concat(name,'_STATUS')")
Filter in Scala Filter in Pyspark
[Link]("weeknum==50") [Link]([Link]==1)
[Link]("weeknum>45") [Link]([Link]>5)
[Link]("country=='India'") [Link]([Link]=="PUNE")
[Link]("country='India' OR [Link](([Link]==1) | ([Link]==3))
country='Italy'")
[Link](ordersDf("country")==="India" && [Link](([Link]=="PUNE") & ([Link]>50000))
ordersDf("totalqty")>1000)
[Link]("weeknum!=50") [Link]([Link]!=1)
[Link]("country!='India'")
[Link]([Link]!="PUNE")
[Link](df("salary")>=30000 &&
df[df["salary"].between(30000,60000)].show()
df("salary")<=60000).show
Sort in Scala Sort in Pyspark
[Link]("invoicevalue") [Link]([Link])
[Link](col("invoicevalue").desc) [Link]([Link]())
[Link]("country","invoicevalue") [Link]([Link],[Link])
[Link](col("country").asc,col("invoicevalue [Link]([Link],[Link]())
").desc)
Remove duplicates in Scala Remove duplicates in Pyspark
[Link]() [Link]()
[Link]() [Link]()
[Link]("city") [Link](["city"])
[Link]("name","city") [Link](["city","salary"])
Union in Scala Union in Pyspark
[Link](ordersDf) [Link](df2)
When in Scala When in Pyspark
[Link]("Tier", [Link]("CityTier",when(col("city")=="Pu
when(col("city")==="MUMBAI",1).when(col("city" ne",3).when(col("city")=="Delhi",1).
)==="PUNE",2).otherwise(0)) when(col("city")=="Mumbai",2).otherwise('na'))
[Link](col("*"), [Link](col("*"),when(col("city")=="Pune",3)
when(col("city")==="MUMBAI",1).when(col("city" .when(col("city")=="Delhi",1).
)==="PUNE",2).otherwise(0).as("Tier")) when(col("city")=="Mumbai",2).
otherwise('na').alias("CityTier"))
Contains in Scala Contains in Pyspark
import [Link] from [Link] import col
val filteredDf= filteredDf2=[Link](col("REGION").co
[Link](col("REGION").contains("ST")) ntains("ST"))
[Link](col("empname").like("A%")).show
[Link](col("empname").like("A%")).show
[Link](col("empname").like("%N")).show
[Link](col("empname").like("%N")).show
[Link](col("empname").like("%A%")).show
[Link](col("empname").like("%A%")).show
Summary in Scala Summary in Pyspark
[Link]().show() [Link]().show()
Case Conversion in Scala Case Conversion in Pyspark
import from [Link] import initcap,col
[Link].{initcap,upper,low
er,col} [Link](initcap(col("data"))).show(truncate=0)
val df2=[Link](initcap(col("data"))) [Link](upper(col("data"))).show(truncate=0)
val df2=[Link](upper(col("data"))) [Link](lower(col("data"))).show(truncate=0)
val df2=[Link](lower(col("data")))
Trim in Scala Trim in Pyspark
import [Link].{lit, ltrim, from [Link] import lit, ltrim, rtrim,
rtrim, rpad, lpad, trim} rpad, lpad, trim
[Link]( [Link](
ltrim(lit(" HELLO ")).as("ltrim"), ltrim(lit(" HELLO ")).alias("ltrim"),
rtrim(lit(" HELLO ")).as("rtrim"), rtrim(lit(" HELLO ")).alias("rtrim"),
trim(lit(" HELLO ")).as("trim"), trim(lit(" HELLO ")).alias("trim"),
lpad(lit("HELLO"), 3, " ").as("lp"), lpad(lit("HELLO"), 3, " ").alias("lp"),
rpad(lit("HELLO"), 10, " ").as("rp")).show(2) rpad(lit("HELLO"), 10, " ").alias("rp")).show(2)
val df2=[Link](upper(col("data")))
val df2=[Link](lower(col("data")))
Round in Scala Round in Pyspark
import [Link].{round, from [Link] import lit,round,
bround,col} bround
val roundedDf [Link](round(lit("2.5")),
=[Link](round(col("SALES"), bround(lit("2.5"))).show(2)
1).alias("rounded"))
[Link](round(lit("2.5")),
bround(lit("2.5"))).show(2)
Split in Scala Split in Pyspark
import [Link].{split,col} from [Link] import split,col
[Link](split(col("data")," [Link](split(col("data"),"
").alias("words_array")).show ").alias("words_array")).show()
[Link]("words_array[0]").show [Link]("words_array[0]").show()
Size of array in Scala Size of array in Pyspark
import [Link].{size,col} from [Link] import size,col
[Link](size(col("words_array"))).show [Link](size(col("words_array"))).show(
)
Array contains in Scala Array contains in Pyspark
import from [Link] import
[Link].{array_contains,col array_contains,col
}
[Link](array_contains(col("words_arra
[Link](array_contains(col("words_arra y"),"big")).show()
y"),"big")).show
Explode in Scala Explode in Pyspark
import from [Link] import explode,col
[Link].{explode,col}
[Link]("exploded_words",explo
[Link]("exploded_words",explod de(col("words_array"))).show(truncate=0)
e(col("words_array"))).show(false)
UDF in Scala UDF in Pyspark
def power3(number:Double):Double = number * def power3(double_value): return double_value
number * number ** 3
[Link]("power3",
power3(_:Double):Double)
[Link]("power3(num)").show
Joins in Scala Joins in Pyspark
val joincondition = [Link](df2,[Link]==[Link],"inner").show()
[Link]("order_customer_id")===customers [Link](df2,[Link]==[Link],"left").show()
[Link]("customer_id") [Link](df2,[Link]==[Link],"right").show()
[Link](df2,[Link]==[Link],"outer").show()
val joinedDf=
[Link](customersDf,joincondition,"inner").
sort("order_customer_id")
Collect set & list in Scala Collect set & list in Pyspark
import [Link].{collect_set, from [Link] import collect_set,
collect_list} collect_list
[Link](collect_set("Country")).show(false) [Link](collect_set("Country")).show()
[Link](collect_list("Country")).show()
[Link](collect_list("Country")).show()
Aggregate in Scala Aggregate in Pyspark
[Link](
count("*").as("Rowcount"),
sum("Quantity").as("TotalQty"),
avg("UnitPrice").as("AvgPrice"),
countDistinct("InvoiceNo").as("DistinctInvoices")
//method1:- column object expression
).show
[Link]( [Link](
"count(*) as Rowcount", "count(*) as Rowcount",
"sum(Quantity) as TotalQty", "sum(Quantity) as TotalQty",
"avg(UnitPrice) as AvgPrice", "avg(UnitPrice) as AvgPrice",
"count(Distinct(InvoiceNo)) as "count(Distinct(InvoiceNo)) as
DistinctInvoices" //method2:- string expression DistinctInvoices"
).show ).show()
[Link]("sales") [Link]("sales") \
//method 3:- spark sql
[Link]("select count(*) as [Link]("select count(*) as
Rowcount,sum(Quantity) as Rowcount,sum(Quantity) as
TotalQty,avg(UnitPrice) as TotalQty,avg(UnitPrice) as
AvgPrice,count(Distinct(InvoiceNo)) as AvgPrice,count(Distinct(InvoiceNo)) as
DistinctInvoices from sales").show DistinctInvoices from sales").show()
Grouping Aggregate in Scala Grouping Aggregate in Pyspark
[Link]("country").sum("Quantity").sho [Link]('city').sum('salary')
w
[Link]("country","InvoiceNo") [Link]('city').agg(sum('salary').alias('TotalSal
.agg(sum("Quantity").as("TotalQty"), ary'), max('salary').alias('MaxSalary'),min('salary')
sum(expr("Quantity * ,min('salary').alias('MinSalary'),
UnitPrice")).as("InvoiceValue")).show avg('salary').alias('AvgSalary'))
//method1
[Link]("country","InvoiceNo")
.agg(expr("sum(Quantity) as TotalQty"),
expr("sum(Quantity * UnitPrice") as
InvoiceValue") //method2
).show
[Link]("sales")
[Link]("""select
country,InvoiceNo,sum(Quantity) as TotalQty,
sum(Quantity * UnitPrice) as InvoiceValue
from sales group by country,InvoiceNo""").show
//method3
Window Aggregate in Scala Window Aggregate in Pyspark
val RowWindow = window =
[Link]().orderBy("TotalQty") [Link]().orderBy("salary")
[Link]("Rownum",row_number().over(wi
[Link]("Rownum",row_number().o ndow)).show()
ver(RowWindow)).show
val RowWindow2 = window =
[Link]().orderBy(col("TotalQty").des [Link]().orderBy(col("salary").desc()
c) )
[Link]("Rownum",row_number().o [Link]("Rownum",row_number().over(wi
ver(RowWindow2)).show ndow)).show()
val RowWindow3 = window =
[Link]("country").orderBy(col("Tota [Link]("city").orderBy(col("salary").
lQty").desc) desc())
[Link]("Rownum",row_number().o [Link]("Rownum",row_number().over(wi
ver(RowWindow3)).show ndow)).show()
val RowWindow4 = window =
[Link]("country","weeknum").order [Link](“state”,"city").orderBy(col("
By(col("TotalQty").desc) salary").desc())
[Link]("Rownum",row_number().o [Link]("Rownum",row_number().over(wi
ver(RowWindow4)).show(100) ndow)).show()
Running Total in Scala Running Total in Pyspark
val RunningWindow = RunningWindow =
[Link]().orderBy("country") [Link]().orderBy("city") \
.rowsBetween([Link],Win
[Link]) .rowsBetween([Link],Wi
[Link])
[Link]("RunningTotal",sum("invoic
evalue").over(RunningWindow)).show [Link]("RunningTotal",sum("salary").ove
r(RunningWindow)).show()
val myWindow = [Link]("country") RunningWindow =
.orderBy("weeknum") [Link]("city").orderBy("city") \
.rowsBetween([Link],Win .rowsBetween([Link],Wi
[Link]) [Link])
val myDf = [Link]("RunningTotal",sum("salary").ove
[Link]("RunningTotal",sum("invoic r(RunningWindow)).show()
evalue").over(myWindow))
val myWindow2 = [Link]() RunningWindow =
.orderBy("weeknum") [Link]().orderBy("city") \
.rowsBetween(-2,[Link]) .rowsBetween(-2,[Link])
[Link]("RunningTotal",sum("invoic [Link]("RunningTotal",sum("salary").ove
evalue").over(myWindow2)).show r(RunningWindow)).show()
Rank in Scala Rank in Pyspark
val RunningWindow = RunningWindow =
[Link]().orderBy("invoicevalue") [Link]().orderBy("salary")
[Link]("Ranks",rank().over(RunningWind
[Link]("Ranks",rank().over(Runnin ow)).show()
gWindow)).show
val RunningWindow2 = RunningWindow =
[Link]().orderBy(col("invoicevalue") [Link]().orderBy(col("salary").desc()
.desc) )
[Link]("Ranks",rank().over(RunningWind
[Link]("Ranks",rank().over(Runnin ow)).show()
gWindow2)).show
val RunningWindow3 = RunningWindow =
[Link]("country").orderBy(col("invo [Link]("city").orderBy(col("salary").
icevalue").desc) desc())
[Link]("Ranks",rank().over(RunningWind
[Link]("Ranks",rank().over(Runnin ow)).show()
gWindow3)).show
Dense Rank in Scala Dense Rank in Pyspark
val RunningWindow = RunningWindow =
[Link]().orderBy("invoicevalue") [Link]().orderBy("salary")
[Link]("Ranks",dense_rank().over(Runni
[Link]("Ranks",dense_rank().over( ngWindow)).show()
RunningWindow)).show
val RunningWindow2 = RunningWindow =
[Link]().orderBy(col("invoicevalue") [Link]().orderBy(col("salary").desc()
.desc) )
[Link]("Ranks", [Link]("Ranks",
dense_rank ().over(RunningWindow2)).show dense_rank().over(RunningWindow)).show()
val RunningWindow3 = RunningWindow =
[Link]("country").orderBy(col("invo [Link]("city").orderBy(col("salary").
icevalue").desc) desc())
[Link]("Ranks", [Link]("Ranks",
dense_rank ().over(RunningWindow3)).show dense_rank().over(RunningWindow)).show()
Repartition in Scala Repartition in Pyspark
val newRdd=[Link](6) [Link](6).[Link]("parquet").mode("
overwrite").save('/FileStore/tables/Repart')
Coalesce in Scala Coalesce in Pyspark
val newRdd=inputRDD. Coalesce (6) df. Coalesce
(6).[Link]("parquet").mode("overwrite").s
ave('/FileStore/tables/Repart')
Partition in Scala Partition in Pyspark
[Link] [Link]("header","true").partitionBy("CO
.format("csv") UNTRY").mode("overwrite").csv("/FileStore/table
.partitionBy("order_status") s/Sample_Partition_op")
.mode([Link])
.option("path","C:/Users/Lenovo/Documents/BIG
DATA/WEEK11/newfolder")
.save()
[Link] [Link]("header","true").partitionBy("CO
.format("csv") UNTRY"
.partitionBy(“country”,"order_status") ,”CITY”).mode("overwrite").csv("/FileStore/tables
.mode([Link]) /Sample_Partition_op")
.option("path","C:/Users/Lenovo/Documents/BIG
DATA/WEEK11/newfolder")
.save()
Bucketing in Scala Bucketing in Pyspark
[Link] [Link]("csv") \
.format("csv") .mode("overwrite") \
.mode([Link]) .bucketBy(4, "id") \
.bucketBy(4, "order_customer_id") .sortBy("id") \
.sortBy("order_customer_id") .saveAsTable("orders_bucketed")
.saveAsTable("orders")
Cast Column in Scala Cast Column in Pyspark
val df= [Link]("id", [Link]("id",[Link]('integer')).withColu
ordersDf("id").cast(IntegerType)) mn("salary",[Link]('integer'))
[Link](col("id").cast("int").as("id"),col("n [Link](col("id").cast('int'),col("name"),col("sal
ame").cast("string").as("name")) ary").cast('int'))
[Link]("cast(id as [Link]('cast(id as int)','name','cast(salary
int)","name","cast(salary as int)") as int)')
Fill nulls in Scala Fill nulls in Pyspark
[Link](0) [Link](0)
[Link]("none") [Link]("none")
[Link]("order_id",expr("coalesce(o [Link]("salary",expr("coalesce(salary,-
rder_id,-1)")) 1)"))
Read directly in Scala Read Directly in Pyspark
[Link]("select * from [Link]("SELECT * FROM
csv.`C:/Users/Lenovo/Documents/[Link]` csv.`/user/hive/warehouse/orders_bucketed/par
") t-00000-tid-3984408860399578289-17a5aa99-
d1f9-4500-88cf-1adde09ef7fb-19-
1_00000.[Link]`")
Literal in Scala Literal in Pyspark
import [Link].{lit,expr} from [Link] import lit,expr
val limitCountriesDf=[Link](expr("*"), limitCountriesDf2=[Link](expr("*"),
lit(1).as("Literalcol")) lit(1).alias("Literalcol"))
[Link](10) [Link](10)
The driver program
This program invokes
converts the code into
Using spark-submit themain()method that is
Directed Acyclic
command user submits specified in the spark- Graph(DAG) which will have
spark application to spark submit command, which all the RDDs and
cluster launches the driver transformations to be
program
performed on them.
During this phase driver
program also does some
After this physical plan, optimizations and then it
Then these tasks are sent to
driver creates small converts the DAG to a
Spark Cluster.
execution units called tasks. physical execution plan
with set of stages.
Executors will register
The driver program then themselves with driver
Then the cluster manger program so the driver
talks to the cluster manager
launches the executors on program will have the
and requests for the
the worker nodes complete knowledge about
resources for execution
the executors
When the job is completed Driver program always
or called stop() method in
case of any failures, the Then driver program sends
monitors these tasks that the tasks to the executors
driver program terminates
and frees the allocated are running on the and starts the execution
resources. executors till the
completion of job