0% found this document useful (0 votes)
10 views47 pages

Journal

The document provides a comprehensive practical manual for data engineering using PySpark, covering various operations such as RDD transformations and actions, DataFrame column renaming, and DataFrame sorting. It includes code examples demonstrating the use of PySpark functions like withColumn(), pivoting, and handling NULL values. Additionally, it illustrates how to manipulate DataFrames through splitting, concatenating columns, and filling missing values.

Uploaded by

shanuk09016
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views47 pages

Journal

The document provides a comprehensive practical manual for data engineering using PySpark, covering various operations such as RDD transformations and actions, DataFrame column renaming, and DataFrame sorting. It includes code examples demonstrating the use of PySpark functions like withColumn(), pivoting, and handling NULL values. Additionally, it illustrates how to manipulate DataFrames through splitting, concatenating columns, and filling missing values.

Uploaded by

shanuk09016
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

Data Engineering Practical Manual

1) Practicals on RDD (Resilient Distributed Dataset) with Operations


and transformations.
a) Write a programme to demonstrate PySpark RDD Transformations.

from pyspark.sql import SparkSession


spark = SparkSession.builder.appName('TusharsExamples').getOrCreate()
rdd = spark.sparkContext.textFile("/content/data.txt") # File Must be uploaded

for element in rdd.collect():


print(element)

#Flatmap
rdd2=rdd.flatMap(lambda x: x.split(" "))
for element in rdd2.collect():
print(element)
#map
rdd3=rdd2.map(lambda x: (x,1))
for element in rdd3.collect():
print(element)
#reduceByKey
rdd4=rdd3.reduceByKey(lambda a,b: a+b)
for element in rdd4.collect():
print(element)
#map
rdd5 = rdd4.map(lambda x: (x[1],x[0])).sortByKey()
for element in rdd5.collect():
print(element)
#filter
rdd6 = rdd5.filter(lambda x : 'a' in x[1])
for element in rdd6.collect():
print(element)

b) Write a programme to demonstrate PySpark Actions.

from pyspark.sql import SparkSession


spark = SparkSession.builder.appName('TusharsExamples').getOrCreate()
data=[("Z", 1),("A", 20),("B", 30),("C", 40),("B", 30),("B", 60)]
inputRDD = spark.sparkContext.parallelize(data)

listRdd = spark.sparkContext.parallelize([1,2,3,4,5,3,2])
#aggregate
seqOp = (lambda x, y: x + y)
combOp = (lambda x, y: x + y)
agg=listRdd.aggregate(0, seqOp, combOp)
print(agg) # output 20

#aggregate 2
seqOp2 = (lambda x, y: (x[0] + y, x[1] + 1))
combOp2 = (lambda x, y: (x[0] + y[0], x[1] + y[1]))
agg2=listRdd.aggregate((0, 0), seqOp2, combOp2)
print(agg2) # output (20,7)

agg2=listRdd.treeAggregate(0,seqOp, combOp)
print(agg2) # output 20

#fold
from operator import add
foldRes=listRdd.fold(0, add)
print(foldRes) # output 20

#reduce
redRes=listRdd.reduce(add)
print(redRes) # output 20

#treeReduce. This is similar to reduce


add = lambda x, y: x + y
redRes=listRdd.treeReduce(add)
print(redRes) # output 20

#Collect
data = listRdd.collect()
print(data)

#count, countApprox, countApproxDistinct


print("Count : "+str(listRdd.count()))
#Output: Count : 20
print("countApprox : "+str(listRdd.countApprox(1200)))
#Output: countApprox : (final: [7.000, 7.000])
print("countApproxDistinct : "+str(listRdd.countApproxDistinct()))
#Output: countApproxDistinct : 5
print("countApproxDistinct : "+str(inputRDD.countApproxDistinct()))
#Output: countApproxDistinct : 5
#countByValue, countByValueApprox
print("countByValue : "+str(listRdd.countByValue()))

#first
print("first : "+str(listRdd.first()))
#Output: first : 1
print("first : "+str(inputRDD.first()))
#Output: first : (Z,1)

#top

print("top : "+str(listRdd.top(2)))

#Output: take : 5,4

print("top : "+str(inputRDD.top(2)))

#Output: take : (Z,1),(C,40)

#min
print("min : "+str(listRdd.min()))
#Output: min : 1
print("min : "+str(inputRDD.min()))
#Output: min : (A,20)

#max
print("max : "+str(listRdd.max()))
#Output: max : 5
print("max : "+str(inputRDD.max()))
#Output: max : (Z,1)

#take, takeOrdered, takeSample


print("take : "+str(listRdd.take(2)))
#Output: take : 1,2
print("takeOrdered : "+ str(listRdd.takeOrdered(2)))
#Output: takeOrdered : 1,2
print("take : "+str(listRdd.takeSample()))
c) Write a programme to demonstrate Dataframe Column Rename operations
with options.

import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
from pyspark.sql.functions import *

spark = SparkSession.builder.appName('TusharsExamples').getOrCreate()

dataDF = [(('James','','Smith'),'1991-04-01','M',3000),
(('Michael','Rose',''),'2000-05-19','M',4000),
(('Robert','','Williams'),'1978-09-05','M',4000),
(('Maria','Anne','Jones'),'1967-12-01','F',4000),
(('Jen','Mary','Brown'),'1980-02-17','F',-1)
]

schema = StructType([
StructField('name', StructType([
StructField('firstname', StringType(), True),
StructField('middlename', StringType(), True),
StructField('lastname', StringType(), True)
])),
StructField('dob', StringType(), True),
StructField('gender', StringType(), True),
StructField('salary', IntegerType(), True)
])

df = spark.createDataFrame(data = dataDF, schema = schema)


df.printSchema()

# Example 1
df.withColumnRenamed("dob","DateOfBirth").printSchema()
# Example 2
df2 = df.withColumnRenamed("dob","DateOfBirth") \
.withColumnRenamed("salary","salary_amount")
df2.printSchema()

# Example 3
schema2 = StructType([
StructField("fname",StringType()),
StructField("middlename",StringType()),
StructField("lname",StringType())])
df.select(col("name").cast(schema2),
col("dob"),
col("gender"),
col("salary")) \
.printSchema()

# Example 4
df.select(col("name.firstname").alias("fname"),
col("name.middlename").alias("mname"),
col("name.lastname").alias("lname"),
col("dob"),col("gender"),col("salary")) \
.printSchema()

# Example 5
df4 = df.withColumn("fname",col("name.firstname")) \
.withColumn("mname",col("name.middlename")) \
.withColumn("lname",col("name.lastname")) \
.drop("name")
df4.printSchema()

#Example 7
newColumns = ["newCol1","newCol2","newCol3","newCol4"]
df.toDF(*newColumns).printSchema()

d) Write a programme to demonstrate PySpark withColumn() operations with


options.

import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, lit
from pyspark.sql.types import StructType, StructField, StringType,IntegerType

spark = SparkSession.builder.appName('TusharsExamples').getOrCreate()

data = [('James','','Smith','1991-04-01','M',3000),
('Michael','Rose','','2000-05-19','M',4000),
('Robert','','Williams','1978-09-05','M',4000),
('Maria','Anne','Jones','1967-12-01','F',4000),
('Jen','Mary','Brown','1980-02-17','F',-1)
]

columns = ["firstname","middlename","lastname","dob","gender","salary"]
df = spark.createDataFrame(data=data, schema = columns)
df.printSchema()
df.show(truncate=False)

df2 = df.withColumn("salary",col("salary").cast("Integer"))
df2.printSchema()
df2.show(truncate=False)

df3 = df.withColumn("salary",col("salary")*100)
df3.printSchema()
df3.show(truncate=False)

df4 = df.withColumn("CopiedColumn",col("salary")* -1)


df4.printSchema()

df5 = df.withColumn("Country", lit("USA"))


df5.printSchema()

df6 = df.withColumn("Country", lit("USA")) \


.withColumn("anotherColumn",lit("anotherValue"))
df6.printSchema()

df.withColumnRenamed("gender","sex") \
.show(truncate=False)

df4.drop("CopiedColumn") \
.show(truncate=False)

e) Write a programme to demonstrate PySpark Pivot and Unpivot DataFrame.

import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import expr
spark = SparkSession.builder.appName('TusharsExamples').getOrCreate()

data = [("Banana",1000,"USA"), ("Carrots",1500,"USA"), ("Beans",1600,"USA"), \


("Orange",2000,"USA"),("Orange",2000,"USA"),("Banana",400,"China"), \
("Carrots",1200,"China"),("Beans",1500,"China"),("Orange",4000,"China"), \
("Banana",2000,"Canada"),("Carrots",2000,"Canada"),("Beans",2000,"Mexico")]

columns= ["Product","Amount","Country"]
df = spark.createDataFrame(data = data, schema = columns)
df.printSchema()
df.show(truncate=False)

pivotDF = df.groupBy("Product").pivot("Country").sum("Amount")
pivotDF.printSchema()
pivotDF.show(truncate=False)

pivotDF = df.groupBy("Product","Country") \
.sum("Amount") \
.groupBy("Product") \
.pivot("Country") \
.sum("sum(Amount)")
pivotDF.printSchema()
pivotDF.show(truncate=False)

""" unpivot """


unpivotExpr = "stack(3, 'Canada', Canada, 'China', China, 'Mexico', Mexico) as
(Country,Total)"
unPivotDF = pivotDF.select("Product", expr(unpivotExpr)) \
.where("Total is not null")
unPivotDF.show(truncate=False)
2 Practical on the DataFrame operations
a) Write a programme to demonstrate Dataframe Sorting operations.

# Imports
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, asc,desc

spark = SparkSession.builder.appName('TusharsExamples').getOrCreate()

simpleData = [("James","Sales","NY",90000,34,10000), \
("Michael","Sales","NY",86000,56,20000), \
("Robert","Sales","CA",81000,30,23000), \
("Maria","Finance","CA",90000,24,23000), \
("Raman","Finance","CA",99000,40,24000), \
("Scott","Finance","NY",83000,36,19000), \
("Jen","Finance","NY",79000,53,15000), \
("Jeff","Marketing","CA",80000,25,18000), \
("Kumar","Marketing","NY",91000,50,21000) \
]
columns= ["employee_name","department","state","salary","age","bonus"]

df = spark.createDataFrame(data = simpleData, schema = columns)

df.printSchema()
df.show(truncate=False)

df.sort("department","state").show(truncate=False)
df.sort(col("department"),col("state")).show(truncate=False)

df.orderBy("department","state").show(truncate=False)
df.orderBy(col("department"),col("state")).show(truncate=False)

df.sort(df.department.asc(),df.state.asc()).show(truncate=False)
df.sort(col("department").asc(),col("state").asc()).show(truncate=False)
df.orderBy(col("department").asc(),col("state").asc()).show(truncate=False)

df.sort(df.department.asc(),df.state.desc()).show(truncate=False)
df.sort(col("department").asc(),col("state").desc()).show(truncate=False)
df.orderBy(col("department").asc(),col("state").desc()).show(truncate=False)

df.createOrReplaceTempView("EMP")
spark.sql("select employee_name,department,state,salary,age,bonus from EMP ORDER
BY department asc").show(truncate=False)
b) Write a programme to demonstrate Drop rows with NULL Values.

from pyspark.sql import SparkSession

spark: SparkSession = SparkSession.builder \


.master("local[1]") \
.appName("TusharsExamples") \
.getOrCreate()

filePath="resources/small_zipcode.csv"
df = spark.read.options(header='true', inferSchema='true') \
.csv(filePath)

df.printSchema()
df.show(truncate=False)

df.na.drop().show(truncate=False)

df.na.drop(how="any").show(truncate=False)

df.na.drop(subset=["population","type"]) \
.show(truncate=False)

df.dropna().show(truncate=False)

c) Write a programme to demonstrate PySpark split() Column with options.

import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import split, col, substring, regexp_replace
spark=SparkSession.builder.appName("sparkbyexamples").getOrCreate()

data=data = [('James','','Smith','1991-04-01'),
('Michael','Rose','','2000-05-19'),
('Robert','','Williams','1978-09-05'),
('Maria','Anne','Jones','1967-12-01'),
('Jen','Mary','Brown','1980-02-17')
]

columns=["firstname","middlename","lastname","dob"]
df=spark.createDataFrame(data,columns)
df.printSchema()
df.show(truncate=False)
df1 = df.withColumn('year', split(df['dob'], '-').getItem(0)) \
.withColumn('month', split(df['dob'], '-').getItem(1)) \
.withColumn('day', split(df['dob'], '-').getItem(2))
df1.printSchema()
df1.show(truncate=False)

# Alternatively we can do like below


split_col = pyspark.sql.functions.split(df['dob'], '-')
df2 = df.withColumn('year', split_col.getItem(0)) \
.withColumn('month', split_col.getItem(1)) \
.withColumn('day', split_col.getItem(2))
df2.show(truncate=False)

# Using split() function of Column class


split_col = pyspark.sql.functions.split(df['dob'], '-')
df3 = df.select("firstname","middlename","lastname","dob",
split_col.getItem(0).alias('year'),split_col.getItem(1).alias('month'),split_col.getItem(2).a
lias('day'))
df3.show(truncate=False)
"""
df4=spark.createDataFrame([("20-13-2012-monday",)], ['date',])

df4.select(split(df4.date,'^([\d]+-[\d]+-[\d])').alias('date'),
regexp_replace(split(df4.date,'^([\d]+-[\d]+-[\d]+)').getItem(1),'-
','').alias('day')).show()
"""
df4 = spark.createDataFrame([('oneAtwoBthree',)], ['str',])
df4.select(split(df4.str, '[AB]').alias('str')).show()

df4.select(split(df4.str, '[AB]',2).alias('str')).show()
df4.select(split(df4.str, '[AB]',3).alias('str')).show()

d) Write a programme to demonstrate PySpark Concatenate Columns with options.


from pyspark.sql import SparkSession
from pyspark.sql.functions import concat,concat_ws
spark=SparkSession.builder.appName("concate").getOrCreate()

data = [('James','','Smith','1991-04-01','M',3000),
('Michael','Rose','','2000-05-19','M',4000),
('Robert','','Williams','1978-09-05','M',4000),
('Maria','Anne','Jones','1967-12-01','F',4000),
('Jen','Mary','Brown','1980-02-17','F',-1)
]
columns = ["firstname","middlename","lastname","dob","gender","salary"]
df = spark.createDataFrame(data=data, schema = columns)
df2=df.select(concat(df.firstname,df.middlename,df.lastname)
.alias("FullName"),"dob","gender","salary")
df2.show(truncate=False)

df3=df.select(concat_ws('_',df.firstname,df.middlename,df.lastname)
.alias("FullName"),"dob","gender","salary")
df3.show(truncate=False)

e) Write a programme to demonstrate PySpark fillna() & fill() & Replace NULLNone
Values.
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.master("local[1]") \
.appName("TusharsExamples") \
.getOrCreate()

filePath="resources/small_zipcode.csv"
df = spark.read.options(header='true', inferSchema='true') \
.csv(filePath)

df.printSchema()
df.show(truncate=False)

df.fillna(value=0).show()
df.fillna(value=0,subset=["population"]).show()
df.na.fill(value=0).show()
df.na.fill(value=0,subset=["population"]).show()

df.fillna(value="").show()
df.na.fill(value="").show()

df.fillna("unknown",["city"]) \
.fillna("",["type"]).show()

df.fillna({"city": "unknown", "type": ""}) \


.show()

df.na.fill("unknown",["city"]) \
.na.fill("",["type"]).show()

df.na.fill({"city": "unknown", "type": ""}) \


.show()

f) Write a programme to demonstrate PySpark Distinct to Drop Duplicate Rows


with options.

import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import expr
spark = SparkSession.builder.appName('TusharsExamples').getOrCreate()

data = [("James", "Sales", 3000), \


("Michael", "Sales", 4600), \
("Robert", "Sales", 4100), \
("Maria", "Finance", 3000), \
("James", "Sales", 3000), \
("Scott", "Finance", 3300), \
("Jen", "Finance", 3900), \
("Jeff", "Marketing", 3000), \
("Kumar", "Marketing", 2000), \
("Saif", "Sales", 4100) \
]
columns= ["employee_name", "department", "salary"]
df = spark.createDataFrame(data = data, schema = columns)
df.printSchema()
df.show(truncate=False)

#Distinct
distinctDF = df.distinct()
print("Distinct count: "+str(distinctDF.count()))
distinctDF.show(truncate=False)

#Drop duplicates
df2 = df.dropDuplicates()
print("Distinct count: "+str(df2.count()))
df2.show(truncate=False)

#Drop duplicates on selected columns


dropDisDF = df.dropDuplicates(["department","salary"])
print("Distinct count of department salary : "+str(dropDisDF.count()))
dropDisDF.show(truncate=False)
3 Practical on the Spark Array and Map operations
a) Write a programme to demonstrate various Array type operations.

from pyspark.sql import SparkSession


from pyspark.sql.types import StringType, ArrayType,StructType,StructField
spark = SparkSession.builder \
.appName('TusharsExamples') \
.getOrCreate()

arrayCol = ArrayType(StringType(),False)

data = [
("James,,Smith",["Java","Scala","C++"],["Spark","Java"],"OH","CA"),
("Michael,Rose,",["Spark","Java","C++"],["Spark","Java"],"NY","NJ"),
("Robert,,Williams",["CSharp","VB"],["Spark","Python"],"UT","NV")
]

schema = StructType([
StructField("name",StringType(),True),
StructField("languagesAtSchool",ArrayType(StringType()),True),
StructField("languagesAtWork",ArrayType(StringType()),True),
StructField("currentState", StringType(), True),
StructField("previousState", StringType(), True)
])

df = spark.createDataFrame(data=data,schema=schema)
df.printSchema()
df.show()

from pyspark.sql.functions import explode


df.select(df.name,explode(df.languagesAtSchool)).show()

from pyspark.sql.functions import split


df.select(split(df.name,",").alias("nameAsArray")).show()

from pyspark.sql.functions import array


df.select(df.name,array(df.currentState,df.previousState).alias("States")).show()

from pyspark.sql.functions import array_contains


df.select(df.name,array_contains(df.languagesAtSchool,"Java")
.alias("array_contains")).show()
b) Write a programme to demonstrate PySpark Convert array column to a String
with options.

import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('TusharsExamples').getOrCreate()

columns = ["name","languagesAtSchool","currentState"]
data = [("James,,Smith",["Java","Scala","C++"],"CA"), \
("Michael,Rose,",["Spark","Java","C++"],"NJ"), \
("Robert,,Williams",["CSharp","VB"],"NV")]

df = spark.createDataFrame(data=data,schema=columns)
df.printSchema()
df.show(truncate=False)

from pyspark.sql.functions import col, concat_ws


df2 = df.withColumn("languagesAtSchool",
concat_ws(",",col("languagesAtSchool")))
df2.printSchema()
df2.show(truncate=False)

df.createOrReplaceTempView("ARRAY_STRING")
spark.sql("select name, concat_ws(',',languagesAtSchool) as languagesAtSchool," + \
" currentState from ARRAY_STRING") \
.show(truncate=False)

c) Write a programme to demonstrate converting a string column (StringType) to


an array column.

from pyspark.sql import SparkSession


spark = SparkSession.builder.appName('TusharsExamples').getOrCreate()

data = ["Project",
"Gutenberg’s",
"Alice’s",
"Adventures",
"in",
"Wonderland",
"Project",
"Gutenberg’s",
"Adventures",
"in",
"Wonderland",
"Project",
"Gutenberg’s"]

rdd=spark.sparkContext.parallelize(data)

rdd2=rdd.map(lambda x: (x,1))
for element in rdd2.collect():
print(element)

data = [('James','Smith','M',30),
('Anna','Rose','F',41),
('Robert','Williams','M',62),
]

columns = ["firstname","lastname","gender","salary"]
df = spark.createDataFrame(data=data, schema = columns)
df.show()

rdd2=df.rdd.map(lambda x:
(x[0]+","+x[1],x[2],x[3]*2)
)
df2=rdd2.toDF(["name","gender","new_salary"] )
df2.show()

#Referring Column Names


rdd2=df.rdd.map(lambda x:
(x["firstname"]+","+x["lastname"],x["gender"],x["salary"]*2)
)

#Referring Column Names


rdd2=df.rdd.map(lambda x:
(x.firstname+","+x.lastname,x.gender,x.salary*2)
)

def func1(x):
firstName=x.firstname
lastName=x.lastname
name=firstName+","+lastName
gender=x.gender.lower()
salary=x.salary*2
return (name,gender,salary)

rdd2=df.rdd.map(lambda x: func1(x))
d) Write a programme to demonstrate converting Map to column.

from pyspark.sql import SparkSession


spark = SparkSession.builder.appName('TusharsExamples').getOrCreate()

dataDictionary = [
('James',{'hair':'black','eye':'brown'}),
('Michael',{'hair':'brown','eye':None}),
('Robert',{'hair':'red','eye':'black'}),
('Washington',{'hair':'grey','eye':'grey'}),
('Jefferson',{'hair':'brown','eye':''})
]

df = spark.createDataFrame(data=dataDictionary, schema = ['name','properties'])


df.printSchema()
df.show(truncate=False)

df3=df.rdd.map(lambda x: \
(x.name,x.properties["hair"],x.properties["eye"])) \
.toDF(["name","hair","eye"])
df3.printSchema()
df3.show()

df.withColumn("hair",df.properties.getItem("hair")) \
.withColumn("eye",df.properties.getItem("eye")) \
.drop("properties") \
.show()

df.withColumn("hair",df.properties["hair"]) \
.withColumn("eye",df.properties["eye"]) \
.drop("properties") \
.show()

# Functions
from pyspark.sql.functions import explode,map_keys,col
keysDF = df.select(explode(map_keys(df.properties))).distinct()
keysList = keysDF.rdd.map(lambda x:x[0]).collect()
keyCols = list(map(lambda x: col("properties").getItem(x).alias(str(x)), keysList))
df.select(df.name, *keyCols).show()
e) Write a programme to demonstrate use of explode on array & map.

import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('pyspark-by-examples').getOrCreate()

arrayData = [
('James',['Java','Scala'],{'hair':'black','eye':'brown'}),
('Michael',['Spark','Java',None],{'hair':'brown','eye':None}),
('Robert',['CSharp',''],{'hair':'red','eye':''}),
('Washington',None,None),
('Jefferson',['1','2'],{})
]
df = spark.createDataFrame(data=arrayData, schema =
['name','knownLanguages','properties'])
df.printSchema()
df.show()

from pyspark.sql.functions import explode


df2 = df.select(df.name,explode(df.knownLanguages))
df2.printSchema()
df2.show()

from pyspark.sql.functions import explode


df3 = df.select(df.name,explode(df.properties))
df3.printSchema()
df3.show()

from pyspark.sql.functions import explode_outer


""" with array """
df.select(df.name,explode_outer(df.knownLanguages)).show()
""" with map """
df.select(df.name,explode_outer(df.properties)).show()

from pyspark.sql.functions import posexplode


""" with array """
df.select(df.name,posexplode(df.knownLanguages)).show()
""" with map """
df.select(df.name,posexplode(df.properties)).show()

from pyspark.sql.functions import posexplode_outer


""" with array """
df.select(df.name,posexplode_outer(df.knownLanguages)).show()

""" with map """


df.select(df.name,posexplode_outer(df.properties)).show()
f) Write a programme to demonstrate use of explode on nested array.

import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode, flatten

spark = SparkSession.builder.appName('pyspark-by-examples').getOrCreate()

arrayArrayData = [
("James",[["Java","Scala","C++"],["Spark","Java"]]),
("Michael",[["Spark","Java","C++"],["Spark","Java"]]),
("Robert",[["CSharp","VB"],["Spark","Python"]])
]

df = spark.createDataFrame(data=arrayArrayData, schema = ['name','subjects'])


df.printSchema()
df.show(truncate=False)

""" """
df.select(df.name,explode(df.subjects)).show(truncate=False)

""" creates a single array from an array of arrays. """


df.select(df.name,flatten(df.subjects)).show(truncate=False)
4 Spark Aggregate
a) Write a programme to demonstrate PySpark Aggregate with options.

import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import approx_count_distinct,collect_list
from pyspark.sql.functions import collect_set,sum,avg,max,countDistinct,count
from pyspark.sql.functions import first, last, kurtosis, min, mean, skewness
from pyspark.sql.functions import stddev, stddev_samp, stddev_pop, sumDistinct
from pyspark.sql.functions import variance,var_samp, var_pop

spark = SparkSession.builder.appName('TusharsExamples').getOrCreate()

simpleData = [("James", "Sales", 3000),


("Michael", "Sales", 4600),
("Robert", "Sales", 4100),
("Maria", "Finance", 3000),
("James", "Sales", 3000),
("Scott", "Finance", 3300),
("Jen", "Finance", 3900),
("Jeff", "Marketing", 3000),
("Kumar", "Marketing", 2000),
("Saif", "Sales", 4100)
]
schema = ["employee_name", "department", "salary"]

df = spark.createDataFrame(data=simpleData, schema = schema)


df.printSchema()
df.show(truncate=False)

print("approx_count_distinct: " + \
str(df.select(approx_count_distinct("salary")).collect()[0][0]))

print("avg: " + str(df.select(avg("salary")).collect()[0][0]))

df.select(collect_list("salary")).show(truncate=False)

df.select(collect_set("salary")).show(truncate=False)

df2 = df.select(countDistinct("department", "salary"))


df2.show(truncate=False)
print("Distinct Count of Department & Salary: "+str(df2.collect()[0][0]))

print("count: "+str(df.select(count("salary")).collect()[0]))
df.select(first("salary")).show(truncate=False)
df.select(last("salary")).show(truncate=False)
df.select(kurtosis("salary")).show(truncate=False)
df.select(max("salary")).show(truncate=False)
df.select(min("salary")).show(truncate=False)
df.select(mean("salary")).show(truncate=False)
df.select(skewness("salary")).show(truncate=False)
df.select(stddev("salary"), stddev_samp("salary"), \
stddev_pop("salary")).show(truncate=False)
df.select(sum("salary")).show(truncate=False)
df.select(sumDistinct("salary")).show(truncate=False)
df.select(variance("salary"),var_samp("salary"),var_pop("salary")) \
.show(truncate=False)

b) Write a programme to demonstrate PySpark groupBy with options.


import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import col,sum,avg,max

spark = SparkSession.builder.appName('TusharsExamples').getOrCreate()

simpleData = [("James","Sales","NY",90000,34,10000),
("Michael","Sales","NY",86000,56,20000),
("Robert","Sales","CA",81000,30,23000),
("Maria","Finance","CA",90000,24,23000),
("Raman","Finance","CA",99000,40,24000),
("Scott","Finance","NY",83000,36,19000),
("Jen","Finance","NY",79000,53,15000),
("Jeff","Marketing","CA",80000,25,18000),
("Kumar","Marketing","NY",91000,50,21000)
]

schema = ["employee_name","department","state","salary","age","bonus"]
df = spark.createDataFrame(data=simpleData, schema = schema)
df.printSchema()
df.show(truncate=False)

df.groupBy("department").sum("salary").show(truncate=False)

df.groupBy("department").count().show(truncate=False)

df.groupBy("department","state") \
.sum("salary","bonus") \
.show(truncate=False)
df.groupBy("department") \
.agg(sum("salary").alias("sum_salary"), \
avg("salary").alias("avg_salary"), \
sum("bonus").alias("sum_bonus"), \
max("bonus").alias("max_bonus") \
)\
.show(truncate=False)

df.groupBy("department") \
.agg(sum("salary").alias("sum_salary"), \
avg("salary").alias("avg_salary"), \
sum("bonus").alias("sum_bonus"), \
max("bonus").alias("max_bonus")) \
.where(col("sum_bonus") >= 50000) \
.show(truncate=False)

c) Write a programme to demonstrate Code of PySpark Count Distinct with options.

from pyspark.sql import SparkSession


spark = SparkSession.builder \
.appName('TusharsExamples') \
.getOrCreate()

data = [("James", "Sales", 3000),


("Michael", "Sales", 4600),
("Robert", "Sales", 4100),
("Maria", "Finance", 3000),
("James", "Sales", 3000),
("Scott", "Finance", 3300),
("Jen", "Finance", 3900),
("Jeff", "Marketing", 3000),
("Kumar", "Marketing", 2000),
("Saif", "Sales", 4100)
]
columns = ["Name","Dept","Salary"]
df = spark.createDataFrame(data=data,schema=columns)
df.distinct().show()
print("Distinct Count: " + str(df.distinct().count()))

# Using countDistrinct()
from pyspark.sql.functions import countDistinct
df2=df.select(countDistinct("Dept","Salary"))
df2.show()
print("Distinct Count of Department & Salary: "+ str(df2.collect()[0][0]))
d) Write a programme to demonstrate Select First Row of Each Group with options.

from pyspark.sql import SparkSession,Row


spark = SparkSession.builder.appName('TusharsExamples').getOrCreate()

data = [("James","Sales",3000),("Michael","Sales",4600),
("Robert","Sales",4100),("Maria","Finance",3000),
("Raman","Finance",3000),("Scott","Finance",3300),
("Jen","Finance",3900),("Jeff","Marketing",3000),
("Kumar","Marketing",2000)
]

df = spark.createDataFrame(data,["Name","Department","Salary"])
df.show()

# Select First Row of Group


from pyspark.sql.window import Window
from pyspark.sql.functions import col, row_number
w2 = Window.partitionBy("department").orderBy(col("salary"))
df.withColumn("row",row_number().over(w2)) \
.filter(col("row") == 1).drop("row") \
.show()

#Get highest salary of each group


w3 = Window.partitionBy("department").orderBy(col("salary").desc())
df.withColumn("row",row_number().over(w3)) \
.filter(col("row") == 1).drop("row") \
.show()

#Get max, min, avg, sum of each group


from pyspark.sql.functions import col, row_number,avg,sum,min,max
w4 = Window.partitionBy("department")
df.withColumn("row",row_number().over(w3)) \
.withColumn("avg", avg(col("salary")).over(w4)) \
.withColumn("sum", sum(col("salary")).over(w4)) \
.withColumn("min", min(col("salary")).over(w4)) \
.withColumn("max", max(col("salary")).over(w4)) \
.where(col("row")==1).select("department","avg","sum","min","max") \
.show()
5 PySpark SQL DateType and TimestampType
a) Write a programme to demonstrate use of PySpark date & timestamp functions.

from pyspark.sql import SparkSession


# Create SparkSession
spark = SparkSession.builder \
.appName('TusharsExamples') \
.getOrCreate()
data=[["1","2020-02-01"],["2","2019-03-01"],["3","2021-03-01"]]
df=spark.createDataFrame(data,["id","input"])
df.show()

from pyspark.sql.functions import *

#current_date()
df.select(current_date().alias("current_date")
).show(1)

#date_format()
df.select(col("input"),
date_format(col("input"), "MM-dd-yyyy").alias("date_format")
).show()

#to_date()
df.select(col("input"),
to_date(col("input"), "yyy-MM-dd").alias("to_date")
).show()

#datediff()
df.select(col("input"),
datediff(current_date(),col("input")).alias("datediff")
).show()

#months_between()
df.select(col("input"),
months_between(current_date(),col("input")).alias("months_between")
).show()

#trunc()
df.select(col("input"),
trunc(col("input"),"Month").alias("Month_Trunc"),
trunc(col("input"),"Year").alias("Month_Year"),
trunc(col("input"),"Month").alias("Month_Trunc")
).show()

#add_months() , date_add(), date_sub()


df.select(col("input"),
add_months(col("input"),3).alias("add_months"),
add_months(col("input"),-3).alias("sub_months"),
date_add(col("input"),4).alias("date_add"),
date_sub(col("input"),4).alias("date_sub")
).show()

df.select(col("input"),
year(col("input")).alias("year"),
month(col("input")).alias("month"),
next_day(col("input"),"Sunday").alias("next_day"),
weekofyear(col("input")).alias("weekofyear")
).show()

df.select(col("input"),
dayofweek(col("input")).alias("dayofweek"),
dayofmonth(col("input")).alias("dayofmonth"),
dayofyear(col("input")).alias("dayofyear"),
).show()

data=[["1","02-01-2020 11 01 19 06"],["2","03-01-2019 12 01 19 406"],["3","03-01-


2021 12 01 19 406"]]
df2=spark.createDataFrame(data,["id","input"])
df2.show(truncate=False)

#current_timestamp()
df2.select(current_timestamp().alias("current_timestamp")
).show(1,truncate=False)

#to_timestamp()
df2.select(col("input"),
to_timestamp(col("input"), "MM-dd-yyyy HH mm ss SSS").alias("to_timestamp")
).show(truncate=False)

#hour, minute,second
data=[["1","2020-02-01 11:01:19.06"],["2","2019-03-01 12:01:19.406"],["3","2021-03-
01 12:01:19.406"]]
df3=spark.createDataFrame(data,["id","input"])

df3.select(col("input"),
hour(col("input")).alias("hour"),
minute(col("input")).alias("minute"),
second(col("input")).alias("second")
).show(truncate=False)

b) Write a programme to demonstrate PySpark datediff() functions uses.

from pyspark.sql import SparkSession

# Create SparkSession
spark = SparkSession.builder \
.appName('TusharsExamples') \
.getOrCreate()
data = [("1","2019-07-01"),("2","2019-06-24"),("3","2019-08-24")]

df=spark.createDataFrame(data=data,schema=["id","date"])

from pyspark.sql.functions import *

df.select(
col("date"),
current_date().alias("current_date"),
datediff(current_date(),col("date")).alias("datediff")
).show()

df.withColumn("datesDiff", datediff(current_date(),col("date"))) \
.withColumn("montsDiff", months_between(current_date(),col("date"))) \
.withColumn("montsDiff_round",round(months_between(current_date(),col("date")),2)) \
.withColumn("yearsDiff",months_between(current_date(),col("date"))/lit(12)) \

.withColumn("yearsDiff_round",round(months_between(current_date(),col("date"))/lit(12)
,2)) \
.show()

data2 = [("1","07-01-2019"),("2","06-24-2019"),("3","08-24-2019")]
df2=spark.createDataFrame(data=data2,schema=["id","date"])
df2.select(
to_date(col("date"),"MM-dd-yyyy").alias("date"),
current_date().alias("endDate")
)

#SQL

spark.sql("select round(months_between('2019-07-01',current_date())/12,2) as
years_diff").show()
c) Write a programme to demonstrate PySpark timestamp & date utilities.

from pyspark.sql import SparkSession

# Create SparkSession
spark = SparkSession.builder \
.appName('TusharsExamples') \
.getOrCreate()

df=spark.createDataFrame(
data = [ ("1","2019-06-24 12:01:19.000")],
schema=["id","input_timestamp"])
df.printSchema()

from pyspark.sql.functions import *

# Using Cast to convert Timestamp String to DateType


df.withColumn('date_type', col('input_timestamp').cast('date')) \
.show(truncate=False)

# Using Cast to convert TimestampType to DateType


df.withColumn('date_type', to_timestamp('input_timestamp').cast('date')) \
.show(truncate=False)

df.select(to_date(lit('06-24-2019 12:01:19.000'),'MM-dd-yyyy HH:mm:ss.SSSS')) \


.show()

#Timestamp String to DateType


df.withColumn("date_type",to_date("input_timestamp")) \
.show(truncate=False)

#Timestamp Type to DateType


df.withColumn("date_type",to_date(current_timestamp())) \
.show(truncate=False)

df.withColumn("ts",to_timestamp(col("input_timestamp"))) \
.withColumn("datetype",to_date(col("ts"))) \
.show(truncate=False)

#SQL TimestampType to DateType


spark.sql("select to_date(current_timestamp) as date_type")
#SQL CAST TimestampType to DateType
spark.sql("select date(to_timestamp('2019-06-24 12:01:19.000')) as date_type")
#SQL CAST timestamp string to DateType
spark.sql("select date('2019-06-24 12:01:19.000') as date_type")
#SQL Timestamp String (default format) to DateType
spark.sql("select to_date('2019-06-24 12:01:19.000') as date_type")
#SQL Custom Timeformat to DateType
spark.sql("select to_date('06-24-2019 12:01:19.000','MM-dd-yyyy HH:mm:ss.SSSS') as
date_type")

d) Write a programme to demonstrate PySpark time difference utility.

from pyspark.sql import SparkSession

# Create SparkSession
spark = SparkSession.builder \
.appName('TusharsExamples') \
.getOrCreate()

dates = [("1","2019-07-01 12:01:19.111"),


("2","2019-06-24 12:01:19.222"),
("3","2019-11-16 16:44:55.406"),
("4","2019-11-16 16:50:59.406")
]

df = spark.createDataFrame(data=dates, schema=["id","from_timestamp"])

from pyspark.sql.functions import *


df2=df.withColumn('from_timestamp',to_timestamp(col('from_timestamp')))\
.withColumn('end_timestamp', current_timestamp())\
.withColumn('DiffInSeconds',col("end_timestamp").cast("long") -
col('from_timestamp').cast("long"))
df2.show(truncate=False)

df.withColumn('from_timestamp',to_timestamp(col('from_timestamp')))\
.withColumn('end_timestamp', current_timestamp())\
.withColumn('DiffInSeconds',unix_timestamp("end_timestamp") -
unix_timestamp('from_timestamp')) \
.show(truncate=False)

df2.withColumn('DiffInMinutes',round(col('DiffInSeconds')/60))\
.show(truncate=False)

df2.withColumn('DiffInHours',round(col('DiffInSeconds')/3600))\
.show(truncate=False)

#Difference between two timestamps when input has just timestamp

data= [("12:01:19.000","13:01:19.000"),
("12:01:19.000","12:02:19.000"),
("16:44:55.406","17:44:55.406"),
("16:50:59.406","16:44:59.406")]
df3 = spark.createDataFrame(data=data, schema=["from_timestamp","to_timestamp"])

df3.withColumn("from_timestamp",to_timestamp(col("from_timestamp"),"HH:mm:ss.SS
S")) \
.withColumn("to_timestamp",to_timestamp(col("to_timestamp"),"HH:mm:ss.SSS")) \
.withColumn("DiffInSeconds", col("from_timestamp").cast("long") -
col("to_timestamp").cast("long")) \
.withColumn("DiffInMinutes",round(col("DiffInSeconds")/60)) \
.withColumn("DiffInHours",round(col("DiffInSeconds")/3600)) \
.show(truncate=False)

df3 = spark.createDataFrame(
data=[("1","07-01-2019 12:01:19.406")],
schema=["id","input_timestamp"]
)
df3.withColumn("input_timestamp",to_timestamp(col("input_timestamp"),"MM-dd-
yyyy HH:mm:ss.SSS")) \
.withColumn("current_timestamp",current_timestamp().alias("current_timestamp"))
\
.withColumn("DiffInSeconds",current_timestamp().cast("long") -
col("input_timestamp").cast("long")) \
.withColumn("DiffInMinutes",round(col("DiffInSeconds")/60)) \
.withColumn("DiffInHours",round(col("DiffInSeconds")/3600)) \
.withColumn("DiffInDays",round(col("DiffInSeconds")/24*3600)) \
.show(truncate=False)

#SQL

spark.sql("select unix_timestamp('2019-07-02 12:01:19') - unix_timestamp('2019-07-


01 12:01:19') DiffInSeconds").show()
spark.sql("select (unix_timestamp('2019-07-02 12:01:19') - unix_timestamp('2019-07-
01 12:01:19'))/60 DiffInMinutes").show()
spark.sql("select (unix_timestamp('2019-07-02 12:01:19') - unix_timestamp('2019-07-
01 12:01:19'))/3600 DiffInHours").show()
6 Spark SQL Joins, Spark SQL Schema
a) Write a programme to demonstrate PySpark SQL Join with options.

import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import col

spark = SparkSession.builder.appName('TusharsExamples').getOrCreate()

emp = [(1,"Smith",-1,"2018","10","M",3000), \
(2,"Rose",1,"2010","20","M",4000), \
(3,"Williams",1,"2010","10","M",1000), \
(4,"Jones",2,"2005","10","F",2000), \
(5,"Brown",2,"2010","40","",-1), \
(6,"Brown",2,"2010","50","",-1) \
]
empColumns = ["emp_id","name","superior_emp_id","year_joined", \
"emp_dept_id","gender","salary"]

empDF = spark.createDataFrame(data=emp, schema = empColumns)


empDF.printSchema()
empDF.show(truncate=False)

dept = [("Finance",10), \
("Marketing",20), \
("Sales",30), \
("IT",40) \
]
deptColumns = ["dept_name","dept_id"]
deptDF = spark.createDataFrame(data=dept, schema = deptColumns)
deptDF.printSchema()
deptDF.show(truncate=False)

empDF.join(deptDF,empDF.emp_dept_id == deptDF.dept_id,"inner") \
.show(truncate=False)

empDF.join(deptDF,empDF.emp_dept_id == deptDF.dept_id,"outer") \
.show(truncate=False)
empDF.join(deptDF,empDF.emp_dept_id == deptDF.dept_id,"full") \
.show(truncate=False)
empDF.join(deptDF,empDF.emp_dept_id == deptDF.dept_id,"fullouter") \
.show(truncate=False)

empDF.join(deptDF,empDF.emp_dept_id == deptDF.dept_id,"left") \
.show(truncate=False)
empDF.join(deptDF,empDF.emp_dept_id == deptDF.dept_id,"leftouter") \
.show(truncate=False)

empDF.join(deptDF,empDF.emp_dept_id == deptDF.dept_id,"right") \
.show(truncate=False)
empDF.join(deptDF,empDF.emp_dept_id == deptDF.dept_id,"rightouter") \
.show(truncate=False)

empDF.join(deptDF,empDF.emp_dept_id == deptDF.dept_id,"leftsemi") \
.show(truncate=False)

empDF.join(deptDF,empDF.emp_dept_id == deptDF.dept_id,"leftanti") \
.show(truncate=False)

empDF.alias("emp1").join(empDF.alias("emp2"), \
col("emp1.superior_emp_id") == col("emp2.emp_id"),"inner") \
.select(col("emp1.emp_id"),col("emp1.name"), \
col("emp2.emp_id").alias("superior_emp_id"), \
col("emp2.name").alias("superior_emp_name")) \
.show(truncate=False)

empDF.createOrReplaceTempView("EMP")
deptDF.createOrReplaceTempView("DEPT")

joinDF = spark.sql("select * from EMP e, DEPT d where e.emp_dept_id == d.dept_id") \


.show(truncate=False)

joinDF2 = spark.sql("select * from EMP e INNER JOIN DEPT d ON e.emp_dept_id ==


d.dept_id") \
.show(truncate=False)

b) Write a programme to demonstrate Join on multiple DataFrames with options.

from pyspark.sql import SparkSession

# Create SparkSession
spark = SparkSession.builder \
.appName('TusharsExamples') \
.getOrCreate()
#EMP DataFrame
empData = [(1,"Smith",10), (2,"Rose",20),
(3,"Williams",10), (4,"Jones",30)
]
empColumns = ["emp_id","name","emp_dept_id"]
empDF = spark.createDataFrame(empData,empColumns)
empDF.show()

#DEPT DataFrame
deptData = [("Finance",10), ("Marketing",20),
("Sales",30),("IT",40)
]
deptColumns = ["dept_name","dept_id"]
deptDF=spark.createDataFrame(deptData,deptColumns)
deptDF.show()

#Address DataFrame
addData=[(1,"1523 Main St","SFO","CA"),
(2,"3453 Orange St","SFO","NY"),
(3,"34 Warner St","Jersey","NJ"),
(4,"221 Cavalier St","Newark","DE"),
(5,"789 Walnut St","Sandiago","CA")
]
addColumns = ["emp_id","addline1","city","state"]
addDF = spark.createDataFrame(addData,addColumns)
addDF.show()

#Join two DataFrames


empDF.join(addDF,empDF["emp_id"] == addDF["emp_id"]).show()

#Drop duplicate column


empDF.join(addDF,["emp_id"]).show()

#Join Multiple DataFrames


empDF.join(addDF,["emp_id"]) \
.join(deptDF,empDF["emp_dept_id"] == deptDF["dept_id"]) \
.show()

#Using Where for Join Condition


empDF.join(deptDF).where(empDF["emp_dept_id"] == deptDF["dept_id"]) \
.join(addDF).where(empDF["emp_id"] == addDF["emp_id"]) \
.show()

#SQL
empDF.createOrReplaceTempView("EMP")
deptDF.createOrReplaceTempView("DEPT")
addDF.createOrReplaceTempView("ADD")

spark.sql("select * from EMP e, DEPT d, ADD a " + \


"where e.emp_dept_id == d.dept_id and e.emp_id == a.emp_id") \
.show()
c) Write a programme to demonstrate PySpark Join Multiple Columns with options.

# Import pyspark
from pyspark.sql import SparkSession

# Create SparkSession
spark = SparkSession.builder \
.appName('TusharsExamples') \
.getOrCreate()

#EMP DataFrame
empData = [(1,"Smith","2018",10,"M",3000),
(2,"Rose","2010",20,"M",4000),
(3,"Williams","2010",10,"M",1000),
(4,"Jones","2005",10,"F",2000),
(5,"Brown","2010",30,"",-1),
(6,"Brown","2010",50,"",-1)
]

empColumns = ["emp_id","name","branch_id","dept_id",
"gender","salary"]
empDF = spark.createDataFrame(empData,empColumns)
empDF.show()

#DEPT DataFrame
deptData = [("Finance",10,"2018"),
("Marketing",20,"2010"),
("Marketing",20,"2018"),
("Sales",30,"2005"),
("Sales",30,"2010"),
("IT",50,"2010")
]
deptColumns = ["dept_name","dept_id","branch_id"]
deptDF=spark.createDataFrame(deptData,deptColumns)
deptDF.show()

# PySpark join multiple columns


empDF.join(deptDF, (empDF["dept_id"] == deptDF["dept_id"]) &
( empDF["branch_id"] == deptDF["branch_id"])).show()

# Using where or filter


empDF.join(deptDF).where((empDF["dept_id"] == deptDF["dept_id"]) &
(empDF["branch_id"] == deptDF["branch_id"])).show()
# Create tables
empDF.createOrReplaceTempView("EMP")
deptDF.createOrReplaceTempView("DEPT")

# Spark SQL
spark.sql("SELECT * FROM EMP e, DEPT d where e.dept_id == d.dept_id"
" and e.branch_id == d.branch_id").show()
7 Spark SQL StructType & SQL Functions
a) Write a programme to demonstrate PySpark map() with options.

from pyspark.sql import SparkSession


spark = SparkSession.builder.appName('TusharsExamples').getOrCreate()

data = ["Project",
"Gutenberg’s",
"Alice’s",
"Adventures",
"in",
"Wonderland",
"Project",
"Gutenberg’s",
"Adventures",
"in",
"Wonderland",
"Project",
"Gutenberg’s"]

rdd=spark.sparkContext.parallelize(data)

rdd2=rdd.map(lambda x: (x,1))
for element in rdd2.collect():
print(element)

data = [('James','Smith','M',30),
('Anna','Rose','F',41),
('Robert','Williams','M',62),
]

columns = ["firstname","lastname","gender","salary"]
df = spark.createDataFrame(data=data, schema = columns)
df.show()

rdd2=df.rdd.map(lambda x:
(x[0]+","+x[1],x[2],x[3]*2)
)
df2=rdd2.toDF(["name","gender","new_salary"] )
df2.show()

#Referring Column Names


rdd2=df.rdd.map(lambda x:
(x["firstname"]+","+x["lastname"],x["gender"],x["salary"]*2)
)
#Referring Column Names
rdd2=df.rdd.map(lambda x:
(x.firstname+","+x.lastname,x.gender,x.salary*2)
)

def func1(x):
firstName=x.firstname
lastName=x.lastname
name=firstName+","+lastName
gender=x.gender.lower()
salary=x.salary*2
return (name,gender,salary)

rdd2=df.rdd.map(lambda x: func1(x))

b) Write a programme to demonstrate Window Functions with options.

import pyspark
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('TusharsExamples').getOrCreate()

simpleData = (("James", "Sales", 3000), \


("Michael", "Sales", 4600), \
("Robert", "Sales", 4100), \
("Maria", "Finance", 3000), \
("James", "Sales", 3000), \
("Scott", "Finance", 3300), \
("Jen", "Finance", 3900), \
("Jeff", "Marketing", 3000), \
("Kumar", "Marketing", 2000),\
("Saif", "Sales", 4100) \
)

columns= ["employee_name", "department", "salary"]

df = spark.createDataFrame(data = simpleData, schema = columns)

df.printSchema()
df.show(truncate=False)

from pyspark.sql.window import Window


from pyspark.sql.functions import row_number
windowSpec = Window.partitionBy("department").orderBy("salary")
df.withColumn("row_number",row_number().over(windowSpec)) \
.show(truncate=False)

from pyspark.sql.functions import rank


df.withColumn("rank",rank().over(windowSpec)) \
.show()

from pyspark.sql.functions import dense_rank


df.withColumn("dense_rank",dense_rank().over(windowSpec)) \
.show()

from pyspark.sql.functions import percent_rank


df.withColumn("percent_rank",percent_rank().over(windowSpec)) \
.show()

from pyspark.sql.functions import ntile


df.withColumn("ntile",ntile(2).over(windowSpec)) \
.show()

from pyspark.sql.functions import cume_dist


df.withColumn("cume_dist",cume_dist().over(windowSpec)) \
.show()

from pyspark.sql.functions import lag


df.withColumn("lag",lag("salary",2).over(windowSpec)) \
.show()

from pyspark.sql.functions import lead


df.withColumn("lead",lead("salary",2).over(windowSpec)) \
.show()

windowSpecAgg = Window.partitionBy("department")
from pyspark.sql.functions import col,avg,sum,min,max,row_number
df.withColumn("row",row_number().over(windowSpec)) \
.withColumn("avg", avg(col("salary")).over(windowSpecAgg)) \
.withColumn("sum", sum(col("salary")).over(windowSpecAgg)) \
.withColumn("min", min(col("salary")).over(windowSpecAgg)) \
.withColumn("max", max(col("salary")).over(windowSpecAgg)) \
.where(col("row")==1).select("department","avg","sum","min","max") \
.show()
c) Write a programme to demonstrate PySpark JSON Functions with options.

from pyspark.sql import SparkSession,Row


spark = SparkSession.builder.appName('TusharsExamples').getOrCreate()

jsonString="""{"Zipcode":704,"ZipCodeType":"STANDARD","City":"PARC
PARQUE","State":"PR"}"""
df=spark.createDataFrame([(1, jsonString)],["id","value"])
df.show(truncate=False)

#Convert JSON string column to Map type


from pyspark.sql.types import MapType,StringType
from pyspark.sql.functions import from_json
df2=df.withColumn("value",from_json(df.value,MapType(StringType(),StringType())))
df2.printSchema()
df2.show(truncate=False)

from pyspark.sql.functions import to_json,col


df2.withColumn("value",to_json(col("value"))) \
.show(truncate=False)

from pyspark.sql.functions import json_tuple


df.select(col("id"),json_tuple(col("value"),"Zipcode","ZipCodeType","City")) \
.toDF("id","Zipcode","ZipCodeType","City") \
.show(truncate=False)

from pyspark.sql.functions import get_json_object


df.select(col("id"),get_json_object(col("value"),"$.ZipCodeType").alias("ZipCodeType")) \
.show(truncate=False)

from pyspark.sql.functions import schema_of_json,lit


schemaStr=spark.range(1) \

.select(schema_of_json(lit("""{"Zipcode":704,"ZipCodeType":"STANDARD","City":"PARC
PARQUE","State":"PR"}"""))) \
.collect()[0][0]
print(schemaStr)
8 Spark SQL
a) Write a programme to demonstrate PySpark SQL examples.

from pyspark.sql import SparkSession

# Create SparkSession
spark = SparkSession.builder.appName('TusharsExamples') \
.getOrCreate()

# Create DataFrame from uploaded CSV


df = spark.read \
.option("header",True) \
.csv("/Users/admin/simple-zipcodes.csv")
df.printSchema()
df.show()

# Create SQL table


spark.read \
.option("header",True) \
.csv("/Users/admin/simple-zipcodes.csv") \
.createOrReplaceTempView("Zipcodes")

# Select query
df.select("country","city","zipcode","state") \
.show(5)

spark.sql("SELECT country, city, zipcode, state FROM ZIPCODES") \


.show(5)

# where
df.select("country","city","zipcode","state") \
.where("state == 'AZ'") \
.show(5)

spark.sql(""" SELECT country, city, zipcode, state FROM ZIPCODES


WHERE state = 'AZ' """) \
.show(5)

# sorting
df.select("country","city","zipcode","state") \
.where("state in ('PR','AZ','FL')") \
.orderBy("state") \
.show(10)

spark.sql(""" SELECT country, city, zipcode, state FROM ZIPCODES


WHERE state in ('PR','AZ','FL') order by state """) \
.show(10)

# grouping
df.groupBy("state").count() \
.show()

spark.sql(""" SELECT state, count(*) as count FROM ZIPCODES


GROUP BY state""") \
.show()

b) Write a programme to demonstrate PySpark SQL expr() Function

from pyspark.sql import SparkSession


spark = SparkSession.builder.appName('TusharsExamples').getOrCreate()

from pyspark.sql.functions import expr


#Concatenate columns
data=[("James","Bond"),("Scott","Varsa")]
df=spark.createDataFrame(data).toDF("col1","col2")
df.withColumn("Name",expr(" col1 ||','|| col2")).show()

#Using CASE WHEN sql expression


data = [("James","M"),("Michael","F"),("Jen","")]
columns = ["name","gender"]
df = spark.createDataFrame(data = data, schema = columns)
df2 = df.withColumn("gender", expr("CASE WHEN gender = 'M' THEN 'Male' " +
"WHEN gender = 'F' THEN 'Female' ELSE 'unknown' END"))
df2.show()

#Add months from a value of another column


data=[("2019-01-23",1),("2019-06-24",2),("2019-09-20",3)]
df=spark.createDataFrame(data).toDF("date","increment")
df.select(df.date,df.increment,
expr("add_months(date,increment)")
.alias("inc_date")).show()

# Providing alias using 'as'


df.select(df.date,df.increment,
expr("""add_months(date,increment) as inc_date""")
).show()

# Add
df.select(df.date,df.increment,
expr("increment + 5 as new_increment")
).show()
# Using cast to convert data types
df.select("increment",expr("cast(increment as string) as str_increment")) \
.printSchema()

#Use expr() to filter the rows


data=[(100,2),(200,3000),(500,500)]
df=spark.createDataFrame(data).toDF("col1","col2")
df.filter(expr("col1 == col2")).show()

c) Write a programme to demonstrate PySpark Select Columns From DataFrame.

import pyspark
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('TusharsExamples').getOrCreate()

data = [("James","Smith","USA","CA"),
("Michael","Rose","USA","NY"),
("Robert","Williams","USA","CA"),
("Maria","Jones","USA","FL")
]

columns = ["firstname","lastname","country","state"]
df = spark.createDataFrame(data = data, schema = columns)
df.show(truncate=False)

df.select("firstname").show()

df.select("firstname","lastname").show()

#Using Dataframe object name


df.select(df.firstname,df.lastname).show()

# Using col function


from pyspark.sql.functions import col
df.select(col("firstname"),col("lastname")).show()

data = [(("James",None,"Smith"),"OH","M"),
(("Anna","Rose",""),"NY","F"),
(("Julia","","Williams"),"OH","F"),
(("Maria","Anne","Jones"),"NY","M"),
(("Jen","Mary","Brown"),"NY","M"),
(("Mike","Mary","Williams"),"OH","M")
]
from pyspark.sql.types import StructType,StructField, StringType
schema = StructType([
StructField('name', StructType([
StructField('firstname', StringType(), True),
StructField('middlename', StringType(), True),
StructField('lastname', StringType(), True)
])),
StructField('state', StringType(), True),
StructField('gender', StringType(), True)
])

df2 = spark.createDataFrame(data = data, schema = schema)


df2.printSchema()
df2.show(truncate=False) # shows all columns

df2.select("name").show(truncate=False)
df2.select("name.firstname","name.lastname").show(truncate=False)
df2.select("name.*").show(truncate=False)

d) Write a programme to demonstrate mapPartitions() with options.

from pyspark.sql import SparkSession


spark = SparkSession.builder.appName('TusharsExamples').getOrCreate()
data = [('James','Smith','M',3000),
('Anna','Rose','F',4100),
('Robert','Williams','M',6200),
]

columns = ["firstname","lastname","gender","salary"]
df = spark.createDataFrame(data=data, schema = columns)
df.show()

#Example 1 mapPartitions()
def reformat(partitionData):
for row in partitionData:
yield [row.firstname+","+row.lastname,row.salary*10/100]
df2=df.rdd.mapPartitions(reformat).toDF(["name","bonus"])
df2.show()

#Example 2 mapPartitions()
def reformat2(partitionData):
updatedData = []
for row in partitionData:
name=row.firstname+","+row.lastname
bonus=row.salary*10/100
updatedData.append([name,bonus])
return iter(updatedData)

df2=df.rdd.mapPartitions(reformat).toDF(["name","bonus"])
df2.show())

e) Write a programme to demonstrate PySpark collect() with options.

import pyspark
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('TusharsExamples').getOrCreate()

dept = [("Finance",10), \
("Marketing",20), \
("Sales",30), \
("IT",40) \
]
deptColumns = ["dept_name","dept_id"]
deptDF = spark.createDataFrame(data=dept, schema = deptColumns)
deptDF.printSchema()
deptDF.show(truncate=False)

dataCollect = deptDF.collect()

print(dataCollect)

dataCollect2 = deptDF.select("dept_name").collect()
print(dataCollect2)

for row in dataCollect:


print(row['dept_name'] + "," +str(row['dept_id']))
9 Spark Data Source API Part 1
a) Write a programme to demonstrate PySpark Read CSV.

import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
from pyspark.sql.types import ArrayType, DoubleType, BooleanType
from pyspark.sql.functions import col,array_contains

spark = SparkSession.builder.appName('TusharsExamples').getOrCreate()

df = spark.read.csv("/content/zipcodes.csv")

df.printSchema()

df2 = spark.read.option("header",True) \
.csv("/content/zipcodes.csv")
df2.printSchema()

df3 = spark.read.options(header='True', delimiter=',') \


.csv("/content/zipcodes.csv")
df3.printSchema()

schema = StructType() \
.add("RecordNumber",IntegerType(),True) \
.add("Zipcode",IntegerType(),True) \
.add("ZipCodeType",StringType(),True) \
.add("City",StringType(),True) \
.add("State",StringType(),True) \
.add("LocationType",StringType(),True) \
.add("Lat",DoubleType(),True) \
.add("Long",DoubleType(),True) \
.add("Xaxis",IntegerType(),True) \
.add("Yaxis",DoubleType(),True) \
.add("Zaxis",DoubleType(),True) \
.add("WorldRegion",StringType(),True) \
.add("Country",StringType(),True) \
.add("LocationText",StringType(),True) \
.add("Location",StringType(),True) \
.add("Decommisioned",BooleanType(),True) \
.add("TaxReturnsFiled",StringType(),True) \
.add("EstimatedPopulation",IntegerType(),True) \
.add("TotalWages",IntegerType(),True) \
.add("Notes",StringType(),True)
df_with_schema = spark.read.format("csv") \
.option("header", True) \
.schema(schema) \
.load("/content/zipcodes.csv")
df_with_schema.printSchema()

df2.write.option("header",True) \
.csv("/zipcodes123")

b) Write a programme to demonstrate PySpark read and write Parquet file.

# Imports
import pyspark
from pyspark.sql import SparkSession
spark=SparkSession.builder.appName("parquetFile").getOrCreate()
data =[("James ","","Smith","36636","M",3000),
("Michael ","Rose","","40288","M",4000),
("Robert ","","Williams","42114","M",4000),
("Maria ","Anne","Jones","39192","F",4000),
("Jen","Mary","Brown","","F",-1)]
columns=["firstname","middlename","lastname","dob","gender","salary"]
df=spark.createDataFrame(data,columns)
df.write.mode("overwrite").parquet("/tmp/output/people.parquet")
parDF1=spark.read.parquet("/tmp/output/people.parquet")
parDF1.createOrReplaceTempView("parquetTable")
parDF1.printSchema()
parDF1.show(truncate=False)

parkSQL = spark.sql("select * from ParquetTable where salary >= 4000 ")


parkSQL.show(truncate=False)

spark.sql("CREATE TEMPORARY VIEW PERSON USING parquet OPTIONS (path


\"/tmp/output/people.parquet\")")
spark.sql("SELECT * FROM PERSON").show()

df.write.partitionBy("gender","salary").mode("overwrite").parquet("/tmp/output/peop
le2.parquet")

parDF2=spark.read.parquet("/tmp/output/people2.parquet/gender=M")
parDF2.show(truncate=False)

spark.sql("CREATE TEMPORARY VIEW PERSON2 USING parquet OPTIONS (path


\"/tmp/output/people2.parquet/gender=F\")")
spark.sql("SELECT * FROM PERSON2" ).show()
c) Spark MLlib in Python to train a linear regression model

# Import
from pyspark.ml.regression import LinearRegression
from pyspark.ml.feature import VectorAssembler

# Sample training data


data = [(1.0, 2.0), (2.0, 3.0), (3.0, 4.0), (4.0, 5.0), (5.0, 6.0)]
df = spark.createDataFrame(data, ["features", "label"])

# Define a feature vector assembler


assembler = VectorAssembler(inputCols=["features"], outputCol="features_vec")

# Transform the DataFrame with the feature vector assembler


df = assembler.transform(df)

# Create a LinearRegression model


lr = LinearRegression(featuresCol="features_vec", labelCol="label")

# Fit the model to the training data


model = lr.fit(df)

# Print the coefficients and intercept of the model


print("Coefficients: " + str(model.coefficients))
print("Intercept: " + str(model.intercept))

# Stop the SparkSession


spark.stop()

10 Spark Data Source API Part 2


a) Write a programme to demonstrate Writing PySpark DataFrame to Hive table.

from os.path import abspath


from pyspark.sql import SparkSession

# warehouse_location points to the default location for managed databases and tables
warehouse_location = abspath('spark-warehouse')

# Create spark session with hive enabled


spark = SparkSession \
.builder \
.appName("TusharsExamples") \
.config("spark.sql.warehouse.dir", warehouse_location) \
.config("spark.sql.catalogImplementation", "hive") \
.enableHiveSupport() \
.getOrCreate()

columns = ["id", "name","age","gender"]

# Create DataFrame
data = [(1, "James",30,"M"), (2, "Ann",40,"F"),
(3, "Jeff",41,"M"),(4, "Jennifer",20,"F")]
sampleDF = spark.sparkContext.parallelize(data).toDF(columns)

# Create Hive Internal table


sampleDF.write.mode('overwrite') \
.saveAsTable("employee")

df = spark.read.table("employee")
df.show()

b) Write a programme to demonstrate PySpark Save Hive Table From Temp view.

from os.path import abspath


from pyspark.sql import SparkSession

# warehouse_location points to the default location for managed databases and tables
warehouse_location = abspath('spark-warehouse')

# Create spark session with hive enabled


spark = SparkSession \
.builder \
.appName("TusharsExamples") \
.config("spark.sql.warehouse.dir", warehouse_location) \
.config("spark.sql.catalogImplementation", "hive") \
.enableHiveSupport() \
.getOrCreate()

columns = ["id", "name","age","gender"]

# Create DataFrame
data = [(1, "James",30,"M"), (2, "Ann",40,"F"),
(3, "Jeff",41,"M"),(4, "Jennifer",20,"F")]
sampleDF = spark.sparkContext.parallelize(data).toDF(columns)

# Create temporary view


sampleDF.createOrReplaceTempView("sampleView")
# Create a Database CT
spark.sql("CREATE DATABASE IF NOT EXISTS ct")

# Create a Table naming as sampleTable under CT database.


spark.sql("CREATE TABLE ct.sampleTable (id Int, name String, age Int, gender String)")

# Insert into sampleTable using the sampleView.


spark.sql("INSERT INTO TABLE ct.sampleTable SELECT * FROM sampleView")

# Lets view the data in the table


spark.sql("SELECT * FROM ct.sampleTable").show()

c) Write a programme to demonstrate PySpark Read Hive Table from Remote Hive.

from pyspark.sql import SparkSession

# Create spark session


spark = SparkSession \
.builder \
.appName("TusharsExamples") \
.enableHiveSupport() \
.getOrCreate()

# Read hive table using table()


df = spark.read.table("employee")
df.show()

df = spark.sql("select * from employee")


df.show()

You might also like