T09 Sparksql
T09 Sparksql
3
What
DIKW is Spark SQL?
4
Spark
DIKW SQL Features
5
Spark
DIKW SQL Features
6
Spark
DIKW SQL Features
7
Spark
DIKW SQL Features
Key Observations:
• Hadoop (shown in red) takes significantly
more time as the number of iterations
increases. Each iteration takes approximately
110 seconds.
• Spark SQL (shown in blue) is much faster and
scales efficiently. The first iteration takes
around 80 seconds, but subsequent iterations
only take 1 second each, highlighting its
performance optimization.
Conclusion:
• Spark SQL is much more efficient than
Hadoop for iterative processes, offering
dramatically reduced run times as the
number of iterations grows.
8
Spark
DIKW SQL Features
9
Spark
DIKW SQL Architecture
11
DataSource
DIKW API
• Features:
• Can handle structure/semi-structured data
• Can load files in multiple formats
• 3rd party integration can be done
12
DataFrames
DIKW API
• Distributed collection of data
organized into named columns
• Equivalent to a relational table in SQL
13
Spark
DIKW SQL Libraries
15
Key
DIKWFeatures of DataFrames
• Distributed
• Immutable
• Lazy Evaluation
• Automatic optimization of code
• SQL support
• Language support available for Python, Scala, Java , and R
• Provide data source API to read from different sources
and formats
• Real time query processing
• Many times faster than Hadoop Hive
16
Example
DIKW Companies Who Use Spark SQL
17
Spark
DIKW DataFrames
• Programmatically
• From Existing RDD
• Loading data: For example, loading
the data from JSON, CSV, etc.
• You can Create a PySpark DataFrame
using toDF() and createDataFrame())
18
Creating
DIKW a DataFrame From CSV File
df1 = spark.read.csv(“fileName.csv”, inferSchema=True, header=True)
19
Pyspark
DIKW SQL in colab
!pip install pyspark
spark = SparkSession.builder\
.master("local")\
.appName("Colab")\
.config('spark.ui.port', '4050')\
.getOrCreate()
20
Creating
DIKW DataFrame Programmatically
emp = [(1,"Smith",-1,"2018","10","M",3000), \
(2,"Rose",1,"2010","20","M",4000), \
(3,"Williams",1,"2010","10","M",1000), \
(4,"Jones",2,"2005","10","F",2000), \
(5,"Brown",2,"2010","40","",-1), \
(6,"Brown",2,"2010","50","",-1) \
]
empColumns = ["emp_id","name","superior_emp_id","year_joined", \
"emp_dept_id","gender","salary"]
type(df1)
21
Creating
DIKW DataFrame From RDD
empData = [(1,"Smith",-1,"2018","10","M",3000), \
(2,"Rose",1,"2010","20","M",4000), \
(3,"Williams",1,"2010","10","M",1000), \
(4,"Jones",2,"2005","10","F",2000), \
(5,"Brown",2,"2010","40","",-1), \
(6,"Brown",2,"2010","50","",-1) \
]
rdd1 = sc.parallelize(empData)
empColumns = ["emp_id","name","superior_emp_id","year_joined", \
"emp_dept_id","gender","salary"]
df1 = rdd1.toDF(empColumns)
type(df1)
22
Manipulating
DIKW DataFrames
23
Manipulating
DIKW DataFrames
• df1.select(df1.salary, df1.gender).show()
• df1.orderBy(df1.name).show()
• df1.orderBy(df1.name.desc()).show()
• df1.sort(df1.gender.desc()).show()
• df1.groupBy(df1.salary).count().show()
• df1.withColumn('New_Salary',df1.salary*1.03).show()
• df1.withColumnRenamed('gender','sex').columns
24
Manipulating
DIKW DataFrames
• df.drop(“salary")
• df.select(“gender”).distinct().count()
• df3 = df1.union(df2)
• df.limit(3).show()
• df1.crosstab(‘salary’, ‘gender').show()
• df1.select(‘salary’,gender').dropDuplicates().show()
• df1.dropna().count()
• df1.fillna(-1).show()
• df2 = df1.sample(False, 0.25, 23)
• df1.select(‘salary').map(lambda x:(x,1)).take(3)
25
PySpark
DIKW SQL Functions
• PySpark – Aggregate Functions
• PySpark – Window Functions
26
PySpark
DIKW Aggregate Functions
27
PySpark
DIKW Window Functions
• PySpark Window functions operate on a group of rows and return a single value
for every input row.
• PySpark SQL supports three kinds of window functions:
• Ranking functions
• Analytic functions
• Aggregate functions
https://sparkbyexamples.com/pyspark-tutorial
28
DIKW
29
Dr. Muzammil Behzad – Assistant Professor
King Fahd University of Petroleum and Minerals
Email: [email protected] 30