0% found this document useful (0 votes)
4 views

T09 Sparksql

Uploaded by

ahmadshowaikan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

T09 Sparksql

Uploaded by

ahmadshowaikan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

ICS 474 - Big Data Analytics

Dr. Muzammil Behzad – Assistant Professor

Information and Computer Science,


Lecture 09
King Fahd University of Petroleum and Minerals
Email: [email protected]
Outline
• What is Spark SQL?
• Spark SQL Features
• Spark SQL Architecture
• Spark SQL Libraries
• DataSource API
• DataFrame API
• Spark DataFrames
• Key Features of DataFrames
• Example Companies who use Spark SQL
• How to create DataFrames
• How to Manipulate DataFrames
• PySpark SQL functions
2
DIKW

3
What
DIKW is Spark SQL?

4
Spark
DIKW SQL Features

5
Spark
DIKW SQL Features

6
Spark
DIKW SQL Features

7
Spark
DIKW SQL Features
Key Observations:
• Hadoop (shown in red) takes significantly
more time as the number of iterations
increases. Each iteration takes approximately
110 seconds.
• Spark SQL (shown in blue) is much faster and
scales efficiently. The first iteration takes
around 80 seconds, but subsequent iterations
only take 1 second each, highlighting its
performance optimization.

Conclusion:
• Spark SQL is much more efficient than
Hadoop for iterative processes, offering
dramatically reduced run times as the
number of iterations grows.
8
Spark
DIKW SQL Features

9
Spark
DIKW SQL Architecture

DSL = Domain-Specific Language 10


DataSource
DIKW API
• Apache Spark's Data Source API provides seamless
integration with numerous data storage systems.
• It enables distributed processing on large datasets
by connecting with both structured and
unstructured data repositories.
• Used for reading structured and semi-structured
data into Spark.

11
DataSource
DIKW API
• Features:
• Can handle structure/semi-structured data
• Can load files in multiple formats
• 3rd party integration can be done

12
DataFrames
DIKW API
• Distributed collection of data
organized into named columns
• Equivalent to a relational table in SQL

13
Spark
DIKW SQL Libraries

1. Data Source API: Connects to


various data sources like databases
and file systems.
2. DataFrame API: Handles structured
data in a table-like format.
3. Interpreter & Optimizer: Analyzes
and optimizes queries for better
performance.
4. SQL Service: Supports SQL queries
for interacting with Spark data.
14
Spark
DIKW DataFrames
• A DataFrame is a Dataset organized into named columns.
• It is conceptually equivalent to a table in a relational
database or a data frame in R/Python.
• DataFrames process structured and semi-structured data.
• They can handle petabytes of data.
• DataFrames can be constructed from a wide array of
sources such as: structured data files, semi-structured
data files, databases, or existing RDDs.
• DataFrame API is available in Scala, Java, Python, and R.

15
Key
DIKWFeatures of DataFrames
• Distributed
• Immutable
• Lazy Evaluation
• Automatic optimization of code
• SQL support
• Language support available for Python, Scala, Java , and R
• Provide data source API to read from different sources
and formats
• Real time query processing
• Many times faster than Hadoop Hive

16
Example
DIKW Companies Who Use Spark SQL

17
Spark
DIKW DataFrames
• Programmatically
• From Existing RDD
• Loading data: For example, loading
the data from JSON, CSV, etc.
• You can Create a PySpark DataFrame
using toDF() and createDataFrame())

18
Creating
DIKW a DataFrame From CSV File
df1 = spark.read.csv(“fileName.csv”, inferSchema=True, header=True)

19
Pyspark
DIKW SQL in colab
!pip install pyspark

from pyspark.sql import *

spark = SparkSession.builder\
.master("local")\
.appName("Colab")\
.config('spark.ui.port', '4050')\
.getOrCreate()

20
Creating
DIKW DataFrame Programmatically
emp = [(1,"Smith",-1,"2018","10","M",3000), \
(2,"Rose",1,"2010","20","M",4000), \
(3,"Williams",1,"2010","10","M",1000), \
(4,"Jones",2,"2005","10","F",2000), \
(5,"Brown",2,"2010","40","",-1), \
(6,"Brown",2,"2010","50","",-1) \
]

empColumns = ["emp_id","name","superior_emp_id","year_joined", \
"emp_dept_id","gender","salary"]

df1 = spark.createDataFrame(data=emp, schema = empColumns)

type(df1)
21
Creating
DIKW DataFrame From RDD
empData = [(1,"Smith",-1,"2018","10","M",3000), \
(2,"Rose",1,"2010","20","M",4000), \
(3,"Williams",1,"2010","10","M",1000), \
(4,"Jones",2,"2005","10","F",2000), \
(5,"Brown",2,"2010","40","",-1), \
(6,"Brown",2,"2010","50","",-1) \
]
rdd1 = sc.parallelize(empData)

empColumns = ["emp_id","name","superior_emp_id","year_joined", \
"emp_dept_id","gender","salary"]

df1 = rdd1.toDF(empColumns)
type(df1)

22
Manipulating
DIKW DataFrames

23
Manipulating
DIKW DataFrames
• df1.select(df1.salary, df1.gender).show()

• df1.filter(df1.salary > 6000).show()

• df1.filter(df1.salary > 6000).count()

• df1.filter((df1.salary > 6000) & (df1.gender == ‘M’)).show()

• df1.orderBy(df1.name).show()

• df1.orderBy(df1.name.desc()).show()

• df1.sort(df1.gender.desc()).show()

• df1.groupBy(df1.salary).count().show()

• df1.withColumn('New_Salary',df1.salary*1.03).show()

• df1.withColumnRenamed('gender','sex').columns

• df2 = df1.filter(df1.Salary > 6000).show()

24
Manipulating
DIKW DataFrames
• df.drop(“salary")
• df.select(“gender”).distinct().count()
• df3 = df1.union(df2)
• df.limit(3).show()
• df1.crosstab(‘salary’, ‘gender').show()
• df1.select(‘salary’,gender').dropDuplicates().show()
• df1.dropna().count()
• df1.fillna(-1).show()
• df2 = df1.sample(False, 0.25, 23)
• df1.select(‘salary').map(lambda x:(x,1)).take(3)

25
PySpark
DIKW SQL Functions
• PySpark – Aggregate Functions
• PySpark – Window Functions

26
PySpark
DIKW Aggregate Functions

27
PySpark
DIKW Window Functions
• PySpark Window functions operate on a group of rows and return a single value
for every input row.
• PySpark SQL supports three kinds of window functions:
• Ranking functions
• Analytic functions
• Aggregate functions

https://sparkbyexamples.com/pyspark-tutorial

28
DIKW

29
Dr. Muzammil Behzad – Assistant Professor
King Fahd University of Petroleum and Minerals
Email: [email protected] 30

You might also like