0% found this document useful (0 votes)

14 views30 pages

T09 Sparksql

Uploaded by

ahmadshowaikan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views30 pages

T09 Sparksql

Uploaded by

ahmadshowaikan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 30

ICS 474 - Big Data Analytics

Dr. Muzammil Behzad – Assistant Professor

Information and Computer Science,

Lecture 09
King Fahd University of Petroleum and Minerals
Email: [email protected]
Outline
• What is Spark SQL?
• Spark SQL Features
• Spark SQL Architecture
• Spark SQL Libraries
• DataSource API
• DataFrame API
• Spark DataFrames
• Key Features of DataFrames
• Example Companies who use Spark SQL
• How to create DataFrames
• How to Manipulate DataFrames
• PySpark SQL functions
2
DIKW

3
What
DIKW is Spark SQL?

4
Spark
DIKW SQL Features

5
Spark
DIKW SQL Features

6
Spark
DIKW SQL Features

7
Spark
DIKW SQL Features
Key Observations:
• Hadoop (shown in red) takes significantly
more time as the number of iterations
increases. Each iteration takes approximately
110 seconds.
• Spark SQL (shown in blue) is much faster and
scales efficiently. The first iteration takes
around 80 seconds, but subsequent iterations
only take 1 second each, highlighting its
performance optimization.

Conclusion:
• Spark SQL is much more efficient than
Hadoop for iterative processes, offering
dramatically reduced run times as the
number of iterations grows.
8
Spark
DIKW SQL Features

9
Spark
DIKW SQL Architecture

DSL = Domain-Specific Language 10

DataSource
DIKW API
• Apache Spark's Data Source API provides seamless
integration with numerous data storage systems.
• It enables distributed processing on large datasets
by connecting with both structured and
unstructured data repositories.
• Used for reading structured and semi-structured
data into Spark.

11
DataSource
DIKW API
• Features:
• Can handle structure/semi-structured data
• Can load files in multiple formats
• 3rd party integration can be done

12
DataFrames
DIKW API
• Distributed collection of data
organized into named columns
• Equivalent to a relational table in SQL

13
Spark
DIKW SQL Libraries

1. Data Source API: Connects to

various data sources like databases
and file systems.
2. DataFrame API: Handles structured
data in a table-like format.
3. Interpreter & Optimizer: Analyzes
and optimizes queries for better
performance.
4. SQL Service: Supports SQL queries
for interacting with Spark data.
14
Spark
DIKW DataFrames
• A DataFrame is a Dataset organized into named columns.
• It is conceptually equivalent to a table in a relational
database or a data frame in R/Python.
• DataFrames process structured and semi-structured data.
• They can handle petabytes of data.
• DataFrames can be constructed from a wide array of
sources such as: structured data files, semi-structured
data files, databases, or existing RDDs.
• DataFrame API is available in Scala, Java, Python, and R.

15
Key
DIKWFeatures of DataFrames
• Distributed
• Immutable
• Lazy Evaluation
• Automatic optimization of code
• SQL support
• Language support available for Python, Scala, Java , and R
• Provide data source API to read from different sources
and formats
• Real time query processing
• Many times faster than Hadoop Hive

16
Example
DIKW Companies Who Use Spark SQL

17
Spark
DIKW DataFrames
• Programmatically
• From Existing RDD
• Loading data: For example, loading
the data from JSON, CSV, etc.
• You can Create a PySpark DataFrame
using toDF() and createDataFrame())

18
Creating
DIKW a DataFrame From CSV File
df1 = spark.read.csv(“fileName.csv”, inferSchema=True, header=True)

19
Pyspark
DIKW SQL in colab
!pip install pyspark

from pyspark.sql import *

spark = SparkSession.builder\
.master("local")\
.appName("Colab")\
.config('spark.ui.port', '4050')\
.getOrCreate()

20
Creating
DIKW DataFrame Programmatically
emp = [(1,"Smith",-1,"2018","10","M",3000), \
(2,"Rose",1,"2010","20","M",4000), \
(3,"Williams",1,"2010","10","M",1000), \
(4,"Jones",2,"2005","10","F",2000), \
(5,"Brown",2,"2010","40","",-1), \
(6,"Brown",2,"2010","50","",-1) \
]

empColumns = ["emp_id","name","superior_emp_id","year_joined", \
"emp_dept_id","gender","salary"]

df1 = spark.createDataFrame(data=emp, schema = empColumns)

type(df1)
21
Creating
DIKW DataFrame From RDD
empData = [(1,"Smith",-1,"2018","10","M",3000), \
(2,"Rose",1,"2010","20","M",4000), \
(3,"Williams",1,"2010","10","M",1000), \
(4,"Jones",2,"2005","10","F",2000), \
(5,"Brown",2,"2010","40","",-1), \
(6,"Brown",2,"2010","50","",-1) \
]
rdd1 = sc.parallelize(empData)

empColumns = ["emp_id","name","superior_emp_id","year_joined", \
"emp_dept_id","gender","salary"]

df1 = rdd1.toDF(empColumns)
type(df1)

22
Manipulating
DIKW DataFrames

23
Manipulating
DIKW DataFrames
• df1.select(df1.salary, df1.gender).show()

• df1.filter(df1.salary > 6000).show()

• df1.filter(df1.salary > 6000).count()

• df1.filter((df1.salary > 6000) & (df1.gender == ‘M’)).show()

• df1.orderBy(df1.name).show()

• df1.orderBy(df1.name.desc()).show()

• df1.sort(df1.gender.desc()).show()

• df1.groupBy(df1.salary).count().show()

• df1.withColumn('New_Salary',df1.salary*1.03).show()

• df1.withColumnRenamed('gender','sex').columns

• df2 = df1.filter(df1.Salary > 6000).show()

24
Manipulating
DIKW DataFrames
• df.drop(“salary")
• df.select(“gender”).distinct().count()
• df3 = df1.union(df2)
• df.limit(3).show()
• df1.crosstab(‘salary’, ‘gender').show()
• df1.select(‘salary’,gender').dropDuplicates().show()
• df1.dropna().count()
• df1.fillna(-1).show()
• df2 = df1.sample(False, 0.25, 23)
• df1.select(‘salary').map(lambda x:(x,1)).take(3)

25
PySpark
DIKW SQL Functions
• PySpark – Aggregate Functions
• PySpark – Window Functions

26
PySpark
DIKW Aggregate Functions

27
PySpark
DIKW Window Functions
• PySpark Window functions operate on a group of rows and return a single value
for every input row.
• PySpark SQL supports three kinds of window functions:
• Ranking functions
• Analytic functions
• Aggregate functions

https://sparkbyexamples.com/pyspark-tutorial

28
DIKW

29
Dr. Muzammil Behzad – Assistant Professor
King Fahd University of Petroleum and Minerals
Email: [email protected] 30

Pyspark Basics
No ratings yet
Pyspark Basics
16 pages
PYSPARK Interview Questions
100% (3)
PYSPARK Interview Questions
126 pages
PySpark Data Frame Questions PDF
100% (2)
PySpark Data Frame Questions PDF
57 pages
Fall209 Spark SQL MC
No ratings yet
Fall209 Spark SQL MC
96 pages
4.3. Spark SQL
No ratings yet
4.3. Spark SQL
25 pages
Spark SQL
No ratings yet
Spark SQL
24 pages
07 Spark Dataframes
100% (1)
07 Spark Dataframes
45 pages
Datasets and Dataframes: Org - Apache.Spark - Sql.Sparksession
No ratings yet
Datasets and Dataframes: Org - Apache.Spark - Sql.Sparksession
17 pages
Spark SQL - Updated
No ratings yet
Spark SQL - Updated
19 pages
Comparison of SQL
No ratings yet
Comparison of SQL
11 pages
Spark SQL
No ratings yet
Spark SQL
41 pages
Apache Spark - DataFrames and Spark SQL
100% (2)
Apache Spark - DataFrames and Spark SQL
146 pages
Spark SQL and DataFrames - Spark 2.2.0 Documentation
No ratings yet
Spark SQL and DataFrames - Spark 2.2.0 Documentation
35 pages
SparkSql AND DF
No ratings yet
SparkSql AND DF
89 pages
10 Spark1
No ratings yet
10 Spark1
31 pages
Unit 4 (Data Frame and Apache Kafka)
No ratings yet
Unit 4 (Data Frame and Apache Kafka)
28 pages
Slide 10 PySpark - SQL
No ratings yet
Slide 10 PySpark - SQL
131 pages
Report SQL PDF
No ratings yet
Report SQL PDF
21 pages
Page 01
No ratings yet
Page 01
2 pages
Chapter 3
No ratings yet
Chapter 3
33 pages
PySpark Notes
No ratings yet
PySpark Notes
31 pages
Lab 4 - Apache Spark SQL
No ratings yet
Lab 4 - Apache Spark SQL
46 pages
Pyspark
No ratings yet
Pyspark
10 pages
Bda U5
No ratings yet
Bda U5
42 pages
(Big Data Analytics With PySpark) (CheatSheet)
No ratings yet
(Big Data Analytics With PySpark) (CheatSheet)
7 pages
4 - Spark SQL
No ratings yet
4 - Spark SQL
58 pages
Day 11 Notes
No ratings yet
Day 11 Notes
3 pages
Bda Unit-4 PDF
No ratings yet
Bda Unit-4 PDF
63 pages
Mod5 Bda
No ratings yet
Mod5 Bda
9 pages
RDDs Vs DataFrames and Datasets
No ratings yet
RDDs Vs DataFrames and Datasets
7 pages
Learning Spark - Chapter 4
No ratings yet
Learning Spark - Chapter 4
30 pages
Unit 5 SQL 2024 25
No ratings yet
Unit 5 SQL 2024 25
19 pages
Unit-5 Spark SQL and Spark Streaming
No ratings yet
Unit-5 Spark SQL and Spark Streaming
24 pages
17 SparkSQL
No ratings yet
17 SparkSQL
44 pages
Day11 Notes
No ratings yet
Day11 Notes
2 pages
Basic DataFrame Operation
No ratings yet
Basic DataFrame Operation
11 pages
Master Pyspark Zero To Big Data Hero: Day 1 Day 2 Day 3 Day 4 Day 5 Day 6 Day 7 Day 8 Day 9 Day 10
No ratings yet
Master Pyspark Zero To Big Data Hero: Day 1 Day 2 Day 3 Day 4 Day 5 Day 6 Day 7 Day 8 Day 9 Day 10
106 pages
Spark 4.0
No ratings yet
Spark 4.0
123 pages
Py Spark
No ratings yet
Py Spark
9 pages
Pyspark Funcamentals
No ratings yet
Pyspark Funcamentals
10 pages
Fundamental Pyspark Operations 1708364268
No ratings yet
Fundamental Pyspark Operations 1708364268
10 pages
1 - Introduction ToPySpark
No ratings yet
1 - Introduction ToPySpark
26 pages
LearningSpark EXCERPT
50% (2)
LearningSpark EXCERPT
47 pages
Lecture 4 - Pair RDD and DataFrame
No ratings yet
Lecture 4 - Pair RDD and DataFrame
38 pages
Spark SQL
No ratings yet
Spark SQL
18 pages
7 Apache Spark
No ratings yet
7 Apache Spark
48 pages
Master PySpark 1-18
No ratings yet
Master PySpark 1-18
59 pages
Spark Programming Basics
No ratings yet
Spark Programming Basics
54 pages
Pyspark Cheatsheet
No ratings yet
Pyspark Cheatsheet
21 pages
Pyspark IQ FREE Guide
100% (1)
Pyspark IQ FREE Guide
57 pages
CISD 42 Introduction To Spark - Spark Transformation - Spark Actions
No ratings yet
CISD 42 Introduction To Spark - Spark Transformation - Spark Actions
27 pages
PySpark Notes
No ratings yet
PySpark Notes
64 pages
Spark SQL
No ratings yet
Spark SQL
28 pages
SQL Cheat Sheet Python
100% (1)
SQL Cheat Sheet Python
1 page
Special Purpose Motors - Large Fonts
No ratings yet
Special Purpose Motors - Large Fonts
26 pages
Legrand Interruptor Horario
No ratings yet
Legrand Interruptor Horario
1 page
PCA82C250 / 251 CAN Transceiver: Application Note
No ratings yet
PCA82C250 / 251 CAN Transceiver: Application Note
24 pages
Over Voltage Protection Circuit For Automotive Load Dump
No ratings yet
Over Voltage Protection Circuit For Automotive Load Dump
6 pages
6SE6440-2AD22-2BA1 Datasheet en
0% (1)
6SE6440-2AD22-2BA1 Datasheet en
2 pages
Data Analyst - CV
No ratings yet
Data Analyst - CV
1 page
Adobe Photoshop Interface
No ratings yet
Adobe Photoshop Interface
4 pages
08 GPRS-ChAdm
No ratings yet
08 GPRS-ChAdm
17 pages
Philips MCM 239
No ratings yet
Philips MCM 239
29 pages
User Management Module
No ratings yet
User Management Module
3 pages
Manual Thermal Load - Calculations
No ratings yet
Manual Thermal Load - Calculations
48 pages
Mini Project
No ratings yet
Mini Project
19 pages
Electromotive XDI-V1.6 Electronic ECU
100% (1)
Electromotive XDI-V1.6 Electronic ECU
35 pages
Chapter 1 - System Development
No ratings yet
Chapter 1 - System Development
24 pages
Brief Digital Media & Marketing Plan For Theme Parks & Amusement Parks
No ratings yet
Brief Digital Media & Marketing Plan For Theme Parks & Amusement Parks
14 pages
الموضوع رقم 29 اختبار الفصل الثالث لغة إنجليزية ثالثة متوسط
No ratings yet
الموضوع رقم 29 اختبار الفصل الثالث لغة إنجليزية ثالثة متوسط
2 pages
Instruction Format 8086
No ratings yet
Instruction Format 8086
15 pages
Boiler Fan VFD With O2 Controls
No ratings yet
Boiler Fan VFD With O2 Controls
8 pages
Sumit Resume
No ratings yet
Sumit Resume
1 page
Software Installation
No ratings yet
Software Installation
5 pages
CV - Nur Imam Masri
No ratings yet
CV - Nur Imam Masri
3 pages
Data Analysis
No ratings yet
Data Analysis
4 pages
Presentation PDF
No ratings yet
Presentation PDF
40 pages
Bcom ITM - Informatics 2B
No ratings yet
Bcom ITM - Informatics 2B
160 pages
Low-Temperature Heating and Cooling: Augustin Mouchot 1878 Universal Exhibition in Paris Sahara Frank Shuman
No ratings yet
Low-Temperature Heating and Cooling: Augustin Mouchot 1878 Universal Exhibition in Paris Sahara Frank Shuman
2 pages
HTA Rev 5.00 Install Guide
No ratings yet
HTA Rev 5.00 Install Guide
240 pages
Instrumentation Installation Verification Procedure:: How To Use This Document
No ratings yet
Instrumentation Installation Verification Procedure:: How To Use This Document
3 pages
Product Line of LG: by Sanath Kumar Vivek
60% (5)
Product Line of LG: by Sanath Kumar Vivek
157 pages
Catalyst Efficiency
100% (2)
Catalyst Efficiency
6 pages
Puranmal Lahoti Government Polytechnic Latur: Name of The Students
No ratings yet
Puranmal Lahoti Government Polytechnic Latur: Name of The Students
11 pages

T09 Sparksql

Uploaded by

T09 Sparksql

Uploaded by

ICS 474 - Big Data Analytics

Dr. Muzammil Behzad – Assistant Professor

Information and Computer Science,

DSL = Domain-Specific Language 10

1. Data Source API: Connects to

from pyspark.sql import *

df1 = spark.createDataFrame(data=emp, schema = empColumns)

• df1.filter(df1.salary > 6000).show()

• df1.filter(df1.salary > 6000).count()

• df1.filter((df1.salary > 6000) & (df1.gender == ‘M’)).show()

• df2 = df1.filter(df1.Salary > 6000).show()

You might also like