0% found this document useful (0 votes)

14 views46 pages

Lab 4 - Apache Spark SQL

The document provides an overview of Apache Spark SQL, focusing on its capabilities for relational data processing and the use of DataFrames. It discusses the Catalyst Optimizer, the advantages of DataFrames over RDDs, and how to perform various operations using Spark SQL. Additionally, it covers the architecture, querying methods, and optimization techniques within Spark SQL.

Uploaded by

suman.struc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views46 pages

Lab 4 - Apache Spark SQL

Uploaded by

suman.struc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 46

Apache SQL:

Relational data processing in Spark

CS562 - Lab 4
Michail Giannoulis
What we will be discussing...

● Apache Spark SQL

● DataFrame
● Catalyst Optimizer
● Examples in DSL and SQL
● Example of adding a new rule on Catalyst Optimizer
Nowadays Challenges and Solutions

Challenges Solutions

Perform ETL to and from various A DataFrame API that can perform
(semi or unstructured) data sources relational operations on both external
data sources and Spark’s built-in
RDDs

Perform advanced analytics (e.g. A highly extensible optimizer,

machine learning, graph processing) Catalyst, that uses features of Scala
that are hard to express in relational to add composable rule, control code
systems
gen., and define extensions.
Why Apache Spark ?
Fast and general cluster computing system, interoperable with Hadoop

Up to 100× faster
(2-10× on disk)
Improves efficiency through:
● In-memory computing primitives
● General computation graphs

Improves usability through: 2-5× less code

● Rich APIs in Scala, Java, Python
● Interactive shell

Note: More about Hadoop versus Spark here.

Apache Spark Software Stack

Now!
Spark SQL
Is a Spark module which Integrates relational processing with Spark’s functional
programming API
Module Characteristics:
● Supports querying data either via SQL or via Hive Query Language
● Extends the traditional relational data processing
Part of the core distribution since Spark 1.0 (April 2014):
Spark SQL Architecture
How to use Spark SQL ?
You issue SQL queries through a SQLContext or HiveContext, using the sql() method.

● The sql() method returns a DataFrame

● You can mix DataFrame methods and SQL queries in the same code

To use SQL you must either:

● Query a persisted Hive table
● Make a table alias for a DataFrame, using the registerTempTable() method

Note: a complete guide how to use, can be find here

DataFrame API
Provides a higher level abstraction (built on RDD API), allowing us to use a query
language to manipulate data
Formal Definition:
● A DataFrame (DF) is a size-mutable, potentially heterogeneous tabular data
structure with labeled axes (i.e., rows and columns)
Characteristics:
● Supports all the RDD operations → but may return back an RDD not a DF
● Ability to scale from kB of data in a single laptop to petabytes on a large cluster
● Support for a wide array of data formats and storage systems
● State-of-the-art optimization and code generation through the Spark SQL Catalyst
optimizer
● ...
Spark SQL Interfaces Interaction with SPARK

Here I am!

● Seamless integration with all big data tooling and infrastructure via Spark.
● APIs for Python, Java and R
Why DataFrame ?
What are the advantages over Resilient Distributed Datasets ?
1. Compact binary representation
○ Columnar, compressed cache; rows for processing
2. Optimization across operations (join, reordering, predicate pushdown, etc)
3. Runtime code generation

What are the advantages over Relational Query Languages ?

● Holistic optimization across functions composed in different languages
● Control structures (e..g if, for)
● Logical plan analyzed eagerly → identify code errors associated with data schema
issues on the fly
Why DataFrame ?
A DF can be significantly faster than RDDs and they perform the same regardless the
language:

But, we have lost type safety → Array[org.apache.spark.sql.Row], because Row extends

serializable. Mapping it back to something useful e.g. row(0).asInstanceOf[String], its
ugly and error-prone.
Querying Native Datasets
Infer column names and types directly from data objects:

● Native objects accessed in-place to avoid expensive data format transformation

Benefits:
● Run relational operations on existing Spark Programs
● Combine RDDs with external structured data

RDD[String] → (User Defined Function) → RDD[User] → (toDF method) → DataFrame

User-Defined Functions (UDFs)
Easy extension of limited operations supported
Allows inline registration of UDFs
● Compare with Pig, which requires the UDF to be written in java package that’s
loaded into the Pig script
Can be defined on simple data types or entire tables
UDFs available to other interfaces after registration
DataFrame API: Transformations, Actions, Laziness
● Transformations contribute to the query plan, but they don't execute anything.
Actions cause the execution of the query

DataFrames are lazy!

What exactly does “execution of the query” means?

● Spark initiates a distributed read of the data source
● The data flows through the transformations (the RDDs resulting from the catalyst
query plan)
● The result of the action is pulled back into the driver JVM
DataFrame API: Actions
DataFrame API: Basic Functions
DataFrame API: Basic Functions
DataFrame API: Language Integrated Queries

Note: More details about these functions here.

DataFrame API: Relational Operations
Relational operations, select, where, join, groupBy via a domain-specific language:
● Operators take expression objects
● Operators build up an Abstract Syntax Tree (AST), which is then optimized by
Catalyst

Alternatively, register as temp SQL table and perform traditional SQL query strings:

SOS
DataFrame API: Output Operations
DataFrame API: RDD Operations
Data Sources
Uniform way to access structured data:
● Apps can migrate across Hive, Cassandra, JSON, Parquet, etc..
● Rich semantics allows query pushdown into data sources
Apache Spark Catalyst Internals

● Now, Spark needs to know:

1. If col1 is actually a valid column in mytable
2. If the type of the referred column needs to be known so that (col1 + 1) can
be validated and necessary type casts cam e added
How analyzer resolve attributes ?
To resolve attributes:
● Look up relations by name from the catalog
● Map named attributes to the input provided given operator’s children
● UID for references to the same value
● Propagate the coerce types through expressions (e.g. 1 + col1)
The Optimizer
Spark Catalyst’s optimizer is responsible for generating an optimized logical plan from
the analyzed logical plan

● Optimization is done by applying rules in batches. Each operation is represented as

a TreeNode in Spark SQL
● When an analyzed plan goes through the optimizer, the tree is transformed to a new
tree repeatedly by applying a set of optimization rules
For instance, a simple Rule:
Replace the addition of Literal values with new Literal
Then, expressions of the form (1+5) will be replaced by 6. Spark will be repeatedly apply
such rules to the expression tree until the tree becomes constant
What are the Optimization Rules ?
The optimizer applies standard rule-based optimization rules:
● Constant folding
● Predicate-pushdown
● Projection
● Null propagation
● Boolean expression simplification
● …

Note: Find more optimization rules here

Optimizer: Example
● An inefficient query where filter is used before join operation → Costly shuffle
operation (Find more about this example here)

The filter is
pushed
The join is
down
inefficient
Physical Planner
Physical plans are the ones that can actually be executed on a cluster. They actually
translate optimized logical plans into RDD operations to be executed on the data source

● A generated Optimized Logical Plan is passed through a series of Spark

strategies that produce one or more Physical plans (More about these
strategies here)
● Spark uses cost based optimization (CBO) to select the best physical plan
based on the data source (i.e. table sizes)
Physical Planner: Example
Code Generation
This phase involves generating java bytecode to run on each machine

A comparison of the performance evaluating the expression “x + x + x”, where x is an

integer, 1 billion times:

● Catalyst transforms a SQL tree into an abstract syntax tree (AST) for scala code
to evaluate expressions and generate code
Apache Spark SQL Example

Save it as spark_sql_example.scala (Find the source code here)

How to run Apache Spark correctly ?
Run your first .scala script, in three simple steps:
1. Open a command line → win + R and type CMD
2. Run the spark shell using user-defined memory → spark-shell --driver-
memory 5g
3. Load the script → :load <path to>\spark_sql_example.scala
Schema Inference Example
Suppose you have a text file that looks
like this:

The file has no schema, but looks like:

● First name: string
● Last name: string
● Gender: string
● Age: integer
How to see what a DataFrame Contains ?
You can have Spark tell you what it thinks the data schema is, by calling the
printSchema() method (This is mostly useful in the shell)

You can look at the first n elements in a DataFrame

with the show() method
If not specified, n defaults to 20
How to persist a DataFrame in memory ?

Spark can cache a DataFrame, using an in-memory columnar format, by calling:

scala> df.cache()
Which just calls df.persist(MEMORY_ONLY)
● Spark will scan only those columns used by the DataFrame and will automatically
tune compression to minimize memory usage and GC pressure.

You can remove the cached data from memory, by calling:

scala> df.unpersist()
How to select columns from a DataFrame ?
The select() is like a SQL SELECT, allowing you to limit the results to specific columns
● The DSL also allows you create on-the-fly derived columns
● The SQL version is also available
How to filter the rows of a DataFrame ?
The filter() method allows you to filter rows out of your results
● The DSL as well as SQL version are available
How to sort the rows of a DataFrame ?
The orderBy() method allows you to sort the results
● The DSL as well as SQL version are available
● It’s easy to reverse the sort order
change the col name of a table in DF?
The as() or alias() allows you to rename a column. It’s especially useful with generated
columns
● The DSL as well as SQL version are available
Add a new optimization rule to Spark Catalyst
Implement the Collapse sorts optimizer rule

The Optimized logical Plan

with our new Rule

The Optimized logical Plan

without our new Rule
Query:
● val data = Seq((‘a’, 1), (‘b’,2), (‘c’, 3)).toDF(‘a’, ‘b’)
● val query = data.select(a, b).orderBy(b.asc).filter(‘b ==2’).orderBy(a.asc)

Note: Find more information of this example here

Which Spark Components do people use?

(Survey 2015)
Which Languages are Used ?
Special Thanks!

Intro to DataFrames and Spark SQL 2015 Databricks

RDDs, DataFrames and Datasets in Apache Spark 2016 Akmal B. Chaudhri

Spark SQL: Relational Data Processing in Spark 2015 Databricks, MIT and
Amplab

Spark SQL PPT 3.2.3 and 3.2.4
No ratings yet
Spark SQL PPT 3.2.3 and 3.2.4
17 pages
Big+Data+with+Apache+Spark+3+and+Python+From+Zero+to+Expert
No ratings yet
Big+Data+with+Apache+Spark+3+and+Python+From+Zero+to+Expert
28 pages
Bda Unit-4 PDF
No ratings yet
Bda Unit-4 PDF
63 pages
SparkSql_AND_DF
No ratings yet
SparkSql_AND_DF
89 pages
Datasets and Dataframes: Org - Apache.Spark - Sql.Sparksession
No ratings yet
Datasets and Dataframes: Org - Apache.Spark - Sql.Sparksession
17 pages
Apache Spark and Scala
No ratings yet
Apache Spark and Scala
53 pages
07 Spark Dataframes
100% (1)
07 Spark Dataframes
45 pages
4.3. Spark SQL
No ratings yet
4.3. Spark SQL
25 pages
spark_sql
No ratings yet
spark_sql
18 pages
Spark SQL_updated
No ratings yet
Spark SQL_updated
19 pages
Spark-Rdd
No ratings yet
Spark-Rdd
15 pages
Spark SQL-A Compiler From Queries To RDDs
No ratings yet
Spark SQL-A Compiler From Queries To RDDs
44 pages
BigData - W4 - Big Data 0 Graph Data - HoangVu (Cont)
No ratings yet
BigData - W4 - Big Data 0 Graph Data - HoangVu (Cont)
76 pages
Spark SQL
100% (1)
Spark SQL
34 pages
4- Spark SQL
No ratings yet
4- Spark SQL
58 pages
Docs Streamlit Io en 0.81.1
100% (1)
Docs Streamlit Io en 0.81.1
164 pages
T09 Sparksql
No ratings yet
T09 Sparksql
30 pages
Spark A To Z
No ratings yet
Spark A To Z
63 pages
Msbte Super 25 Unit 5 Notes
No ratings yet
Msbte Super 25 Unit 5 Notes
17 pages
Apache Spark
No ratings yet
Apache Spark
8 pages
BDA U5 copy
No ratings yet
BDA U5 copy
42 pages
Mastering Apache Spark
100% (2)
Mastering Apache Spark
1,831 pages
Spark Tutorial
No ratings yet
Spark Tutorial
77 pages
Cs 744: Spark SQL: Shivaram Venkataraman Fall 2019
No ratings yet
Cs 744: Spark SQL: Shivaram Venkataraman Fall 2019
24 pages
CISD 42 Introduction to Spark_Spark Transformation_Spark Actions
No ratings yet
CISD 42 Introduction to Spark_Spark Transformation_Spark Actions
27 pages
Spark SQL - Relational Data Processing in Spark
No ratings yet
Spark SQL - Relational Data Processing in Spark
12 pages
7_apache_spark
No ratings yet
7_apache_spark
48 pages
Spark SQL
No ratings yet
Spark SQL
12 pages
Unit 4( Data Frame and Apache Kafka)
No ratings yet
Unit 4( Data Frame and Apache Kafka)
28 pages
Module 3
No ratings yet
Module 3
51 pages
Cse3002 Big Data m3 Detailed
No ratings yet
Cse3002 Big Data m3 Detailed
39 pages
10 Spark1
No ratings yet
10 Spark1
31 pages
BDA-Lec9
No ratings yet
BDA-Lec9
25 pages
Spark Programming Basics
No ratings yet
Spark Programming Basics
54 pages
ECS765P_W5_Spark Programming
No ratings yet
ECS765P_W5_Spark Programming
43 pages
4a.introduction to Apache Spark
No ratings yet
4a.introduction to Apache Spark
28 pages
Big Data Processing With Apache Spark – Part 1_ Introduction - InfoQ
No ratings yet
Big Data Processing With Apache Spark – Part 1_ Introduction - InfoQ
18 pages
Unit-5 Spark SQL and Spark Streaming
No ratings yet
Unit-5 Spark SQL and Spark Streaming
24 pages
PySpark Notes
No ratings yet
PySpark Notes
31 pages
Sparks QL Sig Mod 2015
No ratings yet
Sparks QL Sig Mod 2015
12 pages
Introduction To Spark
No ratings yet
Introduction To Spark
84 pages
Module 4
No ratings yet
Module 4
29 pages
Spark SQL Meetup - 4-8-2012
No ratings yet
Spark SQL Meetup - 4-8-2012
27 pages
Chapter 3 spark
No ratings yet
Chapter 3 spark
6 pages
Databricks On AWS 01 Getting Started Apache Spark Slides
100% (1)
Databricks On AWS 01 Getting Started Apache Spark Slides
29 pages
Report SQL PDF
No ratings yet
Report SQL PDF
21 pages
Mod5 Bda
No ratings yet
Mod5 Bda
9 pages
Computer Security Laboratory Manual
No ratings yet
Computer Security Laboratory Manual
61 pages
Apache Spark Analytics Made Simple PDF
No ratings yet
Apache Spark Analytics Made Simple PDF
76 pages
PySpark+Slides v1
No ratings yet
PySpark+Slides v1
458 pages
Reverse Engineering Techniques Used For Malware Analysis
100% (1)
Reverse Engineering Techniques Used For Malware Analysis
42 pages
RDDs Vs DataFrames and Datasets
No ratings yet
RDDs Vs DataFrames and Datasets
7 pages
Spark SQL Optimization
No ratings yet
Spark SQL Optimization
29 pages
Apache Spark - DataFrames and Spark SQL
100% (2)
Apache Spark - DataFrames and Spark SQL
146 pages
1. MPSC Previous Year Paper Analyses
No ratings yet
1. MPSC Previous Year Paper Analyses
7 pages
Ams 560 Spark SQL
No ratings yet
Ams 560 Spark SQL
2 pages
b.sc. m.sc.Mathematics (Hons.) Integrated
No ratings yet
b.sc. m.sc.Mathematics (Hons.) Integrated
3 pages
Spark: Prepared by Dulari Bhatt
No ratings yet
Spark: Prepared by Dulari Bhatt
19 pages
Page 01
No ratings yet
Page 01
2 pages
b.sc.(Clinical Psychology)
No ratings yet
b.sc.(Clinical Psychology)
1 page
JRF Recruitmment Amity University Kolkata Dr. S. Sau Apr14 2021
No ratings yet
JRF Recruitmment Amity University Kolkata Dr. S. Sau Apr14 2021
1 page
CE GATE 2024 Paper II Final With Keys 1
No ratings yet
CE GATE 2024 Paper II Final With Keys 1
24 pages
Ce Gate 2022 Set II Final 2
No ratings yet
Ce Gate 2022 Set II Final 2
21 pages
Ce Gate 2022 Set 1 Final 2
No ratings yet
Ce Gate 2022 Set 1 Final 2
21 pages
CE-GATE-2012
No ratings yet
CE-GATE-2012
19 pages
Ce Gate 2017 Set 1 Qp
No ratings yet
Ce Gate 2017 Set 1 Qp
19 pages
Rssb Je Offline Test Series Civil Btech Test 10 Solution574
No ratings yet
Rssb Je Offline Test Series Civil Btech Test 10 Solution574
14 pages
Ce Gate 2018 Set 2 Qp
No ratings yet
Ce Gate 2018 Set 2 Qp
17 pages
Ce Gate 2015 Set 2 Qp
No ratings yet
Ce Gate 2015 Set 2 Qp
15 pages
Nigerian Visa 2018
No ratings yet
Nigerian Visa 2018
10 pages
Adpc Paper 2
No ratings yet
Adpc Paper 2
134 pages
LearningSpark EXCERPT
50% (2)
LearningSpark EXCERPT
47 pages
Ngk Murthy
No ratings yet
Ngk Murthy
9 pages
001-AICTE Application Report 2020-21
No ratings yet
001-AICTE Application Report 2020-21
87 pages
Date Sheets
No ratings yet
Date Sheets
3 pages
Academic Calendar for Odd Even Sem 2022 23
No ratings yet
Academic Calendar for Odd Even Sem 2022 23
2 pages
1052uf Cs Abcde Toc
No ratings yet
1052uf Cs Abcde Toc
7 pages
1051uf ME H-I Casting
No ratings yet
1051uf ME H-I Casting
13 pages
1004uf Ce a Aptitude
No ratings yet
1004uf Ce a Aptitude
7 pages
1056uf Ce Def Soil
No ratings yet
1056uf Ce Def Soil
9 pages
6f54fdd9-cdf9-4dd5-9149-953f438cf5ab
No ratings yet
6f54fdd9-cdf9-4dd5-9149-953f438cf5ab
13 pages
Alumni Portal ppt Uploaded
No ratings yet
Alumni Portal ppt Uploaded
6 pages
Gate Brochure
No ratings yet
Gate Brochure
1 page
Aesd Calendar 2024 2025
No ratings yet
Aesd Calendar 2024 2025
1 page
1060uf Me Abcdefg Md
No ratings yet
1060uf Me Abcdefg Md
10 pages
1073uf_03IG_CE_GHIJ_Surveying_24-06-2024-Sol
No ratings yet
1073uf_03IG_CE_GHIJ_Surveying_24-06-2024-Sol
7 pages
ajpbs-v7-id1081
No ratings yet
ajpbs-v7-id1081
10 pages
Shreekunj Hostel
No ratings yet
Shreekunj Hostel
1 page
1005uf Ce Ab Surveying
No ratings yet
1005uf Ce Ab Surveying
10 pages
Spark SQL
No ratings yet
Spark SQL
24 pages
Bruker Toolbox, S1 TITAN and Tracer 5i
No ratings yet
Bruker Toolbox, S1 TITAN and Tracer 5i
41 pages
DDC-V3-OCP-Base-Specification Revision 5.0
No ratings yet
DDC-V3-OCP-Base-Specification Revision 5.0
66 pages
Online Assessment Guide-Amazon
50% (2)
Online Assessment Guide-Amazon
21 pages
Exception Handling in Java
No ratings yet
Exception Handling in Java
11 pages
Paper 3 CC
No ratings yet
Paper 3 CC
36 pages
Normalization and FD
No ratings yet
Normalization and FD
40 pages
4
No ratings yet
4
4 pages
Rob Porter AMWA NMOS IS 04 and IS 05 Scalability and Performance
No ratings yet
Rob Porter AMWA NMOS IS 04 and IS 05 Scalability and Performance
19 pages
Cs With Python Cbse Class 11 - Preeti Arora 2020 - Chapter 1
100% (1)
Cs With Python Cbse Class 11 - Preeti Arora 2020 - Chapter 1
40 pages
CS5002NI WK01 L IntroductiontoSoftwareEngineering 93444
No ratings yet
CS5002NI WK01 L IntroductiontoSoftwareEngineering 93444
35 pages
CD800 User Reference Guide
No ratings yet
CD800 User Reference Guide
36 pages
A Guide To Transformers
No ratings yet
A Guide To Transformers
7 pages
Crypt Ers
No ratings yet
Crypt Ers
4 pages
Java Programming
No ratings yet
Java Programming
4 pages
SANDEEP RESUME - Sandeep G
No ratings yet
SANDEEP RESUME - Sandeep G
3 pages
Faculty of Electrical and Electronic Engineering Bachelor of Electronic Engineering With Honours
No ratings yet
Faculty of Electrical and Electronic Engineering Bachelor of Electronic Engineering With Honours
6 pages
Control System
No ratings yet
Control System
35 pages
Spark SQL and DataFrames - Spark 2.2.0 Documentation
No ratings yet
Spark SQL and DataFrames - Spark 2.2.0 Documentation
35 pages
Sonam Content
No ratings yet
Sonam Content
14 pages
Graphics Standards
No ratings yet
Graphics Standards
12 pages
ATAK CIV 3.11 Change Log
No ratings yet
ATAK CIV 3.11 Change Log
14 pages
Ethernet/Leased Line Service Agreement Terms & Conditions: 1. Definitions
No ratings yet
Ethernet/Leased Line Service Agreement Terms & Conditions: 1. Definitions
8 pages
DWA-131 REVE RELEASE NOTES v5.13B01 HOTFIX WINDOWS
No ratings yet
DWA-131 REVE RELEASE NOTES v5.13B01 HOTFIX WINDOWS
2 pages
GaGe Digitizer CobraMaxCS PCI PCIe Data Sheet
No ratings yet
GaGe Digitizer CobraMaxCS PCI PCIe Data Sheet
4 pages
New Text Document
No ratings yet
New Text Document
6 pages
Topic: 1.3.6 Operating Systems: Loading An Operating System
No ratings yet
Topic: 1.3.6 Operating Systems: Loading An Operating System
5 pages
Name: Akshitha Paduru
No ratings yet
Name: Akshitha Paduru
4 pages
Fast Data Processing Systems with SMACK Stack
From Everand
Fast Data Processing Systems with SMACK Stack
Raúl Estrada
No ratings yet
Learning Apache Spark 2
From Everand
Learning Apache Spark 2
Muhammad Asif Abbasi
No ratings yet
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet

Lab 4 - Apache Spark SQL

Uploaded by

Lab 4 - Apache Spark SQL

Uploaded by

Apache SQL:

Relational data processing in Spark

● Apache Spark SQL

Perform advanced analytics (e.g. A highly extensible optimizer,

Improves usability through: 2-5× less code

Note: More about Hadoop versus Spark here.

● The sql() method returns a DataFrame

To use SQL you must either:

Note: a complete guide how to use, can be find here

What are the advantages over Relational Query Languages ?

But, we have lost type safety → Array[org.apache.spark.sql.Row], because Row extends

● Native objects accessed in-place to avoid expensive data format transformation

RDD[String] → (User Defined Function) → RDD[User] → (toDF method) → DataFrame

DataFrames are lazy!

What exactly does “execution of the query” means?

Note: More details about these functions here.

More info about this article here.

● Now, Spark needs to know:

● Optimization is done by applying rules in batches. Each operation is represented as

Note: Find more optimization rules here

● A generated Optimized Logical Plan is passed through a series of Spark

A comparison of the performance evaluating the expression “x + x + x”, where x is an

Save it as spark_sql_example.scala (Find the source code here)

The file has no schema, but looks like:

You can look at the first n elements in a DataFrame

Spark can cache a DataFrame, using an in-memory columnar format, by calling:

You can remove the cached data from memory, by calling:

The Optimized logical Plan

The Optimized logical Plan

Note: Find more information of this example here

Intro to DataFrames and Spark SQL 2015 Databricks

RDDs, DataFrames and Datasets in Apache Spark 2016 Akmal B. Chaudhri

You might also like