0% found this document useful (0 votes)
8 views

Lab 4 - Apache Spark SQL

The document provides an overview of Apache Spark SQL, focusing on its capabilities for relational data processing and the use of DataFrames. It discusses the Catalyst Optimizer, the advantages of DataFrames over RDDs, and how to perform various operations using Spark SQL. Additionally, it covers the architecture, querying methods, and optimization techniques within Spark SQL.

Uploaded by

suman.struc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Lab 4 - Apache Spark SQL

The document provides an overview of Apache Spark SQL, focusing on its capabilities for relational data processing and the use of DataFrames. It discusses the Catalyst Optimizer, the advantages of DataFrames over RDDs, and how to perform various operations using Spark SQL. Additionally, it covers the architecture, querying methods, and optimization techniques within Spark SQL.

Uploaded by

suman.struc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

Apache SQL:

Relational data processing in Spark


CS562 - Lab 4
Michail Giannoulis
What we will be discussing...

● Apache Spark SQL


● DataFrame
● Catalyst Optimizer
● Examples in DSL and SQL
● Example of adding a new rule on Catalyst Optimizer
Nowadays Challenges and Solutions

Challenges Solutions

Perform ETL to and from various A DataFrame API that can perform
(semi or unstructured) data sources relational operations on both external
data sources and Spark’s built-in
RDDs

Perform advanced analytics (e.g. A highly extensible optimizer,


machine learning, graph processing) Catalyst, that uses features of Scala
that are hard to express in relational to add composable rule, control code
systems
gen., and define extensions.
Why Apache Spark ?
Fast and general cluster computing system, interoperable with Hadoop

Up to 100× faster
(2-10× on disk)
Improves efficiency through:
● In-memory computing primitives
● General computation graphs

Improves usability through: 2-5× less code


● Rich APIs in Scala, Java, Python
● Interactive shell

Note: More about Hadoop versus Spark here.


Apache Spark Software Stack

Now!
Spark SQL
Is a Spark module which Integrates relational processing with Spark’s functional
programming API
Module Characteristics:
● Supports querying data either via SQL or via Hive Query Language
● Extends the traditional relational data processing
Part of the core distribution since Spark 1.0 (April 2014):
Spark SQL Architecture
How to use Spark SQL ?
You issue SQL queries through a SQLContext or HiveContext, using the sql() method.

● The sql() method returns a DataFrame


● You can mix DataFrame methods and SQL queries in the same code

To use SQL you must either:


● Query a persisted Hive table
● Make a table alias for a DataFrame, using the registerTempTable() method

Note: a complete guide how to use, can be find here


DataFrame API
Provides a higher level abstraction (built on RDD API), allowing us to use a query
language to manipulate data
Formal Definition:
● A DataFrame (DF) is a size-mutable, potentially heterogeneous tabular data
structure with labeled axes (i.e., rows and columns)
Characteristics:
● Supports all the RDD operations → but may return back an RDD not a DF
● Ability to scale from kB of data in a single laptop to petabytes on a large cluster
● Support for a wide array of data formats and storage systems
● State-of-the-art optimization and code generation through the Spark SQL Catalyst
optimizer
● ...
Spark SQL Interfaces Interaction with SPARK

Here I am!

● Seamless integration with all big data tooling and infrastructure via Spark.
● APIs for Python, Java and R
Why DataFrame ?
What are the advantages over Resilient Distributed Datasets ?
1. Compact binary representation
○ Columnar, compressed cache; rows for processing
2. Optimization across operations (join, reordering, predicate pushdown, etc)
3. Runtime code generation

What are the advantages over Relational Query Languages ?


● Holistic optimization across functions composed in different languages
● Control structures (e..g if, for)
● Logical plan analyzed eagerly → identify code errors associated with data schema
issues on the fly
Why DataFrame ?
A DF can be significantly faster than RDDs and they perform the same regardless the
language:

But, we have lost type safety → Array[org.apache.spark.sql.Row], because Row extends


serializable. Mapping it back to something useful e.g. row(0).asInstanceOf[String], its
ugly and error-prone.
Querying Native Datasets
Infer column names and types directly from data objects:

● Native objects accessed in-place to avoid expensive data format transformation

Benefits:
● Run relational operations on existing Spark Programs
● Combine RDDs with external structured data

RDD[String] → (User Defined Function) → RDD[User] → (toDF method) → DataFrame


User-Defined Functions (UDFs)
Easy extension of limited operations supported
Allows inline registration of UDFs
● Compare with Pig, which requires the UDF to be written in java package that’s
loaded into the Pig script
Can be defined on simple data types or entire tables
UDFs available to other interfaces after registration
DataFrame API: Transformations, Actions, Laziness
● Transformations contribute to the query plan, but they don't execute anything.
Actions cause the execution of the query

DataFrames are lazy!

What exactly does “execution of the query” means?


● Spark initiates a distributed read of the data source
● The data flows through the transformations (the RDDs resulting from the catalyst
query plan)
● The result of the action is pulled back into the driver JVM
DataFrame API: Actions
DataFrame API: Basic Functions
DataFrame API: Basic Functions
DataFrame API: Language Integrated Queries

Note: More details about these functions here.


DataFrame API: Relational Operations
Relational operations, select, where, join, groupBy via a domain-specific language:
● Operators take expression objects
● Operators build up an Abstract Syntax Tree (AST), which is then optimized by
Catalyst

Alternatively, register as temp SQL table and perform traditional SQL query strings:

SOS
DataFrame API: Output Operations
DataFrame API: RDD Operations
Data Sources
Uniform way to access structured data:
● Apps can migrate across Hive, Cassandra, JSON, Parquet, etc..
● Rich semantics allows query pushdown into data sources
Apache Spark Catalyst Internals

More info about this article here.


Apache Spark Execution Plan

● From the above diagram, you can already predict the amount of work that is being
done by Spark Catalyst to execute your Spark SQL queries 😳
● The SQL queries of Spark application will be converted to Dataframe APIs
● Logical Plan is converted to an Optimized Logic plan and then to one or more
Physical Plans
Note: Find more about what happening under the hood of Spark SQL here and here.
The Analyzer
Spark Catalyst’s analyzer is responsible for resolving types and names of attributes in
SQL queries

● The analyzer looks at the table statistics to know the types of the referred column
For example:
SELECT (col1 + 1) FROM mytable;

● Now, Spark needs to know:


1. If col1 is actually a valid column in mytable
2. If the type of the referred column needs to be known so that (col1 + 1) can
be validated and necessary type casts cam e added
How analyzer resolve attributes ?
To resolve attributes:
● Look up relations by name from the catalog
● Map named attributes to the input provided given operator’s children
● UID for references to the same value
● Propagate the coerce types through expressions (e.g. 1 + col1)
The Optimizer
Spark Catalyst’s optimizer is responsible for generating an optimized logical plan from
the analyzed logical plan

● Optimization is done by applying rules in batches. Each operation is represented as


a TreeNode in Spark SQL
● When an analyzed plan goes through the optimizer, the tree is transformed to a new
tree repeatedly by applying a set of optimization rules
For instance, a simple Rule:
Replace the addition of Literal values with new Literal
Then, expressions of the form (1+5) will be replaced by 6. Spark will be repeatedly apply
such rules to the expression tree until the tree becomes constant
What are the Optimization Rules ?
The optimizer applies standard rule-based optimization rules:
● Constant folding
● Predicate-pushdown
● Projection
● Null propagation
● Boolean expression simplification
● …

Note: Find more optimization rules here


Optimizer: Example
● An inefficient query where filter is used before join operation → Costly shuffle
operation (Find more about this example here)

The filter is
pushed
The join is
down
inefficient
Physical Planner
Physical plans are the ones that can actually be executed on a cluster. They actually
translate optimized logical plans into RDD operations to be executed on the data source

● A generated Optimized Logical Plan is passed through a series of Spark


strategies that produce one or more Physical plans (More about these
strategies here)
● Spark uses cost based optimization (CBO) to select the best physical plan
based on the data source (i.e. table sizes)
Physical Planner: Example
Code Generation
This phase involves generating java bytecode to run on each machine

A comparison of the performance evaluating the expression “x + x + x”, where x is an


integer, 1 billion times:

● Catalyst transforms a SQL tree into an abstract syntax tree (AST) for scala code
to evaluate expressions and generate code
Apache Spark SQL Example

Save it as spark_sql_example.scala (Find the source code here)


How to run Apache Spark correctly ?
Run your first .scala script, in three simple steps:
1. Open a command line → win + R and type CMD
2. Run the spark shell using user-defined memory → spark-shell --driver-
memory 5g
3. Load the script → :load <path to>\spark_sql_example.scala
Schema Inference Example
Suppose you have a text file that looks
like this:

The file has no schema, but looks like:


● First name: string
● Last name: string
● Gender: string
● Age: integer
How to see what a DataFrame Contains ?
You can have Spark tell you what it thinks the data schema is, by calling the
printSchema() method (This is mostly useful in the shell)

You can look at the first n elements in a DataFrame


with the show() method
If not specified, n defaults to 20
How to persist a DataFrame in memory ?

Spark can cache a DataFrame, using an in-memory columnar format, by calling:


scala> df.cache()
Which just calls df.persist(MEMORY_ONLY)
● Spark will scan only those columns used by the DataFrame and will automatically
tune compression to minimize memory usage and GC pressure.

You can remove the cached data from memory, by calling:


scala> df.unpersist()
How to select columns from a DataFrame ?
The select() is like a SQL SELECT, allowing you to limit the results to specific columns
● The DSL also allows you create on-the-fly derived columns
● The SQL version is also available
How to filter the rows of a DataFrame ?
The filter() method allows you to filter rows out of your results
● The DSL as well as SQL version are available
How to sort the rows of a DataFrame ?
The orderBy() method allows you to sort the results
● The DSL as well as SQL version are available
● It’s easy to reverse the sort order
change the col name of a table in DF?
The as() or alias() allows you to rename a column. It’s especially useful with generated
columns
● The DSL as well as SQL version are available
Add a new optimization rule to Spark Catalyst
Implement the Collapse sorts optimizer rule

The Optimized logical Plan


with our new Rule

The Optimized logical Plan


without our new Rule
Query:
● val data = Seq((‘a’, 1), (‘b’,2), (‘c’, 3)).toDF(‘a’, ‘b’)
● val query = data.select(a, b).orderBy(b.asc).filter(‘b ==2’).orderBy(a.asc)

Note: Find more information of this example here


Which Spark Components do people use?

(Survey 2015)
Which Languages are Used ?
Special Thanks!

Intro to DataFrames and Spark SQL 2015 Databricks

RDDs, DataFrames and Datasets in Apache Spark 2016 Akmal B. Chaudhri

Spark SQL: Relational Data Processing in Spark 2015 Databricks, MIT and
Amplab

You might also like