Lab 4 - Apache Spark SQL
Lab 4 - Apache Spark SQL
Challenges Solutions
Perform ETL to and from various A DataFrame API that can perform
(semi or unstructured) data sources relational operations on both external
data sources and Spark’s built-in
RDDs
Up to 100× faster
(2-10× on disk)
Improves efficiency through:
● In-memory computing primitives
● General computation graphs
Now!
Spark SQL
Is a Spark module which Integrates relational processing with Spark’s functional
programming API
Module Characteristics:
● Supports querying data either via SQL or via Hive Query Language
● Extends the traditional relational data processing
Part of the core distribution since Spark 1.0 (April 2014):
Spark SQL Architecture
How to use Spark SQL ?
You issue SQL queries through a SQLContext or HiveContext, using the sql() method.
Here I am!
● Seamless integration with all big data tooling and infrastructure via Spark.
● APIs for Python, Java and R
Why DataFrame ?
What are the advantages over Resilient Distributed Datasets ?
1. Compact binary representation
○ Columnar, compressed cache; rows for processing
2. Optimization across operations (join, reordering, predicate pushdown, etc)
3. Runtime code generation
Benefits:
● Run relational operations on existing Spark Programs
● Combine RDDs with external structured data
Alternatively, register as temp SQL table and perform traditional SQL query strings:
SOS
DataFrame API: Output Operations
DataFrame API: RDD Operations
Data Sources
Uniform way to access structured data:
● Apps can migrate across Hive, Cassandra, JSON, Parquet, etc..
● Rich semantics allows query pushdown into data sources
Apache Spark Catalyst Internals
● From the above diagram, you can already predict the amount of work that is being
done by Spark Catalyst to execute your Spark SQL queries 😳
● The SQL queries of Spark application will be converted to Dataframe APIs
● Logical Plan is converted to an Optimized Logic plan and then to one or more
Physical Plans
Note: Find more about what happening under the hood of Spark SQL here and here.
The Analyzer
Spark Catalyst’s analyzer is responsible for resolving types and names of attributes in
SQL queries
● The analyzer looks at the table statistics to know the types of the referred column
For example:
SELECT (col1 + 1) FROM mytable;
The filter is
pushed
The join is
down
inefficient
Physical Planner
Physical plans are the ones that can actually be executed on a cluster. They actually
translate optimized logical plans into RDD operations to be executed on the data source
● Catalyst transforms a SQL tree into an abstract syntax tree (AST) for scala code
to evaluate expressions and generate code
Apache Spark SQL Example
(Survey 2015)
Which Languages are Used ?
Special Thanks!
Spark SQL: Relational Data Processing in Spark 2015 Databricks, MIT and
Amplab