Spark SQL
The 8 fastest-growing tech skills worth over
$110,000
No. 1: Spark, up 120%, worth $113,214
DO you know how to write code in
Spark ?
Can you write SQL ?
“SQL is a highly sought-after technical skill due to its ability to work with
nearly all databases.”
Ibro Palic, CEO of Resumes Templates
History and Evolution of Big Data
Technologies
Procedural
Programing
interface
Declarative
Queries Automatic
Optimization
So Far…
We have established that we need
platform with Automatic Optimization
What user want ?
•ETL from different
1
sources
•Advanced
2
Analytics
Introducing
Spark SQL : Relational Data Processing
in Spark
Background
Apache Spark is a general-purpose cluster computing engine with
APIs in Scala, Java and Python and libraries for streaming, graph
processing and machine learning
RDDs are fault-tolerant, in that the system can recover lost data
using the lineage graph of the RDDs (by rerunning operations such
as the filter above to rebuild missing partitions). They can also
explicitly be cached in memory or on disk to support iteration
Shark, a modified the Apache Hive system to run on Spark and
implemented traditional RDBMS optimizations, such as columnar
processing, over the Spark engine.
Goals for Spark SQL
Support Relational Processing both within Spark
programs and on external data sources
Provide High Performance using established DBMS
techniques.
Easily support New Data Sources
Enable Extension with advanced analytics algorithms
such as graph processing and machine learning.
Programming Interface
DataFrame API
DataFrame is a distributed collection of rows with a
homogeneous schema
Keep Track of
Hashtags ##
# A Lazy Computation
Data Model and DataFrame
Operations
Spark SQL uses a nested data model based on Hive
It supports all major SQL data types, including boolean, integer, double,
decimal, string, date, timestamp and also User Defined Data types
Example of DataFrame Operations
DataFrame Operations Cont.
#Access DF with DSL or SQL
Real World Problems
#Heterogeneous
Data Sources
Schema Inference
Spark SQL can automatically infer the schema of these
objects using reflection
Scala/Java - extracted from the language’s type system
Python – Sampling the Dataset
In – Memory Caching
#Invoked with .cache()
User-Defined Functions
How Spark SQLs User defined
functions are different than traditional
Database Systems ?
Catalyst Optimizer
Catalyst is based on functional programming constructs in Scala
Purposes
Ability to add new
optimization techniques
and features to Ability to extend the
optimizer
Spark SQL
Catalyst Optimization
#Trees
#Rules
Catalyst Optimization Cont.
Rule Based Optimization
Cost Based Optimization
Query Planning in Spark SQL
Extension Points
#Open Source Projects
Extension Points Cont.
Data Sources
Examples :
CSV
Avro
Parquet
JDBC
Extension Points Cont.
User Defined Types (UDTs)
#Useful for Machine Learning
Advanced Analytics Features
1.Schema Inference for Semi structured Data
2.Query Federation to External Databases
Advanced Analytics Features Cont.
3.Integration with Spark’s Machine
Learning Library
Evaluation
SQL Performance
Evaluation Cont.
DataFrames vs. Native Spark Code
Pipeline Performance
Applications
Generalized Online Aggregation
Computational Genomics
List is infinite only limited by your imagination…
Conclusion
Our Final Hash Tags
#A Platform with
#Automatic optimization
#Complex pipelines that mix relational and complex analytics
#Large-scale data analysis
#Semi-structured data
#Data types for machine learning
#Extensible optimizer called Catalyst
#Easy to add Optimization rules, data sources and data types