SlideShare a Scribd company logo
Spark SQL:
Relational Data Processing in Spark
Challenges and Solutions
Challenges
• Perform ETL to and from
various (semi- or
unstructured) data sources
• Perform advanced
analytics (e.g. machine
learning, graph
processing) that are hard
to express in relational
systems.
Solutions
• A DataFrame API that can
perform relational
operations on both external
data sources and Spark’s
built-in RDDs.
• A highly extensible
optimizer, Catalyst, that
uses features of Scala to
add composable rule,
control code gen., and
define extensions.
2
3
SELECT COUNT(*)
FROM hiveTable
WHERE hive_udf(data)
Spark SQL
• Part of the core distribution since Spark 1.0 (April
2014)
• Runs SQL / HiveQL queries, optionally alongside or
replacing existing Hive deployments
About SQL
Improvement upon Existing Art
Engine does not understand
the structure of the data in
RDDs or the semantics of
user functions  limited
optimization.
Can only be used to query
external data in Hive catalog
 limited data sources
Can only be invoked via SQL
string from Spark error
prone
Hive optimizer tailored for
MapReduce  difficult to
extend
Set Footer from Insert Dropdown Menu 4
Programming Interface
Set Footer from Insert Dropdown Menu 5
DataFrame
• A distributed collection of rows with the same
schema (RDDs suffer from type erasure)
• Supports relational operators (e.g. where,
groupby) as well as Spark operations.
Set Footer from Insert Dropdown Menu 6
Data Model
• Nested data model
• Supports both primitive SQL types (boolean,
integer, double, decimal, string, data, timestamp)
and complex types (structs, arrays, maps, and
unions); also user defined types.
• First class support for complex data types
Set Footer from Insert Dropdown Menu 7
DataFrame Operations
• Relational operations (select, where, join, groupBy) via a DSL
• Operators take expression objects
• Operators build up an abstract syntax tree (AST), which is then
optimized by Catalyst.
• Alternatively, register as temp SQL table and perform
traditional SQL query strings
Set Footer from Insert Dropdown Menu 8
Advantages over Relational Query Languages
• Holistic optimization across functions composed
in different languages.
• Control structures (e.g. if, for)
• Logical plan analyzed eagerly  identify code
errors associated with data schema issues on
the fly.
Set Footer from Insert Dropdown Menu 9
Catalyst
Set Footer from Insert Dropdown Menu 10
Add
Attribute(x) Literal(3)
x + (1 + 2) x + 3
Plan Optimization & Execution
11
SQL AST
DataFrame
Unresolved
Logical Plan
Logical Plan
Optimized
Logical Plan
RDDs
Selected
Physical
Plan
Analysis
Logical
Optimization
Physical
Planning
CostModel
Physical
Plans
Code
Generation
Catalog
DataFrames and SQL share the same optimization/execution pipeline
An Example Catalyst Transformation
Set Footer from Insert Dropdown Menu
1. Find filters on top of
projections.
2. Check that the filter
can be evaluated
without the result of
the project.
3. If so, switch the
operators.
Project
name
Project
id,name
Filter
id = 1
People
Original
Plan
Project
name
Project
id,name
Filter
id = 1
People
Filter
Push-Down
Advanced Analytics Features
Schema Inference for Semistructured Data
JSON
• Automatically infers schema from
a set of records, in one pass or
sample
• A tree of STRUCT types, each of
which may contain atoms, arrays,
or other STRUCTs.
• Find the most appropriate type for
a field based on all data observed
in that column. Determine array
element types in the same way.
• Merge schemata of single
records in one reduce operation.
• Same trick for Python typing
Set Footer from Insert Dropdown Menu 13
The not-so-secret truth...
14
is about more than SQL.
SQL
: Declarative BigData Processing
Let Developers Create and Run Spark Programs
Faster:
• Write less code
• Read less data
• Let the optimizer do the hard work
15
SQL
DataFrame
noun – [dey-tuh-freym]
16
1. A distributed collection of rows organized into
named columns.
2. An abstraction for selecting, filtering,
aggregating and plotting structured data (cf. R,
Pandas).
Write Less Code: Compute an Average
private IntWritable one =
new IntWritable(1)
private IntWritable output =
new IntWritable()
proctected void map(
LongWritable key,
Text value,
Context context) {
String[] fields = value.split("t")
output.set(Integer.parseInt(fields[1]))
context.write(one, output)
}
IntWritable one = new IntWritable(1)
DoubleWritable average = new DoubleWritable()
protected void reduce(
IntWritable key,
Iterable<IntWritable> values,
Context context) {
int sum = 0
int count = 0
for(IntWritable value : values) {
sum += value.get()
count++
}
average.set(sum / (double) count)
context.Write(key, average)
}
data = sc.textFile(...).split("t")
data.map(lambda x: (x[0], [x.[1], 1])) 
.reduceByKey(lambda x, y: [x[0] + y[0], x[1] + y[1]]) 
.map(lambda x: [x[0], x[1][0] / x[1][1]]) 
.collect()
Write Less Code: Compute an Average
18
Using RDDs
data = sc.textFile(...).split("t")
data.map(lambda x: (x[0], [int(x[1]), 1])) 
.reduceByKey(lambda x, y: [x[0] + y[0], x[1] + y[1]]) 
.map(lambda x: [x[0], x[1][0] / x[1][1]]) 
.collect()
Using DataFrames
sqlCtx.table("people") 
.groupBy("name") 
.agg("name", avg("age")) 
.collect()
Using SQL
SELECT name, avg(age)
FROM people
GROUP BY name
Using Pig
P = load '/people' as (name, name);
G = group P by name;
R = foreach G generate … AVG(G.age);
Seamlessly Integrated: RDDs
Internally, DataFrame execution is done with Spark
RDDs making interoperation with outside sources
and custom algorithms easy.
Set Footer from Insert Dropdown Menu 19
External Input
def buildScan(
requiredColumns: Array[String],
filters: Array[Filter]):
RDD[Row]
Custom Processing
queryResult.rdd.mapPartitions { iter =>
… Your code here …
}
Extensible Input & Output
Spark’s Data Source API allows optimizations like column
pruning and filter pushdown into custom data sources.
20
{ JSON }
Built-In External
JDBC
and more…
Seamlessly Integrated
Embedding in a full programming language makes
UDFs trivial and allows composition using
functions.
21
zipToCity = udf(lambda city: <custom logic here>)
def add_demographics(events):
u = sqlCtx.table("users")
events 
.join(u, events.user_id == u.user_id) 
.withColumn("city", zipToCity(df.zip))
Takes and
returns a
DataFram
e

More Related Content

PDF
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Edureka!
 
PDF
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
PPTX
Introduction to Apache Spark Developer Training
Cloudera, Inc.
 
PPTX
Apache Spark Fundamentals
Zahra Eskandari
 
PPTX
Programming in Spark using PySpark
Mostafa
 
PDF
Spark on YARN
Adarsh Pannu
 
PDF
Spark Summit EU talk by Ted Malaska
Spark Summit
 
PDF
Deep Dive: Memory Management in Apache Spark
Databricks
 
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Edureka!
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
Introduction to Apache Spark Developer Training
Cloudera, Inc.
 
Apache Spark Fundamentals
Zahra Eskandari
 
Programming in Spark using PySpark
Mostafa
 
Spark on YARN
Adarsh Pannu
 
Spark Summit EU talk by Ted Malaska
Spark Summit
 
Deep Dive: Memory Management in Apache Spark
Databricks
 

What's hot (20)

PDF
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
Databricks
 
PDF
A Deep Dive into Query Execution Engine of Spark SQL
Databricks
 
PPTX
Intro to Apache Spark
Robert Sanders
 
PDF
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Databricks
 
PPTX
Apache Spark overview
DataArt
 
PDF
Apache Spark At Scale in the Cloud
Databricks
 
PDF
Presto on Apache Spark: A Tale of Two Computation Engines
Databricks
 
PDF
Introduction to Apache Spark
Samy Dindane
 
PDF
Spark SQL
Joud Khattab
 
PDF
Understanding Query Plans and Spark UIs
Databricks
 
PDF
Batch and Stream Graph Processing with Apache Flink
Vasia Kalavri
 
PDF
Apache Spark Core—Deep Dive—Proper Optimization
Databricks
 
PPTX
SAP Extractorのソースエンドポイントとしての利用
QlikPresalesJapan
 
PDF
Introduction to Apache Kafka
Shiao-An Yuan
 
PDF
From Zero to Hero with Kafka Connect
confluent
 
PDF
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
 
PDF
The Hidden Life of Spark Jobs
DataWorks Summit
 
PDF
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Cloudera, Inc.
 
PDF
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Databricks
 
PDF
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Edureka!
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
Databricks
 
A Deep Dive into Query Execution Engine of Spark SQL
Databricks
 
Intro to Apache Spark
Robert Sanders
 
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Databricks
 
Apache Spark overview
DataArt
 
Apache Spark At Scale in the Cloud
Databricks
 
Presto on Apache Spark: A Tale of Two Computation Engines
Databricks
 
Introduction to Apache Spark
Samy Dindane
 
Spark SQL
Joud Khattab
 
Understanding Query Plans and Spark UIs
Databricks
 
Batch and Stream Graph Processing with Apache Flink
Vasia Kalavri
 
Apache Spark Core—Deep Dive—Proper Optimization
Databricks
 
SAP Extractorのソースエンドポイントとしての利用
QlikPresalesJapan
 
Introduction to Apache Kafka
Shiao-An Yuan
 
From Zero to Hero with Kafka Connect
confluent
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
 
The Hidden Life of Spark Jobs
DataWorks Summit
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Cloudera, Inc.
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Databricks
 
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Edureka!
 
Ad

Similar to Spark Sql and DataFrame (20)

PPTX
This is training for spark SQL essential
Sudesh64
 
PDF
Spark SQL In Depth www.syedacademy.com
Syed Hadoop
 
PDF
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...
Databricks
 
PDF
Beyond SQL: Speeding up Spark with DataFrames
Databricks
 
PDF
SparkSQL: A Compiler from Queries to RDDs
Databricks
 
PDF
Introduction to Spark Datasets - Functional and relational together at last
Holden Karau
 
PDF
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
CloudxLab
 
PDF
Real-Time Spark: From Interactive Queries to Streaming
Databricks
 
PDF
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Holden Karau
 
PDF
Spark SQL Deep Dive @ Melbourne Spark Meetup
Databricks
 
PPTX
Apache Spark sql
aftab alam
 
PPTX
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
Simplilearn
 
PDF
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Spark Summit
 
PDF
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Databricks
 
PPTX
Spark sql
Zahra Eskandari
 
PPTX
Spark sql meetup
Michael Zhang
 
PDF
20140908 spark sql & catalyst
Takuya UESHIN
 
PDF
Jump Start with Apache Spark 2.0 on Databricks
Anyscale
 
PDF
Introduction to Spark SQL & Catalyst
Takuya UESHIN
 
PDF
Introduction to Spark SQL and Catalyst / Spark SQLおよびCalalystの紹介
scalaconfjp
 
This is training for spark SQL essential
Sudesh64
 
Spark SQL In Depth www.syedacademy.com
Syed Hadoop
 
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...
Databricks
 
Beyond SQL: Speeding up Spark with DataFrames
Databricks
 
SparkSQL: A Compiler from Queries to RDDs
Databricks
 
Introduction to Spark Datasets - Functional and relational together at last
Holden Karau
 
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
CloudxLab
 
Real-Time Spark: From Interactive Queries to Streaming
Databricks
 
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Holden Karau
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Databricks
 
Apache Spark sql
aftab alam
 
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
Simplilearn
 
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Spark Summit
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Databricks
 
Spark sql
Zahra Eskandari
 
Spark sql meetup
Michael Zhang
 
20140908 spark sql & catalyst
Takuya UESHIN
 
Jump Start with Apache Spark 2.0 on Databricks
Anyscale
 
Introduction to Spark SQL & Catalyst
Takuya UESHIN
 
Introduction to Spark SQL and Catalyst / Spark SQLおよびCalalystの紹介
scalaconfjp
 
Ad

More from Prashant Gupta (10)

PPTX
Spark core
Prashant Gupta
 
PPTX
Map Reduce
Prashant Gupta
 
PPTX
Hadoop File system (HDFS)
Prashant Gupta
 
PPTX
Apache PIG
Prashant Gupta
 
PPTX
Map reduce prashant
Prashant Gupta
 
PDF
Sqoop
Prashant Gupta
 
PPTX
6.hive
Prashant Gupta
 
PPTX
Apache HBase™
Prashant Gupta
 
PPTX
Mongodb - NoSql Database
Prashant Gupta
 
PPTX
Sonar Tool - JAVA code analysis
Prashant Gupta
 
Spark core
Prashant Gupta
 
Map Reduce
Prashant Gupta
 
Hadoop File system (HDFS)
Prashant Gupta
 
Apache PIG
Prashant Gupta
 
Map reduce prashant
Prashant Gupta
 
Apache HBase™
Prashant Gupta
 
Mongodb - NoSql Database
Prashant Gupta
 
Sonar Tool - JAVA code analysis
Prashant Gupta
 

Recently uploaded (20)

PDF
How-Cloud-Computing-Impacts-Businesses-in-2025-and-Beyond.pdf
Artjoker Software Development Company
 
PDF
Event Presentation Google Cloud Next Extended 2025
minhtrietgect
 
PDF
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
PPTX
Coupa-Overview _Assumptions presentation
annapureddyn
 
PDF
Why Your AI & Cybersecurity Hiring Still Misses the Mark in 2025
Virtual Employee Pvt. Ltd.
 
PDF
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
PDF
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Anchore
 
PDF
Software Development Methodologies in 2025
KodekX
 
PDF
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
Sandesh Rao
 
PDF
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
PPTX
How to Build a Scalable Micro-Investing Platform in 2025 - A Founder’s Guide ...
Third Rock Techkno
 
PDF
CIFDAQ'S Market Insight: BTC to ETH money in motion
CIFDAQ
 
PDF
Chapter 2 Digital Image Fundamentals.pdf
Getnet Tigabie Askale -(GM)
 
PDF
This slide provides an overview Technology
mineshkharadi333
 
PDF
Software Development Company | KodekX
KodekX
 
PPT
L2 Rules of Netiquette in Empowerment technology
Archibal2
 
PDF
Advances in Ultra High Voltage (UHV) Transmission and Distribution Systems.pdf
Nabajyoti Banik
 
PDF
REPORT: Heating appliances market in Poland 2024
SPIUG
 
PDF
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
PDF
Revolutionize Operations with Intelligent IoT Monitoring and Control
Rejig Digital
 
How-Cloud-Computing-Impacts-Businesses-in-2025-and-Beyond.pdf
Artjoker Software Development Company
 
Event Presentation Google Cloud Next Extended 2025
minhtrietgect
 
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
Coupa-Overview _Assumptions presentation
annapureddyn
 
Why Your AI & Cybersecurity Hiring Still Misses the Mark in 2025
Virtual Employee Pvt. Ltd.
 
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Anchore
 
Software Development Methodologies in 2025
KodekX
 
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
Sandesh Rao
 
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
How to Build a Scalable Micro-Investing Platform in 2025 - A Founder’s Guide ...
Third Rock Techkno
 
CIFDAQ'S Market Insight: BTC to ETH money in motion
CIFDAQ
 
Chapter 2 Digital Image Fundamentals.pdf
Getnet Tigabie Askale -(GM)
 
This slide provides an overview Technology
mineshkharadi333
 
Software Development Company | KodekX
KodekX
 
L2 Rules of Netiquette in Empowerment technology
Archibal2
 
Advances in Ultra High Voltage (UHV) Transmission and Distribution Systems.pdf
Nabajyoti Banik
 
REPORT: Heating appliances market in Poland 2024
SPIUG
 
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
Revolutionize Operations with Intelligent IoT Monitoring and Control
Rejig Digital
 

Spark Sql and DataFrame

  • 1. Spark SQL: Relational Data Processing in Spark
  • 2. Challenges and Solutions Challenges • Perform ETL to and from various (semi- or unstructured) data sources • Perform advanced analytics (e.g. machine learning, graph processing) that are hard to express in relational systems. Solutions • A DataFrame API that can perform relational operations on both external data sources and Spark’s built-in RDDs. • A highly extensible optimizer, Catalyst, that uses features of Scala to add composable rule, control code gen., and define extensions. 2
  • 3. 3 SELECT COUNT(*) FROM hiveTable WHERE hive_udf(data) Spark SQL • Part of the core distribution since Spark 1.0 (April 2014) • Runs SQL / HiveQL queries, optionally alongside or replacing existing Hive deployments About SQL
  • 4. Improvement upon Existing Art Engine does not understand the structure of the data in RDDs or the semantics of user functions  limited optimization. Can only be used to query external data in Hive catalog  limited data sources Can only be invoked via SQL string from Spark error prone Hive optimizer tailored for MapReduce  difficult to extend Set Footer from Insert Dropdown Menu 4
  • 5. Programming Interface Set Footer from Insert Dropdown Menu 5
  • 6. DataFrame • A distributed collection of rows with the same schema (RDDs suffer from type erasure) • Supports relational operators (e.g. where, groupby) as well as Spark operations. Set Footer from Insert Dropdown Menu 6
  • 7. Data Model • Nested data model • Supports both primitive SQL types (boolean, integer, double, decimal, string, data, timestamp) and complex types (structs, arrays, maps, and unions); also user defined types. • First class support for complex data types Set Footer from Insert Dropdown Menu 7
  • 8. DataFrame Operations • Relational operations (select, where, join, groupBy) via a DSL • Operators take expression objects • Operators build up an abstract syntax tree (AST), which is then optimized by Catalyst. • Alternatively, register as temp SQL table and perform traditional SQL query strings Set Footer from Insert Dropdown Menu 8
  • 9. Advantages over Relational Query Languages • Holistic optimization across functions composed in different languages. • Control structures (e.g. if, for) • Logical plan analyzed eagerly  identify code errors associated with data schema issues on the fly. Set Footer from Insert Dropdown Menu 9
  • 10. Catalyst Set Footer from Insert Dropdown Menu 10 Add Attribute(x) Literal(3) x + (1 + 2) x + 3
  • 11. Plan Optimization & Execution 11 SQL AST DataFrame Unresolved Logical Plan Logical Plan Optimized Logical Plan RDDs Selected Physical Plan Analysis Logical Optimization Physical Planning CostModel Physical Plans Code Generation Catalog DataFrames and SQL share the same optimization/execution pipeline
  • 12. An Example Catalyst Transformation Set Footer from Insert Dropdown Menu 1. Find filters on top of projections. 2. Check that the filter can be evaluated without the result of the project. 3. If so, switch the operators. Project name Project id,name Filter id = 1 People Original Plan Project name Project id,name Filter id = 1 People Filter Push-Down
  • 13. Advanced Analytics Features Schema Inference for Semistructured Data JSON • Automatically infers schema from a set of records, in one pass or sample • A tree of STRUCT types, each of which may contain atoms, arrays, or other STRUCTs. • Find the most appropriate type for a field based on all data observed in that column. Determine array element types in the same way. • Merge schemata of single records in one reduce operation. • Same trick for Python typing Set Footer from Insert Dropdown Menu 13
  • 14. The not-so-secret truth... 14 is about more than SQL. SQL
  • 15. : Declarative BigData Processing Let Developers Create and Run Spark Programs Faster: • Write less code • Read less data • Let the optimizer do the hard work 15 SQL
  • 16. DataFrame noun – [dey-tuh-freym] 16 1. A distributed collection of rows organized into named columns. 2. An abstraction for selecting, filtering, aggregating and plotting structured data (cf. R, Pandas).
  • 17. Write Less Code: Compute an Average private IntWritable one = new IntWritable(1) private IntWritable output = new IntWritable() proctected void map( LongWritable key, Text value, Context context) { String[] fields = value.split("t") output.set(Integer.parseInt(fields[1])) context.write(one, output) } IntWritable one = new IntWritable(1) DoubleWritable average = new DoubleWritable() protected void reduce( IntWritable key, Iterable<IntWritable> values, Context context) { int sum = 0 int count = 0 for(IntWritable value : values) { sum += value.get() count++ } average.set(sum / (double) count) context.Write(key, average) } data = sc.textFile(...).split("t") data.map(lambda x: (x[0], [x.[1], 1])) .reduceByKey(lambda x, y: [x[0] + y[0], x[1] + y[1]]) .map(lambda x: [x[0], x[1][0] / x[1][1]]) .collect()
  • 18. Write Less Code: Compute an Average 18 Using RDDs data = sc.textFile(...).split("t") data.map(lambda x: (x[0], [int(x[1]), 1])) .reduceByKey(lambda x, y: [x[0] + y[0], x[1] + y[1]]) .map(lambda x: [x[0], x[1][0] / x[1][1]]) .collect() Using DataFrames sqlCtx.table("people") .groupBy("name") .agg("name", avg("age")) .collect() Using SQL SELECT name, avg(age) FROM people GROUP BY name Using Pig P = load '/people' as (name, name); G = group P by name; R = foreach G generate … AVG(G.age);
  • 19. Seamlessly Integrated: RDDs Internally, DataFrame execution is done with Spark RDDs making interoperation with outside sources and custom algorithms easy. Set Footer from Insert Dropdown Menu 19 External Input def buildScan( requiredColumns: Array[String], filters: Array[Filter]): RDD[Row] Custom Processing queryResult.rdd.mapPartitions { iter => … Your code here … }
  • 20. Extensible Input & Output Spark’s Data Source API allows optimizations like column pruning and filter pushdown into custom data sources. 20 { JSON } Built-In External JDBC and more…
  • 21. Seamlessly Integrated Embedding in a full programming language makes UDFs trivial and allows composition using functions. 21 zipToCity = udf(lambda city: <custom logic here>) def add_demographics(events): u = sqlCtx.table("users") events .join(u, events.user_id == u.user_id) .withColumn("city", zipToCity(df.zip)) Takes and returns a DataFram e

Editor's Notes

  • #3: Some of the limitations of Spark RDD were- It does not have any built-in optimization engine. There is no provision to handle structured data.
  • #9: traditional sql convenient for computing multiple things at the same time
  • #11: Done in Scala because functional programming languages naturally support compiler functions.