SlideShare a Scribd company logo
Spark DataFrames:
Simple and Fast Analytics
on Structured Data
Michael Armbrust
Spark Summit 2015 - June, 15th
Graduated
from Alpha
in 1.3
• Spark SQL
•  Part of the core distribution since Spark 1.0 (April 2014)
SQL!About Me and
2
0
50
100
150
200
250
# Of Commits Per Month
0
50
100
150
200
# of Contributors
2
3
SELECT&COUNT(*)&
FROM&hiveTable&
WHERE&hive_udf(data)&&
• Spark SQL
•  Part of the core distribution since Spark 1.0 (April 2014)
•  Runs SQL / HiveQL queries, optionally alongside or
replacing existing Hive deployments
SQL!About Me and
Improved
multi-version
support in 1.4
4
• Spark SQL
•  Part of the core distribution since Spark 1.0 (April 2014)
•  Runs SQL / HiveQL queries, optionally alongside or
replacing existing Hive deployments
•  Connect existing BI tools to Spark through JDBC
SQL!About Me and
• Spark SQL
•  Part of the core distribution since Spark 1.0 (April 2014)
•  Runs SQL / HiveQL queries, optionally alongside or
replacing existing Hive deployments
•  Connect existing BI tools to Spark through JDBC
•  Bindings in Python, Scala, Java, and R
5
SQL!About Me and
• Spark SQL
•  Part of the core distribution since Spark 1.0 (April 2014)
•  Runs SQL / HiveQL queries, optionally alongside or
replacing existing Hive deployments
•  Connect existing BI tools to Spark through JDBC
•  Bindings in Python, Scala, Java, and R
• @michaelarmbrust
•  Lead developer of Spark SQL @databricks
6
SQL!About Me and
The not-so-secret truth...
7
is about more than SQL.
!
SQL!
Spark SQL: The whole story
Creating and Running Spark Programs Faster:
•  Write less code
•  Read less data
•  Let the optimizer do the hard work
8
DataFrame
noun – [dey-tuh-freym]
9
1.  A distributed collection of rows organized into
named columns.
2.  An abstraction for selecting, filtering, aggregating
and plotting structured data (cf. R, Pandas).
3.  Archaic: Previously SchemaRDD (cf. Spark < 1.3).
!
Write Less Code: Input & Output
Unified interface to reading/writing data in a variety of formats:
df&=&sqlContext.read&&
&&.format("json")&&
&&.option("samplingRatio",&"0.1")&&
&&.load("/home/michael/data.json")&
&
df.write&&
&&.format("parquet")&&
&&.mode("append")&&
&&.partitionBy("year")&&
&&.saveAsTable("fasterData")!
! 10
Write Less Code: Input & Output
Unified interface to reading/writing data in a variety of formats:
df&=&sqlContext.read&&
&&.format("json")&&
&&.option("samplingRatio",&"0.1")&&
&&.load("/home/michael/data.json")&
&
df.write&&
&&.format("parquet")&&
&&.mode("append")&&
&&.partitionBy("year")&&
&&.saveAsTable("fasterData")!
!
read and write&
functions create
new builders for
doing I/O
11
Write Less Code: Input & Output
Unified interface to reading/writing data in a variety of formats:
Builder methods
specify:
•  Format
•  Partitioning
•  Handling of
existing data
df&=&sqlContext.read&&
&&.format("json")&&
&&.option("samplingRatio",&"0.1")&&
&&.load("/home/michael/data.json")&
&
df.write&&
&&.format("parquet")&&
&&.mode("append")&&
&&.partitionBy("year")&&
&&.saveAsTable("fasterData")!
! 12
Write Less Code: Input & Output
Unified interface to reading/writing data in a variety of formats:
load(…), save(…) or
saveAsTable(…)&
finish the I/O
specification
df&=&sqlContext.read&&
&&.format("json")&&
&&.option("samplingRatio",&"0.1")&&
&&.load("/home/michael/data.json")&
&
df.write&&
&&.format("parquet")&&
&&.mode("append")&&
&&.partitionBy("year")&&
&&.saveAsTable("fasterData")!
! 13
Write Less Code: Input & Output
Spark SQL’s Data Source API can read and write DataFrames
using a variety of formats.
14
{ JSON }
Built-In External
JDBC
and more…
Find more sources at http://spark-packages.org/
ETL Using Custom Data Sources
sqlContext.read&
&&.format("com.databricks.spark.jira")&
&&.option("url",&"https://issues.apache.org/jira/rest/api/latest/search")&
&&.option("user",&"marmbrus")&
&&.option("password",&"*******")&
&&.option("query",&"""&
&&&&|project&=&SPARK&AND&&
&&&&|component&=&SQL&AND&&
&&&&|(status&=&Open&OR&status&=&"In&Progress"&OR&status&=&Reopened)""".stripMargin)&
&&.load()&
&&.repartition(1)&
&&.write&
&&.format("parquet")&
&&.saveAsTable("sparkSqlJira")&
15
Write Less Code: High-Level Operations
Solve common problems concisely using DataFrame functions:
•  Selecting columns and filtering
•  Joining different data sources
•  Aggregation (count, sum, average, etc)
•  Plotting results with Pandas
16
Write Less Code: Compute an Average
private&IntWritable&one&=&&
&&new&IntWritable(1)&
private&IntWritable&output&=&
&&new&IntWritable()&
proctected&void&map(&
&&&&LongWritable&key,&
&&&&Text&value,&
&&&&Context&context)&{&
&&String[]&fields&=&value.split("t")&
&&output.set(Integer.parseInt(fields[1]))&
&&context.write(one,&output)&
}&
&
IntWritable&one&=&new&IntWritable(1)&
DoubleWritable&average&=&new&DoubleWritable()&
&
protected&void&reduce(&
&&&&IntWritable&key,&
&&&&Iterable<IntWritable>&values,&
&&&&Context&context)&{&
&&int&sum&=&0&
&&int&count&=&0&
&&for(IntWritable&value&:&values)&{&
&&&&&sum&+=&value.get()&
&&&&&count++&
&&&&}&
&&average.set(sum&/&(double)&count)&
&&context.Write(key,&average)&
}&
data&=&sc.textFile(...).split("t")&
data.map(lambda&x:&(x[0],&[x.[1],&1]))&&
&&&.reduceByKey(lambda&x,&y:&[x[0]&+&y[0],&x[1]&+&y[1]])&&
&&&.map(lambda&x:&[x[0],&x[1][0]&/&x[1][1]])&&
&&&.collect()&
17
Write Less Code: Compute an Average
Using RDDs
&
data&=&sc.textFile(...).split("t")&
data.map(lambda&x:&(x[0],&[int(x[1]),&1]))&&
&&&.reduceByKey(lambda&x,&y:&[x[0]&+&y[0],&x[1]&+&y[1]])&&
&&&.map(lambda&x:&[x[0],&x[1][0]&/&x[1][1]])&&
&&&.collect()&
&
&
&Using DataFrames
&
sqlCtx.table("people")&&
&&&.groupBy("name")&&
&&&.agg("name",&avg("age"))&&
&&&.collect()&&
!
Full API Docs
•  Python
•  Scala
•  Java
•  R
18
Not Just Less Code: Faster Implementations
19
0 2 4 6 8 10
RDD Scala
RDD Python
DataFrame Scala
DataFrame Python
DataFrame R
DataFrame SQL
Time to Aggregate 10 million int pairs (secs)
20
Demo
Combine data from with data from
Running in
•  Hosted Spark in the cloud
•  Notebooks with integrated visualization
•  Scheduled production jobs
https://accounts.cloud.databricks.com/
6/15/2015 demo - Databricks
https://demo.cloud.databricks.com/#notebook/43587 1/5
> 
Command took 2.08s -- by dbadmin at 6/15/2015, 1:17:07 PM on michael (54 GB)
> 
> 
%run /home/michael/ss.2015.demo/spark.sql.lib ...
%sql SELECT * FROM sparkSqlJira
6/15/2015 demo - Databricks
https://demo.cloud.databricks.com/#notebook/43587 2/5
6/15/2015 demo - Databricks
https://demo.cloud.databricks.com/#notebook/43587 3/5
Command took 1.95s -- by dbadmin at 6/15/2015, 1:18:46 PM on michael (54 GB)
> 
rawPRs: org.apache.spark.sql.DataFrame = [commenters: array<struct<data:struct<asked_to_close:boolea
n,avatar:string,body:string,date:array<string>,diff_hunk:string,said_lgtm:boolean,url:string>,usernam
e:string>>, components: array<string>, is_mergeable: boolean, jira_issuetype_icon_url: string, jira_i
ssuetype_name: string, jira_priority_icon_url: string, jira_priority_name: string, last_jenkins_comme
nt: struct<body:string,html_url:string,user:struct<login:string>>, last_jenkins_outcome: string, line
s_added: bigint, lines_changed: bigint, lines_deleted: bigint, number: bigint, parsed_title: struct<j
iras:array<bigint>,metadata:string,title:string>, state: string, updated_at: string, user: string]
Command took 2.01s -- by dbadmin at 6/15/2015, 1:19:08 PM on michael (54 GB)
> 
val rawPRs = sqlContext.read
.format("com.databricks.spark.rest")
.option("url", "https://spark-prs.appspot.com/search-open-prs")
.load()
display(rawPRs)
6/15/2015 demo - Databricks
https://demo.cloud.databricks.com/#notebook/43587 4/5
Command took 0.26s -- by dbadmin at 6/15/2015, 1:19:39 PM on michael (54 GB)
> 
import org.apache.spark.sql.functions._
sparkPRs: org.apache.spark.sql.DataFrame = [component: string, pr_jira: string, title: string, jira_i
ssuetype_icon_url: string, jira_priority_icon_url: string, number: bigint, commenters: array<struct<d
ata:struct<asked_to_close:boolean,avatar:string,body:string,date:array<string>,diff_hunk:string,sai
d_lgtm:boolean,url:string>,username:string>>, user: string, last_jenkins_outcome: string, is_mergeabl
e: boolean]
import org.apache.spark.sql.functions._
val sparkPRs = rawPRs
.select(
  // "Explode" nested array to create one row per item.
  explode($"components").as("component"),
 
  // Use a built-in function to construct the full 'SPARK-XXXX' key
  concat("SPARK-", $"parsed_title.jiras"(0)).as("pr_jira"),
// Other required columns.
  $"parsed_title.title",
$"jira_issuetype_icon_url",
$"jira_priority_icon_url",
$"number",
$"commenters",
$"user",
$"last_jenkins_outcome",
$"is_mergeable")
.where($"component" === "SQL") // Select only SQL PRs
6/15/2015 demo - Databricks
https://demo.cloud.databricks.com/#notebook/43587 5/5
Command took 7.55s -- by dbadmin at 6/15/2015, 1:20:15 PM on michael (54 GB)
> 
✗
✗
✗
✗
table("sparkSqlJira")
.join(sparkPRs, $"key" === $"pr_jira")
.jiraTable
Plan Optimization & Execution
21
SQL AST
DataFrame
Unresolved
Logical Plan
Logical Plan
Optimized
Logical Plan
RDDs
Selected
Physical Plan
Analysis
Logical
Optimization
Physical
Planning
CostModel
Physical
Plans
Code
Generation
Catalog
DataFrames and SQL share the same optimization/execution pipeline
Seamlessly Integrated
Intermix DataFrame operations with
custom Python, Java, R, or Scala code
zipToCity&=&udf(lambda&zipCode:&<custom&logic&here>)&
&
def&add_demographics(events):&
&&&u&=&sqlCtx.table("users")&
&&&events&&
&&&&&.join(u,&events.user_id&==&u.user_id)&&&&&&&
&&&&&.withColumn("city",&zipToCity(df.zip))&
Augments any
DataFrame
that contains
user_id&
22
Optimize Entire Pipelines
Optimization happens as late as possible, therefore
Spark SQL can optimize even across functions.
23
events&=&add_demographics(sqlCtx.load("/data/events",&"json"))&
&&
training_data&=&events&&
&&.where(events.city&==&"San&Francisco")&&
&&.select(events.timestamp)&&
&&.collect()&&
24
def&add_demographics(events):&
&&&u&=&sqlCtx.table("users")&&&&&&&&&&&&&&&&&&&&&#&Load&Hive&table&
&&&events&&
&&&&&.join(u,&events.user_id&==&u.user_id)&&&&&&#&Join&on&user_id&&&&&&
&&&&&.withColumn("city",&zipToCity(df.zip))&&&&&&#&Run&udf&to&add&city&column&
&events&=&add_demographics(sqlCtx.load("/data/events",&"json"))&&
training_data&=&events.where(events.city&==&"San&Francisco").select(events.timestamp).collect()&&
Logical Plan
filter
join
events file users table
expensive
only join
relevent users
Physical Plan
join
scan
(events)
filter
scan
(users)
24
25
def&add_demographics(events):&
&&&u&=&sqlCtx.table("users")&&&&&&&&&&&&&&&&&&&&&#&Load&partitioned&Hive&table&
&&&events&&
&&&&&.join(u,&events.user_id&==&u.user_id)&&&&&&#&Join&on&user_id&&&&&&
&&&&&.withColumn("city",&zipToCity(df.zip))&&&&&&#&Run&udf&to&add&city&column&
&
Optimized Physical Plan
with Predicate Pushdown
and Column Pruning
join
optimized
scan
(events)
optimized
scan
(users)
events&=&add_demographics(sqlCtx.load("/data/events",&"parquet"))&&
training_data&=&events.where(events.city&==&"San&Francisco").select(events.timestamp).collect()&&
Logical Plan
filter
join
events file users table
Physical Plan
join
scan
(events)
filter
scan
(users)
25
Machine Learning Pipelines
26
tokenizer&=&Tokenizer(inputCol="text",!outputCol="words”)&
hashingTF&=&HashingTF(inputCol="words",!outputCol="features”)&
lr&=&LogisticRegression(maxIter=10,&regParam=0.01)&
pipeline&=&Pipeline(stages=[tokenizer,&hashingTF,&lr])&
&
df&=&sqlCtx.load("/path/to/data")!
model&=&pipeline.fit(df)!	
ds0 ds1 ds2 ds3tokenizer hashingTF lr.model
lr
Pipeline Model!
Find out more during Joseph’s Talk: 3pm Today
Project Tungsten: Initial Results
27
0
50
100
150
200
1x 2x 4x 8x 16x
Average GC
time per
node
(seconds)
Data set size (relative)
Default
Code Gen
Tungsten onheap
Tungsten offheap
Find out more during Josh’s Talk: 5pm Tomorrow
Questions?
Spark SQL Office Hours Today
-  Michael Armbrust 1:45-2:30
-  Yin Huai 3:40-4:15
Spark SQL Office Hours Tomorrow
-  Reynold 1:45-2:30

More Related Content

PPTX
Alpine academy apache spark series #1 introduction to cluster computing wit...
Holden Karau
 
PDF
Apache Spark 2.0: Faster, Easier, and Smarter
Databricks
 
PDF
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Databricks
 
PPTX
The Pushdown of Everything by Stephan Kessler and Santiago Mola
Spark Summit
 
PPTX
Real time Analytics with Apache Kafka and Apache Spark
Rahul Jain
 
PPT
spark-kafka_mod
Vritika Godara
 
PPTX
Introduction to real time big data with Apache Spark
Taras Matyashovsky
 
PPTX
ETL with SPARK - First Spark London meetup
Rafal Kwasny
 
Alpine academy apache spark series #1 introduction to cluster computing wit...
Holden Karau
 
Apache Spark 2.0: Faster, Easier, and Smarter
Databricks
 
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Databricks
 
The Pushdown of Everything by Stephan Kessler and Santiago Mola
Spark Summit
 
Real time Analytics with Apache Kafka and Apache Spark
Rahul Jain
 
spark-kafka_mod
Vritika Godara
 
Introduction to real time big data with Apache Spark
Taras Matyashovsky
 
ETL with SPARK - First Spark London meetup
Rafal Kwasny
 

What's hot (20)

PPTX
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
PDF
Spark Tuning For Enterprise System Administrators, Spark Summit East 2016
Anya Bida
 
PDF
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
Spark Summit
 
PDF
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Spark Summit
 
PDF
Parallelize R Code Using Apache Spark
Databricks
 
PDF
Elasticsearch And Apache Lucene For Apache Spark And MLlib
Jen Aman
 
PDF
Spark on YARN
Adarsh Pannu
 
PDF
Time Series Analytics with Spark: Spark Summit East talk by Simon Ouellette
Spark Summit
 
PPT
Spark stream - Kafka
Dori Waldman
 
PDF
Reactive app using actor model & apache spark
Rahul Kumar
 
PDF
Apache Spark Introduction - CloudxLab
Abhinav Singh
 
PDF
Monitor Apache Spark 3 on Kubernetes using Metrics and Plugins
Databricks
 
PDF
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Spark Summit
 
PDF
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
Databricks
 
PPTX
Keeping Spark on Track: Productionizing Spark for ETL
Databricks
 
PPTX
Robust and Scalable ETL over Cloud Storage with Apache Spark
Databricks
 
PDF
A Journey into Databricks' Pipelines: Journey and Lessons Learned
Databricks
 
PPTX
Introduction to Apache Spark
Mohamed hedi Abidi
 
PDF
Memory Management in Apache Spark
Databricks
 
PDF
How Apache Spark fits into the Big Data landscape
Paco Nathan
 
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
Spark Tuning For Enterprise System Administrators, Spark Summit East 2016
Anya Bida
 
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
Spark Summit
 
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Spark Summit
 
Parallelize R Code Using Apache Spark
Databricks
 
Elasticsearch And Apache Lucene For Apache Spark And MLlib
Jen Aman
 
Spark on YARN
Adarsh Pannu
 
Time Series Analytics with Spark: Spark Summit East talk by Simon Ouellette
Spark Summit
 
Spark stream - Kafka
Dori Waldman
 
Reactive app using actor model & apache spark
Rahul Kumar
 
Apache Spark Introduction - CloudxLab
Abhinav Singh
 
Monitor Apache Spark 3 on Kubernetes using Metrics and Plugins
Databricks
 
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Spark Summit
 
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
Databricks
 
Keeping Spark on Track: Productionizing Spark for ETL
Databricks
 
Robust and Scalable ETL over Cloud Storage with Apache Spark
Databricks
 
A Journey into Databricks' Pipelines: Journey and Lessons Learned
Databricks
 
Introduction to Apache Spark
Mohamed hedi Abidi
 
Memory Management in Apache Spark
Databricks
 
How Apache Spark fits into the Big Data landscape
Paco Nathan
 
Ad

Similar to Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summit 2015 (20)

PDF
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...
Databricks
 
PDF
Spark SQL
Joud Khattab
 
PDF
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Holden Karau
 
PDF
Spark SQL Deep Dive @ Melbourne Spark Meetup
Databricks
 
PDF
Beyond SQL: Speeding up Spark with DataFrames
Databricks
 
PDF
Introduction to Spark Datasets - Functional and relational together at last
Holden Karau
 
PDF
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Databricks
 
PPTX
Spark sql
Zahra Eskandari
 
PPTX
Building a modern Application with DataFrames
Spark Summit
 
PPTX
Building a modern Application with DataFrames
Databricks
 
PDF
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
CloudxLab
 
PDF
Intro to Spark and Spark SQL
jeykottalam
 
PDF
Structuring Spark: DataFrames, Datasets, and Streaming
Databricks
 
PDF
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Spark Summit
 
PDF
Real-Time Spark: From Interactive Queries to Streaming
Databricks
 
PDF
Introduction to Structured Data Processing with Spark SQL
datamantra
 
PDF
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
Holden Karau
 
PPTX
Learning spark ch09 - Spark SQL
phanleson
 
PDF
Introduction to Spark SQL & Catalyst
Takuya UESHIN
 
PDF
Introduction to Spark SQL and Catalyst / Spark SQLおよびCalalystの紹介
scalaconfjp
 
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...
Databricks
 
Spark SQL
Joud Khattab
 
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Holden Karau
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Databricks
 
Beyond SQL: Speeding up Spark with DataFrames
Databricks
 
Introduction to Spark Datasets - Functional and relational together at last
Holden Karau
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Databricks
 
Spark sql
Zahra Eskandari
 
Building a modern Application with DataFrames
Spark Summit
 
Building a modern Application with DataFrames
Databricks
 
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
CloudxLab
 
Intro to Spark and Spark SQL
jeykottalam
 
Structuring Spark: DataFrames, Datasets, and Streaming
Databricks
 
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Spark Summit
 
Real-Time Spark: From Interactive Queries to Streaming
Databricks
 
Introduction to Structured Data Processing with Spark SQL
datamantra
 
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
Holden Karau
 
Learning spark ch09 - Spark SQL
phanleson
 
Introduction to Spark SQL & Catalyst
Takuya UESHIN
 
Introduction to Spark SQL and Catalyst / Spark SQLおよびCalalystの紹介
scalaconfjp
 
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
Databricks
 
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
PPT
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 4
Databricks
 
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
PDF
Democratizing Data Quality Through a Centralized Platform
Databricks
 
PDF
Learn to Use Databricks for Data Science
Databricks
 
PDF
Why APM Is Not the Same As ML Monitoring
Databricks
 
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
PDF
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
PDF
Sawtooth Windows for Feature Aggregations
Databricks
 
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
PDF
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
PDF
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
PDF
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Databricks
 

Recently uploaded (20)

PPTX
Economic Sector Performance Recovery.pptx
yulisbaso2020
 
PDF
Classifcation using Machine Learning and deep learning
bhaveshagrawal35
 
PPTX
1intro to AI.pptx AI components & composition
ssuserb993e5
 
PPTX
Trading Procedures (1).pptxcffcdddxxddsss
garv794
 
PPTX
Purple and Violet Modern Marketing Presentation (1).pptx
SanthoshKumar229321
 
PPT
2009worlddatasheet_presentation.ppt peoole
umutunsalnsl4402
 
PPTX
Data-Driven Machine Learning for Rail Infrastructure Health Monitoring
Sione Palu
 
PPTX
Complete_STATA_Introduction_Beginner.pptx
mbayekebe
 
PDF
Data_Cleaning_Infographic_Series_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
PDF
Key_Statistical_Techniques_in_Analytics_by_CA_Suvidha_Chaplot (1).pdf
CA Suvidha Chaplot
 
PPTX
Azure Data management Engineer project.pptx
sumitmundhe77
 
PPTX
Introduction to Data Analytics and Data Science
KavithaCIT
 
PDF
oop_java (1) of ice or cse or eee ic.pdf
sabiquntoufiqlabonno
 
PDF
blockchain123456789012345678901234567890
tanvikhunt1003
 
PPTX
Web dev -ppt that helps us understand web technology
shubhragoyal12
 
PDF
Company Presentation pada Perusahaan ADB.pdf
didikfahmi
 
PDF
A Systems Thinking Approach to Algorithmic Fairness.pdf
Epistamai
 
PPTX
Presentation1.pptxvhhh. H ycycyyccycycvvv
ItratBatool16
 
PDF
Mastering Financial Analysis Materials.pdf
SalamiAbdullahi
 
PDF
Company Profile 2023 PT. ZEKON INDONESIA.pdf
hendranofriadi26
 
Economic Sector Performance Recovery.pptx
yulisbaso2020
 
Classifcation using Machine Learning and deep learning
bhaveshagrawal35
 
1intro to AI.pptx AI components & composition
ssuserb993e5
 
Trading Procedures (1).pptxcffcdddxxddsss
garv794
 
Purple and Violet Modern Marketing Presentation (1).pptx
SanthoshKumar229321
 
2009worlddatasheet_presentation.ppt peoole
umutunsalnsl4402
 
Data-Driven Machine Learning for Rail Infrastructure Health Monitoring
Sione Palu
 
Complete_STATA_Introduction_Beginner.pptx
mbayekebe
 
Data_Cleaning_Infographic_Series_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
Key_Statistical_Techniques_in_Analytics_by_CA_Suvidha_Chaplot (1).pdf
CA Suvidha Chaplot
 
Azure Data management Engineer project.pptx
sumitmundhe77
 
Introduction to Data Analytics and Data Science
KavithaCIT
 
oop_java (1) of ice or cse or eee ic.pdf
sabiquntoufiqlabonno
 
blockchain123456789012345678901234567890
tanvikhunt1003
 
Web dev -ppt that helps us understand web technology
shubhragoyal12
 
Company Presentation pada Perusahaan ADB.pdf
didikfahmi
 
A Systems Thinking Approach to Algorithmic Fairness.pdf
Epistamai
 
Presentation1.pptxvhhh. H ycycyyccycycvvv
ItratBatool16
 
Mastering Financial Analysis Materials.pdf
SalamiAbdullahi
 
Company Profile 2023 PT. ZEKON INDONESIA.pdf
hendranofriadi26
 

Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summit 2015

  • 1. Spark DataFrames: Simple and Fast Analytics on Structured Data Michael Armbrust Spark Summit 2015 - June, 15th
  • 2. Graduated from Alpha in 1.3 • Spark SQL •  Part of the core distribution since Spark 1.0 (April 2014) SQL!About Me and 2 0 50 100 150 200 250 # Of Commits Per Month 0 50 100 150 200 # of Contributors 2
  • 3. 3 SELECT&COUNT(*)& FROM&hiveTable& WHERE&hive_udf(data)&& • Spark SQL •  Part of the core distribution since Spark 1.0 (April 2014) •  Runs SQL / HiveQL queries, optionally alongside or replacing existing Hive deployments SQL!About Me and Improved multi-version support in 1.4
  • 4. 4 • Spark SQL •  Part of the core distribution since Spark 1.0 (April 2014) •  Runs SQL / HiveQL queries, optionally alongside or replacing existing Hive deployments •  Connect existing BI tools to Spark through JDBC SQL!About Me and
  • 5. • Spark SQL •  Part of the core distribution since Spark 1.0 (April 2014) •  Runs SQL / HiveQL queries, optionally alongside or replacing existing Hive deployments •  Connect existing BI tools to Spark through JDBC •  Bindings in Python, Scala, Java, and R 5 SQL!About Me and
  • 6. • Spark SQL •  Part of the core distribution since Spark 1.0 (April 2014) •  Runs SQL / HiveQL queries, optionally alongside or replacing existing Hive deployments •  Connect existing BI tools to Spark through JDBC •  Bindings in Python, Scala, Java, and R • @michaelarmbrust •  Lead developer of Spark SQL @databricks 6 SQL!About Me and
  • 7. The not-so-secret truth... 7 is about more than SQL. ! SQL!
  • 8. Spark SQL: The whole story Creating and Running Spark Programs Faster: •  Write less code •  Read less data •  Let the optimizer do the hard work 8
  • 9. DataFrame noun – [dey-tuh-freym] 9 1.  A distributed collection of rows organized into named columns. 2.  An abstraction for selecting, filtering, aggregating and plotting structured data (cf. R, Pandas). 3.  Archaic: Previously SchemaRDD (cf. Spark < 1.3). !
  • 10. Write Less Code: Input & Output Unified interface to reading/writing data in a variety of formats: df&=&sqlContext.read&& &&.format("json")&& &&.option("samplingRatio",&"0.1")&& &&.load("/home/michael/data.json")& & df.write&& &&.format("parquet")&& &&.mode("append")&& &&.partitionBy("year")&& &&.saveAsTable("fasterData")! ! 10
  • 11. Write Less Code: Input & Output Unified interface to reading/writing data in a variety of formats: df&=&sqlContext.read&& &&.format("json")&& &&.option("samplingRatio",&"0.1")&& &&.load("/home/michael/data.json")& & df.write&& &&.format("parquet")&& &&.mode("append")&& &&.partitionBy("year")&& &&.saveAsTable("fasterData")! ! read and write& functions create new builders for doing I/O 11
  • 12. Write Less Code: Input & Output Unified interface to reading/writing data in a variety of formats: Builder methods specify: •  Format •  Partitioning •  Handling of existing data df&=&sqlContext.read&& &&.format("json")&& &&.option("samplingRatio",&"0.1")&& &&.load("/home/michael/data.json")& & df.write&& &&.format("parquet")&& &&.mode("append")&& &&.partitionBy("year")&& &&.saveAsTable("fasterData")! ! 12
  • 13. Write Less Code: Input & Output Unified interface to reading/writing data in a variety of formats: load(…), save(…) or saveAsTable(…)& finish the I/O specification df&=&sqlContext.read&& &&.format("json")&& &&.option("samplingRatio",&"0.1")&& &&.load("/home/michael/data.json")& & df.write&& &&.format("parquet")&& &&.mode("append")&& &&.partitionBy("year")&& &&.saveAsTable("fasterData")! ! 13
  • 14. Write Less Code: Input & Output Spark SQL’s Data Source API can read and write DataFrames using a variety of formats. 14 { JSON } Built-In External JDBC and more… Find more sources at http://spark-packages.org/
  • 15. ETL Using Custom Data Sources sqlContext.read& &&.format("com.databricks.spark.jira")& &&.option("url",&"https://issues.apache.org/jira/rest/api/latest/search")& &&.option("user",&"marmbrus")& &&.option("password",&"*******")& &&.option("query",&"""& &&&&|project&=&SPARK&AND&& &&&&|component&=&SQL&AND&& &&&&|(status&=&Open&OR&status&=&"In&Progress"&OR&status&=&Reopened)""".stripMargin)& &&.load()& &&.repartition(1)& &&.write& &&.format("parquet")& &&.saveAsTable("sparkSqlJira")& 15
  • 16. Write Less Code: High-Level Operations Solve common problems concisely using DataFrame functions: •  Selecting columns and filtering •  Joining different data sources •  Aggregation (count, sum, average, etc) •  Plotting results with Pandas 16
  • 17. Write Less Code: Compute an Average private&IntWritable&one&=&& &&new&IntWritable(1)& private&IntWritable&output&=& &&new&IntWritable()& proctected&void&map(& &&&&LongWritable&key,& &&&&Text&value,& &&&&Context&context)&{& &&String[]&fields&=&value.split("t")& &&output.set(Integer.parseInt(fields[1]))& &&context.write(one,&output)& }& & IntWritable&one&=&new&IntWritable(1)& DoubleWritable&average&=&new&DoubleWritable()& & protected&void&reduce(& &&&&IntWritable&key,& &&&&Iterable<IntWritable>&values,& &&&&Context&context)&{& &&int&sum&=&0& &&int&count&=&0& &&for(IntWritable&value&:&values)&{& &&&&&sum&+=&value.get()& &&&&&count++& &&&&}& &&average.set(sum&/&(double)&count)& &&context.Write(key,&average)& }& data&=&sc.textFile(...).split("t")& data.map(lambda&x:&(x[0],&[x.[1],&1]))&& &&&.reduceByKey(lambda&x,&y:&[x[0]&+&y[0],&x[1]&+&y[1]])&& &&&.map(lambda&x:&[x[0],&x[1][0]&/&x[1][1]])&& &&&.collect()& 17
  • 18. Write Less Code: Compute an Average Using RDDs & data&=&sc.textFile(...).split("t")& data.map(lambda&x:&(x[0],&[int(x[1]),&1]))&& &&&.reduceByKey(lambda&x,&y:&[x[0]&+&y[0],&x[1]&+&y[1]])&& &&&.map(lambda&x:&[x[0],&x[1][0]&/&x[1][1]])&& &&&.collect()& & & &Using DataFrames & sqlCtx.table("people")&& &&&.groupBy("name")&& &&&.agg("name",&avg("age"))&& &&&.collect()&& ! Full API Docs •  Python •  Scala •  Java •  R 18
  • 19. Not Just Less Code: Faster Implementations 19 0 2 4 6 8 10 RDD Scala RDD Python DataFrame Scala DataFrame Python DataFrame R DataFrame SQL Time to Aggregate 10 million int pairs (secs)
  • 20. 20 Demo Combine data from with data from Running in •  Hosted Spark in the cloud •  Notebooks with integrated visualization •  Scheduled production jobs https://accounts.cloud.databricks.com/
  • 21. 6/15/2015 demo - Databricks https://demo.cloud.databricks.com/#notebook/43587 1/5 >  Command took 2.08s -- by dbadmin at 6/15/2015, 1:17:07 PM on michael (54 GB) >  >  %run /home/michael/ss.2015.demo/spark.sql.lib ... %sql SELECT * FROM sparkSqlJira
  • 22. 6/15/2015 demo - Databricks https://demo.cloud.databricks.com/#notebook/43587 2/5
  • 23. 6/15/2015 demo - Databricks https://demo.cloud.databricks.com/#notebook/43587 3/5 Command took 1.95s -- by dbadmin at 6/15/2015, 1:18:46 PM on michael (54 GB) >  rawPRs: org.apache.spark.sql.DataFrame = [commenters: array<struct<data:struct<asked_to_close:boolea n,avatar:string,body:string,date:array<string>,diff_hunk:string,said_lgtm:boolean,url:string>,usernam e:string>>, components: array<string>, is_mergeable: boolean, jira_issuetype_icon_url: string, jira_i ssuetype_name: string, jira_priority_icon_url: string, jira_priority_name: string, last_jenkins_comme nt: struct<body:string,html_url:string,user:struct<login:string>>, last_jenkins_outcome: string, line s_added: bigint, lines_changed: bigint, lines_deleted: bigint, number: bigint, parsed_title: struct<j iras:array<bigint>,metadata:string,title:string>, state: string, updated_at: string, user: string] Command took 2.01s -- by dbadmin at 6/15/2015, 1:19:08 PM on michael (54 GB) >  val rawPRs = sqlContext.read .format("com.databricks.spark.rest") .option("url", "https://spark-prs.appspot.com/search-open-prs") .load() display(rawPRs)
  • 24. 6/15/2015 demo - Databricks https://demo.cloud.databricks.com/#notebook/43587 4/5 Command took 0.26s -- by dbadmin at 6/15/2015, 1:19:39 PM on michael (54 GB) >  import org.apache.spark.sql.functions._ sparkPRs: org.apache.spark.sql.DataFrame = [component: string, pr_jira: string, title: string, jira_i ssuetype_icon_url: string, jira_priority_icon_url: string, number: bigint, commenters: array<struct<d ata:struct<asked_to_close:boolean,avatar:string,body:string,date:array<string>,diff_hunk:string,sai d_lgtm:boolean,url:string>,username:string>>, user: string, last_jenkins_outcome: string, is_mergeabl e: boolean] import org.apache.spark.sql.functions._ val sparkPRs = rawPRs .select(   // "Explode" nested array to create one row per item.   explode($"components").as("component"),     // Use a built-in function to construct the full 'SPARK-XXXX' key   concat("SPARK-", $"parsed_title.jiras"(0)).as("pr_jira"), // Other required columns.   $"parsed_title.title", $"jira_issuetype_icon_url", $"jira_priority_icon_url", $"number", $"commenters", $"user", $"last_jenkins_outcome", $"is_mergeable") .where($"component" === "SQL") // Select only SQL PRs
  • 25. 6/15/2015 demo - Databricks https://demo.cloud.databricks.com/#notebook/43587 5/5 Command took 7.55s -- by dbadmin at 6/15/2015, 1:20:15 PM on michael (54 GB) >  ✗ ✗ ✗ ✗ table("sparkSqlJira") .join(sparkPRs, $"key" === $"pr_jira") .jiraTable
  • 26. Plan Optimization & Execution 21 SQL AST DataFrame Unresolved Logical Plan Logical Plan Optimized Logical Plan RDDs Selected Physical Plan Analysis Logical Optimization Physical Planning CostModel Physical Plans Code Generation Catalog DataFrames and SQL share the same optimization/execution pipeline
  • 27. Seamlessly Integrated Intermix DataFrame operations with custom Python, Java, R, or Scala code zipToCity&=&udf(lambda&zipCode:&<custom&logic&here>)& & def&add_demographics(events):& &&&u&=&sqlCtx.table("users")& &&&events&& &&&&&.join(u,&events.user_id&==&u.user_id)&&&&&&& &&&&&.withColumn("city",&zipToCity(df.zip))& Augments any DataFrame that contains user_id& 22
  • 28. Optimize Entire Pipelines Optimization happens as late as possible, therefore Spark SQL can optimize even across functions. 23 events&=&add_demographics(sqlCtx.load("/data/events",&"json"))& && training_data&=&events&& &&.where(events.city&==&"San&Francisco")&& &&.select(events.timestamp)&& &&.collect()&&
  • 30. 25 def&add_demographics(events):& &&&u&=&sqlCtx.table("users")&&&&&&&&&&&&&&&&&&&&&#&Load&partitioned&Hive&table& &&&events&& &&&&&.join(u,&events.user_id&==&u.user_id)&&&&&&#&Join&on&user_id&&&&&& &&&&&.withColumn("city",&zipToCity(df.zip))&&&&&&#&Run&udf&to&add&city&column& & Optimized Physical Plan with Predicate Pushdown and Column Pruning join optimized scan (events) optimized scan (users) events&=&add_demographics(sqlCtx.load("/data/events",&"parquet"))&& training_data&=&events.where(events.city&==&"San&Francisco").select(events.timestamp).collect()&& Logical Plan filter join events file users table Physical Plan join scan (events) filter scan (users) 25
  • 32. Project Tungsten: Initial Results 27 0 50 100 150 200 1x 2x 4x 8x 16x Average GC time per node (seconds) Data set size (relative) Default Code Gen Tungsten onheap Tungsten offheap Find out more during Josh’s Talk: 5pm Tomorrow
  • 33. Questions? Spark SQL Office Hours Today -  Michael Armbrust 1:45-2:30 -  Yin Huai 3:40-4:15 Spark SQL Office Hours Tomorrow -  Reynold 1:45-2:30