0% found this document useful (0 votes)

93 views34 pages

Spark in Production

This document summarizes lessons learned from over 100 production users of Spark. It discusses common problems users face and solutions. Key points include: 1. Moving beyond Python performance limitations by using DataFrames and RDDs instead of just Python. 2. Enabling the use of other languages like R by doing distributed computation in Scala/Python and then bringing smaller datasets back to a single node for analysis in R. 3. Addressing network-bound and CPU-bound workloads, such as optimizing Spark's performance reading from S3 by buffering reads to pipeline processing. 4. Common pitfalls to avoid like overusing cache() and joining a small table to a large one without broadcasting the

Uploaded by

Sridhar Plv

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

93 views34 pages

Spark in Production

Uploaded by

Sridhar Plv

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 34

Spark in Production:

Lessons from 100+ production

users 300+

Aaron Davidson
October 28, 2015
About Databricks
Founded by creators of Spark and remains largest
contributor

Offers a hosted service:

• Spark on EC2
• Notebooks
• Plot visualizations
• Cluster management
• Scheduled jobs

2
What have we learned?
Hosted service + focus on Spark = lots of user feedback
Community!

Focus on two types:

1. Lessons for Spark
2. Lessons for users

3
Outline: What are the problems?
● Moving beyond Python performance
● Using Spark with new languages (R)
● Network and CPU-bound workloads
● Miscellaneous common pitfalls

4
Python: Who uses it, anyway?

(From Spark Survey 2015)

PySpark Architecture
sc.textFile(“/data”)
.filter(lambda s: “foobar” in s)
.count()
PySpark Architecture
sc.textFile(“/data”)
.filter(lambda s: “foobar” in s)
.count()
PySpark Architecture
sc.textFile(“/data”)
.filter(lambda s: “foobar” in s)
.count()
PySpark Architecture
sc.textFile(“/data”)
.filter(lambda s: “foobar” in s)
.count()

/data
PySpark Architecture
sc.textFile(“/data”) Java-to-Python
.filter(lambda s: “foobar” in s) communication
.count() is expensive!

Driver

/data
Moving beyond Python performance
Using RDDs
data = sc.textFile(...).split("\t")
data.map(lambda x: (x[0], [int(x[1]), 1])) \
.reduceByKey(lambda x, y: [x[0] + y[0], x[1] + y[1]]) \
.map(lambda x: [x[0], x[1][0] / x[1][1]]) \
.collect()

11
Moving beyond Python performance
Using RDDs
data = sc.textFile(...).split("\t")
data.map(lambda x: (x[0], [int(x[1]), 1])) \
.reduceByKey(lambda x, y: [x[0] + y[0], x[1] + y[1]]) \
.map(lambda x: [x[0], x[1][0] / x[1][1]]) \
.collect()

Using DataFrames
sqlCtx.table("people") \
.groupBy("name") \
.agg("name", avg("age")) \
.collect()

12
Moving beyond Python performance
Using RDDs
data = sc.textFile(...).split("\t")
data.map(lambda x: (x[0], [int(x[1]), 1])) \
.reduceByKey(lambda x, y: [x[0] + y[0], x[1] + y[1]]) \
.map(lambda x: [x[0], x[1][0] / x[1][1]]) \
.collect()

Using DataFrames
sqlCtx.table("people") \
.groupBy("name") \
.agg("name", avg("age")) \
.collect()

(At least as much as possible!)

13
Using Spark with other languages (R)
- As adoption rises, new groups of people try Spark:
- People who never used Hadoop or distributed computing
- People who are familiar with statistical languages

- Problem: Difficult to run R programs

on a cluster
- Technically challenging to rewrite algorithms
to run on cluster
- Requires bigger paradigm shift than changing
languages
SparkR interface
- A pattern emerges:
- Distributed computation for initial transformations in Scala/Python
- Bring back a small dataset to a single node to do plotting and quick
advanced analyses

- Result: R interface to Spark is mainly DataFrames

people <- read.df(sqlContext, "./people.json", "json")
teenagers <- filter(people, "age >= 13 AND age <= 19")
head(teenagers)

Spark R docs
See talk: Enabling exploratory data science with Spark and R
Network and CPU-bound workloads
- Databricks uses S3 heavily, instead of HDFS
- S3 is a key-value based blob store “in the cloud”
- Accessed over the network
- Intended for large object storage
- ~10-200 ms latency for reads and writes
- Adapters for HDFS-like access (s3n/s3a) through Spark
- Strong consistency with some caveats (updates and us-east-1)
S3 as data storage
“Traditional”
Databricks
Data Warehouse Amazon S3

Instance

Executor HDFS Executor Cache

JVM JVM
HDFS Cache

Executor HDFS Executor Cache

JVM JVM
HDFS Cache
S3(N): Not as advertised
- Had perf issues using S3N out of the box
- Could not saturate 1 Gb/s link using 8 cores
- Peaked around 800% CPU utilization and 100 MB/s
by oversubscribing cores
S3 Performance Problem #1

val bytes = new Array[Byte](256 * 1024)

val numRead = s3File.read(bytes)
numRead = ?

8999 1 8999 1 8999 1 8999 1 8999 1 8999 1

Answer: buffering!
S3 Performance Problem #2
sc.textFile(“/data”).filter(s => doCompute(s)).count()

Read 128KB doCompute() Read 128KB doCompute()

Time

Network CPU
Utilization

Time
S3: Pipelining to the rescue
S3
User
Reading
program
Thread Pipe/
Buffer

Read Read Read Read Read

doCompute() doCompute() doCompute()

Time
S3: Results
● Max network throughput (1 Gb/s on our NICs)
● Use 100% of a core across 8 threads (largely SSL)
● With this optimization S3, has worked well:
○ Spark hides latency via its inherent batching (except for
driver metadata lookups)
○ Network is pretty fast
Why is network “pretty fast?”
r3.2xlarge:

- 120 MiB/s network

- Single 250 MiB/s disk
- Max of 2x improvement to be gained from disk

More surprising: Most workloads were CPU-bound

on read side
Why is Spark often CPU-bound?
- Users think more about the high-level details than
the CPU-efficiency
- Reasonable! Getting something to work at all is most important.
- Need the right tracing and visualization tools to find bottlenecks.

See talk: SparkUI visualization: a lens into your application

Why is Spark often CPU-bound?
- Users think more about the high-level details than
the CPU-efficiency
- Reasonable! Getting something to work at all is most important.
- Need the right tracing and visualization tools to find bottlenecks.
- Need efficient primitives for common operations (Tungsten).

- Just reading data may be expensive

- Decompression is not cheap - between snappy, lzf/lzo, and gzip,
be wary of gzip
See talk: SparkUI visualization: a lens into your application
Conclusion
- DataFrames came up a lot
- Python perf problems? Use DataFrames.
- Want to use R + Spark? Use DataFrames.
- Want more perf with less work? Use DataFrames.

- DataFrames are important for Spark to progress in:

- Expressivity in language-neutral fashion
- Performance from knowledge about structure of data
Common pitfalls
● Avoid RDD groupByKey()
○ API requires all values for a single key to fit in memory
○ DataFrame groupBy() works as expected, though
Common pitfalls
● Avoid RDD groupByKey()
○ API requires all values for a single key to fit in memory
○ DataFrame groupBy() works as expected, though
● Avoid Cartesian products in SQL
○ Always ensure you have a join condition! (Can check with
df.explain())
Common pitfalls
● Avoid RDD groupByKey()
○ API requires all values for a single key to fit in memory
○ DataFrame groupBy() works as expected, though
● Avoid Cartesian products in SQL
○ Always ensure you have a join condition! (Can check with
df.explain())
● Avoid overusing cache()
○ Avoid use of vanilla cache() when using data which does
not fit in memory or which will not be reused.
○ Starting in Spark 1.6, this can actually hurt performance
significantly.
○ Consider persist(MEMORY_AND_DISK) instead.
Common pitfalls (continued)
● Be careful when joining small with large table
○ Broadcast join is by far the best option, so make sure
SparkSQL takes it
○ Cache smaller table in memory, or use Parquet
Common pitfalls (continued)
● Be careful when joining small with large table
○ Broadcast join is by far the best option, so make sure
SparkSQL takes it
○ Cache smaller table in memory, or use Parquet
● Avoid using jets3t 1.9 (default in Hadoop 2)
○ Inexplicably terrible performance
Common pitfalls (continued)
● Be careful when joining small with large table
○ Broadcast join is by far the best option, so make sure
SparkSQL takes it
○ Cache smaller table in memory, or use Parquet
● Avoid using jets3t 1.9 (default in Hadoop 2)
○ Inexplicably terrible performance
● Prefer S3A to S3N (new in Hadoop 2.6.0)
○ Uses AWS SDK to allow for use of advanced features like
KMS encryption
○ Has some nice features, like reusing HTTP connections
○ Recently saw problem related to S3N buffering entire file!
Common pitfalls (continued)
● In RDD API, can manually reuse partitioner to avoid
extra shuffles
Questions?

SQL Vs PySpark 1678871778
No ratings yet
SQL Vs PySpark 1678871778
8 pages
Fast Data Processing with Spark 2 - Third Edition
From Everand
Fast Data Processing with Spark 2 - Third Edition
Krishna Sankar
No ratings yet
FD Controller Instruction Manual Command Reference: 4th Edition
No ratings yet
FD Controller Instruction Manual Command Reference: 4th Edition
124 pages
Consent Form and Terms of Use For Applicant Services Provided by Vfs Global Operated Canada Visa Application Centres (Cvac)
No ratings yet
Consent Form and Terms of Use For Applicant Services Provided by Vfs Global Operated Canada Visa Application Centres (Cvac)
4 pages
Big Data With Apache Spark 3 and Python From Zero To Expert
No ratings yet
Big Data With Apache Spark 3 and Python From Zero To Expert
28 pages
Distributed Database Systems: - Spark I
No ratings yet
Distributed Database Systems: - Spark I
59 pages
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
From Everand
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
Eric Tome
No ratings yet
Expert Strategies in Apache Spark: Comprehensive Data Processing and Advanced Analytics
From Everand
Expert Strategies in Apache Spark: Comprehensive Data Processing and Advanced Analytics
Adam Jones
No ratings yet
Pyspark Practice - Databricks
No ratings yet
Pyspark Practice - Databricks
66 pages
Create An Spark Streaming App: 1. Architecture and Abstraction
No ratings yet
Create An Spark Streaming App: 1. Architecture and Abstraction
8 pages
2018 02 08 Whats New in Apache Spark 2 180213220045
No ratings yet
2018 02 08 Whats New in Apache Spark 2 180213220045
57 pages
Spark QA
No ratings yet
Spark QA
34 pages
What Is Spark?: Up To 100× Faster
No ratings yet
What Is Spark?: Up To 100× Faster
56 pages
Learning Apache Spark With Python
No ratings yet
Learning Apache Spark With Python
10 pages
PySpark Meetup Talk
No ratings yet
PySpark Meetup Talk
35 pages
Master Pyspark Zero To Hero 1738689679
No ratings yet
Master Pyspark Zero To Hero 1738689679
102 pages
Spark Optimizations & Deployment
No ratings yet
Spark Optimizations & Deployment
39 pages
Pyspark Interview 1738079940
No ratings yet
Pyspark Interview 1738079940
6 pages
Databricks Question
No ratings yet
Databricks Question
7 pages
Midhun BIGDATA Curicullum
No ratings yet
Midhun BIGDATA Curicullum
17 pages
Py 1731703428
No ratings yet
Py 1731703428
8 pages
Spark Job Dataproc
No ratings yet
Spark Job Dataproc
4 pages
Big Data Computing Spark Built-In Libraries
No ratings yet
Big Data Computing Spark Built-In Libraries
11 pages
Pyspark RDD Cheat Sheet Python For Data Science
No ratings yet
Pyspark RDD Cheat Sheet Python For Data Science
1 page
SCD Type-1,2 Implementation in Pyspark
No ratings yet
SCD Type-1,2 Implementation in Pyspark
6 pages
Spark Syllabus 1
No ratings yet
Spark Syllabus 1
3 pages
Spark A To Z
No ratings yet
Spark A To Z
63 pages
Caching in Spark
No ratings yet
Caching in Spark
51 pages
Spark Streaming Twitter Example
No ratings yet
Spark Streaming Twitter Example
4 pages
Unstructured Dataload Into Hive Database Through PySpark
No ratings yet
Unstructured Dataload Into Hive Database Through PySpark
9 pages
L02 - Spark SQL For Data Processing: CBG1C04 Big Data Programming
No ratings yet
L02 - Spark SQL For Data Processing: CBG1C04 Big Data Programming
23 pages
PTC Big Data Analysis With ApacheS 27.11-28.11.2019 Handout
No ratings yet
PTC Big Data Analysis With ApacheS 27.11-28.11.2019 Handout
48 pages
Hive Cheat Sheet - Quick Reference
No ratings yet
Hive Cheat Sheet - Quick Reference
19 pages
DataEngineer Roadmap
No ratings yet
DataEngineer Roadmap
12 pages
BD - Spark - Baladasu A - SightSpectrum
No ratings yet
BD - Spark - Baladasu A - SightSpectrum
3 pages
Key Features: General-Purpose Fast Cluster Computing Platform
No ratings yet
Key Features: General-Purpose Fast Cluster Computing Platform
16 pages
What Are DBT Sources
No ratings yet
What Are DBT Sources
109 pages
PySpark and Azure Data Engineer Free Notes
No ratings yet
PySpark and Azure Data Engineer Free Notes
65 pages
Azure Data Engineering Interview Q & A - Topicwise
No ratings yet
Azure Data Engineering Interview Q & A - Topicwise
57 pages
Spark RDD Dataframes SQL
No ratings yet
Spark RDD Dataframes SQL
3 pages
Data Engineering & GCP Basic Services 2. Data Storage in GCP 3. Database Offering by GCP 4. Data Processing in GCP 5. ML/AI Offering in GCP
No ratings yet
Data Engineering & GCP Basic Services 2. Data Storage in GCP 3. Database Offering by GCP 4. Data Processing in GCP 5. ML/AI Offering in GCP
3 pages
SQL - & - Pyspak
No ratings yet
SQL - & - Pyspak
6 pages
3 Lecture 3-ETL
100% (1)
3 Lecture 3-ETL
42 pages
De Mod 2 Transform Data With Spark
No ratings yet
De Mod 2 Transform Data With Spark
32 pages
Databricks Performance Tuning
No ratings yet
Databricks Performance Tuning
9 pages
Apache Druid: Sudhindra Tirupati Nagaraj
No ratings yet
Apache Druid: Sudhindra Tirupati Nagaraj
12 pages
Problem Description: Sensitivity: Internal & Restricted
No ratings yet
Problem Description: Sensitivity: Internal & Restricted
2 pages
Spark Use Cases
No ratings yet
Spark Use Cases
2 pages
Apache Spark
No ratings yet
Apache Spark
62 pages
Snowflake Demo
No ratings yet
Snowflake Demo
13 pages
Data Bricks
No ratings yet
Data Bricks
20 pages
SCD Type 2. Pyspark
No ratings yet
SCD Type 2. Pyspark
7 pages
How To Create Secrets in Databricks? - by Ashish Garg - Medium
No ratings yet
How To Create Secrets in Databricks? - by Ashish Garg - Medium
13 pages
20 PySpark Problems
No ratings yet
20 PySpark Problems
22 pages
Interview Questions
No ratings yet
Interview Questions
2 pages
Databricks Data Engineer Associate Notes
No ratings yet
Databricks Data Engineer Associate Notes
5 pages
Spark Optimizations & Deployment
No ratings yet
Spark Optimizations & Deployment
13 pages
Deepshikha Agrawal Pushp B.Sc. (IT), MBA (IT) Certification-Hadoop, Spark, Scala, Python, Tableau, ML (Assistant Professor JLBS)
No ratings yet
Deepshikha Agrawal Pushp B.Sc. (IT), MBA (IT) Certification-Hadoop, Spark, Scala, Python, Tableau, ML (Assistant Professor JLBS)
74 pages
Dice Resume CV SN
No ratings yet
Dice Resume CV SN
5 pages
Company Interview Question Bank
No ratings yet
Company Interview Question Bank
16 pages
HDP Developer-Enterprise Spark 1-Python Lab Guide-Rev 1
No ratings yet
HDP Developer-Enterprise Spark 1-Python Lab Guide-Rev 1
168 pages
DBT Flow
No ratings yet
DBT Flow
15 pages
Users User Email: For GAM 4.40
No ratings yet
Users User Email: For GAM 4.40
7 pages
Xi4 Series Parts Catalog en Us
No ratings yet
Xi4 Series Parts Catalog en Us
17 pages
Multiple Ips and Subnet Support: Pjsip Set Logger On Pjsip Show Endpoints Pjsip Show Registrations
No ratings yet
Multiple Ips and Subnet Support: Pjsip Set Logger On Pjsip Show Endpoints Pjsip Show Registrations
3 pages
CPO 3713 Spring 2021 Syllabus
No ratings yet
CPO 3713 Spring 2021 Syllabus
13 pages
LPD8 Editor User Guide: To Download and Install The Editor Software
No ratings yet
LPD8 Editor User Guide: To Download and Install The Editor Software
2 pages
Matrices Basic Concepts
No ratings yet
Matrices Basic Concepts
14 pages
Chapter 1 - Exploring Network PDF
No ratings yet
Chapter 1 - Exploring Network PDF
51 pages
IAT Ans
No ratings yet
IAT Ans
6 pages
Chapter-5 Network Programming
No ratings yet
Chapter-5 Network Programming
22 pages
CS411 Final Term MCQs Merged by Masters
No ratings yet
CS411 Final Term MCQs Merged by Masters
357 pages
Introduction To Solana - Grayscale-Building-Blocks-Solana-1
No ratings yet
Introduction To Solana - Grayscale-Building-Blocks-Solana-1
18 pages
Case Study HR
No ratings yet
Case Study HR
2 pages
Maths Course Outline Y8
No ratings yet
Maths Course Outline Y8
5 pages
WBS-2-Operations Analytics-W2S1-How-to-Build-and-Optimization-Model
No ratings yet
WBS-2-Operations Analytics-W2S1-How-to-Build-and-Optimization-Model
18 pages
SF Dump
No ratings yet
SF Dump
20 pages
Scanscore 2 Manual
No ratings yet
Scanscore 2 Manual
26 pages
Digital Signal
No ratings yet
Digital Signal
6 pages
Lec6 PDF
No ratings yet
Lec6 PDF
22 pages
Device Management in Operating System
No ratings yet
Device Management in Operating System
5 pages
Twinkle
No ratings yet
Twinkle
2 pages
Chapter 4
No ratings yet
Chapter 4
58 pages
Idioms For 12th Class
0% (1)
Idioms For 12th Class
21 pages
TopupArticle Latest PDF
No ratings yet
TopupArticle Latest PDF
23 pages
Delomatic 4 DM-4 Land/DM-4 Marine: Technical Specifications Part 2, Chapter 29
No ratings yet
Delomatic 4 DM-4 Land/DM-4 Marine: Technical Specifications Part 2, Chapter 29
22 pages
Monday Wednesday Thursday Friday
No ratings yet
Monday Wednesday Thursday Friday
4 pages
اردو گرامر برائے نہم دہم
No ratings yet
اردو گرامر برائے نہم دہم
116 pages
C++ in Huge AAA Games - Nicolas Fleury - CppCon 2014
No ratings yet
C++ in Huge AAA Games - Nicolas Fleury - CppCon 2014
51 pages
DX Log
No ratings yet
DX Log
26 pages

Spark in Production

Uploaded by

Spark in Production

Uploaded by

Spark in Production:

Lessons from 100+ production

Offers a hosted service:

Focus on two types:

(From Spark Survey 2015)

(At least as much as possible!)

- Problem: Difficult to run R programs

- Result: R interface to Spark is mainly DataFrames

Executor HDFS Executor Cache

Executor HDFS Executor Cache

val bytes = new Array[Byte](256 * 1024)

8999 1 8999 1 8999 1 8999 1 8999 1 8999 1

Read 128KB doCompute() Read 128KB doCompute()

Read Read Read Read Read

doCompute() doCompute() doCompute()

- 120 MiB/s network

More surprising: Most workloads were CPU-bound

See talk: SparkUI visualization: a lens into your application

- Just reading data may be expensive

- DataFrames are important for Spark to progress in:

You might also like