0% found this document useful (0 votes)
93 views34 pages

Spark in Production

This document summarizes lessons learned from over 100 production users of Spark. It discusses common problems users face and solutions. Key points include: 1. Moving beyond Python performance limitations by using DataFrames and RDDs instead of just Python. 2. Enabling the use of other languages like R by doing distributed computation in Scala/Python and then bringing smaller datasets back to a single node for analysis in R. 3. Addressing network-bound and CPU-bound workloads, such as optimizing Spark's performance reading from S3 by buffering reads to pipeline processing. 4. Common pitfalls to avoid like overusing cache() and joining a small table to a large one without broadcasting the

Uploaded by

Sridhar Plv
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
93 views34 pages

Spark in Production

This document summarizes lessons learned from over 100 production users of Spark. It discusses common problems users face and solutions. Key points include: 1. Moving beyond Python performance limitations by using DataFrames and RDDs instead of just Python. 2. Enabling the use of other languages like R by doing distributed computation in Scala/Python and then bringing smaller datasets back to a single node for analysis in R. 3. Addressing network-bound and CPU-bound workloads, such as optimizing Spark's performance reading from S3 by buffering reads to pipeline processing. 4. Common pitfalls to avoid like overusing cache() and joining a small table to a large one without broadcasting the

Uploaded by

Sridhar Plv
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

Spark in Production:

Lessons from 100+ production


users 300+

Aaron Davidson
October 28, 2015
About Databricks
Founded by creators of Spark and remains largest
contributor

Offers a hosted service:


• Spark on EC2
• Notebooks
• Plot visualizations
• Cluster management
• Scheduled jobs

2
What have we learned?
Hosted service + focus on Spark = lots of user feedback
Community!

Focus on two types:


1. Lessons for Spark
2. Lessons for users

3
Outline: What are the problems?
● Moving beyond Python performance
● Using Spark with new languages (R)
● Network and CPU-bound workloads
● Miscellaneous common pitfalls

4
Python: Who uses it, anyway?

(From Spark Survey 2015)


PySpark Architecture
sc.textFile(“/data”)
.filter(lambda s: “foobar” in s)
.count()
PySpark Architecture
sc.textFile(“/data”)
.filter(lambda s: “foobar” in s)
.count()
PySpark Architecture
sc.textFile(“/data”)
.filter(lambda s: “foobar” in s)
.count()
PySpark Architecture
sc.textFile(“/data”)
.filter(lambda s: “foobar” in s)
.count()

/data
PySpark Architecture
sc.textFile(“/data”) Java-to-Python
.filter(lambda s: “foobar” in s) communication
.count() is expensive!

Driver

/data
Moving beyond Python performance
Using RDDs
data = sc.textFile(...).split("\t")
data.map(lambda x: (x[0], [int(x[1]), 1])) \
.reduceByKey(lambda x, y: [x[0] + y[0], x[1] + y[1]]) \
.map(lambda x: [x[0], x[1][0] / x[1][1]]) \
.collect()

11
Moving beyond Python performance
Using RDDs
data = sc.textFile(...).split("\t")
data.map(lambda x: (x[0], [int(x[1]), 1])) \
.reduceByKey(lambda x, y: [x[0] + y[0], x[1] + y[1]]) \
.map(lambda x: [x[0], x[1][0] / x[1][1]]) \
.collect()

Using DataFrames
sqlCtx.table("people") \
.groupBy("name") \
.agg("name", avg("age")) \
.collect()

12
Moving beyond Python performance
Using RDDs
data = sc.textFile(...).split("\t")
data.map(lambda x: (x[0], [int(x[1]), 1])) \
.reduceByKey(lambda x, y: [x[0] + y[0], x[1] + y[1]]) \
.map(lambda x: [x[0], x[1][0] / x[1][1]]) \
.collect()

Using DataFrames
sqlCtx.table("people") \
.groupBy("name") \
.agg("name", avg("age")) \
.collect()

(At least as much as possible!)

13
Using Spark with other languages (R)
- As adoption rises, new groups of people try Spark:
- People who never used Hadoop or distributed computing
- People who are familiar with statistical languages

- Problem: Difficult to run R programs


on a cluster
- Technically challenging to rewrite algorithms
to run on cluster
- Requires bigger paradigm shift than changing
languages
SparkR interface
- A pattern emerges:
- Distributed computation for initial transformations in Scala/Python
- Bring back a small dataset to a single node to do plotting and quick
advanced analyses

- Result: R interface to Spark is mainly DataFrames


people <- read.df(sqlContext, "./people.json", "json")
teenagers <- filter(people, "age >= 13 AND age <= 19")
head(teenagers)

Spark R docs
See talk: Enabling exploratory data science with Spark and R
Network and CPU-bound workloads
- Databricks uses S3 heavily, instead of HDFS
- S3 is a key-value based blob store “in the cloud”
- Accessed over the network
- Intended for large object storage
- ~10-200 ms latency for reads and writes
- Adapters for HDFS-like access (s3n/s3a) through Spark
- Strong consistency with some caveats (updates and us-east-1)
S3 as data storage
“Traditional”
Databricks
Data Warehouse Amazon S3

Instance

Executor HDFS Executor Cache


JVM JVM
HDFS Cache

Executor HDFS Executor Cache


JVM JVM
HDFS Cache
S3(N): Not as advertised
- Had perf issues using S3N out of the box
- Could not saturate 1 Gb/s link using 8 cores
- Peaked around 800% CPU utilization and 100 MB/s
by oversubscribing cores
S3 Performance Problem #1

val bytes = new Array[Byte](256 * 1024)


val numRead = s3File.read(bytes)
numRead = ?

8999 1 8999 1 8999 1 8999 1 8999 1 8999 1

Answer: buffering!
S3 Performance Problem #2
sc.textFile(“/data”).filter(s => doCompute(s)).count()

Read 128KB doCompute() Read 128KB doCompute()

Time

Network CPU
Utilization

Time
S3: Pipelining to the rescue
S3
User
Reading
program
Thread Pipe/
Buffer

Read Read Read Read Read

doCompute() doCompute() doCompute()

Time
S3: Results
● Max network throughput (1 Gb/s on our NICs)
● Use 100% of a core across 8 threads (largely SSL)
● With this optimization S3, has worked well:
○ Spark hides latency via its inherent batching (except for
driver metadata lookups)
○ Network is pretty fast
Why is network “pretty fast?”
r3.2xlarge:

- 120 MiB/s network


- Single 250 MiB/s disk
- Max of 2x improvement to be gained from disk

More surprising: Most workloads were CPU-bound


on read side
Why is Spark often CPU-bound?
- Users think more about the high-level details than
the CPU-efficiency
- Reasonable! Getting something to work at all is most important.
- Need the right tracing and visualization tools to find bottlenecks.

See talk: SparkUI visualization: a lens into your application


Why is Spark often CPU-bound?
- Users think more about the high-level details than
the CPU-efficiency
- Reasonable! Getting something to work at all is most important.
- Need the right tracing and visualization tools to find bottlenecks.
- Need efficient primitives for common operations (Tungsten).

- Just reading data may be expensive


- Decompression is not cheap - between snappy, lzf/lzo, and gzip,
be wary of gzip
See talk: SparkUI visualization: a lens into your application
Conclusion
- DataFrames came up a lot
- Python perf problems? Use DataFrames.
- Want to use R + Spark? Use DataFrames.
- Want more perf with less work? Use DataFrames.

- DataFrames are important for Spark to progress in:


- Expressivity in language-neutral fashion
- Performance from knowledge about structure of data
Common pitfalls
● Avoid RDD groupByKey()
○ API requires all values for a single key to fit in memory
○ DataFrame groupBy() works as expected, though
Common pitfalls
● Avoid RDD groupByKey()
○ API requires all values for a single key to fit in memory
○ DataFrame groupBy() works as expected, though
● Avoid Cartesian products in SQL
○ Always ensure you have a join condition! (Can check with
df.explain())
Common pitfalls
● Avoid RDD groupByKey()
○ API requires all values for a single key to fit in memory
○ DataFrame groupBy() works as expected, though
● Avoid Cartesian products in SQL
○ Always ensure you have a join condition! (Can check with
df.explain())
● Avoid overusing cache()
○ Avoid use of vanilla cache() when using data which does
not fit in memory or which will not be reused.
○ Starting in Spark 1.6, this can actually hurt performance
significantly.
○ Consider persist(MEMORY_AND_DISK) instead.
Common pitfalls (continued)
● Be careful when joining small with large table
○ Broadcast join is by far the best option, so make sure
SparkSQL takes it
○ Cache smaller table in memory, or use Parquet
Common pitfalls (continued)
● Be careful when joining small with large table
○ Broadcast join is by far the best option, so make sure
SparkSQL takes it
○ Cache smaller table in memory, or use Parquet
● Avoid using jets3t 1.9 (default in Hadoop 2)
○ Inexplicably terrible performance
Common pitfalls (continued)
● Be careful when joining small with large table
○ Broadcast join is by far the best option, so make sure
SparkSQL takes it
○ Cache smaller table in memory, or use Parquet
● Avoid using jets3t 1.9 (default in Hadoop 2)
○ Inexplicably terrible performance
● Prefer S3A to S3N (new in Hadoop 2.6.0)
○ Uses AWS SDK to allow for use of advanced features like
KMS encryption
○ Has some nice features, like reusing HTTP connections
○ Recently saw problem related to S3N buffering entire file!
Common pitfalls (continued)
● In RDD API, can manually reuse partitioner to avoid
extra shuffles
Questions?

You might also like