0% found this document useful (0 votes)

12 views8 pages

Apache Spark Things To Know

The document provides lessons learned from working with Spark for over a year. It discusses important Spark concepts like knowing what data is shuffled, avoiding skew, intelligent partitioning, and best practices for joins, transformations vs actions, checkpointing, and monitoring jobs. Specific issues with CSV reading and benefits of using Parquet are also covered.

Uploaded by

sparkaredla

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views8 pages

Apache Spark Things To Know

Uploaded by

sparkaredla

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 8

Things I Wish I’d Known About

Spark When I Started (One Year

Later Edition)
By Jeremy Krinsley, Pam Wu, Daniel Melemed, Jarrod
Parker, Linan Zheng

About 12 months ago, we made a decision to move our entity

resolution pipeline into the Scala/Spark universe. This was not
without its pain points. This was our first major push as a
company to productize entity resolution prototypes that had been
in development for pretty much as long as the company has
existed. It was also the first time our team had worked with either
Scala or Spark.
Looking back over the year, there are dozens of “learning
moments” that I would love to ship via wormhole to my former
self.

In case the opportunity arises, here’s the transmission:

Know What You Shuffle

Shuffle is the transportation of data between workers across a
Spark cluster’s network. It’s central for operations where a
reorganization of data is required, referred to as wide
dependencies (See Wide vs Narrow Dependencies). This kind of
operation can quickly become the bottleneck of your Spark
application. To use Spark well, you need to know what you shuffle,
and for this it’s essential that you know your data.

Skew Causes Bad Shuffles

Skew is an imbalance in the distribution of your data. If you fail to
account for how your data is distributed, you may find that Spark
naively places an overwhelming majority of rows on one executor,
and a fraction on all the rest. This is skew, and it will kill your
application, whether by causing out of memory errors, network
timeouts, or exponentially long running processes that will never
terminate.

Partition on Well-Distributed Columns

A powerful way to control Spark shuffles is to partition your data
intelligently. Partitioning on the right column (or set of columns)
helps to balance the amount of data that has to be mapped across
the cluster network in order to perform actions. Partitioning on a
unique ID is generally a good strategy, but don’t partition on
sparsely filled columns or columns that over-represent particular
values.

Beware the Default Partition

It’s absolutely essential to model the number of partitions around
the kinds of things you’re solving. In the stage of our application
where we run parallel transformations on many heterogeneously-
sized datasets at once, 200 partitions works just about fine.

When we are dealing with billions of pairwise comparisons, we

have found that partitions in the range of 4–10k work most
efficiently.

Furthermore, if you run tests on a single server (or locally), you

may see dramatic speed improvements by re-partitioning data
down to size 1. We recently squashed a particularly curious bug
where our end-to-end test ran fine on our local 8 or 16 core
machines, but would fail to ever complete on the 2-core server on
which we run our CI. Combining the data down to 1 partition
solved our issue.

Drive Your Jobs Into Overdrive

with .par
While you can depend on Spark to do a lot of parallel heavy lifting,
you can push your jobs even harder with thoughtful use of Scala’s
built in .par functionality, which can operate on iterables. The
initial steps of our ER pipeline involve reading in dozens of
heterogeneous datasets and applying shared transformation
pipelines to each of them. A simple datasets.par.foreach cut our
run times in half.

Of course, you can only rely on its usage for aspects of your
pipeline that are completely deterministic and provide no risk of a
race condition. Overzealous usage of .par can quickly result in
mysteriously disappearing or overwritten data.

Joins Are Highly Flammable

Joins are by far the biggest shuffle offender, and the dangers of sql
joining are amplified by the scale Spark enables. Even joining
medium sized data can cause an explosion if there are repeated
join values on both sides of your join. This is something that we at
Enigma have to be particularly wary of, where ‘unique’ public data
keys may result in a couple million row join exponentially
exploding into a billion row join!

If there is a chance your join columns have null values, you are in
danger of massive skew. A great solution to this problem is to
“salt” your nulls. This essentially means pre-filling arbitrary values
(like uuids) into empty cells prior to running a join.

Is Your Data Real Yet?

Operations in Spark are divided between transformations and
actions. Transformations are lazy operations that allow Spark to
optimize your query under the hood. They will set up a DataFrame
for changes — like adding a column, or joining it to another — but
will not execute on these plans. This can result in surprising
results. For instance, it’s important to remember that the behavior
of a UDF is to not have a materialized value until an action is
performed. Imagine, for instance, creating an id column using
Spark’s built-in monotonically_increasing_id, and then trying to
join on that column. If you do not place an action between the
generation of those ids (such as checkpointing), your values have
not been materialized. The result will be non-deterministic!

Checkpointing Is Your Friend

Checkpointing is basically the process of saving data to disk and
reloading it back in, which would be redundant anywhere else
besides Spark. This both triggers an action on any waiting
transformations, and it also truncates the Spark query plan for
that object. Not only will this action show up in your spark UI
(thus indicating where exactly you are in your job), it will help to
avoid re-triggering latent udf actions in your DAG, and conserve
resources, since it can potentially allow you to release memory
that would otherwise be cached for downstream access. In our
experience, checkpointed data is also a valuable source for data-
debugging forensics and repurposing. The training data for our
pipeline, for instance, is filtered out from a 500 million row table
generated halfway through our application.

Sanity Check Your Runtime With

Monitoring
The Spark UI is your friend, and so are monitoring tools like
Ganglia that let you know how your run is going in real-time.
Yarn’s depiction of the Spark query plan can instantly
communicate whether your intentions align with your execution.
Is something that is supposed to be one join actually a cascade of
many small joins?
The SparkUI also contains information on the job level, the stage
level, and the executor level. This means you can get quickly see if
the number/volume of data going to each partition or to each
executor makes sense, and you can see if any part of your job is
supposed to be 10% of the data but is taking 90% of the time.
Monitoring tools that allow you to view your total memory and
CPU usage across executors is essential for resource planning and
autopsies on failed jobs.

When we first started using Spark, we used standalone clusters on

Yarn and Amazons’s EMRFS. We learned the hard way that
gathering Spark logs is a non-trivial task. We are happy to now use
Databricks, which handles the essential matter of log aggregation
for us, but if you are spinning up your own solution, a log
aggregation tool like Kibana is probably essential for introspection
sanity.

Error Messages Don’t Mean What They

Say
It took quite a while to get used to the fact that Spark complains
about one thing, when the problem is really somewhere else.

 “Connection reset by peer” often implies you have skewed data

and one particular worker has run out of memory.
 “java.net.SocketTimeoutException: Write timed out” might
mean you have set your number of partitions too high, and the
filesystem is too slow at handling the number of simultaneous
writes Spark is attempting to execute.
 “Total size of serialized results… is bigger than
spark.driver.maxResultSize” could mean you’ve set your
number of partitions too high and results can’t fit onto a
particular worker.
 “Column x is not a member of table y”: You ran half your
pipeline just to discover this sql join error. Front-load your
run-time execution with validation to avoid having to reverse
engineer these errors.
 Sometimes you will get a real out of memory error, but the
forensic work will be to understand why: Yes, you can increase
the size of your individual workers to make this problem
disappear, but before you do that, you should always ask
yourself, “is the data well distributed?”

Scala/Spark CSV Reading Is Brittle

Coming from Python, it was a surprise to learn that naively
reading CSVs in Scala/Spark often results in silent escape-
character errors. The scenario: You have a CSV and naively read it
into spark:
val df = spark.read.option("header", "true").csv("quote-happy.csv")

Your DataFrame seems happy — no runtime exceptions, and you

can execute operations on the DataFrame. But after careful
debugging of your columns, you realize that at some point in the
data, literally everything has shifted over one or several columns.
It turns out that to be safe, you need to include .option("escape",
"\"") in your reads.

Better suggestion: Use Parquet!

Parquet Is Your Friend

The open-source file format is designed to offer read/and write
operations an order of magnitude more efficient than
uncompressed CSVs.

Parquet is “columnar” in that it is designed to only select data

from those columns specified in, say, a Spark sql query, and skip
over those that are not requested. Furthermore, it implements
“predicate pushdown” operations on sql-like filtering operations
that efficiently run queries on only relevant subsets of the values in
a given column. Switching from uncompressed tabular file formats
to parquet is one of the most fundamental things you can do to
improve Spark performance.

If you are responsible for generating parquet from another format

— say you are using PyArrow and Pandas for some large-scale
migration — be conscious that simply creating a single parquet file
gives up a major benefit of the format.

Conclusion
And there you have it, a loose assemblage of suggestions, cobbled
together from a year of using Spark. Here’s hoping my future self
has already found that wormhole and is sending me the year two
edition as you’re reading this.

Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Learning PySpark
From Everand
Learning PySpark
Tomasz Drabas
No ratings yet
AWS Certified Solutions Architect - Professional
From Everand
AWS Certified Solutions Architect - Professional
VB Dev
No ratings yet
Fast Data Processing with Spark 2 - Third Edition
From Everand
Fast Data Processing with Spark 2 - Third Edition
Krishna Sankar
No ratings yet
Scala for Java Developers
From Everand
Scala for Java Developers
Thomas Alexandre
5/5 (1)
Spark Troubleshooting, Part 2: Five Types of Solutions
No ratings yet
Spark Troubleshooting, Part 2: Five Types of Solutions
7 pages
Spark Tips 1716698498
No ratings yet
Spark Tips 1716698498
7 pages
Spark Optimization PDF
100% (1)
Spark Optimization PDF
14 pages
Kafka Up and Running for Network DevOps: Set Your Network Data in Motion
From Everand
Kafka Up and Running for Network DevOps: Set Your Network Data in Motion
Eric Chou
No ratings yet
Learning Cascading
From Everand
Learning Cascading
Michael Covert
No ratings yet
Dimensionnement Spark - Les 5 Erreurs À Éviter
No ratings yet
Dimensionnement Spark - Les 5 Erreurs À Éviter
75 pages
Spark Databricks
No ratings yet
Spark Databricks
19 pages
5 Key Factors To Keep in Mind While Optimizing Apache Spark in AWS
No ratings yet
5 Key Factors To Keep in Mind While Optimizing Apache Spark in AWS
9 pages
Databricks Spark Knowledge Base
100% (1)
Databricks Spark Knowledge Base
22 pages
Module 4
No ratings yet
Module 4
29 pages
Pyspark
100% (1)
Pyspark
48 pages
Troubleshooting Spark Challenges
No ratings yet
Troubleshooting Spark Challenges
7 pages
Spark: Big Data Cluster Computing in Production
From Everand
Spark: Big Data Cluster Computing in Production
Ilya Ganelin
No ratings yet
PySpark Optimization Scenarios - Wipro
No ratings yet
PySpark Optimization Scenarios - Wipro
8 pages
Master Pyspark Zero To Hero 1738689679
No ratings yet
Master Pyspark Zero To Hero 1738689679
102 pages
Elements of Android Room
From Everand
Elements of Android Room
Mark Murphy
No ratings yet
Parallel Python with Dask: Perform distributed computing, concurrent programming and manage large dataset
From Everand
Parallel Python with Dask: Perform distributed computing, concurrent programming and manage large dataset
Tim Peters
No ratings yet
Parallel Python with Dask
From Everand
Parallel Python with Dask
Tim Peters
No ratings yet
Python Data Science Cookbook
From Everand
Python Data Science Cookbook
Taryn Voska
No ratings yet
Python Data Science Cookbook: Practical solutions across fast data cleaning, processing, and machine learning workflows with pandas, NumPy, and scikit-learn
From Everand
Python Data Science Cookbook: Practical solutions across fast data cleaning, processing, and machine learning workflows with pandas, NumPy, and scikit-learn
Taryn Voska
No ratings yet
ApacheSpark Top 10 QnA
No ratings yet
ApacheSpark Top 10 QnA
33 pages
Spark QA
No ratings yet
Spark QA
34 pages
Instant Apache Camel Messaging System
From Everand
Instant Apache Camel Messaging System
Evgeniy Sharapov
No ratings yet
PySpark Q&A
No ratings yet
PySpark Q&A
56 pages
From Query Plan To Query Performance:: Supercharging Your Spark Queries Using The Spark UI SQL Tab
No ratings yet
From Query Plan To Query Performance:: Supercharging Your Spark Queries Using The Spark UI SQL Tab
52 pages
Learning Apache Spark 2
From Everand
Learning Apache Spark 2
Muhammad Asif Abbasi
No ratings yet
Advanced Spark Training
0% (1)
Advanced Spark Training
49 pages
Cluster Configuration and Spark UI Databricks 1721934901
No ratings yet
Cluster Configuration and Spark UI Databricks 1721934901
3 pages
SQLite Database Programming for Xamarin: Cross-platform C# database development for iOS and Android using SQLite.XM
From Everand
SQLite Database Programming for Xamarin: Cross-platform C# database development for iOS and Android using SQLite.XM
Anthony Serpico
No ratings yet
Data Engineer Interview
No ratings yet
Data Engineer Interview
23 pages
Tarsnap Mastery: IT Mastery, #6
From Everand
Tarsnap Mastery: IT Mastery, #6
Michael W. Lucas
No ratings yet
Apache Spark
No ratings yet
Apache Spark
6 pages
Spark
No ratings yet
Spark
15 pages
Spark Optimization Techniques 1676610430
No ratings yet
Spark Optimization Techniques 1676610430
15 pages
Spark Questions Imp
No ratings yet
Spark Questions Imp
33 pages
Performance Tuning Spark UI
No ratings yet
Performance Tuning Spark UI
37 pages
Code Optimization in Spark
No ratings yet
Code Optimization in Spark
4 pages
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
From Everand
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Google BigQuery Analytics
From Everand
Google BigQuery Analytics
Jordan Tigani
3/5 (1)
Bdafinal
No ratings yet
Bdafinal
11 pages
15 Asked Questions in KPMG
No ratings yet
15 Asked Questions in KPMG
22 pages
Apache Spark Performance Troubleshooting at Scale Challenges, Tools and Methods
No ratings yet
Apache Spark Performance Troubleshooting at Scale Challenges, Tools and Methods
48 pages
Spark 3.0 New Features: Spark With GPU Support
No ratings yet
Spark 3.0 New Features: Spark With GPU Support
8 pages
DE Bootcamp - Week 3 Day 2
No ratings yet
DE Bootcamp - Week 3 Day 2
4 pages
Spark Interview More Questions With Answers
No ratings yet
Spark Interview More Questions With Answers
3 pages
SAS Interview Questions You'll Most Likely Be Asked
From Everand
SAS Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Notes
No ratings yet
Notes
5 pages
Myinterview Qs
No ratings yet
Myinterview Qs
9 pages
Pyspark
No ratings yet
Pyspark
10 pages
Java: Tips and Tricks to Programming Code with Java: Java Computer Programming, #2
From Everand
Java: Tips and Tricks to Programming Code with Java: Java Computer Programming, #2
Charlie Masterson
No ratings yet
Java: Tips and Tricks to Programming Code with Java
From Everand
Java: Tips and Tricks to Programming Code with Java
Charlie Masterson
No ratings yet
Understanding Apache Spark Architecture
No ratings yet
Understanding Apache Spark Architecture
30 pages
Pyspark Interview: Abhinav Singh
No ratings yet
Pyspark Interview: Abhinav Singh
275 pages
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
From Everand
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
Eric Tome
No ratings yet
Apache Spark
No ratings yet
Apache Spark
15 pages
Final Report
No ratings yet
Final Report
80 pages
Codewithdhiru
No ratings yet
Codewithdhiru
84 pages
Menu Dinamis Framework CI + Template adminLTE + Mysql
No ratings yet
Menu Dinamis Framework CI + Template adminLTE + Mysql
2 pages
Python Hospitality Data Analysis Project
No ratings yet
Python Hospitality Data Analysis Project
14 pages
Lecture 2 SQL
No ratings yet
Lecture 2 SQL
50 pages
Joins: Presented by M Naresh Babu
No ratings yet
Joins: Presented by M Naresh Babu
17 pages
Python Full Stack Documentation
No ratings yet
Python Full Stack Documentation
31 pages
Rahul SQL File 18.04.2024
No ratings yet
Rahul SQL File 18.04.2024
10 pages
Dbms PDF
No ratings yet
Dbms PDF
16 pages
Constraints
No ratings yet
Constraints
15 pages
Bipartite
No ratings yet
Bipartite
170 pages
Dbms Lab Manual
No ratings yet
Dbms Lab Manual
25 pages
Semester 1 Final
No ratings yet
Semester 1 Final
29 pages
Ascii Returns The ASCII Code Value of The Leftmost Character of A Character Expression
100% (1)
Ascii Returns The ASCII Code Value of The Leftmost Character of A Character Expression
7 pages
CUET 2023 Maths Question Paper
No ratings yet
CUET 2023 Maths Question Paper
113 pages
Computer Science 24 25
No ratings yet
Computer Science 24 25
27 pages
Basic SQL: CHAPTER 4 (6/E) CHAPTER 8 (5/E)
No ratings yet
Basic SQL: CHAPTER 4 (6/E) CHAPTER 8 (5/E)
28 pages
Review of Data Consistency and Integrity Constraint
No ratings yet
Review of Data Consistency and Integrity Constraint
7 pages
Mysql Cheat Sheet
No ratings yet
Mysql Cheat Sheet
4 pages
Immediate Download MySQL Crash Course A Hands On Introduction To Database Development 1 / Converted Edition Rick Silva Ebooks 2024
No ratings yet
Immediate Download MySQL Crash Course A Hands On Introduction To Database Development 1 / Converted Edition Rick Silva Ebooks 2024
50 pages
SQL Statement Select From Per - People - F Where Person - Id 130
No ratings yet
SQL Statement Select From Per - People - F Where Person - Id 130
37 pages
SQP 07 - QP
No ratings yet
SQP 07 - QP
10 pages
Quizbuilder - Fortify Security Report
No ratings yet
Quizbuilder - Fortify Security Report
9 pages
Forms Personalization Full
No ratings yet
Forms Personalization Full
153 pages
Mca-R20-Course Structure-Syllabus (16122021)
No ratings yet
Mca-R20-Course Structure-Syllabus (16122021)
78 pages
Important SQL Practice Questions With Answers
100% (1)
Important SQL Practice Questions With Answers
7 pages
Audit SQL Instance Details
No ratings yet
Audit SQL Instance Details
2 pages
Oracle SQL Functions
No ratings yet
Oracle SQL Functions
5 pages
Best Practices For Data Load
No ratings yet
Best Practices For Data Load
13 pages
Kendriya Vidyalaya Sangathan Sample Paper
No ratings yet
Kendriya Vidyalaya Sangathan Sample Paper
15 pages

Apache Spark Things To Know

Uploaded by

Apache Spark Things To Know

Uploaded by

Things I Wish I’d Known About

Spark When I Started (One Year

About 12 months ago, we made a decision to move our entity

In case the opportunity arises, here’s the transmission:

Know What You Shuffle

Skew Causes Bad Shuffles

Partition on Well-Distributed Columns

Beware the Default Partition

When we are dealing with billions of pairwise comparisons, we

Furthermore, if you run tests on a single server (or locally), you

Drive Your Jobs Into Overdrive

Joins Are Highly Flammable

Is Your Data Real Yet?

Checkpointing Is Your Friend

Sanity Check Your Runtime With

When we first started using Spark, we used standalone clusters on

Error Messages Don’t Mean What They

 “Connection reset by peer” often implies you have skewed data

Scala/Spark CSV Reading Is Brittle

Your DataFrame seems happy — no runtime exceptions, and you

Better suggestion: Use Parquet!

Parquet Is Your Friend

Parquet is “columnar” in that it is designed to only select data

If you are responsible for generating parquet from another format

You might also like