0% found this document useful (0 votes)

37 views16 pages

Spark-Performance Tuning

The document outlines performance tuning strategies for Spark, focusing on data serialization, memory tuning, and levels of parallelism. It discusses memory management, including execution and storage memory, and provides techniques for optimizing memory consumption through data structure design, serialized RDD storage, and garbage collection tuning. Additionally, it emphasizes the importance of data locality and offers contact information for further assistance from zekeLabs.

Uploaded by

bharat4704

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

37 views16 pages

Spark-Performance Tuning

Uploaded by

bharat4704

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 16

zekeLabs

Spark - Performance
Tuning

Learning made Simpler !

www.zekeLabs.com
• Data Serialization
• Memory Tuning
• Level of Parallelism
Agenda
• Memory Usage of Reduce Tasks
• Determinin
• Broadcasting Large Values
Performance Tuning

Memory
Data Serialization Memory Tuning
Management

Data Storage Garbage Collection

Tuning Tuning
Data Serialization

• Java serialization
• Kyro serilization
Memory Management Overview

Memory usage of Spark falls under two categories

• execution - memory used for shuffles, joins, sorts, agg
• storage - caching & propogating internal data across cluster

Memeory layout design

• Execution & storage memory share unified region - M
• Minimum storage memory - R
• Execution memory can evict storage memory till R
Advantages of Design

• applications that do not use caching can use the entire space for
execution, obviating unnecessary disk spills.
• applications that do use caching can reserve a minimum storage
space (R) where their data blocks are immune to being evicted.
• reasonable out-of-the-box performance for a variety of
workloads.
• spark.memory.fraction - M,default 0.6 fraction of heap
• spark.memory.storageFraction, R
Determing Memory Consumption

• Caching RDD & seeing memory occupied in web UI

• Using SizeEstimator API
import org.apache.spark.util.SizeEstimator

One calculated, use one of the below techniques

• Tuning Data Structure
• Serialized RDD Storage
• Garbage Collection Tuning
Tuning Data Structues

• Design your data structures to prefer arrays of objects

• Avoid nested structures with a lot of small objects and pointers
when possible.
• Consider using numeric IDs or enumeration objects instead of
strings for keys.
• If you have less than 32 GB of RAM, set the JVM flag -XX:
+UseCompressedOops to make pointers be four bytes instead of
eight. You can add these options in spark-env.sh.
Serialized RDD Storage

• If objects are still too large, they should be serialized.

• Use persist API.
• Use Kryo if you want to cache data in serialized form
Garbage Collection Tuning

Measuring impact of GC using javaoptions

• spark.driver.extraJavaOptions - -verbose:gc -XX:+PrintGCDetails -
XX:+PrintGCTimeStamps.
• Output will be seen in Worker nodes & not on driver nodes.
Advanced GC Tuning

• Two regions - Young short-lived objects) & Old (long-lived object)

• Young is further divided into - Eden, Servivor1, Survivor2
• When Eden is full, a minor GC is run on Eden and objects that are
alive from Eden and Survivor1 are copied to Survivor2.
• If an object is old enough or Survivor2 is full, it is moved to Old.
• Finally, when Old is close to full, a full GC is invoked.
• Goal of GC tuning is to ensure that only long-lived RDDs are
stored in Old gen & Young is used for short-lived ones.
Useful steps

• If a full GC is invoked multiple times for before a task completes,

it means that there isn’t enough memory available for executing
tasks.
• If there are too many minor collections but not many major GCs,
allocating more memory for Eden would help.
• If the OldGen is close to being full, reduce the amount of
memory used for caching by lowering spark.memory.fraction; it
is better to cache fewer objects than to slow down task
execution. Also, consider young gen size as well
More ...

• Use G1GC garbage collector, -XX:+UseG1GC

Others

• Levels of Parallelism - 2-3 tasks per CPU core

• Memoey usage of Reduce Tasks by increase in parallelism
• Broadcasting data from driver to all partition
• Data Locality - Bring code close to data.
Configuration spark.locality
Thank You !!!
THANK YOU

Let us know how can we help your organization to Upskill the

employees to stay updated in the ever-evolving IT Industry.

Get in touch:

www.zekeLabs.com | +91-8095465880 | [email protected]

Visit : www.zekeLabs.com for more details

Spark Interview Questions
No ratings yet
Spark Interview Questions
61 pages
Spark
No ratings yet
Spark
96 pages
GC in Databricks
No ratings yet
GC in Databricks
18 pages
IT 802 Computer Organization Class 11 Question and Answer Book Back
No ratings yet
IT 802 Computer Organization Class 11 Question and Answer Book Back
7 pages
Complete Guide To Spark Memory Management 1726709042
No ratings yet
Complete Guide To Spark Memory Management 1726709042
11 pages
Learn Apache Spark
100% (1)
Learn Apache Spark
31 pages
Spark Tuning
No ratings yet
Spark Tuning
26 pages
ScalaJVMBigData SparkLessons PDF
100% (1)
ScalaJVMBigData SparkLessons PDF
100 pages
Intro To Spark Development
No ratings yet
Intro To Spark Development
172 pages
SPARK
No ratings yet
SPARK
35 pages
Tuning - Spark 3.5.1 Documentation
No ratings yet
Tuning - Spark 3.5.1 Documentation
10 pages
Performance Tuning Spark UI
No ratings yet
Performance Tuning Spark UI
37 pages
Apache Spark Quick Guide
100% (2)
Apache Spark Quick Guide
21 pages
Jazz Pakistan (VEON) Hadoop Check List
No ratings yet
Jazz Pakistan (VEON) Hadoop Check List
132 pages
Spark Summit East 2015 - Adv Dev Ops - Student Slides
No ratings yet
Spark Summit East 2015 - Adv Dev Ops - Student Slides
219 pages
Spark Notes
No ratings yet
Spark Notes
27 pages
Dimensionnement Spark - Les 5 Erreurs À Éviter
No ratings yet
Dimensionnement Spark - Les 5 Erreurs À Éviter
75 pages
JVM Tuning in Containers
No ratings yet
JVM Tuning in Containers
22 pages
Spark Pyspark Day 15
No ratings yet
Spark Pyspark Day 15
11 pages
An Empirical Study of The Out of Memory Errors in Apache Spark
No ratings yet
An Empirical Study of The Out of Memory Errors in Apache Spark
28 pages
Spark Troubleshooting, Part 2: Five Types of Solutions
No ratings yet
Spark Troubleshooting, Part 2: Five Types of Solutions
7 pages
Azure Databricks: Job Performance Monitoring, Troubleshooting and Optimization - by Prashanth Kumar - Feb, 2024 - Medium
No ratings yet
Azure Databricks: Job Performance Monitoring, Troubleshooting and Optimization - by Prashanth Kumar - Feb, 2024 - Medium
41 pages
Spark Notes
No ratings yet
Spark Notes
19 pages
Spark and Dask
No ratings yet
Spark and Dask
55 pages
Spark 1
No ratings yet
Spark 1
57 pages
SPARK Internals
No ratings yet
SPARK Internals
13 pages
SPARK Architecture
No ratings yet
SPARK Architecture
22 pages
THYZQh Meot
No ratings yet
THYZQh Meot
13 pages
RS10895 - Computer Science Grade 9 (English Version)
No ratings yet
RS10895 - Computer Science Grade 9 (English Version)
248 pages
Ties Pearc 23
No ratings yet
Ties Pearc 23
35 pages
High Java CPU Usage Problems
No ratings yet
High Java CPU Usage Problems
4 pages
Visualizing Big Data in The Browser Using Spark: Hossein Falaki @mhfalaki Spark Summit East - March 18, 2015
No ratings yet
Visualizing Big Data in The Browser Using Spark: Hossein Falaki @mhfalaki Spark Summit East - March 18, 2015
17 pages
SPark Monitoring and Tuning PPT 3.3.1
No ratings yet
SPark Monitoring and Tuning PPT 3.3.1
15 pages
Advanced Spark Training
0% (1)
Advanced Spark Training
49 pages
Cluster Configuration and Spark UI Databricks 1721934901
No ratings yet
Cluster Configuration and Spark UI Databricks 1721934901
3 pages
Ha Do Op Performance Tuning Guide
No ratings yet
Ha Do Op Performance Tuning Guide
1 page
5.input Output Devices and Memory
No ratings yet
5.input Output Devices and Memory
20 pages
Spark and Scala Week 1
No ratings yet
Spark and Scala Week 1
16 pages
Spark
No ratings yet
Spark
9 pages
Bigdata Interview Q&A
No ratings yet
Bigdata Interview Q&A
71 pages
Spark Introduction
No ratings yet
Spark Introduction
26 pages
Introduction To Data and Memory Intensive Computing
No ratings yet
Introduction To Data and Memory Intensive Computing
31 pages
Apache Spark
No ratings yet
Apache Spark
31 pages
Enterprise Data Storage and Analysis On Spark
No ratings yet
Enterprise Data Storage and Analysis On Spark
34 pages
Unit 4
No ratings yet
Unit 4
8 pages
BDA Unit III
No ratings yet
BDA Unit III
19 pages
Analyzing Big Data in Hadoop Spark
No ratings yet
Analyzing Big Data in Hadoop Spark
30 pages
C24-3414-1 BOSasm Dec65
No ratings yet
C24-3414-1 BOSasm Dec65
148 pages
Databricks Question
No ratings yet
Databricks Question
7 pages
Module 2 Webinar Slides - Powerpoint
No ratings yet
Module 2 Webinar Slides - Powerpoint
24 pages
Spark Optimization Techniques 1676610430
No ratings yet
Spark Optimization Techniques 1676610430
15 pages
Data Engineer Interview
No ratings yet
Data Engineer Interview
23 pages
BDA Lec7
No ratings yet
BDA Lec7
32 pages
Spark 20 Tuning Guide
No ratings yet
Spark 20 Tuning Guide
21 pages
Class 9 Solutions of Chapter
No ratings yet
Class 9 Solutions of Chapter
4 pages
Information and Communication Technology - Reading Book GRADE 7
No ratings yet
Information and Communication Technology - Reading Book GRADE 7
136 pages
Spark Class 1 PPT
No ratings yet
Spark Class 1 PPT
33 pages
Introduction To Spark
No ratings yet
Introduction To Spark
30 pages
JVM and Java Performance Tuning
No ratings yet
JVM and Java Performance Tuning
12 pages
Lec No 10
No ratings yet
Lec No 10
17 pages
Computer G 4
No ratings yet
Computer G 4
4 pages
Interview Question Spark Day1
No ratings yet
Interview Question Spark Day1
3 pages
Beginning Database Design
No ratings yet
Beginning Database Design
2 pages
Java Performance Mindmap
No ratings yet
Java Performance Mindmap
1 page
Diag Result Log
No ratings yet
Diag Result Log
3 pages
3last Components of Comp Hardware-1 Backing Store
No ratings yet
3last Components of Comp Hardware-1 Backing Store
27 pages
Spark Class 1
No ratings yet
Spark Class 1
33 pages
Performance of Java Application - Part 1
No ratings yet
Performance of Java Application - Part 1
9 pages
Ca Unit 5 Prabu
No ratings yet
Ca Unit 5 Prabu
37 pages
Starburn
No ratings yet
Starburn
9 pages
Fs Mini Project: Employee Management System
No ratings yet
Fs Mini Project: Employee Management System
21 pages
Unit 2 - : What Is ICT? Hardware Components Software
No ratings yet
Unit 2 - : What Is ICT? Hardware Components Software
51 pages
CF Notes - Unit 3
No ratings yet
CF Notes - Unit 3
16 pages
William Stallings Computer Organization and Architecture 6 Edition External Memory
No ratings yet
William Stallings Computer Organization and Architecture 6 Edition External Memory
24 pages
Computer (CD)
No ratings yet
Computer (CD)
35 pages
Features: 1Mb Serial SPI MRAM
No ratings yet
Features: 1Mb Serial SPI MRAM
21 pages
Solid State Drive HG6 Series 3 Generation: Key Features
No ratings yet
Solid State Drive HG6 Series 3 Generation: Key Features
20 pages
BackupandDR v3.4
No ratings yet
BackupandDR v3.4
18 pages
Storage
No ratings yet
Storage
17 pages
Kisi-Kisi Soal Asas 2025 X TKJ
No ratings yet
Kisi-Kisi Soal Asas 2025 X TKJ
10 pages
Mcqs
No ratings yet
Mcqs
10 pages
Debug 1214
No ratings yet
Debug 1214
4 pages
EMC Statement of Volatility - Isilon S2xx, X2xx, X4xx, NL4xx - January 2016
No ratings yet
EMC Statement of Volatility - Isilon S2xx, X2xx, X4xx, NL4xx - January 2016
7 pages
RAID Script
No ratings yet
RAID Script
4 pages
Kontron JUMPtec DIMM PC ADA2 Datasheet 202212815381
No ratings yet
Kontron JUMPtec DIMM PC ADA2 Datasheet 202212815381
3 pages
Multiple Choice Questions For Unit 3
No ratings yet
Multiple Choice Questions For Unit 3
3 pages
Virus
No ratings yet
Virus
2 pages
18 GCSE Lesson Retrieval Practice (Assessments) Answers - Utility Software (OCR)
No ratings yet
18 GCSE Lesson Retrieval Practice (Assessments) Answers - Utility Software (OCR)
2 pages
Learning Apache Spark 2
From Everand
Learning Apache Spark 2
Muhammad Asif Abbasi
No ratings yet
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
From Everand
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet

Spark-Performance Tuning

Uploaded by

Spark-Performance Tuning

Uploaded by

zekeLabs

Learning made Simpler !

Data Storage Garbage Collection

Memory usage of Spark falls under two categories

Memeory layout design

• Caching RDD & seeing memory occupied in web UI

One calculated, use one of the below techniques

• Design your data structures to prefer arrays of objects

• If objects are still too large, they should be serialized.

Measuring impact of GC using javaoptions

• Two regions - Young short-lived objects) & Old (long-lived object)

• If a full GC is invoked multiple times for before a task completes,

• Use G1GC garbage collector, -XX:+UseG1GC

• Levels of Parallelism - 2-3 tasks per CPU core

Let us know how can we help your organization to Upskill the

www.zekeLabs.com | +91-8095465880 | [email protected]

Visit : www.zekeLabs.com for more details

You might also like