0% found this document useful (0 votes)
37 views16 pages

Spark-Performance Tuning

The document outlines performance tuning strategies for Spark, focusing on data serialization, memory tuning, and levels of parallelism. It discusses memory management, including execution and storage memory, and provides techniques for optimizing memory consumption through data structure design, serialized RDD storage, and garbage collection tuning. Additionally, it emphasizes the importance of data locality and offers contact information for further assistance from zekeLabs.

Uploaded by

bharat4704
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views16 pages

Spark-Performance Tuning

The document outlines performance tuning strategies for Spark, focusing on data serialization, memory tuning, and levels of parallelism. It discusses memory management, including execution and storage memory, and provides techniques for optimizing memory consumption through data structure design, serialized RDD storage, and garbage collection tuning. Additionally, it emphasizes the importance of data locality and offers contact information for further assistance from zekeLabs.

Uploaded by

bharat4704
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 16

zekeLabs

Spark - Performance
Tuning

Learning made Simpler !

www.zekeLabs.com
• Data Serialization
• Memory Tuning
• Level of Parallelism
Agenda
• Memory Usage of Reduce Tasks
• Determinin
• Broadcasting Large Values
Performance Tuning

Memory
Data Serialization Memory Tuning
Management

Data Storage Garbage Collection


Tuning Tuning
Data Serialization

• Java serialization
• Kyro serilization
Memory Management Overview

Memory usage of Spark falls under two categories


• execution - memory used for shuffles, joins, sorts, agg
• storage - caching & propogating internal data across cluster

Memeory layout design


• Execution & storage memory share unified region - M
• Minimum storage memory - R
• Execution memory can evict storage memory till R
Advantages of Design

• applications that do not use caching can use the entire space for
execution, obviating unnecessary disk spills.
• applications that do use caching can reserve a minimum storage
space (R) where their data blocks are immune to being evicted.
• reasonable out-of-the-box performance for a variety of
workloads.
• spark.memory.fraction - M,default 0.6 fraction of heap
• spark.memory.storageFraction, R
Determing Memory Consumption

• Caching RDD & seeing memory occupied in web UI


• Using SizeEstimator API
import org.apache.spark.util.SizeEstimator

One calculated, use one of the below techniques


• Tuning Data Structure
• Serialized RDD Storage
• Garbage Collection Tuning
Tuning Data Structues

• Design your data structures to prefer arrays of objects


• Avoid nested structures with a lot of small objects and pointers
when possible.
• Consider using numeric IDs or enumeration objects instead of
strings for keys.
• If you have less than 32 GB of RAM, set the JVM flag -XX:
+UseCompressedOops to make pointers be four bytes instead of
eight. You can add these options in spark-env.sh.
Serialized RDD Storage

• If objects are still too large, they should be serialized.


• Use persist API.
• Use Kryo if you want to cache data in serialized form
Garbage Collection Tuning

Measuring impact of GC using javaoptions


• spark.driver.extraJavaOptions - -verbose:gc -XX:+PrintGCDetails -
XX:+PrintGCTimeStamps.
• Output will be seen in Worker nodes & not on driver nodes.
Advanced GC Tuning

• Two regions - Young short-lived objects) & Old (long-lived object)


• Young is further divided into - Eden, Servivor1, Survivor2
• When Eden is full, a minor GC is run on Eden and objects that are
alive from Eden and Survivor1 are copied to Survivor2.
• If an object is old enough or Survivor2 is full, it is moved to Old.
• Finally, when Old is close to full, a full GC is invoked.
• Goal of GC tuning is to ensure that only long-lived RDDs are
stored in Old gen & Young is used for short-lived ones.
Useful steps

• If a full GC is invoked multiple times for before a task completes,


it means that there isn’t enough memory available for executing
tasks.
• If there are too many minor collections but not many major GCs,
allocating more memory for Eden would help.
• If the OldGen is close to being full, reduce the amount of
memory used for caching by lowering spark.memory.fraction; it
is better to cache fewer objects than to slow down task
execution. Also, consider young gen size as well
More ...

• Use G1GC garbage collector, -XX:+UseG1GC


Others

• Levels of Parallelism - 2-3 tasks per CPU core


• Memoey usage of Reduce Tasks by increase in parallelism
• Broadcasting data from driver to all partition
• Data Locality - Bring code close to data.
Configuration spark.locality
Thank You !!!
THANK YOU

Let us know how can we help your organization to Upskill the


employees to stay updated in the ever-evolving IT Industry.

Get in touch:

www.zekeLabs.com | +91-8095465880 | [email protected]

Visit : www.zekeLabs.com for more details

You might also like