0% found this document useful (0 votes)
114 views

Debugging A Spark Application PDF

Uploaded by

kolodacool
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
114 views

Debugging A Spark Application PDF

Uploaded by

kolodacool
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

BIG DATA (HTTP://TIMEPASSTECHIES.

COM/)

debugging
Home a spark application
(http://timepasstechies.com)

 March, 2018  adarsh (http://timepasstechies.com/author/adarshgorur/)  Leave a


Map Reduce (http://timepasstechies.com/category/programming/data-
comment (http://timepasstechies.com/performance-debugging-spark-applicaton/#respond)
analytics/mapreduce/)

Performance
Spark issues can be categorized into two parts
(http://timepasstechies.com/category/programming/data-analytics/spark/)

Hive (http://timepasstechies.com/category/programming/data-analytics/hive/)
1. Distribution Performance – program slow due to scheduling , coordination and data
distribution.
Hdfs & Yarn (http://timepasstechies.com/category/programming/data-analytics/hdfs/)

Pig2. (http://timepasstechies.com/category/programming/data-analytics/pig/)
Local Performance – program slow because program is generally slow on a single
node.
Oozie (http://timepasstechies.com/category/programming/data-analytics/oozie/)
Tools for debugging
Hbase (http://timepasstechies.com/category/programming/data-analytics/hbase/)
1. Spark UI
Design Patterns (http://timepasstechies.com/category/programming/design-patterns/)

Check tasks which are taking maximum time and also check summary metrics in the spark
streaming (http://timepasstechies.com/category/stream-processing/)
ui and if there is a too much difference in maximum and minimum time taken for each
task execution there will straggler .

2. Executor Logs Posts


There can be straggler because of below reasons
1. One of the node is slower than others – To solve this problem set spark.speculation
property to true which will make the spark identify the slow tasks looking at the runtime
distribution and relaunches those tasks in other nodes.

2.  Due to data skew – This can happen when there is one partition which has large
amount of data compared to the other partition . To solve this we need to spread this
into multiple partitions.

3. Garbage Collection – We can see the GC time taken in the spark ui and if GC is taking
most of the time of task execution then we have a problem here.

4 . Performance of the code running each task is slow


/
SHARE THIS:

 (http://timepasstechies.com/performance-debugging-spark-applicaton/?share=twitter&nb=1)

 (http://timepasstechies.com/performance-debugging-spark-applicaton/?share=facebook&nb=1)

 (http://timepasstechies.com/performance-debugging-spark-applicaton/?share=google-plus-1&nb=1)

RELATED

spark performance tuning and spark accumulator and broadcast spark dataframe and dataset
optimization - tutorial 14 example in java and scala - loading and saving data, spark sql
(http://timepasstechies.com/spark tutorial 10 performance tuning - tutorial 19
-performance-tuning- (http://timepasstechies.com/spark (http://timepasstechies.com/spark
optimization/) -accumulator-broadcast- -dataframe-dataset-loading-
November, 2017 example-java-scala-tutorial-10/) saving-data-spark-sql-
In "Data Analytics" November, 2017 performance-tuning/)
In "Data Analytics" November, 2017
In "Data Analytics"

Posted in: performance tuning (http://timepasstechies.com/category/performance-tuning/), Spark


(http://timepasstechies.com/category/programming/data-analytics/spark/)
Filed under: spark performance tuning (http://timepasstechies.com/tag/spark-performance-tuning/), Spark
Rdd (http://timepasstechies.com/tag/spark-rdd/)

← spark read avro file from hdfs example reading orc file in spark →
(http://timepasstechies.com/spark-read- (http://timepasstechies.com/reading-orc-file-
avro-file-hdfs-example/) spark/)

LEAVE A REPLY
Your email address will not be published. Required fields are marked *

COMMENT

/
NAME *

EMAIL *

WEBSITE

NOTIFY ME OF FOLLOW-UP COMMENTS BY EMAIL.


NOTIFY ME OF NEW POSTS BY EMAIL.

POST COMMENT

Search …

RECENT POSTS

aws s3 downloading a folder (http://timepasstechies.com/aws-s3-download-a-folder/)

using regex in spark dataframe (http://timepasstechies.com/using-regex-in-spark-dataframe/)

running spark job using the mesosphere rest api (http://timepasstechies.com/running-spark-job-


using-the-mesosphere-rest-api/)

HOME (HTTP://TIMEPASSTECHIES.COM) CONTACT ME


(HTTP://TIMEPASSTECHIES.COM/CONTACT/) ABOUT ME
(HTTP://TIMEPASSTECHIES.COM/ABOUT/)

Copyright © 2017 Time Pass Techies

You might also like