Debugging A Spark Application PDF
Debugging A Spark Application PDF
COM/)
debugging
Home a spark application
(http://timepasstechies.com)
Performance
Spark issues can be categorized into two parts
(http://timepasstechies.com/category/programming/data-analytics/spark/)
Hive (http://timepasstechies.com/category/programming/data-analytics/hive/)
1. Distribution Performance – program slow due to scheduling , coordination and data
distribution.
Hdfs & Yarn (http://timepasstechies.com/category/programming/data-analytics/hdfs/)
Pig2. (http://timepasstechies.com/category/programming/data-analytics/pig/)
Local Performance – program slow because program is generally slow on a single
node.
Oozie (http://timepasstechies.com/category/programming/data-analytics/oozie/)
Tools for debugging
Hbase (http://timepasstechies.com/category/programming/data-analytics/hbase/)
1. Spark UI
Design Patterns (http://timepasstechies.com/category/programming/design-patterns/)
Check tasks which are taking maximum time and also check summary metrics in the spark
streaming (http://timepasstechies.com/category/stream-processing/)
ui and if there is a too much difference in maximum and minimum time taken for each
task execution there will straggler .
2. Due to data skew – This can happen when there is one partition which has large
amount of data compared to the other partition . To solve this we need to spread this
into multiple partitions.
3. Garbage Collection – We can see the GC time taken in the spark ui and if GC is taking
most of the time of task execution then we have a problem here.
(http://timepasstechies.com/performance-debugging-spark-applicaton/?share=twitter&nb=1)
(http://timepasstechies.com/performance-debugging-spark-applicaton/?share=facebook&nb=1)
(http://timepasstechies.com/performance-debugging-spark-applicaton/?share=google-plus-1&nb=1)
RELATED
spark performance tuning and spark accumulator and broadcast spark dataframe and dataset
optimization - tutorial 14 example in java and scala - loading and saving data, spark sql
(http://timepasstechies.com/spark tutorial 10 performance tuning - tutorial 19
-performance-tuning- (http://timepasstechies.com/spark (http://timepasstechies.com/spark
optimization/) -accumulator-broadcast- -dataframe-dataset-loading-
November, 2017 example-java-scala-tutorial-10/) saving-data-spark-sql-
In "Data Analytics" November, 2017 performance-tuning/)
In "Data Analytics" November, 2017
In "Data Analytics"
← spark read avro file from hdfs example reading orc file in spark →
(http://timepasstechies.com/spark-read- (http://timepasstechies.com/reading-orc-file-
avro-file-hdfs-example/) spark/)
LEAVE A REPLY
Your email address will not be published. Required fields are marked *
COMMENT
/
NAME *
EMAIL *
WEBSITE
POST COMMENT
Search …
RECENT POSTS