0% found this document useful (0 votes)
2 views

apache-spark-application-performance-tuning

This three-day hands-on training course focuses on improving the performance of Apache Spark applications, covering key concepts such as architecture, performance evaluation, and optimization techniques. Participants will learn to identify performance issues, utilize caching, and understand the new features in Spark 3.0. The course is designed for experienced developers and data scientists familiar with Python and SQL.

Uploaded by

Rajveer Singh
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

apache-spark-application-performance-tuning

This three-day hands-on training course focuses on improving the performance of Apache Spark applications, covering key concepts such as architecture, performance evaluation, and optimization techniques. Participants will learn to identify performance issues, utilize caching, and understand the new features in Spark 3.0. The course is designed for experienced developers and data scientists familiar with Python and SQL.

Uploaded by

Rajveer Singh
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

TRAINING SHEET

APACHE SPARK APPLICATION


PERFORMANCE TUNING
Maximize the performance of your applications

This three-day hands-on training course delivers the key concepts and expertise
“Cloudera’s instructor was developers need to improve the performance of their Apache Spark applications.
excellent, offering clear and During the course, participants will learn how to identify common sources of poor
concise training that was easy to performance in Spark applications, techniques for avoiding or solving them, and best
understand. His wide-ranging practices for Spark application monitoring.
peripheral knowledge helped
Apache Spark Application Performance Tuning presents the architecture and
apply the course materials to
concepts behind Apache Spark and underlying data platform, then builds on this
real-world situations. I look
foundational understanding by teaching students how to tune Spark application
forward to attending another
course.”
code. The course format emphasizes instructor-led demonstrations illustrate both
performance issues and the techniques that address them, followed by hands-on
Comscore exercises that give students an opportunity to practice what they’ve learned through
an interactive notebook environment. The course applies to Spark 2.4, but also
introduces the Spark 3.0 Adaptive Query Execution framework.

What You Will Learn


Students who successfully complete this course will be able to:
• Understand Apache Spark’s architecture, job execution, and how techniques
such as lazy execution and pipelining can improve runtime performance
• Evaluate the performance characteristics of core data structures such as RDD
and DataFrames
• Select the file formats that will provide the best performance for your application
• Identify and resolve performance problems caused by data skew
• Use partitioning, bucketing, and join optimizations to improve SparkSQL
performance
• Understand the performance overhead of Python-based RDDs, DataFrames, and
user-defined functions
• Take advantage of caching for better application performance
• Understand how the Catalyst and Tungsten optimizers work
• Understand how Workload XM can help troubleshoot and proactively monitor
Spark applications performance
• Learn about the new features in Spark 3.0 and specifically how the Adaptive
Query Execution engine improves performance

What to Expect
This course is designed for software developers, engineers, and data scientists who
have experience developing Spark applications and want to learn how to improve the
performance of their code. This is not an introduction to Spark.
Spark examples and hands-on exercises are presented in Python and the ability to
program in this language is required. Basic familiarity with the Linux command line is
assumed. Basic knowledge of SQL is helpful.
TRAINING SHEET

Course Details:

Spark Architecture Mitigating Spark Shuffles Caching Data for Reuse


• RDDs • Denormalization • Caching Options
• DataFrames and Datasets • Broadcast Joins • Impact on Performance
• Lazy Evaluation • Map-Side Operations • Caching Pitfalls
• Pipelining • Sort Merge Joins
Workload XM (WXM) Introduction
Data Sources and Formats Partitioned and Bucketed Tables • WXM Overview
• Available Formats Overview • Partitioned Tables • WXM for Spark Developers
• Impact on Performance • Bucketed Tables
• The Small Files Problem • Impact on Performance What’s New in Spark 3.0?
• Adaptive Number of Shuffle Partitions
Inferring Schemas Improving Join Performance • Skew Joins
• The Cost of Inference • Skewed Joins • Convert Sort Merge Joins to
• Mitigating Tactics • Bucketed Joins Broadcast Joins
• Incremental Joins • Dynamic Partition Pruning
Dealing With Skewed Data • Dynamic Coalesce Shuffle Partitions
• Recognizing Skew Pyspark Overhead and UDFs
• Mitigating Tactics • Pyspark Overhead Appendix A: Partition Processing
• Scalar UDFs
Catalyst and Tungsten Overview Appendix B: Broadcasting
• Vector UDFs using Apache Arrow
• Catalyst Overview • Scala UDFs Appendix C: Scheduling
• Tungsten Overview

Cloudera, Inc. 5470 Great America Parkway, Santa Clara, CA 95054 cloudera.com
© 2020 Cloudera, Inc. All rights reserved. Cloudera and the Cloudera logo are trademarks or registered
trademarks of Cloudera Inc. in the USA and other countries. All other trademarks are the property of their
respective companies. Information is subject to change without notice.
spark-application-performance-tuning-datasheet_103 : 201026

You might also like