This is the code repository for Troubleshooting Apache Spark [Video], published by Packt. It contains all the supporting project files necessary to work through the video course from start to finish.
In this course, you will learn how Spark's computation model works and leverage the DataFrame API along with its optimizations. Joining is one of the most important features in any Big Data tool and you will implement joins and write code in an efficient way. Implementing efficient transformations is hard. Common problems can cause your processing to go on a very long time. You will learn how to leverage reusing objects, and reduce setup and startup overheads using shared variables. Also, you will master Spark streaming and solve problems that arise while using that API.
- Solve long-running computation problems by leveraging lazy evaluation in Spark
- Avoid memory leaks by understanding the internal memory management of Apache Spark
- Rework problems due to not-scaling out pipelines by using partitions
- Debug and create user-defined functions that enrich the Spark API
- Choose a proper join strategy depending on the characteristics of your input data
- Troubleshoot APIs for joins - DataFrames or DataSets
- Write code that minimizes object creation using the proper API
- Troubleshoot real-time pipelines written in Spark Streaming
To fully benefit from the coverage included in this course, you will need:
To fully benefit from the coverage included in this course, you will need
experienced Apache Spark technology.
This course has the following software requirements:
This course has the following software requirements:
For an optimal experience with hands-on labs and other practical activities, we recommend the following configuration:
OS: Mac Processor: Not Applicable Memory: 4GB or above Storage: 50GB free space
Software Requirements OS: Windows or Mac Browser: Google Chrome Atom IDE, Latest Version Node.js LTS 8.9.1 Installe