Spark Training - Java
Spark Training - Java
Prerequisite:
Candidate attending the training should have a basic knowledge on Java or scala.
HDFS
Why HDFS ?
HDFS Architecture
Using HDFS and hdfs commands
Spark (The Spark version covered is the latest version of Spark - 1.6)
Scala - Introduction
Objects and Classes
val, var, functions, currying, implicits
traits, actors and file manipulations
Operations in Spark
Spark Configuration and the Spark Context
Configuring spark properties
RDD Operations - Transformation and Actions
map, flatMap, repartition, coalesce, glom, reduce, cartesian, pipe, sample,
distinct, mapPartitions, mapPartitionsWithIndex
Map, filter, distinct, collect, take operations
Joining two RDD’s
Storage levels supported in spark
Programming with a partition and use of custom partitioners
Accumulators and Broadcast variables
Checkpointing an RDD
Spark deployment plans
Spark History Server
SparkSQL
The DataFrame Abstraction
Elucidate on SparkSQL
Dataframe manipulation on top of json
The temp table abstraction on top of DataFrame Schema
SQL manipulation on top of parquest files
Dataframes caching
Connecting dataframes to relational database
Spark Streaming
Kafka and the need
Basic read from a socket
Advanced Topics
Spark SQL with Hive
The new Dataset API
Working with nested data
Spark with Alluxio
Custom Accumulators
Writing custom RDD
Writing custom partitioner
Internals of persistence API. How spark manages persistence internally.
(Drilling down the source code)
Maven would be used as the build tool to download the dependencies. IntelliJ would be the IDE
to develop the applications and examples.
HP Scala
Project: A live project of how each of the API’s are used in the industry.
[1] A csv file format of three hundred columns will be used as a dataset.
[2] Consuming and operating two csv files (each of 3 MB) that are produced every second
through spark streaming.
[3] Ten to fifteen transformation on a single job. Efficiently optimize and fine tune on all the
transformations.
[4] Architectural sharing of data between spark jobs.
Hands-on/Lecture Ratio:
The course is 60 % hands-on, 40 % discussion, with the longest discussion segments lasting 20 minutes.
Note to participants: