0% found this document useful (0 votes)

163 views

Spark Training - Java

This document outlines an Apache Spark training covering both basic and advanced concepts. The training will provide hands-on experience with Spark and require some prior knowledge of Java or Scala. Topics will include HDFS, RDDs, Spark SQL, streaming, and performance tuning. A live project involving CSV data transformation and optimization is included. The training will be 60% hands-on with code examples and slides provided to participants, who should prepare a development environment with IntelliJ, JDK 8, and a Spark installation.

Uploaded by

Pavan Kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

163 views

Spark Training - Java

Uploaded by

Pavan Kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 8

Apache Spark

(Basic and Advanced)

Chinnasamy
[email protected]

Prerequisite:
Candidate attending the training should have a basic knowledge on Java or scala.

Big Data Conceptuals

What is Big Data?
The need for Big Data
Why Big Data now?
Myths of Big Data
Tabular representation of data unit measurement.
Is one petabyte big data ?
Types of Architectures in Big Data
Lambda Architecture
Kappa Architecture
Zeta Architecture

[email protected] +91 96000 45955

Seda Architecture
NoSQL Store and high throughput messaging system
Illustration about CAP theorem
Problems with large-scale systems

HDFS
Why HDFS ?
HDFS Architecture
Using HDFS and hdfs commands

Spark (The Spark version covered is the latest version of Spark - 1.6)

JDK 8 - Quick Introduction

Functional Programming with Java
Lambda expressions and Functional Interfaces in Java

Scala - Introduction
Objects and Classes
val, var, functions, currying, implicits
traits, actors and file manipulations

[email protected] +91 96000 45955

Core Spark
Introduction to Apache Spark
What is Spark ? Explain about the modules in spark
Spark-Shell - scala and python REPL
Spark Internals - The Driver program, master, workers, executors and the tasks
SparkSession- The Umbrella API for all context
Running spark in a standalone mode
Spark UI and monitoring a job
Functional programming with Spark
Map-reduce and Spark advantages over Map-reduce.
RDD
What is an RDD ?
Laziness in RDD Evaluation
Different ways of creating an RDD
Types of RDD’s - PairRDD, DoubleRDD
RDD Operations
Partitions - The core of RDD
textFiles, wholeFiles

Running Spark on a Cluster

[email protected] +91 96000 45955

Overview
A Spark Standalone Cluster
The Spark Standalone Web UI
Installing and configuring a cluster

Operations in Spark
Spark Configuration and the Spark Context
Configuring spark properties
RDD Operations - Transformation and Actions
map, flatMap, repartition, coalesce, glom, reduce, cartesian, pipe, sample,
distinct, mapPartitions, mapPartitionsWithIndex
Map, filter, distinct, collect, take operations
Joining two RDD’s
Storage levels supported in spark
Programming with a partition and use of custom partitioners
Accumulators and Broadcast variables
Checkpointing an RDD
Spark deployment plans
Spark History Server

Reading Data from External Sources

[email protected] +91 96000 45955

JdbcRdd - Read data from mysql
Connecting and reading data from mongodb

Caching and Persistence

RDD Lineage
Caching Overview
Distributed Persistence

SparkSQL
The DataFrame Abstraction
Elucidate on SparkSQL
Dataframe manipulation on top of json
The temp table abstraction on top of DataFrame Schema
SQL manipulation on top of parquest files
Dataframes caching
Connecting dataframes to relational database

Spark Streaming
Kafka and the need
Basic read from a socket

[email protected] +91 96000 45955

Spark Streaming from kafka
Windowing operation in streaming
Developing streaming applications
Writing a custom receiver
Spark Structured Streaming

Advanced Topics
Spark SQL with Hive
The new Dataset API
Working with nested data
Spark with Alluxio
Custom Accumulators
Writing custom RDD
Writing custom partitioner
Internals of persistence API. How spark manages persistence internally.
(Drilling down the source code)

Spark Performance Tuning

Various strategies to adopt to performance tune your spark application.
Introduction to various variables in Spark like shared variables.
Broadcast variables and learning about accumulators.
Common performance issues and troubleshooting the performance problems.

Maven would be used as the build tool to download the dependencies. IntelliJ would be the IDE
to develop the applications and examples.

Notable Corporate Trainings

[email protected] +91 96000 45955

Company Name Trainings

Wipro Apache Spark, Scala, ELK

ITC Infotech Spark, Scala, Kafka, SBT

IBM (Hyderabad, Apache Spark, Scala

Bangalore)

HP Scala

Scholastic Apache Spark

HCL Apache Spark, Hadoop, ELK

Stack

Big Data Analytics Hadoop, Spark

Private Ltd

Project: A live project of how each of the API’s are used in the industry.

[email protected] +91 96000 45955

Use cases covered:

[1] A csv file format of three hundred columns will be used as a dataset.
[2] Consuming and operating two csv files (each of 3 MB) that are produced every second
through spark streaming.
[3] Ten to fifteen transformation on a single job. Efficiently optimize and fine tune on all the
transformations.
[4] Architectural sharing of data between spark jobs.

Hands-on/Lecture Ratio:
The course is 60 % hands-on, 40 % discussion, with the longest discussion segments lasting 20 minutes.

Note to participants:

[*] All content in this course will be a hands-on session.

[*] All slides of the course will be given to candidates.
[*] Source code of all examples tried out in the session will be provided.

Training Developers Environment:

[*] The training programs would be given as a intellij project.

So I would need in a internet connection for the maven execution. The maven would
download a lot of jars from the internet.
[*] Download sbt and install it.
http://www.scala-sbt.org/download.html
[*] Download IntelliJ and install it. https://www.jetbrains.com/idea/
[*] Eclipse Mars, JDK 8, Spark 1.6.0 or Spark 2.0 installation on their respective OS.
[*] Any linux or unix flavor box is needed for the trainees to do the cluster setup of spark.
The box should have an internet connection, JDK 8 and Spark 2.0 or
Spark 1.6 installed in it.
[*] If the box is not available, then the VM of Ubuntu Linux is needed. In order to run Linux
VM, Oracle Virtualbox or VMWare is needed.
[*] JDK_HOME and PATH variable to the JDK 1.8 should be set.
[*] The trainer has a MAC laptop, so infrastructure should be provided to
connect MAC laptop to the screen.

[email protected] +91 96000 45955

12 - DataEngineer - Interview - Questions and Answers - EPAM Anywhere
No ratings yet
12 - DataEngineer - Interview - Questions and Answers - EPAM Anywhere
2 pages
IMaster MAE-Access V100R021C10 Open API Developer Guide (Wireless Network)
No ratings yet
IMaster MAE-Access V100R021C10 Open API Developer Guide (Wireless Network)
91 pages
APC Building Data Lakes On AWS SG
No ratings yet
APC Building Data Lakes On AWS SG
187 pages
Fast Data Processing with Spark 2 - Third Edition
From Everand
Fast Data Processing with Spark 2 - Third Edition
Krishna Sankar
No ratings yet
Saurav Dudulwar Resume
No ratings yet
Saurav Dudulwar Resume
1 page
DVS SPARK Course Content PDF
No ratings yet
DVS SPARK Course Content PDF
2 pages
Mastering Apache Cassandra - Second Edition
From Everand
Mastering Apache Cassandra - Second Edition
Nishant Neeraj
No ratings yet
Mastering Kafka Streams: From Basics to Expert Proficiency
From Everand
Mastering Kafka Streams: From Basics to Expert Proficiency
William Smith
No ratings yet
Murat Durmus - A Primer To The 42 Most Commonly Used Machine Learning Algorithms (With Code Samples) - Leanpub (2023)
No ratings yet
Murat Durmus - A Primer To The 42 Most Commonly Used Machine Learning Algorithms (With Code Samples) - Leanpub (2023)
192 pages
AWS SAA C03 AWS Certified Solutions Architect Associate Dumps
100% (1)
AWS SAA C03 AWS Certified Solutions Architect Associate Dumps
7 pages
Professional Summary:: Bhavana Pallepati Software Engineer 404-860-3320
No ratings yet
Professional Summary:: Bhavana Pallepati Software Engineer 404-860-3320
6 pages
Textbook For System Design Interviews
No ratings yet
Textbook For System Design Interviews
50 pages
Apache Spark Theory by Arsh
No ratings yet
Apache Spark Theory by Arsh
4 pages
Name: Wable Snehal Mahesh Subject:-Scala & Spark Div: - Mba Ii Roll No: - 57 Guidence Name: - Prof. Archana Suryawanshi - Kadam
No ratings yet
Name: Wable Snehal Mahesh Subject:-Scala & Spark Div: - Mba Ii Roll No: - 57 Guidence Name: - Prof. Archana Suryawanshi - Kadam
11 pages
Complete Guide To Spark Memory Management 1726709042
No ratings yet
Complete Guide To Spark Memory Management 1726709042
11 pages
Apache Kafka Course Curriculum
No ratings yet
Apache Kafka Course Curriculum
5 pages
HBase Administration Cookbook
From Everand
HBase Administration Cookbook
Yifeng Jiang
No ratings yet
Azure SQL Trainings: Contact: +91 90 32 82 44 67
No ratings yet
Azure SQL Trainings: Contact: +91 90 32 82 44 67
6 pages
Software Developer Resume
No ratings yet
Software Developer Resume
1 page
Machine Learning Spark ML
No ratings yet
Machine Learning Spark ML
11 pages
Unstructured Dataload Into Hive Database Through PySpark
No ratings yet
Unstructured Dataload Into Hive Database Through PySpark
9 pages
Management Mantras of Dhirubhai Ambani
No ratings yet
Management Mantras of Dhirubhai Ambani
2 pages
Spark NLP Training-Public-April 2020
No ratings yet
Spark NLP Training-Public-April 2020
39 pages
Backend Task - Internship - Tanx - Fi
No ratings yet
Backend Task - Internship - Tanx - Fi
2 pages
Krishna Resume
No ratings yet
Krishna Resume
2 pages
Amazon Elastic MapReduce Best Practices
No ratings yet
Amazon Elastic MapReduce Best Practices
38 pages
TF On Spark
No ratings yet
TF On Spark
35 pages
Akash Resume
No ratings yet
Akash Resume
7 pages
Hadoop Hive Cheat Sheet - Developer Guide For SQL To HiveQL - Qubole
No ratings yet
Hadoop Hive Cheat Sheet - Developer Guide For SQL To HiveQL - Qubole
19 pages
WP - Databricks vs. ETL Data Lake - Updated
No ratings yet
WP - Databricks vs. ETL Data Lake - Updated
12 pages
MSExcel Lecture
No ratings yet
MSExcel Lecture
162 pages
Understanding Unit and Integration Testing in Golang
No ratings yet
Understanding Unit and Integration Testing in Golang
59 pages
Pyspark Cashing & Persisting - Complete Guide
No ratings yet
Pyspark Cashing & Persisting - Complete Guide
3 pages
1.hadoop Admin Brochure
No ratings yet
1.hadoop Admin Brochure
11 pages
Problem Description: Sensitivity: Internal & Restricted
No ratings yet
Problem Description: Sensitivity: Internal & Restricted
2 pages
10 Algorithms That Dominate The World
No ratings yet
10 Algorithms That Dominate The World
26 pages
Spark Interview Questions IV. Next Installment of The Series. - by Amit Singh Rathore - Dev Genius
No ratings yet
Spark Interview Questions IV. Next Installment of The Series. - by Amit Singh Rathore - Dev Genius
15 pages
1 - Creating A Data Transformation Pipeline With Cloud Dataprep
0% (1)
1 - Creating A Data Transformation Pipeline With Cloud Dataprep
39 pages
2 Hadoop (Uploaded)
No ratings yet
2 Hadoop (Uploaded)
82 pages
AZ-900 Prepaway Premium Exam 208q
No ratings yet
AZ-900 Prepaway Premium Exam 208q
185 pages
Case Study Based On: Cloud Deployment and Service Delivery Models
No ratings yet
Case Study Based On: Cloud Deployment and Service Delivery Models
10 pages
Cloudera Spark
No ratings yet
Cloudera Spark
55 pages
Practice Questions Edition'22: Prepare Yourself For Exam Azure Administrator
No ratings yet
Practice Questions Edition'22: Prepare Yourself For Exam Azure Administrator
14 pages
Ambari Operations
No ratings yet
Ambari Operations
194 pages
Oozie Tutorial
No ratings yet
Oozie Tutorial
84 pages
MVC Framework Tutorial
100% (1)
MVC Framework Tutorial
90 pages
(Omran) Introduction To Google Cloud Platform
No ratings yet
(Omran) Introduction To Google Cloud Platform
45 pages
Course Outline Hadoop and Spark For Big Data and Data Science PDF
No ratings yet
Course Outline Hadoop and Spark For Big Data and Data Science PDF
4 pages
Azure DevOps Build and Release Pipelines 1
100% (1)
Azure DevOps Build and Release Pipelines 1
13 pages
Azure Cloud Intro
No ratings yet
Azure Cloud Intro
34 pages
Applied Coding Track
No ratings yet
Applied Coding Track
10 pages
Big Data Hadoop Training Certification 7
No ratings yet
Big Data Hadoop Training Certification 7
40 pages
Amazon Elastic MapReduce PDF
No ratings yet
Amazon Elastic MapReduce PDF
231 pages
AWS Interview Questions
No ratings yet
AWS Interview Questions
3 pages
spark
No ratings yet
spark
160 pages
Akka PDF
No ratings yet
Akka PDF
454 pages
SRE Report 2023 Catchpoint
No ratings yet
SRE Report 2023 Catchpoint
56 pages
Hadoop Admin Questions
100% (3)
Hadoop Admin Questions
11 pages
before spark interview
No ratings yet
before spark interview
18 pages
Kafka Cheat Sheets
No ratings yet
Kafka Cheat Sheets
1 page
Integrating Apache Nifi and Apache Kafka
No ratings yet
Integrating Apache Nifi and Apache Kafka
5 pages
Navigating The Financials: 1) 2) Cash & Cash Equivalents Cash From Operations
No ratings yet
Navigating The Financials: 1) 2) Cash & Cash Equivalents Cash From Operations
1 page
Learn To Code Project Planner: Instructions
No ratings yet
Learn To Code Project Planner: Instructions
6 pages
Cloudera Developer Training For Apache Spark: Hands-On Exercises
No ratings yet
Cloudera Developer Training For Apache Spark: Hands-On Exercises
61 pages
Dear Pavanakumar Karedla,: Return Authorization Number: 0061592118
No ratings yet
Dear Pavanakumar Karedla,: Return Authorization Number: 0061592118
1 page
Big Picture Quick Quiz
No ratings yet
Big Picture Quick Quiz
1 page
Varun Kulkarni Resume
No ratings yet
Varun Kulkarni Resume
1 page
CDS Views Concepts New
No ratings yet
CDS Views Concepts New
54 pages
WebnMobapps-Solutions PVT LTD Company Profile
No ratings yet
WebnMobapps-Solutions PVT LTD Company Profile
18 pages
Log
No ratings yet
Log
2 pages
Sys Verilog
No ratings yet
Sys Verilog
115 pages
Vaish Reliance
No ratings yet
Vaish Reliance
1 page
lastUIException 63776067521
No ratings yet
lastUIException 63776067521
4 pages
Assignment # 7
No ratings yet
Assignment # 7
13 pages
Write A C Program To Create A Linked List
100% (2)
Write A C Program To Create A Linked List
2 pages
Sacha Verweij and Jane Herriman
No ratings yet
Sacha Verweij and Jane Herriman
20 pages
HTML SUMMARY
No ratings yet
HTML SUMMARY
3 pages
Shravan_Sr React Developer_Resume
No ratings yet
Shravan_Sr React Developer_Resume
5 pages
Instant Ebooks Textbook Introduction To Compiler Design 2nd Edition Torben Ægidius Mogensen (Auth.) Download All Chapters
100% (5)
Instant Ebooks Textbook Introduction To Compiler Design 2nd Edition Torben Ægidius Mogensen (Auth.) Download All Chapters
62 pages
Apple - The Objective-C.programming Language
100% (1)
Apple - The Objective-C.programming Language
15 pages
Architecture of Web Dynpro
No ratings yet
Architecture of Web Dynpro
3 pages
Assignment 1 Front Sheet: Qualification BTEC Level 5 HND Diploma in Computing
No ratings yet
Assignment 1 Front Sheet: Qualification BTEC Level 5 HND Diploma in Computing
32 pages
Customer TCA Architecure and API's: Architecture
No ratings yet
Customer TCA Architecure and API's: Architecture
11 pages
Unit 18 - Databases
No ratings yet
Unit 18 - Databases
10 pages
Program For Searching A Number or Character in String For 8086
No ratings yet
Program For Searching A Number or Character in String For 8086
23 pages
Qt and QML Essentials Exam Curriculum 010-003
No ratings yet
Qt and QML Essentials Exam Curriculum 010-003
8 pages
BCS303_Question Bank
No ratings yet
BCS303_Question Bank
5 pages
CSC 2105: D S I: ATA Tructure Ntroduction
No ratings yet
CSC 2105: D S I: ATA Tructure Ntroduction
22 pages
Oportunidades Remotas
No ratings yet
Oportunidades Remotas
7 pages
The TaskJuggler User Manual
No ratings yet
The TaskJuggler User Manual
15 pages
EditParametersInSeveralDrivesAddIn DOC V17 en
No ratings yet
EditParametersInSeveralDrivesAddIn DOC V17 en
43 pages
Considerations Using Burp Suite
No ratings yet
Considerations Using Burp Suite
3 pages
GUI Guider Release Notes
No ratings yet
GUI Guider Release Notes
12 pages
C++ MCQ
No ratings yet
C++ MCQ
8 pages
02 5 Functions and Procedures
No ratings yet
02 5 Functions and Procedures
7 pages

Spark Training - Java

Uploaded by

Spark Training - Java

Uploaded by

Apache Spark

(Basic and Advanced)

Big Data Conceptuals

[email protected] +91 96000 45955

JDK 8 - Quick Introduction

[email protected] +91 96000 45955

Running Spark on a Cluster

[email protected] +91 96000 45955

Reading Data from External Sources

[email protected] +91 96000 45955

Caching and Persistence

[email protected] +91 96000 45955

Spark Performance Tuning

Notable Corporate Trainings

[email protected] +91 96000 45955

Wipro Apache Spark, Scala, ELK

ITC Infotech Spark, Scala, Kafka, SBT

IBM (Hyderabad, Apache Spark, Scala

Scholastic Apache Spark

HCL Apache Spark, Hadoop, ELK

Big Data Analytics Hadoop, Spark

[email protected] +91 96000 45955

[*] All content in this course will be a hands-on session.

Training Developers Environment:

[*] The training programs would be given as a intellij project.

[email protected] +91 96000 45955

You might also like