0% found this document useful (0 votes)
72 views33 pages

Spark

The document provides an introduction to Apache Spark including its goals, architecture, and key features like RDDs. Spark is a fast, general-purpose cluster computing system that allows processing of batch, streaming, and interactive data across clusters in memory for improved performance over Hadoop.

Uploaded by

Madhavi Kareddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
72 views33 pages

Spark

The document provides an introduction to Apache Spark including its goals, architecture, and key features like RDDs. Spark is a fast, general-purpose cluster computing system that allows processing of batch, streaming, and interactive data across clusters in memory for improved performance over Hadoop.

Uploaded by

Madhavi Kareddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 33

Introduction to Apache Spark

Certified Apache Spark and Scala Training – DataFlair


Agenda
 Before Spark
 Need for Spark
 What is Apache Spark ?
 Goals
 Why Spark ?
 RDD & its Operations
 Features Of Spark

Certified Apache Spark and Scala Training – DataFlair


Before Spark

Batch Stream Interactive Graph Machine


Processing Processing Processing Processing Learning

Certified Apache Spark and Scala Training – DataFlair


Need For Spark

• Need for a powerful engine that can process the data in Real-Time
(streaming) as well as in Batch mode
• Need for a powerful engine that can respond in Sub-second and
perform In-memory analytics
• Need for a powerful engine that can handle diverse workloads:
– Batch
– Streaming
– Interactive
– Graph
– Machine Learning

Certified Apache Spark and Scala Training – DataFlair


What is Apache Spark?

Apache Spark is a powerful open source engine which can handle:


– Batch processing
– Real-time (stream)
– Interactive
– Graph
– Machine Learning (Iterative)
– In-memory

Certified Apache Spark and Scala Training – DataFlair


Introduction to Apache Spark

 Lightening fast cluster computing tool


 General purpose distributed system
 Provides APIs in Scala, Java, Python, and R

Certified Apache Spark and Scala Training – DataFlair


History
Open Became Top-level Most active
Sourced project project at Apache
Introduced by Donated to World record
UC Berkeley Apache in sorting

2009 2010 2011 2012 2013 2014 2015

Certified Apache Spark and Scala Training – DataFlair


Sort Record
2100 Nodes
Hadoop-MapReduce 72 min

206 Nodes
Spark
23 min

Hadoop MapReduce Spark


Data Size 102.5 TB 100 TB
Time Taken 72 min 23 min
No of nodes 2100 206
No of cores 50400 physical 6592 virtualized
Cluster disk throughput 3150 GBPS 618 GBPS
Network Dedicated 10 Gbps Virtualized 10 Gbps

Src: Databricks
Certified Apache Spark and Scala Training – DataFlair
Goals
 Easy to combine batch, streaming, and interactive
computations

Batch

One
Stack to
Rule
them
Interactive all Streaming

Certified Apache Spark and Scala Training – DataFlair


Goals
 Easy to combine batch, streaming, and interactive
computations
 Easy to develop sophisticated algorithms

Certified Apache Spark and Scala Training – DataFlair


Goals
 Easy to combine batch, streaming, and interactive
computations
 Easy to develop sophisticated algorithms
 Compatible with existing open source ecosystem

Certified Apache Spark and Scala Training – DataFlair


Why Spark ?
 100x faster than Hadoop.

Certified Apache Spark and Scala Training – DataFlair


Why Spark ?
 100x faster than Hadoop.
 In-memory computation.

Operation1 Operation1

Operation2 Operation1

Disk … Disk

Certified Apache Spark and Scala Training – DataFlair


Why Spark ?
 100x faster than Hadoop.
 In-memory computation.

Operation 1 Operation 2 … Operation n

Disk Disk Disk Disk

Operation 1 Operation 2
… Operation n
Disk Disk

Certified Apache Spark and Scala Training – DataFlair


Why Spark ?
 100x faster than Hadoop.
 In-memory computation.
 Language support like Scala, Java, Python and R.

Certified Apache Spark and Scala Training – DataFlair


Why Spark ?
 100x faster than Hadoop.
 In-memory computation.
 Language support like Scala, Java, Python and R.
 Support Real time and Batch Processing.

Input data Batches of Batches of


stream Spark Input data Spark Processed
data
Streaming Engine

Certified Apache Spark and Scala Training – DataFlair


Why Spark ?
 100x faster than Hadoop.
 In-memory computation.
 Language support like Scala, Java, Python and R.
 Support Real time and Batch Processing.
 Lazy Operations – optimize the job before execution.

Certified Apache Spark and Scala Training – DataFlair


Why Spark ?
 100x faster than Hadoop.
 In-memory computation.
 Language support like Scala, Java, Python and R.
 Support Real time and Batch Processing.
 Lazy Operations – optimize the job before execution.
 Support for multiple transformations and actions.

Transformation 1 Transformation 2 Action


map() filter() (collect)

RDD1 RDD2 RDD3 Result

Certified Apache Spark and Scala Training – DataFlair


Why Spark ?
 100x faster than Hadoop.
 In-memory computation.
 Language support like Scala, Java, Python and R.
 Support Real time and Batch Processing.
 Lazy Operations – optimize the job before execution.
 Support for multiple transformations and actions.
 Compatible with hadoop, can process existing hadoop data.

Certified Apache Spark and Scala Training – DataFlair


Spark
Architecture

Certified Apache Spark and Scala Training – DataFlair


Spark Nodes
Nodes

Master Node Slave Nodes

Master Worker

Certified Apache Spark and Scala Training – DataFlair


Basic Spark Architecture

Sub Work Sub Work Sub Work Sub Work

Sub Work Sub Work Sub Work Sub Work

Work Sub Work Sub Work Sub Work Sub Work

Sub Work Sub Work Sub Work Sub Work

Certified Apache Spark and Scala Training – DataFlair


Resilient Distributed Dataset (RDD)
 RDD is a simple and immutable collection of objects.

Obj1

Obj2

Obj3

....
Obj n

RDD

Certified Apache Spark and Scala Training – DataFlair


Resilient Distributed Dataset (RDD)
 RDD is a simple and immutable collection of objects.
 RDD can contain any type of (scala, java, python and R)
objects.

RDD
Objects

Certified Apache Spark and Scala Training – DataFlair


Resilient Distributed Dataset (RDD)
 RDD is a simple and immutable collection of objects.
 RDD can contain any type of (scala, java, python and R) objects.
 Each RDD is split-up into different partitions, which may be computed on
different nodes of clusters.
RDD
PPaarrttii

ttiooi

nn11

Partition2

Partition3

Partition4

Partition5
Partition6

Certified Apache Spark and Scala Training – DataFlair


Resilient Distributed Dataset (RDD)
B2

B1 B12
Partition-1 B5 B3
Partition-2
B4 B9
Partition-3
B10 B7 B11 B6
Create RDD Partition-4
Partition-5
... B8

Employee-data.txt

RDD
Hadoop Cluster
Certified Apache Spark and Scala Training – DataFlair
RDD Operations
RDD
Operations

Transformations Actions Persistence

Certified Apache Spark and Scala Training – DataFlair


RDD Operations – Transformation
Transformation:
 Set of operations that define how RDD should be transformed
 Creates a new RDD from the existing one to process the data
 Lazy evaluation: Computation doesn’t start until an action associated
 E.g. Map, FlatMap, Filter, Union, GroupBy, etc.

Certified Apache Spark and Scala Training – DataFlair


RDD Operations – Action
Action:
 Triggers job execution.
 Returns the result or write it to the storage.
 E.g. Count, Collect, Reduce, Take, etc.

Certified Apache Spark and Scala Training – DataFlair


RDD Operations – Persistence
Persistence:
 Spark allows caching/Persisting entire dataset in memory
 Caches the RDD in the memory for future operations

Cache
Primary Storage

Certified Apache Spark and Scala Training – DataFlair


RDD Operations
Parent RDD (map(), flatMap()…) Creates a new
RDD based on
custom business
Transformations logic

RDD
RDD Returns output to
Lineage Driver or exports
data to storage
system after
Actions computation

(saveAsTextFile(), count()…)

Result

Certified Apache Spark and Scala Training – DataFlair


Features of Spark
Process every
100 X Faster
record exactly
Than Hadoop
Duplicate once
Speed
Elimination

Automatic Diverse
Memory
Memory Processing processing
Management
Management platform

Fault Window
Recovers Tolerance Criteria Time based
Automatically window
criteria

Certified Apache Spark and Scala Training – DataFlair


Thank
You
DataFlair

/c/DataFlairWS /DataFlairWS

Certified Apache Spark and Scala Training – DataFlair

You might also like