SlideShare a Scribd company logo
Lightning-fast cluster computing
Rahul Kavale(rahulkav@thoughtworks.com)
Unmesh Joshi(uvjoshi@thoughtworks.com)
2
Some properties of “Big Data”
•Big data is inherently immutable, meaning it is not supposed to
updated once generated.
•Mostly the operations are coarse grained when it comes to write
•Commodity hardware makes more sense for storage/computation
of such enormous data,hence the data is distributed across cluster
of many such machines
• The distributed nature makes the programming complicated.
3
Brush up for Hadoop concepts
Distributed Storage => HDFS
Cluster Manager => YARN
Fault tolerance => achieved via replication
Job scheduling => Scheduler in YARN
Mapper
Reducer
Combiner
4http://hadoop.apache.org/docs/r1.2.1/images/hdfsarchitecture.gif
5
Map Reduce Programming Model
6https://twitter.com/francesc/status/507942534388011008
7http://www.admin-magazine.com/HPC/Articles/MapReduce-and-Hadoop
8
http://www.slideshare.net/JimArgeropoulos/hadoop-101-32661121
9
MapReduce pain points
• considerable latency
• only Map and Reduce phases
• Non trivial to test
• results into complex workflow
• Not suitable for Iterative processing
10
Immutability and MapReduce model
• HDFS storage is immutable or append-only.
• The MapReduce model lacks to exploit the immutable nature of
the data.
• The intermediate results are persisted resulting in huge of IO,
causing a serious performance hit.
11
Wouldn’t it be very nice if we could have• Low latency
• Programmer friendly programming model
• Unified ecosystem
• Fault tolerance and other typical distributed system properties
• Easily testable code
• Of course open source :)
12
What is Apache Spark
• Cluster computing Engine
• Abstracts the storage and cluster management
• Unified interfaces to data
• API in Scala, Python, Java, R*
13
Where does it fit in existing Bigdata ecosystem
http://www.kdnuggets.com/2014/06/yarn-all-rage-hadoop-summit.html
14
Why should you care about Apache Spark
• Abstracts underlying storage,
• Abstracts cluster management
• Easy programming model
• Very easy to test the code
• Highly performant
15
• Petabyte sort record
https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html
16
• Offers in memory caching of data
• Specialized Applications
• GraphX for graph processing
• Spark Streaming
• MLib for Machine learning
• Spark SQL
• Data exploration via Spark-Shell
17
Programming model
for
Apache Spark
18
Word Count example
val file = spark.textFile("input path")
val counts = file.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey((a, b) => a + b)
counts.saveAsTextFile("destination path")
19
Comparing example with MapReduce
20
Spark Shell Demo
• SparkContext
• RDD
• RDD operations
21
RDD
• RDD stands for Resilient Distributed Dataset.
• basic abstraction for Spark
22
• Equivalent of Distributed collections.
• The interface makes distributed nature of underlying data
transparent.
• RDD is immutable
• Can be created via,
• parallelising a collection,
• transforming an existing RDD by applying a transformation
function,
• reading from a persistent data store like HDFS.
23
RDD is lazily evaluated
RDD has two type of operations
• Transformations
Create a DAG of transformations to be applied on the RDD
Does not evaluating anything
• Actions
Evaluate the DAG of transformations
24
RDD operations
Transformations
map(f : T ⇒ U) : RDD[T] ⇒ RDD[U]
filter(f : T ⇒ Bool) : RDD[T] ⇒ RDD[T]
flatMap(f : T ⇒ Seq[U]) : RDD[T] ⇒ RDD[U]
sample(fraction : Float) : RDD[T] ⇒ RDD[T] (Deterministic sampling)
union() : (RDD[T],RDD[T]) ⇒ RDD[T]
join() : (RDD[(K, V)],RDD[(K, W)]) ⇒ RDD[(K, (V, W))]
groupByKey() : RDD[(K, V)] ⇒ RDD[(K, Seq[V])]
reduceByKey(f : (V,V) ⇒ V) : RDD[(K, V)] ⇒ RDD[(K, V)]
partitionBy(p : Partitioner[K]) : RDD[(K, V)] ⇒ RDD[(K, V)]
25
Actions
count() : RDD[T] ⇒ Long
collect() : RDD[T] ⇒ Seq[T]
reduce(f : (T,T) ⇒ T) : RDD[T] ⇒ T
lookup(k : K) : RDD[(K, V)] ⇒ Seq[V] (On hash/range partitioned
RDDs)
save(path : String) : Outputs RDD to a storage system, e.g., HDFS
26
Job Execution
27
Spark Execution in Context of YARN
http://kb.cnblogs.com/page/198414/
28
Fault tolerance via lineage
MappedRDD
FilteredRDD
FlatMappedRDD
MappedRDD
HadoopRDD
29
Testing
30
Why is Spark more performant than
MapReduce
31
Reduced IO
• No disk IO between phases since phases themselves are pipelined
• No network IO involved unless a shuffle is required
32
No Mandatory Shuffle
• Programs not bounded by map and reduce phases
• No mandatory Shuffle and sort required
33
In memory caching of data
• Optional In memory caching
• DAG engine can apply certain optimisations since when an action is
called, it knows what all transformations as to be applied
34
Questions?
35
Thank You!

More Related Content

PPTX
Working with Scientific Data in MATLAB
The HDF-EOS Tools and Information Center
 
PPTX
Advancing Scientific Data Support in ArcGIS
The HDF-EOS Tools and Information Center
 
PPTX
Data Analytics using MATLAB and HDF5
The HDF-EOS Tools and Information Center
 
PPTX
Matlab, Big Data, and HDF Server
The HDF-EOS Tools and Information Center
 
PDF
Implementing HDF5 in MATLAB
The HDF-EOS Tools and Information Center
 
PDF
EMR AWS Demo
Rim Moussa
 
PPTX
Mastering Hadoop Map Reduce - Custom Types and Other Optimizations
scottcrespo
 
PPTX
SPD and KEA: HDF5 based file formats for Earth Observation
The HDF-EOS Tools and Information Center
 
Working with Scientific Data in MATLAB
The HDF-EOS Tools and Information Center
 
Advancing Scientific Data Support in ArcGIS
The HDF-EOS Tools and Information Center
 
Data Analytics using MATLAB and HDF5
The HDF-EOS Tools and Information Center
 
Matlab, Big Data, and HDF Server
The HDF-EOS Tools and Information Center
 
Implementing HDF5 in MATLAB
The HDF-EOS Tools and Information Center
 
EMR AWS Demo
Rim Moussa
 
Mastering Hadoop Map Reduce - Custom Types and Other Optimizations
scottcrespo
 
SPD and KEA: HDF5 based file formats for Earth Observation
The HDF-EOS Tools and Information Center
 

What's hot (20)

PPTX
Join optimization in hive
Liyin Tang
 
PPT
Hive
Srinath Reddy
 
PPT
Hw09 Hadoop Development At Facebook Hive And Hdfs
Cloudera, Inc.
 
PPTX
Java/Scala Lab 2016. Александр Конопко: Машинное обучение в Spark.
GeeksLab Odessa
 
PDF
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
soujavajug
 
PPT
HDF5 Performance Enhancements with the Elimination of Unlimited Dimension
The HDF-EOS Tools and Information Center
 
KEY
Realtime Computation with Storm
boorad
 
PPTX
Map Reduce
Rahul Agarwal
 
PPTX
Hadoop eco system-first class
alogarg
 
PDF
Hadoop and Hive Development at Facebook
elliando dias
 
PPTX
DECK36 - Log everything! and Realtime Datastream Analytics with Storm
Mike Lohmann
 
PPTX
SparkNotes
Demet Aksoy
 
PDF
Big Data Analytics with Scala at SCALA.IO 2013
Samir Bessalah
 
PDF
Partitioning SKA Dataflows for Optimal Graph Execution
Chen Wu
 
PPT
Hive integration: HBase and Rcfile__HadoopSummit2010
Yahoo Developer Network
 
PPTX
Utilizing HDF4 File Content Maps for the Cloud Computing
The HDF-EOS Tools and Information Center
 
PPT
Reading HDF family of formats via NetCDF-Java / CDM
The HDF-EOS Tools and Information Center
 
PPTX
Graphite
David Lutz
 
PPTX
MATLAB, netCDF, and OPeNDAP
The HDF-EOS Tools and Information Center
 
PDF
Collecting metrics with Graphite and StatsD
itnig
 
Join optimization in hive
Liyin Tang
 
Hw09 Hadoop Development At Facebook Hive And Hdfs
Cloudera, Inc.
 
Java/Scala Lab 2016. Александр Конопко: Машинное обучение в Spark.
GeeksLab Odessa
 
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
soujavajug
 
HDF5 Performance Enhancements with the Elimination of Unlimited Dimension
The HDF-EOS Tools and Information Center
 
Realtime Computation with Storm
boorad
 
Map Reduce
Rahul Agarwal
 
Hadoop eco system-first class
alogarg
 
Hadoop and Hive Development at Facebook
elliando dias
 
DECK36 - Log everything! and Realtime Datastream Analytics with Storm
Mike Lohmann
 
SparkNotes
Demet Aksoy
 
Big Data Analytics with Scala at SCALA.IO 2013
Samir Bessalah
 
Partitioning SKA Dataflows for Optimal Graph Execution
Chen Wu
 
Hive integration: HBase and Rcfile__HadoopSummit2010
Yahoo Developer Network
 
Utilizing HDF4 File Content Maps for the Cloud Computing
The HDF-EOS Tools and Information Center
 
Reading HDF family of formats via NetCDF-Java / CDM
The HDF-EOS Tools and Information Center
 
Graphite
David Lutz
 
MATLAB, netCDF, and OPeNDAP
The HDF-EOS Tools and Information Center
 
Collecting metrics with Graphite and StatsD
itnig
 
Ad

Viewers also liked (11)

PPT
Driving Out Of Control
slovejoy
 
PPTX
Welcome bus drivers
tlmillerlrlib
 
PDF
Map reduce vs spark
Tudor Lapusan
 
PDF
Dealing with difficult people
Hisham Hosni
 
PDF
Fast Data Analytics with Spark and Python
Benjamin Bengfort
 
PDF
Machine Learning by Example - Apache Spark
Meeraj Kunnumpurath
 
PPT
Hadoop MapReduce Fundamentals
Lynn Langit
 
PPTX
Tuning and Debugging in Apache Spark
Patrick Wendell
 
PDF
Hadoop Summit Europe 2014: Apache Storm Architecture
P. Taylor Goetz
 
PDF
Apache storm vs. Spark Streaming
P. Taylor Goetz
 
PPTX
Big Data Analytics with Hadoop
Philippe Julio
 
Driving Out Of Control
slovejoy
 
Welcome bus drivers
tlmillerlrlib
 
Map reduce vs spark
Tudor Lapusan
 
Dealing with difficult people
Hisham Hosni
 
Fast Data Analytics with Spark and Python
Benjamin Bengfort
 
Machine Learning by Example - Apache Spark
Meeraj Kunnumpurath
 
Hadoop MapReduce Fundamentals
Lynn Langit
 
Tuning and Debugging in Apache Spark
Patrick Wendell
 
Hadoop Summit Europe 2014: Apache Storm Architecture
P. Taylor Goetz
 
Apache storm vs. Spark Streaming
P. Taylor Goetz
 
Big Data Analytics with Hadoop
Philippe Julio
 
Ad

Similar to Scrap Your MapReduce - Apache Spark (20)

PPTX
October 2014 HUG : Hive On Spark
Yahoo Developer Network
 
PPTX
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
IT Event
 
PPTX
Spark Study Notes
Richard Kuo
 
PDF
New Developments in Spark
Databricks
 
PDF
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
Chetan Khatri
 
PPT
11. From Hadoop to Spark 1:2
Fabio Fumarola
 
PPT
Scala and spark
Fabio Fumarola
 
PDF
No more struggles with Apache Spark workloads in production
Chetan Khatri
 
PDF
20170126 big data processing
Vienna Data Science Group
 
PPTX
Spark 计算模型
wang xing
 
PDF
How Apache Spark fits into the Big Data landscape
Paco Nathan
 
PDF
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
Chetan Khatri
 
PDF
An introduction To Apache Spark
Amir Sedighi
 
PPTX
Apache spark sneha challa- google pittsburgh-aug 25th
Sneha Challa
 
PPTX
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
cdmaxime
 
PDF
Apache spark - Architecture , Overview & libraries
Walaa Hamdy Assy
 
PDF
Apache Spark and DataStax Enablement
Vincent Poncet
 
PDF
Data Science
Subhajit75
 
PDF
Unified Big Data Processing with Apache Spark (QCON 2014)
Databricks
 
PDF
Brief Intro to Apache Spark @ Stanford ICME
Paco Nathan
 
October 2014 HUG : Hive On Spark
Yahoo Developer Network
 
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
IT Event
 
Spark Study Notes
Richard Kuo
 
New Developments in Spark
Databricks
 
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
Chetan Khatri
 
11. From Hadoop to Spark 1:2
Fabio Fumarola
 
Scala and spark
Fabio Fumarola
 
No more struggles with Apache Spark workloads in production
Chetan Khatri
 
20170126 big data processing
Vienna Data Science Group
 
Spark 计算模型
wang xing
 
How Apache Spark fits into the Big Data landscape
Paco Nathan
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
Chetan Khatri
 
An introduction To Apache Spark
Amir Sedighi
 
Apache spark sneha challa- google pittsburgh-aug 25th
Sneha Challa
 
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
cdmaxime
 
Apache spark - Architecture , Overview & libraries
Walaa Hamdy Assy
 
Apache Spark and DataStax Enablement
Vincent Poncet
 
Data Science
Subhajit75
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Databricks
 
Brief Intro to Apache Spark @ Stanford ICME
Paco Nathan
 

More from IndicThreads (20)

PPTX
Http2 is here! And why the web needs it
IndicThreads
 
ODP
Understanding Bitcoin (Blockchain) and its Potential for Disruptive Applications
IndicThreads
 
PPT
Go Programming Language - Learning The Go Lang way
IndicThreads
 
PPT
Building Resilient Microservices
IndicThreads
 
PPT
App using golang indicthreads
IndicThreads
 
PDF
Building on quicksand microservices indicthreads
IndicThreads
 
PDF
How to Think in RxJava Before Reacting
IndicThreads
 
PPT
Iot secure connected devices indicthreads
IndicThreads
 
PDF
Real world IoT for enterprises
IndicThreads
 
PPT
IoT testing and quality assurance indicthreads
IndicThreads
 
PPT
Functional Programming Past Present Future
IndicThreads
 
PDF
Harnessing the Power of Java 8 Streams
IndicThreads
 
PDF
Building & scaling a live streaming mobile platform - Gr8 road to fame
IndicThreads
 
PPTX
Internet of things architecture perspective - IndicThreads Conference
IndicThreads
 
PDF
Cars and Computers: Building a Java Carputer
IndicThreads
 
PPT
Continuous Integration (CI) and Continuous Delivery (CD) using Jenkins & Docker
IndicThreads
 
PPTX
Speed up your build pipeline for faster feedback
IndicThreads
 
PPT
Unraveling OpenStack Clouds
IndicThreads
 
PPTX
Digital Transformation of the Enterprise. What IT leaders need to know!
IndicThreads
 
PDF
Architectural Considerations For Complex Mobile And Web Applications
IndicThreads
 
Http2 is here! And why the web needs it
IndicThreads
 
Understanding Bitcoin (Blockchain) and its Potential for Disruptive Applications
IndicThreads
 
Go Programming Language - Learning The Go Lang way
IndicThreads
 
Building Resilient Microservices
IndicThreads
 
App using golang indicthreads
IndicThreads
 
Building on quicksand microservices indicthreads
IndicThreads
 
How to Think in RxJava Before Reacting
IndicThreads
 
Iot secure connected devices indicthreads
IndicThreads
 
Real world IoT for enterprises
IndicThreads
 
IoT testing and quality assurance indicthreads
IndicThreads
 
Functional Programming Past Present Future
IndicThreads
 
Harnessing the Power of Java 8 Streams
IndicThreads
 
Building & scaling a live streaming mobile platform - Gr8 road to fame
IndicThreads
 
Internet of things architecture perspective - IndicThreads Conference
IndicThreads
 
Cars and Computers: Building a Java Carputer
IndicThreads
 
Continuous Integration (CI) and Continuous Delivery (CD) using Jenkins & Docker
IndicThreads
 
Speed up your build pipeline for faster feedback
IndicThreads
 
Unraveling OpenStack Clouds
IndicThreads
 
Digital Transformation of the Enterprise. What IT leaders need to know!
IndicThreads
 
Architectural Considerations For Complex Mobile And Web Applications
IndicThreads
 

Recently uploaded (20)

PDF
Become an Agentblazer Champion Challenge
Dele Amefo
 
PDF
Exploring AI Agents in Process Industries
amoreira6
 
PDF
Jenkins: An open-source automation server powering CI/CD Automation
SaikatBasu37
 
PDF
The Future of Smart Factories Why Embedded Analytics Leads the Way
Varsha Nayak
 
PDF
A REACT POMODORO TIMER WEB APPLICATION.pdf
Michael624841
 
PPTX
Audio Editing and it's techniques in computer graphics.pptx
fosterbayirinia3
 
PPTX
Materi_Pemrograman_Komputer-Looping.pptx
RanuFajar1
 
PDF
Rise With SAP partner in Mumbai.........
pts464036
 
PDF
ShowUs: Pharo Stream Deck (ESUG 2025, Gdansk)
ESUG
 
PPTX
Odoo Integration Services by Candidroot Solutions
CandidRoot Solutions Private Limited
 
PPTX
AIRLINE PRICE API | FLIGHT API COST |
philipnathen82
 
PDF
Why Use Open Source Reporting Tools for Business Intelligence.pdf
Varsha Nayak
 
PDF
Teaching Reproducibility and Embracing Variability: From Floating-Point Exper...
University of Rennes, INSA Rennes, Inria/IRISA, CNRS
 
PPTX
Services offered by Dynamic Solutions in Pakistan
DaniyaalAdeemShibli1
 
PPTX
Presentation of Computer CLASS 2 .pptx
darshilchaudhary558
 
PDF
How to Seamlessly Integrate Salesforce Data Cloud with Marketing Cloud.pdf
NSIQINFOTECH
 
PPT
Order to Cash Lifecycle Overview R12 .ppt
nbvreddy229
 
PPTX
Materi-Enum-and-Record-Data-Type (1).pptx
RanuFajar1
 
PPTX
10 Hidden App Development Costs That Can Sink Your Startup.pptx
Lunar Web Solution
 
PPTX
Save Business Costs with CRM Software for Insurance Agents
Insurance Tech Services
 
Become an Agentblazer Champion Challenge
Dele Amefo
 
Exploring AI Agents in Process Industries
amoreira6
 
Jenkins: An open-source automation server powering CI/CD Automation
SaikatBasu37
 
The Future of Smart Factories Why Embedded Analytics Leads the Way
Varsha Nayak
 
A REACT POMODORO TIMER WEB APPLICATION.pdf
Michael624841
 
Audio Editing and it's techniques in computer graphics.pptx
fosterbayirinia3
 
Materi_Pemrograman_Komputer-Looping.pptx
RanuFajar1
 
Rise With SAP partner in Mumbai.........
pts464036
 
ShowUs: Pharo Stream Deck (ESUG 2025, Gdansk)
ESUG
 
Odoo Integration Services by Candidroot Solutions
CandidRoot Solutions Private Limited
 
AIRLINE PRICE API | FLIGHT API COST |
philipnathen82
 
Why Use Open Source Reporting Tools for Business Intelligence.pdf
Varsha Nayak
 
Teaching Reproducibility and Embracing Variability: From Floating-Point Exper...
University of Rennes, INSA Rennes, Inria/IRISA, CNRS
 
Services offered by Dynamic Solutions in Pakistan
DaniyaalAdeemShibli1
 
Presentation of Computer CLASS 2 .pptx
darshilchaudhary558
 
How to Seamlessly Integrate Salesforce Data Cloud with Marketing Cloud.pdf
NSIQINFOTECH
 
Order to Cash Lifecycle Overview R12 .ppt
nbvreddy229
 
Materi-Enum-and-Record-Data-Type (1).pptx
RanuFajar1
 
10 Hidden App Development Costs That Can Sink Your Startup.pptx
Lunar Web Solution
 
Save Business Costs with CRM Software for Insurance Agents
Insurance Tech Services
 

Scrap Your MapReduce - Apache Spark