0% found this document useful (0 votes)

72 views33 pages

Spark

The document provides an introduction to Apache Spark including its goals, architecture, and key features like RDDs. Spark is a fast, general-purpose cluster computing system that allows processing of batch, streaming, and interactive data across clusters in memory for improved performance over Hadoop.

Uploaded by

Madhavi Kareddy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

72 views33 pages

Spark

Uploaded by

Madhavi Kareddy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 33

Introduction to Apache Spark

Certified Apache Spark and Scala Training – DataFlair

Agenda
 Before Spark
 Need for Spark
 What is Apache Spark ?
 Goals
 Why Spark ?
 RDD & its Operations
 Features Of Spark

Certified Apache Spark and Scala Training – DataFlair

Before Spark

Batch Stream Interactive Graph Machine

Processing Processing Processing Processing Learning

Certified Apache Spark and Scala Training – DataFlair

Need For Spark

• Need for a powerful engine that can process the data in Real-Time
(streaming) as well as in Batch mode
• Need for a powerful engine that can respond in Sub-second and
perform In-memory analytics
• Need for a powerful engine that can handle diverse workloads:
– Batch
– Streaming
– Interactive
– Graph
– Machine Learning

Certified Apache Spark and Scala Training – DataFlair

What is Apache Spark?

Apache Spark is a powerful open source engine which can handle:

– Batch processing
– Real-time (stream)
– Interactive
– Graph
– Machine Learning (Iterative)
– In-memory

Certified Apache Spark and Scala Training – DataFlair

Introduction to Apache Spark

 Lightening fast cluster computing tool

 General purpose distributed system
 Provides APIs in Scala, Java, Python, and R

Certified Apache Spark and Scala Training – DataFlair

History
Open Became Top-level Most active
Sourced project project at Apache
Introduced by Donated to World record
UC Berkeley Apache in sorting

2009 2010 2011 2012 2013 2014 2015

Certified Apache Spark and Scala Training – DataFlair

Sort Record
2100 Nodes
Hadoop-MapReduce 72 min

206 Nodes
Spark
23 min

Hadoop MapReduce Spark

Data Size 102.5 TB 100 TB
Time Taken 72 min 23 min
No of nodes 2100 206
No of cores 50400 physical 6592 virtualized
Cluster disk throughput 3150 GBPS 618 GBPS
Network Dedicated 10 Gbps Virtualized 10 Gbps

Src: Databricks
Certified Apache Spark and Scala Training – DataFlair
Goals
 Easy to combine batch, streaming, and interactive
computations

Batch

One
Stack to
Rule
them
Interactive all Streaming

Certified Apache Spark and Scala Training – DataFlair

Goals
 Easy to combine batch, streaming, and interactive
computations
 Easy to develop sophisticated algorithms

Certified Apache Spark and Scala Training – DataFlair

Goals
 Easy to combine batch, streaming, and interactive
computations
 Easy to develop sophisticated algorithms
 Compatible with existing open source ecosystem

Certified Apache Spark and Scala Training – DataFlair

Why Spark ?
 100x faster than Hadoop.

Certified Apache Spark and Scala Training – DataFlair

Why Spark ?
 100x faster than Hadoop.
 In-memory computation.

Operation1 Operation1

Operation2 Operation1

Disk … Disk
…

Certified Apache Spark and Scala Training – DataFlair

Why Spark ?
 100x faster than Hadoop.
 In-memory computation.

Operation 1 Operation 2 … Operation n

Disk Disk Disk Disk

Operation 1 Operation 2
… Operation n
Disk Disk

Certified Apache Spark and Scala Training – DataFlair

Why Spark ?
 100x faster than Hadoop.
 In-memory computation.
 Language support like Scala, Java, Python and R.

Certified Apache Spark and Scala Training – DataFlair

Why Spark ?
 100x faster than Hadoop.
 In-memory computation.
 Language support like Scala, Java, Python and R.
 Support Real time and Batch Processing.

Input data Batches of Batches of

stream Spark Input data Spark Processed
data
Streaming Engine

Certified Apache Spark and Scala Training – DataFlair

Why Spark ?
 100x faster than Hadoop.
 In-memory computation.
 Language support like Scala, Java, Python and R.
 Support Real time and Batch Processing.
 Lazy Operations – optimize the job before execution.
 Support for multiple transformations and actions.

Transformation 1 Transformation 2 Action

map() filter() (collect)

RDD1 RDD2 RDD3 Result

Certified Apache Spark and Scala Training – DataFlair

Spark
Architecture

Certified Apache Spark and Scala Training – DataFlair

Spark Nodes
Nodes

Master Node Slave Nodes

Master Worker

Certified Apache Spark and Scala Training – DataFlair

Basic Spark Architecture

Sub Work Sub Work Sub Work Sub Work

Work Sub Work Sub Work Sub Work Sub Work

Sub Work Sub Work Sub Work Sub Work

Certified Apache Spark and Scala Training – DataFlair

Resilient Distributed Dataset (RDD)
 RDD is a simple and immutable collection of objects.

Obj1

Obj2

Obj3

....
Obj n

RDD

Certified Apache Spark and Scala Training – DataFlair

Resilient Distributed Dataset (RDD)
 RDD is a simple and immutable collection of objects.
 RDD can contain any type of (scala, java, python and R)
objects.

RDD
Objects

Certified Apache Spark and Scala Training – DataFlair

Resilient Distributed Dataset (RDD)
 RDD is a simple and immutable collection of objects.
 RDD can contain any type of (scala, java, python and R) objects.
 Each RDD is split-up into different partitions, which may be computed on
different nodes of clusters.
RDD
PPaarrttii

ttiooi

nn11

Partition2

Partition3

Partition4

Partition5
Partition6

Certified Apache Spark and Scala Training – DataFlair

Resilient Distributed Dataset (RDD)
B2

B1 B12
Partition-1 B5 B3
Partition-2
B4 B9
Partition-3
B10 B7 B11 B6
Create RDD Partition-4
Partition-5
... B8

Employee-data.txt

RDD
Hadoop Cluster
Certified Apache Spark and Scala Training – DataFlair
RDD Operations
RDD
Operations

Transformations Actions Persistence

Certified Apache Spark and Scala Training – DataFlair

RDD Operations – Transformation
Transformation:
 Set of operations that define how RDD should be transformed
 Creates a new RDD from the existing one to process the data
 Lazy evaluation: Computation doesn’t start until an action associated
 E.g. Map, FlatMap, Filter, Union, GroupBy, etc.

Certified Apache Spark and Scala Training – DataFlair

RDD Operations – Action
Action:
 Triggers job execution.
 Returns the result or write it to the storage.
 E.g. Count, Collect, Reduce, Take, etc.

Certified Apache Spark and Scala Training – DataFlair

RDD Operations – Persistence
Persistence:
 Spark allows caching/Persisting entire dataset in memory
 Caches the RDD in the memory for future operations

Cache
Primary Storage

Certified Apache Spark and Scala Training – DataFlair

RDD Operations
Parent RDD (map(), flatMap()…) Creates a new
RDD based on
custom business
Transformations logic

RDD
RDD Returns output to
Lineage Driver or exports
data to storage
system after
Actions computation

(saveAsTextFile(), count()…)

Result

Certified Apache Spark and Scala Training – DataFlair

Features of Spark
Process every
100 X Faster
record exactly
Than Hadoop
Duplicate once
Speed
Elimination

Automatic Diverse
Memory
Memory Processing processing
Management
Management platform

Fault Window
Recovers Tolerance Criteria Time based
Automatically window
criteria

Certified Apache Spark and Scala Training – DataFlair

Thank
You
DataFlair

/c/DataFlairWS /DataFlairWS

Certified Apache Spark and Scala Training – DataFlair

Data Contracts Early Release 042024
No ratings yet
Data Contracts Early Release 042024
52 pages
3 Lecture 3-ETL
100% (1)
3 Lecture 3-ETL
42 pages
Spark Kafkaintegration PDF
100% (1)
Spark Kafkaintegration PDF
71 pages
Complete Guide To Spark Memory Management 1726709042
No ratings yet
Complete Guide To Spark Memory Management 1726709042
11 pages
1 Introduction To Databricks Machine Learning
No ratings yet
1 Introduction To Databricks Machine Learning
9 pages
Caching in Spark
No ratings yet
Caching in Spark
51 pages
Loan Risk Analysis With Databricks and XGBoost - A Databricks Guide, Including Code Samples and Notebooks (2019)
No ratings yet
Loan Risk Analysis With Databricks and XGBoost - A Databricks Guide, Including Code Samples and Notebooks (2019)
11 pages
Big Data With Apache Spark 3 and Python From Zero To Expert
No ratings yet
Big Data With Apache Spark 3 and Python From Zero To Expert
28 pages
"Analytics Using Apache Spark": (Lightening Fast Cluster Computing)
No ratings yet
"Analytics Using Apache Spark": (Lightening Fast Cluster Computing)
99 pages
Databricks Best Practices
No ratings yet
Databricks Best Practices
25 pages
SCD Typ2 in Databricks Azure
0% (1)
SCD Typ2 in Databricks Azure
8 pages
Spark Use Cases
No ratings yet
Spark Use Cases
2 pages
Azuredatabricks New
No ratings yet
Azuredatabricks New
22 pages
PSD02 - Data Science Overview
No ratings yet
PSD02 - Data Science Overview
64 pages
InfoAdvisors MDM Neo4j Graph
100% (1)
InfoAdvisors MDM Neo4j Graph
14 pages
Unstructured Dataload Into Hive Database Through PySpark
No ratings yet
Unstructured Dataload Into Hive Database Through PySpark
9 pages
Spark: Prepared by Dulari Bhatt
No ratings yet
Spark: Prepared by Dulari Bhatt
19 pages
Stream Processing at Lyft
No ratings yet
Stream Processing at Lyft
20 pages
Mongodb Spark
No ratings yet
Mongodb Spark
13 pages
Spark Structured Streaming
No ratings yet
Spark Structured Streaming
655 pages
Fundamentals of Big Data Engineering: A Guide To The
No ratings yet
Fundamentals of Big Data Engineering: A Guide To The
14 pages
Big Data Hadoop Training Certification 7
No ratings yet
Big Data Hadoop Training Certification 7
40 pages
4 Spark SBP
No ratings yet
4 Spark SBP
74 pages
Apache Spark Theory by Arsh
No ratings yet
Apache Spark Theory by Arsh
4 pages
Course12 2 PDF
No ratings yet
Course12 2 PDF
36 pages
Unified Batch and Real Time Stream Processing
No ratings yet
Unified Batch and Real Time Stream Processing
68 pages
DBR 7.x - Spark 3.x Features Migration
No ratings yet
DBR 7.x - Spark 3.x Features Migration
86 pages
Adf 161206173358
No ratings yet
Adf 161206173358
29 pages
Piyush Data Science 3
No ratings yet
Piyush Data Science 3
26 pages
2018 02 08 Whats New in Apache Spark 2 180213220045
No ratings yet
2018 02 08 Whats New in Apache Spark 2 180213220045
57 pages
2 Hadoop (Uploaded)
No ratings yet
2 Hadoop (Uploaded)
82 pages
T-GCPBDML-B - M2 - Data Engineering For Streaming Data - ILT Slides
No ratings yet
T-GCPBDML-B - M2 - Data Engineering For Streaming Data - ILT Slides
71 pages
BY K Madhavi Data Architect
No ratings yet
BY K Madhavi Data Architect
24 pages
Databricks
No ratings yet
Databricks
43 pages
Developing Modern Applications With Scala
No ratings yet
Developing Modern Applications With Scala
72 pages
Databricks - Data Intelligence Platform For Advanced Data Architecture
No ratings yet
Databricks - Data Intelligence Platform For Advanced Data Architecture
5 pages
Azure Cloud Intro
No ratings yet
Azure Cloud Intro
34 pages
Aditya Latest Resume PDF
No ratings yet
Aditya Latest Resume PDF
4 pages
Azure Synpase Analytics Service
No ratings yet
Azure Synpase Analytics Service
22 pages
Data Lake Bootcamp: Building Reliable Data Lakes
No ratings yet
Data Lake Bootcamp: Building Reliable Data Lakes
29 pages
Pyspark Cashing & Persisting - Complete Guide
No ratings yet
Pyspark Cashing & Persisting - Complete Guide
3 pages
ProMoTe A Data Product Model Template For Data Meshes
No ratings yet
ProMoTe A Data Product Model Template For Data Meshes
18 pages
Set Your Data in Motion
No ratings yet
Set Your Data in Motion
8 pages
Lab - Qlik Replicate Azure Databricks
No ratings yet
Lab - Qlik Replicate Azure Databricks
16 pages
Distributed Computing With Python - Sample Chapter
No ratings yet
Distributed Computing With Python - Sample Chapter
18 pages
ABD22 1st Exam - 6 January - Attempt Review
No ratings yet
ABD22 1st Exam - 6 January - Attempt Review
13 pages
SPARK Interview Questions
No ratings yet
SPARK Interview Questions
12 pages
DWH Fundamentals
No ratings yet
DWH Fundamentals
63 pages
De Mod 5 Deploy Workloads With Databricks Workflows
No ratings yet
De Mod 5 Deploy Workloads With Databricks Workflows
19 pages
Tableau Interview Questions and Answers
No ratings yet
Tableau Interview Questions and Answers
5 pages
Apache Druid: Sudhindra Tirupati Nagaraj
No ratings yet
Apache Druid: Sudhindra Tirupati Nagaraj
12 pages
Troubleshooting Spark Challenges
No ratings yet
Troubleshooting Spark Challenges
7 pages
Star Schema and Technology Review: Musa Sami Ata Abdel-Rahman Supervisor: Professor Sebastian Link
No ratings yet
Star Schema and Technology Review: Musa Sami Ata Abdel-Rahman Supervisor: Professor Sebastian Link
15 pages
Name: Wable Snehal Mahesh Subject:-Scala & Spark Div: - Mba Ii Roll No: - 57 Guidence Name: - Prof. Archana Suryawanshi - Kadam
No ratings yet
Name: Wable Snehal Mahesh Subject:-Scala & Spark Div: - Mba Ii Roll No: - 57 Guidence Name: - Prof. Archana Suryawanshi - Kadam
11 pages
CODE201911 Practices DataVisualizations
No ratings yet
CODE201911 Practices DataVisualizations
19 pages
WP - Databricks vs. ETL Data Lake - Updated
No ratings yet
WP - Databricks vs. ETL Data Lake - Updated
12 pages
7 Steps For A Developer To Learn Apache Spark
No ratings yet
7 Steps For A Developer To Learn Apache Spark
30 pages
Functional Programming With Scala
No ratings yet
Functional Programming With Scala
23 pages
Designing Data Integration The ETL Pattern Approac
No ratings yet
Designing Data Integration The ETL Pattern Approac
9 pages
Azure SQL Trainings: Contact: +91 90 32 82 44 67
No ratings yet
Azure SQL Trainings: Contact: +91 90 32 82 44 67
6 pages
Azure Cosmos DB - Change Feed Support
No ratings yet
Azure Cosmos DB - Change Feed Support
8 pages
Spark and Scala Course
No ratings yet
Spark and Scala Course
5 pages
Serverless Architecture For Product Defect Detection Using Computer Vision Ra
No ratings yet
Serverless Architecture For Product Defect Detection Using Computer Vision Ra
1 page

Spark

Uploaded by

Spark

Uploaded by

Introduction to Apache Spark

Certified Apache Spark and Scala Training – DataFlair

Certified Apache Spark and Scala Training – DataFlair

Batch Stream Interactive Graph Machine

Certified Apache Spark and Scala Training – DataFlair

Certified Apache Spark and Scala Training – DataFlair

Apache Spark is a powerful open source engine which can handle:

Certified Apache Spark and Scala Training – DataFlair

 Lightening fast cluster computing tool

Certified Apache Spark and Scala Training – DataFlair

2009 2010 2011 2012 2013 2014 2015

Certified Apache Spark and Scala Training – DataFlair

Hadoop MapReduce Spark

Certified Apache Spark and Scala Training – DataFlair

Certified Apache Spark and Scala Training – DataFlair

Certified Apache Spark and Scala Training – DataFlair

Certified Apache Spark and Scala Training – DataFlair

Certified Apache Spark and Scala Training – DataFlair

Operation 1 Operation 2 … Operation n

Disk Disk Disk Disk

Certified Apache Spark and Scala Training – DataFlair

Certified Apache Spark and Scala Training – DataFlair

Input data Batches of Batches of

Certified Apache Spark and Scala Training – DataFlair

Certified Apache Spark and Scala Training – DataFlair

Transformation 1 Transformation 2 Action

RDD1 RDD2 RDD3 Result

Certified Apache Spark and Scala Training – DataFlair

Certified Apache Spark and Scala Training – DataFlair

Certified Apache Spark and Scala Training – DataFlair

Master Node Slave Nodes

Certified Apache Spark and Scala Training – DataFlair

Sub Work Sub Work Sub Work Sub Work

Sub Work Sub Work Sub Work Sub Work

Work Sub Work Sub Work Sub Work Sub Work

Sub Work Sub Work Sub Work Sub Work

Certified Apache Spark and Scala Training – DataFlair

Certified Apache Spark and Scala Training – DataFlair

Certified Apache Spark and Scala Training – DataFlair

Certified Apache Spark and Scala Training – DataFlair

Transformations Actions Persistence

Certified Apache Spark and Scala Training – DataFlair

Certified Apache Spark and Scala Training – DataFlair

Certified Apache Spark and Scala Training – DataFlair

Certified Apache Spark and Scala Training – DataFlair

Certified Apache Spark and Scala Training – DataFlair

Certified Apache Spark and Scala Training – DataFlair

Certified Apache Spark and Scala Training – DataFlair

You might also like