CC PPT

Apache Spark is an open-source distributed data processing framework designed for large-scale data handling, capable of processing both batch and real-time data. It features in-memory computing for speed, supports multiple programming languages, and includes components like Spark SQL, Spark Streaming, and MLlib for various data processing tasks. Spark's architecture ensures fault tolerance and efficient parallel processing through the use of Resilient Distributed Datasets (RDDs).

Uploaded by

k.kaushik2003

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

39 views12 pages

CC PPT

Uploaded by

k.kaushik2003

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 12

TITLE: Presented by:

INTRODUCTION TO
Krishnan Kaushik (1CR21CS082)
Kottakota Rishina (1CR21cs081)
Mayank Kumar Singh

APACHE SPARK (1CR21CS095)

INTRODUCTION
•What is Apache Spark?
•Open-source distributed data processing framework
•Designed for large-scale data processing
•Developed at UC Berkeley's AMPLab
•Built to handle both batch and real-time data
WHY APACHE SPARK?
Key Features:
•Speed: In-memory computing makes it up to 100x faster than
Hadoop’s MapReduce
•Ease of Use: APIs available in Java, Python, Scala, and R
•Unified Engine: Supports batch processing, real-time streaming,
SQL queries, machine learning, and graph processing
•Fault Tolerance: Ensures no data loss during failures through
lineage and RDDs
DISTRIBUTED COMPUTING
•How Spark Works:
• Spark distributes data across clusters for parallel processing
• Tasks are broken into smaller jobs and run on multiple nodes
• This approach ensures efficient large-scale data handling
CORE COMPONENTS OF
APACHE SPARK
Spark Core: The engine for
distributed data processing
Spark SQL: For working with
structured and semi-structured
data using SQL queries
Spark Streaming: For real-
time stream processing
MLlib: Machine learning library
for distributed model building
GraphX: Graph computation
engine
RESILIENT DISTRIBUTED
DATASET (RDD)
What is RDD?
• Immutable distributed collection of objects
• Provides fault tolerance and parallelism
• Two types of operations:
• Transformations: e.g., map(), filter(), flatMap()
• Actions: e.g., collect(), count(), reduce()

Spark Programming Model

Languages Supported: Scala, Python (PySpark), Java, R
Lazy Evaluation:
• Transformations are evaluated only when an action is called
• Optimizes job execution
BASIC TRANSFORMATIONS &
ACTIONS
•Transformations: Create new RDDsmap(), filter(), flatMap()
•Actions: Trigger the executioncollect(), count(), reduce()
WORD COUNT EXAMPLE
(CODE)
Initialize Spark: SparkContext is
created to set up the Spark
environment locally.
Load Data: The text file is loaded as
an RDD using textFile().
Transform Data: Lines are split into
words with flatMap(), and each word is
mapped to (word, 1) pairs using map().
Count Words: reduceByKey() sums up
the values to get the count of each
word.
Display Results: collect() retrieves
the final word counts, which are
printed.
KEY FEATURES OF RDD
Immutable: Once created, cannot be modified
Lazy Evaluation: Operations are delayed until an action is triggered
Fault Tolerance: Recomputes data in case of failure
Parallelism: Processed across multiple nodes

In-Memory Computing
How It Works:
• Spark processes data in RAM instead of disk
• Drastically reduces time spent on I/O operations
• Increases speed, especially for iterative algorithms
SPARK SQL AND
DATAFRAMES
•Spark SQL:
• Allows querying structured data via SQL
• Can load data from various sources like JSON, Parquet, etc.

•DataFrames:
• Distributed collection of data organized into named columns
• Optimized for structured data processing
SPARK STREAMING AND
SPARK MLLIB
•Real-Time Data Processing:
• Processes live data streams (like logs, social media feeds)
• Can handle batch intervals (micro-batching)

•Use Cases:
• Log analysis, fraud detection, real-time analytics

•Machine Learning Library:

• Distributed algorithms for clustering, classification, regression, etc.
• Scalable and built for large datasets
CONCLUSION
•Summary:
• Apache Spark is a powerful framework for large-scale data processing
• Offers high speed, fault tolerance, and versatility
• Can handle batch, real-time, SQL, ML, and graph processing in a unified
environment

Account Statement 1 Apr 2024 To 21 Jul 2024
No ratings yet
Account Statement 1 Apr 2024 To 21 Jul 2024
12 pages
Big Data Engineering - PySpark
100% (2)
Big Data Engineering - PySpark
120 pages
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
BMW E46 Code List
No ratings yet
BMW E46 Code List
82 pages
Spark PPT
No ratings yet
Spark PPT
55 pages
09 Programming Hadoop - Spark, R and Pig
No ratings yet
09 Programming Hadoop - Spark, R and Pig
80 pages
Unit IV Spark
No ratings yet
Unit IV Spark
23 pages
Bda Unit 5 - Mam
No ratings yet
Bda Unit 5 - Mam
44 pages
Spark: Prepared by Dulari Bhatt
No ratings yet
Spark: Prepared by Dulari Bhatt
19 pages
07 - Apache Spark - An Introduction
No ratings yet
07 - Apache Spark - An Introduction
36 pages
Unit 5
100% (1)
Unit 5
109 pages
Presentation On Apache Spark
No ratings yet
Presentation On Apache Spark
7 pages
L03-Spark Framework
No ratings yet
L03-Spark Framework
58 pages
Cse3002 Big Data m3 Detailed
No ratings yet
Cse3002 Big Data m3 Detailed
39 pages
Apache Spark With Java
No ratings yet
Apache Spark With Java
209 pages
Introduction To Spark
No ratings yet
Introduction To Spark
84 pages
Module 3
No ratings yet
Module 3
51 pages
Bda U4
No ratings yet
Bda U4
49 pages
Key Features: General-Purpose Fast Cluster Computing Platform
No ratings yet
Key Features: General-Purpose Fast Cluster Computing Platform
16 pages
CISD 42 Introduction To Spark - Spark Transformation - Spark Actions
No ratings yet
CISD 42 Introduction To Spark - Spark Transformation - Spark Actions
27 pages
Bda U3 p1 (Intro To Spark)
No ratings yet
Bda U3 p1 (Intro To Spark)
66 pages
What Is Spark?: History of Apache Spark
No ratings yet
What Is Spark?: History of Apache Spark
65 pages
Unit-4 - Apache Spark
No ratings yet
Unit-4 - Apache Spark
24 pages
Mod4 Bda
No ratings yet
Mod4 Bda
14 pages
A Brief Introduction To Apache Spark
No ratings yet
A Brief Introduction To Apache Spark
10 pages
Analytics at Large Scale in Spark
No ratings yet
Analytics at Large Scale in Spark
13 pages
Pyspark Interview Code
100% (3)
Pyspark Interview Code
197 pages
Unit 4 (Big Data Analytics)
No ratings yet
Unit 4 (Big Data Analytics)
28 pages
Unit V
No ratings yet
Unit V
35 pages
SPARK
No ratings yet
SPARK
47 pages
Big Data Anlytics Unit 3 R22 It
No ratings yet
Big Data Anlytics Unit 3 R22 It
57 pages
Intro To Apache Spark
No ratings yet
Intro To Apache Spark
66 pages
BigData Spark Sparklyr
No ratings yet
BigData Spark Sparklyr
80 pages
20J41A0514-Big Data Spark
No ratings yet
20J41A0514-Big Data Spark
12 pages
Apache Spark Engine
100% (1)
Apache Spark Engine
82 pages
A2041175501 - 28953 - 15 - 2025 - Unit 1 Part 1
No ratings yet
A2041175501 - 28953 - 15 - 2025 - Unit 1 Part 1
13 pages
Pyspark Notes New
No ratings yet
Pyspark Notes New
18 pages
SPARK
No ratings yet
SPARK
125 pages
Unit-5 Spark
No ratings yet
Unit-5 Spark
24 pages
Big Data Processing With Apache Spark - Part 1 - Introduction - InfoQ
No ratings yet
Big Data Processing With Apache Spark - Part 1 - Introduction - InfoQ
18 pages
Tech Seminar Report
No ratings yet
Tech Seminar Report
5 pages
4a.introduction To Apache Spark
No ratings yet
4a.introduction To Apache Spark
28 pages
Apache Spark Essential Training
No ratings yet
Apache Spark Essential Training
30 pages
Shark
No ratings yet
Shark
24 pages
Apache Spark
No ratings yet
Apache Spark
27 pages
4 Spark SBP
No ratings yet
4 Spark SBP
74 pages
Introduction To Spark 1
No ratings yet
Introduction To Spark 1
21 pages
Lec No 10
No ratings yet
Lec No 10
17 pages
Module 4
No ratings yet
Module 4
29 pages
Lecture 25
No ratings yet
Lecture 25
59 pages
SABDE3G06 Big Data Sparks
No ratings yet
SABDE3G06 Big Data Sparks
57 pages
Apach Spark With Scala Slides
No ratings yet
Apach Spark With Scala Slides
187 pages
Unit 6 Spark
No ratings yet
Unit 6 Spark
43 pages
Unit V Big Data
No ratings yet
Unit V Big Data
18 pages
Spark BD
No ratings yet
Spark BD
9 pages
Introduction To Spark
No ratings yet
Introduction To Spark
30 pages
Spark
No ratings yet
Spark
9 pages
Apache Spark Ecosystem - Complete Spark Components Guide: 1. Objective
No ratings yet
Apache Spark Ecosystem - Complete Spark Components Guide: 1. Objective
11 pages
Apache Spark: The Next Gen Toolset For Big Data Processing
No ratings yet
Apache Spark: The Next Gen Toolset For Big Data Processing
9 pages
Apache Spark Lecture Notes
No ratings yet
Apache Spark Lecture Notes
4 pages
Learning Apache Spark 2
From Everand
Learning Apache Spark 2
Muhammad Asif Abbasi
No ratings yet
Fast Data Processing Systems with SMACK Stack
From Everand
Fast Data Processing Systems with SMACK Stack
Raúl Estrada
No ratings yet
Latest Generation Movile 5G Group 1
No ratings yet
Latest Generation Movile 5G Group 1
6 pages
Ibase User Manual 20200831
No ratings yet
Ibase User Manual 20200831
70 pages
SGP 1
No ratings yet
SGP 1
5 pages
College Information System
68% (28)
College Information System
97 pages
The Views Tangent Pile-Method Statement
No ratings yet
The Views Tangent Pile-Method Statement
5 pages
Rev 32 P0003 28 August 2017
No ratings yet
Rev 32 P0003 28 August 2017
24 pages
S2 COMPUTER End Term 1 EXAM
67% (3)
S2 COMPUTER End Term 1 EXAM
4 pages
10-7-24 Computer
No ratings yet
10-7-24 Computer
1 page
Umbrella Og PDF
100% (1)
Umbrella Og PDF
20 pages
Fascia Ventilated Eaves 25mm - Warm Roof
No ratings yet
Fascia Ventilated Eaves 25mm - Warm Roof
1 page
VMW Ebook Vmware Vsphere Eight
No ratings yet
VMW Ebook Vmware Vsphere Eight
11 pages
AC Power Control With Thyristor Using Pic Microcontroller
100% (2)
AC Power Control With Thyristor Using Pic Microcontroller
10 pages
Zá Àääaqéã Àéj Zàäåvï Àgà Gádä Uà Àä Aiàä Ävà (Pà Áðlpà Àpáðgàzà Áé Àäåpéì M À Ànözé) Uà Àä Páaiàiáð®Aiàä, Rjã Ásé, Zá À, Éäê Àægàä-570017
No ratings yet
Zá Àääaqéã Àéj Zàäåvï Àgà Gádä Uà Àä Aiàä Ävà (Pà Áðlpà Àpáðgàzà Áé Àäåpéì M À Ànözé) Uà Àä Páaiàiáð®Aiàä, Rjã Ásé, Zá À, Éäê Àægàä-570017
169 pages
Managing Risk and Security in Outsourcing It Services Onshore Offshore and The Cloud 1st Edition Frank Siepmann Instant Download
No ratings yet
Managing Risk and Security in Outsourcing It Services Onshore Offshore and The Cloud 1st Edition Frank Siepmann Instant Download
79 pages
22k-4522 (Shozab Mehdi) Lab - 1
No ratings yet
22k-4522 (Shozab Mehdi) Lab - 1
4 pages
Myopia Master
No ratings yet
Myopia Master
6 pages
Brosur HE-43 T
No ratings yet
Brosur HE-43 T
2 pages
Paper 784 - COmputer
No ratings yet
Paper 784 - COmputer
9 pages
Key Components in Intuitive Thinking
No ratings yet
Key Components in Intuitive Thinking
16 pages
Doctor Appointment Booking
No ratings yet
Doctor Appointment Booking
5 pages
Module 4. VAPOR COMPRESSION CYCLE
No ratings yet
Module 4. VAPOR COMPRESSION CYCLE
8 pages
PhpMyAdmin SQL Dump
No ratings yet
PhpMyAdmin SQL Dump
11 pages
CFN Ug
No ratings yet
CFN Ug
4,069 pages
33kV TWIN FEEDER 3&4 - REV-A - 03.06.2013
No ratings yet
33kV TWIN FEEDER 3&4 - REV-A - 03.06.2013
52 pages
Rakib Talukder
No ratings yet
Rakib Talukder
5 pages
History of Mobile Generations
No ratings yet
History of Mobile Generations
7 pages
QAS 14 KD APP S/N APP207576 - APP225648: Spare Parts List: Portable Generators
No ratings yet
QAS 14 KD APP S/N APP207576 - APP225648: Spare Parts List: Portable Generators
171 pages
Einhell-Bmh 33-36-E
No ratings yet
Einhell-Bmh 33-36-E
1 page

CC PPT

Uploaded by

CC PPT

Uploaded by

TITLE: Presented by:

APACHE SPARK (1CR21CS095)

Spark Programming Model

•Machine Learning Library:

You might also like