0% found this document useful (0 votes)
39 views12 pages

CC PPT

Apache Spark is an open-source distributed data processing framework designed for large-scale data handling, capable of processing both batch and real-time data. It features in-memory computing for speed, supports multiple programming languages, and includes components like Spark SQL, Spark Streaming, and MLlib for various data processing tasks. Spark's architecture ensures fault tolerance and efficient parallel processing through the use of Resilient Distributed Datasets (RDDs).

Uploaded by

k.kaushik2003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views12 pages

CC PPT

Apache Spark is an open-source distributed data processing framework designed for large-scale data handling, capable of processing both batch and real-time data. It features in-memory computing for speed, supports multiple programming languages, and includes components like Spark SQL, Spark Streaming, and MLlib for various data processing tasks. Spark's architecture ensures fault tolerance and efficient parallel processing through the use of Resilient Distributed Datasets (RDDs).

Uploaded by

k.kaushik2003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 12

TITLE: Presented by:

INTRODUCTION TO
Krishnan Kaushik (1CR21CS082)
Kottakota Rishina (1CR21cs081)
Mayank Kumar Singh

APACHE SPARK (1CR21CS095)


INTRODUCTION
•What is Apache Spark?
•Open-source distributed data processing framework
•Designed for large-scale data processing
•Developed at UC Berkeley's AMPLab
•Built to handle both batch and real-time data
WHY APACHE SPARK?
Key Features:
•Speed: In-memory computing makes it up to 100x faster than
Hadoop’s MapReduce
•Ease of Use: APIs available in Java, Python, Scala, and R
•Unified Engine: Supports batch processing, real-time streaming,
SQL queries, machine learning, and graph processing
•Fault Tolerance: Ensures no data loss during failures through
lineage and RDDs
DISTRIBUTED COMPUTING
•How Spark Works:
• Spark distributes data across clusters for parallel processing
• Tasks are broken into smaller jobs and run on multiple nodes
• This approach ensures efficient large-scale data handling
CORE COMPONENTS OF
APACHE SPARK
Spark Core: The engine for
distributed data processing
Spark SQL: For working with
structured and semi-structured
data using SQL queries
Spark Streaming: For real-
time stream processing
MLlib: Machine learning library
for distributed model building
GraphX: Graph computation
engine
RESILIENT DISTRIBUTED
DATASET (RDD)
What is RDD?
• Immutable distributed collection of objects
• Provides fault tolerance and parallelism
• Two types of operations:
• Transformations: e.g., map(), filter(), flatMap()
• Actions: e.g., collect(), count(), reduce()

Spark Programming Model


Languages Supported: Scala, Python (PySpark), Java, R
Lazy Evaluation:
• Transformations are evaluated only when an action is called
• Optimizes job execution
BASIC TRANSFORMATIONS &
ACTIONS
•Transformations: Create new RDDsmap(), filter(), flatMap()
•Actions: Trigger the executioncollect(), count(), reduce()
WORD COUNT EXAMPLE
(CODE)
Initialize Spark: SparkContext is
created to set up the Spark
environment locally.
Load Data: The text file is loaded as
an RDD using textFile().
Transform Data: Lines are split into
words with flatMap(), and each word is
mapped to (word, 1) pairs using map().
Count Words: reduceByKey() sums up
the values to get the count of each
word.
Display Results: collect() retrieves
the final word counts, which are
printed.
KEY FEATURES OF RDD
Immutable: Once created, cannot be modified
Lazy Evaluation: Operations are delayed until an action is triggered
Fault Tolerance: Recomputes data in case of failure
Parallelism: Processed across multiple nodes

In-Memory Computing
How It Works:
• Spark processes data in RAM instead of disk
• Drastically reduces time spent on I/O operations
• Increases speed, especially for iterative algorithms
SPARK SQL AND
DATAFRAMES
•Spark SQL:
• Allows querying structured data via SQL
• Can load data from various sources like JSON, Parquet, etc.

•DataFrames:
• Distributed collection of data organized into named columns
• Optimized for structured data processing
SPARK STREAMING AND
SPARK MLLIB
•Real-Time Data Processing:
• Processes live data streams (like logs, social media feeds)
• Can handle batch intervals (micro-batching)

•Use Cases:
• Log analysis, fraud detection, real-time analytics

•Machine Learning Library:


• Distributed algorithms for clustering, classification, regression, etc.
• Scalable and built for large datasets
CONCLUSION
•Summary:
• Apache Spark is a powerful framework for large-scale data processing
• Offers high speed, fault tolerance, and versatility
• Can handle batch, real-time, SQL, ML, and graph processing in a unified
environment

You might also like