0% found this document useful (0 votes)

5 views26 pages

Your Paragraph Text

Uploaded by

nnta1342004

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views26 pages

Your Paragraph Text

Uploaded by

nnta1342004

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 26

APACHE SPARK

Nguyen Van Manh Cuong

Ngo Ngoc Tuan Anh

Nguyen Gia Bao 1

WHAT IS APACHE SPARK ?
Apache Spark is a unified analytics engine for
large-scale data processing. It provides high-level
APIs in many programming languages, and an
optimized engine that supports general execution
graphs.
It supports a rich set of higher-level tools including :
Spark SQL for SQL and structured data processing
pandas API on Spark for pandas workloads
MLlib for machine learning
GraphX for graph processing
Structured Streaming for incremental computation
and stream processing

2
HADOOP
MAPREDUCE
Hadoop mapreduce is "a software framework for easily writing
applications which process vast amounts of data in parallel on large
clusters of commodity hardware in a reliable, fault-tolerant manner."
The MapReduce paradigm consists of two sequential tasks:
Map filters and sorts data while converting it into key-value pairs
Reduce then takes this input and reduces its size by performing
some kind of summary over the data set

3
SPARK VS HADOOP

Processing data using Spark processes data 100 times faster

Software engineering applies engineering principles to the design, development,
MapReduce in Hadoop is slow than MapReduce as it is done in memory
testing, and maintenance of software systems. It emphasizes systematic
approaches to software development, ensuring that projects are completed on
time and within budget while meeting quality and performance standards. Key
Both batch andanalysis,
real-time
Performs batch processing of data
practices in software engineering include requirements software design,
coding, testing, and deployment.processing of data
Effective collaboration, project management,
and communication are essential for successful software engineering projects,
It is difficult to program as you
which range from small-scale applications to large-scale enterprise systems.
It is easy to program.
required code for every process.

It actually needs other queries to It has Spark SQL as its very own 4
perform the task. query language.
COMPONENT
APACHE SPARK

5
Liceria Tech

SPARK SQL

6
SPARK SQL ARCHITECTURE

7
FEATURE
SPARK SQL

8
9
10
11
12
13
Data frame
A DataFrame is a distributed collection of data, which is organized
into named columns. Conceptually, it is equivalent to relational
tables with good optimization techniques.

14
SPARK SQL FUNCTION

15
Spark Streaming

Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-
tolerant stream processing of live data streams

16
Spark Streaming

Spark Streaming receives live input data streams and divides the data into batches, which are then
processed by the Spark engine to generate the final stream of results in batches.

17
Spark Streaming Spark Structured Streaming

1st generation 2nd generation

One of the first APIs to enable stream processing Structured API through DataFrames/Datasets rather
using high-level functional operators like map and than RDDs
reduce Easier code reuse between batch and streaming

Like RDD API the DStreams API is based on relatively Marked production ready in Spark 2.2.0
low0level operations on Java/Python objectsaragraph
text Support for Java, Scala, Python, R and SQL Focus of
this talk
Used by many organizations in production
Focus of this talk

18
Spark Structured Streaming
Structured Streaming is a scalable and fault-tolerant stream processing engine built on the
Spark SQL engine. You can express your streaming computation the same way you would
express a batch computation on static data.

19
Spark Structured Streaming
A query on the input will generate the “Result Table”. Every trigger interval (say, every 1 second),
new rows get appended to the Input Table

20
Spark Structured Streaming

Structured Streaming does not materialize the entire table. It reads the latest available data from the
streaming data source, processes it incrementally to update the result, and then discards the source data

21
Spark Structured Streaming
We want to count words within 10 minute windows, updating every 5 minutes

22
Spark Structured Streaming
Handling Late Data and Watermarking
watermarking lets the engine automatically track the current event time in the data and attempt to clean up old state

23
Spark Structured Streaming
Spark supports three types of time windows: tumbling (fixed), sliding and session.

24
Spark Structured Streaming
Asynchronous progress tracking allows streaming queries to checkpoint progress asynchronously and in parallel to the
actual data processing within a micro-batch, reducing latency associated with maintaining the offset log and commit log.

25
THANK YOU!
26

Spark Tutorial
No ratings yet
Spark Tutorial
77 pages
Hands On Guide To Apache Spark 3 Build Scalable Computing Engines For Batch and Stream Data Processing 1nbsped 1484293797 9781484293799
No ratings yet
Hands On Guide To Apache Spark 3 Build Scalable Computing Engines For Batch and Stream Data Processing 1nbsped 1484293797 9781484293799
407 pages
Compute Engine
No ratings yet
Compute Engine
49 pages
Timex Sinclair BASIC Primer With Graphics
100% (4)
Timex Sinclair BASIC Primer With Graphics
252 pages
1.1.4 and 1.1.5
No ratings yet
1.1.4 and 1.1.5
38 pages
PySpark Notes
No ratings yet
PySpark Notes
31 pages
Modern Data Engineering With Apache Spark (For - .)
No ratings yet
Modern Data Engineering With Apache Spark (For - .)
604 pages
06 Big Data
No ratings yet
06 Big Data
52 pages
Big Data Anlytics Unit 3 R22 It
No ratings yet
Big Data Anlytics Unit 3 R22 It
57 pages
Spark Devops
0% (1)
Spark Devops
301 pages
Lecture 4 - Spark Introduction
No ratings yet
Lecture 4 - Spark Introduction
45 pages
Apache Spark and Scala
No ratings yet
Apache Spark and Scala
53 pages
Unit V
No ratings yet
Unit V
35 pages
Unit 4
No ratings yet
Unit 4
60 pages
4 Spark SBP
No ratings yet
4 Spark SBP
74 pages
Ebin - Pub Hands On Guide To Apache Spark 3 Build Scalable Computing Engines For Batch and Stream Data Processing 1nbsped 1484293797 9781484293799
100% (1)
Ebin - Pub Hands On Guide To Apache Spark 3 Build Scalable Computing Engines For Batch and Stream Data Processing 1nbsped 1484293797 9781484293799
307 pages
CISD 42 Introduction To Spark - Spark Transformation - Spark Actions
No ratings yet
CISD 42 Introduction To Spark - Spark Transformation - Spark Actions
27 pages
4a.introduction To Apache Spark
No ratings yet
4a.introduction To Apache Spark
28 pages
Ia-3 S&S
No ratings yet
Ia-3 S&S
10 pages
Real-Time Big Data Analytics
From Everand
Real-Time Big Data Analytics
Shilpi
5/5 (1)
Apache Spark Engine
100% (1)
Apache Spark Engine
82 pages
BDA1
No ratings yet
BDA1
17 pages
Introduction To Spark 1
No ratings yet
Introduction To Spark 1
21 pages
John Deere 310 Tractor Loader Backhoe Service Manual
0% (2)
John Deere 310 Tractor Loader Backhoe Service Manual
22 pages
Bda Unit 5 - Mam
No ratings yet
Bda Unit 5 - Mam
44 pages
Spark Final Theory
No ratings yet
Spark Final Theory
19 pages
Apache Spark Unleashed: Advanced Techniques for Data Processing and Analysis
From Everand
Apache Spark Unleashed: Advanced Techniques for Data Processing and Analysis
Adam Jones
No ratings yet
Thesis Apache Spark
100% (2)
Thesis Apache Spark
4 pages
7 Steps For A Developer To Learn Apache Spark
No ratings yet
7 Steps For A Developer To Learn Apache Spark
30 pages
Bda U4
No ratings yet
Bda U4
49 pages
Fast Data Processing Systems with SMACK Stack
From Everand
Fast Data Processing Systems with SMACK Stack
Raúl Estrada
No ratings yet
09 Programming Hadoop - Spark, R and Pig
No ratings yet
09 Programming Hadoop - Spark, R and Pig
80 pages
Spark-Rdd
No ratings yet
Spark-Rdd
15 pages
Big Data Processing With Apache Spark - Part 1 - Introduction - InfoQ
No ratings yet
Big Data Processing With Apache Spark - Part 1 - Introduction - InfoQ
18 pages
Advanced Real-Time Data Integration: Apache Kafka and Spark Streaming Techniques
From Everand
Advanced Real-Time Data Integration: Apache Kafka and Spark Streaming Techniques
Adam Jones
No ratings yet
Pyspark Notes New
No ratings yet
Pyspark Notes New
18 pages
Intro To Spark Development
No ratings yet
Intro To Spark Development
172 pages
Learning Apache Spark 2
From Everand
Learning Apache Spark 2
Muhammad Asif Abbasi
No ratings yet
Module 2
No ratings yet
Module 2
20 pages
Schematics and Wiring Diagrams
86% (7)
Schematics and Wiring Diagrams
46 pages
TP MS3663S PB818
No ratings yet
TP MS3663S PB818
25 pages
Analytics at Large Scale in Spark
No ratings yet
Analytics at Large Scale in Spark
13 pages
Shark
No ratings yet
Shark
24 pages
Introduction To Spark
No ratings yet
Introduction To Spark
84 pages
Unit 5
100% (1)
Unit 5
109 pages
Lecture 3 PPT 22
No ratings yet
Lecture 3 PPT 22
25 pages
Introduction To Spark
No ratings yet
Introduction To Spark
30 pages
Embedded Tutorial
No ratings yet
Embedded Tutorial
257 pages
Lec No 10
No ratings yet
Lec No 10
17 pages
Powerwall 2 AC Owners Manual
No ratings yet
Powerwall 2 AC Owners Manual
60 pages
What Is Spark?: History of Apache Spark
No ratings yet
What Is Spark?: History of Apache Spark
65 pages
OS Unit-3 Process Synchronization & Deadlock
100% (1)
OS Unit-3 Process Synchronization & Deadlock
38 pages
Module 3. Mech Safety.
No ratings yet
Module 3. Mech Safety.
45 pages
Apache Spark Components
No ratings yet
Apache Spark Components
4 pages
DE Bootcamp - Week 3 Day 2
No ratings yet
DE Bootcamp - Week 3 Day 2
4 pages
Apache Spark Essential Training
No ratings yet
Apache Spark Essential Training
30 pages
ZCP 515-33KV Twin FDR
No ratings yet
ZCP 515-33KV Twin FDR
21 pages
20J41A0514-Big Data Spark
No ratings yet
20J41A0514-Big Data Spark
12 pages
Spark
No ratings yet
Spark
9 pages
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Spark: Prepared by Dulari Bhatt
No ratings yet
Spark: Prepared by Dulari Bhatt
19 pages
Solr and Spark Terminology
No ratings yet
Solr and Spark Terminology
3 pages
Big Data Processing With Apache Spark - Infoqdotcom
No ratings yet
Big Data Processing With Apache Spark - Infoqdotcom
16 pages
Spark 101
No ratings yet
Spark 101
25 pages
Simplifying Data Engineering Databricks
100% (1)
Simplifying Data Engineering Databricks
20 pages
Apache Spark Features
No ratings yet
Apache Spark Features
2 pages
Financialaccounting IFRSPrinciples 5 e 2019
No ratings yet
Financialaccounting IFRSPrinciples 5 e 2019
2 pages
Unit Description Specification
No ratings yet
Unit Description Specification
7 pages
DB For Data Engineering Solution Sheet
No ratings yet
DB For Data Engineering Solution Sheet
2 pages
Data Communication and Network Questions and Answers PDF
No ratings yet
Data Communication and Network Questions and Answers PDF
3 pages
Apds Unified Communications Assessment
No ratings yet
Apds Unified Communications Assessment
17 pages
1641018728ledger Wise Pending Sales and Purchase Bills
No ratings yet
1641018728ledger Wise Pending Sales and Purchase Bills
9 pages
Apache Spark Ecosystem - Complete Spark Components Guide: 1. Objective
No ratings yet
Apache Spark Ecosystem - Complete Spark Components Guide: 1. Objective
11 pages
Bessam Unislide en
100% (1)
Bessam Unislide en
2 pages
W3Schools HTML Quiz Test
No ratings yet
W3Schools HTML Quiz Test
6 pages
Tech Seminar Report
No ratings yet
Tech Seminar Report
5 pages
Ethernet Cable - Color Coding Diagram - The Internet Centre
No ratings yet
Ethernet Cable - Color Coding Diagram - The Internet Centre
3 pages
WS11 EiSE 07 Domain - Modeling PDF
No ratings yet
WS11 EiSE 07 Domain - Modeling PDF
44 pages
8 Linux Advanced File System Management m8 Slides
No ratings yet
8 Linux Advanced File System Management m8 Slides
24 pages
Esp8266 Commands
No ratings yet
Esp8266 Commands
12 pages
67 Electric Motor
No ratings yet
67 Electric Motor
23 pages
Activity No 4 - Controller Operations
No ratings yet
Activity No 4 - Controller Operations
6 pages
The Central Business District: New Administrative Capital
No ratings yet
The Central Business District: New Administrative Capital
18 pages
Components of The System Unit
No ratings yet
Components of The System Unit
7 pages
Tugas Bahasa Inggris Inventor Mobile Phone
No ratings yet
Tugas Bahasa Inggris Inventor Mobile Phone
2 pages
27.2.16 Lab - Investigating An Attack On A Windows Host
No ratings yet
27.2.16 Lab - Investigating An Attack On A Windows Host
8 pages
Industry 4.0 PDF
No ratings yet
Industry 4.0 PDF
4 pages
Solid State Logic SL 4000 G
No ratings yet
Solid State Logic SL 4000 G
3 pages
CMI Series Horizontal Multistage Centrifugal Pump Brochure
No ratings yet
CMI Series Horizontal Multistage Centrifugal Pump Brochure
2 pages
Data Sheet Cruzer Fit Usb 2 0
No ratings yet
Data Sheet Cruzer Fit Usb 2 0
2 pages

Your Paragraph Text

Uploaded by

Your Paragraph Text

Uploaded by

APACHE SPARK

Nguyen Van Manh Cuong

Ngo Ngoc Tuan Anh

Nguyen Gia Bao 1

Processing data using Spark processes data 100 times faster

1st generation 2nd generation

You might also like