0% found this document useful (0 votes)
5 views26 pages

Your Paragraph Text

Uploaded by

nnta1342004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views26 pages

Your Paragraph Text

Uploaded by

nnta1342004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

APACHE SPARK

Nguyen Van Manh Cuong

Ngo Ngoc Tuan Anh

Nguyen Gia Bao 1


WHAT IS APACHE SPARK ?
Apache Spark is a unified analytics engine for
large-scale data processing. It provides high-level
APIs in many programming languages, and an
optimized engine that supports general execution
graphs.
It supports a rich set of higher-level tools including :
Spark SQL for SQL and structured data processing
pandas API on Spark for pandas workloads
MLlib for machine learning
GraphX for graph processing
Structured Streaming for incremental computation
and stream processing

2
HADOOP
MAPREDUCE
Hadoop mapreduce is "a software framework for easily writing
applications which process vast amounts of data in parallel on large
clusters of commodity hardware in a reliable, fault-tolerant manner."
The MapReduce paradigm consists of two sequential tasks:
Map filters and sorts data while converting it into key-value pairs
Reduce then takes this input and reduces its size by performing
some kind of summary over the data set

3
SPARK VS HADOOP

Processing data using Spark processes data 100 times faster


Software engineering applies engineering principles to the design, development,
MapReduce in Hadoop is slow than MapReduce as it is done in memory
testing, and maintenance of software systems. It emphasizes systematic
approaches to software development, ensuring that projects are completed on
time and within budget while meeting quality and performance standards. Key
Both batch andanalysis,
real-time
Performs batch processing of data
practices in software engineering include requirements software design,
coding, testing, and deployment.processing of data
Effective collaboration, project management,
and communication are essential for successful software engineering projects,
It is difficult to program as you
which range from small-scale applications to large-scale enterprise systems.
It is easy to program.
required code for every process.

It actually needs other queries to It has Spark SQL as its very own 4
perform the task. query language.
COMPONENT
APACHE SPARK

5
Liceria Tech

SPARK SQL

6
SPARK SQL ARCHITECTURE

7
FEATURE
SPARK SQL

8
9
10
11
12
13
Data frame
A DataFrame is a distributed collection of data, which is organized
into named columns. Conceptually, it is equivalent to relational
tables with good optimization techniques.

14
SPARK SQL FUNCTION

15
Spark Streaming

Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-
tolerant stream processing of live data streams

16
Spark Streaming

Spark Streaming receives live input data streams and divides the data into batches, which are then
processed by the Spark engine to generate the final stream of results in batches.

17
Spark Streaming Spark Structured Streaming

1st generation 2nd generation

One of the first APIs to enable stream processing Structured API through DataFrames/Datasets rather
using high-level functional operators like map and than RDDs
reduce Easier code reuse between batch and streaming

Like RDD API the DStreams API is based on relatively Marked production ready in Spark 2.2.0
low0level operations on Java/Python objectsaragraph
text Support for Java, Scala, Python, R and SQL Focus of
this talk
Used by many organizations in production
Focus of this talk

18
Spark Structured Streaming
Structured Streaming is a scalable and fault-tolerant stream processing engine built on the
Spark SQL engine. You can express your streaming computation the same way you would
express a batch computation on static data.

19
Spark Structured Streaming
A query on the input will generate the “Result Table”. Every trigger interval (say, every 1 second),
new rows get appended to the Input Table

20
Spark Structured Streaming

Structured Streaming does not materialize the entire table. It reads the latest available data from the
streaming data source, processes it incrementally to update the result, and then discards the source data

21
Spark Structured Streaming
We want to count words within 10 minute windows, updating every 5 minutes

22
Spark Structured Streaming
Handling Late Data and Watermarking
watermarking lets the engine automatically track the current event time in the data and attempt to clean up old state

23
Spark Structured Streaming
Spark supports three types of time windows: tumbling (fixed), sliding and session.

24
Spark Structured Streaming
Asynchronous progress tracking allows streaming queries to checkpoint progress asynchronously and in parallel to the
actual data processing within a micro-batch, reducing latency associated with maintaining the offset log and commit log.

25
THANK YOU!
26

You might also like