Your Paragraph Text
Your Paragraph Text
2
HADOOP
MAPREDUCE
Hadoop mapreduce is "a software framework for easily writing
applications which process vast amounts of data in parallel on large
clusters of commodity hardware in a reliable, fault-tolerant manner."
The MapReduce paradigm consists of two sequential tasks:
Map filters and sorts data while converting it into key-value pairs
Reduce then takes this input and reduces its size by performing
some kind of summary over the data set
3
SPARK VS HADOOP
It actually needs other queries to It has Spark SQL as its very own 4
perform the task. query language.
COMPONENT
APACHE SPARK
5
Liceria Tech
SPARK SQL
6
SPARK SQL ARCHITECTURE
7
FEATURE
SPARK SQL
8
9
10
11
12
13
Data frame
A DataFrame is a distributed collection of data, which is organized
into named columns. Conceptually, it is equivalent to relational
tables with good optimization techniques.
14
SPARK SQL FUNCTION
15
Spark Streaming
Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-
tolerant stream processing of live data streams
16
Spark Streaming
Spark Streaming receives live input data streams and divides the data into batches, which are then
processed by the Spark engine to generate the final stream of results in batches.
17
Spark Streaming Spark Structured Streaming
One of the first APIs to enable stream processing Structured API through DataFrames/Datasets rather
using high-level functional operators like map and than RDDs
reduce Easier code reuse between batch and streaming
Like RDD API the DStreams API is based on relatively Marked production ready in Spark 2.2.0
low0level operations on Java/Python objectsaragraph
text Support for Java, Scala, Python, R and SQL Focus of
this talk
Used by many organizations in production
Focus of this talk
18
Spark Structured Streaming
Structured Streaming is a scalable and fault-tolerant stream processing engine built on the
Spark SQL engine. You can express your streaming computation the same way you would
express a batch computation on static data.
19
Spark Structured Streaming
A query on the input will generate the “Result Table”. Every trigger interval (say, every 1 second),
new rows get appended to the Input Table
20
Spark Structured Streaming
Structured Streaming does not materialize the entire table. It reads the latest available data from the
streaming data source, processes it incrementally to update the result, and then discards the source data
21
Spark Structured Streaming
We want to count words within 10 minute windows, updating every 5 minutes
22
Spark Structured Streaming
Handling Late Data and Watermarking
watermarking lets the engine automatically track the current event time in the data and attempt to clean up old state
23
Spark Structured Streaming
Spark supports three types of time windows: tumbling (fixed), sliding and session.
24
Spark Structured Streaming
Asynchronous progress tracking allows streaming queries to checkpoint progress asynchronously and in parallel to the
actual data processing within a micro-batch, reducing latency associated with maintaining the offset log and commit log.
25
THANK YOU!
26