0% found this document useful (0 votes)
24 views

Week-5 - Lecture Notes

HBase is an open source NoSQL database that provides distributed column-oriented storage and is designed to operate at large scale on top of HDFS. It is modeled after Google's BigTable storage system and stores data in tables that are split into regions served by region servers.

Uploaded by

tejastaware7451
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views

Week-5 - Lecture Notes

HBase is an open source NoSQL database that provides distributed column-oriented storage and is designed to operate at large scale on top of HDFS. It is modeled after Google's BigTable storage system and stores data in tables that are split into regions served by region servers.

Uploaded by

tejastaware7451
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 138

Design of HBase

EL
PT
N
Dr. Rajiv Misra
Dept. of Computer Science & Engg.
Indian Institute of Technology Patna
[email protected]

Big Data Computing Vu Pham Design of HBase


Preface
Content of this Lecture:

In this lecture, we will discuss:

EL
What is HBase?
HBase Architecture
PT
HBase Components
Data model
N
HBase Storage Hierarchy
Cross-Datacenter Replication
Auto Sharding and Distribution
Bloom Filter and Fold, Store, and Shift
Big Data Computing Vu Pham Design of HBase
HBase is:
An opensource NOSQL database.
A distributed column-oriented data store that can scale
horizontally to 1,000s of commodity servers and petabytes

EL
of indexed storage.
Designed to operate on top of the Hadoop distributed file
system (HDFS) for scalability, fault tolerance, and high
availability.
PT
Hbase is actually an implementation of the BigTable
N
storage architecture, which is a distributed storage system
developed by Google.
Works with structured, unstructured and semi-structured
data.

Big Data Computing Vu Pham Design of HBase


HBase
Google’s BigTable was first “blob-based” storage system
Yahoo! Open-sourced it → HBase
Major Apache project today

EL
Facebook uses HBase internally
API functions
Get/Put(row)
PT
Scan(row range, filter) – range queries
N
MultiPut
Unlike Cassandra, HBase prefers consistency (over
availability)

Big Data Computing Vu Pham Design of HBase


HBase Architecture
Table Split into regions
and served by region
servers.
Regions vertically

EL
divided by column
families into “stores”.

PT Stores saved as files on


HDFS.
N
Hbase utilizes
zookeeper for
distributed
coordination.

Big Data Computing Vu Pham Design of HBase


HBase Components
Client:
Finds RegionServers that are serving particular row
range of interest
HMaster:

EL
Monitoring all RegionServer instances in the cluster
Regions:
PT
Basic element of availability and distribution for
tables
N
RegionServer:
Serving and managing regions
In a distributed cluster, a RegionServer runs on a
DataNode
Big Data Computing Vu Pham Design of HBase
Data Model
Data stored in Hbase is
located by its “rowkey”
RowKey is like a primary key
from a rational database.

EL
Records in Hbase are stored
in sorted order, according to
rowkey.

PT Data in a row are grouped


together as Column Families.
Each Column Family has one
N
or more Columns
These Columns in a family
are stored together in a low
level storage file known as
Column families HFile
Big Data Computing Vu Pham Design of HBase
HBase Components

EL
PT
N
Tables are divided into sequences of rows, by key range, called regions.
These regions are then assigned to the data nodes in the cluster called
“RegionServers.”

Big Data Computing Vu Pham Design of HBase


Column Family

EL
PT
N
A column is identified by a Column Qualifier that consists of the
Column Family name concatenated with the Column name using a
colon.ex-personaldata:Name
Column families are mapped to storage files and are stored in
separate files, which can also be accesses separately.
Big Data Computing Vu Pham Design of HBase
Cell in HBase Table

EL
PT
Data is stored in HBASE tables Cells.
Cell is a combination of row, column family, column qualifier and
N
contains a value and a timestamp
The key consists of the row key, column name, and timestamp.
The entire cell, with the added structural information, is called
Key Value.

Big Data Computing Vu Pham Design of HBase


HBase Data Model
Table: Hbase organizes data into tables. Table names are
Strings and composed of characters that are safe for use in a
file system path.

EL
Row: Within a table, data is stored according to its row. Rows
are identified uniquely by their row key. Row keys do not have

PT
a data type and are always treated as a byte[ ] (byte array).

Column Family: Data within a row is grouped by column


N
family. Every row in a table has the same column families,
although a row need not store data in all its families. Column
families are Strings and composed of characters that are safe
for use in a file system path.
Big Data Computing Vu Pham Design of HBase
HBase Data Model
Column Qualifier: Data within a column family is
addressed via its column qualifier, or simply , column,
Column qualifiers need not be specified in advance.
Column qualifiers need not be consistent between rows.

EL
Like row keys, column qualifiers do not have a data type
and are always treated as a byte[ ].

PT
Cell: A combination of row key, column family, and column
N
qualifier uniquely identifies a cell. The data stored in a cell
is referred to as that cell’s value.

Big Data Computing Vu Pham Design of HBase


HBase Data Model
Timestamp: Values within a cell are versioned. Versions
are identified by their version number, which by default is
the timestamp is used.

EL
If the timestamp is not specified for a read, the latest one
is returned. The number of cell value versions retained by
PT
Hbase is configured for each column family. The default
number of cell versions is three.
N

Big Data Computing Vu Pham Design of HBase


HBase Architecture
Small group of servers running
Zab, a consensus protocol (Paxos-like)
Client HMaster
Zookeeper

EL
HRegionServer HLog HRegionServer
Hregion
Store

StoreFile
HFile
MemStore


StoreFile
HFile
PT
...
Store

StoreFile
HFile
MemStore


StoreFile
HFile
... ...
N

HDFS

Big Data Computing Vu Pham Design of HBase


HBase Storage Hierarchy
HBase Table
Split it into multiple regions: replicated across servers
• ColumnFamily = subset of columns with similar query
patterns

EL
• One Store per combination of ColumnFamily + region
– Memstore for each Store: in-memory updates to

PT
Store; flushed to disk when full
» StoreFiles for each store for each region: where
the data lives
N
- Hfile

HFile
SSTable from Google’s BigTable
Big Data Computing Vu Pham Design of HBase
HFile

Data … Data … Metadata, file info, indices, and trailer

EL
Magic (Key, value) (Key, value) … (Key, value)

Key
length
Value Row
length length
Row PT Col Family Col Family Col
length Qualifier
Timestamp Key Value
type
N
SSN:000-01-2345 Demographic Ethnicity
Information

HBase Key

Big Data Computing Vu Pham Design of HBase


Strong Consistency: HBase Write-Ahead Log
Client HRegion Store MemStore
2. (k1)
StoreFile StoreFile
. …
(k1, k2, k3, k4) HFile HFile
(k1, k2) .
. .

EL
HRegionServer
HRegion .
(k3, k4) 1. (k1)
.
Store MemStore
Log flush
PT StoreFile
HFile

StoreFile
HFile
N
HLog
Write to HLog before writing to MemStore
Helps recover from failure by replaying Hlog.

Big Data Computing Vu Pham Design of HBase


Log Replay
After recovery from failure, or upon bootup
(HRegionServer/HMaster)

EL
Replay any stale logs (use timestamps to find out where
the database is with respect to the logs)

PT
Replay: add edits to the MemStore
N

Big Data Computing Vu Pham Design of HBase


Cross-Datacenter Replication
Single “Master” cluster
Other “Slave” clusters replicate the same tables
Master cluster synchronously sends HLogs over to slave

EL
clusters
Coordination among clusters is via Zookeeper

information PT
Zookeeper can be used like a file system to store control

1. /hbase/replication/state
N
2. /hbase/replication/peers/<peer cluster number>
3. /hbase/replication/rs/<hlog>

Big Data Computing Vu Pham Design of HBase


Auto Sharding

EL
PT
N

Big Data Computing Vu Pham Design of HBase


Distribution

EL
PT
N

Big Data Computing Vu Pham Design of HBase


Auto Sharding and Distribution

Unit of scalability in HBase is the Region


Sorted, contiguous range of rows

EL
Spread “randomly” across RegionServer
Moved around for load balancing and failover

data PT
Split automatically or manually to scale with growing

Capacity is solely a factor of cluster nodes vs. Regions per


N
node

Big Data Computing Vu Pham Design of HBase


Bloom Filter
Bloom Filters are generated when HFile is persisted
Stored at the end of each HFile
Loaded into memory

EL
Allows check on row + column level

PT
Can filter entire store files from reads
N
Useful when data is grouped

Also useful when many misses are expected during reads


(non existing keys)

Big Data Computing Vu Pham Design of HBase


Bloom Filter

EL
PT
N

Big Data Computing Vu Pham Design of HBase


Fold, Store, and Shift

EL
PT
N

Big Data Computing Vu Pham Design of HBase


Fold, Store, and Shift
Logical layout does not match physical one

All values are stored with the full coordinates, including:

EL
Row Key, Column Family, Column Qualifier, and
Timestamp

PT
Folds columns into “row per column”
N
NULLs are cost free as nothing is stored

Versions are multiple “rows” in folded table

Big Data Computing Vu Pham Design of HBase


Conclusion
Traditional Databases (RDBMSs) work with strong
consistency, and offer ACID
Modern workloads don’t need such strong guarantees, but
do need fast response times (availability)

EL
Unfortunately, CAP theorem
Key-value/NoSQL systems offer BASE

PT
Eventual consistency, and a variety of other consistency
models striving towards strong consistency
N
In this lecture, we have discussed:
HBase Architecture, HBase Components, Data model,
HBase Storage Hierarchy, Cross-Datacenter
Replication, Auto Sharding and Distribution, Bloom
Filter and Fold, Store, and Shift
Big Data Computing Vu Pham Design of HBase
Spark Streaming and
Sliding Window Analytics

EL
PT
N
Dr. Rajiv Misra
Dept. of Computer Science & Engg.
Indian Institute of Technology Patna
[email protected]

Big Data Computing Vu Pham Spark Streaming


Preface
Content of this Lecture:

In this lecture, we will discuss Real-time big data


processing with Spark Streaming and Sliding Window

EL
Analytics.

PT
We will also discuss a case study based on Twitter
Sentiment Analysis with using Streaming.
N

Big Data Computing Vu Pham Spark Streaming


Big Streaming Data Processing

EL
PT
N

Big Data Computing Vu Pham Spark Streaming


How to Process Big Streaming Data
Scales to hundreds of nodes
Achieves low latency
Efficiently recover from failures

EL
Integrates with batch and interactive processing

PT
N

Big Data Computing Vu Pham Spark Streaming


What people have been doing?
Build two stacks – one for batch, one for streaming
Often both process same data

EL
Existing frameworks cannot do both

latency PT
Either, stream processing of 100s of MB/s with low

Or, batch processing of TBs of data with high latency


N

Big Data Computing Vu Pham Spark Streaming


What people have been doing?
Extremely painful to maintain two different stacks
Different programming models
Doubles implementation effort

EL
Doubles operational effort

PT
N

Big Data Computing Vu Pham Spark Streaming


Fault-tolerant Stream Processing
Traditional processing model
Pipeline of nodes
Each node maintains mutable state

EL
Each input record updates the state
and new records are sent out

PT
Mutable state is lost if node fails
N
Making stateful stream processing fault-
tolerant is challenging!

Big Data Computing Vu Pham Spark Streaming


What is Streaming?
Data Streaming is a technique for transferring data so that
it can be processed as a steady and continuous stream.

EL
Streaming technologies are becoming increasingly
important with the growth of the Internet.

PT
N

Big Data Computing Vu Pham Spark Streaming


Spark Ecosystem

EL
PT
N

Big Data Computing Vu Pham Spark Streaming


What is Spark Streaming?
Extends Spark for doing big data stream processing
Project started in early 2012, alpha released in Spring 2017 with Spark 0.7
Moving out of alpha in Spark 0.9
Spark Streaming has support built-in to consume from Kafka, Flume,

EL
Twitter, ZeroMQ, Kinesis, and TCP/IP sockets.
In Spark 2.x, a separate technology based on Datasets, called Structured
Streaming, that has a higher-level interface is also provided to support
streaming.
PT
N

Big Data Computing Vu Pham Spark Streaming


What is Spark Streaming?
Framework for large scale stream processing

Scales to 100s of nodes

EL
Can achieve second scale latencies
Integrates with Spark’s batch and interactive processing

PT
Provides a simple batch-like API for implementing
complex algorithm
N
Can absorb live data streams from Kafka, Flume,
ZeroMQ, etc.

Big Data Computing Vu Pham Spark Streaming


What is Spark Streaming?
Receive data streams from input sources, process
them in a cluster, push out to databases/ dashboards

EL
Scalable, fault-tolerant, second-scale latencies

PT
N

Big Data Computing Vu Pham Spark Streaming


Why Spark Streaming ?
Many big-data applications need to process large data
streams in realtime

EL
Website monitoring
Fraud detection

PT Ad monetization
N

Big Data Computing Vu Pham Spark Streaming


Why Spark Streaming ?
▪ Many important applications must process large streams of live data
and provide results in near-real-time
- Social network trends

- Website statistics

EL
- Intrustion detection systems

- etc.

PT
N
▪ Require large clusters to handle workloads

▪ Require latencies of few seconds

Big Data Computing Vu Pham Spark Streaming


Why Spark Streaming ?
We can use Spark Streaming to stream real-time data from
various sources like Twitter, Stock Market and
Geographical Systems and perform powerful analytics to

EL
help businesses.

PT
N

Big Data Computing Vu Pham Spark Streaming


Why Spark Streaming?
Need a framework for big data
stream processing that

EL
Website monitoring
Scales Fraud
to hundreds
detectionof nodes

PT
Achieves second-scale latencies
Ad monetization
N
Efficiently recover from failures

Integrates with batch and interactive processing

Big Data Computing Vu Pham Spark Streaming


Spark Streaming Features
Scaling: Spark Streaming can easily scale to hundreds of nodes.
Speed: It achieves low latency.
Fault Tolerance: Spark has the ability to efficiently recover from
failures.

EL
Integration: Spark integrates with batch and real-time processing.
Business Analysis: Spark Streaming is used to track the behavior of

PT
customers which can be used in business analysis
N

Big Data Computing Vu Pham Spark Streaming


Requirements

▪ Scalable to large clusters


▪ Second-scale latencies

EL
▪ Simple programming model


PT
Integrated with batch & interactive processing
Efficient fault-tolerance in stateful computations
N

Big Data Computing Vu Pham Spark Streaming


Batch vs Stream Processing
Batch Processing
Ability to process and analyze data at-rest (stored data)
Request-based, bulk evaluation and short-lived processing
Enabler for Retrospective, Reactive and On-demand Analytics

EL
Stream Processing
Ability to ingest, process and analyze data in-motion in real- or near-
real-time
PT
Event or micro-batch driven, continuous evaluation and long-lived
processing
N
Enabler for real-time Prospective, Proactive and Predictive Analytics
for Next Best Action
Stream Processing + Batch Processing = All Data Analytics
real-time (now) historical (past)

Big Data Computing Vu Pham Spark Streaming


Integration with Batch Processing
Many environments require processing same data in live
streaming as well as batch post-processing

Existing frameworks cannot do both

EL
Either, stream processing of 100s of MB/s with low latency
Or, batch processing of TBs of data with high latency

PT
Extremely painful to maintain two different stacks
N
Different programming models
Double implementation effort

Big Data Computing Vu Pham Spark Streaming


Stateful Stream Processing
Traditional model
mutable state
– Processing pipeline of nodes
input

EL
– Each node maintains mutable state records
– Each input record updates the state node 1
and new records are sent out

PT
Mutable state is lost if node fails
input
records
node 2
node 3
N
Making stateful stream processing fault tolerant is
challenging!

Big Data Computing Vu Pham Spark Streaming


Modern Data Applications approach to Insights

EL
PT
N

Big Data Computing Vu Pham Spark Streaming


Existing Streaming Systems

Storm
Replays record if not processed by a node

EL
Processes each record at least once
May update mutable state twice!

PT
Mutable state can be lost due to failure!
N
Trident – Use transactions to update state
Processes each record exactly once
Per-state transaction to external database is slow

Big Data Computing Vu Pham Spark Streaming 23


How does Spark Streaming work?
Run a streaming computation as a series of very small,
deterministic batch jobs live data stream
Spark
▪ Chop up the live stream into Streaming
batches of X seconds

EL
batches of X seconds
▪ Spark treats each batch of data as
RDDs and processes them using
RDD operations
PT
▪ Finally, the processed results of
the RDD operations are returned
processed results
Spark
N
in batches

Big Data Computing Vu Pham Spark Streaming 24


How does Spark Streaming work?
Run a streaming computation as a series of very small,
deterministic batch jobs live data stream
Spark
▪ Batch sizes as low as ½ second, Streaming

EL
latency of about 1 second
batches of X seconds
▪ Potential for combining batch

PT
processing and streaming
processing in the same system
processed results
Spark
N

Big Data Computing Vu Pham Spark Streaming 25


Word Count with Kafka

EL
PT
N

Big Data Computing Vu Pham Spark Streaming


Any Spark Application

EL
PT
N

Big Data Computing Vu Pham Spark Streaming


Spark Streaming Application: Receive data

EL
PT
N

Big Data Computing Vu Pham Spark Streaming


Spark Streaming Application: Process data

EL
PT
N

Big Data Computing Vu Pham Spark Streaming


Spark Streaming Architecture
Micro batch architecture.
Operates on interval of time
New batches are created at

EL
regular time intervals.
Divides received time batch

PT
into blocks for parallelism
Each batch is a graph that
translates into multiple jobs
N
Has the ability to create
larger size batch window as
it processes over time.

Big Data Computing Vu Pham Spark Streaming


Spark Streaming Workflow
Spark Streaming workflow has four high-level stages. The first is to stream
data from various sources. These sources can be streaming data sources like
Akka, Kafka, Flume, AWS or Parquet for real-time streaming. The second type
of sources includes HBase, MySQL, PostgreSQL, Elastic Search, Mongo DB and
Cassandra for static/batch streaming.

EL
Once this happens, Spark can be used to perform Machine Learning on the
data through its MLlib API. Further, Spark SQL is used to perform further
operations on this data. Finally, the streaming output can be stored into
various data storage systems like HBase, Cassandra, MemSQL, Kafka, Elastic

PT
Search, HDFS and local file system.
N

Big Data Computing Vu Pham Spark Streaming


Spark Streaming Workflow

EL
PT
N

Big Data Computing Vu Pham Spark Streaming


Example 1 – Get hashtags from Twitter
val tweets = ssc.twitterStream()

DStream: a sequence of RDDs representing a stream of data

EL
Twitter Streaming API

tweets DStream PT
batch @ t batch @ t+1 batch @ t+2
N
stored in memory as an RDD
(immutable, distributed)

Big Data Computing Vu Pham Spark Streaming


Example 1 – Get hashtags from Twitter
val tweets = ssc.twitterStream()
val hashTags = tweets.flatMap(status => getTags(status))

EL
new DStream transformation: modify data in one DStream to create another DStream

tweets DStream PT
batch @ t batch @ t+1 batch @ t+2
N
flatMap flatMap flatMap

hashTags Dstream


[#cat, #dog, … ] new RDDs created
for every batch

Big Data Computing Vu Pham Spark Streaming


Example 1– Get hashtags from Twitter
val tweets = ssc.twitterStream()
val hashTags = tweets.flatMap(status => getTags(status))
hashTags.saveAsHadoopFiles("hdfs://...")

EL
output operation: to push data to external storage

tweets DStream
PT
batch @ t

flatMap
batch @ t+1

flatMap
batch @ t+2

flatMap
N
hashTags DStream

save save save

every batch
saved to HDFS

Big Data Computing Vu Pham Spark Streaming


Example 1 – Get hashtags from Twitter
val tweets = ssc.twitterStream()
val hashTags = tweets.flatMap(status => getTags(status))
hashTags.foreach(hashTagRDD => { ... })

EL
foreach: do whatever you want with the processed data

tweets DStream
PT
batch @ t

flatMap
batch @ t+1

flatMap
batch @ t+2

flatMap
N
hashTags DStream

foreach foreach foreach

Write to a database, update analytics


UI, do whatever you want

Big Data Computing Vu Pham Spark Streaming


Java Example

Scala
val tweets = ssc.twitterStream()
val hashTags = tweets.flatMap(status => getTags(status))

EL
hashTags.saveAsHadoopFiles("hdfs://...")

Java
PT
JavaDStream<Status> tweets = ssc.twitterStream()
JavaDstream<String> hashTags = tweets.flatMap(new Function<...> { })
N
hashTags.saveAsHadoopFiles("hdfs://...")

Function object

Big Data Computing Vu Pham Spark Streaming


Fault-tolerance
▪ RDDs are remember the
sequence of operations that
created it from the original
tweets
fault-tolerant input data RDD
input data
replicated

EL
in memory

▪ Batches of input data are flatMap

multiple workerPT
replicated in memory of

therefore fault-tolerant
nodes, hashTags
RDD lost partitions
recomputed on
N
other workers

▪ Data lost due to worker


failure, can be recomputed
from input data
Big Data Computing Vu Pham Spark Streaming
Key concepts
DStream – sequence of RDDs representing a stream of data
Twitter, HDFS, Kafka, Flume, ZeroMQ, Akka Actor, TCP
sockets

EL
Transformations – modify data from on DStream to another
Standard RDD operations – map, countByValue, reduce,
join, …
PT
Stateful operations – window, countByValueAndWindow, …
N
Output Operations – send data to external entity
saveAsHadoopFiles – saves to HDFS
foreach – do anything with each batch of results

Big Data Computing Vu Pham Spark Streaming


Example 2 – Count the hashtags

val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)


val hashTags = tweets.flatMap (status => getTags(status))

EL
val tagCounts = hashTags.countByValue()
batch @ batch @
batch @ t
t+1 t+2
tweets

hashTags
PTflatMa

map
p
flatMa

map
p
flatMa

map
p


N
reduceByKey reduceByKey reduceByKey
tagCounts
[(#cat, 10), (#dog, 25), ... ]

Big Data Computing Vu Pham Spark Streaming


Example 3 – Count the hashtags over last 10 mins
val tweets = ssc.twitterStream()
val hashTags = tweets.flatMap(status => getTags(status))
val tagCounts = hashTags.window(Minutes(1),
Seconds(5)).countByValue()

EL
sliding window
window length sliding interval
operation

PT window length
N
DStream of data

sliding interval

Big Data Computing Vu Pham Spark Streaming


Example 3 – Counting the hashtags over last 10 mins

val tagCounts = hashTags.window(Minutes(10), Seconds(1)).countByValue()

EL
t-1 t t+1 t+2 t+3
hashTags

PT
sliding window

countByValue
N
tagCounts count over all
the data in the
window

Big Data Computing Vu Pham Spark Streaming


Smart window-based countByValue

val tagCounts = hashtags.countByValueAndWindow(Minutes(10), Seconds(1))

EL
t-1 t t+1 t+2 t+3
hashTags

PT
countByValue

subtract the
– +
add the counts
from the new
batch in the
N
counts from window
tagCounts batch before ?
the window +

Big Data Computing Vu Pham Spark Streaming


Smart window-based reduce

Technique to incrementally compute count generalizes


to many reduce operations

EL
Need a function to “inverse reduce” (“subtract” for
counting)

PT
Could have implemented counting as:
N
hashTags.reduceByKeyAndWindow(_ + _, _ - _,
Minutes(1), …)
14

Big Data Computing Vu Pham Spark Streaming


Arbitrary Stateful Computations

Specify function to generate new state based on


previous state and new data

EL
Example: Maintain per-user mood as state, and update it
with their tweets

PT
def updateMood(newTweets, lastMood) => newMood

moods = tweetsByUser.updateStateByKey(updateMood _)
N

Big Data Computing Vu Pham Spark Streaming


Arbitrary Combinations of Batch and Streaming
Computations

Inter-mix RDD and DStream operations!

EL
Example: Join incoming tweets with a spam HDFS file to
filter out bad tweets

})
PT
tweets.transform(tweetsRDD => {
tweetsRDD.join(spamHDFSFile).filter(...)
N

Big Data Computing Vu Pham Spark Streaming


Spark Streaming-Dstreams, Batches and RDDs

EL
• PT
These steps repeat for each batch.. Continuously
N
• Because we are dealing with Streaming data. Spark
Streaming has the ability to “remember” the previous
RDDs…to some extent.

Big Data Computing Vu Pham Spark Streaming


DStreams + RDDs = Power
Online machine learning
Continuously learn and update data models
(updateStateByKey and transform)

EL
Combine live data streams with historical data
Generate historical data models with Spark, etc.

PT
Use data models to process live data stream (transform)
N
CEP-style processing
window-based operations (reduceByWindow, etc.)

Big Data Computing Vu Pham Spark Streaming


From DStreams to Spark Jobs
Every interval, an RDD graph is computed from the DStream
graph
For each output operation, a Spark action is created
For each action, a Spark job is created to compute it

EL
PT
N

Big Data Computing Vu Pham Spark Streaming


Input Sources
Out of the box, we provide
Kafka, HDFS, Flume, Akka Actors, Raw TCP sockets,
etc.

EL
Very easy to write a receiver for your own data source

PT
Also, generate your own RDDs from Spark, etc. and
N
push them in as a “stream”

Big Data Computing Vu Pham Spark Streaming


Current Spark Streaming I/O

EL
PT
N

Big Data Computing Vu Pham Spark Streaming


Dstream Classes
Different classes for
different languages
(Scala, Java)

EL
Dstream has 36 value
members

PT
Multiple types of
Dstreams
N
Separate Python API

Big Data Computing Vu Pham Spark Streaming


Spark Streaming Operations

EL
PT
N

Big Data Computing Vu Pham Spark Streaming


Fault-tolerance
Batches of input data are replicated in memory for fault-
tolerance

Data lost due to worker failure, can be recomputed from

EL
replicated input data tweets
RDD input data
replicated

PT flatMap
in memory
N
▪ All transformations are fault- hashTags
tolerant, and exactly-once RDD
lost partitions
transformations recomputed on
other workers

Big Data Computing Vu Pham Spark Streaming


Fault-tolerance

EL
PT
N

Big Data Computing Vu Pham Spark Streaming


Performance
Can process 60M records/sec (6 GB/sec) on
100 nodes at sub-second latency

EL
7 3.5
WordCount
Cluster Thhroughput (GB/s)

Cluster Throughput (GB/s)


Grep
6
5
4
PT 3
2.5
2
N
3 1.5
2 1
1 sec 1 sec
1 0.5
2 sec 2 sec
0 0
0 50 100 0 50 100
# Nodes in Cluster # Nodes in Cluster

Big Data Computing Vu Pham Spark Streaming


Comparison with other systems
Higher throughput than Storm
Spark Streaming: 670k records/sec/node
Storm: 115k records/sec/node

EL
Commercial systems: 100-500k records/sec/node

Grep
PT WordCount

Throughput per node


Throughput per node

60 30

Spark

(MB/s)
Spark
(MB/s)

40 20
N
20 10 Storm
Storm
0 0
100 1000 100 1000
Record Size (bytes) Record Size (bytes)

Big Data Computing Vu Pham Spark Streaming


Fast Fault Recovery

Recovers from faults/stragglers within 1 sec

EL
PT
N

Big Data Computing Vu Pham Spark Streaming


Real time application: Mobile Millennium Project

Traffic transit time estimation using online machine


learning on GPS observations

EL
2000
▪ Markov-chain Monte Carlo

GPS observations per sec


simulations on GPS 1600

observations
PT
▪ Very CPU intensive, requires
1200

800
N
400
dozens of machines for useful
computation 0
0 20 40 60 80
# Nodes in Cluster
▪ Scales linearly with cluster size

Big Data Computing Vu Pham Spark Streaming


Vision - one stack to rule them all

EL
PT
N

Big Data Computing Vu Pham Spark Streaming


Spark program vs Spark Streaming program

Spark Streaming program on Twitter stream


val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)

EL
val hashTags = tweets.flatMap (status => getTags(status))
hashTags.saveAsHadoopFiles("hdfs://...")

PT
Spark program on Twitter log file
val tweets = sc.hadoopFile("hdfs://...")
val hashTags = tweets.flatMap (status => getTags(status))
N
hashTags.saveAsHadoopFile("hdfs://...")

Big Data Computing Vu Pham Spark Streaming


Advantage of an unified stack
Explore data $ ./spark-shell
interactively to scala> val file = sc.hadoopFile(“smallLogs”)
...
identify problems scala> val filtered = file.filter(_.contains(“ERROR”))
...
scala> val mapped = filtered.map(...)

EL
...
object ProcessProductionData {
def main(args: Array[String]) {
Use same code in val sc = new SparkContext(...)
val file = sc.hadoopFile(“productionLogs”)
Spark for processing val filtered = file.filter(_.contains(“ERROR”))

large logs
PT }
val mapped = filtered.map(...)
...

} object ProcessLiveStream {
def main(args: Array[String]) {
N
val sc = new StreamingContext(...)
val stream = sc.kafkaStream(...)
Use similar code in val filtered = stream.filter(_.contains(“ERROR”))
val mapped = filtered.map(...)
Spark Streaming for }
...

realtime processing }

Big Data Computing Vu Pham Spark Streaming


Roadmap
Spark 0.8.1
Marked alpha, but has been quite stable
Master fault tolerance – manual recovery
• Restart computation from a checkpoint file saved to HDFS

EL
Spark 0.9 in Jan 2014 – out of alpha!

PT
Automated master fault recovery
Performance optimizations
Web UI, and better monitoring capabilities
N
Spark v2.4.0 released in November 2, 2018

Big Data Computing Vu Pham Spark Streaming


Sliding Window Analytics

EL
PT
N

Big Data Computing Vu Pham Spark Streaming


Spark Streaming Windowing Capabilities
Parameters
Window length: duration of the window
Sliding interval: interval at which the window operation is
performed

EL
Both the parameters must be a multiple of the batch interval

PT
A window creates a new DStream with a larger batch size
N

Big Data Computing Vu Pham Spark Streaming


Spark Window Functions
Spark Window Functions for DataFrames and SQL

Introduced in Spark 1.4, Spark window functions improved the expressiveness of


Spark DataFrames and Spark SQL. With window functions, you can easily calculate a
moving average or cumulative sum, or reference a value in a previous row of a table.

EL
Window functions allow you to do many common calculations with DataFrames,
without having to resort to RDD manipulation.

PT
Aggregates, UDFs vs. Window functions
N
Window functions are complementary to existing DataFrame operations:
aggregates, such as sum and avg, and UDFs. To review, aggregates calculate one
result, a sum or average, for each group of rows, whereas UDFs calculate one result
for each row based on only data in that row. In contrast, window functions calculate
one result for each row based on a window of rows. For example, in a moving
average, you calculate for each row the average of the rows surrounding the current
row; this can be done with window functions.
Big Data Computing Vu Pham Spark Streaming
Moving Average Example
Let us dive right into the moving average example. In this example
dataset, there are two customers who have spent different amounts
of money each day.

EL
// Building the customer DataFrame. All examples are written in
Scala with Spark 1.6.1, but the same can be done in Python or SQL.
val customers = sc.parallelize(List(("Alice", "2016-05-01", 50.00),
PT
("Alice", "2016-05-03", 45.00),
("Alice", "2016-05-04", 55.00),
N
("Bob", "2016-05-01", 25.00),
("Bob", "2016-05-04", 29.00),
("Bob", "2016-05-06", 27.00))).
toDF("name", "date", "amountSpent")

Big Data Computing Vu Pham Spark Streaming


Moving Average Example
// Import the window functions.
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._

EL
// Create a window spec.
val wSpec1 =

PT
Window.partitionBy("name").orderBy("date").rowsBetween(-1, 1)

In this window spec, the data is partitioned by customer. Each


N
customer’s data is ordered by date. And, the window frame is
defined as starting from -1 (one row before the current row) and
ending at 1 (one row after the current row), for a total of 3 rows in
the sliding window.

Big Data Computing Vu Pham Spark Streaming


Moving Average Example
// Calculate the moving average
customers.withColumn( "movingAvg",
avg(customers("amountSpent")).over(wSpec1) ).show()

EL
This code adds a new column, “movingAvg”, by applying the avg
function on the sliding window defined in the window spec:

PT
N

Big Data Computing Vu Pham Spark Streaming


Window function and Window Spec definition
As shown in the above example, there are two parts to applying a window function: (1)
specifying the window function, such as avg in the example, and (2) specifying the
window spec, or wSpec1 in the example. For (1), you can find a full list of the window
functions here:
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions
$

EL
You can use functions listed under “Aggregate Functions” and “Window Functions”.

For (2) specifying a window spec, there are three components: partition by, order by, and
frame.
PT
1. “Partition by” defines how the data is grouped; in the above example, it was by
customer. You have to specify a reasonable grouping because all data within a group will
N
be collected to the same machine. Ideally, the DataFrame has already been partitioned by
the desired grouping.
2. “Order by” defines how rows are ordered within a group; in the above example, it
was by date.
3. “Frame” defines the boundaries of the window with respect to the current row; in
the above example, the window ranged between the previous row and the next row.

Big Data Computing Vu Pham Spark Streaming


Cumulative Sum
Next, let us calculate the cumulative sum of the amount spent per customer.
// Window spec: the frame ranges from the beginning (Long.MinValue) to the
current row (0).
val wSpec2 =

EL
Window.partitionBy("name").orderBy("date").rowsBetween(Long.MinValue, 0)
// Create a new column which calculates the sum over the defined window
frame.

PT
customers.withColumn( "cumSum",
sum(customers("amountSpent")).over(wSpec2) ).show()
N

Big Data Computing Vu Pham Spark Streaming


Data from previous row
In the next example, we want to see the amount spent by the customer
in their previous visit.
// Window spec. No need to specify a frame in this case.
val wSpec3 = Window.partitionBy("name").orderBy("date")

EL
// Use the lag function to look backwards by one row.
customers.withColumn("prevAmountSpent",

PT
lag(customers("amountSpent"), 1).over(wSpec3) ).show()
N

Big Data Computing Vu Pham Spark Streaming


Rank
In this example, we want to know the order of a customer’s
visit (whether this is their first, second, or third visit).
// The rank function returns what we want.

EL
customers.withColumn( "rank", rank().over(wSpec3) ).show()

PT
N

Big Data Computing Vu Pham Spark Streaming


Case Study: Twitter Sentiment

EL
Analysis with Spark Streaming

PT
N

Big Data Computing Vu Pham Spark Streaming


Case Study: Twitter Sentiment Analysis
Trending Topics can be used to create campaigns and attract larger
audience. Sentiment Analytics helps in crisis management, service
adjusting and target marketing.
Sentiment refers to the emotion behind a social media mention
online.

EL
Sentiment Analysis is categorising the tweets related to particular
topic and performing data mining using Sentiment Automation
Analytics Tools.
PT
We will be performing Twitter Sentiment Analysis as an Use Case
or Spark Streaming.
N

Big Data Computing Vu Pham Spark Streaming


Problem Statement
To design a Twitter Sentiment Analysis System where we
populate real-time sentiments for crisis management, service
adjusting and target marketing.

EL
Sentiment Analysis is used to:
Predict the success of a movie

PT
Predict political campaign success
Decide whether to invest in a certain company
N
Targeted advertising
Review products and services

Big Data Computing Vu Pham Spark Streaming


Importing Packages

EL
PT
N

Big Data Computing Vu Pham Spark Streaming


Twitter Token Authorization

EL
PT
N

Big Data Computing Vu Pham Spark Streaming


DStream Transformation

EL
PT
N

Big Data Computing Vu Pham Spark Streaming


Results

EL
PT
N

Big Data Computing Vu Pham Spark Streaming


Sentiment for Trump

EL
PT
N

Big Data Computing Vu Pham Spark Streaming


Applying Sentiment Analysis
As we have seen from our Sentiment Analysis demonstration,
we can extract sentiments of particular topics just like we did
for ‘Trump’. Similarly, Sentiment Analytics can be used in crisis
management, service adjusting and target marketing by

EL
companies around the world.

PT
Companies using Spark Streaming for Sentiment Analysis have
applied the same approach to achieve the following:
Enhancing the customer experience
N
1.

2. Gaining competitive advantage


3. Gaining Business Intelligence
4. Revitalizing a losing brand

Big Data Computing Vu Pham Spark Streaming


References

https://spark.apache.org/streaming/

EL
Streaming programming guide –
spark.incubator.apache.org/docs/latest/streaming-

PT
programming-guide.html
N
https://databricks.com/speaker/tathagata-das

Big Data Computing Vu Pham Spark Streaming


Conclusion
▪ Stream processing framework that is ...

- Scalable to large clusters

EL
- Achieves second-scale latencies
- Has simple programming model
-

-
PT
Integrates with batch & interactive workloads
Ensures efficient fault-tolerance in stateful
N
computations

Big Data Computing Vu Pham Spark Streaming


Introduction to Kafka

EL
PT
N
Dr. Rajiv Misra
Dept. of Computer Science & Engg.
Indian Institute of Technology Patna
[email protected]

Big Data Computing Vu Pham Introduction to Kafka


Preface
Content of this Lecture:

Define Kafka

EL
Use cases for Kafka

PT
Kafka data model

Kafka architecture
N
Types of messaging systems

Importance of brokers
Big Data Computing Vu Pham Introduction to Kafka
Batch vs. Streaming

Batch Streaming

EL
PT
N

Big Data Computing Vu Pham Introduction to Kafka


EL
PT
N

Vu Pham
Introduction: Apache Kafka
Kafka is a high-performance, real-time messaging
system. It is an open source tool and is a part of Apache
projects.

EL
The characteristics of Kafka are:

PT
1. It is a distributed and partitioned messaging system.
2. It is highly fault-tolerant
N
3. It is highly scalable.
4. It can process and send millions of messages per second
to several receivers.

Big Data Computing Vu Pham Introduction to Kafka


Kafka History
Apache Kafka was originally developed by LinkedIn and
later, handed over to the open source community in early
2011.

EL
It became a main Apache project in October, 2012.
A stable Apache Kafka version 0.8.2.0 was release in Feb,
2015.
PT
A stable Apache Kafka version 0.8.2.1 was released in May,
N
2015, which is the latest version.

Big Data Computing Vu Pham Introduction to Kafka


Kafka Use Cases
Kafka can be used for various purposes in an organization,
such as:

EL
PT
N

Big Data Computing Vu Pham Introduction to Kafka


Apache Kafka: a Streaming Data Platform
➢ Most of what a business does can be thought as event
streams. They are in a
• Retail system: orders, shipments, returns, …
• Financial system: stock ticks, orders, …

EL
• Web site: page views, clicks, searches, …
• IoT: sensor readings, …

PT
and so on.N

Big Data Computing Vu Pham Introduction to Kafka


Enter Kafka
Adopted at 1000s of companies worldwide

EL
PT
N

Big Data Computing Vu Pham Introduction to Kafka


Aggregating User Activity Using Kafka-Example

Kafka can be used to aggregate user activity data such as clicks,


navigation, and searches from different websites of an
organization; such user activities can be sent to a real-time
monitoring system and hadoop system for offline processing.

EL
PT
N

Big Data Computing Vu Pham Introduction to Kafka


Kafka Data Model
The Kafka data model consists of messages and topics.
Messages represent information such as, lines in a log file, a row of stock
market data, or an error message from a system.
Messages are grouped into categories called topics.
Example: LogMessage and Stock Message.

EL
The processes that publish messages into a topic in Kafka are known as
producers.
The processes that receive the messages from a topic in Kafka are known as
consumers.
PT
The processes or servers within Kafka that process the messages are known as
brokers.
N
A Kafka cluster consists of a set of brokers that process the messages.

Vu Pham Introduction to Kafka


Topics
A topic is a category of messages in Kafka.
The producers publish the messages into topics.
The consumers read the messages from topics.
A topic is divided into one or more partitions.

EL
A partition is also known as a commit log.
Each partition contains an ordered set of messages.

PT
Each message is identified by its offset in the partition.
Messages are added at one end of the partition and consumed
at the other.
N

Big Data Computing Vu Pham Introduction to Kafka


Partitions
Topics are divided into partitions, which are the unit of
parallelism in Kafka.

Partitions allow messages in a topic to be distributed to

EL
multiple servers.
A topic can have any number of partitions.
Each partition should fit in a single Kafka server.
PT
The number of partitions decide the parallelism of the topic.
N

Big Data Computing Vu Pham Introduction to Kafka


Partition Distribution
Partitions can be distributed across the Kafka cluster.
Each Kafka server may handle one or more partitions.
A partition can be replicated across several servers fro fault-tolerance.
One server is marked as a leader for the partition and the others are

EL
marked as followers.
The leader controls the read and write for the partition, whereas, the
followers replicate the data.

PT
If a leader fails, one of the followers automatically become the leader.
Zookeeper is used for the leader selection.
N

Big Data Computing Vu Pham Introduction to Kafka


Producers
The producer is the creator of the message in Kafka.

The producers place the message to a particular topic.


The producers also decide which partition to place the message into.

EL
Topics should already exist before a message is placed by the producer.
Messages are added at one end of the partition.

PT
N

Big Data Computing Vu Pham Introduction to Kafka


Consumers
The consumer is the receiver of the message in Kafka.

Each consumer belongs to a consumer group.


A consumer group may have one or more consumers.

EL
The consumers specify what topics they want to listen to.
A message is sent to all the consumers in a consumer group.
The consumer groups are used to control the messaging system.

PT
N

Big Data Computing Vu Pham Introduction to Kafka


Kafka Architecture
Kafka architecture consists of brokers that take messages from the
producers and add to a partition of a topic. Brokers provide the
messages to the consumers from the partitions.
• A topic is divided into multiple partitions.
• The messages are added to the partitions at one end and consumed in

EL
the same order.
• Each partition acts as a message queue.

PT
• Consumers are divided into consumer groups.
• Each message is delivered to one consumer in each consumer group.
• Zookeeper is used for coordination.
N

Big Data Computing Vu Pham Introduction to Kafka


Types of Messaging Systems
Kafka architecture supports the publish-subscribe and queue system.

EL
PT
N

Big Data Computing Vu Pham Introduction to Kafka


Example: Queue System

EL
PT
N

Big Data Computing Vu Pham Introduction to Kafka


Example: Publish-Subscribe System

EL
PT
N

Big Data Computing Vu Pham Introduction to Kafka


Brokers
Brokers are the Kafka processes that process the messages in Kafka.

• Each machine in the cluster can run one broker.

EL
• They coordinate among each other using Zookeeper.

PT
• One broker acts as a leader for a partition and handles the
delivery and persistence, where as, the others act as followers.
N

Big Data Computing Vu Pham Introduction to Kafka


Kafka Guarantees
Kafka guarantees the following:

1. Messages sent by a producer to a topic and a partition

EL
are appended in the same order

PT
2. A consumer instance gets the messages in the same
order as they are produced.
N
3. A topic with replication factor N, tolerates upto N-1
server failures.

Big Data Computing Vu Pham Introduction to Kafka


Replication in Kafka
Kafka uses the primary-backup method of replication.
One machine (one replica) is called a leader and is chosen
as the primary; the remaining machines (replicas) are

EL
chosen as the followers and act as backups.
The leader propagates the writes to the followers.

replicas. PT
The leader waits until the writes are completed on all the

If a replica is down, it is skipped for the write until it


N
comes back.
If the leader fails, one of the followers will be chosen as
the new leader; this mechanism can tolerate n-1 failures if
the replication factor is ‘n’
Big Data Computing Vu Pham Introduction to Kafka
Persistence in Kafka
Kafka uses the Linux file system for persistence of messages
Persistence ensures no messages are lost.
Kafka relies on the file system page cache for fast reads

EL
and writes.
All the data is immediately written to a file in file system.

writes. PT
Messages are grouped as message sets for more efficient

Message sets can be compressed to reduce network


N
bandwidth.
A standardized binary message format is used among
producers, brokers, and consumers to minimize data
modification.
Big Data Computing Vu Pham Introduction to Kafka
Apache Kafka: a Streaming Data Platform
➢ Apache Kafka is an open source streaming data platform (a new
category of software!) with 3 major components:
1. Kafka Core: A central hub to transport and store event
streams in real-time.

EL
2. Kafka Connect: A framework to import event streams from
other source data systems into Kafka and export event
streams from Kafka to destination data systems.
3.

they occur.
PT
Kafka Streams: A Java library to process event streams live as
N

Big Data Computing Vu Pham Introduction to Kafka


Further Learning
o Kafka Streams code examples
o Apache Kafka
https://github.com/apache/kafka/tree/trunk/streams/examples/src/main/java/org/apache/kafka/
streams/examples
o Confluent

EL
https://github.com/confluentinc/examples/tree/master/kafka-streams

o Source Code https://github.com/apache/kafka/tree/trunk/streams


Kafka Streams Java docs

PT
o
http://docs.confluent.io/current/streams/javadocs/index.html

o First book on Kafka Streams (MEAP)


N
o Kafka Streams in Action https://www.manning.com/books/kafka-streams-in-action

o Kafka Streams download


o Apache Kafka https://kafka.apache.org/downloads
o Confluent Platform http://www.confluent.io/download

Big Data Computing Vu Pham Introduction to Kafka


Conclusion
Kafka is a high-performance, real-time messaging system.

Kafka can be used as an external commit log for distributed


systems.

EL
Kafka data model consists of messages and topics.

PT
Kafka architecture consists of brokers that take messages from the
producers and add to a partition of a topics.
N
Kafka architecture supports two types of messaging system called
publish-subscribe and queue system.

Brokers are the Kafka processes that process the messages in Kafka.

Big Data Computing Vu Pham Introduction to Kafka

You might also like