Week-5 - Lecture Notes
Week-5 - Lecture Notes
EL
PT
N
Dr. Rajiv Misra
Dept. of Computer Science & Engg.
Indian Institute of Technology Patna
[email protected]
EL
What is HBase?
HBase Architecture
PT
HBase Components
Data model
N
HBase Storage Hierarchy
Cross-Datacenter Replication
Auto Sharding and Distribution
Bloom Filter and Fold, Store, and Shift
Big Data Computing Vu Pham Design of HBase
HBase is:
An opensource NOSQL database.
A distributed column-oriented data store that can scale
horizontally to 1,000s of commodity servers and petabytes
EL
of indexed storage.
Designed to operate on top of the Hadoop distributed file
system (HDFS) for scalability, fault tolerance, and high
availability.
PT
Hbase is actually an implementation of the BigTable
N
storage architecture, which is a distributed storage system
developed by Google.
Works with structured, unstructured and semi-structured
data.
EL
Facebook uses HBase internally
API functions
Get/Put(row)
PT
Scan(row range, filter) – range queries
N
MultiPut
Unlike Cassandra, HBase prefers consistency (over
availability)
EL
divided by column
families into “stores”.
EL
Monitoring all RegionServer instances in the cluster
Regions:
PT
Basic element of availability and distribution for
tables
N
RegionServer:
Serving and managing regions
In a distributed cluster, a RegionServer runs on a
DataNode
Big Data Computing Vu Pham Design of HBase
Data Model
Data stored in Hbase is
located by its “rowkey”
RowKey is like a primary key
from a rational database.
EL
Records in Hbase are stored
in sorted order, according to
rowkey.
EL
PT
N
Tables are divided into sequences of rows, by key range, called regions.
These regions are then assigned to the data nodes in the cluster called
“RegionServers.”
EL
PT
N
A column is identified by a Column Qualifier that consists of the
Column Family name concatenated with the Column name using a
colon.ex-personaldata:Name
Column families are mapped to storage files and are stored in
separate files, which can also be accesses separately.
Big Data Computing Vu Pham Design of HBase
Cell in HBase Table
EL
PT
Data is stored in HBASE tables Cells.
Cell is a combination of row, column family, column qualifier and
N
contains a value and a timestamp
The key consists of the row key, column name, and timestamp.
The entire cell, with the added structural information, is called
Key Value.
EL
Row: Within a table, data is stored according to its row. Rows
are identified uniquely by their row key. Row keys do not have
PT
a data type and are always treated as a byte[ ] (byte array).
EL
Like row keys, column qualifiers do not have a data type
and are always treated as a byte[ ].
PT
Cell: A combination of row key, column family, and column
N
qualifier uniquely identifies a cell. The data stored in a cell
is referred to as that cell’s value.
EL
If the timestamp is not specified for a read, the latest one
is returned. The number of cell value versions retained by
PT
Hbase is configured for each column family. The default
number of cell versions is three.
N
EL
HRegionServer HLog HRegionServer
Hregion
Store
StoreFile
HFile
MemStore
…
StoreFile
HFile
PT
...
Store
StoreFile
HFile
MemStore
…
StoreFile
HFile
... ...
N
HDFS
EL
• One Store per combination of ColumnFamily + region
– Memstore for each Store: in-memory updates to
PT
Store; flushed to disk when full
» StoreFiles for each store for each region: where
the data lives
N
- Hfile
HFile
SSTable from Google’s BigTable
Big Data Computing Vu Pham Design of HBase
HFile
EL
Magic (Key, value) (Key, value) … (Key, value)
Key
length
Value Row
length length
Row PT Col Family Col Family Col
length Qualifier
Timestamp Key Value
type
N
SSN:000-01-2345 Demographic Ethnicity
Information
HBase Key
EL
HRegionServer
HRegion .
(k3, k4) 1. (k1)
.
Store MemStore
Log flush
PT StoreFile
HFile
…
StoreFile
HFile
N
HLog
Write to HLog before writing to MemStore
Helps recover from failure by replaying Hlog.
EL
Replay any stale logs (use timestamps to find out where
the database is with respect to the logs)
PT
Replay: add edits to the MemStore
N
EL
clusters
Coordination among clusters is via Zookeeper
information PT
Zookeeper can be used like a file system to store control
1. /hbase/replication/state
N
2. /hbase/replication/peers/<peer cluster number>
3. /hbase/replication/rs/<hlog>
EL
PT
N
EL
PT
N
EL
Spread “randomly” across RegionServer
Moved around for load balancing and failover
data PT
Split automatically or manually to scale with growing
EL
Allows check on row + column level
PT
Can filter entire store files from reads
N
Useful when data is grouped
EL
PT
N
EL
PT
N
EL
Row Key, Column Family, Column Qualifier, and
Timestamp
PT
Folds columns into “row per column”
N
NULLs are cost free as nothing is stored
EL
Unfortunately, CAP theorem
Key-value/NoSQL systems offer BASE
PT
Eventual consistency, and a variety of other consistency
models striving towards strong consistency
N
In this lecture, we have discussed:
HBase Architecture, HBase Components, Data model,
HBase Storage Hierarchy, Cross-Datacenter
Replication, Auto Sharding and Distribution, Bloom
Filter and Fold, Store, and Shift
Big Data Computing Vu Pham Design of HBase
Spark Streaming and
Sliding Window Analytics
EL
PT
N
Dr. Rajiv Misra
Dept. of Computer Science & Engg.
Indian Institute of Technology Patna
[email protected]
EL
Analytics.
PT
We will also discuss a case study based on Twitter
Sentiment Analysis with using Streaming.
N
EL
PT
N
EL
Integrates with batch and interactive processing
PT
N
EL
Existing frameworks cannot do both
latency PT
Either, stream processing of 100s of MB/s with low
EL
Doubles operational effort
PT
N
EL
Each input record updates the state
and new records are sent out
PT
Mutable state is lost if node fails
N
Making stateful stream processing fault-
tolerant is challenging!
EL
Streaming technologies are becoming increasingly
important with the growth of the Internet.
PT
N
EL
PT
N
EL
Twitter, ZeroMQ, Kinesis, and TCP/IP sockets.
In Spark 2.x, a separate technology based on Datasets, called Structured
Streaming, that has a higher-level interface is also provided to support
streaming.
PT
N
EL
Can achieve second scale latencies
Integrates with Spark’s batch and interactive processing
PT
Provides a simple batch-like API for implementing
complex algorithm
N
Can absorb live data streams from Kafka, Flume,
ZeroMQ, etc.
EL
Scalable, fault-tolerant, second-scale latencies
PT
N
EL
Website monitoring
Fraud detection
PT Ad monetization
N
- Website statistics
EL
- Intrustion detection systems
- etc.
PT
N
▪ Require large clusters to handle workloads
EL
help businesses.
PT
N
EL
Website monitoring
Scales Fraud
to hundreds
detectionof nodes
PT
Achieves second-scale latencies
Ad monetization
N
Efficiently recover from failures
EL
Integration: Spark integrates with batch and real-time processing.
Business Analysis: Spark Streaming is used to track the behavior of
PT
customers which can be used in business analysis
N
EL
▪ Simple programming model
▪
▪
PT
Integrated with batch & interactive processing
Efficient fault-tolerance in stateful computations
N
EL
Stream Processing
Ability to ingest, process and analyze data in-motion in real- or near-
real-time
PT
Event or micro-batch driven, continuous evaluation and long-lived
processing
N
Enabler for real-time Prospective, Proactive and Predictive Analytics
for Next Best Action
Stream Processing + Batch Processing = All Data Analytics
real-time (now) historical (past)
EL
Either, stream processing of 100s of MB/s with low latency
Or, batch processing of TBs of data with high latency
PT
Extremely painful to maintain two different stacks
N
Different programming models
Double implementation effort
EL
– Each node maintains mutable state records
– Each input record updates the state node 1
and new records are sent out
PT
Mutable state is lost if node fails
input
records
node 2
node 3
N
Making stateful stream processing fault tolerant is
challenging!
EL
PT
N
Storm
Replays record if not processed by a node
EL
Processes each record at least once
May update mutable state twice!
PT
Mutable state can be lost due to failure!
N
Trident – Use transactions to update state
Processes each record exactly once
Per-state transaction to external database is slow
EL
batches of X seconds
▪ Spark treats each batch of data as
RDDs and processes them using
RDD operations
PT
▪ Finally, the processed results of
the RDD operations are returned
processed results
Spark
N
in batches
EL
latency of about 1 second
batches of X seconds
▪ Potential for combining batch
PT
processing and streaming
processing in the same system
processed results
Spark
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
regular time intervals.
Divides received time batch
PT
into blocks for parallelism
Each batch is a graph that
translates into multiple jobs
N
Has the ability to create
larger size batch window as
it processes over time.
EL
Once this happens, Spark can be used to perform Machine Learning on the
data through its MLlib API. Further, Spark SQL is used to perform further
operations on this data. Finally, the streaming output can be stored into
various data storage systems like HBase, Cassandra, MemSQL, Kafka, Elastic
PT
Search, HDFS and local file system.
N
EL
PT
N
EL
Twitter Streaming API
tweets DStream PT
batch @ t batch @ t+1 batch @ t+2
N
stored in memory as an RDD
(immutable, distributed)
EL
new DStream transformation: modify data in one DStream to create another DStream
tweets DStream PT
batch @ t batch @ t+1 batch @ t+2
N
flatMap flatMap flatMap
hashTags Dstream
…
[#cat, #dog, … ] new RDDs created
for every batch
EL
output operation: to push data to external storage
tweets DStream
PT
batch @ t
flatMap
batch @ t+1
flatMap
batch @ t+2
flatMap
N
hashTags DStream
every batch
saved to HDFS
EL
foreach: do whatever you want with the processed data
tweets DStream
PT
batch @ t
flatMap
batch @ t+1
flatMap
batch @ t+2
flatMap
N
hashTags DStream
Scala
val tweets = ssc.twitterStream()
val hashTags = tweets.flatMap(status => getTags(status))
EL
hashTags.saveAsHadoopFiles("hdfs://...")
Java
PT
JavaDStream<Status> tweets = ssc.twitterStream()
JavaDstream<String> hashTags = tweets.flatMap(new Function<...> { })
N
hashTags.saveAsHadoopFiles("hdfs://...")
Function object
EL
in memory
multiple workerPT
replicated in memory of
therefore fault-tolerant
nodes, hashTags
RDD lost partitions
recomputed on
N
other workers
EL
Transformations – modify data from on DStream to another
Standard RDD operations – map, countByValue, reduce,
join, …
PT
Stateful operations – window, countByValueAndWindow, …
N
Output Operations – send data to external entity
saveAsHadoopFiles – saves to HDFS
foreach – do anything with each batch of results
EL
val tagCounts = hashTags.countByValue()
batch @ batch @
batch @ t
t+1 t+2
tweets
hashTags
PTflatMa
map
p
flatMa
map
p
flatMa
map
p
…
N
reduceByKey reduceByKey reduceByKey
tagCounts
[(#cat, 10), (#dog, 25), ... ]
EL
sliding window
window length sliding interval
operation
PT window length
N
DStream of data
sliding interval
EL
t-1 t t+1 t+2 t+3
hashTags
PT
sliding window
countByValue
N
tagCounts count over all
the data in the
window
EL
t-1 t t+1 t+2 t+3
hashTags
PT
countByValue
subtract the
– +
add the counts
from the new
batch in the
N
counts from window
tagCounts batch before ?
the window +
EL
Need a function to “inverse reduce” (“subtract” for
counting)
PT
Could have implemented counting as:
N
hashTags.reduceByKeyAndWindow(_ + _, _ - _,
Minutes(1), …)
14
EL
Example: Maintain per-user mood as state, and update it
with their tweets
PT
def updateMood(newTweets, lastMood) => newMood
moods = tweetsByUser.updateStateByKey(updateMood _)
N
EL
Example: Join incoming tweets with a spam HDFS file to
filter out bad tweets
})
PT
tweets.transform(tweetsRDD => {
tweetsRDD.join(spamHDFSFile).filter(...)
N
EL
• PT
These steps repeat for each batch.. Continuously
N
• Because we are dealing with Streaming data. Spark
Streaming has the ability to “remember” the previous
RDDs…to some extent.
EL
Combine live data streams with historical data
Generate historical data models with Spark, etc.
PT
Use data models to process live data stream (transform)
N
CEP-style processing
window-based operations (reduceByWindow, etc.)
EL
PT
N
EL
Very easy to write a receiver for your own data source
PT
Also, generate your own RDDs from Spark, etc. and
N
push them in as a “stream”
EL
PT
N
EL
Dstream has 36 value
members
PT
Multiple types of
Dstreams
N
Separate Python API
EL
PT
N
EL
replicated input data tweets
RDD input data
replicated
PT flatMap
in memory
N
▪ All transformations are fault- hashTags
tolerant, and exactly-once RDD
lost partitions
transformations recomputed on
other workers
EL
PT
N
EL
7 3.5
WordCount
Cluster Thhroughput (GB/s)
EL
Commercial systems: 100-500k records/sec/node
Grep
PT WordCount
60 30
Spark
(MB/s)
Spark
(MB/s)
40 20
N
20 10 Storm
Storm
0 0
100 1000 100 1000
Record Size (bytes) Record Size (bytes)
EL
PT
N
EL
2000
▪ Markov-chain Monte Carlo
observations
PT
▪ Very CPU intensive, requires
1200
800
N
400
dozens of machines for useful
computation 0
0 20 40 60 80
# Nodes in Cluster
▪ Scales linearly with cluster size
EL
PT
N
EL
val hashTags = tweets.flatMap (status => getTags(status))
hashTags.saveAsHadoopFiles("hdfs://...")
PT
Spark program on Twitter log file
val tweets = sc.hadoopFile("hdfs://...")
val hashTags = tweets.flatMap (status => getTags(status))
N
hashTags.saveAsHadoopFile("hdfs://...")
EL
...
object ProcessProductionData {
def main(args: Array[String]) {
Use same code in val sc = new SparkContext(...)
val file = sc.hadoopFile(“productionLogs”)
Spark for processing val filtered = file.filter(_.contains(“ERROR”))
large logs
PT }
val mapped = filtered.map(...)
...
} object ProcessLiveStream {
def main(args: Array[String]) {
N
val sc = new StreamingContext(...)
val stream = sc.kafkaStream(...)
Use similar code in val filtered = stream.filter(_.contains(“ERROR”))
val mapped = filtered.map(...)
Spark Streaming for }
...
realtime processing }
EL
Spark 0.9 in Jan 2014 – out of alpha!
PT
Automated master fault recovery
Performance optimizations
Web UI, and better monitoring capabilities
N
Spark v2.4.0 released in November 2, 2018
EL
PT
N
EL
Both the parameters must be a multiple of the batch interval
PT
A window creates a new DStream with a larger batch size
N
EL
Window functions allow you to do many common calculations with DataFrames,
without having to resort to RDD manipulation.
PT
Aggregates, UDFs vs. Window functions
N
Window functions are complementary to existing DataFrame operations:
aggregates, such as sum and avg, and UDFs. To review, aggregates calculate one
result, a sum or average, for each group of rows, whereas UDFs calculate one result
for each row based on only data in that row. In contrast, window functions calculate
one result for each row based on a window of rows. For example, in a moving
average, you calculate for each row the average of the rows surrounding the current
row; this can be done with window functions.
Big Data Computing Vu Pham Spark Streaming
Moving Average Example
Let us dive right into the moving average example. In this example
dataset, there are two customers who have spent different amounts
of money each day.
EL
// Building the customer DataFrame. All examples are written in
Scala with Spark 1.6.1, but the same can be done in Python or SQL.
val customers = sc.parallelize(List(("Alice", "2016-05-01", 50.00),
PT
("Alice", "2016-05-03", 45.00),
("Alice", "2016-05-04", 55.00),
N
("Bob", "2016-05-01", 25.00),
("Bob", "2016-05-04", 29.00),
("Bob", "2016-05-06", 27.00))).
toDF("name", "date", "amountSpent")
EL
// Create a window spec.
val wSpec1 =
PT
Window.partitionBy("name").orderBy("date").rowsBetween(-1, 1)
EL
This code adds a new column, “movingAvg”, by applying the avg
function on the sliding window defined in the window spec:
PT
N
EL
You can use functions listed under “Aggregate Functions” and “Window Functions”.
For (2) specifying a window spec, there are three components: partition by, order by, and
frame.
PT
1. “Partition by” defines how the data is grouped; in the above example, it was by
customer. You have to specify a reasonable grouping because all data within a group will
N
be collected to the same machine. Ideally, the DataFrame has already been partitioned by
the desired grouping.
2. “Order by” defines how rows are ordered within a group; in the above example, it
was by date.
3. “Frame” defines the boundaries of the window with respect to the current row; in
the above example, the window ranged between the previous row and the next row.
EL
Window.partitionBy("name").orderBy("date").rowsBetween(Long.MinValue, 0)
// Create a new column which calculates the sum over the defined window
frame.
PT
customers.withColumn( "cumSum",
sum(customers("amountSpent")).over(wSpec2) ).show()
N
EL
// Use the lag function to look backwards by one row.
customers.withColumn("prevAmountSpent",
PT
lag(customers("amountSpent"), 1).over(wSpec3) ).show()
N
EL
customers.withColumn( "rank", rank().over(wSpec3) ).show()
PT
N
EL
Analysis with Spark Streaming
PT
N
EL
Sentiment Analysis is categorising the tweets related to particular
topic and performing data mining using Sentiment Automation
Analytics Tools.
PT
We will be performing Twitter Sentiment Analysis as an Use Case
or Spark Streaming.
N
EL
Sentiment Analysis is used to:
Predict the success of a movie
PT
Predict political campaign success
Decide whether to invest in a certain company
N
Targeted advertising
Review products and services
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
companies around the world.
PT
Companies using Spark Streaming for Sentiment Analysis have
applied the same approach to achieve the following:
Enhancing the customer experience
N
1.
https://spark.apache.org/streaming/
EL
Streaming programming guide –
spark.incubator.apache.org/docs/latest/streaming-
PT
programming-guide.html
N
https://databricks.com/speaker/tathagata-das
EL
- Achieves second-scale latencies
- Has simple programming model
-
-
PT
Integrates with batch & interactive workloads
Ensures efficient fault-tolerance in stateful
N
computations
EL
PT
N
Dr. Rajiv Misra
Dept. of Computer Science & Engg.
Indian Institute of Technology Patna
[email protected]
Define Kafka
EL
Use cases for Kafka
PT
Kafka data model
Kafka architecture
N
Types of messaging systems
Importance of brokers
Big Data Computing Vu Pham Introduction to Kafka
Batch vs. Streaming
Batch Streaming
EL
PT
N
Vu Pham
Introduction: Apache Kafka
Kafka is a high-performance, real-time messaging
system. It is an open source tool and is a part of Apache
projects.
EL
The characteristics of Kafka are:
PT
1. It is a distributed and partitioned messaging system.
2. It is highly fault-tolerant
N
3. It is highly scalable.
4. It can process and send millions of messages per second
to several receivers.
EL
It became a main Apache project in October, 2012.
A stable Apache Kafka version 0.8.2.0 was release in Feb,
2015.
PT
A stable Apache Kafka version 0.8.2.1 was released in May,
N
2015, which is the latest version.
EL
PT
N
EL
• Web site: page views, clicks, searches, …
• IoT: sensor readings, …
PT
and so on.N
EL
PT
N
EL
PT
N
EL
The processes that publish messages into a topic in Kafka are known as
producers.
The processes that receive the messages from a topic in Kafka are known as
consumers.
PT
The processes or servers within Kafka that process the messages are known as
brokers.
N
A Kafka cluster consists of a set of brokers that process the messages.
EL
A partition is also known as a commit log.
Each partition contains an ordered set of messages.
PT
Each message is identified by its offset in the partition.
Messages are added at one end of the partition and consumed
at the other.
N
EL
multiple servers.
A topic can have any number of partitions.
Each partition should fit in a single Kafka server.
PT
The number of partitions decide the parallelism of the topic.
N
EL
marked as followers.
The leader controls the read and write for the partition, whereas, the
followers replicate the data.
PT
If a leader fails, one of the followers automatically become the leader.
Zookeeper is used for the leader selection.
N
EL
Topics should already exist before a message is placed by the producer.
Messages are added at one end of the partition.
PT
N
EL
The consumers specify what topics they want to listen to.
A message is sent to all the consumers in a consumer group.
The consumer groups are used to control the messaging system.
PT
N
EL
the same order.
• Each partition acts as a message queue.
PT
• Consumers are divided into consumer groups.
• Each message is delivered to one consumer in each consumer group.
• Zookeeper is used for coordination.
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
• They coordinate among each other using Zookeeper.
PT
• One broker acts as a leader for a partition and handles the
delivery and persistence, where as, the others act as followers.
N
EL
are appended in the same order
PT
2. A consumer instance gets the messages in the same
order as they are produced.
N
3. A topic with replication factor N, tolerates upto N-1
server failures.
EL
chosen as the followers and act as backups.
The leader propagates the writes to the followers.
replicas. PT
The leader waits until the writes are completed on all the
EL
and writes.
All the data is immediately written to a file in file system.
writes. PT
Messages are grouped as message sets for more efficient
EL
2. Kafka Connect: A framework to import event streams from
other source data systems into Kafka and export event
streams from Kafka to destination data systems.
3.
they occur.
PT
Kafka Streams: A Java library to process event streams live as
N
EL
https://github.com/confluentinc/examples/tree/master/kafka-streams
PT
o
http://docs.confluent.io/current/streams/javadocs/index.html
EL
Kafka data model consists of messages and topics.
PT
Kafka architecture consists of brokers that take messages from the
producers and add to a partition of a topics.
N
Kafka architecture supports two types of messaging system called
publish-subscribe and queue system.
Brokers are the Kafka processes that process the messages in Kafka.