0% found this document useful (0 votes)

32 views138 pages

Week-5 - Lecture Notes

HBase is an open source NoSQL database that provides distributed column-oriented storage and is designed to operate at large scale on top of HDFS. It is modeled after Google's BigTable storage system and stores data in tables that are split into regions served by region servers.

Uploaded by

tejastaware7451

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views138 pages

Week-5 - Lecture Notes

Uploaded by

tejastaware7451

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 138

Design of HBase

EL
PT
N
Dr. Rajiv Misra
Dept. of Computer Science & Engg.
Indian Institute of Technology Patna
[email protected]

Big Data Computing Vu Pham Design of HBase

Preface
Content of this Lecture:

In this lecture, we will discuss:

EL
What is HBase?
HBase Architecture
PT
HBase Components
Data model
N
HBase Storage Hierarchy
Cross-Datacenter Replication
Auto Sharding and Distribution
Bloom Filter and Fold, Store, and Shift
Big Data Computing Vu Pham Design of HBase
HBase is:
An opensource NOSQL database.
A distributed column-oriented data store that can scale
horizontally to 1,000s of commodity servers and petabytes

EL
of indexed storage.
Designed to operate on top of the Hadoop distributed file
system (HDFS) for scalability, fault tolerance, and high
availability.
PT
Hbase is actually an implementation of the BigTable
N
storage architecture, which is a distributed storage system
developed by Google.
Works with structured, unstructured and semi-structured
data.

Big Data Computing Vu Pham Design of HBase

HBase
Google’s BigTable was first “blob-based” storage system
Yahoo! Open-sourced it → HBase
Major Apache project today

EL
Facebook uses HBase internally
API functions
Get/Put(row)
PT
Scan(row range, filter) – range queries
N
MultiPut
Unlike Cassandra, HBase prefers consistency (over
availability)

Big Data Computing Vu Pham Design of HBase

HBase Architecture
Table Split into regions
and served by region
servers.
Regions vertically

EL
divided by column
families into “stores”.

PT Stores saved as files on

HDFS.
N
Hbase utilizes
zookeeper for
distributed
coordination.

Big Data Computing Vu Pham Design of HBase

HBase Components
Client:
Finds RegionServers that are serving particular row
range of interest
HMaster:

EL
Monitoring all RegionServer instances in the cluster
Regions:
PT
Basic element of availability and distribution for
tables
N
RegionServer:
Serving and managing regions
In a distributed cluster, a RegionServer runs on a
DataNode
Big Data Computing Vu Pham Design of HBase
Data Model
Data stored in Hbase is
located by its “rowkey”
RowKey is like a primary key
from a rational database.

EL
Records in Hbase are stored
in sorted order, according to
rowkey.

PT Data in a row are grouped

together as Column Families.
Each Column Family has one
N
or more Columns
These Columns in a family
are stored together in a low
level storage file known as
Column families HFile
Big Data Computing Vu Pham Design of HBase
HBase Components

EL
PT
N
Tables are divided into sequences of rows, by key range, called regions.
These regions are then assigned to the data nodes in the cluster called
“RegionServers.”

Big Data Computing Vu Pham Design of HBase

Column Family

EL
PT
N
A column is identified by a Column Qualifier that consists of the
Column Family name concatenated with the Column name using a
colon.ex-personaldata:Name
Column families are mapped to storage files and are stored in
separate files, which can also be accesses separately.
Big Data Computing Vu Pham Design of HBase
Cell in HBase Table

EL
PT
Data is stored in HBASE tables Cells.
Cell is a combination of row, column family, column qualifier and
N
contains a value and a timestamp
The key consists of the row key, column name, and timestamp.
The entire cell, with the added structural information, is called
Key Value.

Big Data Computing Vu Pham Design of HBase

HBase Data Model
Table: Hbase organizes data into tables. Table names are
Strings and composed of characters that are safe for use in a
file system path.

EL
Row: Within a table, data is stored according to its row. Rows
are identified uniquely by their row key. Row keys do not have

PT
a data type and are always treated as a byte[ ] (byte array).

Column Family: Data within a row is grouped by column

N
family. Every row in a table has the same column families,
although a row need not store data in all its families. Column
families are Strings and composed of characters that are safe
for use in a file system path.
Big Data Computing Vu Pham Design of HBase
HBase Data Model
Column Qualifier: Data within a column family is
addressed via its column qualifier, or simply , column,
Column qualifiers need not be specified in advance.
Column qualifiers need not be consistent between rows.

EL
Like row keys, column qualifiers do not have a data type
and are always treated as a byte[ ].

PT
Cell: A combination of row key, column family, and column
N
qualifier uniquely identifies a cell. The data stored in a cell
is referred to as that cell’s value.

Big Data Computing Vu Pham Design of HBase

HBase Data Model
Timestamp: Values within a cell are versioned. Versions
are identified by their version number, which by default is
the timestamp is used.

EL
If the timestamp is not specified for a read, the latest one
is returned. The number of cell value versions retained by
PT
Hbase is configured for each column family. The default
number of cell versions is three.
N

Big Data Computing Vu Pham Design of HBase

HBase Architecture
Small group of servers running
Zab, a consensus protocol (Paxos-like)
Client HMaster
Zookeeper

EL
HRegionServer HLog HRegionServer
Hregion
Store

StoreFile
HFile
MemStore

…
StoreFile
HFile
PT
...
Store

StoreFile
HFile
MemStore

…
StoreFile
HFile
... ...
N

HDFS

Big Data Computing Vu Pham Design of HBase

HBase Storage Hierarchy
HBase Table
Split it into multiple regions: replicated across servers
• ColumnFamily = subset of columns with similar query
patterns

EL
• One Store per combination of ColumnFamily + region
– Memstore for each Store: in-memory updates to

PT
Store; flushed to disk when full
» StoreFiles for each store for each region: where
the data lives
N
- Hfile

HFile
SSTable from Google’s BigTable
Big Data Computing Vu Pham Design of HBase
HFile

Data … Data … Metadata, file info, indices, and trailer

EL
Magic (Key, value) (Key, value) … (Key, value)

Key
length
Value Row
length length
Row PT Col Family Col Family Col
length Qualifier
Timestamp Key Value
type
N
SSN:000-01-2345 Demographic Ethnicity
Information

HBase Key

Big Data Computing Vu Pham Design of HBase

Strong Consistency: HBase Write-Ahead Log
Client HRegion Store MemStore
2. (k1)
StoreFile StoreFile
. …
(k1, k2, k3, k4) HFile HFile
(k1, k2) .
. .

EL
HRegionServer
HRegion .
(k3, k4) 1. (k1)
.
Store MemStore
Log flush
PT StoreFile
HFile
…
StoreFile
HFile
N
HLog
Write to HLog before writing to MemStore
Helps recover from failure by replaying Hlog.

Big Data Computing Vu Pham Design of HBase

Log Replay
After recovery from failure, or upon bootup
(HRegionServer/HMaster)

EL
Replay any stale logs (use timestamps to find out where
the database is with respect to the logs)

PT
Replay: add edits to the MemStore
N

Big Data Computing Vu Pham Design of HBase

Cross-Datacenter Replication
Single “Master” cluster
Other “Slave” clusters replicate the same tables
Master cluster synchronously sends HLogs over to slave

EL
clusters
Coordination among clusters is via Zookeeper

information PT
Zookeeper can be used like a file system to store control

1. /hbase/replication/state
N
2. /hbase/replication/peers/<peer cluster number>
3. /hbase/replication/rs/<hlog>

Big Data Computing Vu Pham Design of HBase

Auto Sharding

EL
PT
N

Big Data Computing Vu Pham Design of HBase

Distribution

EL
PT
N

Big Data Computing Vu Pham Design of HBase

Auto Sharding and Distribution

Unit of scalability in HBase is the Region

Sorted, contiguous range of rows

EL
Spread “randomly” across RegionServer
Moved around for load balancing and failover

data PT
Split automatically or manually to scale with growing

Capacity is solely a factor of cluster nodes vs. Regions per

N
node

Big Data Computing Vu Pham Design of HBase

Bloom Filter
Bloom Filters are generated when HFile is persisted
Stored at the end of each HFile
Loaded into memory

EL
Allows check on row + column level

PT
Can filter entire store files from reads
N
Useful when data is grouped

Also useful when many misses are expected during reads

(non existing keys)

Big Data Computing Vu Pham Design of HBase

Bloom Filter

EL
PT
N

Big Data Computing Vu Pham Design of HBase

Fold, Store, and Shift

EL
PT
N

Big Data Computing Vu Pham Design of HBase

Fold, Store, and Shift
Logical layout does not match physical one

All values are stored with the full coordinates, including:

EL
Row Key, Column Family, Column Qualifier, and
Timestamp

PT
Folds columns into “row per column”
N
NULLs are cost free as nothing is stored

Versions are multiple “rows” in folded table

Big Data Computing Vu Pham Design of HBase

Conclusion
Traditional Databases (RDBMSs) work with strong
consistency, and offer ACID
Modern workloads don’t need such strong guarantees, but
do need fast response times (availability)

EL
Unfortunately, CAP theorem
Key-value/NoSQL systems offer BASE

PT
Eventual consistency, and a variety of other consistency
models striving towards strong consistency
N
In this lecture, we have discussed:
HBase Architecture, HBase Components, Data model,
HBase Storage Hierarchy, Cross-Datacenter
Replication, Auto Sharding and Distribution, Bloom
Filter and Fold, Store, and Shift
Big Data Computing Vu Pham Design of HBase
Spark Streaming and
Sliding Window Analytics

EL
PT
N
Dr. Rajiv Misra
Dept. of Computer Science & Engg.
Indian Institute of Technology Patna
[email protected]

Big Data Computing Vu Pham Spark Streaming

Preface
Content of this Lecture:

In this lecture, we will discuss Real-time big data

processing with Spark Streaming and Sliding Window

EL
Analytics.

PT
We will also discuss a case study based on Twitter
Sentiment Analysis with using Streaming.
N

Big Data Computing Vu Pham Spark Streaming

Big Streaming Data Processing

EL
PT
N

Big Data Computing Vu Pham Spark Streaming

How to Process Big Streaming Data
Scales to hundreds of nodes
Achieves low latency
Efficiently recover from failures

EL
Integrates with batch and interactive processing

PT
N

Big Data Computing Vu Pham Spark Streaming

What people have been doing?
Build two stacks – one for batch, one for streaming
Often both process same data

EL
Existing frameworks cannot do both

latency PT
Either, stream processing of 100s of MB/s with low

Or, batch processing of TBs of data with high latency

Big Data Computing Vu Pham Spark Streaming

What people have been doing?
Extremely painful to maintain two different stacks
Different programming models
Doubles implementation effort

EL
Doubles operational effort

PT
N

Big Data Computing Vu Pham Spark Streaming

Fault-tolerant Stream Processing
Traditional processing model
Pipeline of nodes
Each node maintains mutable state

EL
Each input record updates the state
and new records are sent out

PT
Mutable state is lost if node fails
N
Making stateful stream processing fault-
tolerant is challenging!

Big Data Computing Vu Pham Spark Streaming

What is Streaming?
Data Streaming is a technique for transferring data so that
it can be processed as a steady and continuous stream.

EL
Streaming technologies are becoming increasingly
important with the growth of the Internet.

PT
N

Big Data Computing Vu Pham Spark Streaming

Spark Ecosystem

EL
PT
N

Big Data Computing Vu Pham Spark Streaming

What is Spark Streaming?
Extends Spark for doing big data stream processing
Project started in early 2012, alpha released in Spring 2017 with Spark 0.7
Moving out of alpha in Spark 0.9
Spark Streaming has support built-in to consume from Kafka, Flume,

EL
Twitter, ZeroMQ, Kinesis, and TCP/IP sockets.
In Spark 2.x, a separate technology based on Datasets, called Structured
Streaming, that has a higher-level interface is also provided to support
streaming.
PT
N

Big Data Computing Vu Pham Spark Streaming

What is Spark Streaming?
Framework for large scale stream processing

Scales to 100s of nodes

EL
Can achieve second scale latencies
Integrates with Spark’s batch and interactive processing

PT
Provides a simple batch-like API for implementing
complex algorithm
N
Can absorb live data streams from Kafka, Flume,
ZeroMQ, etc.

Big Data Computing Vu Pham Spark Streaming

What is Spark Streaming?
Receive data streams from input sources, process
them in a cluster, push out to databases/ dashboards

EL
Scalable, fault-tolerant, second-scale latencies

PT
N

Big Data Computing Vu Pham Spark Streaming

Why Spark Streaming ?
Many big-data applications need to process large data
streams in realtime

EL
Website monitoring
Fraud detection

PT Ad monetization
N

Big Data Computing Vu Pham Spark Streaming

Why Spark Streaming ?
▪ Many important applications must process large streams of live data
and provide results in near-real-time
- Social network trends

- Website statistics

EL
- Intrustion detection systems

- etc.

PT
N
▪ Require large clusters to handle workloads

▪ Require latencies of few seconds

Big Data Computing Vu Pham Spark Streaming

Why Spark Streaming ?
We can use Spark Streaming to stream real-time data from
various sources like Twitter, Stock Market and
Geographical Systems and perform powerful analytics to

EL
help businesses.

PT
N

Big Data Computing Vu Pham Spark Streaming

Why Spark Streaming?
Need a framework for big data
stream processing that

EL
Website monitoring
Scales Fraud
to hundreds
detectionof nodes

PT
Achieves second-scale latencies
Ad monetization
N
Efficiently recover from failures

Integrates with batch and interactive processing

Big Data Computing Vu Pham Spark Streaming

Spark Streaming Features
Scaling: Spark Streaming can easily scale to hundreds of nodes.
Speed: It achieves low latency.
Fault Tolerance: Spark has the ability to efficiently recover from
failures.

EL
Integration: Spark integrates with batch and real-time processing.
Business Analysis: Spark Streaming is used to track the behavior of

PT
customers which can be used in business analysis
N

Big Data Computing Vu Pham Spark Streaming

Requirements

▪ Scalable to large clusters

▪ Second-scale latencies

EL
▪ Simple programming model
▪

▪
PT
Integrated with batch & interactive processing
Efficient fault-tolerance in stateful computations
N

Big Data Computing Vu Pham Spark Streaming

Batch vs Stream Processing
Batch Processing
Ability to process and analyze data at-rest (stored data)
Request-based, bulk evaluation and short-lived processing
Enabler for Retrospective, Reactive and On-demand Analytics

EL
Stream Processing
Ability to ingest, process and analyze data in-motion in real- or near-
real-time
PT
Event or micro-batch driven, continuous evaluation and long-lived
processing
N
Enabler for real-time Prospective, Proactive and Predictive Analytics
for Next Best Action
Stream Processing + Batch Processing = All Data Analytics
real-time (now) historical (past)

Big Data Computing Vu Pham Spark Streaming

Integration with Batch Processing
Many environments require processing same data in live
streaming as well as batch post-processing

Existing frameworks cannot do both

EL
Either, stream processing of 100s of MB/s with low latency
Or, batch processing of TBs of data with high latency

PT
Extremely painful to maintain two different stacks
N
Different programming models
Double implementation effort

Big Data Computing Vu Pham Spark Streaming

Stateful Stream Processing
Traditional model
mutable state
– Processing pipeline of nodes
input

EL
– Each node maintains mutable state records
– Each input record updates the state node 1
and new records are sent out

PT
Mutable state is lost if node fails
input
records
node 2
node 3
N
Making stateful stream processing fault tolerant is
challenging!

Big Data Computing Vu Pham Spark Streaming

Modern Data Applications approach to Insights

EL
PT
N

Big Data Computing Vu Pham Spark Streaming

Existing Streaming Systems

Storm
Replays record if not processed by a node

EL
Processes each record at least once
May update mutable state twice!

PT
Mutable state can be lost due to failure!
N
Trident – Use transactions to update state
Processes each record exactly once
Per-state transaction to external database is slow

Big Data Computing Vu Pham Spark Streaming 23

How does Spark Streaming work?
Run a streaming computation as a series of very small,
deterministic batch jobs live data stream
Spark
▪ Chop up the live stream into Streaming
batches of X seconds

EL
batches of X seconds
▪ Spark treats each batch of data as
RDDs and processes them using
RDD operations
PT
▪ Finally, the processed results of
the RDD operations are returned
processed results
Spark
N
in batches

Big Data Computing Vu Pham Spark Streaming 24

How does Spark Streaming work?
Run a streaming computation as a series of very small,
deterministic batch jobs live data stream
Spark
▪ Batch sizes as low as ½ second, Streaming

EL
latency of about 1 second
batches of X seconds
▪ Potential for combining batch

PT
processing and streaming
processing in the same system
processed results
Spark
N

Big Data Computing Vu Pham Spark Streaming 25

Word Count with Kafka

EL
PT
N

Big Data Computing Vu Pham Spark Streaming

Any Spark Application

EL
PT
N

Big Data Computing Vu Pham Spark Streaming

Spark Streaming Application: Receive data

EL
PT
N

Big Data Computing Vu Pham Spark Streaming

Spark Streaming Application: Process data

EL
PT
N

Big Data Computing Vu Pham Spark Streaming

Spark Streaming Architecture
Micro batch architecture.
Operates on interval of time
New batches are created at

EL
regular time intervals.
Divides received time batch

PT
into blocks for parallelism
Each batch is a graph that
translates into multiple jobs
N
Has the ability to create
larger size batch window as
it processes over time.

Big Data Computing Vu Pham Spark Streaming

Spark Streaming Workflow
Spark Streaming workflow has four high-level stages. The first is to stream
data from various sources. These sources can be streaming data sources like
Akka, Kafka, Flume, AWS or Parquet for real-time streaming. The second type
of sources includes HBase, MySQL, PostgreSQL, Elastic Search, Mongo DB and
Cassandra for static/batch streaming.

EL
Once this happens, Spark can be used to perform Machine Learning on the
data through its MLlib API. Further, Spark SQL is used to perform further
operations on this data. Finally, the streaming output can be stored into
various data storage systems like HBase, Cassandra, MemSQL, Kafka, Elastic

PT
Search, HDFS and local file system.
N

Big Data Computing Vu Pham Spark Streaming

Spark Streaming Workflow

EL
PT
N

Big Data Computing Vu Pham Spark Streaming

Example 1 – Get hashtags from Twitter
val tweets = ssc.twitterStream()

DStream: a sequence of RDDs representing a stream of data

EL
Twitter Streaming API

tweets DStream PT
batch @ t batch @ t+1 batch @ t+2
N
stored in memory as an RDD
(immutable, distributed)

Big Data Computing Vu Pham Spark Streaming

Example 1 – Get hashtags from Twitter
val tweets = ssc.twitterStream()
val hashTags = tweets.flatMap(status => getTags(status))

EL
new DStream transformation: modify data in one DStream to create another DStream

tweets DStream PT
batch @ t batch @ t+1 batch @ t+2
N
flatMap flatMap flatMap

hashTags Dstream

…
[#cat, #dog, … ] new RDDs created
for every batch

Big Data Computing Vu Pham Spark Streaming

Example 1– Get hashtags from Twitter
val tweets = ssc.twitterStream()
val hashTags = tweets.flatMap(status => getTags(status))
hashTags.saveAsHadoopFiles("hdfs://...")

EL
output operation: to push data to external storage

tweets DStream
PT
batch @ t

flatMap
batch @ t+1

flatMap
batch @ t+2

flatMap
N
hashTags DStream

save save save

every batch
saved to HDFS

Big Data Computing Vu Pham Spark Streaming

Example 1 – Get hashtags from Twitter
val tweets = ssc.twitterStream()
val hashTags = tweets.flatMap(status => getTags(status))
hashTags.foreach(hashTagRDD => { ... })

EL
foreach: do whatever you want with the processed data

tweets DStream
PT
batch @ t

flatMap
batch @ t+1

flatMap
batch @ t+2

flatMap
N
hashTags DStream

foreach foreach foreach

Write to a database, update analytics

UI, do whatever you want

Big Data Computing Vu Pham Spark Streaming

Java Example

Scala
val tweets = ssc.twitterStream()
val hashTags = tweets.flatMap(status => getTags(status))

EL
hashTags.saveAsHadoopFiles("hdfs://...")

Java
PT
JavaDStream<Status> tweets = ssc.twitterStream()
JavaDstream<String> hashTags = tweets.flatMap(new Function<...> { })
N
hashTags.saveAsHadoopFiles("hdfs://...")

Function object

Big Data Computing Vu Pham Spark Streaming

Fault-tolerance
▪ RDDs are remember the
sequence of operations that
created it from the original
tweets
fault-tolerant input data RDD
input data
replicated

EL
in memory

▪ Batches of input data are flatMap

multiple workerPT
replicated in memory of

therefore fault-tolerant
nodes, hashTags
RDD lost partitions
recomputed on
N
other workers

▪ Data lost due to worker

failure, can be recomputed
from input data
Big Data Computing Vu Pham Spark Streaming
Key concepts
DStream – sequence of RDDs representing a stream of data
Twitter, HDFS, Kafka, Flume, ZeroMQ, Akka Actor, TCP
sockets

EL
Transformations – modify data from on DStream to another
Standard RDD operations – map, countByValue, reduce,
join, …
PT
Stateful operations – window, countByValueAndWindow, …
N
Output Operations – send data to external entity
saveAsHadoopFiles – saves to HDFS
foreach – do anything with each batch of results

Big Data Computing Vu Pham Spark Streaming

Example 2 – Count the hashtags

val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)

val hashTags = tweets.flatMap (status => getTags(status))

EL
val tagCounts = hashTags.countByValue()
batch @ batch @
batch @ t
t+1 t+2
tweets

hashTags
PTflatMa

map
p
flatMa

map
p

…
N
reduceByKey reduceByKey reduceByKey
tagCounts
[(#cat, 10), (#dog, 25), ... ]

Big Data Computing Vu Pham Spark Streaming

Example 3 – Count the hashtags over last 10 mins
val tweets = ssc.twitterStream()
val hashTags = tweets.flatMap(status => getTags(status))
val tagCounts = hashTags.window(Minutes(1),
Seconds(5)).countByValue()

EL
sliding window
window length sliding interval
operation

PT window length
N
DStream of data

sliding interval

Big Data Computing Vu Pham Spark Streaming

Example 3 – Counting the hashtags over last 10 mins

val tagCounts = hashTags.window(Minutes(10), Seconds(1)).countByValue()

EL
t-1 t t+1 t+2 t+3
hashTags

PT
sliding window

countByValue
N
tagCounts count over all
the data in the
window

Big Data Computing Vu Pham Spark Streaming

Smart window-based countByValue

val tagCounts = hashtags.countByValueAndWindow(Minutes(10), Seconds(1))

EL
t-1 t t+1 t+2 t+3
hashTags

PT
countByValue

subtract the
– +
add the counts
from the new
batch in the
N
counts from window
tagCounts batch before ?
the window +

Big Data Computing Vu Pham Spark Streaming

Smart window-based reduce

Technique to incrementally compute count generalizes

to many reduce operations

EL
Need a function to “inverse reduce” (“subtract” for
counting)

PT
Could have implemented counting as:
N
hashTags.reduceByKeyAndWindow(_ + _, _ - _,
Minutes(1), …)
14

Big Data Computing Vu Pham Spark Streaming

Arbitrary Stateful Computations

Specify function to generate new state based on

previous state and new data

EL
Example: Maintain per-user mood as state, and update it
with their tweets

PT
def updateMood(newTweets, lastMood) => newMood

moods = tweetsByUser.updateStateByKey(updateMood _)
N

Big Data Computing Vu Pham Spark Streaming

Arbitrary Combinations of Batch and Streaming
Computations

Inter-mix RDD and DStream operations!

EL
Example: Join incoming tweets with a spam HDFS file to
filter out bad tweets

})
PT
tweets.transform(tweetsRDD => {
tweetsRDD.join(spamHDFSFile).filter(...)
N

Big Data Computing Vu Pham Spark Streaming

Spark Streaming-Dstreams, Batches and RDDs

EL
• PT
These steps repeat for each batch.. Continuously
N
• Because we are dealing with Streaming data. Spark
Streaming has the ability to “remember” the previous
RDDs…to some extent.

Big Data Computing Vu Pham Spark Streaming

DStreams + RDDs = Power
Online machine learning
Continuously learn and update data models
(updateStateByKey and transform)

EL
Combine live data streams with historical data
Generate historical data models with Spark, etc.

PT
Use data models to process live data stream (transform)
N
CEP-style processing
window-based operations (reduceByWindow, etc.)

Big Data Computing Vu Pham Spark Streaming

From DStreams to Spark Jobs
Every interval, an RDD graph is computed from the DStream
graph
For each output operation, a Spark action is created
For each action, a Spark job is created to compute it

EL
PT
N

Big Data Computing Vu Pham Spark Streaming

Input Sources
Out of the box, we provide
Kafka, HDFS, Flume, Akka Actors, Raw TCP sockets,
etc.

EL
Very easy to write a receiver for your own data source

PT
Also, generate your own RDDs from Spark, etc. and
N
push them in as a “stream”

Big Data Computing Vu Pham Spark Streaming

Current Spark Streaming I/O

EL
PT
N

Big Data Computing Vu Pham Spark Streaming

Dstream Classes
Different classes for
different languages
(Scala, Java)

EL
Dstream has 36 value
members

PT
Multiple types of
Dstreams
N
Separate Python API

Big Data Computing Vu Pham Spark Streaming

Spark Streaming Operations

EL
PT
N

Big Data Computing Vu Pham Spark Streaming

Fault-tolerance
Batches of input data are replicated in memory for fault-
tolerance

Data lost due to worker failure, can be recomputed from

EL
replicated input data tweets
RDD input data
replicated

PT flatMap
in memory
N
▪ All transformations are fault- hashTags
tolerant, and exactly-once RDD
lost partitions
transformations recomputed on
other workers

Big Data Computing Vu Pham Spark Streaming

Fault-tolerance

EL
PT
N

Big Data Computing Vu Pham Spark Streaming

Performance
Can process 60M records/sec (6 GB/sec) on
100 nodes at sub-second latency

EL
7 3.5
WordCount
Cluster Thhroughput (GB/s)

Cluster Throughput (GB/s)

Grep
6
5
4
PT 3
2.5
2
N
3 1.5
2 1
1 sec 1 sec
1 0.5
2 sec 2 sec
0 0
0 50 100 0 50 100
# Nodes in Cluster # Nodes in Cluster

Big Data Computing Vu Pham Spark Streaming

Comparison with other systems
Higher throughput than Storm
Spark Streaming: 670k records/sec/node
Storm: 115k records/sec/node

EL
Commercial systems: 100-500k records/sec/node

Grep
PT WordCount

Throughput per node

60 30

Spark

(MB/s)
Spark
(MB/s)

40 20
N
20 10 Storm
Storm
0 0
100 1000 100 1000
Record Size (bytes) Record Size (bytes)

Big Data Computing Vu Pham Spark Streaming

Fast Fault Recovery

Recovers from faults/stragglers within 1 sec

EL
PT
N

Big Data Computing Vu Pham Spark Streaming

Real time application: Mobile Millennium Project

Traffic transit time estimation using online machine

learning on GPS observations

EL
2000
▪ Markov-chain Monte Carlo

GPS observations per sec

simulations on GPS 1600

observations
PT
▪ Very CPU intensive, requires
1200

800
N
400
dozens of machines for useful
computation 0
0 20 40 60 80
# Nodes in Cluster
▪ Scales linearly with cluster size

Big Data Computing Vu Pham Spark Streaming

Vision - one stack to rule them all

EL
PT
N

Big Data Computing Vu Pham Spark Streaming

Spark program vs Spark Streaming program

Spark Streaming program on Twitter stream

val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)

EL
val hashTags = tweets.flatMap (status => getTags(status))
hashTags.saveAsHadoopFiles("hdfs://...")

PT
Spark program on Twitter log file
val tweets = sc.hadoopFile("hdfs://...")
val hashTags = tweets.flatMap (status => getTags(status))
N
hashTags.saveAsHadoopFile("hdfs://...")

Big Data Computing Vu Pham Spark Streaming

Advantage of an unified stack
Explore data $ ./spark-shell
interactively to scala> val file = sc.hadoopFile(“smallLogs”)
...
identify problems scala> val filtered = file.filter(_.contains(“ERROR”))
...
scala> val mapped = filtered.map(...)

EL
...
object ProcessProductionData {
def main(args: Array[String]) {
Use same code in val sc = new SparkContext(...)
val file = sc.hadoopFile(“productionLogs”)
Spark for processing val filtered = file.filter(_.contains(“ERROR”))

large logs
PT }
val mapped = filtered.map(...)
...

} object ProcessLiveStream {
def main(args: Array[String]) {
N
val sc = new StreamingContext(...)
val stream = sc.kafkaStream(...)
Use similar code in val filtered = stream.filter(_.contains(“ERROR”))
val mapped = filtered.map(...)
Spark Streaming for }
...

realtime processing }

Big Data Computing Vu Pham Spark Streaming

Roadmap
Spark 0.8.1
Marked alpha, but has been quite stable
Master fault tolerance – manual recovery
• Restart computation from a checkpoint file saved to HDFS

EL
Spark 0.9 in Jan 2014 – out of alpha!

PT
Automated master fault recovery
Performance optimizations
Web UI, and better monitoring capabilities
N
Spark v2.4.0 released in November 2, 2018

Big Data Computing Vu Pham Spark Streaming

Sliding Window Analytics

EL
PT
N

Big Data Computing Vu Pham Spark Streaming

Spark Streaming Windowing Capabilities
Parameters
Window length: duration of the window
Sliding interval: interval at which the window operation is
performed

EL
Both the parameters must be a multiple of the batch interval

PT
A window creates a new DStream with a larger batch size
N

Big Data Computing Vu Pham Spark Streaming

Spark Window Functions
Spark Window Functions for DataFrames and SQL

Introduced in Spark 1.4, Spark window functions improved the expressiveness of

Spark DataFrames and Spark SQL. With window functions, you can easily calculate a
moving average or cumulative sum, or reference a value in a previous row of a table.

EL
Window functions allow you to do many common calculations with DataFrames,
without having to resort to RDD manipulation.

PT
Aggregates, UDFs vs. Window functions
N
Window functions are complementary to existing DataFrame operations:
aggregates, such as sum and avg, and UDFs. To review, aggregates calculate one
result, a sum or average, for each group of rows, whereas UDFs calculate one result
for each row based on only data in that row. In contrast, window functions calculate
one result for each row based on a window of rows. For example, in a moving
average, you calculate for each row the average of the rows surrounding the current
row; this can be done with window functions.
Big Data Computing Vu Pham Spark Streaming
Moving Average Example
Let us dive right into the moving average example. In this example
dataset, there are two customers who have spent different amounts
of money each day.

EL
// Building the customer DataFrame. All examples are written in
Scala with Spark 1.6.1, but the same can be done in Python or SQL.
val customers = sc.parallelize(List(("Alice", "2016-05-01", 50.00),
PT
("Alice", "2016-05-03", 45.00),
("Alice", "2016-05-04", 55.00),
N
("Bob", "2016-05-01", 25.00),
("Bob", "2016-05-04", 29.00),
("Bob", "2016-05-06", 27.00))).
toDF("name", "date", "amountSpent")

Big Data Computing Vu Pham Spark Streaming

Moving Average Example
// Import the window functions.
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._

EL
// Create a window spec.
val wSpec1 =

PT
Window.partitionBy("name").orderBy("date").rowsBetween(-1, 1)

In this window spec, the data is partitioned by customer. Each

N
customer’s data is ordered by date. And, the window frame is
defined as starting from -1 (one row before the current row) and
ending at 1 (one row after the current row), for a total of 3 rows in
the sliding window.

Big Data Computing Vu Pham Spark Streaming

Moving Average Example
// Calculate the moving average
customers.withColumn( "movingAvg",
avg(customers("amountSpent")).over(wSpec1) ).show()

EL
This code adds a new column, “movingAvg”, by applying the avg
function on the sliding window defined in the window spec:

PT
N

Big Data Computing Vu Pham Spark Streaming

Window function and Window Spec definition
As shown in the above example, there are two parts to applying a window function: (1)
specifying the window function, such as avg in the example, and (2) specifying the
window spec, or wSpec1 in the example. For (1), you can find a full list of the window
functions here:
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions
$

EL
You can use functions listed under “Aggregate Functions” and “Window Functions”.

For (2) specifying a window spec, there are three components: partition by, order by, and
frame.
PT
1. “Partition by” defines how the data is grouped; in the above example, it was by
customer. You have to specify a reasonable grouping because all data within a group will
N
be collected to the same machine. Ideally, the DataFrame has already been partitioned by
the desired grouping.
2. “Order by” defines how rows are ordered within a group; in the above example, it
was by date.
3. “Frame” defines the boundaries of the window with respect to the current row; in
the above example, the window ranged between the previous row and the next row.

Big Data Computing Vu Pham Spark Streaming

Cumulative Sum
Next, let us calculate the cumulative sum of the amount spent per customer.
// Window spec: the frame ranges from the beginning (Long.MinValue) to the
current row (0).
val wSpec2 =

EL
Window.partitionBy("name").orderBy("date").rowsBetween(Long.MinValue, 0)
// Create a new column which calculates the sum over the defined window
frame.

PT
customers.withColumn( "cumSum",
sum(customers("amountSpent")).over(wSpec2) ).show()
N

Big Data Computing Vu Pham Spark Streaming

Data from previous row
In the next example, we want to see the amount spent by the customer
in their previous visit.
// Window spec. No need to specify a frame in this case.
val wSpec3 = Window.partitionBy("name").orderBy("date")

EL
// Use the lag function to look backwards by one row.
customers.withColumn("prevAmountSpent",

PT
lag(customers("amountSpent"), 1).over(wSpec3) ).show()
N

Big Data Computing Vu Pham Spark Streaming

Rank
In this example, we want to know the order of a customer’s
visit (whether this is their first, second, or third visit).
// The rank function returns what we want.

EL
customers.withColumn( "rank", rank().over(wSpec3) ).show()

PT
N

Big Data Computing Vu Pham Spark Streaming

Case Study: Twitter Sentiment

EL
Analysis with Spark Streaming

PT
N

Big Data Computing Vu Pham Spark Streaming

Case Study: Twitter Sentiment Analysis
Trending Topics can be used to create campaigns and attract larger
audience. Sentiment Analytics helps in crisis management, service
adjusting and target marketing.
Sentiment refers to the emotion behind a social media mention
online.

EL
Sentiment Analysis is categorising the tweets related to particular
topic and performing data mining using Sentiment Automation
Analytics Tools.
PT
We will be performing Twitter Sentiment Analysis as an Use Case
or Spark Streaming.
N

Big Data Computing Vu Pham Spark Streaming

Problem Statement
To design a Twitter Sentiment Analysis System where we
populate real-time sentiments for crisis management, service
adjusting and target marketing.

EL
Sentiment Analysis is used to:
Predict the success of a movie

PT
Predict political campaign success
Decide whether to invest in a certain company
N
Targeted advertising
Review products and services

Big Data Computing Vu Pham Spark Streaming

Importing Packages

EL
PT
N

Big Data Computing Vu Pham Spark Streaming

Twitter Token Authorization

EL
PT
N

Big Data Computing Vu Pham Spark Streaming

DStream Transformation

EL
PT
N

Big Data Computing Vu Pham Spark Streaming

Results

EL
PT
N

Big Data Computing Vu Pham Spark Streaming

Sentiment for Trump

EL
PT
N

Big Data Computing Vu Pham Spark Streaming

Applying Sentiment Analysis
As we have seen from our Sentiment Analysis demonstration,
we can extract sentiments of particular topics just like we did
for ‘Trump’. Similarly, Sentiment Analytics can be used in crisis
management, service adjusting and target marketing by

EL
companies around the world.

PT
Companies using Spark Streaming for Sentiment Analysis have
applied the same approach to achieve the following:
Enhancing the customer experience
N
1.

2. Gaining competitive advantage

3. Gaining Business Intelligence
4. Revitalizing a losing brand

Big Data Computing Vu Pham Spark Streaming

References

https://spark.apache.org/streaming/

EL
Streaming programming guide –
spark.incubator.apache.org/docs/latest/streaming-

PT
programming-guide.html
N
https://databricks.com/speaker/tathagata-das

Big Data Computing Vu Pham Spark Streaming

Conclusion
▪ Stream processing framework that is ...

- Scalable to large clusters

EL
- Achieves second-scale latencies
- Has simple programming model
-

-
PT
Integrates with batch & interactive workloads
Ensures efficient fault-tolerance in stateful
N
computations

Big Data Computing Vu Pham Spark Streaming

Introduction to Kafka

EL
PT
N
Dr. Rajiv Misra
Dept. of Computer Science & Engg.
Indian Institute of Technology Patna
[email protected]

Big Data Computing Vu Pham Introduction to Kafka

Preface
Content of this Lecture:

Define Kafka

EL
Use cases for Kafka

PT
Kafka data model

Kafka architecture
N
Types of messaging systems

Importance of brokers
Big Data Computing Vu Pham Introduction to Kafka
Batch vs. Streaming

Batch Streaming

EL
PT
N

Big Data Computing Vu Pham Introduction to Kafka

EL
PT
N

Vu Pham
Introduction: Apache Kafka
Kafka is a high-performance, real-time messaging
system. It is an open source tool and is a part of Apache
projects.

EL
The characteristics of Kafka are:

PT
1. It is a distributed and partitioned messaging system.
2. It is highly fault-tolerant
N
3. It is highly scalable.
4. It can process and send millions of messages per second
to several receivers.

Big Data Computing Vu Pham Introduction to Kafka

Kafka History
Apache Kafka was originally developed by LinkedIn and
later, handed over to the open source community in early
2011.

EL
It became a main Apache project in October, 2012.
A stable Apache Kafka version 0.8.2.0 was release in Feb,
2015.
PT
A stable Apache Kafka version 0.8.2.1 was released in May,
N
2015, which is the latest version.

Big Data Computing Vu Pham Introduction to Kafka

Kafka Use Cases
Kafka can be used for various purposes in an organization,
such as:

EL
PT
N

Big Data Computing Vu Pham Introduction to Kafka

Apache Kafka: a Streaming Data Platform
➢ Most of what a business does can be thought as event
streams. They are in a
• Retail system: orders, shipments, returns, …
• Financial system: stock ticks, orders, …

EL
• Web site: page views, clicks, searches, …
• IoT: sensor readings, …

PT
and so on.N

Big Data Computing Vu Pham Introduction to Kafka

Enter Kafka
Adopted at 1000s of companies worldwide

EL
PT
N

Big Data Computing Vu Pham Introduction to Kafka

Aggregating User Activity Using Kafka-Example

Kafka can be used to aggregate user activity data such as clicks,

navigation, and searches from different websites of an
organization; such user activities can be sent to a real-time
monitoring system and hadoop system for offline processing.

EL
PT
N

Big Data Computing Vu Pham Introduction to Kafka

Kafka Data Model
The Kafka data model consists of messages and topics.
Messages represent information such as, lines in a log file, a row of stock
market data, or an error message from a system.
Messages are grouped into categories called topics.
Example: LogMessage and Stock Message.

EL
The processes that publish messages into a topic in Kafka are known as
producers.
The processes that receive the messages from a topic in Kafka are known as
consumers.
PT
The processes or servers within Kafka that process the messages are known as
brokers.
N
A Kafka cluster consists of a set of brokers that process the messages.

Vu Pham Introduction to Kafka

Topics
A topic is a category of messages in Kafka.
The producers publish the messages into topics.
The consumers read the messages from topics.
A topic is divided into one or more partitions.

EL
A partition is also known as a commit log.
Each partition contains an ordered set of messages.

PT
Each message is identified by its offset in the partition.
Messages are added at one end of the partition and consumed
at the other.
N

Big Data Computing Vu Pham Introduction to Kafka

Partitions
Topics are divided into partitions, which are the unit of
parallelism in Kafka.

Partitions allow messages in a topic to be distributed to

EL
multiple servers.
A topic can have any number of partitions.
Each partition should fit in a single Kafka server.
PT
The number of partitions decide the parallelism of the topic.
N

Big Data Computing Vu Pham Introduction to Kafka

Partition Distribution
Partitions can be distributed across the Kafka cluster.
Each Kafka server may handle one or more partitions.
A partition can be replicated across several servers fro fault-tolerance.
One server is marked as a leader for the partition and the others are

EL
marked as followers.
The leader controls the read and write for the partition, whereas, the
followers replicate the data.

PT
If a leader fails, one of the followers automatically become the leader.
Zookeeper is used for the leader selection.
N

Big Data Computing Vu Pham Introduction to Kafka

Producers
The producer is the creator of the message in Kafka.

The producers place the message to a particular topic.

The producers also decide which partition to place the message into.

EL
Topics should already exist before a message is placed by the producer.
Messages are added at one end of the partition.

PT
N

Big Data Computing Vu Pham Introduction to Kafka

Consumers
The consumer is the receiver of the message in Kafka.

Each consumer belongs to a consumer group.

A consumer group may have one or more consumers.

EL
The consumers specify what topics they want to listen to.
A message is sent to all the consumers in a consumer group.
The consumer groups are used to control the messaging system.

PT
N

Big Data Computing Vu Pham Introduction to Kafka

Kafka Architecture
Kafka architecture consists of brokers that take messages from the
producers and add to a partition of a topic. Brokers provide the
messages to the consumers from the partitions.
• A topic is divided into multiple partitions.
• The messages are added to the partitions at one end and consumed in

EL
the same order.
• Each partition acts as a message queue.

PT
• Consumers are divided into consumer groups.
• Each message is delivered to one consumer in each consumer group.
• Zookeeper is used for coordination.
N

Big Data Computing Vu Pham Introduction to Kafka

Types of Messaging Systems
Kafka architecture supports the publish-subscribe and queue system.

EL
PT
N

Big Data Computing Vu Pham Introduction to Kafka

Example: Queue System

EL
PT
N

Big Data Computing Vu Pham Introduction to Kafka

Example: Publish-Subscribe System

EL
PT
N

Big Data Computing Vu Pham Introduction to Kafka

Brokers
Brokers are the Kafka processes that process the messages in Kafka.

• Each machine in the cluster can run one broker.

EL
• They coordinate among each other using Zookeeper.

PT
• One broker acts as a leader for a partition and handles the
delivery and persistence, where as, the others act as followers.
N

Big Data Computing Vu Pham Introduction to Kafka

Kafka Guarantees
Kafka guarantees the following:

1. Messages sent by a producer to a topic and a partition

EL
are appended in the same order

PT
2. A consumer instance gets the messages in the same
order as they are produced.
N
3. A topic with replication factor N, tolerates upto N-1
server failures.

Big Data Computing Vu Pham Introduction to Kafka

Replication in Kafka
Kafka uses the primary-backup method of replication.
One machine (one replica) is called a leader and is chosen
as the primary; the remaining machines (replicas) are

EL
chosen as the followers and act as backups.
The leader propagates the writes to the followers.

replicas. PT
The leader waits until the writes are completed on all the

If a replica is down, it is skipped for the write until it

N
comes back.
If the leader fails, one of the followers will be chosen as
the new leader; this mechanism can tolerate n-1 failures if
the replication factor is ‘n’
Big Data Computing Vu Pham Introduction to Kafka
Persistence in Kafka
Kafka uses the Linux file system for persistence of messages
Persistence ensures no messages are lost.
Kafka relies on the file system page cache for fast reads

EL
and writes.
All the data is immediately written to a file in file system.

writes. PT
Messages are grouped as message sets for more efficient

Message sets can be compressed to reduce network

N
bandwidth.
A standardized binary message format is used among
producers, brokers, and consumers to minimize data
modification.
Big Data Computing Vu Pham Introduction to Kafka
Apache Kafka: a Streaming Data Platform
➢ Apache Kafka is an open source streaming data platform (a new
category of software!) with 3 major components:
1. Kafka Core: A central hub to transport and store event
streams in real-time.

EL
2. Kafka Connect: A framework to import event streams from
other source data systems into Kafka and export event
streams from Kafka to destination data systems.
3.

they occur.
PT
Kafka Streams: A Java library to process event streams live as
N

Big Data Computing Vu Pham Introduction to Kafka

Further Learning
o Kafka Streams code examples
o Apache Kafka
https://github.com/apache/kafka/tree/trunk/streams/examples/src/main/java/org/apache/kafka/
streams/examples
o Confluent

EL
https://github.com/confluentinc/examples/tree/master/kafka-streams

o Source Code https://github.com/apache/kafka/tree/trunk/streams

Kafka Streams Java docs

PT
o
http://docs.confluent.io/current/streams/javadocs/index.html

o First book on Kafka Streams (MEAP)

N
o Kafka Streams in Action https://www.manning.com/books/kafka-streams-in-action

o Kafka Streams download

o Apache Kafka https://kafka.apache.org/downloads
o Confluent Platform http://www.confluent.io/download

Big Data Computing Vu Pham Introduction to Kafka

Conclusion
Kafka is a high-performance, real-time messaging system.

Kafka can be used as an external commit log for distributed

systems.

EL
Kafka data model consists of messages and topics.

PT
Kafka architecture consists of brokers that take messages from the
producers and add to a partition of a topics.
N
Kafka architecture supports two types of messaging system called
publish-subscribe and queue system.

Brokers are the Kafka processes that process the messages in Kafka.

Big Data Computing Vu Pham Introduction to Kafka

HBase
No ratings yet
HBase
38 pages
Big Data Analytics Unit-5
No ratings yet
Big Data Analytics Unit-5
28 pages
Unit 5 Lecture No-3(Hbase)
No ratings yet
Unit 5 Lecture No-3(Hbase)
35 pages
Big Data Analytics & Technologies: Hbase
No ratings yet
Big Data Analytics & Technologies: Hbase
30 pages
Lesson 6 NoSQL Databases HBase
100% (1)
Lesson 6 NoSQL Databases HBase
47 pages
HBase - Tutorial
No ratings yet
HBase - Tutorial
14 pages
Hbase
100% (1)
Hbase
30 pages
M03 - Application Policy Infrastructure Controller
No ratings yet
M03 - Application Policy Infrastructure Controller
50 pages
C7 Hbase
No ratings yet
C7 Hbase
36 pages
Chapter 12 HBase[1]
No ratings yet
Chapter 12 HBase[1]
108 pages
BDA Unit 5
No ratings yet
BDA Unit 5
33 pages
hbasedatamodel
No ratings yet
hbasedatamodel
28 pages
10 NoSQL Databases - HBase Hive Cassandra
No ratings yet
10 NoSQL Databases - HBase Hive Cassandra
74 pages
HBase
No ratings yet
HBase
39 pages
Module 05 HBase - Distributed NoSQL Database
No ratings yet
Module 05 HBase - Distributed NoSQL Database
54 pages
Hadoop HBase Notes-Abhijit-Nagargoje
No ratings yet
Hadoop HBase Notes-Abhijit-Nagargoje
24 pages
Unit 5 Lecture No-3(Hbase)
No ratings yet
Unit 5 Lecture No-3(Hbase)
35 pages
Unit 5 BDA
No ratings yet
Unit 5 BDA
34 pages
Bda - Unit 5
No ratings yet
Bda - Unit 5
30 pages
Apache Kafka Long Polling
No ratings yet
Apache Kafka Long Polling
20 pages
Hadoop HBASE
No ratings yet
Hadoop HBASE
71 pages
BDA Unit-4 Part-2 HBase,Hive,Pig
No ratings yet
BDA Unit-4 Part-2 HBase,Hive,Pig
74 pages
lec18
No ratings yet
lec18
21 pages
HBASE (1)
No ratings yet
HBASE (1)
18 pages
UNIT 5 Notes
No ratings yet
UNIT 5 Notes
47 pages
Unit v Hadoop Related Tools_b5f716067e8295de72a527efb7a3698b
No ratings yet
Unit v Hadoop Related Tools_b5f716067e8295de72a527efb7a3698b
54 pages
BDA1
No ratings yet
BDA1
42 pages
BDA.Unit-5
No ratings yet
BDA.Unit-5
31 pages
Cse 17CS82 M2 S4 PPT
No ratings yet
Cse 17CS82 M2 S4 PPT
19 pages
Unit - IV_Notes
No ratings yet
Unit - IV_Notes
23 pages
Hadoop Week 6
No ratings yet
Hadoop Week 6
38 pages
Unit 5 Big Data
No ratings yet
Unit 5 Big Data
34 pages
UNIT 3
No ratings yet
UNIT 3
15 pages
HBase Presentation
No ratings yet
HBase Presentation
23 pages
Data Protection Product Guide
No ratings yet
Data Protection Product Guide
78 pages
Hbase - Quick Guide Hbase - Overview
No ratings yet
Hbase - Quick Guide Hbase - Overview
53 pages
BDA Unit 5 HIVE HBASE
No ratings yet
BDA Unit 5 HIVE HBASE
33 pages
lec18
No ratings yet
lec18
18 pages
Hbase: Q) What Is Hbase ?
No ratings yet
Hbase: Q) What Is Hbase ?
15 pages
Big Data 22MSM40206
No ratings yet
Big Data 22MSM40206
9 pages
Hbase in Practice
No ratings yet
Hbase in Practice
46 pages
HBASE
No ratings yet
HBASE
18 pages
9 HBase
No ratings yet
9 HBase
77 pages
HBase (Unit 4)
No ratings yet
HBase (Unit 4)
37 pages
4 4HBase
No ratings yet
4 4HBase
17 pages
10_HBase
No ratings yet
10_HBase
13 pages
HBase
No ratings yet
HBase
6 pages
Apache HBase PPT
No ratings yet
Apache HBase PPT
12 pages
Unit - 5 Part - 1
No ratings yet
Unit - 5 Part - 1
8 pages
Hbase
No ratings yet
Hbase
3 pages
UNIT5
No ratings yet
UNIT5
42 pages
HBASE
No ratings yet
HBASE
35 pages
HBase
No ratings yet
HBase
27 pages
BDT UNIT - V
No ratings yet
BDT UNIT - V
15 pages
HBASE
No ratings yet
HBASE
11 pages
Oracle GoldenGate 12c Implementer's Guide - Sample Chapter
100% (2)
Oracle GoldenGate 12c Implementer's Guide - Sample Chapter
40 pages
Unit 5 Hbase
No ratings yet
Unit 5 Hbase
15 pages
HBase
No ratings yet
HBase
31 pages
IBM SVC Advanced Copyservices
No ratings yet
IBM SVC Advanced Copyservices
264 pages
What Is HBASE
No ratings yet
What Is HBASE
2 pages
Assignment 10
No ratings yet
Assignment 10
9 pages
Hbase Big Table: Oriented vs. Column-Oriented Data Stores. As Shown Below, in A Row
No ratings yet
Hbase Big Table: Oriented vs. Column-Oriented Data Stores. As Shown Below, in A Row
6 pages
Director LVR Guide PDF
No ratings yet
Director LVR Guide PDF
96 pages
Week-6 - Lecture Notes
No ratings yet
Week-6 - Lecture Notes
149 pages
Hbase What Is Hbase?
No ratings yet
Hbase What Is Hbase?
2 pages
Week-7 - Lecture Notes
No ratings yet
Week-7 - Lecture Notes
143 pages
Week-4 Lecture Notes
No ratings yet
Week-4 Lecture Notes
122 pages
VVR Implementation
100% (1)
VVR Implementation
27 pages
Hbase - in Detail: Pushpinder Singh Paxcel Technologies
No ratings yet
Hbase - in Detail: Pushpinder Singh Paxcel Technologies
32 pages
HP OST Plugin For Netbackup
No ratings yet
HP OST Plugin For Netbackup
49 pages
SLES4SAP Hana SR Guide PerfOpt 15 - Color - en
No ratings yet
SLES4SAP Hana SR Guide PerfOpt 15 - Color - en
95 pages
HP Simply Storageworks: Introduction To Storage Technologies
100% (2)
HP Simply Storageworks: Introduction To Storage Technologies
16 pages
Indian Railways URS
No ratings yet
Indian Railways URS
10 pages
Replace NSD Disk Under GPFS Cluster
No ratings yet
Replace NSD Disk Under GPFS Cluster
5 pages
Datamigration
No ratings yet
Datamigration
23 pages
ESDS_SAS6G_389E05_RN_v1.0
No ratings yet
ESDS_SAS6G_389E05_RN_v1.0
46 pages
Mca Mah Cet 2017
100% (1)
Mca Mah Cet 2017
6 pages
Lesson 3 S3
No ratings yet
Lesson 3 S3
37 pages
Ceph
No ratings yet
Ceph
40 pages
Payroll Exam3
No ratings yet
Payroll Exam3
12 pages
Saputra 2015
No ratings yet
Saputra 2015
4 pages
Overview of Physical Database Design Methodology
No ratings yet
Overview of Physical Database Design Methodology
5 pages
AWS Database Migration Service Best Practices
100% (1)
AWS Database Migration Service Best Practices
17 pages
Contacts Log
No ratings yet
Contacts Log
40 pages
Distributed Database Management Systems
No ratings yet
Distributed Database Management Systems
55 pages
10SDA
No ratings yet
10SDA
49 pages
Huawei Routine Maintenance - (V100R002C01 - 05)
No ratings yet
Huawei Routine Maintenance - (V100R002C01 - 05)
60 pages
SQL - Dba - Resum 24 (2) - 1
No ratings yet
SQL - Dba - Resum 24 (2) - 1
4 pages
NETBACKUP White Paper
No ratings yet
NETBACKUP White Paper
20 pages
DES-3611.prepaway - Premium.exam.65q: Number: DES-3611 Passing Score: 800 Time Limit: 120 Min File Version: 1.1
No ratings yet
DES-3611.prepaway - Premium.exam.65q: Number: DES-3611 Passing Score: 800 Time Limit: 120 Min File Version: 1.1
22 pages
Disaster Recovery Virtualization Protecting Production Systems Using VMware Virtual Infrastructure and DoubleTake
No ratings yet
Disaster Recovery Virtualization Protecting Production Systems Using VMware Virtual Infrastructure and DoubleTake
17 pages
Datasheet - Veeam Backup & Replication - Hyper-V
No ratings yet
Datasheet - Veeam Backup & Replication - Hyper-V
2 pages
System Design Interview Textbook
100% (1)
System Design Interview Textbook
51 pages
Learn Hbase in 24 Hours
From Everand
Learn Hbase in 24 Hours
Alex Nordeen
No ratings yet