BD Sec B
BD Sec B
History: Hadoop was created by Doug Cutting and Mike Cafarelta in 2005. It was originaliy developed to
support distribution for the Nutch search engine project. Doug
Cutting, who was working at Yahoo! at the time
and is now Chief Architect of Cloudera, named the project after his
son's toy elephant. Cutting's
son was
years old at the time and just beginning to talk. He calted his 2
favorite stuffed yellow elephant "Hadoop
(with the stress on the first
syllable). Now 12, Doug's son often exclaims, "Why don't you say my name, and
why don't I get royalties? t deserve to be famous for this!"
Hadoop is an open-source framework from
the ASF Apache Software Foundation. Open source project
means it is freely available and we can even
change its
source code as per the requirements. If certain
functionality does not fülfill your need then you can change it according to your need. Most of Hadoop code
is written by Yahoo. IBM, Facebook, Cloudera.
Apache Hadoop framework is written in Java. The basic Hadoop programming language is Java, but this
does not mean you can code only in Java. You can code in C, C++, Perl, Python, ruby etc. You can code
the Hadoop framework in any language but it will be better to code in java as you will have lower level
control of the code.
Hadoop framework allows distributed processing of large datasets across clusters (i.e. a group of systems
connected via LAN) of computers using simple programming models. A Hadoop frame-worked application
works in an environment that provides distributed storage and computation across clusters of commodity
hardware (i.e. low-end hardware, they are cheap devices which are very economical. Hence, Hadoop is very
economic.). Hadoop is designed to scale up from single server to thousands of machines, each offering local
computation and storage.
It is at the center of a growing ecosystem of big data technologies that are primarily used to support
advanced analytics initiatives, including predictive analytics, data mining and learning applications. Hadoop
can handle various forms of structured and unstructured data, giving users more flexibility for collecting,
processing and analyzing data than relational databases and data warehouses provide.
Hadoop Common: These are Java libraries and utilities required by other Hadoop modules. These libraries
provide filesystem and OS level abstractions and contains the necessary Java files and scripts required to
start Hadoop.
Hadoop Distributed File System (HDFS): The Hadoop Distributed File System (HDFS) is the primary
data storage system used by Hadoop applications. It employs a NameNode and DataNode architecture to
implement a distributed file system (DFS) that provides high-performance access to data across highly
scalable Hadoop clusters. HDFS creates multiple replicas of data blocks and distributes them on compute
nodes in cluster.
YARN is the resource management and job scheduling technology in the
Hadoop YARN: Apache Hadoop
framework. One of Apache Hadoop's core components, YARN is
open source Hadoop distributed processing
to the various applications running in a Hadoop cluster and
responsible for allocating system resources
different cluster nodes.
scheduling tasks to be executed on
***
**vowwmme
*w*w.n
DataNode DataNode DataNode
Apache Hadoop offers a scalable, flexible and reliable distributed computing big data framework for a cluster of
systems with storage capacity and local computing power by leveraging commodity hadware. Hadoop follows
a Master Slave architecture for the transformation and analysis of large datasets ousing Hadoop
MapReduce paradigm. The 3 important hadoop components that play a vital role in the Hadoop architecture are
1) Hadoop Distributed File System (HDFS)- Patterned after the UNIX file system
2) Hadoop MapReduce
3) Yet Another Resource Negotiator (YARN)
Hadoop follows a maater slave architecture design for data storage and distributed data processing
is the NameNode
using HDFS and MapReduce respectively. The master node for data storage is hadoop HDFS
and the master node for parallel processing of data using Hadoop MapReduce is the Job Tracker. The slave
nodes in the hadoop architecture are the other machines in the Hadoop cluster which store data and perform
complex computations. Every slave node has a Task Tracker daemon and a DataNode that synchronizes the
processes with the Job Tracker and
NameNode respectively. In Hadoop architectural implementation the master
or slave systems can be setup in the cloud or on-premise.
which consists of one JobTracker, to which client
Above the file systems comes the MapReduce engine,
The JobTracker pushes work out to available TaskTracker nodes in the
applications submit MapReduce jobs.
JobTracker knows which node contains the data, and which other
cluster. With a rack-aware file system, the
on the actual node where the data resides, priority is given to
machines are nearby. If the work cannot be hosted
traffic on the main backbone network.
nodes in the same rack. This reduces network
of the job is rescheduled. 'The TaskTracker on each node initiates a
If a TaskTracker fails or times out, that part
to prevent the Tasklracker itself from failing if the running job crashes
separate Java Virtual Machine process
TaskTracker to the Job| racker every few minutes to check its status. The
the JVM. A heartbeat is sent from the
information is exposed and can be viewed from a web browser.
Job Tracker and TaskTracker status and
If the JobTracker failed, all ongoing work was lost. Hadoop added some check pointing to this process. The
When a JobI racker starts up, it looks for any such data, so
JobTracker records what it is up to in the file system.
that it can restart work from where it left off.
1. Hadoop Distributed File System
HDFS is the primary storage system of Hadoop. It is a java based file system that provides scalable, fault
tolerance, reliable and cost etticient data storage for Big data. HDFS is a distributed filesystem that runs On
commodity hardware. Hadoop interact directly with HDFS by shel1-like commands.
HDFS Architecture
Hadoop HDFS architecture consists of a Master/Slave architecture in
which Master is Name Node that stores
meta-dta and Slave is DataNode that stores the actual
data. HDFS Architecture consists of single NameNode
and all the other nodes are DataNodes. Nodes are
arranged in racks and Replicas of data blocks are stored on
different racks in the cluster to provide fault tolerance.
HDFS Architecture
Meta Data Ops Meta data (Name, replicas,...):
Name /homo/fooldata, 3, .
Blocks
D a ta NO de s Data Nodes
Rack 1 Rack 2
HDFS Components:
There are two major components of Hadoop HDFS- NameNode and DataNode.
i. NameNode
It is also known as Master node. NameNode does not store actual data or dataset.
NameNode stores Metadata
i.e. number of blocks, their location, on which Rack, which Datanode the data is
stored and other details. It
consists of files and directories.
Tasks of HDFS NameNode
.Manage file system namespace.
Regulates client's access to files.
Executes file system execution such as naming, closing, opening files and directories.
A l l DataNodes sends a Heartbeat and block report to the NameNode in the Hadoop cluster. It ensures that
the DataNodes are alive. A block report contains a list of all blocks on a datanode.
NameNode is also responsible for taking care of the Replication Factor of all the blocks.
FsImage
Fslmage in NameNode is an "Image file". Fslmage contains the entire filesystem namespace and stored as a
file in the namenode's local file system. It also contains a serialized form of all the
directories and file inodes in
the filesystem. Each inode is an internal representation of file or directory's metadata.
EditLogs-
EditLogs contains all the recent modifications made to the file system on the most recent Fslmage. Namenode
receives a create/update/delete request from the client. After that this request is first recorded to edits file.
ii. DataNode
t is also known as Slave. HDFS Datanode is responsible for storing actual data in HDFs. Datanode
performs read and write operation as per the request of the clients. Replica block of Datanode consists of 2
files on the file system. The first file is for data and second file is for recording the block's metadata. HDFS
Metadata includes checksums for data. At startup, each Datanode connects to its corresponding Namenode and
does handshaking. Verification of
namespace ID and software version of DataNode take place by handshaking.
At the time of mismatch found, DataNode
goes down automatically.
Secondary NameNode?
In HDFS, when NameNode starts, first it reads HDFS state from an image file, Fslmage. After that, it applies
edits from the edits log file. NameNode then writes new HDFS state to the Fsimage. Then it starts normal
operation with an empty edits file. At the time of start-up, NameNode merges Fslmage and edits files, so the
edit log file could get very large over time. A side effect of a larger edits file is that next restart of Namenode
takes longer.
Secondary Namenode solves this issue. Secondary NameNode downloads the Fslmage and EditLogs from the
NameNode. And then merges EditLogs with the Fsimage (FileSystem Image). It keeps edits log size within a
limit. It stores the modified FsImage into persistent storage. And we can use it in the case of NameNode
failure. Secondary NameNode performs a regular checkpoint in HDFS.
Checkpoint Node?
The Checkpoint node is a node which periodically creates checkpoints of the namespace. Checkpoint Node in
Hadoop first downloads Fsimage and edits from the Active Namenode. Then it merges them (Fslmage and
edits) locally, and at last, it uploads the new image back to the active NameNode. It stores the latest checkpoint
in a directory that has the same structure as the Namenode's directory. This permits the checkpointed image to
be always available for reading by the namenode if necessary.
Backup Node?
A Backup node provides the same checkpointing functionality as the Checkpoint node. In Hadoop, Backup
node keeps an in-memory, up-to-date copy of the file system namespace. It is always synchronized with the
active NameNode state. The backup node in HDFS Architecture does not need to download Fslmage and edits
files from the active NameNode to create a checkpoint. It already has an up-to-date state of the namespace state
in memory. The Backup node checkpoint process is more efficient as it only needs to save the namespace into
the local Fslmage file and reset edits. NameNode supports one Backup node at a time.
Pipeline of
DataNod
16 89
Datangdes (OPata
clientnode
4. Read Read
DataNode
Datanode
E
Blocks in HDFS Architecture?
HDFS in Apache Hadoop split huge files into small chunks known as Blocks. These are the smallest unit of
data in a filesystem. We (client and admin) do not have any control on the block like block location. NameNode
decides all such things. The default size of the HDFS block is 128 MB, which we can configure as per the need.
All blocks ofthe file are of the same size except the last block, which can be the same size or smaller.
If the data size is less than the block size, then block size will be equal to the data size. For example, if the file
size is 129 MB, then it will create 2 blocks. One block will be of default size 128 MB. The other will be I MB
only and not 128 MB as it will waste the space. Hadoop is intelligent enough not to waste rest of 127 MB. So it
is allocating IMb block only for IMB data. The major advantages of storing data in such block size are that it
saves disk seek time.
Replication Management
Block replication provides fault tolerance. If one copy is not accessible and corrupted, we can read data from
other copy. The number of copies or replicas of each block of a file in HDFS Architecture is replícation factor.
The default replication factor is 3 which are again configurable. So, each block replicates three times and stored
on different DataNodes.
Ifwe are storing a file of 128 MB in HDFS using the default configuration, we will end up occupying a space of
384 MB (3*128 MB).
NameNode receives block report from DataNode periodically to maintain the replication factor. When a block
is over-replicated/under-replicated the NameNode add or delete replicas as needed.
In a large cluster of Hadoop, in order to improve the network traffic while reading/writing HDFS file,
NameNode chooses the DataNode which is closer to the same rack or nearby rack to Read /write request.
NameNode achieves rack information by maintaining the rack ids of each DataNode. Rack Awareness in
Hadoop is the concept that chooses Datanodes based on the rack information.
In HDFS Arehitecture, NameNode makes sure that all the replicas are not stored on the same rack or single
rack. It follows Rack Awareness Algorithm to reduce latency as well as fault tolerance. We know that default
replication factor is 3. According to Rack Awareness Algorithm first replica of a block will store on a local rack.
The next replica will store another datanode within the same rack. The third replica will store on diferent rack
In Hadoop. "No more than one replica is placed on one node. And no more than two replicas are placed on the
same rack. This has a coñstraint that the number of racks used for block replication should be less than the total
number of block replicas".
(Pata
2. InputFormat
It defines how these input files are split and read. It selects the files or other objects that are used for input.
InputFormat creates InputSplit.
3. InputSplits
It is created by InputFormat, logically represent the data which will be processed by an individual Mapper .
One map task is created for each split; thus the number of map tasks will be equal to the number of
InputSplits. The spiit is divided into records and each record will be processed by the mapper.
4. RecordReader
It communicates with the InputSplit in Hadoop MapReduce and converts the data into key-value
pairs suitable for reading by the mapper. By default, it uses TextlnputFormat for converting data into a key-
value pair. RecordReader communicates with the InputSplit until the file reading is not completed. It assigns
byte offset (unique number) to each line present in the file. Further, these key-value pairs are sent to the
mapper for further processing.
5. Mapper
t processes each input record (from RecordReader) and generates new key-value pair, and this key-value
pair generated by Mapper is completely different from the input pair. The output of Mapper is also known as
intermediate output which is written to the local disk. The output of the Mapper is not stored on HDFS as
this is temporary data and writing on HDFS will create unnecessary copies (also HDFS is a high latency
system). Mappers output is passed to the combiner for further process.
6. Combiner
The combiner is also known as Mini-reducer'. Hadoop MapReduce Combiner performs local
on the mappers' output, which helps to minimize the data transfer between mapper and reducer. Once the
aggregation
combiner functionality is executed, the output is then passed to the partitioner for further work.
7. Partitioner
It comes into the picture if we are working on more than one reducer
(for one reducer partitioner is not used).
It takes the output from combiners and performs partitioning. Partitioning of output takes place on the basis
of the key and then sorted. By hash function, key is used to derive the partition.
According to the key value in MapReduce, each combiner output is partitioned, and a record having the same
key value goes into the same partition, and then each partition is sent to a reducer. Partitioning allows even
distribution of the map output over the reducer.
8. Shufling and Sorting
The shuffling is the physical movement of the data which is done over the network. Once all the mappers are
finished and their output is shuffled on the reducer nodes, then this intermediate output is merged and sorted,
which is then provided as input to reduce phase.
9. Reducer
It takes the set of intermediate key-value pairs produced by the mappers as the input and then runs a reducer
function on each of them to generate the output. The output of the reducer is the final output, which is stored
in HDFS.
10. RecordWriter
It writes these output key-value pair from the Reducer phase to the output files.
11. OutputFormat
The way these output key-value pairs are written in output files by RecordWriter is determined by the
OutputFormat. OutputFormat instances provided by the Hadoop are used to write files in HDFS or on the
local disk. Thus the final output of reducer is written on HDFS by OutputFormat instances.
Hence, in this manner, a Hadoop MapReduce works over the cluster.
Hadoop Mapper
for
In mapper task, the output is the full collection of all these <key, value> pairs. Before writing the output
This
each mapper task, partitioning of output take place on the basis of the key and then sorting is done.
one
partitioning specifies that all the values for each key are grouped together. MapReduce frame generates
the InputFormat for the job.Mapper only understandskey,
map task for each InputSplit generated I
value pairs of data, so before passing data to the mapper, data should be first converted
into <key, value
pairs.
execution.
Tiger1 -Tier2
Lion 1
Lion Tiger River Tiger 1
Bus 1
Output
River 1
Bus 1 Bus3 Bus 3
Map Reduce
MapReduce Applications used at:
Google:
To create index which is used by google search engine for retrieving search results.
To compute the page rank of the web pages.
Statistical machine translation for translating between different languages.
Facebook:
. Data Mining.
Ad optimization.
Yahoo:
Spam detection for Yahoo! mail
Yahoo! search
In order to enable LZO compression set mapred.compress.map.output to true. This is one of the most
important Hadoop optimization techniques.
2.5. Usage of most appropriate and compact writable type for data
Big data new users or users switching from Hadoop Streaming to Java MapReduce often use the Text
writable type unnecessarily. Although Text can be convenient, it's inefficient to convert numeric data to and
from UTF8 strings and can actually make up a significant portion of CPU time. Whenever dealing with non
textual data, consider using the binary Writables like IntWritable, FLoatwritable etc.
One of the common mistakes that many MapReduce users make is to allocate a new Writable object for
every output from a mapper or reducer. For example, to implement a word-count mapper:
Simplicity- MapReduce jobs are easy to run. Applications can be written in any language such as java,
C++, and python.
Scalability - MapReduce can process petabytes of data.
Speed By means of parallel processing problems that take days to solve, it is solved in hours and
minutes by MapReduce.
Fault Tolerance MapReduce takes care of failures. If
-
put the compute on the same node where data resides, if that cannot possible. First, it tries to
be done (due to reasons like
on that node is
down, compute on that node is performing some other compute
put the compute on the node nearest to the
computation, etc.), then it tries to
respective data node(s) which contains the data to be
processed. This feature of MapReduce is "Data Locality".
19
MapReduce- Word Count Example Fiow
Shuiie
input Split map combinel reduce output
SaL DW SQ
soL Ow soL
sQL SSIS
saLssASS
DW SISO issAS SSRS
DW sQLK
Car, 2 Car, 31
Car, 2 Dew
diaiowiirnnianin
e eer. 2
Bear, 1 R
( 1 E, 2
Rives 1
Output in a list of (Key, Value) Out put in a list of (Key, List of Vaues)
in the intermediate file in the intermediate file
ARN (Yet Another Resource Negotiator)
YARN was introduced in Hadoop 2.0. In Hadoop 1.0 a map-reduce job is run through a jab tracker and mut
task trackers. Job of job tracker is to monitor the progress of map-reduce job, handle the resource allocation ana
As single process is handling all these things, Hadoop 1.0 is not with scaling. Also it makes
good
scheduling etc.
Job tracker a single point of failure. In 1.0, you can run only map-reduce jobs with hadoop but with YARN support
for map and reduce tasks
in 2.0, you can run other jobs like streaming and graph processing. In 1.0 slots are fixed a
so while map is running you can't use reduce slots for map tasks because of that
slots go waste, in 2.0 there is
concept of container, which has resoureces like memory and cpu-cores and any task can be run in it.
Application ResourceManager
client 1:submit
YARN
l i e n t node application resource manager node
2a:start container
3: allocate resources (heartbeat)
NodeManager
2b: launch|
container
Application
ProCe>> 4a: s t a r t NodeManager
container i 4b: launch |L
node manager node
Container
Application
Proess
node manager node
Resource Manager:
It has two main components: Job Scheduler and Application Manager. Job of scheduler is allocated the resources
with the given scheduling method and job of Application Manager is to monitor the progress of submitted
application like map-reduce job. It has all the info of available resources.
Node Manager:
For each node there is a node manager running. t maintains the available resources on that particular node and
notifies Resource Manager about the available resources when it starts. It launches the containers by providing the
needed resources (memory, cpu etc.). These resourcesare allocated to container by Resource Manager. It manages
the containers during it's lifetime. It sends heartbeat to Resource Manager to let it know that it is alive. In case
Resource Manager doesn't receive heartbeat from Node Manager, it marks that node as failure.
Application Master:
It carries out the execution of job using different compunents
of YARN. It is spawned under Node Manager under
the instructions of Resource Manager. One Application master is launched for each
job. For resource allocation it
talks to Resource Manager, for launching or stopping a container it talks to Node
of task from different nodes and notifies the status of job to client as client
Manager. It aggregates the status
polls on it. It also sends periodic
heartbeat to Resource Manager to make sure Resource manager can launch a new
failure. Application Master in case of
Container:
It is started by Node
Manager. It consists of resources like memory, cpu core etc. For
task,Application Master asks Resource Manager for resources using which a container canrunning
be run.
a map or reduce
Apache HBase
Since 1970, RDBMS is the solution for data storage and maintenance related problems. After the advent of big
data, companies realized the benefit of processing big data and started opting for solutions like Hadoop.
Hadoop uses distributed file system for storing big data, and MapReduce to process it. Hadoop excels in
storing and processing of huge data of various formats such as arbitrary, semi-, or even unstrucured.
Limitations of Hadoop
That means
adoop can perform only batch processing, and data will be accessed only in a sequential manner.
What is HBase
of the Hadoop file system. It is an open-source
HBase is distributed column-oriented database built on top
a
HBase is a data model that is similar 1
and is horizontally scalable. Its programming language is Java.
project data. It leverages the
random access to huge amounts of structured
Google's big table designed to provide quick
Hbase is well suited for sparse data sets which are
fault tolerance provided by the Hadoop File System (HDFS).
in practically any programming
data use cases. Hbase provides APls enabling development
very c o m m o n in big access to data in the
of the Hadoop ecosystem that provides random real-time read/write
language. It is a part
or through HBase. Data
consumer
store the data in HDFS either directly
Hadoop File System. One can File System and
HBase. HBase sits on top of the Hadoop
reads/accesses the data in HDFS randomly using
HBase
Data
Data
Consumer Producer
HDFS
Why HBase
slow as the data becomes large
RDBMS get exponentially
scale data.
It can easily work on extremely large
to fit in a well-defined schema
o Expects data to be highly structured, i.e. ability
semi-structured data types we can use it.
o For both structured and
a downtime
Any change in schema might require
of maintaining NULL values
For datasets, too much of overhead
sparse
architecture.
o Apache HBase has a completely distributed
which unprecedented high
results in write throughput.
o HBase offers high security and easy management
can be backed with HBase Tables.
Moreover, the MapReduce jobs
1
Where to Use HBase
Apache HBase is used to have random, real-time read/write access to Big Data.
It hosts very large tables on
top of clusters of commodity hardware.
Apache HBase is a non-relational database modeled after
File System, likewise
Google's Bigtable. Bigtable acts up on Google
Apache HBase works on top of and HDFS. Hadoop
HBase History
Year Event
Nov 2006 Google released the paper on BigTable.
Feb 2007 Initial HBase prototype was created as a
Oct 2007 Hadoop contribution.
The first usable HBase
along with Hadoop 0.15.0 was released.
Jan 2008 HBase became the sub project of
Hadoop.
Oct 2008 HBase 0.18.1 was released.
Jan 2009 HBase 0.19.0 was released.
Sept 2009 HBase 0.20.0 was released.
May 2010 HBase became Apache top-level project.
In HBase, tables
HBase- Architecture
are split into regions and are served by the region servers. Regions
column families into "Stores". Stores
are vertically divided by
are saved as files in HDFS. Shown below is the architecture of HBase.
Note: The term 'store' is used for
regions to explain the storage structure.
HBase Architecture
adoo
HDES
HBASE
CIHents
ter
E egion servor
HBase Architecture is basically a column-oriented key-value data store and also it is the natural fit for
deploying as a top layer on HDFS because it works extremely fine with the kind of data that Hadoop process.
Moreover, when it comes to both read and write operations it is extremely fast and even it does not lose this
extremely important quality with humongous datasets.
There are 3 major components of HBase Architecture:
Master server
Region servers (can be added or removed as per requirement)
Zookeeper (client library)
MasterServer
master server
-
The
. Assigns regions to the region servers and takes the help of Apache ZooKeeper for this task.
Handles load balancing of the regions across region servers. It unloads the busy servers and shifts the
regions to less occupied
servers.
D
Column family is a collection
of columns.
Column is a.collection of
key value pairs.
Given below is an
example schema of table in HBase.
Rowid Column Family Column Family Column Family Column Family
coll col2 col3 coll col2 col3 coll col2 col3 coll col2 col3
cOLUMN FAMILIES
pesona
empid narme ity designation salary
raju hyderabad manager |50.000
ravi chennai Sr.engineer 30.000
rajesh delhi ir.engineer 25,000
HDFS HBase
HDFS is a distributed file system HBase is a database built on top ofthe HDFS.
suitable for storing large files.
HDFS does not support fast individual | HBase provides fast lookups for larger tables.
record lookups.
batch | It provides low latency access to single rows from billions of
It provides high latency
processing no concept of batch records (Random access).
processing
Hash tables and provides random access,
It provides only sequential access of | HBase internally uses
and it stores the data in indexed HDFS files for faster lookups.
data.
RDBMS
HBase
which
An RDBMS is governed by its schema,
| HBase is schema-less, it doesn't have the concept of fixed describes the whole structure of tables.
columns schema; defines only column families.
It is thin and built for small tables. Hard to scale.
It is built for wide tables. HBase is horizontaliy scalable. RDBMS is transactional.
| No transactions are there in HBase.
It will have normalized data.
It has de-normalized data. It is good for structured data.
| It is good for semi-structured as well as structured data.
Features of HBase
HBase is linearly scalable.
It has automatic failure support.
I t provides consistent read and writes.
.
It integrates with Hadoop, both as a source and a destination.
It has easy java API for client.
It provids data replication across clusters.
Horizontally scalable: You can add any number of columns anytime.
Automatic Failover: Automatic failover is a resouree that allows a system administrator to automatically
switch data handling to a standby system in the event of system compromise
Integrations with Map/Reduce framework: Al the commands and java codes internally implements Map/
Reduce to do the task and it is built over Hadoop Distributed File System.
Sparse, distributed, persistent, multidimensional sorted map, which is indexed by rowkey, column key,
and timestamp.
Often referred as a key value store or column family-oriented database, or storing versioned maps of
maps.
Fundamentally, it's a platform for storing and retrieving data with random access.
It doesn't care about data types (storing an integer in one row and a string in another for the same
column).
!t doesn't enforce relationships within your data.
It is designed to run on a cluster of computers, built using commodity hardware.
I t hosts very large tables on top ofclusters of commodity hardware.
Apache HBase is a non-relational database modeled after Google's Bigtable. Bigtable acts up on Google
File System; likewise Apache HBase works on top of Hadoop and HDFS.
Applications of HBase
It is used whenever there is a need to write heavy
applications.
HBase is used whenever we need to provide fast random access to
available data.
Companies such as Facebook, Twitter, Yahoo, and Adobe use HBase internally.