Lineland - HBase Architecture 101 - Storage
Lineland - HBase Architecture 101 - Storage
Lineland - HBase Architecture 101 - Storage
Please note that this is not a UML or call graph but a merged picture
of classes and the files they handle and by no means complete
though focuses on the topic of this post. I will discuss the details
So what does my sketch of the HBase innards really say? You can
see that HBase handles basically two kinds of file types. One is used
for the write-ahead log and the other for the actual data storage. The
files are primarily handled by the HRegionServer's. But in certain
scenarios even the HMaster will have to perform low-level file
operations. You may also notice that the actual files are in fact
divided up into smaller blocks when stored within the Hadoop
Distributed Filesystem (HDFS). This is also one of the areas where
you can configure the system to handle larger or smaller data better.
More on that later.
The general flow is that a new client contacts the Zookeeper quorum
(a separate cluster of Zookeeper nodes) first to find a particular row
key. It does so by retrieving the server name (i.e. host name) that
hosts the -ROOT- region from Zookeeper. With that information it
can query that server to get the server that hosts the .META. table.
Both of these two details are cached and only looked up once. Lastly
it can query the .META. server and retrieve the server that has the
row the client is looking for.
Once it has been told where the row resides, i.e. in what region, it
caches this information as well and contacts the HRegionServer
hosting that region directly. So over time the client has a pretty
complete picture of where to get rows from without needing to query
the .META. server again.
Stay Put
Once the data is written (or not) to the WAL it is placed in the
MemStore. At the same time it is checked if the MemStore is full and in
Files
HBase has a configurable root directory in the HDFS but the default
is /hbase. You can simply use the DFS tool of the Hadoop command
line tool to look at the various files HBase stores.
02....
08....
10....
The first set of files are the log files handled by the HLog instances
and which are created in a directory called .logs underneath the
HBase root directory. Then there is another subdirectory for each
HRegionServer and then a log for each HRegion.
Next there is a file called oldlogfile.log which you may not even
see on your cluster. They are created by one of the exceptions I
mentioned earlier as far as file access is concerned. They are a result
of so called "log splits". When the HMaster starts and finds that there
is a log file that is not handled by a HRegionServer anymore it splits
the log copying the HLogKey's to the new regions they should be in. It
places them directly in the region's directory in a file named
oldlogfile.log. Now when the respective HRegion is instantiated it
reads these files and inserts the contained data into its local
MemStore and starts a flush to persist the data right away and delete
the file.
/hbase/<tablename>/<encoded-regionname>/<column-
family>/<filename>
In each column-family directory you can see the actual data files,
which I explain in the following section in detail.
Something that I have not shown above are split regions with their
initial daughter reference files. When a data file within a region grows
larger than the configured hbase.hregion.max.filesize then the
region is split in two. This is done initially very quickly because the
system simply creates two reference files in the new regions now
supposed to host each half. The name of the reference file is an ID
with the hashed name of the referenced region as a postfix, e.g.
1278437856009925445.3323223323. The reference files only hold little
information: the key the original region was split at and wether it is
the top or bottom reference. Of note is that these references are
then used by the HalfHFileReader class (which I also omitted from
the big picture above as it is only used temporarily) to read the
And this also concludes the file dump here, the last thing you see is a
compaction.dir directory in each table directory. They are used
when splitting or compacting regions as noted above. They are
usually empty and are used as a scratch area to stage the new data
files before swapping them into place.
HFile
The files have a variable length, the only fixed blocks are the FileInfo
and Trailer block. As the picture shows it is the Trailer that has the
pointers to the other blocks and it is written at the end of persisting
the data to the file, finalizing the now immutable data store. The
The default is "64KB" (or 65535 bytes). Here is what the HFile
JavaDoc explains:
One thing you may notice is that the default block size for files in
DFS is 64MB, which is 1024 times what the HFile default block size
is. So the HBase storage files blocks do not match the Hadoop
blocks. Therefore you have to think about both parameters
separately and find the sweet spot in terms of performance for your
particular setup.
So far so good, but how can you see if a HFile is OK or what data it
contains? There is an App for that!
02.usage: HFile [-f <arg>] [-v] [-r <arg>] [-a] [-p] [-m] [-
k]
11.</arg></arg></arg></arg>
Here is an example of what the output will look like (shortened here):
03.
05....
06.K:
\x00\x04docA\x08mimetype\x00\x00\x01\x23y\x60\xE7\xB5\x04 V:
text\x2Fxml
07.K:
\x00\x04docB\x08mimetype\x00\x00\x01\x23x\x8C\x1C\x5E\x04 V:
text\x2Fxml
09.K: \x00\x04docD\x08mimetype\x00\x00\x01\x23y\x1EK\x15\x04
V: text\x2Fxml
10.K: \x00\x04docE\x08mimetype\x00\x00\x01\x23x\xF3\x23n\x04
V: text\x2Fxml
12.
15.compression=none, inMemory=false, \
16.firstKey=US6683275_20040127/mimetype:/1251853756871/Put, \
17.lastKey=US6684814_20040203/mimetype:/1251864683374/Put, \
18.avgKeyLen=37, avgValueLen=8, \
19.entries=1554, length=84447
20.fileinfoOffset=84055, dataIndexOffset=84277,
dataIndexCount=2, metaIndexOffset=0, \
22.Fileinfo:
23.MAJOR_COMPACTION_KEY = \xFF
24.MAX_SEQ_ID_KEY = 32041891
26.hfile.AVG_VALUE_LEN = \x00\x00\x00\x08
27.hfile.COMPARATOR =
org.apache.hadoop.hbase.KeyValue\x24KeyComparator
28.hfile.LASTKEY =
\x00\x12US6684814_20040203\x08mimetype\x00\x00\x01\x23x\xF3\x
23n\x04
The first part is the actual data stored as KeyValue pairs, explained in
detail in the next section. The second part dumps the internal
HFile.Reader properties as well as the Trailer block details and
finally the FileInfo block values. This is a great way to check if a data
file is still healthy.
KeyValue's
The structure starts with two fixed length numbers indicating the
size of the key and the value part. With that info you can offset into
the array to for example get direct access to the value, ignoring the
key - if you know what you are doing. Otherwise you can get the
required information from the key part. Once parsed into a KeyValue
Update: Slightly updated with more links to JIRA issues. Also added
Zookeeper to be more precise about the current mechanisms to look
up a region.