0% found this document useful (0 votes)
47 views3 pages

4 How To Build A Stock Data Management Database

The document discusses the design of an HBase database to store stock record data. It describes HBase as a sparse, multi-dimensional, sorted mapping table with column families to organize data storage. The stock record table is designed with two column families, StockInfo and Statistic, to store basic stock information and statistics respectively. The design of the rowkey is also important for load balancing, with this example using a hash of the year and month combined with the stock code and date to improve data distribution across regions. Multiple pre-partitioned regions were also created when building the table to further aid balancing. The data is then inserted into HBase from a CSV file using an import command.

Uploaded by

Rokon
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views3 pages

4 How To Build A Stock Data Management Database

The document discusses the design of an HBase database to store stock record data. It describes HBase as a sparse, multi-dimensional, sorted mapping table with column families to organize data storage. The stock record table is designed with two column families, StockInfo and Statistic, to store basic stock information and statistics respectively. The design of the rowkey is also important for load balancing, with this example using a hash of the year and month combined with the stock code and date to improve data distribution across regions. Multiple pre-partitioned regions were also created when building the table to further aid balancing. The data is then inserted into HBase from a CSV file using an import command.

Uploaded by

Rokon
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

Hbasse database model design

Unlike traditional relational databases, Hbase is a sparse, multi-dimensional, sorted mapping


table. Each table has a set of column families, and HBase organizes the physical storage of data
through the concept of column families. Each row has a sortable primary key (the row keys of one
of the primary keys are in lexicographical order) and any number of columns. The data is stored
in the unit determined by the row and column, and the data type is a byte array byte[]. Due to the
schemaless nature of HBase, each row of data in the same table can have completely different
columns. All databases in HBase have a time stamp when they are updated. Each update is a new
version. HBase will retain a certain number of versions. The client can choose to obtain the value
of the version unit closest to a certain point in time, or Get the values of all version units at once.
According to the above characteristics of HBase, the stock record table in this practice project is
designed, as shown in Table 1. The table is divided into two column families, the column family
StockInfo contains basic stock information, and the column family Statistic contains stock
statistics.

Table 1: Stock Record Sheet

Column family(StackInfo) Column family(Statistic)


Rowkey
Code Name GMV RANGE

03_000001.SZ_20071228 000001.SZ PA Bank 2376508379 50

03_000002.SZ_20071228 000002.SZ WK A 601376594.4 73.3333

……

13_000004.SZ_20071228 000004.SZ BA 2832630790 11.1484

13_000005.SZ_20071228 000005.SZ SWY A 2935635546 12.7659

The data in Hbase is sorted by rowkey.

Rowkey design
The design of the HBase row key is very important. The unique identifier of a piece of data is the
rowkey. Which partition the data is stored in depends on which pre-partition range the rowkey is
in. The main purpose of designing the rowkey is to make the data evenly distributed in all In the
region of , data skew is prevented to a certain extent.

Example of rowkey design scheme: take the high bit of Rowkey as the hash field, the year and
month of the stock are used to hash the remainder, the median is the stock code, and the low bit
is the time field, for example:

hash value (202004) %299 + “_” + stock code + “_” + stock date

This will improve the probability of data balancing across each Regionserver to achieve load
balancing. If there is no hash field, the first field is directly the time information, which will cause a
hotspot phenomenon in which all new data is accumulated on one RegionServer. At the same
time, when data retrieval is performed, the load will be concentrated on individual RegionServers,
reducing query efficiency.
The last is the creation of the table. When HBase creates a table by default, there is a region. The
rowkey of this region has no boundaries, that is, there is no startkey and endkey. When data is
written, all data will be written to the default region. As the amount of data continues to increase,
this region has If it cannot bear the growing amount of data, it will be split and divided into 2
regions. During this process, two problems will arise:

1. When data is written to a region, there will be a problem of writing hot spots.
2. Region split consumes valuable cluster I/O resources.

Therefore, for load balancing, when building a table, create multiple empty regions for pre-
partitioning, so that stocks of different years and months exist in different regionservers. Since
the data stored in the same column family has the same characteristics, for the stock
dataset, it is divided into two column families, StackInfo and Statisitc, to save the basic
information and statistical information of the stock.

The columns under the column cluster do not need to be created in advance, and can be specified
by: when needed.

Insert data
Because we usually need to manipulate a large amount of data, we need to insert batches of data
into HBase. Here I will use 1.csv as an example.

I simply design the rowkey as "code-date", and delete the first row, because the first row will
affect the batch operation. The modified file is as shown below.

First upload the local csv file to HDFS, which is operated by command.

hdfs dfs -mkdir /hadoop


hdfs dfs -mkdir/hadoop/input
hdfs fs -put /home/whh/Documents/1.csv /hadoop/input/1.csv
Because the table has been built before, the next step is to use the command to operate, and the
mapreduce package is borrowed for import.

hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator=',' -


Dimporttsv.columns=HBASE_ROW_KEY,StockInfo:abbreviation,StockInfo:date,Statistic
:previous-closing-price,Statistic:opening-price,Statistic:max-
price,Statistic:min-price,Statistic:closing-price,Statistic:trading-
volume,Statistic:transaction,Statistic:ups-and-
downs,Statistic:range,Statistic:average,Statistic:turnover-rate,Statistic:total
HbaseStock /hadoop/input/1.csv

This will appear after running the command

check:

Up to this point, the data of the first table has been inserted successfully. Other data can follow
the previous process.

You might also like