4 How To Build A Stock Data Management Database
4 How To Build A Stock Data Management Database
……
Rowkey design
The design of the HBase row key is very important. The unique identifier of a piece of data is the
rowkey. Which partition the data is stored in depends on which pre-partition range the rowkey is
in. The main purpose of designing the rowkey is to make the data evenly distributed in all In the
region of , data skew is prevented to a certain extent.
Example of rowkey design scheme: take the high bit of Rowkey as the hash field, the year and
month of the stock are used to hash the remainder, the median is the stock code, and the low bit
is the time field, for example:
hash value (202004) %299 + “_” + stock code + “_” + stock date
This will improve the probability of data balancing across each Regionserver to achieve load
balancing. If there is no hash field, the first field is directly the time information, which will cause a
hotspot phenomenon in which all new data is accumulated on one RegionServer. At the same
time, when data retrieval is performed, the load will be concentrated on individual RegionServers,
reducing query efficiency.
The last is the creation of the table. When HBase creates a table by default, there is a region. The
rowkey of this region has no boundaries, that is, there is no startkey and endkey. When data is
written, all data will be written to the default region. As the amount of data continues to increase,
this region has If it cannot bear the growing amount of data, it will be split and divided into 2
regions. During this process, two problems will arise:
1. When data is written to a region, there will be a problem of writing hot spots.
2. Region split consumes valuable cluster I/O resources.
Therefore, for load balancing, when building a table, create multiple empty regions for pre-
partitioning, so that stocks of different years and months exist in different regionservers. Since
the data stored in the same column family has the same characteristics, for the stock
dataset, it is divided into two column families, StackInfo and Statisitc, to save the basic
information and statistical information of the stock.
The columns under the column cluster do not need to be created in advance, and can be specified
by: when needed.
Insert data
Because we usually need to manipulate a large amount of data, we need to insert batches of data
into HBase. Here I will use 1.csv as an example.
I simply design the rowkey as "code-date", and delete the first row, because the first row will
affect the batch operation. The modified file is as shown below.
First upload the local csv file to HDFS, which is operated by command.
check:
Up to this point, the data of the first table has been inserted successfully. Other data can follow
the previous process.