A Course in In-Memory Data Management: Prof. Hasso Plattner
A Course in In-Memory Data Management: Prof. Hasso Plattner
A Course in In-Memory Data Management: Prof. Hasso Plattner
Hasso Plattner
A Course in
In-Memory Data Management
The Inner Mechanics
of In-Memory Databases
37
38 6 Dictionary Encoding
dictionary is the representing number for that text (here: “24” for Mary).
Until now, we have not saved any storage space. The benefits come to e↵ect
with values appearing more than once in a column. In our tiny example, the
value “John” can be found twice in the column “fname”, namely on position
39 and 42. Using dictionary encoding, the long text value (we assume 49 Byte
per entry in the first name column) is represented by the short integer value
(23 bit are needed to encode the 5 million di↵erent first names we assume
to exist in the world). The more often identical values appear, the better
dictionary encoding can compress a column. As we noted in Section 3.6,
enterprise data has low entropy. Therefore, dictionary encoding is well suited
and yields a good compression ratio. We will exemplify this on the first name
and gender columns of our world-population example.
Given is the world population table with 8 billion rows, 200 Byte per row:
Each column is split into a dictionary and an attribute vector. Each dic-
tionary stores all distinct values of the column. The valueID of each value
is implicitly given by the value’s position in the dictionary and do thus not
need to be stored explicitly.
In a dictionary-encoded column, the attribute vectors now only store
valueIDs, which correspond to the valueIDs in the dictionary. The recordID
(row number) is stored implicitly via the position of an entry in the attribute
vector. To sum up, via dictionary encoding, all information can be stored as
integers instead of other, usually larger, data types.
6.1 Compression Example 39
How many bits are required to represent all 5 million distinct values of the
first name column “fname”?
Therefore, 23 bits are enough to represent all distinct values for the required
column. Instead of using
for the first name column, the attribute vector itself can be reduced to the
size of
That means we reduced the column size by a factor of 17 and the result
only consumes about 6 % of the initial amount of main memory.
Let us look on another example with the gender column. It has only 2 distinct
values. For the gender representation without compression for each value
(“m” or “f”) 1 Byte is required. So, the amount of data without compression
is:
8 billion · 1 Byte = 7.45 GB
If compression is used, then 1 bit is enough to represent the same information.
The attribute vector takes:
2 · 1 Byte = 2 Byte
The compression rate depends on the size of the initial data type as well
as on the column’s entropy, which is determined by two cardinalities:
• Column cardinality, which is defined as the number of distinct values in
a column, and
• Table cardinality, which is the total number of rows in the table or column
Entropy is a measure which shows how much information is contained in a
column. It is calculated as
column cardinality
entropy =
table cardinality
The smaller the entropy of the column, the better the achievable compression
rate.
The first and most important e↵ect of dictionary encoding is that all oper-
ations concerning the table data are now done via attribute vectors, which
solely consist of integers. This causes an implicit speedup of all operations,
since a CPU is designed to perform operations on numbers, not on charac-
ters. When explaining dictionary encoding, a question often asked is: “But
isn’t the process of looking up all values via an additional data structure
more costly than the actual savings? We understand the benefits concerning
main memory, but what about the processor” – First, it has to be stated that
6.3 Operations on Encoded Values 41