0% found this document useful (0 votes)

85 views6 pages

A Course in In-Memory Data Management: Prof. Hasso Plattner

This document discusses dictionary encoding as a technique for compressing data in in-memory databases. Dictionary encoding replaces long text values with integer IDs, reducing memory usage. It provides examples showing how dictionary encoding can compress a first name column by a factor of 17 and a gender column by a factor of 8. Sorted dictionaries can provide faster lookups but updating the dictionary is more expensive. Operations on encoded data are performed using integers, which is faster than other data types.

Uploaded by

Alexandru Moldovan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

85 views6 pages

A Course in In-Memory Data Management: Prof. Hasso Plattner

Uploaded by

Alexandru Moldovan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Prof.

Hasso Plattner

A Course in
In-Memory Data Management
The Inner Mechanics
of In-Memory Databases

August 30, 2013

This learning material is part of the reading material for Prof.

Plattner’s online lecture "In-Memory Data Management" taking place at
www.openHPI.de. If you have any questions or remarks regarding the
online lecture or the reading material, please give us a note at openhpi-
[email protected]. We are glad to further improve the material.
Chapter 6
Dictionary Encoding

Since memory is the new bottleneck, it is required to minimize access to it.

Accessing a smaller number of columns can do this on the one hand; so only
required attributes are queried. On the other hand, decreasing the number
of bits used for data representation can reduce both memory consumption
and transfer times.
Dictionary encoding builds the basis for several other compression tech-
niques (see Chapter 7) that might be applied on top of the encoded columns.
The main e↵ect of dictionary encoding is that long values, such as texts, are
represented as short integer values.
Dictionary encoding is relatively simple. This means not only that it is
easy to understand, but also it is easy to implement and does not have
to rely on complex multilevel procedures, which would limit or lessen the
performance gains. First, we will explain the general algorithm how original
values are translated to integers using the example presented in Figure 6.1.

Column'“fname”' Dic2onary'for'“fname”' A7ribute'Vector'for'“fname”'

recID' fname' valueID' Value' posi2on' valueID'
…" …" …" …" …" …"
39" John" 23" John" 39" 23"
40" Mary" 24" Mary" 40" 24"
41" Jane" 25" Jane" 41" 25"
42" John" 26" Peter" 42" 23"
43" Peter" …" …" 43" 26"
…" …" …" …"

Fig. 6.1: Dictionary Encoding Example

Dictionary encoding is applied on each column of a table separately. In the

example, every distinct value in the first name column “fname” is replaced
by a distinct integer value. The position of a text value (e.g. Mary) in the

37
38 6 Dictionary Encoding

dictionary is the representing number for that text (here: “24” for Mary).
Until now, we have not saved any storage space. The benefits come to e↵ect
with values appearing more than once in a column. In our tiny example, the
value “John” can be found twice in the column “fname”, namely on position
39 and 42. Using dictionary encoding, the long text value (we assume 49 Byte
per entry in the first name column) is represented by the short integer value
(23 bit are needed to encode the 5 million di↵erent first names we assume
to exist in the world). The more often identical values appear, the better
dictionary encoding can compress a column. As we noted in Section 3.6,
enterprise data has low entropy. Therefore, dictionary encoding is well suited
and yields a good compression ratio. We will exemplify this on the first name
and gender columns of our world-population example.

6.1 Compression Example

Given is the world population table with 8 billion rows, 200 Byte per row:

Attribute # of Distinct Values Size

first name 5 million 49 Byte
last name 8 million 50 Byte
gender 2 1 Byte
country 200 49 Byte
city 1 million 49 Byte
birthday 40 000 2 Byte
Sum 200 Byte

The complete amount of data is:

8 billion rows · 200 Byte per row = 1.6 TB

Each column is split into a dictionary and an attribute vector. Each dic-
tionary stores all distinct values of the column. The valueID of each value
is implicitly given by the value’s position in the dictionary and do thus not
need to be stored explicitly.
In a dictionary-encoded column, the attribute vectors now only store
valueIDs, which correspond to the valueIDs in the dictionary. The recordID
(row number) is stored implicitly via the position of an entry in the attribute
vector. To sum up, via dictionary encoding, all information can be stored as
integers instead of other, usually larger, data types.
6.1 Compression Example 39

Dictionary Encoding Example: First Names

How many bits are required to represent all 5 million distinct values of the
first name column “fname”?

dlog2 (5, 000, 000)e = 23

Therefore, 23 bits are enough to represent all distinct values for the required
column. Instead of using

8 billion · 49 Byte = 392 billion Byte = 365.1 GB

for the first name column, the attribute vector itself can be reduced to the
size of

8 billion · 23 bit = 184 billion bit = 23 billion Byte = 21.4 GB

and an additional dictionary is introduced, which needs

49 Byte · 5 million = 245 million Byte = 0.23 GB.

The achieved compression factor can be calculated as follows:

uncompressed size 365.1 GB

= ⇡ 17
compressed size 21.4 GB + 0.23 GB

That means we reduced the column size by a factor of 17 and the result
only consumes about 6 % of the initial amount of main memory.

Dictionary Encoding Example: Gender

Let us look on another example with the gender column. It has only 2 distinct
values. For the gender representation without compression for each value
(“m” or “f”) 1 Byte is required. So, the amount of data without compression
is:
8 billion · 1 Byte = 7.45 GB
If compression is used, then 1 bit is enough to represent the same information.
The attribute vector takes:

8 billion · 1bit = 8 billion bit = 0.93 GB

The dictionary needs additionally:

2 · 1 Byte = 2 Byte

This concludes to a compression factor of:

40 6 Dictionary Encoding

uncompressed size 7.45GB

= ⇡8
compressed size 0.93 GB + 2 Byte

The compression rate depends on the size of the initial data type as well
as on the column’s entropy, which is determined by two cardinalities:
• Column cardinality, which is defined as the number of distinct values in
a column, and
• Table cardinality, which is the total number of rows in the table or column
Entropy is a measure which shows how much information is contained in a
column. It is calculated as
column cardinality
entropy =
table cardinality

The smaller the entropy of the column, the better the achievable compression
rate.

6.2 Sorted Dictionaries

The benefits of dictionary encoding can be further enhanced if sorting is

applied to the dictionary. Retrieving a value from a sorted dictionary speeds
up the lookup process from O(n), which means a full scan through the dic-
tionary, to O(log(n)), because values in the dictionary can be found using
binary search. Sadly, this optimization comes at a cost: Every time a new
value is added to the dictionary, which does not belong at the end of the
sorted sequence, the dictionary has to be re-sorted. Consequently, the posi-
tions of already present values behind the inserted value have to be moved
one position up. While sorting the dictionary is not that costly, updating the
corresponding attribute vector is. In our example, about 8 billion values have
to be checked or updated if a new first name is added to the dictionary.

6.3 Operations on Encoded Values

The first and most important e↵ect of dictionary encoding is that all oper-
ations concerning the table data are now done via attribute vectors, which
solely consist of integers. This causes an implicit speedup of all operations,
since a CPU is designed to perform operations on numbers, not on charac-
ters. When explaining dictionary encoding, a question often asked is: “But
isn’t the process of looking up all values via an additional data structure
more costly than the actual savings? We understand the benefits concerning
main memory, but what about the processor” – First, it has to be stated that
6.3 Operations on Encoded Values 41

the question is deemed appropriate. The processor has to take additional

load, but this is acceptable, given the fact that our bottleneck is memory and
bandwidth, so a slight shift of pressure in the direction of the processor is
not only accepted but also welcome. Second, the impact of retrieving the ac-
tual values for the encoded columns is rather small. When selecting tuples,
only the corresponding values from the query have to be looked up in the
dictionary for the column scan. Generally, the result set is small compared to
the total table size, so the lookup of all other selected columns to materialize
the query result is not that expensive. Carefully written queries also only se-
lect those columns that are really needed, which not only saves bandwidth
but also further reduces the number of necessary lookups. Finally, several
operations such as COUNT or NOT NULL can even be performed without
retrieving the real values at all.

LESSON 14 Audio Media and Information
29% (7)
LESSON 14 Audio Media and Information
76 pages
Foundations of Database Storage Techniques: The Future of Enterprise Compu'ng
No ratings yet
Foundations of Database Storage Techniques: The Future of Enterprise Compu'ng
8 pages
A Course in In-Memory Data Management: Prof. Hasso Plattner
No ratings yet
A Course in In-Memory Data Management: Prof. Hasso Plattner
12 pages
1611.05428v2
No ratings yet
1611.05428v2
16 pages
B210 Sizing
100% (1)
B210 Sizing
51 pages
Cs403 Short Notes
No ratings yet
Cs403 Short Notes
53 pages
Sort Comp
No ratings yet
Sort Comp
10 pages
Chapter 1: Lossless Data Compression
No ratings yet
Chapter 1: Lossless Data Compression
4 pages
asdba
No ratings yet
asdba
2 pages
Softw Pract Exp - 2013 - Lemire - Decoding billions of integers per second through vectorization
No ratings yet
Softw Pract Exp - 2013 - Lemire - Decoding billions of integers per second through vectorization
29 pages
Data Interpretation Management
No ratings yet
Data Interpretation Management
50 pages
Ulllted States Patent (10) Patent N0.: US 8,549,004 B2
No ratings yet
Ulllted States Patent (10) Patent N0.: US 8,549,004 B2
12 pages
UNIT-5 Entropy Encoding
No ratings yet
UNIT-5 Entropy Encoding
8 pages
Chapter 5_Index Compression
No ratings yet
Chapter 5_Index Compression
28 pages
Literature Survey
No ratings yet
Literature Survey
5 pages
Lecture 5p2 - Index Construction & Compressing
No ratings yet
Lecture 5p2 - Index Construction & Compressing
40 pages
Information Retrieval - 3
No ratings yet
Information Retrieval - 3
36 pages
Asymmetric Distances For Binary Embeddings: Albert Gordo, Florent Perronnin, Yunchao Gong, Svetlana Lazebnik
No ratings yet
Asymmetric Distances For Binary Embeddings: Albert Gordo, Florent Perronnin, Yunchao Gong, Svetlana Lazebnik
15 pages
Dba110lab03 Answers
No ratings yet
Dba110lab03 Answers
13 pages
Chapter 7 SQL
No ratings yet
Chapter 7 SQL
32 pages
Main Techniques and Performance of Each Compression
No ratings yet
Main Techniques and Performance of Each Compression
23 pages
Laboratory Manual On MYSQL For Year IV Database Students (Power and Control Engineering Stream Students)
No ratings yet
Laboratory Manual On MYSQL For Year IV Database Students (Power and Control Engineering Stream Students)
20 pages
A Course in In-Memory Data Management: Prof. Hasso Plattner
No ratings yet
A Course in In-Memory Data Management: Prof. Hasso Plattner
4 pages
IP - Sep
No ratings yet
IP - Sep
2 pages
Jancy-Jayakumar2019 Article SequenceStatisticalCodeBasedDa
No ratings yet
Jancy-Jayakumar2019 Article SequenceStatisticalCodeBasedDa
15 pages
[email protected]
No ratings yet
[email protected]
8 pages
Handling of Categorical Data
No ratings yet
Handling of Categorical Data
18 pages
9 - Analytics Databases
No ratings yet
9 - Analytics Databases
12 pages
Index Compression: Prabhakar Raghavan (Yahoo and Stanford) and Christopher Manning (Stanford)
100% (1)
Index Compression: Prabhakar Raghavan (Yahoo and Stanford) and Christopher Manning (Stanford)
56 pages
HTCS501 unit 4
No ratings yet
HTCS501 unit 4
17 pages
CH 6
No ratings yet
CH 6
21 pages
Week08 - Physical Design
No ratings yet
Week08 - Physical Design
24 pages
14.1 Databases
No ratings yet
14.1 Databases
24 pages
Algorithms For Data Compression in Wireless Computing Systems
No ratings yet
Algorithms For Data Compression in Wireless Computing Systems
7 pages
Decision Science: Solution 1
No ratings yet
Decision Science: Solution 1
11 pages
Dce Easy Solution
0% (1)
Dce Easy Solution
87 pages
ADCexp 2
No ratings yet
ADCexp 2
7 pages
Unit-2 .Net Final
No ratings yet
Unit-2 .Net Final
52 pages
UEU Basis Data Pertemuan 14
No ratings yet
UEU Basis Data Pertemuan 14
32 pages
A New Approach For Compression On Textual Data
No ratings yet
A New Approach For Compression On Textual Data
4 pages
4.2.7 Dictionaries
No ratings yet
4.2.7 Dictionaries
16 pages
Data Types Range Examples
No ratings yet
Data Types Range Examples
7 pages
Journal of Discrete Algorithms: Andrei Kelarev, Joe Ryan, Leanne Rylands, Jennifer Seberry, Xun Yi
No ratings yet
Journal of Discrete Algorithms: Andrei Kelarev, Joe Ryan, Leanne Rylands, Jennifer Seberry, Xun Yi
10 pages
Test Data Compression Using Bitmask & Dictionary Selection
No ratings yet
Test Data Compression Using Bitmask & Dictionary Selection
13 pages
Database and Data Representation Sheet_024624
No ratings yet
Database and Data Representation Sheet_024624
7 pages
Practical Research 2 - Organizing Data
No ratings yet
Practical Research 2 - Organizing Data
19 pages
Unit 1 Data Compression
No ratings yet
Unit 1 Data Compression
30 pages
Content-Based Textual Big Data Analysis and Compression: Fei Gao Ananya Dutta Jiangjiang Liu
No ratings yet
Content-Based Textual Big Data Analysis and Compression: Fei Gao Ananya Dutta Jiangjiang Liu
6 pages
Coding Interview Questions and Answers
From Everand
Coding Interview Questions and Answers
Chinmoy Mukherjee
No ratings yet
Lecture I: Data Compression Data Encoding: Efficient Information Encoding To
No ratings yet
Lecture I: Data Compression Data Encoding: Efficient Information Encoding To
48 pages
Guc 437 59 31055 2023-05-25T16 41 09
No ratings yet
Guc 437 59 31055 2023-05-25T16 41 09
15 pages
Algorithms: Compressed Matching in Dictionaries
No ratings yet
Algorithms: Compressed Matching in Dictionaries
14 pages
798 m1 Slides v1 6hc Jgr8t0sj.pdf
No ratings yet
798 m1 Slides v1 6hc Jgr8t0sj.pdf
32 pages
8.4 - Find elements common in two lists using a HashtableDict.mp4 - Copy
No ratings yet
8.4 - Find elements common in two lists using a HashtableDict.mp4 - Copy
3 pages
Attribute Value Reordering For Efficient Hybrid Olap: Owen Kaser
No ratings yet
Attribute Value Reordering For Efficient Hybrid Olap: Owen Kaser
32 pages
mysql-data-types
No ratings yet
mysql-data-types
2 pages
Very Short Notes
No ratings yet
Very Short Notes
13 pages
Notes ML for Data science
No ratings yet
Notes ML for Data science
14 pages
Lecture 16
No ratings yet
Lecture 16
26 pages
Data Compression: Debra A. Lelewer and Daniel S. Hirschberg
No ratings yet
Data Compression: Debra A. Lelewer and Daniel S. Hirschberg
36 pages
Numerical Based On Indexing: Problem 1.2
No ratings yet
Numerical Based On Indexing: Problem 1.2
3 pages
Measuring Marketing Productivity: Current Knowledge and Future Directions
No ratings yet
Measuring Marketing Productivity: Current Knowledge and Future Directions
14 pages
A Course in In-Memory Data Management: Prof. Hasso Plattner
No ratings yet
A Course in In-Memory Data Management: Prof. Hasso Plattner
6 pages
Update PDF
No ratings yet
Update PDF
6 pages
Insert PDF
No ratings yet
Insert PDF
7 pages
A Course in In-Memory Data Management: Prof. Hasso Plattner
No ratings yet
A Course in In-Memory Data Management: Prof. Hasso Plattner
5 pages
Partitioning PDF
No ratings yet
Partitioning PDF
5 pages
A Course in In-Memory Data Management: Prof. Hasso Plattner
No ratings yet
A Course in In-Memory Data Management: Prof. Hasso Plattner
9 pages
Enterprise Application Characteristics: 3.1 Diverse Applications
No ratings yet
Enterprise Application Characteristics: 3.1 Diverse Applications
4 pages
Changes in Hardware: 4.1 Memory Cells
No ratings yet
Changes in Hardware: 4.1 Memory Cells
11 pages
New Requirements For Enterprise Computing: 2.1 Processing of Event Data
No ratings yet
New Requirements For Enterprise Computing: 2.1 Processing of Event Data
8 pages
1 s2.0 S0925527311003872 Main PDF
No ratings yet
1 s2.0 S0925527311003872 Main PDF
9 pages
1 s2.0 S0925527307001892 Main PDF
No ratings yet
1 s2.0 S0925527307001892 Main PDF
17 pages
Esports Year Book PDF
No ratings yet
Esports Year Book PDF
147 pages
Esports Year Book PDF
No ratings yet
Esports Year Book PDF
147 pages
NATO's Cyber Strategies and Wireless Warfare in The Information Age
No ratings yet
NATO's Cyber Strategies and Wireless Warfare in The Information Age
7 pages
6th Sem Syllbus
No ratings yet
6th Sem Syllbus
8 pages
WinnieXu CV
No ratings yet
WinnieXu CV
2 pages
Harry Potter and The Deathly Hallows Part 1 Dutch Subs
No ratings yet
Harry Potter and The Deathly Hallows Part 1 Dutch Subs
2 pages
Datasheet Axis
No ratings yet
Datasheet Axis
3 pages
UNIT 4a
No ratings yet
UNIT 4a
34 pages
Department of Electronics & Communication Engineering: National Institute of Technology Calicut
No ratings yet
Department of Electronics & Communication Engineering: National Institute of Technology Calicut
41 pages
Fast Image Compression Using Dic Algorithm
No ratings yet
Fast Image Compression Using Dic Algorithm
21 pages
Interpolation Paper
No ratings yet
Interpolation Paper
7 pages
Flickr Solution
No ratings yet
Flickr Solution
8 pages
Apache IoTDB A Time Series Database For IoT Applications
No ratings yet
Apache IoTDB A Time Series Database For IoT Applications
27 pages
Korurek
No ratings yet
Korurek
3 pages
Lossless Image Compression Systems
No ratings yet
Lossless Image Compression Systems
15 pages
Integrating Image and Video Encoders For Enhanced Video Understanding
No ratings yet
Integrating Image and Video Encoders For Enhanced Video Understanding
18 pages
Soc Hpec18
No ratings yet
Soc Hpec18
5 pages
Parts of The Camera: CCTV Note - Ii
No ratings yet
Parts of The Camera: CCTV Note - Ii
7 pages
Download ebooks file Voice Compression and Communications Principles and Applications for Fixed and Wireless Channels 1st Edition Lajos L. Hanzo all chapters
100% (3)
Download ebooks file Voice Compression and Communications Principles and Applications for Fixed and Wireless Channels 1st Edition Lajos L. Hanzo all chapters
81 pages
NCIS Los Angeles (2009) Season 6 - E.Rev 480p MKV x264
No ratings yet
NCIS Los Angeles (2009) Season 6 - E.Rev 480p MKV x264
2 pages
Color Image Processing Methods and Applications 1st Edition Rastislav Lukac instant download
No ratings yet
Color Image Processing Methods and Applications 1st Edition Rastislav Lukac instant download
43 pages
Mini Project 2
No ratings yet
Mini Project 2
4 pages
Digital Image Processing Project Report PDF
No ratings yet
Digital Image Processing Project Report PDF
6 pages
Haivision Makito x4 Series Datasheets
No ratings yet
Haivision Makito x4 Series Datasheets
20 pages
Altivar 71 Variable Frequency Drives VFD - VW3A3408
No ratings yet
Altivar 71 Variable Frequency Drives VFD - VW3A3408
2 pages
Emu3 Next-Token Prediction is All You Need
No ratings yet
Emu3 Next-Token Prediction is All You Need
23 pages
2Mp Wifi Mini PTZ Network Camera
No ratings yet
2Mp Wifi Mini PTZ Network Camera
3 pages
Pondicherry University Syllabus
No ratings yet
Pondicherry University Syllabus
39 pages
IPC-HFW2230S-S-S2 - Ficha Técnica Dahua
No ratings yet
IPC-HFW2230S-S-S2 - Ficha Técnica Dahua
3 pages
CC Ii PDF
No ratings yet
CC Ii PDF
28 pages
10d-Animation Tips Latest 4
No ratings yet
10d-Animation Tips Latest 4
10 pages
Image Compression Using Proposed Enhanced Run Length Encoding Algorithm
No ratings yet
Image Compression Using Proposed Enhanced Run Length Encoding Algorithm
14 pages

A Course in In-Memory Data Management: Prof. Hasso Plattner

Uploaded by

A Course in In-Memory Data Management: Prof. Hasso Plattner

Uploaded by

Prof.

August 30, 2013

This learning material is part of the reading material for Prof.

Since memory is the new bottleneck, it is required to minimize access to it.

Column'“fname”' Dic2onary'for'“fname”' A7ribute'Vector'for'“fname”'

Fig. 6.1: Dictionary Encoding Example

Dictionary encoding is applied on each column of a table separately. In the

6.1 Compression Example

Attribute # of Distinct Values Size

The complete amount of data is:

8 billion rows · 200 Byte per row = 1.6 TB

Dictionary Encoding Example: First Names

dlog2 (5, 000, 000)e = 23

8 billion · 49 Byte = 392 billion Byte = 365.1 GB

8 billion · 23 bit = 184 billion bit = 23 billion Byte = 21.4 GB

and an additional dictionary is introduced, which needs

49 Byte · 5 million = 245 million Byte = 0.23 GB.

The achieved compression factor can be calculated as follows:

uncompressed size 365.1 GB

Dictionary Encoding Example: Gender

8 billion · 1bit = 8 billion bit = 0.93 GB

The dictionary needs additionally:

This concludes to a compression factor of:

uncompressed size 7.45GB

6.2 Sorted Dictionaries

The benefits of dictionary encoding can be further enhanced if sorting is

6.3 Operations on Encoded Values

the question is deemed appropriate. The processor has to take additional

You might also like