0% found this document useful (0 votes)

4 views

1 Introduction

Uploaded by

honganhp0903

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views

1 Introduction

Uploaded by

honganhp0903

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

A Crash Course on Data Compression

1. Introduction
Giulio Ermanno Pibiri
ISTI-CNR, [email protected]

@giulio_pibiri

@jermp
Overview

• What is Data Compression and why do we need it?

• Fundamental questions and undecidability
• Some applications
• Technological limitations
• Warmup
What is Data Compression?

• The process for which data is transformed into another representation that
takes less storage space:
- save space when storing data,
- save time when transmitting data.

• The process must be reversible (exactly or admitting some loss) to be useful.

What is Data Compression?

• The process for which data is transformed into another representation that
takes less storage space:
- save space when storing data,
- save time when transmitting data.

• The process must be reversible (exactly or admitting some loss) to be useful.

• Q. What is this “process”?
A. A computer program that takes data as input and produces a data
structure that takes less space than the input.

• We seek/need e cient programs that build such data structures.

Example: command line utility gzip.
ffi
Basic Model

input bit-string output bit-string input bit-string

Compressor Expander
B C(B) B

Some loss can happen here, if wanted.

• Compression ratio. The compression ratio is de ned as CR = | B | / | C(B) | .

• If CR = r, then the size of the compressed output | C(B) | is r times smaller
than the input size | B | . fi
Space vs. Time Trade-Off

• The compression ratio depends on many factors:

wanted compression/decompression speed, related to the amount of energy
spent (CPU power); loss of precision; (…)

• The most common one: the trade-o between the space of the compressed
data structure and the e ciency of the operations that we want to support on
the data.
Example: gzip has 9 compression “levels” (1 is fastest but “worst” compression; 9 is slower but “best”).
ffi
ff
Space vs. Time Trade-Off

• The compression ratio depends on many factors:

wanted compression/decompression speed, related to the amount of energy
spent (CPU power); loss of precision; (…)

• This trade-o is becoming more and more important: nowadays,

we cannot a ord the naive approach “decompress and compute”.

• Ultimate goal: allow direct computation over compressed data.

ff
ff
ffi
ff
Limit

• Proposition. No algorithm can compress every bit-string.

Proof.
Proceed by contradiction. Assume that
| B | > | C(B) | > | C(C(B)) | > | C(C(C(B))) | > …, like in the picture below.

B C(B) C(C(B))
C C C … C(…C(C(B))…)

Then all possible bit-strings could be compressed to 0 bits — absurd. ◼

Fundamental Question(s)

• Q. What is the best way of compressing a le for my application?

A. This is an undecidable problem.

• An extreme, but very common, example: to compress a le, you may

replace it by the program that originated the le (related to the so-called
Kolmogorov complexity).
But how would you “ nd” (i.e., write) such a program?

• If you think: most of the data we deal with (Web pages, log les,
sequencing data, ecc.) is created by programs, not by humans.
fi
fi
fi
fi
fi
Undecidability
Q. How would you compress these bits?

100,000 pseudo-random bits

Undecidability
Q. How would you compress these bits?
A. With the following piece of code.

• 220 bytes (1 char = 1 byte)

• CR = 100000/(8 ⋅ 220) = 56.8

Compile with:
g++ random_bits.cpp -o random_bits
Run with:
100,000 pseudo-random bits ./random_bits
Undecidability
Q. How would you compress these bits?
A. With the following piece of code.

What if n = 1000000?
What happens now to the CR ?

• 220 bytes (1 char = 1 byte)

• CR = 100000/(8 ⋅ 220) = 56.8

Compile with:
g++ random_bits.cpp -o random_bits
Run with:
100,000 pseudo-random bits ./random_bits
Data and Information

• Data and information are not the same.

• Information is the knowledge coming from interpreting data according to a
speci c semantic scheme.

• In the future, it is foreseen that data will grow much faster than information:
data will become more and more redundant.

• So there are great possibilities for compression!

Huge amount of research being actively carried out.
fi
Why Data Compression?

• We use compression everywhere/anytime; even without being aware of it.

• Ever-increasing demand for storage and large-scale computing.
- Generic le compression: gzip, bzip, LZ4, Zstd, ecc.
- Multimedia: images (jpeg, png, gif); sound (MP3); video (MPEG, DVD);
Spotify, Net ix, ecc.
- Search engines (Google, Yandex, Bing,…);
- Distributed storage (Dropbox, Google Drive,…);

• Communication cost.
- Skype, Zoom, FaceTime, WhatsApp, ecc.
- Social networks (Facebook, Instagram, Twitter,…);
fi
fl
Why Data Compression?

• We use compression everywhere/anytime; even without being aware of it.

• Communication cost.
- Skype, Zoom, FaceTime, WhatsApp, ecc.
- Social networks (Facebook, Instagram, Twitter,…);

• Increased software performance.

fi
fl
Technological Limitations

• Whatever space we have available: we are going to ll it up, by virtue of our

human, eager, nature.

• Moore’s Law. Number of transistors on a chip doubles every 1.5 – 2 years.

• So we get faster processors…
• But not faster memories!

fi
Memory Hierarchies

• If a program stalls…it is likely that it is waiting for memory.

• Thus, it is more important than ever to trade-o processor time
for RAM/disk access time.
• Action of compression: transfer more data to the processor. https://colin-scott.github.io/personal_website/research/interactive_latency.html
ff
Memory Hierarchies

• If a program stalls…it is likely that it is waiting for memory.

large_record consumes 40 bytes overall;

small_record consumes 6 bytes overall (slight lie)

uint64_t is a primitive data type

for unsigned 64-bit ints;
uint8_t for unsigned 8-bit ints

Experiment methodology.
1. Allocate two vectors of the same size, one
holding large_record objects and the other
holding small_record objects.
2. Fill the two vectors with the same data.
3. Sort the two vectors (say, on the day attribute).

With a b-bit unsigned integer, we

can represent all values in [0,2b).
A Simple Experiment
Experiment methodology.
1. Allocate two vectors of the same size, one
holding large_record objects and the other
holding small_record objects.
2. Fill the two vectors with the same data.
3. Sort the two vectors (say, on the day attribute).
A Simple Experiment
Experiment methodology.
1. Allocate two vectors of the same size, one
holding large_record objects and the other
holding small_record objects.
2. Fill the two vectors with the same data.
3. Sort the two vectors (say, on the day attribute).

initialise the pseudo-random generator

with a xed seed to reproduce the results

create the vectors

and reserve space

ll the vectors

1+2
fi
fi
A Simple Experiment std::chrono to measure time

Experiment methodology.
3
1. Allocate two vectors of the same size, one
holding large_record objects and the other
holding small_record objects.
2. Fill the two vectors with the same data.
3. Sort the two vectors (say, on the day attribute).

initialise the pseudo-random generator

with a xed seed to reproduce the results

create the vectors

and reserve space use the std::sort algorithm to sort
the vectors, using a lambda function
to implement the comparison

ll the vectors

1+2
fi
fi
A Simple Experiment
• Q. Which sort will take less time?
• Hint. Remember! The smaller the data, the more data can be transferred to the processor.

Compile with:
g++ -std=c++11 -O3 sort_bench.cpp -o sort_bench

Run with:
./sort_bench 10000000
A Simple Experiment
• Q. Which sort will take less time?
• Hint. Remember! The smaller the data, the more data can be transferred to the processor.

Compile with:
g++ -std=c++11 -O3 sort_bench.cpp -o sort_bench

Run with:
./sort_bench 10000000

The size of the data matters!

• Preface and Chapter 1 of:

Alistair Mo at and Andrew Turpin. 2002. Compression and coding algorithms.
Springer Science & Business Media, ISBN 978-1-4615-0935-6.

• Chapter 5.5 (pages 810-825) of:

Robert Sedgewick and Kevin Wayne. 2011. Algorithms. 4-th Edition.
Addison-Wesley Professional, ISBN 0-321-57351-X.
ff

01-2 - Digital Literacy Foundations
No ratings yet
01-2 - Digital Literacy Foundations
89 pages
A Programmers Guide To Computer Science A Virtual Degree For The Self-Taught Developer by William M. Springer II
No ratings yet
A Programmers Guide To Computer Science A Virtual Degree For The Self-Taught Developer by William M. Springer II
122 pages
The Absolute Beginner's Guide to Binary, Hex, Bits, and Bytes! How to Master Your Computer's Love Language
From Everand
The Absolute Beginner's Guide to Binary, Hex, Bits, and Bytes! How to Master Your Computer's Love Language
Greg Perry
4.5/5 (11)
Sapsprow V19: Programming of NC Cutting Machines
0% (1)
Sapsprow V19: Programming of NC Cutting Machines
174 pages
Advanced C++ Interview Questions You'll Most Likely Be Asked
From Everand
Advanced C++ Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
VMware Vsphere Optimize and Scale v6 Lab Guide
100% (4)
VMware Vsphere Optimize and Scale v6 Lab Guide
150 pages
Data Compression Project-Huffman Algorithm
56% (9)
Data Compression Project-Huffman Algorithm
54 pages
Data Compression Explained
100% (1)
Data Compression Explained
92 pages
Special Topics Data Compression
No ratings yet
Special Topics Data Compression
51 pages
Lecture 1 - Slide
No ratings yet
Lecture 1 - Slide
41 pages
data1002-summary
No ratings yet
data1002-summary
17 pages
Data Compression Explained
No ratings yet
Data Compression Explained
110 pages
mod 3
No ratings yet
mod 3
8 pages
Complexity Analysis Time-Space Trade-Off
No ratings yet
Complexity Analysis Time-Space Trade-Off
51 pages
cs spec
No ratings yet
cs spec
16 pages
1 Introduction
No ratings yet
1 Introduction
77 pages
book for studying yk 2025
No ratings yet
book for studying yk 2025
1 page
Multimedia Systems: Chapter 7: Data Compression
No ratings yet
Multimedia Systems: Chapter 7: Data Compression
41 pages
Data Compression Algorithms in Fpgas: Gonçalo César Mendes Ribeiro
No ratings yet
Data Compression Algorithms in Fpgas: Gonçalo César Mendes Ribeiro
101 pages
CS214-DS2024-lec-1-Intro
No ratings yet
CS214-DS2024-lec-1-Intro
42 pages
Compression PDF
No ratings yet
Compression PDF
55 pages
Computer Science An Overview Chapter 1 PDF Notes
No ratings yet
Computer Science An Overview Chapter 1 PDF Notes
82 pages
Computer Science Revision
No ratings yet
Computer Science Revision
73 pages
Data Structures and Algorithms: (CS210/ESO207/ESO211)
No ratings yet
Data Structures and Algorithms: (CS210/ESO207/ESO211)
21 pages
Introduction To Data Compression - Guy E. Blelloch PDF
No ratings yet
Introduction To Data Compression - Guy E. Blelloch PDF
54 pages
File Organization Lec910
No ratings yet
File Organization Lec910
37 pages
Untitled
No ratings yet
Untitled
66 pages
Week 2
No ratings yet
Week 2
9 pages
Content-Based Textual Big Data Analysis and Compression: Fei Gao Ananya Dutta Jiangjiang Liu
No ratings yet
Content-Based Textual Big Data Analysis and Compression: Fei Gao Ananya Dutta Jiangjiang Liu
6 pages
Edexcel GCSE Computer Science Paper 1 Part 3
No ratings yet
Edexcel GCSE Computer Science Paper 1 Part 3
5 pages
Spec
No ratings yet
Spec
16 pages
Lecture 1
No ratings yet
Lecture 1
35 pages
Key+Points+Paper+1+ +CS
No ratings yet
Key+Points+Paper+1+ +CS
69 pages
Cyclic Redundancy Codes
No ratings yet
Cyclic Redundancy Codes
7 pages
Data Compression The Complete Reference 3rd Edition by David Salomon ISBN 0387406972 9780387406978 - Quickly download the ebook to start your content journey
No ratings yet
Data Compression The Complete Reference 3rd Edition by David Salomon ISBN 0387406972 9780387406978 - Quickly download the ebook to start your content journey
43 pages
AICT Chap 1
No ratings yet
AICT Chap 1
200 pages
Computational Complexity: Section 1: Why Is It Important?
No ratings yet
Computational Complexity: Section 1: Why Is It Important?
5 pages
Information, Entropy, and Coding: 8.1 The Need For Data Compression
No ratings yet
Information, Entropy, and Coding: 8.1 The Need For Data Compression
13 pages
Data Compression The Complete Reference 3rd Edition by David Salomon ISBN 0387406972 9780387406978 pdf download
100% (3)
Data Compression The Complete Reference 3rd Edition by David Salomon ISBN 0387406972 9780387406978 pdf download
49 pages
Specification Gcsegggg Computer Science j277
No ratings yet
Specification Gcsegggg Computer Science j277
16 pages
Lecture I: Data Compression Data Encoding: Efficient Information Encoding To
No ratings yet
Lecture I: Data Compression Data Encoding: Efficient Information Encoding To
48 pages
DATA COMPRESSION TECHNIQUES MODULE 1 KTU
No ratings yet
DATA COMPRESSION TECHNIQUES MODULE 1 KTU
15 pages
SCS214-DS2024-lec-1-Intro
No ratings yet
SCS214-DS2024-lec-1-Intro
45 pages
Computational Complexity: Section 1: Why Is It Important?
No ratings yet
Computational Complexity: Section 1: Why Is It Important?
6 pages
chapter -1 Analysis of Algorithm
No ratings yet
chapter -1 Analysis of Algorithm
46 pages
1.2 Memeory and Storage
No ratings yet
1.2 Memeory and Storage
21 pages
Daa
No ratings yet
Daa
15 pages
Data Structures and Algorithms: CS210/CS210A
No ratings yet
Data Structures and Algorithms: CS210/CS210A
31 pages
Unit2 PDF
No ratings yet
Unit2 PDF
37 pages
Unit 2
No ratings yet
Unit 2
37 pages
Computational Thinking
No ratings yet
Computational Thinking
5 pages
Snappy Compression相關論文
No ratings yet
Snappy Compression相關論文
38 pages
Lecture 1 Chapter 1 Basic Compression Concepts
No ratings yet
Lecture 1 Chapter 1 Basic Compression Concepts
20 pages
CS3114 09212011
No ratings yet
CS3114 09212011
344 pages
Compression Huff Merged
No ratings yet
Compression Huff Merged
27 pages
weak 1
No ratings yet
weak 1
81 pages
NMCNTT 3 NumeralSystems and DataStorage - Update
No ratings yet
NMCNTT 3 NumeralSystems and DataStorage - Update
108 pages
NMCNTT 3 NumeralSystems and DataStorage - PNCuong
No ratings yet
NMCNTT 3 NumeralSystems and DataStorage - PNCuong
107 pages
NMCNTT-03-Data storage
No ratings yet
NMCNTT-03-Data storage
101 pages
CS100: Introduction To Computer Science: In-Class Exercise
No ratings yet
CS100: Introduction To Computer Science: In-Class Exercise
6 pages
How To Code For Quantum Computers
From Everand
How To Code For Quantum Computers
Nivio Dos Santos
No ratings yet
Basic Information About C language PDF
From Everand
Basic Information About C language PDF
Suraj Das
No ratings yet
SSM71361 - DA-ST512 J2534 Update Instructions
No ratings yet
SSM71361 - DA-ST512 J2534 Update Instructions
3 pages
Cisco Router As A VPN Server
No ratings yet
Cisco Router As A VPN Server
24 pages
Ec6013 Ampmc MSM
100% (1)
Ec6013 Ampmc MSM
94 pages
io_uring-BPF
No ratings yet
io_uring-BPF
28 pages
100 Active Directory Interview Questions With
No ratings yet
100 Active Directory Interview Questions With
14 pages
BritSure Quote SFTP 300
No ratings yet
BritSure Quote SFTP 300
1 page
Getting Started: Warranty
No ratings yet
Getting Started: Warranty
56 pages
Subject: - Computer Organization and Architecture (01CT0301) Date:-6/12/2022 Total Marks:-100 Time: - 2 To 5 PM
No ratings yet
Subject: - Computer Organization and Architecture (01CT0301) Date:-6/12/2022 Total Marks:-100 Time: - 2 To 5 PM
3 pages
Risc in Pipe Ine
No ratings yet
Risc in Pipe Ine
39 pages
Embedded Development Boards
No ratings yet
Embedded Development Boards
3 pages
Slimboard Beginners
100% (2)
Slimboard Beginners
9 pages
ANX3110 Datasheet PDF
No ratings yet
ANX3110 Datasheet PDF
2 pages
Gujarat Technological University: Page 1 of 5
No ratings yet
Gujarat Technological University: Page 1 of 5
5 pages
WP08 ACT Fiori Configuration
No ratings yet
WP08 ACT Fiori Configuration
17 pages
DFI-BT700-Qseven-Manual
No ratings yet
DFI-BT700-Qseven-Manual
52 pages
ACR38 CCID SDK Casino Demo Guide - v2.1 PDF
No ratings yet
ACR38 CCID SDK Casino Demo Guide - v2.1 PDF
19 pages
Productos Western Digital02032023
No ratings yet
Productos Western Digital02032023
1 page
P03 (30!61!023-10) Introduction To CHARON-VAX and CHARON-AXP Family of Products
No ratings yet
P03 (30!61!023-10) Introduction To CHARON-VAX and CHARON-AXP Family of Products
36 pages
ZCLI1 (4)
No ratings yet
ZCLI1 (4)
22 pages
APM f5
No ratings yet
APM f5
17 pages
cctalk42-2
No ratings yet
cctalk42-2
49 pages
Technology Livelihood Education-Computer System Servicing: Quarter 2 - Module 1.2 Microsoft Windows 7 Installation
No ratings yet
Technology Livelihood Education-Computer System Servicing: Quarter 2 - Module 1.2 Microsoft Windows 7 Installation
21 pages
Hesmon 64 Instruction Manual PDF
No ratings yet
Hesmon 64 Instruction Manual PDF
20 pages
Group 1 CIS Compliance 4lojg4
No ratings yet
Group 1 CIS Compliance 4lojg4
24 pages
Introduction To Operating Systems PDF
No ratings yet
Introduction To Operating Systems PDF
67 pages
SOP VoIP Client Installation
No ratings yet
SOP VoIP Client Installation
11 pages
Jbase Editor Tips
No ratings yet
Jbase Editor Tips
3 pages
Connectivity Dart
No ratings yet
Connectivity Dart
4 pages

1 Introduction

Uploaded by

1 Introduction

Uploaded by

A Crash Course on Data Compression

• What is Data Compression and why do we need it?

• The process must be reversible (exactly or admitting some loss) to be useful.

• The process must be reversible (exactly or admitting some loss) to be useful.

• We seek/need e cient programs that build such data structures.

input bit-string output bit-string input bit-string

Some loss can happen here, if wanted.

• Compression ratio. The compression ratio is de ned as CR = | B | / | C(B) | .

• The compression ratio depends on many factors:

• The compression ratio depends on many factors:

• This trade-o is becoming more and more important: nowadays,

• Ultimate goal: allow direct computation over compressed data.

• Proposition. No algorithm can compress every bit-string.

Then all possible bit-strings could be compressed to 0 bits — absurd. ◼

• Q. What is the best way of compressing a le for my application?

• An extreme, but very common, example: to compress a le, you may

100,000 pseudo-random bits

• 220 bytes (1 char = 1 byte)

• 220 bytes (1 char = 1 byte)

• Data and information are not the same.

• So there are great possibilities for compression!

• We use compression everywhere/anytime; even without being aware of it.

• We use compression everywhere/anytime; even without being aware of it.

• Increased software performance.

• Whatever space we have available: we are going to ll it up, by virtue of our

• Moore’s Law. Number of transistors on a chip doubles every 1.5 – 2 years.

• If a program stalls…it is likely that it is waiting for memory.

• If a program stalls…it is likely that it is waiting for memory.

large_record consumes 40 bytes overall;

uint64_t is a primitive data type

With a b-bit unsigned integer, we

initialise the pseudo-random generator

create the vectors

initialise the pseudo-random generator

create the vectors

The size of the data matters!

• Preface and Chapter 1 of:

• Chapter 5.5 (pages 810-825) of:

You might also like