0% found this document useful (0 votes)
4 views

1 Introduction

Uploaded by

honganhp0903
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

1 Introduction

Uploaded by

honganhp0903
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

A Crash Course on Data Compression

1. Introduction
Giulio Ermanno Pibiri
ISTI-CNR, [email protected]

@giulio_pibiri

@jermp
Overview

• What is Data Compression and why do we need it?


• Fundamental questions and undecidability
• Some applications
• Technological limitations
• Warmup
What is Data Compression?

• The process for which data is transformed into another representation that
takes less storage space:
- save space when storing data,
- save time when transmitting data.

• The process must be reversible (exactly or admitting some loss) to be useful.


What is Data Compression?

• The process for which data is transformed into another representation that
takes less storage space:
- save space when storing data,
- save time when transmitting data.

• The process must be reversible (exactly or admitting some loss) to be useful.


• Q. What is this “process”?
A. A computer program that takes data as input and produces a data
structure that takes less space than the input.

• We seek/need e cient programs that build such data structures.


Example: command line utility gzip.
ffi
Basic Model

input bit-string output bit-string input bit-string


Compressor Expander
B C(B) B

Some loss can happen here, if wanted.

• Compression ratio. The compression ratio is de ned as CR = | B | / | C(B) | .


• If CR = r, then the size of the compressed output | C(B) | is r times smaller
than the input size | B | . fi
Space vs. Time Trade-Off

• The compression ratio depends on many factors:


wanted compression/decompression speed, related to the amount of energy
spent (CPU power); loss of precision; (…)

• The most common one: the trade-o between the space of the compressed
data structure and the e ciency of the operations that we want to support on
the data.
Example: gzip has 9 compression “levels” (1 is fastest but “worst” compression; 9 is slower but “best”).
ffi
ff
Space vs. Time Trade-Off

• The compression ratio depends on many factors:


wanted compression/decompression speed, related to the amount of energy
spent (CPU power); loss of precision; (…)

• The most common one: the trade-o between the space of the compressed
data structure and the e ciency of the operations that we want to support on
the data.
Example: gzip has 9 compression “levels” (1 is fastest but “worst” compression; 9 is slower but “best”).

• This trade-o is becoming more and more important: nowadays,


we cannot a ord the naive approach “decompress and compute”.

• Ultimate goal: allow direct computation over compressed data.


ff
ff
ffi
ff
Limit

• Proposition. No algorithm can compress every bit-string.


Proof.
Proceed by contradiction. Assume that
| B | > | C(B) | > | C(C(B)) | > | C(C(C(B))) | > …, like in the picture below.

B C(B) C(C(B))
C C C … C(…C(C(B))…)

Then all possible bit-strings could be compressed to 0 bits — absurd. ◼


Fundamental Question(s)

• Q. What is the best way of compressing a le for my application?


A. This is an undecidable problem.

• An extreme, but very common, example: to compress a le, you may


replace it by the program that originated the le (related to the so-called
Kolmogorov complexity).
But how would you “ nd” (i.e., write) such a program?

• If you think: most of the data we deal with (Web pages, log les,
sequencing data, ecc.) is created by programs, not by humans.
fi
fi
fi
fi
fi
Undecidability
Q. How would you compress these bits?

100,000 pseudo-random bits


Undecidability
Q. How would you compress these bits?
A. With the following piece of code.

• 220 bytes (1 char = 1 byte)


• CR = 100000/(8 ⋅ 220) = 56.8

Compile with:
g++ random_bits.cpp -o random_bits
Run with:
100,000 pseudo-random bits ./random_bits
Undecidability
Q. How would you compress these bits?
A. With the following piece of code.

What if n = 1000000?
What happens now to the CR ?

• 220 bytes (1 char = 1 byte)


• CR = 100000/(8 ⋅ 220) = 56.8

Compile with:
g++ random_bits.cpp -o random_bits
Run with:
100,000 pseudo-random bits ./random_bits
Data and Information

• Data and information are not the same.


• Information is the knowledge coming from interpreting data according to a
speci c semantic scheme.

• In the future, it is foreseen that data will grow much faster than information:
data will become more and more redundant.

• So there are great possibilities for compression!


Huge amount of research being actively carried out.
fi
Why Data Compression?

• We use compression everywhere/anytime; even without being aware of it.


• Ever-increasing demand for storage and large-scale computing.
- Generic le compression: gzip, bzip, LZ4, Zstd, ecc.
- Multimedia: images (jpeg, png, gif); sound (MP3); video (MPEG, DVD);
Spotify, Net ix, ecc.
- Search engines (Google, Yandex, Bing,…);
- Distributed storage (Dropbox, Google Drive,…);

• Communication cost.
- Skype, Zoom, FaceTime, WhatsApp, ecc.
- Social networks (Facebook, Instagram, Twitter,…);
fi
fl
Why Data Compression?

• We use compression everywhere/anytime; even without being aware of it.


• Ever-increasing demand for storage and large-scale computing.
- Generic le compression: gzip, bzip, LZ4, Zstd, ecc.
- Multimedia: images (jpeg, png, gif); sound (MP3); video (MPEG, DVD);
Spotify, Net ix, ecc.
- Search engines (Google, Yandex, Bing,…);
- Distributed storage (Dropbox, Google Drive,…);

• Communication cost.
- Skype, Zoom, FaceTime, WhatsApp, ecc.
- Social networks (Facebook, Instagram, Twitter,…);

• Increased software performance.


fi
fl
Technological Limitations

• Whatever space we have available: we are going to ll it up, by virtue of our


human, eager, nature.

• Moore’s Law. Number of transistors on a chip doubles every 1.5 – 2 years.


• So we get faster processors…
• But not faster memories!

fi
Memory Hierarchies

• If a program stalls…it is likely that it is waiting for memory.


• Thus, it is more important than ever to trade-o processor time
for RAM/disk access time.
• Action of compression: transfer more data to the processor. https://colin-scott.github.io/personal_website/research/interactive_latency.html
ff
Memory Hierarchies

• If a program stalls…it is likely that it is waiting for memory.


• Thus, it is more important than ever to trade-o processor time
for RAM/disk access time.
• Action of compression: transfer more data to the processor. https://colin-scott.github.io/personal_website/research/interactive_latency.html
ff
A Simple Experiment

large_record consumes 40 bytes overall;


small_record consumes 6 bytes overall (slight lie)

uint64_t is a primitive data type


for unsigned 64-bit ints;
uint8_t for unsigned 8-bit ints

Experiment methodology.
1. Allocate two vectors of the same size, one
holding large_record objects and the other
holding small_record objects.
2. Fill the two vectors with the same data.
3. Sort the two vectors (say, on the day attribute).

With a b-bit unsigned integer, we


can represent all values in [0,2b).
A Simple Experiment
Experiment methodology.
1. Allocate two vectors of the same size, one
holding large_record objects and the other
holding small_record objects.
2. Fill the two vectors with the same data.
3. Sort the two vectors (say, on the day attribute).
A Simple Experiment
Experiment methodology.
1. Allocate two vectors of the same size, one
holding large_record objects and the other
holding small_record objects.
2. Fill the two vectors with the same data.
3. Sort the two vectors (say, on the day attribute).

initialise the pseudo-random generator


with a xed seed to reproduce the results

create the vectors


and reserve space

ll the vectors

1+2
fi
fi
A Simple Experiment std::chrono to measure time

Experiment methodology.
3
1. Allocate two vectors of the same size, one
holding large_record objects and the other
holding small_record objects.
2. Fill the two vectors with the same data.
3. Sort the two vectors (say, on the day attribute).

initialise the pseudo-random generator


with a xed seed to reproduce the results

create the vectors


and reserve space use the std::sort algorithm to sort
the vectors, using a lambda function
to implement the comparison

ll the vectors

1+2
fi
fi
A Simple Experiment
• Q. Which sort will take less time?
• Hint. Remember! The smaller the data, the more data can be transferred to the processor.

Compile with:
g++ -std=c++11 -O3 sort_bench.cpp -o sort_bench

Run with:
./sort_bench 10000000
A Simple Experiment
• Q. Which sort will take less time?
• Hint. Remember! The smaller the data, the more data can be transferred to the processor.

Compile with:
g++ -std=c++11 -O3 sort_bench.cpp -o sort_bench

Run with:
./sort_bench 10000000

The size of the data matters!


Further Readings

• Preface and Chapter 1 of:


Alistair Mo at and Andrew Turpin. 2002. Compression and coding algorithms.
Springer Science & Business Media, ISBN 978-1-4615-0935-6.

• Chapter 5.5 (pages 810-825) of:


Robert Sedgewick and Kevin Wayne. 2011. Algorithms. 4-th Edition.
Addison-Wesley Professional, ISBN 0-321-57351-X.
ff

You might also like