0% found this document useful (0 votes)

6 views18 pages

Lec 5 Data Compression Part3

lec-5-Data-Compression-Part3

Uploaded by

simonwzm

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views18 pages

Lec 5 Data Compression Part3

lec-5-Data-Compression-Part3

Uploaded by

simonwzm

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 18

CS258: Information Theory

Fan Cheng
Shanghai Jiao Tong
University

http://www.cs.sjtu.edu.cn/~chengfan/
[email protected]
Spring, 2023
Outline

 Kraft inequality
 Optimal codes
 Huffman coding
 Shannon-Fano-Elias coding
 Generation of discrete distribution
 Universal source coding
Random Variable Generation
 We are given a sequence of
fair coin tosses and we wish
to generate with probability
mass function
 Let the random variable denote
the number of coin flips used in
the algorithm.

Heads Vs. Tails Generate a random variable according the

outcome of fair coin flips:
HHHH, TTTTT, HTHTHT, THTHTH
If
 H:
 TH:
 TT:
 How many fair coin flips to generate ?

The entropy of

The expected number of coin flips

Random Variable Generation
Representation of a generation algorithm
 We can describe the algorithm mapping strings of bits to possible
outcomes by a binary tree
 The leaves of the tree are marked by output symbols , and the path to
the leaves is given by the sequence of bits produced by the fair coin

The tree representing the algorithm must

satisfy certain properties:
 The tree should be complete (i.e., every
node is either a leaf or has two
descendants in the tree). The tree may be
infinite, as we will see in some examples.
 The probability of a leaf at depth is .
Many leaves may be labeled with the
same output symbol—the total probability Tree for generation of the
distribution
of all these leaves should equal the
Intuition: Each coin tossing
desired probability of the output symbol. generates 1 bit.
 The expected number of fair bits required
to generate is equal to the expected
depth of this tree.
Random Variable Generation
Let denote the set of leaves of a complete tree. Consider a distribution
on the leaves such that the probability of a leaf at depth on the tree is .
Let be a random variable with
this distribution.
(Lemma). For any complete tree, consider a probability distribution on
the leaves such that the probability of a leaf at depth is . Then the
expected depth of the tree is equal to the entropy of this
distribution ().
 The expected depth of the tree

 The entropy of the distribution of is

where denotes the depth of leaf Thus,

𝑯 (𝒀 )= 𝑬𝑻
Random Variable Generation
(Theorem). For any algorithm generating , the expected number of fair
bits used is greater than the entropy , that is,

 Any algorithm generating from fair bits can be represented by a

complete binary tree. Label all the leaves of this tree by distinct
symbols If the tree is infinite, the alphabet is also infinite.
 Now consider the random variable defined on the leaves of the tree,
such that for any leaf at depth , the probability that is . The
expected depth of this tree is equal to the entropy of :

 Now the random variable is a function of (one or more leaves map

onto an output symbol), and hence we have

𝑬𝑻 ≥ 𝑯 (𝑿 )
Random Variable Generation
(Theorem). Let the random variable have a dyadic
distribution. The optimal algorithm to generate
from fair coin flips requires an expected number of
coin tosses precisely equal to the entropy:

 For the constructive part, we use the Huffman code tree for as the
tree to generate the random variable. Each will correspond to a leaf
 For a dyadic distribution, the Huffman code is the same as the
Shannon code and achieves the entropy bound.

 For any , the depth of the leaf in the code tree corresponding to is
the length of the corresponding codeword, which is . Hence, when
this code tree is used to generate , the leaf will have a probability

 The expected number of coin flips is the expected depth of the tree,
which is equal to the entropy (because the distribution is dyadic).
Hence, for a dyadic distribution, the optimal generating algorithm
achieves
Random Variable Generation
 If the distribution is not dyadic? In this case we cannot use the same
idea, since the code tree for the Huffman code will generate a
dyadic distribution on the leaves, not the distribution with which
we started
 Since all the leaves of the tree have probabilities of the form , it follows
that we should split any probability that is not of this form into
atoms of this form. We can then allot these atoms to leaves on the
tree

 Finding the binary expansions of the probabilities . Let the binary

expansion of the probability be

where or . Then the atoms of the expansion are the .

 Since , the sum of the probabilities of these atoms is 1. We will
allot an atom of probability to a leaf at depth on the tree.
 The depths (j) of the atoms satisfy the Kraft inequality, we can
always construct such a tree with all the atoms at the right depths.
Random Variable Generation
Let have distribution

We find the binary expansions of these probabilities:

Hence, the atom for the expansion are:

Tree to generate a
distribution

 This procedure yields a tree that generates the random variable . We

have argued that this procedure is optimal (gives a tree of minimum
expected depth)
 (Theorem) The expected number of fair bits required by the optimal
algorithm to generate a random variable lies between and :
Universal Source Coding
Challenge: For many practical situations, however, the probability
distribution underlying the source may be unknown
 One possible approach is to wait until we have seen all the data,
estimate the distribution from the data, use this distribution to
construct the best code, and then go back to the beginning and
compress the data using this code.
 This two-pass procedure is used in some applications where there
is a fairly small amount of data to be compressed.
 In yet other cases, there is no probability distribution underlying the
data—all we are given is an individual sequence of outcomes. How
well can we compress the sequence?
 If we do not put any restrictions on the class of algorithms, we get
a meaningless answer—there always exists a function that
compresses a particular sequence to one bit while leaving every
 Assume we have a random variable drawn according to a distribution
other sequence uncompressed. This function is clearly
from“overfitted”
the family ,to
where the parameter is unknown
the data.
 We wish to find an efficient code for this source
Minmax Redundancy

 If we know , we can construct a code with codeword length

 What happens if we do not know the true distribution , yet wish

to code as efficiently as possible? In this case, using a code with
codeword lengths and implied probability , we define the redundancy
of the code as the difference between the expected length of the code
and the lower limit for the expected length:

 We wish to find a code that does well irrespective of the true

distribution , and thus we define the minimax redundancy as
Redundancy and Capacity
How to compute : Take as a transition a matrix

This is a channel The capacity of this channel is given by

where

(Theorem) The capacity of a channel with rows is given by

Channel capacity is well

understood.
Shannon-Fano-Elias Arithmetic
Coding
Shannon-Fano-Elias Coding:

Motivation: using intervals to represent symbols

Consider a random variable with a ternary

alphabet , with probabilities 0.4, 0.4, and
0.2, respectively.
Let the sequence to be encoded by ACAA
 A
 AC → [0.32, 0.4) (scale with ratio )
 ACA → [0.32, 0.352)
 ACAA → [0.32, 0.3328)
 The procedure is incremental and can
Combination of
be used for any blocklength
 Coding by intervals: new insight
“ 火车刚发明的时候比马车还慢”
Lempel-Ziv Coding:
Introduction
 Use dictionaries for compression dates back to the invention of the
telegraph.
 “25: Merry Christmas”
The idea
“26:ofMay
adaptive dictionary-based
Heaven’s schemes
choicest blessings was not explored
be showered until
on the newly
Ziv and Lempel
married wrote their papers in 1977 and 1978. The two papers
couple.”
describe two distinct versions of the algorithm. We refer to these
versions as LZ77 or sliding window Lempel–Ziv and LZ78 or tree-
structured Lempel–Ziv.

Abraham Lempel Yaakov Ziv

Gzip, pkzip, compress in
unix, GIF
Lempel-Ziv Coding: Sliding
Window
The key idea of the Lempel–Ziv algorithm is to parse the string into
phrases and to replace phrases by pointers to where the same string has
Sliding Window
occurred Lempel–Ziv Algorithm
in the past.
 We assume that we have a string to be compressed from a finite
alphabet. A parsing S of a string is a division of the string into
phrases, separated by commas. Let be the length of the window.
 Assume that we have compressed the string until time . Then to find
the next phrase, find the largest such that for some the string of
length starting at is equal to the string (of length ) starting at (i.e.,
for all ). The next phrase is then of length (i.e., ) and is represented by
the pair where is the location of the beginning of the match and is the
length of the match.
 If a match is not found in the window, the next character is sent
uncompressed.

0101010101010101011010101010101101, W = 7
0101010101010101011010101010101101
0101010101010101011010101010101101
Find the maximum repeated substring inside
Lempel-Ziv Coding: Sliding
Window
0101010101010101011010101010101101, W = 6
0101010101010101011010101010101101
0101010101010101011010101010101101
Find the maximum repeated substring inside
, ABBABBABBBAABABA

ABBABBABBBAABABA
A BBABBABBBAABABA
A, B BABBABBBAABABA

A, B, B ABBABBBAABABA

A, B, B, ABBABB BAABABA

A, B, B, ABBABB, BA ABABA

A, B, B, ABBABB, BA, A BABA

A, B, B, ABBABB, BA, A, BA BA

A, B, B, ABBABB, BA, A, BA, BA

Lempel-Ziv Coding: Tree-
Structured
 In the 1978 paper, Ziv and Lempel described an algorithm that parses
a string into phrases, where each phrase is the shortest phrase
not seen earlier.
 This algorithm can be viewed as building a dictionary in the form of a
tree, where the nodes correspond to phrases seen so far.
 Find a string in a set of strings: Trie
ABBABBABBBAABABAA
ABBABBABBBAABABAA
A BBABBABBBAABABAA
A, B BABBABBBAABABAA

A, B, BA BBABBBAABABAA

A, B, BA, BB ABBBAABABAA

A, B, BA, BB, AB BBAABABAA

A, B, BA, BB, AB, BBA ABABAA

A, B, BA, BB, AB, BBA, ABA BAA

A, B, BA, BB, AB, BBA, ABA,

BAA
Optimality of LZ77, LZ78
Ref. Ch. 13.5 T. Cover
Summary
Cover: 5.11, 13.1, 13.3, 13.4, 13.5

Water Penetration of Metal Roof Panel Systems by Static Water Pressure Head
No ratings yet
Water Penetration of Metal Roof Panel Systems by Static Water Pressure Head
4 pages
San Miguel Corporation Business Model Canvas
71% (7)
San Miguel Corporation Business Model Canvas
2 pages
Ut 1 PPT
No ratings yet
Ut 1 PPT
77 pages
%1iqsv) ) Jjmgmirx%Hetxmzi, Yjjqer'Shmrk %Pksvmxlqjsv:Iv) 0Evki7Ixwsj7) QFSPW
No ratings yet
%1iqsv) ) Jjmgmirx%Hetxmzi, Yjjqer'Shmrk %Pksvmxlqjsv:Iv) 0Evki7Ixwsj7) QFSPW
10 pages
MIT6 004s09 Tutor01 Sol
No ratings yet
MIT6 004s09 Tutor01 Sol
13 pages
ICT - Module 1 Lecture 3
No ratings yet
ICT - Module 1 Lecture 3
43 pages
cp467 12 Lecture14 Compression1
No ratings yet
cp467 12 Lecture14 Compression1
146 pages
Commentaries 2024
No ratings yet
Commentaries 2024
111 pages
HW 3 Sol
No ratings yet
HW 3 Sol
9 pages
Data Compression Arithmetic Coding
No ratings yet
Data Compression Arithmetic Coding
38 pages
HW 2 Sol
No ratings yet
HW 2 Sol
25 pages
EE5143 Tutorial1
No ratings yet
EE5143 Tutorial1
5 pages
Source Coding Ompression
No ratings yet
Source Coding Ompression
34 pages
03 Practice Problems
No ratings yet
03 Practice Problems
2 pages
Compression For Sending and Storing Information: Text, Audio, Images, Videos
No ratings yet
Compression For Sending and Storing Information: Text, Audio, Images, Videos
28 pages
Information Theory Notes
No ratings yet
Information Theory Notes
4 pages
Arithmetic, Run Length, Compression
No ratings yet
Arithmetic, Run Length, Compression
62 pages
Introduction To Information Technology: Lecture #6
No ratings yet
Introduction To Information Technology: Lecture #6
22 pages
Comp101 Lect02
No ratings yet
Comp101 Lect02
44 pages
Agenda For The Lecture: C Himanshu Tyagi. Feel Free To Use With Acknowledgement
No ratings yet
Agenda For The Lecture: C Himanshu Tyagi. Feel Free To Use With Acknowledgement
7 pages
Randomizedd Algorithms
No ratings yet
Randomizedd Algorithms
195 pages
A2 Sol
No ratings yet
A2 Sol
7 pages
Introstat
No ratings yet
Introstat
16 pages
Explorations in Computing An Introduction To Compu... - (Appendix A Answers To Selected Exercises)
No ratings yet
Explorations in Computing An Introduction To Compu... - (Appendix A Answers To Selected Exercises)
6 pages
Data Compression Unit-5
No ratings yet
Data Compression Unit-5
17 pages
Introduction To Randomized Algorithms
No ratings yet
Introduction To Randomized Algorithms
18 pages
Lecturer: Mark Braverman Scribe: Mark Braverman: COS597D: Information Theory in Computer Science
No ratings yet
Lecturer: Mark Braverman Scribe: Mark Braverman: COS597D: Information Theory in Computer Science
5 pages
Huffman Trees and Codes: Greedy Technique
No ratings yet
Huffman Trees and Codes: Greedy Technique
6 pages
CS648A 1 Overview of The Course 2025
No ratings yet
CS648A 1 Overview of The Course 2025
35 pages
Lecture 3-Huffman Coding
No ratings yet
Lecture 3-Huffman Coding
30 pages
Chapter10 Part1 Huffman
No ratings yet
Chapter10 Part1 Huffman
17 pages
Chapter 2
No ratings yet
Chapter 2
14 pages
Information Theory-Homework Exercises: 1 Entropy, Source Coding
No ratings yet
Information Theory-Homework Exercises: 1 Entropy, Source Coding
18 pages
Data Structures Algorithms Part IIIb
No ratings yet
Data Structures Algorithms Part IIIb
37 pages
Lecture 6
No ratings yet
Lecture 6
22 pages
Randomized Algo Harvey
No ratings yet
Randomized Algo Harvey
234 pages
Huffman Coding Assignment
50% (2)
Huffman Coding Assignment
7 pages
Rojas 10 Why The Normal Distribution
No ratings yet
Rojas 10 Why The Normal Distribution
10 pages
EE4740 Lecture4 Slides
No ratings yet
EE4740 Lecture4 Slides
43 pages
Digital Communications Lab (CE-343L) : Experiment NO
No ratings yet
Digital Communications Lab (CE-343L) : Experiment NO
3 pages
Entropy & Run Length Coding
No ratings yet
Entropy & Run Length Coding
45 pages
Chapter 5 Math
No ratings yet
Chapter 5 Math
67 pages
Ee5143 Pset2 PDF
No ratings yet
Ee5143 Pset2 PDF
4 pages
Huffman
No ratings yet
Huffman
17 pages
Chapter 4 Multi
No ratings yet
Chapter 4 Multi
45 pages
Edexcel D1 Revision Sheets
No ratings yet
Edexcel D1 Revision Sheets
12 pages
Ict Solution
No ratings yet
Ict Solution
41 pages
Mmis G1 Ass
No ratings yet
Mmis G1 Ass
13 pages
Huffman Encoding: Farhad Muhammad Riaz
No ratings yet
Huffman Encoding: Farhad Muhammad Riaz
17 pages
Lecture 4 - Arithmetic Coding and Lempel-Ziv
No ratings yet
Lecture 4 - Arithmetic Coding and Lempel-Ziv
26 pages
Limitations of Algorithm Power
100% (1)
Limitations of Algorithm Power
10 pages
Homework 3 Solutions
No ratings yet
Homework 3 Solutions
9 pages
Image Compression
100% (1)
Image Compression
38 pages
Lecture 14
No ratings yet
Lecture 14
25 pages
HW 3 Soln
100% (1)
HW 3 Soln
4 pages
Indian Institute of Technology Bombay
No ratings yet
Indian Institute of Technology Bombay
6 pages
MMDC&S Lab Manual
No ratings yet
MMDC&S Lab Manual
19 pages
Audio and Video Coding PDF
No ratings yet
Audio and Video Coding PDF
72 pages
Randomnumbers 5
No ratings yet
Randomnumbers 5
42 pages
Noise, Information Theory, and Entropy: CS414 - Spring 2007
No ratings yet
Noise, Information Theory, and Entropy: CS414 - Spring 2007
44 pages
Gene Expression Programming: Fundamentals and Applications
From Everand
Gene Expression Programming: Fundamentals and Applications
Fouad Sabry
No ratings yet
Functions and Probability for Sixth Graders
From Everand
Functions and Probability for Sixth Graders
Home School Brew
No ratings yet
Schedule MANPOWER TLD JAN. 2023
No ratings yet
Schedule MANPOWER TLD JAN. 2023
21 pages
Maths New Sylabus Ministry of Primary and Secondary Education - Validated-1
No ratings yet
Maths New Sylabus Ministry of Primary and Secondary Education - Validated-1
96 pages
Angew Chem Int Ed - 2017 - Choi
No ratings yet
Angew Chem Int Ed - 2017 - Choi
5 pages
Acct Statement XX0539 12042025
No ratings yet
Acct Statement XX0539 12042025
43 pages
Podcast Lesson Plan
No ratings yet
Podcast Lesson Plan
3 pages
Percentage Type 1
No ratings yet
Percentage Type 1
82 pages
Lesson Plan: Veer Surendra Sai University of Technology
No ratings yet
Lesson Plan: Veer Surendra Sai University of Technology
2 pages
Theories in Nursing Informatics
No ratings yet
Theories in Nursing Informatics
31 pages
Mccsemi: Features
No ratings yet
Mccsemi: Features
1 page
Dispatch & Store
No ratings yet
Dispatch & Store
1 page
Title Page Thesis SHSHSH
No ratings yet
Title Page Thesis SHSHSH
6 pages
Topo Sheet Report
No ratings yet
Topo Sheet Report
15 pages
Accounting - Study Plan
No ratings yet
Accounting - Study Plan
1 page
SME - Metal Enclosed Switchgears
No ratings yet
SME - Metal Enclosed Switchgears
4 pages
Worksheet On Conduction, Convection and Radiation
No ratings yet
Worksheet On Conduction, Convection and Radiation
2 pages
Parts of Speech Test Bank
No ratings yet
Parts of Speech Test Bank
14 pages
Group 4 Travel Device
No ratings yet
Group 4 Travel Device
8 pages
Tender - GCMS Specification
No ratings yet
Tender - GCMS Specification
5 pages
Koskela en Es
No ratings yet
Koskela en Es
298 pages
Current Affairs
No ratings yet
Current Affairs
3 pages
Read Me
No ratings yet
Read Me
2 pages
Unit 1 - Set Theory, Types of Sets, Set Operations
No ratings yet
Unit 1 - Set Theory, Types of Sets, Set Operations
20 pages
Area Manager Training Programme Overview PDF
No ratings yet
Area Manager Training Programme Overview PDF
2 pages
Md-070 Application Extensions Technical Design
100% (1)
Md-070 Application Extensions Technical Design
16 pages
Fanuc LATHE CNC Program Manual Gcodetraining 588
77% (13)
Fanuc LATHE CNC Program Manual Gcodetraining 588
104 pages
Unit 5 - Systems of Equations and Inequalities Study Guide
No ratings yet
Unit 5 - Systems of Equations and Inequalities Study Guide
6 pages
Struts Survival Guide
No ratings yet
Struts Survival Guide
227 pages
Attachment 14940535 2 4 - S-GATE - Presentation
No ratings yet
Attachment 14940535 2 4 - S-GATE - Presentation
14 pages

Lec 5 Data Compression Part3

Uploaded by

Lec 5 Data Compression Part3

Uploaded by

CS258: Information Theory

Heads Vs. Tails Generate a random variable according the

The expected number of coin flips

The tree representing the algorithm must

 The entropy of the distribution of is

where denotes the depth of leaf Thus,

 Any algorithm generating from fair bits can be represented by a

 Now the random variable is a function of (one or more leaves map

 Finding the binary expansions of the probabilities . Let the binary

where or . Then the atoms of the expansion are the .

We find the binary expansions of these probabilities:

Hence, the atom for the expansion are:

 This procedure yields a tree that generates the random variable . We

 If we know , we can construct a code with codeword length

 What happens if we do not know the true distribution , yet wish

 We wish to find a code that does well irrespective of the true

This is a channel The capacity of this channel is given by

(Theorem) The capacity of a channel with rows is given by

Channel capacity is well

Motivation: using intervals to represent symbols

Consider a random variable with a ternary

Abraham Lempel Yaakov Ziv

A, B, B, ABBABB, BA, A BABA

A, B, B, ABBABB, BA, A, BA, BA

A, B, BA, BB, AB BBAABABAA

A, B, BA, BB, AB, BBA ABABAA

A, B, BA, BB, AB, BBA, ABA BAA

A, B, BA, BB, AB, BBA, ABA,

You might also like