0% found this document useful (0 votes)
108 views

CH 0 Introduction: 0.1 Overview of Information Theory and Coding

1. The information theory was founded by Shannon in 1948 to analyze communication and storage systems. It considers compression, error detection/correction, and cryptography over both wireless and wired channels. 2. Information theory is based on probability theory and analyzes discrete-time, discrete-amplitude sources like text with a finite symbol alphabet. A basic system includes a source, channel encoder/decoder, and receiver. The goal is reliable reproduction of information transmitted through the channel while maximizing throughput. 3. A discrete source is defined by its finite alphabet of unique symbols and the probability of emitting each symbol. The uncertainty of the source is quantified by its entropy, which is minimum for a deterministic source and maximum for a uniform random

Uploaded by

Usman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
108 views

CH 0 Introduction: 0.1 Overview of Information Theory and Coding

1. The information theory was founded by Shannon in 1948 to analyze communication and storage systems. It considers compression, error detection/correction, and cryptography over both wireless and wired channels. 2. Information theory is based on probability theory and analyzes discrete-time, discrete-amplitude sources like text with a finite symbol alphabet. A basic system includes a source, channel encoder/decoder, and receiver. The goal is reliable reproduction of information transmitted through the channel while maximizing throughput. 3. A discrete source is defined by its finite alphabet of unique symbols and the probability of emitting each symbol. The uncertainty of the source is quantified by its entropy, which is minimum for a deterministic source and maximum for a uniform random

Uploaded by

Usman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 133

Ch 0 Introduction

§0.1 Overview of Information Theory and Coding


Overview
The information theory was founded by Shannon in 1948.
This theory is for transmission (communication system) or recording (storage system)
over/in a channel. The Channel can be wireless or wire channel (communication: copper
telephone or fiber optic cables), magnetic or optical disks (storage).

There are three aspects that need to be considered:


(Compression)
(Error Detection and Correction)
(Cryptography)

Information Theory is based on the Probability Theory.

A communication or compression procedure includes:

Sent Received
messages messages
symbols
Source Channel Channel Source
coding coding decoding decoding
source encoder channel decoder receiver

0110101001110…

Compression Error Detection Decompression


and Correction

Source Entropy Channel Capacity

1
Digital Communication and Storage Systems

A basic information processing system consists of

Channel: produces a received signal r which differs from the original signal, c (the
channel introduces noise, channel distortion, etc.). Thus, the decoder can only produce an
estimate m’ of the original message, m.

Goal of processing: Information conveyed through (or stored in) the channel must be
reproduced at the destination as reliable as possible. At the same time, it needs to allow
the transmission of as much information as possible per unit time (communication
system) or storage (storage system).

Information Source
The Source Message m consists of a time sequence of symbols emitted by the
information source. The source can be:
Continuous-time Source, if this message is continuous in time, e.g., speech waveform.
Discrete-time Source, if the message is discrete in time, e.g., data sequences from a
computer.

The symbols emitted by the source can be:


Continuous in amplitude, e.g., speech waveform.
Discrete in amplitude, e.g., text with a finite symbol alphabet.
This course primarily concerns with discrete-time and discrete-amplitude (i.e. digital)
sources, as practically all new communication or storage systems fall into this category.

Since the information and coding theory depends on the probability theory, we need to
review it first.

2
§0.2 Review of Random Variables and Probability

Probability
Let us consider a single experiment, such as rolling of a dice, with a number of possible
outcomes. The sample space S of the experiment consists of the set of all possible
outcomes.

In the case of a dice S  1, 2,3,4,5,6 , with the integer representing the number of
dots on the six faces of the dice.

Event:
Complement

Example 0.1: For S and A defined above, find A .

Two events are said to be Mutually Exclusive if they have no sample points in common.

For example:
Or:

The Union of two events:


The Intersection of two events:

Associated with each event A contained in S is its Probability, denoted by P  A  .


This has the following properties:

For mutually exclusive events:


Ai  Aj   ,  
i  j , the probability of the union is P  Ai   P  Ai 
i i

Example 0.2: if A  {2,4} , find P( A) .

3
Joint Event and Joint Probability
Instead of dealing with a single experiment, let us perform two experiments and consider
their outcomes.

For example: The two experiments can be separate tosses of a single dice or a single toss
consisting of two consecutive dices.

 The sample space S consists of the 36 two-tuples (i,j), where i, j  1,...,6 .


1
 Each point in the sample space is assigned the probability .
36
Let us denote
Ai , i  1,..., n , as the outcomes of the first experiment, and
B j , j  1,..., m , as the outcomes of the second experiment.

Assuming that the outcomes B j , j  1,..., m , are mutually exclusive and  B j  S , it


j
follows that:

If Ai , i  1,..., n , are mutually exclusive and  Ai  S , then


i

In addition,

Conditional Probability
A joint event  A, B  occurs with the probability P  A, B  , which can be expressed as:

,
   
where P A B and P B A are conditional probabilities.

Example 0.3: Let us assume that we toss a dice.


The events are A  1,2,3 and B  1,3,6 , find P ( B | A) .

4
A conditional probability is .

Let A and B be two events in a single experiment:


If these are mutually exclusive (A  B   ) , then

If A  B ,

The Bayes Theorem:


 Ai , i  1,..., n, are mutually exclusive and

If  n , then

i 1 Ai  S

 
Statistical Independence: Let P A B be the probability of occurrence of A given that
B has occurred. Suppose that the occurrence of A does not depend on the occurrence of B.
Then,

Example 0.4: Two successive experiments in tossing a dice


1
A  2,4,6  P  A   even-numbered sample points in the first toss
2
1
B  2,4,6  P  B   even-numbered sample points in the second toss
2
Determine the probability of the joint event “even-numbered outcome on the first toss
(A)” and “even-numbered outcome on the second toss (B)” P( A, B) .

5
Random Variables
 Sample space S
 Elements s  S  X  s  is a Random Variable

For examples:

Probability Mass Function (PMF)


p X  x   P  X  x   P  X  xi   x  xi    p  xi   x  xi 
M M

i 1 i 1

1, x  xi
  x  xi    pX  x 
0, otherwise

For example:
1/6

x
1 2 3 4 5 6
Definition:
The Mean of the random variable X:

Example 0.5: S  1, 2,3, 4,5, 6 , X  s   s , find E(X).

Useful Distributions
Let X be a discrete random variable that has two possible values, say X  1 or X  0 ,
with probabilities p and 1- p , respectively.
This is the Bernoulli distribution, and the PMF can be represented as given in the figure.
The mean of such a random variable is .
pX  x 
The performance of a fixed number of trials
1-p p
with fixed probability of success on each trial is
known as a Bernoulli trial.
x
0 1
6
Let X i , i  1,..., n , be statistically independent and identically distributed random
variables with a Bernoulli distribution, and let us define a new random variable,
n
Y   X i . This random variable takes values from 0 to n . Associated probabilities can
i 1
be expressed as:

More generally,

,
n n!
where    is the binomial coefficient. This represents the probability to
 k  k ! n  k  !
have k successes in n Bernoulli trials.

The probability mass function (PMF) can be expressed as

This represents the binomial distribution (see the www.mathworld.com website).

The mean of a random variable with a binomial distribution is: E Y   np .

Definitions:
1. The Mean of a function of the random variable X, g  X  , is defined as

2. The Variance of the random variable X is defined as

Example (calculate the variance for the random variable defined in Example 0.5, whose
mean is 21/6)

3. The Variance of a function of the random variable X, g  X  , is defined as

7
Ch 1 Discrete Source and Entropy

§1.1 Discrete Sources and Entropy


1.1.1 Source Alphabets and Entropy

Overview
The Information Theory is based on the Probability Theory, as the term information
carries with it a connotation of UNPREDICTABILITY (SURPRISE) in the transmitted
signal.
The Information Source is defined by :
- The set of output symbols
- The probability rules which govern the emission of these symbols.
Finite-Discrete Source: finite number of unique symbols.
The symbol set is called the Source Alphabet.

Definition
A is a source alphabet with M possible symbols, . We can say
that the emitted symbol is a random variable, which takes values in A. The number of
elements in a set is called its Cardinality, e.g.,

The source output symbols can be denoted as where s t  A

is the symbol emitted by the source at time t. Note that here t is an integer time index.
Stationary Source: the set of probabilities is not a function of time. It means, at any
given time moment, the probability that the source emits am is pm  Pr(am )
Probability mass function:
Since the source emits only members of its alphabet, then

8
Information Sources Classification
Stationary Versus Non-Stationary Source:
For a Stationary Source the set of probabilities is not a function of time, whereas for a
Non-stationary Source it is.

Synchronous Source Versus Asynchronous Source:


A Synchronous Source emits a new symbol at a fixed time interval, Ts, whereas for an
Asynchronous Source the interval between emitted symbols is not fixed.
The latter can be approximated as synchronous, by defining a null character when the
source does not emit at time t. We say the source emits a null character at time t.

Representation of the Source Symbols


The symbols emitted by the source must be represented somehow. In digital systems, the
binary representation is used.
Pop Quiz: How many bits are required to represent the symbols 1, 2, 3, 4? or in a symbol
of n symbols 1, 2, 3, …, n?
Answer:

The symbols represented in this fashion are referred to as Source Data.


Distinction between Data and Information
For example: An information source has an alphabet with only 1 symbol. This
representation of this symbol is data, but this data is not information, as it is completely
uninformative. Since information carries the connotation of uncertainty, the information
content of this source is zero.

Question: how can one measure the information content of a source?


Answer:

9
Entropy of a Source

Example: Pick a marble from


a bag of 2 blue, and
5 read marbles
 Probability for picking
a red marble:
pred = 5/7
 Number of choices for each red
picked
1 / pred = 7/5 =1.4

Each transmitted Symbol 1 is just one choice out of 1/p1 many possible choices and
therefore Symbol 1 contains log2 1/p1 bits information (1/ p1 = 2 log2 1/ p1).
Similarly, Symbol k contains log2 1/pk bits information.

 The average information bits per symbol for our source is Entropy, it is calculated by

Shannon gave this precise mathematical definition of the average amount of


information conveyed per source symbol, used to measure the information content of a
source.

Unit of Measure (entropy):


Range of entropy: where M is the cardinality of the source
A, and when  1 ,(i.e. equal probabilities), H(A) takes the maximum.
 pm 
 M
m  1,..., M

10
Example 1.1: What is the entropy of a 4-ary source having symbol probabilities
PA  {0.5,0.3,0.15,0.05} ?

Example 1.2: If A  {0,1} with probabilities PA  {1  p, p} where 0  p  1 , determine


the range of H ( A) .

Example 1.3: For a M-ary source, what distribution of probabilities P( A) maximizes the
information entropy H ( A) ?

11
Measurement of the Information Efficiency of the Source is in terms of ratio of the
entropy of the source to the (average) number of binary digits used to represent the
source data.
Example 1.4: For a 4-ary source A  {00,01,10,11} that has symbol probabilities
PA  {0.5,0.3,0.15,0.05} . What is the efficiency of the source?

When the entropy of the source is lower than the (average) number of bits used to
represent the source data, an efficient coding scheme can be used to encode the source
information, using, an average, fewer binary digits. This is called Data Compression and
the encoder used for that is called Source Encoder.

1.1.2 Joint and Conditional Entropy

If we have two information source A and B, and

we want to make a compound symbol C with cij  (ai , b j ) , find H(C).

12
i) If A and B are statistically independent:

ii) If B depends on A:

Example 1.5: We often use a parity bit for error detection. For a 4-ary information source
A  {0,1,2,3} with PA  {0.25,0.25,0.25,0.25} , and the parity generator B  {0,1} with
0, if a  0 or 1
bj  { where j  1,2 , find H ( A) , H (B) and H ( A, B) .
1, if a  2 or 3

13
1.1.3 Entropy of Symbol Blocks and the Chain Rule

To find H ( A0 , A1 , , An 1 ) where At (t  0,1, , n  1) is the symbol at index time of t


that is drawn from alphabet A.

Example 1.5: Suppose a memoryless source with A  {0,1} having equal probabilities
emits a sequence of 6 symbols. Following the 6th symbol, suppose a 7th symbol is
transmitted which is the sum modulo 2 of the six previous symbols (this is just the
exclusive-or of the symbols emitted by A). What is the entropy of the 7-symbol
sequence?

14
Example 1.6: For an information source having alphabet A with |A| symbols, what is the
range of entropies possible?

§1.2 Source Coding


1.2.1 Mapping Functions and Efficiency

For an inefficient information source, i.e. H(A) < log2(|A|), the communication system
can be made more cost effective through source coding.

Information
Source Source
Sequence Code Words
Encoder

s0,s1,… s'0,s'1,…
st ϵ A(source s't ϵ B(code
alphabet) alphabet)

15
In its simplest form, the encoder can be viewed as a mapping of the source alphabet A to
a code alphabet B, i.e., C: A→B. Since the encoded sequence must be decoded at the
receiver end, the mapping function C must be invertible.
Goal of coding: average information bits/symbol ~ average bits we use to represent a
symbol (i.e. code efficiency ~ 1).

Example 1.7: Let A be a 4-ary source with symbol probabilities PA  {0.5,0.3,0.15,0.05} ,


let C be an encoder with maps the symbols in A into strings of binary bits, as below
p 0  0.5, C (a 0 )  0
p1  0.3, C (a1 )  10
p 2  0.15, C (a 2 )  110
p3  0.05, C (a3 )  111
Determine the average number of transmitted binary digits per code word and the
efficiency of the encoder.

Example 1.8: Let C be an encoder grouping the symbols in A into ordered pairs
 ai , a j  , the set of all possible pairs  ai , a j  is called the Cartesian product of set A

and is denoted as A X A. Thus, encoder C: A X A → B. or C (ai , a j )  b . Now let A be a

4-ary memoryless source with symbol probabilities given in Example 1.7, determine the
average number of transmitted binary digits per code word and the efficiency of the
encoder. The code words are shown in the table following.

16
< ai,aj> Pr< ai,aj> bm < ai,aj > Pr< ai,aj > bm
a0,a0 00 a2,a0 1101
a0,a1 100 a2,a1 0111
a0,a2 0.075 1100 a2,a2 0.0225 111110
a0,a3 0.025 11100 a2,a3 0.0075 1111110
a1,a0 0.15 101 a3,a0 0.025 11101
a1,a1 0.09 010 a3,a1 0.015 111101
a1,a2 0.45 0110 a3,a2 0.0075 11111110
a1,a3 0.015 111100 a3,a3 0.0025 11111111

1.2.2 Mutual Information

If we have source set A and code set B, what are the entropy relationship between them?

A B

17
i) A B

a b

ii) A B

ai b

aj

18
iii)
A B

bi
ai
bj

1.2.3 Data Compression

Why Data Compression ?


Whenever space is concern, you would like to use data compression. For example, when
sending text files over a modem or Internet. If the files are smaller, they will get faster to
the destination. All media, such as text, audio, graphics or video has “redundancy”.
Compression attempts to eliminate this redundancy.
Example of Redundancy: If the representation of a media captures content that is not
perceivable by humans, then removing such content will not affect the quality of the
content. For example, capturing audio frequencies outside the human hearing range can
be avoided without any harm to the audio’s quality.

Original Compressed Decompressed


message, A message, B message, A’
ENCODER DECODER

Lossless Compression:
Lossy Compression:

19
Lossless and lossy compression are terms that describe whether or not, in the
compression of the message, all original data can be recovered when decompression is
performed.
Lossless Compression
- Every single bit of data originally transmitted remains after decompression.
After decompression, all the information is completely restored.
- One can use lossless compression whenever space is a concern, but the
information must be the same.
In other words, when a file is compressed, it takes up less space, but when it is
decompressed, it still has the same information.
- The idea is to get rid of redundancy in the information.
- Standards: ZIP, GZIP, UNIX Compress, GIF
Lossy Compression
- Certain information is permanently eliminated from the original message,
especially redundant information.
- When the message is decompressed, only a part of the original information is still
there (although the user may not notice it).
- Lossy compression is generally used for video and sound, where a certain amount
of information loss will not be detected by most users.
- Standards: JPEG (still), MPEG (audio and video), MP3 (MPEG-1, Layer 3)

Lossless Compression
When we encode characters in computers, we assign each an 8-bit code based on
(extended) ASCII chart. (Extended) ASCII: fixed 8 bits per character
For example: for “hello there!”, a number of 12 characters*8bits=96 bits are needed.

Question: Can one encode this message using fewer bits?


Answer: Yes. In general, in most files, some characters appear most often than others. So,
it makes sense to assign shorter codes for characters that appear more often, and longer
codes for characters that appear less often. This is exactly what C. Shannon and R.M.
Fano were thinking when created the first compression algorithm in 1950.

20
Kraft Inequality Theorem
Prefix Code (or Instantaneously Decodable Code): A code that has the property of
being self-punctuating. Punctuating means dividing a string of symbols into words. Thus,
a prefix code has punctuating built into the structure (rather than adding in using special
punctuating symbols). This is designed in a way that no code word is a prefix of any
other (longer) code word. It is also data compression code.
To construct an instantaneously decodable code of minimum average length (for a
source A or given random variable a, with values drawn from the source alphabet), it
needs to follow the Kraft Inequality:
For an instantaneously decodable code B for a source A, the code lengths {li } must
satisfy the inequality

Conversely, if the code word lengths satisfy this inequality, then there exists an
instantaneously decodable code with these word lengths.

Shanno-Fano Theorem
KRAFT INEQUALITY tells us when an instantaneously decodable code exists. But we
are interested in finding the optimal code, i.e., the one that maximizes the efficiency, or
minimizes the average code length, L . The average code length L of the code B for the
source A (with a as a random variable of values drawn from the source alphabet with
probabilities {pi}) is minimized if the code lengths {li} are given by:

This quantity is called the Shannon Information (pointwise).


Example 1.9: Consider the following random variable a, with the optimal code lengths
given by the Shannon information. Calculate the average code length.

a a0 a1 a2 a3

pi i=0,1,2,3 1/2 1/4 1/8 1/8

li i=0,1,2,3 1 2 3 3

The average code length of the optimal code is:

21
Note that this is the same as the entropy of A, H(A).
Lower Bound on the Average Length
The observation about the relation between the entropy and the expected length of the
optimal code can be generalized. Let B be an instantaneous code for the source A. Then,
the average code length is bounded by:

Upper Bound on the Average Length


Let B a code with optimal code lengths, i.e., li   log 2 pi . Then, the average length
i

is bounded by:

Why is the upper bound H(A)+1 and not H(A)? Because sometimes the Shannon
information gives us fractional lengths, and we have to round them up.
Example 1.10: Consider the following random variable a, with the optimal code lengths
given by the Shannon information theorem. Determine the average code length bounds.
a a0 a1 a2 a3 a4

pi i=0,1,2,3 0.25 0.25 0.20 0.15 0.15

li i=0,1,2,3 2.0 2.0 2.3 2.7 2.7

The entropy of the source A is


The source coding theorem tells us:

where L is the code length of the optimal code.


Example 1.11: For the source in Ex. 1.10, the following code tries to make the code
words with optimal code lengths as closely as possible, find the average code length.

a a0 a1 a2 a3 a4

b 00 10 11 010 011

li i=0,1,2,3 2 2 2 3 3

22
The average code length for this code is .
This is very close to the optimal code length of H(A)=2.2855.

Summary
i) The motivation for data compression is to reduce in the space allocated for data
(increase of source efficiency). It is obtained by reducing redundancy which exists in data.
ii) Compression can be lossless or lossy. In the former case, all information is completely
restored after decompression, whereas in the latter case it is not (used in applications in
which the information loss will not be detected by most users).
iii) The optimal code, which ensures a maximum efficiency for the source, is
characterized by the lengths of the code words given by the Shannon information, log 2 pi .
iv) According to the source coding theorem, the average length of the optimal code is
bounded by entropy as

v) The coding schemes for data compression include Huffman, Lempel-Ziv, Arithmetic
coding.

§1.3 Huffman Coding


Remarks
Huffman coding is used in data communications, speech coding, video compression.
Each symbol is assigned a variable-length code that depends on its frequency (probability
of occurrence). The higher the frequency, the shorter the code word. It is a variable-
length code. The number of bits for each code word is an integer (requires an integer
number of coded bits to represent an integer number of source symbols). It is a Prefix
Code (instantaneously decodable).

Encoder – Tree Building Algorithm


Huffman code words are generated by building a Huffman tree:
Step 1 : List the source symbols in a column in descending order of probabilities.

23
Step 2 : Begin with the two symbols with the two lowest probability symbols. The
combining of the two symbols forms a new compound symbol or a branch in the tree.
This step is repeated using the two lowest probability symbols from the new set of
symbols, and continues until all the original symbols have been combined into a single
compound symbol.
Step 3 : A tree is formed, with the top and bottom stems going from the compound
symbol to the symbols which form it, labeled with 0 and 1, respectively, or the other way
around. Code words are assign by reading the labels of the tree stems from right to the
left, back to the original symbol.

Example 1.12: Let the alphabet of the source A be {a0, a1, a2, a3}, and the probabilities
of emitting these symbols be {0.50 0.30 0.15 0.05}. Draw the Huffman tree and find the
Huffman codes.

STEP 1 STEP 2 STEP 3


Probability Symbol
0.50 a0

0.30 a1

0.15 a2

0.05 a3

Symbol Code Words


a0
a1
a2
a3
Hardware implementation of encoding and decoding.

24
25
How are the Probabilities Known?
Counting symbols in input string:
- data must be given in advance; requires an extra pass on the input string. Data source’s
distribution is known
- data not necessarily known in advance, but we know its distribution. Reasonable care
must be taken in estimating the probabilities, since large errors lead to serious loss in

26
optimality. For example, a Huffman code designed for English text can have a serious
loss in optimality when used for French.
More Remarks
For Huffman coding, the alphabet and its distribution must be known in advance. It
achieves entropy when occurrence probabilities are negative powers of 2 (optimal code).
Huffman code is not unique (because some arbitrary decisions in the tree construction).
Given the Huffman tree, it is easy (and fast) to encode and decode. In general, the
efficiency of Huffman coding relies on having a source alphabet A with a fairly large
number of symbols. Compound symbols are obtained based on the original symbols (see,
e.g., AxA). For a compound symbol formed with n symbols, the alphabet is An, and the set
of probabilities of the compound symbols is denoted by PAn.
Question: How does one get PAn?
Answer: Easy for a memoryless source. Difficult for a source with memory!

§1.4 Lempel-Ziv (LZ) Coding

Remarks
LZ coding does not require the knowledge of the symbol probabilities beforehand. It is a
particular class of dictionary codes. They are compression codes that dynamically
construct their own coding and decoding tables by looking at the data stream itself.
In simple Huffman coding, the dependency between the symbols is ignored, while in LZ,
these dependencies are identified and exploited to perform better encoding. When all the
data is known (alphabet, probabilities, no dependencies), it’s best to use Huffman (LZ
will try to find dependencies which are not there…)

This is the compression algorithm used in most PCs. Extra information is supplied to the
receiver, these codes initially “expand”. The secret is that most of the code words
represent strings of source symbols. In a long message it is more economical to encode
these strings (can be of variable length), than it is to encode individual symbols.

27
Definitions related to the Structure of the Dictionary
Each entry in the dictionary has an address, m. Each entry is an ordered pair, <n, ai >.
The former ( n ) is a pointer to another location in the dictionary, it is also the
transmitted code word. ai is a symbol drawn from the source alphabet A. A fixed-length
binary word of b bits is used to represent the transmitted code word. The number of
entries will be lower or equal to 2b. The total number of entries will exceed the number of
symbols, M, in the source alphabet. Each transmitted code word contains more bits that it
would take to represent the alphabet A.

Question: Why do we use LZ coding if the code word has more bits?
Answer: Because most of these code words represent STRINGS of source symbols
other than single.

Encoder
A Linked-List Algorithm (simplified for the illustration purpose) is sued, it inlcudes:

Step 1: Initialization
The algorithm is initialized by constructing the first M +1 (null symbol plus M source
symbols) entries in the dictionary, as follows.
Address (m) Dictionary Entry (n, ai)
0 0 null
1 0 a0
2 0 a1
… … …
m 0 am
… … …
M 0 aM-1

Note: The 0-address entry in the dictionary is a null symbol. It is used to let the decoder
know where the end of the string is. In a way, this entry is a punctuation mark. The
pointers n in these first M+1 entries are zero. It means they point to the null entry at

28
address 0 at the beginning.
The initialization also initializes pointer variable to zero (n=0), and the address pointer to
M +1, (m=M+1). The address pointer points to the next “blank” location in the dictionary.

Iteratively executed:
Step 2: Fetch next source symbol.

Step 3:
If
the ordered pair <n, a> is already in the dictionary, then
n = dictionary address of entry <n, a>
Else
transmit n
create new dictionary entry <n, a> at dictionary address m
m = m+1
n = dictionary address of entry <0, a>

Step 4:
Return to Step 2.
Example 1.13: A binary information source emits the sequence of symbols 110 001 011
001 011 100 011 11 etc. Construct the encoding dictionary and determine the sequence of
transmitted code symbols.
Initialize:

Source Present Present transmit Next Dictionary


symbol n m n entry
1
1
0
0

29
0 1 6 5
1 5 6 5 2 5,1
0 2 7 4
1 4 7 4 2 4,1
1 2 8 3
0 3 8 3 1 3,0
0 1 9 5
1 5 9 6
0 6 9 6 1 6,0
1 1 10 1 2 1,1
1 2 11 3
1 3 11 3 2 3,1
0 2 12 4
0 4 12 4 1 4,0
0 1 13 5
1 5 13 6
1 6 13 6 2 6,1
1 2 14 3
1 3 14 11

Thus, the encoder's dictionary is:


Dictionary address Dictionary entry
0 0, null
1 0, 0
2 0, 1
3
4
5
6 5, 1
7 4, 1

30
8 3, 0
9 6, 0
10 1, 1
11 3, 1
12 4, 0
13 6, 1
14 No entry yet

Decoder
The decoder at the receiver must also construct an identical dictionary for decoding.
Moreover, reception of any code word means that a new dictionary entry must be
constructed. Pointer n for this new dictionary entry is the same as the received code word.
Source symbol a for this entry is not yet known, since it is the root symbol of the next
string (which has not been transmitted by the encoder).

If the address of the next dictionary entry is m, we see that the decoder can only construct
a partial entry <n, ?>, since it must await the next received code word to find the root
symbol a for this entry. It can, however, fill in the missing symbol a in its previous
dictionary entry, at address m -1. It can also decode the source symbol string associated
with the received code word n.
Example 1.14: Decode the received code words transmitted in Example 1.13.

We know the received code words are 221 543 613 46


Address (m) n (pointer) ai (symbol) Decoded bits
0
1
2
3
4

31
5
6
7
8
9
… … … …

§1.5 Arithmetic Coding

Remarks
Assigns one (normally long) code word to the entire input stream. Reads the input stream
symbol by symbol, appending more bits to the code word each time. The code word is a
number obtained based on the symbol probabilities. The symbols probabilities need to
be known. Encodes symbols using a non-integer number of bits (in average), which
results in a very good efficiency of the encoder (it allows to achieve the entropy lower
bound). It is often used for data compression in image processing.

Encoder
Construct a code interval (rather than a code number), which uniquely describes a block
of successive source symbols. Any convenient b within this range is a suitable code word,
representing the entire block of symbols.

Algorithm:

ai  A, I i  [ Sli , Shi )
j  0, L j  0, H j  1

32
REPEAT
  H j - Lj
Next read ai , use ai's Ii=[Sli,Shi) to update
L j 1  L j +  Sli
H j 1  L j +  Shi
j  j 1
Until all a i have been encoded.
Select a number b that fall in the final interval as the code word.
Example 1.15: For a 4-ary source A  {a 0 , a1 , a 2 , a3 } with PA  {0.5,0.3,0.15,0.05} ,

assign each a i  A a fraction of the real number interval I i as

a 0 : I 0  [0,0.5); a1 : I 1  [0.5,0.8); a 2 : I 2  [0.8,0.95); a3 : I 3  [0.95,1).

Encode the sequence a1 a0 a 0 a3 a 2 with arithmetic coding.

j ai Lj Hj ∆ Lj+1 Hj+1
0
1
2 a0 0.5 0.65 0.15 0.5 0.575
3 a3 0.5 0.575 0.075 0.57125 0.575
4 a2 0.57125 0.575 0.00375 0.57425 0.5748125

33
Decoder
In order to decode the message, the symbol order and probabilities must be passed to the
decoder. The decoding process is identical to the encoding. Given the code word (the
final number), at each iteration the corresponding sub-range is entered, decoding the
symbols representing the specific range.
Given b , the decoding procedure is
L  0, H  1, Δ  H - L
R ep eat
F in d i s u c h th a t
b - L
 Ii

O u t p u t s y m b o l a i , use ai's Ii=[Sli,Shi) to update
L  L +  S li

H  L +  S hi

Δ  H - L
U n til la s t s y m b o l is d e c o d e d .

Example 1.16: For the source and encoder in Example 1.15, decode b  0.57470703125 .

L H ∆ Ii Next H Next L Next ∆ ai

0.5 0.65 0.15 I0 0.575 0.5 0.075 a0


0.5 0.575 0.075 I3 0.575 0.57125 0.00375 a3
0.57125 0.575 0.00375 I2 0.5748125 0.57425 0.0005625 a2

Practical Issues
Attention: the precision with which we calculate (b  L) /  .
Round-off error in this calculation can lead to an erroneous answer. Numerical overflow
(see the products S li and S hi ). The limited size of S li and S hi limits the size of the

34
alphabet A. In practice it is important to transmit and decode the info “on the fly.” Here
we must read in the entire block of source symbols before being able to compute the code
word. We also must receive the entire code word b before we can begin decoding.

Huffman Arithmetic Lempel-Ziv

Probabilities Known in advance Known in advance Not known in advance

Alphabet Known in advance Known in advance Not known in advance

Data Loss None None None

Symbol Not used Not used Used for better


Dependency compression

Entropy If probabilities are Very close Best results for long


negative powers of 2 messages

Code words One code word for One code word for Code words for strings
each symbol all data of source symbols

Intuition Intuitive Not intuitive Not intuitive

35
Ch 2 Channel and Channel Capacity

§2.1 Discrete Memoryless Channel Model

Communication Link

Composite Discrete-Input Discrete-Output Channel

Channel Modulator Channel Demodulato


Encoder r

Source
Encoder Channel
Continuous-Input Decoder
Continuous-Output
Informatio
Channel
Source Decoder
c0,c1,...,ct

Alphabet C y0,y1,...,yt
Probabilities PC
Alphabet Y
Probabilities PY

Definition
In most communication or storage systems, the signal is designed such that the output
symbols, y0,y1,...,yt , are statistically independent if the input symbols, c0,c1,...,ct , are
statistically independent. If the output set Y consists of discrete output symbols, and if the
property of statistical independence of the output sequence holds, the channel is called a
Discrete Memoryless Channel (DMC).

Transition Probability Matrix


Mathematically, we can view the channel as a probabilistic function that transforms a
sequence of (usually coded) input symbols, c, into a sequence of channel output symbols,
y. Because of noise an other impairments of the communication system, the
transformation is not one-to-one mapping from the set of input symbols, C, to the set of

36
output symbols, Y. Any particular c from C may have some probability, py|c , of being
transformed to an output symbol y, from Y, this probability is called a (Forward)
Transition Probability.
For a DMC, let p c be the probability that symbol c is transmitted, the probability that the

received symbol is y is given in terms of transition probabilities as

The probability distribution of the output set Y, denoted by QY, may be easily calculated
in matrix form as

 q0   p y0 |c0 p y0 |c1 .... p y0 |cMC 1   p0 


q  p p .... p

p 
   1 
QY     y1|c0
1 y1 |c1 y1 |cM C 1

...  ...  ... 


     
q   p
 M Y 1   p yMY 1|c0 p yMY 1|c1 .... p yMY 1|cMC 1   M C 1 
or, more compactly, Here,
PC : Probability distribution of the input alphabet

QY : Probability distribution of the output alphabet


PY |C :

Remarks: The columns of PY|C sum to unity (no matter what symbol is sent, some
output symbol must result). Numerical values for the transition probability matrix are
determined by analysis of the noise and transmission impairment properties of the
channel, and the method of modulation/demodulation.
Hard Decision Decoding : MY = MC. Hard refers to the decision that the demodulator
makes; it is a firm decision on what symbol was transmitted.
Soft Decision Decoding : MY > MC. The final decision is left to the receiver decoder.

37
Example 2.1: C={0,1} , with equally probable symbols; Y={y0, y1, y2}. The transition
probability matrix of the channel is
0.80 0.05 
PY |C  0.15 0.15  . QY=?

0.05 0.80 

Remarks: The sum of the elements on each column of the transition probability matrix is
1. This is an example of soft-decision decoding.
Example 2.1 (cont’d): Calculate the entropy of Y for the previous system. Compare this
with the entropy of source C.

(how can this happen?)

Remarks: We noticed the same thing when we discussed the source encoder
(encryption encoder). It is possible for the output entropy to be greater than the input
entropy, but the “additional” information carried in the output is not related to the
information from the source. The “extra” information in the output comes from the
presence of noise in the channel during transmission, and not from the source C.
This “extra” information carried in Y is truly “useless”. In fact, it is harmful because it
produces uncertainty about what symbols were transmitted.
Question: Can we solve this problem by using only systems which employ hard-decision
decoding?

38
Answer:
Example 2.2: C={0,1} , with equally probable symbols; Y={0,1}. The transition
probability matrix of the channel is
0.98 0.05
PYC|   .
0.02 0.95
Calculate the entropy of Y. Compare this with the entropy of source C.

Remarks: Y carries less information than was transmitted by the source.


Question: Where did it go ?
Answer: It was lost during the transmission process. The channel is information lossy !
So far, we have looked at two examples, in which the output entropy was either greater or
less than the input entropy. What we have not considered yet is what effect all this has on
the ability to “tell from observing Y what original information was transmitted.”
Do not forget that the purpose of the receiver is to recover the original transmitted
information !
What does the observation of Y tell us about the transmitted information sequence?
As we know, Mutual information is a measure of how much the uncertainty of
generating a random variable c is reduced by observing a random variable y !

If Y tells us nothing about C (e.g., Y and C are independent, such as somebody cut the
phone wire and there is no signal getting through).

But if

39
Looking at Y there is no uncertainty on C. i.e., Y contains sufficient information to tell
what the transmitted sequence is. The conditional entropy is a measure of how much
information loss occurs in the channel !
Example 2.3: Calculate the mutual information for the system of Example 2.1.

Remark: The mutual information for this system is well below the entropy ( H(C)=1 )
of the source and so, this channel has a high level of information loss.
Example 2.4: Calculate the mutual information for the system of Example 2.2.

Remarks: This channel is quite lossy also. Although H(Y) was almost equal to H(C) in
Example 2.2, the mutual information is considerably less than H(C) . One cannot tell
how much information loss we are dealing with simply by comparing the input and
output entropies !

§2.2 Channel Capacity and Binary Symmetric Channel


Maximization of Mutual Information and Channel Capacity
Each time the transmitter sends a symbol, it is said to use the channel. The Channel
Capacity is the maximum average amount of information that can be sent per channel
use.

40
Question: Why it is not the same as the mutual information ?
Answer: Because for a fixed transition probability matrix, a change in the probability
distribution of C, PC , results in a different mutual information, I(C;Y).The maximum
mutual information achieved for a given transition probability matrix is the Channel
Capacity.

with units of bits per channel use.

An analytical closed-form solution to find CC is difficult to achieve for an arbitrary


channel. An efficient numerical algorithm for finding CC was derived in 1972, by Blahut
and Arimoto (see textbook).
Example 2.5: For the following transition probability matrix, find the channel capacity,
the input and output probability distributions that achieve the channel capacity, and
mutual information given a uniform Pc.

a)
0.98 0.05 0.51289 0.52698
PY|C   ,
 C C  0.78585, PC  , Q 
0.48711 Y 0.47302
0.02 0.95    
b)
0.80 0.05 0.46761 0.4007
PY|C   , C  0.48130, P  , Q 
0.95 0.53239 Y 0.5993
C C
0.20    
c
0.80 0.10 0.4824 0.4377
PY|C   , C  0.39775, P  , Q 
0.90 0.5176 Y 0.5623
C C
0.20    
d)
0.80 0.30  0.510  0.555
PY |C    , CC  0.191238, PC  , Q 
0.490  Y 0.445
0.20 0.70     
e)  0.80 0.05  0.425 
 
0.5
PY |C   0.15 0.15  , CC  0.57566, PC    , QY  0.150 
 0.05 0.80  0.5
0.425 
41
Remarks: The channel capacity proves to be a sensitive function of the transition
probability matrix, PY|C , but a fairly weak function of PC. The last case is interesting, as
the uniform input distribution produces the maximum mutual information.
This is an example of Symmetric Channel. Note that the columns of symmetric
channel’s transition probability matrix are permutations of each other. Likewise, the top
and bottom rows are permutations of each other. The center row, which is not a
permutation of the other rows, corresponds to the output symbol y1, which, as we noticed
in Example 2.3, makes no contribution to the mutual information.

Symmetric Channels
Symmetric channels play an important role in communication systems and many such
systems attempt, by design, to achieve a symmetric channel function. The reason for the
importance of the symmetric channel is that when such a channel is possible, it
frequently has greater channel capacity than an non-symmetric channel would have.
Example 2.6:
0.79 0.05  0.4207 
 0.50095 
PY |C  0.16 0.15  , CC  0.571215, PC    , QY  0.1550 
0.05 0.80  0.49905 0.4243 

The transition probability matrix is slightly changed compared to Example 2.5e), and the
channel capacity decreases.
Example 2.7:

0.950 0.024 0.024 0.002   0.25


0.024 
0.950 0.002 0.024   0.25
PY |C  , CC  1.653488, PC   
0.024 0.002 0.950 0.024   0.25
   
0.002 0.024 0.024 0.950   0.25

This is an example of using quadrature phase-shift keying (QPSK), which is a modulation


method that produces a symmetric channel. For QPSK, MC=MY=4.

42
Remarks:
i) The capacity for this channel is achieved when PC is uniformly distributed. This is
always the case for a symmetric channel.
ii) The columns of the transition probability matrix are permutations of each other, and so
are the rows.
iii) When the transition probability matrix is a square matrix, this permutation property of
columns and rows is sufficient condition for a uniformly distributed input alphabet to
achieve the maximum mutual information. Indeed, the permutation condition is what it
gives rise to the term “symmetric channel .”

Binary Symmetric Channel (BSC)


A symmetric channel of considerable importance, both theoretically and practically, is a
binary symmetric channel (BSC), for which

The parameter p is known as the Crossover Probability, and it is the probability that the
demodulator/detector makes a hard-decision decoding error. The BSC is the model for
essentially all binary-pulse transmission systems of practical importance.
Channel Capacity: for uniform input probability distribution

which is often written as

where the notation H(P) arises from the terms involving p.


Remarks:
 The capacity is bounded by the range
 The upper bound is achieved only if
 The case p = 0 is not surprising, as it corresponds to a channel which does not
make errors (known as “noiseless” channel).

43
 The case p = 1 corresponds to a channel which always makes errors. If we know
that the channel output is always wrong, we can easily set things right by
decoding the opposite of what the channel output is.
 The case p = 0.5 corresponds to a channel for which the output symbol is as
likely to be correct as it is to be incorrect. Under this condition, the information
loss in the channel is total, and the channel capacity is zero. The capacity of the
BSC is a concave-upward function, possessing a single minimum at p = 0.5.
 Except for p = 0 and p = 1 cases, the capacity of the BSC is always less than the
source entropy. If we try to transmit information through the channel using the
maximum amount of information per symbol, some of this info will be lost, and
decoding errors at the receiver will result. However, if we add sufficient
redundancy to the transmitted data stream, it is possible to reduce the
probability of lost information to an arbitrary low level.

§2.3 Block Coding and Shannon’s 2nd Theorem


Equivocation
We have seen that there is a maximum amount of information per channel use that can be
supported by the channel. Any attempt to exceed this channel capacity will result in
information being lost during transmission. That is,

and, so

The conditional entropy H(C|Y) corresponds to our uncertainty about what the input of
the channel was, given our observation of the channel output. It is a measure of the
information loss during the transmission. For this reason, this conditional entropy is
often called the Equivocation. The equivocation has the property that

and it is given by

44
The equivocation is zero if and only if the transition probabilities py|c are either zero or
one for all pairs (yY, cC).

Entropy Rate
The entropy of a block of n symbols satisfy the inequality

with equality if and only if C is a memoryless source. In transmitting a block of n


symbols, we use the channel n times. Recall that channel capacity has units of bits per
channel use, and refers to an average amount of information per channel use.

Since H(C0,C1,...,Cn-1) is the average information contained in the n-symbol block, it


follows that the average information per channel use would be

However, the average bits per channel use is achieved in the limit, when n goes to infinity,
such that
H ( C 0 , C 1 , ..., C n  1 )
R  lim  H (C )
n  n
where R is called the Entropy Rate.
R  H (C ) , with equality if and only if all symbols are statistically independent.
Suppose that they are not, and in the transmission of the block, we deliberately introduce
redundant symbols. Then, R < H(C). Taking this further, suppose that we introduce a
sufficient number of redundant symbols in the block so that

Question: Is the transmission without information loss (i.e. zero equivocation) possible in
such case?
Answer: Remarkably enough, the answer to this question is “YES”!
What is the implication of doing so ?
It is possible to send information through the channel with arbitrarily low probability of
error.
The process of adding redundancy to a block of transmitted symbols is called Channel
Coding.

45
Question: Does there exist a channel code that will accomplish this purpose?
Answer: The answer to this question is given by the Shannon’s second theorem.

Shannon’s 2nd Theorem


Suppose R < Cc , where Cc is the capacity of a memoryless channel. Then, for any  >
0, there exists a block of length n and rate R whose probability of block decoding error
pe satisfies pe ≤  when the code is used on this channel.
Shannon’s second theorem (also called Shannon’s main theorem) tells us that it is
possible to transmit information over a noisy channel with arbitrarily small probability of
error. The theorem says that if the entropy rate R in a block of n symbols is smaller
than the channel capacity, then we can make the probability of error arbitrarily
small.

What error are we speaking about?


Suppose we send a block of n bits in which k < n of these bits are statistically
independent “information” bits and n-k are redundant “parity” bits computed from the k
information bits, according to some coding rule. The entropy of the block will then be k
bits and the average information in bits per channel use will be
If this entropy rate is less than the channel capacity, Shannon’s main theorem says we can
make the probability of error in recovering our original k information bits arbitrarily
small. The channel will make errors within our block of n bits, but the redundancy built
into the block will be sufficient to correct these errors and recover the k bits of
information we transmitted.
Shannon’s theorem does not say that we can do this for just any block length n we
might want to choose! The theorem says there exists a block length n for which there is
a code of rate R. The required size of the block length n depends on the upper bound
 we pick for our error probability. Actually, Shannon’s theorem implies very strongly
that the block length n is going to be very large if R is to approach CC to within an
arbitrarily small distance with an arbitrarily probability of error.
The complexity and expense of an error-correcting channel code are believed to grow
rapidly as R approaches the channel capacity and the probability of a block decoding

46
error is made arbitrarily small. It is believed by many that beyond a particular rate, called
Cutoff Rate, R0, it is prohibitively expensive to use the channel. In the case of the binary
symmetric channel, this rate is given by

R0   log 2 0.5  p (1  p ) 
The belief that R0 is some kind of “sound barrier” for practical error correcting codes
comes from the fact that for certain kind of decoding methods, the complexity of the
decoder grows extremely rapidly as R exceeds R0.

§2.4 Markov Processes and Sources with Memory


Markov Process
Thus far, we have discussed memoryless sources and channels. We now turn our
attention to sources with memory. By this, we mean information sources, where the
successive symbols in a transmitted sequence are correlated with each other, i.e.,
the sources in a sense “remember” what symbols they have previously emitted, and the
probability of their next symbol depends on this history.
Sources with memory arise in a number of ways. First, natural languages, such as
English, have this property. For example, the letter “q” in English is almost always
followed by the letter “u”. Similarly, the letter “t” is followed by the letter “h”
approximately 37% of the time in English text. Many real-time signals, such as speech
waveform, are also heavily time correlated. Any time correlated signal is a source with
memory. Finally, we sometimes wish to deliberately introduce some correlation
(redundancy) in a source for purposes of block coding, as discussed in the previous
section.
Let A be the alphabet of a discrete source having MA symbols, and suppose this source
emits a time sequence of symbols (s0,s1,…,st,…) with each stA. If the conditional
probability p(st | st-1,…,s0) depends only on j previous symbols, so that
p(st | st-1,…,s0)=p(st | st-1,…,st-j),
then A is called a j th order Markov process. The string of j symbols is
called the state of the Markov process at time t. A j th order Markov process, therefore,
has possible states.

47
Let us number these possible states from 0 to N -1 and let n(t) represent the probability
of being in state n at time t. The probability distribution of the system at time t can
then be represented by the vector

For each state at time t, there are MA possible next states at time t +1, depending on which
symbol is emitted next by the source.
If we let pi |k be the conditional probability of going to state i given that the present state
is k, the state probability distribution at time t + 1 is governed by the transition
probability matrix.
 p0|0 p0|1 ... p0| N 1 
 p p1|1 ... p1| N 1 

1|0
PA|
 ... ... ... .... 
 
 pN 1|0 pN 1|1 ... pN 1| N 1 
and is given by

Example 2.8: Let A be a binary first-order Markov source with A={0,1}. This source
has 2 states, labeled “0” and “1”. Let the transition probabilities be

 0.3 0.4 
PA|   .
 0.7 0.6 
What is the equation for the next probability state? Find the state probabilities at time t=2,
given that the probabilities at time t=0 are 0=1 and 1=0.

The next-state equation for the state probabilities:

48
Example 2.9: Let A be a second-order binary Markov source with
Pr( a  0 | 0,0)  0.2 Pr(a  1| 0,0)  0.8
Pr(a  0 | 0,1)  0.4 Pr( a  1| 0,1)  0.6
Pr(a  0 |1,0)  0.0 Pr(a  1|1,0)  1.0
Pr(a  0 |1,1)  0.5 Pr(a  1|1,1)  0.5
If all the states are equally probable at time t = 0, what are the state probabilities at t =1 ?

Define the . The possible state


transitions and their associated transition probabilities can be represented using a
state diagram. For this problem, the state diagram is

The next state probability equation is

Remarks: Every column of the transition probability matrix adds to one. Every properly
constructed transition probability matrix has this property.

49
Steady State Probability and the Entropy Rate
Starting from the equation for the state probabilities, is can be shown by induction that
the state probabilities at time t are given by

A Markov process is said to be Ergodic if we can get from the initial state to any other
state in some number of steps and if, for large t, Πt approaches a steady-state value that is
independent of the initial probability distribution, Π0. The steady-state value is reached
when

The Markov processes which model information sources are always ergodic.

Example 2.10: Find the steady-state probability distribution for the source in Example
2.9.
In the steady state, the state probabilities become

It appears from this that we have four equations and four unknowns, so, solving for the
four probabilities is no problem. However, if we look closely, we will see that only three
of the equations above are linearly independent. To solve for the probabilities, we can use
any of three of the above equations and the constraint equation. This equation is a
consequence of the fact that the total probability must sum to unity;

it is certain that the system is in some state!


Dropping the first equation above and using the constraint, we have
1  0.53
2  0.80  0.61
3  2  0.53
0  1  2  3  1

50
which has the solution
0  1/ 9, 1  2  2 / 9, 3  4 / 9.
This solution is independent of the initial probability distribution. The situation
illustrated in the previous example, where only N - 1 of the equations resulting from the
transition probability expression are linearly independent and we must use the “sum to
unity” equation to obtain the solution, always occurs in the steady-state probability
solution of an ergodic Markov process.

Entropy Rate of an Ergodic Markov Process


POP QUIZ: How do you define the entropy rate?
The entropy rate, R, is the average information per channel use (average info bits per
channel use)
H ( A0 , A1 ,..., At 1 )
R  lim  H ( A)
t  t
with equality if and only if all symbols are statistically independent.
For ergodic Markov sources, as t grows very large, the state probabilities converge
to a steady-state value, n, for each of the N possible states (n=0,...,N-1). As t becomes
large, the average information per symbol in the block of symbols will be determined by
the probabilities of occurrence of the symbols in A, after the state probabilities converge
to their steady-state values.
Suppose we are in state Sn at time t . The conditional entropy of A is

Since each possible symbol a leads to a single state, Sn can lead to MA possible next
states. The remaining N - MA states cannot be reached from Sn , and for these states the
transition probability pi|n=0. Therefore, the conditional entropy expression can be
expressed in terms of the transition probabilities as

For large t , the probability of being in state Sn is given by its steady-state probability n.
Therefore, the entropy rate of the system is

51
This expression, in turn, is equivalent to

where pi|n are the entries in the transition probability matrix and the n are the steady-state
probabilities.
Example 2.11: Find the entropy rate for the source in Example 2.9. Calculate the steady-
state probability of the source emitting a “0” and the steady-state probability of the source
emitting a “1”. Calculate the entropy of a memoryless source having these symbol
probabilities and compare the result with the entropy rate of the Markov source.

With the steady-state probabilities calculated in Example 2.10, by applying the formula
for the entropy rate of an ergodic Markov source, one gets

The steady state probabilities of emitting 0 and 1 are, respectively

The entropy of a memoryless source having this symbol distribution is

Thus, R<H(X) as expected.

Remarks:
i) In earlier section, we discussed about how introducing redundancy into a block of
symbols can be used to reduce the entropy rate to a level below the channel capacity and

52
how this technique can be used for error correction at the receive-side, in order to
achieve an arbitrarily small information bit error rate.
ii) In this section, we have seen that a Markov process also introduces redundancy
into the symbol block.
Question: Can this redundancy be introduced in such a way such to be useful for error
correction?
Answer: YES! This is the principle underlying a class of error correcting codes known
as convolutional codes.
iii) In the previous lecture we examined the process of transmitting information C
through a channel, which produces a channel output Y. We have found out that a noisy
channel introduces information loss if the entropy rate exceeds the channel capacity.
iv) It is natural to wonder if there might be some (possible complicated) form of data
processing which can be performed on Y to recover the lost information. Unfortunately,
the answer to this question is NO! Once the information has been lost, it is gone!

Data Processing Inequality


This states that additional processing of the channel output can at best result in no further
loss of information, and may even result in additional information loss.

Y Z
Data Processing

A very common example of this kind of information loss is the roundoff or truncation
error during digital signal processing in a computer or microprocessor. Another
examples is quantization in an analog to digital converter. Designers of these systems
need to have an awareness of the possible impact of such design decisions, as the word
length of the digital signal processor or the number of bits of quantization in analog to
digital converters, on the information content.

53
§2.5 Constrained Channels
Channel Constraints
So far, we have considered only memoryless channels corrupted by noise, which are
modeled as discrete-input discrete-output memoryless channels. However, in many cases
we have channels which place constraints on the information sequence.

Sampler

Modulator Bandlimited + Demodulator


Channel

s(t)

at Noise Timing Symbol


Recovery Detector

Block Diagram of a Typical Communication System.


yt
The coded information at is presented to the modulator, which transforms the symbol
sequence into continuous-valued waveform signals, designed to be compatible with the
physical channel (bandlimited channel). Examples of bandlimited channels are wireless
channels, telephone lines, TV cables, etc. During transmission, the information bearing
signal is distorted by the channel and corrupted with noise. The output of the
demodulator, which attempts to combat the distortion and minimize the effect of the
noise, is sampled and the detector attempts to reconstruct the original coded sequence, at.
The timing recovery is required; the performance of this block are crucial in recovering
the information. The theory and practice of performing these tasks consist the
modulation theory, which is treated in “Digital Communications” textbooks. In this
course, we are concern with the information theory aspects of this process. What
are these aspects?

Remarks:
i) When the system needs to recover the timing information, additional information
should be transmitted for that. As the maximum information rate is limited by the

54
channel capacity, the information needed for timing recovery is included at the expense
of user information. This may require that the sequence of transmitted symbols be
constrained in such a way as to guarantee the presence of timing infomation embedded
within the transmitted coded sequence.
ii) Another aspect arises from the type and severity of channel distortions imposed by the
physical bandlimited channel. We can think of the physical channel as performing a kind
of data processing on the information bearing waveform presented to it by the modulator.
But data processing might result in information loss. A given channel can thus place its
own constraints on the allowable symbol sequence which can be “process” without
information loss.
iii) Modulation theory tells us that it is possible and desirable to model the
communication channel as a cascade of noise-free channel and an unconstrained noisy
channel (we have implicitly used such a model, except that we have not considered any
constraint on the input symbol sequence).

Constrained + Decision Block yt


Channel, ht
at xt rt

Noise, nt

Linear and Time-Invariant (LTI) Channel


The LTI channel is specified by a set of parameters ht, which represent the channel
impulse response. The channel’s output sequence is related to the input sequence as

The decision block is presented with a noisy signal

The decision block takes these inputs and produces output symbols, yt, drawn from a
finite alphabet Y, with MY ≥ MA.

55
If MY =MA, yt is an estimate of the transmitted symbol at, and the decision block is
said to make a Hard-decision.
If MY > MA, the decision block is said to make a Soft-decision, and the final decision
on the transmitted symbol at is made by the decoder.
Example 2.12: Let A be a source with equiprobable symbols, A={-1,1}. The bandlimited
channel has the impulse response {h0=1 h1=0 h2=-1}. Calculate the steady-state entropy
of the constrained channel’s output and the entropy rate of the sequence xt.
State of the channel at time t : St = <at-1,at-2>.
The states are as follows:
(-1,-1) is state S0, (1,-1) is state S1,
(-1, 1) is state S2, (1, 1) is state S3.
The channel can be represented as a Markov process, with the state diagram given in the
sequel.
-1 / 0
1/2
(0.5) S0 S1
(0.5)
-1 -1 1-1

-1 / 0
-1 / -2 1/2
(0.5) 1/0 (0.5)

-1 1 11
-1 / -2
S2 (0.5) S3
1/0
(0.5)

Note that all transition probabilities, shown in parentheses, are 0.5. The arrows are
labeled at / xt . One can easily show that X={-2, 0, 2}.
The state probability equation is then given by

 0.5 0 0.5 0 
 0.5 0 0.5 0 
 t 1  t
 0 0.5 0 0.5 
 
 0 0.5 0 0.5 

56
from which we set up 4 equations and find the steady state probabilities, i.e.,
i=0.25, i=0,1,2,3.
The output symbol X's probabilities are:

The steady state entropy of the channel output is

The entropy rate is:

which equals the source entropy → channel is lossless.


Note that the entropy rate is not equal to the steady state entropy of the channel’s output
symbols. While the channel is lossless, the sequences it produces does not carry
sufficient information to permit clock recovery for arbitrary input sequences. For
example, a long input sequence of “-1”, “+1”, or a long sequence of alternating symbols,
“+1-1” or“-1+1”, all produce a long output of zeros at the output of the channel. Timing
recovery methods can fail in such situations.

57
Ch 3 Error Control Strategies

Error Control Strategies

Forward Error Correction Automatic Repeat Request


(FEC) (ARQ)

Forward Error Correction (FEC)


In a one-way communication system: The transmission or recording is strictly in one
direction, from transmitter to receiver. Error control strategy must be FEC; that is, they
employ error-correcting codes that automatically correct errors detected at the receiver.
For example: 1) digital storage systems, in which the information recorded can be
replayed weeks or even months after it is recorded, and 2) deep-space communication
systems. Most of the coded systems in use today employ some form of FEC, even if the
channel is not strictly one-way!
However, for a two-way system, the control strategies use error detection and
retransmission that is called automatic repeat request (ARP).

§3.1 Automatic Repeat Request

Automatic Repeat Request (ARQ)


In most communication systems, the information can be sent in both directions, and the
transmitter also acts at a receiver (transceiver), and vice-versa. For example: data
networks, satellite communications, etc. Error control strategies for a two-way system
can include error detection and retransmission, called Automatic Repeat Request
(ARQ). In an ARQ system, when errors are detected at the receiver, a request is sent for
the transmitter to repeat the message, and repeat requests continue to be sent until the
message is correctly received.
ARQ SYSTEMS

Stop-and-Wait ARQ

Continuous ARQ

Go-Back-N ARQ Selective ARQ

58
Types
Stop-and-Wait (SW) ARQ: The transmitter sends a block of information to the receiver
and waits for a positive (ACK) or negative (NAK) acknowledgment from the receiver. If
an ACK is received (no error detected), the transmitter sends the next block. If a NAK is
received (errors detected) , the transmitter resends the previous block. When the errors
are persistent, the same block may be retransmitted several times before it is correctly
received and acknowledged.
Continuous ARQ: The transmitter sends blocks of information to the receiver
continuously and receives acknowledgments continuously. When a NAK is received, the
transmitter begins a retransmission. It may back-up to the block and resend that block
plus the N-1 blocks that follow it. This is called Go-Back-N (GBN) ARQ. Alternatively,
the transmitter may simply resend only those blocks that are negatively acknowledged.
This is known as Selective Repeat (SR) ARQ.

Comparison
GBN Versus SR ARQ
SR ARQ is more efficient than GBN ARQ, but requires more logic and buffering.
Continuous Versus SW ARQ
Continuous ARQ is more efficient than SW ARQ, but it is more expensive to implement.
For example: In a satellite communication, where the transmission rate is high and the
round-trip delay is long, continuously ARQ is used. SW ARQ is used in systems where
the time taken to transmit a block is long compared to the time taken to receive an
acknowledgment. SW ARQ is used on half-duplex channels (only one way transmission
at a time), whereas continuous ARQ is designed for use on full-duplex channels
(simultaneous two-way transmission).

Performance Measure
Throughput Efficiency: is the average number of information (bits) successfully
accepted by the receiver per unit of time, over the total number of information digits that
could have been transmitted per unit of time.

59
Delay of a Scheme: The interval from the beginning of a transmission of a block to the
receipt of a positive acknowledgment for that block.

GBN Versus SR ARQ

Figure 1 From Lin and Costello, Error Control

ARQ Versus FEC


The major advantage of ARQ versus FEC is that error detection requires much simpler
decoding equipment than error correcting. Also, ARQ is adaptive in the sense that
information is retransmitted only when errors occurs. In contrast, when the channel error
is high, retransmissions must be sent too frequently, and the SYSTEM THROUGHPUT
is lowered by ARQ. In this situation, a HYBRID combination of FEC for the most
frequent error patterns along with error detection and retransmission for the less likely
error patterns is more efficient than ARQ alone (HYBRID ARQ).

60
§3.2 Forward Error Correction

Performance Measures – Error Probability


The performance of a coded communication system is in general measured by its
probability of decoding error (called the Error Probability) and its coding gain over the
uncoded system that transmit information at the same rate (with the same modulation
format).
There are two types of error probabilities, probability of word (or block) error and
probability of bit error. The probability of block error is defined as the probability that
a decoded word (or block) at the output of the decoder is in error. This error probability is
often called the Word-Error Rate (WER) or Block-error Rate (BLER). The
probability of bit-error rate, also called the Bit Error Rate (BER), is defined as the
probability that a decoded information bit at the output of the decoder is in error.
A coded communication system should be designed to keep these two error probabilities
as low as possible under certain system constraints, such as power, bandwidth and
decoding complexity.

The error probability of a coded communication system is commonly expressed in terms


of the ratio of energy-per information bit, Eb, to the one-sided power spectral density
(PSD) N0 of the channel noise.

Example 3.1: Consider a coded communication system using an (23, 12) binary Golay
code for error control. Each code word consists of 23 code digits, of which 12 are of
information. Therefore, there are 11 redundant bits, and the code rate is R=12/23=0.5217.
Suppose that BPSK modulation with coherent detection is used and the channel is
AWGN, with one-side PSD N0 . Let Eb / N0 at the input of the receiver be the signal-to-
noise ratio (SNR), which is usually expressed in dB.

61
The bit-error performance of the (23,12) Golay code with both hard- and soft-decision
decoding versus SNR is given, along with the performance of the uncoded system.

From Lin and Costello, Error Control

From the above figure, the coded system, with either hard- or soft-decision decoding,
provides a lower bit-error probability than the uncoded system for the same SNR, when
the SNR is above a certain threshold.
With hard-decision, this threshold is 3.7 dB.
For SNR=7dB, the BER of the uncoded system is 8x10-4, whereas the coded system
(hard-decision) achieves a BER of 2.9x10-5. This is a significant improvement in
performance.
For SNR=5dB this improvement in performance is small: 2.1x10-3 compared to 6.5x10-3.
However, with soft-decision decoding, the coded system achieves a BER of 7x10-5.

62
Performance Measures – Coding Gain
The other performance measure is the Coding Gain. Coding gain is defined as the
reduction in SNR required to achieve a specific error probability (BER or WER) for a
coded communication system compared to an uncoded system.

Example 3.1 (cont’d): Determine the coding gain for BER=10-5.

For a BER=10-5, the Golay-coded system with hard-decision decoding has a coding gain
of 2.15 dB over the uncoded system, whereas with soft-decision decoding, a coding gain
of more than 4 dB is achieved. This result shows that soft-decision decoding of the Golay
code achieves 1.85 dB additional coding gain compared to hard-decision decoding at a
BER of 10-5.
This additional coding gain is achieved at the expense of higher decoding complexity.
Coding gain is important in communication applications, where every dB of improved
performance results in savings in overall system cost.

Remarks:
At sufficient low SNR, the coding gain actually becomes negative. This threshold
phenomenon is common to all coding schemes. There always exists an SNR below which
the code loses its effectiveness and actually makes the situation worse. This SNR is
called the Coding Threshold. It is important to keep this threshold low and to maintain a
coded communication system operating at an SNR well above its coding threshold.
Another quantity that is sometimes used as a performance measure is the Asymptotic
Coding Gain (the coding gain for large SNR).

§3.3 Shannon’s Limit of Code Rate

Shannon’s Limit

63
In designing a coding system for error control, it is desired to minimize the SNR
required to achieve a specific error rate. This is equivalent to maximizing the coding
gain of the coded system compared to an uncoded system using the same modulation
format. A theoretical limit on the minimum SNR required for a coded system with
code rate R to achieve error-free communication (or an arbitrarily small error
probability) can be derived based on Shannon’s noisy coding theorem.

This theoretical limit, often called the Shannon Limit, simply says that for a coded
system with code rate R, error-free communication is achieved only if the SNR exceeds
this limit. As long as SNR exceeds this limit, Shannon’s theorem guarantees the existence
of a (perhaps very complex) coded system capable of achieving error-free
communication.
For transmission over a binary-input, continuous-output AWGN with BPSK signaling,
the Shannon’s limit, in terms of SNR as a function of the code rate does not have a close
form; however, it can be evaluated numerically.

0.188 dB

64
Convolutional
Code, R=1/2

5.35 dB

9.462 dB

Shannon’s limit

From Lin and Costello, Error Control


From Fig. 3 (Shannon limit as a function of the code rate for BPSK signaling on a
continuous-output AWGN channel), one can see that the minimum required SNR to
achieve error free communication with a coded system with rate R=1/2, is 0.188 dB. The
Shannon limit can be used as a yardstick to measure the maximum achievable coding
gain for a coded system with a given rate R over an uncoded system with the same
modulation format. For example, to achieve BER=10-5, un uncoded BPSK system
requires an SNR of 9.65 dB. For a coded system with code rate R=1/2, the Shannon limit
is 0.188 dB. Therefore, the maximum potential coding gain for a coded system with code
rate R=1/2 is 9.462 dB.
For example (Fig. 4), a rate R=1/2 convolutional code with memory order 6, achieves
BER=10-5 with SNR=4.15 dB, and achieves a code gain of 5.35 dB compared to the
uncoded system. However, it is 3.962 dB away from the Shannon’s limit. This gap can be
reduced by using a more powerful code.

65
§3.4 Codes for Error Control

Basic Concepts in Error Control

There can be a hybrid of the two approaches, as well.

Codes for Error Control (FEC)

66
Types of Channnels

Types of Channels

Random Error Channels

Burst-Error Channels

Compound Channels

Random Error Channels: are memoryless channels; the noise affects each transmitted
symbol independently. Example: deep space and satellite channels, most line-of-sight
transmission.
Burst Error Channels: are channels with memory. Example: fading channels (the
channel is in a “bad state” when a deep fade occurs, which is caused by multipath
transmission) and magnetic recordings subject to dropouts caused by surface defects and
dust particles.
Compound Channels: both types of errors are encountered.

67
Ch 4 Error Detection and Correction

Source ECC DTC Channel


Encoder Encoder Encoder

At Transmitter

DTC ECC Source


Decoder Decoder Decoder

At Receiver

Encoding and Decoding Procedure

§4.1 Error Detection and Correction Capacity

Definition
A code can be characterized in terms of its amount of error detection capability and error
correction capability. The Error Detection Capability is the ability of the decoder to
tell if an error has been made in transmission. The Error Correction Capability is the
ability of the decoder to tell which bits are in error.

68
Binary Code, M={0,1}
Coded sequence, C

Channel Encoder

m(m0,...,mk1) c(c0,...,cn1), nk

Message Code word

• Message: block of k bits One To One


Correspondence • Code Word: block of n bits
• Only 2k out of 2n are used as
code words.
G is the encoding rule

Assumptions:
- independent bits
- each message is equally probable, 2k equally likely messages, of k bits each
- r = n-k redundant bits
Thus, the Entropy Rate of the coded word is , this is also called the Code Rate.

For every ci , c j  C , i  j , d H ( ci , c j ) is the Hamming distance between


the two code words. The Hamming Distance is defined as the number of bits which are
different in the two code words.
There is at least one pair of code words for which the distance is the least. This is called
the Minimum Hamming Distance of the code.

Example 4.1 (Repetition Code) : Given encoding rule:


G(0)→ 000
G(1) →111
i.e. only two valid code words. Find its code rate and Hamming weight.

Hamming Weight wH of a code word is defined as the number of “1” bits in the code
word (the Hamming distance between the code word and the zero code word).

69
Example 4.1 (cont’d) : For the received words in the 1st column of the Table below,
determine their source words.

Decision: based on the minimum Hamming distance between the received word and
the code words.
• The code corrects 1 error (dH=1), but does not simultaneously detect the 2 bit
error. Moreover, we can miscorrect the received word.
• The code detects up to two bits in error (3 bits in error lead to a code word;
dmin between the two code words is 3).

Received Decoded Error Flag


Word Word
000
001
010
011
100
101
110
111

Example 4.2 (Repetition Code) : Given coding rule,


G(0)→ 0000
G(1) →1111
find decoded words for the received words in the table on the next page.
n = 4, k = 1, r = 3, dmin = 4, R=1/4
• Correct 1 error (dH =1) and Detect 2 errors (dH=2)
• An error of 3 or 4 bits will be miscorrected.

70
Received Decoded Word Received Decoded
Word Word Word

0000 1000

0001 1001

0010 1010

0011 1011

0100 1100

0101 1101

0110 1110

0111 1111

Hamming Distance and Code Capability


1. Detect Up to t Errors IF AND ONLY IF
Example: Repetition Code, n = 3, k = 1, r = 2, dmin = 3. This code detects up to t = 2
errors.

2. Correct Up to t Errors IF AND ONLY IF


Example: Repetition Code, n = 3, k = 1, r = 2, dmin = 3. This code corrects t = 1 error.

3. Detect Up to td Errors and Correct Up to tc Errors IF AND ONLY IF

Example: Repetition Code, n = 3, k = 1, r = 2, dmin = 3. This code cannot simultaneously


correct (tc = 1) and detect (td = 2) errors.

Number of Redundant Bits


The minimum Hamming Distance is related to the number of redundant bits, r

71
This gives us the lower limit on the number of the redundant bits for a certain minimum
Hamming distance (certain detection and correction capability), and it is called the
Singleton Bound.

For example: Repetition Code, n = 3, k = 1, r = 2, dmin = 3. dmin = r +1 See its error


detection and correction capabilities as previously discussed.

§4.2 Linear Block Codes

Definition
Linear Block Codes can be mathematically treated using the mathematics of vector
spaces.

Linear Block Codes

Binary
(We deal here only with such codes)

Non-Binary

Reed-Solomon

Galois Field has two elements, i.e., A={0,1} or A=GF(2)

( A,,) Exclusive Or And


(Digital Logic) (Digital Logic)
+ 0 1 0 1

0 0 1 0 0 0

1 1 0 1 0 1

72
( A , , )
n

Vector Scalar
Addition Multiplication

Vector space An is a set with elements a  (a0 ,..., an1 ), with each ai  A
The set of code words, C, is a subset of An. It is a subspace (2k elements); any subspace
is also a vector space.

If the sum of two code words is also a code word, such a code is called a Linear
Code).
Consequence : All-zero vector is a code word, 0  C (because c1  c1  0)

Vector Space
Linear Independent : For code words, c0 ,..., ck 1

if and only if a0  ...  ak 1 0 , these c0 ,..., ck 1 are linear independent,


and they are Basis Vectors.

If they are linear independent and if and only if every cC can be uniquely written as

c  a0 c0  ...  a k 1 ck 1
then, the Dimension of a vector space is defined as the number of basis vectors it takes to
describe (span) it.

Generating Code Word


Question: how do we generate a code word ?

73
c  mG m  ( m 0 ,..., m k 1 )
c  ( c 0 ,..., c n 1 ), n  k

Code Word Message Generator Matrix


1xn 1xk kxn  g0 
 
G  ... 
 gk 1 

Linear Combination of the rows of the G matrix 


They form a basis. The k rows must be linearly independent.
All the lines of G are code words!
For example:
Example 4.3: For linear block code n = 7, k = 4, r = 3, generated by
1 1 0 10 0 0
0 1 1 01 0 0
(c0 c1 c2 c3 c4 c5 c6 )  ( m0 m1 m2 m3 ) 
1 1 1 0 0 1 0
 
1 0 1 0 0 0 1
Find all the code words.

74
Here, it is linear systematic block code, since

75
§4.2.1 Linear Systematic Block Codes
Definition
If the generating matrix can be written as :

G  [P | I k ]

nxk Parity-check
matrix k x n-k Identity matrix k x k

Redundant Checking Message


Part Information Part
n-k digits k digits

n bits

then, a linear block code generated by such a generator matrix is called Linear
Systematic Block Code. Its code words are in the form of

Example 4.3 (cont’d): n = 7, k = 4, r = 3


1 1 0 10 0 0
0 1 1 01 0 0
c  (c0 c1 c2 c3 c4 c5 c6 )  (m0 m1 m2 m3 ) 
1 1 1 0 0 1 0
 
design the encoder. 1 0 1 0 0 0 1

c0  m0 + m2 +m3 
 
 c1  m0 + m 1 +m2  Parity Check Bits (last k bits)
c2  m1 +m2 +m3 

c3  m0   ENCODING CIRCUIT
c  m 
 4 1
c5  m2 
Information Bits (first r bits)
 
c6  m3 
76
the encoder can be designed as

§4.2.2 Hamming Weight and Distance

Hamming Distance of two code words c1 , c2 : d H (c1 , c2 ) is the number of


positions in which they differ. Hamming Weight of a code word ci : wH (ci ) is the
number of non-zero positions in it. It is clear

In Example 4.3 : n = 7, k = 4, r = 3, determine the Hamming weight for


c1  (1000001)

c1  (0010001)

77
Minimum Hamming Distance

The Minimum Hamming Distance of a linear block code is equal to the Minimum
Hamming Weight of the non-zero code vectors.
In Example 4.3 : n = 7, k = 4, r = 3, dmin = wmin=3

§4.2.3 Error Detection and Correction Capacity

Rules
i) Detect Up to t Errors IF AND ONLY IF dmin t 1
ii) Correct Up to t Errors IF AND ONLY IF dmin  2t 1
iii) Detect Up to td Errors and Correct Up to tc Errors IF AND ONLY IF
dmin  2tc 1 and dmin  tc  td 1

In Example 4.3: n = 7, k = 4, r = 3
The minimum Hamming distance is 3, and, such, the number of errors which can be
detected is 2 and the number of errors which can corrected is equal to 1. The code does
not have the capability to simultaneously detect and correct errors. (see the relations
between dmin and the correction/detection capability of a code).

Error Vector
For received vectors
v c+e
Error Vector

No Error Ex: An error at the first bit


e  (0000000) e  (1 0 0 0 0 0 0 )

78
Parity Check Matrix

GH T = 0
G=Generator Matrix H=Parity Check Matrix k x n-k
kxn n-k x n

For Systematic Code in which G  [ P I k k ] , then

For a Code Word c : cH T  mGHT  0

In Example 4.3: n = 7, k = 4, r = 3, find its parity-check matrix.

From the generator matrix G in Example 4.3,

mGHT  0 c0  m0  m2  m3  0
 
cH T  0 c1  m0  m1  m2  0

c2  m1  m2  m3  0
(c0c1c2m0m1m2m3 )HT  0
Parity Check Equations

79
Syndrome Calculation and Error Detection
Syndrome is defined as:

s = vH T =0 If v=c
1 x n-k 1xn n x n-k
0 If vc

In Example 4.3: n = 7, k = 4, r = 3, if

There is an error, but this error is undetectable!


The error vector introduces 3 error. But the minimum Hamming distance for this code is
3, and, such, a 3 error pattern can lead to another code word!
Note: When we say that the number of errors which can be detected is 2, we refer at all
error patterns with 2 bits in errors. However, the code is capable to detect patterns with
more than 2 errors, but not all !

Question: What is the number of error patterns which can be detected with this code?
Answer: The total number of error patterns is 2n-1 (the all-zero vector is not an error!).
However, 2k-1 of them lead to code words, which mean that they are not detectable. So,
the number of error patterns which are detectable is 2n-2k.

80
Error Correction Capacity

Likelihood Test
Why and When the Minimum Hamming Distance is a Good Decoding Rule ?
Let c1 , c 2 be two code words and v be the received word,
If c1 is the actual code word, then the number of errors is
If c 2 is the actual code word, then the number of errors is
Which of these two code words is most likely based on v ?
The most likely code word is the one with the greatest probability of occuring with the
received word, i.e.,

This is called the Likelihood Ratio Test. or, equivalently,


c1

ln pv,c1  ln pv,c2 0

c2

Log-Likelihood Ratio Test.

81
The joint probabilities can be further written as

p v, ci  p v |ci p ci (i = 1,2)

For the BSC Channel (independent errors)

p v|ci  Pr(ti )  p ti (1  p ) n  ti (i = 1,2)

where t i  d (v , ci ) is the number of errors that have occurred during the transmission of

code word ci . Since there is a specific error pattern for a received word, the binomial
coefficient does not appear in above.
IF
Condition 1: the code words have the same probability and
Condition 2: p < 0.5 (p is the crossover probability of the BSC channel)
By performing some calculations, one gets that:

82
§4.2.4 Decoding Linear Block Codes

Standard Array Decoder


The simplest, least clever, and often most expensive strategy for implementing error
correction is to simply look up c in a decoding table that contains all possible v . This is
called a standard-array decoder, and the lookup table is called the Standard Array. The
first word in the first column of the standard array is the zero code-word (it also means
zero error). If no error, the received words are the code words. These are given in the first
row of the standard array. For a linear block code (n, k), the first row contains 2k code
words, including the zero code-word. All 2n words are contained in the array. Each row
contains 2k words. So, the number of columns is 2k. The number of rows will then be
2n/2k =2n-k=2r. The standard array for a (7, 4) code can be seen in the table on next page.

When decoding with the standard array, we indentify the column of the array where the
received vector appears. The decoded vector is the vector in the first row of that column.
Each row is called Coset. In the first column we have all correctable error patterns.
These are called Coset Leaders. Decoding is correctly done if and only if the error
pattern caused by the channel is a coset leader (including the zero-vector). The
words on each column, except for the first element, which is a code word, are obtained by
adding the coset leader to the code word.

Question: How do we chose the coset leaders?


To minimize the probability of a decoding error, the error patterns that are more likely to
occur for a given channel should be chosen as coset leaders. For a BSC, an error pattern
of smaller weight is more probable than an error pattern of larger weight. Therefore,
when the standard array is formed, each coset leader should be chosen to be a vector of at
least weight from the remaining available vectors. Choosing coset leaders this way, each
coset leader will have the minimum weight in its coset. In a column, one gets the words
which are at minimum distance of the code word, which is the first element of the column.
A linear block code is capable to correct 2n-k error patterns (including zero error).

83
84
Syndrome Decoder
Standard array decoder becomes slow when the block code length is large. A more
efficient method is syndrome decoder. Syndrome Vector is defined as:

s = vH T =0 If v=c
1 x n-k 1xn n x n-k
0 If vc

where v is the received vector, H is the parity-check matrix. The syndrome is


independent on the code word; It depends only on the error vector (for a specific code).
All the 2k n-tuples (n bit words) of a coset have the same syndrome.

Steps in the Syndrome Decoder


1. For the received word, the syndrome is calculated by
T T
s  vH  eH
2. The coset leader e is calculated.
3. The transmitted code word is obtained by
c  ve

§4.2.4:
4.2.4: Decoding Linear Block Codes
 Syndrome Decoder

85
Example 4.4: Design the Syndrome decoder for Example 4.3 in which n = 7, k = 4, r
=3
For the parity-check matrix in Example 4.3 and the single-bit error pattern:

86
§4.2.5 Hamming Codes

Definition
Hamming codes are important linear block codes, used for single-error controlling in
digital communications and data storage systems. For any integer r 3 , there exist a
Hamming Code with the following parameters:

Code Length:
Number of information digits:
Number of parity check digits:
Error correction capability:
Systematic Hamming code has:

In Example 4.3: n = 7, k = 4, r = 3
Code Length: n  2r  1  7
Number of information digits: k  2  1  r  4
r

Number of parity check digits: r  nk  3


Error correction capability: t  1 (d min =1)
Thus, the code given as example is a Hamming code.
Example 4.5: Construct the parity-check matrix for the (7, 4) systematic Hamming code.

Example 4.6: Write down the generator matrix for the Hamming code of Example
4.5.

87
Perfect Code
If we form the standard array for the Hamming code of length n  2 r  1 , the n-tuples of
weight 1 can be used as coset leaders. Recall that the number of cosets is 2 n / 2 k  2 r !
That would be the zero vector and the n–tuples of weight 1. Such a code is called a
Perfect Code. “PERFECT” does not mean “BEST”!
A Hamming code corrects only error patterns of single error and no others.

Some Theorems on The Relation Between the Parity Check


Matrix and the Weight of Code Words
Theorem 1: For each code word of weight d, there exist d columns of H, such that the
vector sum of these columns is equal to the zero vector.
The reciprocal is true.

Theorem 2: The minimum weight (distance) of a code is equal to the smallest number of
columns of H that sum to 0.

In Example 4.3: n = 7, k = 4, r = 3

The columns of H are non-zero and distinct. Thus, no two columns add to zero, and the
minimum distance of the code is at least 3. As H consists of all non-zero r-tuples as its
columns, the vector sum of any such two columns must be a column in H, and thus, there
are three columns whose sum is zero. Hence, the minimum Hamming distance is 3.

88
Shortened Hamming Codes
If we delete  columns of H of a Hamming code, then the dimension of the new parity
check matrix, H’, becomes r(2r 1) .. Using H’ we obtain a Shortened Hamming
Code, with the following parameters:
Code Length:
Number of information digits:
Number of parity check digits:
Minimum Hamming Distance:

In Example 4.3: We shorten the code (7,4)

We delete from PT all the columns of even weight, such that no three columns add to zero
(since total weight must be odd). However, for the column of weight 3, there are 3
columns in Ir , such that the 4 columns’ sum is zero. We can thus conclude that the
minimum Hamming distance of the shortened code is exactly 4. This increases the error
correction and detection capability.

The shortened code is capable of correcting all error patterns of single error and detecting
all error patterns of double errors. By shortening the code, the error correction and
detection capability is increased.

89
Ch 5 Cyclic Codes

§5.1 Description of Cyclic Codes

Definition
Cyclic code is a class of linear block codes, which can be implemented with extremely
cost effective electronic circuits.
Cyclic Shift Property
A cyclic shift of c  (c0 c1...cn  2 cn 1 ) is given by

uIn general, a cyclic shift of c can be written as

A Cyclic Code is a linear block code C, with code words c  (c c ...c


0 1 c
n  2 n 1 )
such that for every c  C , the vector given by the cyclic shift of cis also a code word.

Example 5.1: Verify the (6,2) repetition code


C  {(000000), (111111), (010101), (101010)}
is a cyclic code.

Since a cyclic shift of any of its code vectors results in a vector that is element of C.
Check by yourself.

Example 5.2: Verify the (5,2) linear block code defined by the generator matrix
Its code vectors are
1 0 1 1 1
G  
0 1 1 0 1
is not a cyclic code.

Its code vectors are c  m G 

90
00000
10111
01101
11010
The cyclic shift of (10111) is (11011), which is not an element of C. Similarly, the cyclic
shift of (01101) is (10110), which is also not a code word.

Code (or Codeword) Polynomial

c  (c0 c1...cn  2 cn 1 )
Code Word

 One to-one correspondence

c( X )  c0  c1 X  ...  cn  2 X n  2  cn 1 X n 1
Code Polynomial of degree (highest exponent of X) n -1 or less.
Theorem: The non-zero code polynomial of minimum degree in a cyclic code is
unique, and is of order r.

Theorem 1: A binary code polynomial of degree n -1 or less is a code word if and only if
it is a multiple of g ( X ) .

c ( X )  m( X ) g ( X )
degree k -1 degree r
degree n -1
or less
or less

c ( X )  ( m0  m1 X  ...  m k 1 X k 1 ) g ( X )
where m0 ,..., m k 1 are the k information digits to be encoded.

An (n, k) cyclic code is completely specified by its non-zero code polynomial of


minimum degree, g(X), called the generator polynomial .

91
Theorem 2: The generator polynomial, g ( X ) , of an (n, k) cyclic code is a factor of

X n  1.
Question: For any n and k, is there an (n, k) cyclic code?

Theorem 3: If g ( X ) is a polynomial of degree r = n - k and if it is a factor of


then g ( X ) generates an (n, k) cyclic code.

Remark: For n large, X n  1 may have many factors of degree n - k. Some of these
polynomials generate good codes, whereas some generate bad codes.
Example 5.3: Determine the factor of X7+1 that can generate (7, 4) cyclic codes.

For a (7,4) code, r=n-k=7-4=3, the generator polynomial can be chosen either as
or

Systematic Cyclic Code


For message: m( X )  m0  m1 X  ...  mk 1 X k 1 , the steps to generate
systematic cyclic code includes:

Step 1:
Step 2:
Step 3:

Proof: X nk m ( X )  a( X ) g ( X )  b( X )

degree = n - k degree ≤ n - k -1
degree ≤ n -1

b(X)  Xnkm(X)  a(X)g(X) Code word

b0  b1 X  ...  bk 1 X nk 1  m0 X nk  ...  mk 1 X n1


  
parity check bits message

92
Example 5.4: Find (7, 4) cyclic code, generated by g( X )  1  X  X 3 when

m( X )  1  X 3 i.e. m  (1001)

Step 1: Multiply the message m(X) by Xn – k.

Step 2: Obtain the remainder b(X) from dividing Xn – k m(X) by g(X).

Step 3: Combine b(X) and Xn – k m(X) to form the systematic code word.

c ( 011
 1001
 )
parity check bits k bits of the message

§5.2: Generator and Parity-check Matrices

Generator Matrix
Let (n, k) be a cyclic code, with the Generator Polynomial

Then, a code polynomial can be written as

93
which is equivalent to the fact that g ( X ), Xg ( X ),..., X k 1 g ( X ) span C.

g0 g1 g2 ... gnk 0 0 0 ... 0 


g( X )   
 Xg ( X )  0 g0 g1 ... gnk 1 gnk 0 0 ... 0 
 G  ...  
  0 0 g0 ... gnk 2 gnk 1 gnk 0 ... 0

kxn  k 1   
 X g ( X )   
.....................................................................
0 0 0 ... 0 .............gnk 
 0
with g0  gnk 1.

Systematic Generator Matrix


In general, G is not in a systematic form. However, we can bring it in a systematic form
by performing row operations.
Reminder: For a Systematic Code G  [PI k ]

Example 5.5: Determine the systematic generator matrix for (7, 4) cyclic code, generated
by g(X) 1XX3

1 1 0 1 0 0 0 R1 1 1 0 1 0 0 0
  R4+R1+R2 
R3+R1

G
0 1 1 0 1 0 0 R2 0 1 1 0 1 0 0
0 =
0 1 1 0 1 0 R3 1 1 1 0 0 1 0
   
0 0 0 1 1 0 1 R4 1 0 1 0 0 0 1

systematic form

94
The (7, 4) cyclic code, generated by g ( X )  1  X  X 3 when message is (1100)

for other messages, the code see below:

The (7, 4) cyclic code in systematic form, generated by g(X )  1 X  X 3 when


message is (0011)

95
for other messages, the code see below:

Parity-check Matrix

X n  1  g ( X ) h( X )
We know:

degree k
degree r =n - k

Parity-check Polynomial
Let c  (c0c1...cn1) be a code word,
c( X )  a( X ) g ( X )
degree ≤ k-1

96
Thus, X k , X k 1 ,, X n 1 do not appear in a( X )  a ( X ) X n , i.e.,
the coefficients of X k , X k 1 ,..., X n 1 must be equal to zero, then


k
hc
i  0 i n i  j
 0, 1  j  n - k
from which we can set up n-k equations.

Reciprocal of h(X) is defined as

It can be shown that this is a factor of X n  1 , thus, it can generate an (n, n-k) cyclic
code. The generator matrix of the (n, n-k) cyclic code is

 hk hk 1 hk  2 ... h0 0 0 0 ... 0 
 
 0 hk hk 1 ... h1 h0 0 0 ... 0 
H  0 0 hk ... h2 h1 h0 0 ... 0 
 
................................ ..................................... 
 0 0 0 ... 0 ................h0 
 0
with h0  hk  1.

As for a linear block code, any code word is orthogonal to every row of H,
( cH T  0 )
H is a Parity Check Matrix of the cyclic code. h(X) is called the parity polynomial of
the code. A cyclic code is uniquely specified by h(X).
Remark: The polynomial X k h( X 1 ) generates the dual code of C, (n, r).

Example 5.6: Find the dual code generator polynomial for (7, 4) cyclic code, generated
by g( X )  1 X  X 3
k  4, r  n  k  7  4  3

97
Generates
r  7 3  4

§5.3 Encoder for Systematic Cyclic Codes

Find Remainder by Binary Polynomial Division


Recall the 3 steps to generate systematic cyclic codes are:
Step 1: Multiply the message m(X) by Xn – k.

X n  k m ( X )  m0 X n  k  ...  mk 1 X n 1 degree ≤ n-1

Step 2: Obtain the remainder b(X) from dividing Xn – k m(X) by g(X).


Step 3: Combine b(X) and Xn – k m(X) to form the systematic codeword.
In the 2nd step, we assume that mk-1=1, the remainder can be found by considering the
calculation of
X n 1 /( X r  g r 1 X r 1    g1 X  1)

All the 3 steps can be accomplished with a division circuit of (n-k)-stage register with
feedback based on g(X). The mechanism of the division process have a simple
implementation for binary polynomials. We assume that the bits are transmitted serially
with the highest power of X being transmitted first.

We illustrate the mechanism using n = 7 and r = 3, i.e., the (7,4) code.

98
 g r 1 
 g2  g 
 r 2 
S1   g1 
Remainder
after this cycle  In the general case, S1   ... 
 
 1  g
 1 
 g 0 
In the next division cycle we have g 2 X 5  g1 X 4  X 3 divided by X 3  g 2 X 2  g1 X  1

 g 2 1 0  g 2   g 2 1 0
Remainder
after this cycle
S2   g1 0 1  g1    g1 0 1 S1
 1 0 0  1   1 0 0

I
( r 1)( r 1)
 
 g r 1 1 0 ... 0
g 0 
In the  r  2 0 1 ...
general S2   ... ... ... ... ... S1  S1
case,  
 g1 0 0 ... 1
 g0 0 0 ... 0 

The process continues 2 more times, for a total of k cycles (k=4 here),
S3   S 2 and S 4   S3
The process for the terms m2 X5 is the same, except only k-1=3 cycles are involved . The
same is true for each successive term in Xrm(X), with one less shift in for each decrease in
the power of X.

99
For a general (n, k) code, we can represent the long-division process for the remainder
vector as
 g r 1 
g 
 r 2 
St  St 1   ...  mk t , t  1, 2,..., k , S0  0
 
g
 1 
 g 0 

Example 5.7: For k = 4 and r = 3, find the remainder vector.

Homework: Write S3 and S4 for the (7,4) code.

Encoder Circuit
After obtaining the remainder, run Step 3: Combine b(X) and Xn –k
m(X) to form the
systematic code word.
In Example 5.7: For k = 4 and r = 3, design the encoder circuit.

100
X3m(X)  X3(m3 X3  m2 X 2  m1X  m0 )
Codeword
m3

m2  g2m3 m2

g0 g1 g2
g2m3 Parity-check digits

D D D

g 0 m3 g1 m3 g 2 m3 S1

g 0 m2  g 0 g 2 m3 g0 m3  g1m2  g1 g2 m3 g1 m3  g 2 m 2  g 22 m3 S2

and so on …………………………………………………………………….
b0 b1 b2

For the general case, an endocder of n-k-stage shift register is

101
Homework: Find the encoding circuit for the (7, 4) code, generated by g(X) 1X X3

Encoding a cyclic code can also be accomplish by using its parity polynomial,
h ( X )  1  h1 X  ...  hk 1 X k 1  1 X k
As hk =1 (see formula in slide 2)
k 1
cn  k  j  
(1)
hc
i  0 i n i  j
, 1 j  n - k
This is known as a difference equation.
For a Systematic Code: ( c0 c1 ...cn  k 1 cn  k ...cn 1 )
  
n - k parity check binary digits k information binary digits
m0 ... mk 1

Given the k info bits, (1) is a rule for determining the n-k parity check digits, c0 c1  c n k 1 .
The encoder circuit using parity polynomial is

h0=1

102
The Encoding Operations can be described in the following steps:
Step1: Initially, Gate 1 is turned on and Gate 2 is turned off. The k information
k 1
digits, m( X )  m0  m1 X  ...  mk 1 X , are shifted into the register and the
communication channel simultaneously.
Step 2: As soon as the k information bits have entered the shift register, Gate 1 is turned
off and Gate 2 is turned on. The first parity check digit,

is formed and appear at point P.


Step 3:
The register is shifted once. The first parity-check digit is shifted into the channel and
into the register. The second parity check digit,

is formed and appear at point P.


Step 4: Step 3 is repeated until n-k parity-check digits have been formed and shifted into
the channel. Then, Gate 1 is turned on and gate 2 is turned off. The next message will be
shifted into the register.
Remark 1: This is a k-stage shift register.
Remark 2:
If r > k, the k-stage encoding circuit is more economical.
Otherwise, the (n-k)-stage encoding circuit is preferable.

Homework: Find the encoding circuit for the (7, 4) code, generated
by g( X )  1 X  X 3 , based on h(X).

§5.4 Syndrome Computation and Error Correction

Definition of Syndrome
Cyclic Codes are Linear Block Codes. For a received word v  ( v 0 v1 ...v n 1 )
Syndrome is defined as s  vH T

We know that cH T  0 , so if s = vH T  0, v is a codeword.

103
For Cyclic Codes: Received Polynomial v( X )  v0  v1 X  ...  vn 1 X n 1
or v( X )  a( X ) g ( X )  s( X )

degree ≤ n-1 degree r=n-k degree ≤ r-1

The r = n-k coefficients of S(X) form the Syndrome S. s( X )  0 if and only if v( X )


is a code polynomial (a multiple of g(X)).

Syndrome Computation Circuit


s ( X ) is the remainder of the division v(X) / g(X). It can be computed with a division
circuit, which is identical to the (n-k)-stage encoding circuit, except that the received
polynomial is shifted into the register from the left end.

The received polynomial is shifted into the register with all stages initially set to zero.
As soon as v(X) has been shifted into the register, the content in the register form the
syndrome s(X).

Properties of Syndrome
Let s(X) be the syndrome of a received polynomial v(X). The remainder s(1)(X) resulting
from dividing Xs(X) by the generator polynomial g(X) is the syndrome of v(1)(X), which
is a cyclic shift of v(X) (For proof: see the definition of the syndrome). The syndrome
s(1)(X) of v(1)(X) can be obtained by shifting the register (syndrome) once, with s(X) as
the initial content and with the input gate disabled. This is equivalent with dividing Xs(X)
by g(X).
In general, the remainder s(i)(X) resulting from dividing Xis(X) by the generator
polynomial g(X) is the syndrome of v(i)(X), which is a cyclic shift of v(X). This

104
property is useful in decoding cyclic codes. The syndrome s(i)(X) of v(i)(X) can be
obtained by shifting the register (syndrome) i times, with s(X) as the initial content and
with the input gate disabled. This is equivalent with dividing Xis(X) by g(X).

Example 5.8: Find the syndrome circuit for the (7,4) cyclic code generated by
g ( X )  1  X  X 3 . Suppose that the received vector is v  (0010110)
.
Calculate the syndrome and compare it with the contents of the shift register after the 7th
shift. Show the contains of the shift register with the input gate disabled and comment on
the result.

v  (0010110) 
The remainder of v(X) / g(X) is X  1 , and so, the syndrome is
2
,
or .
For the content of the shift register, see the next table, which is related to the syndrome
circuit.

105
With the input gate disabled, the syndrome of v(1) ( X )  (0001011) is obtained by shifting
the register once, the syndrome of v(2) ( X )  (1000101) is obtained if we shift the register
twice, and so on.
Let c( X ) be the transmitted code polynomial, and let

be the error pattern. Then, the received polynomial is

As c( X )  m( X )g( X ) and v(X )  a( X )g( X )  s( X ) , then

The syndrome is computed based on the received vector, and the decoder has to estimate
the error pattern e(X) based on the syndrome. However, the error pattern is not known at
the decoder. The syndrome is equal to the remainder of dividing the error pattern by the
generator polynomial.

Remark: One can notice that s( X )  0 if and only if e( X )  0 or


e( X )  c( X ) (the error pattern is a codeword)
For the latter, the error pattern is undetectable !

Remark: The error detection circuit is simply a syndrome circuit with an OR gate whose
inputs are the syndrome digits. If the syndrome is non-zero, the output of the OR gate is 1,
and the presence of errors has been detected.
CYCLIC CODES ARE VERY EFFECTIVE FOR DETECTING ERRORS,
RANDOM OR BURST !

Burst Error Patterns


Definition:
The Burst Length of an error polynomial e(X) is defined as the number of bits from the
first error term in e(X) to the last error term, inclusive.

Example: e( X )  X 3  X 7 has the burst length b=7-3+1=5.

By definition, there can be only one burst in a block.

Example: e( X )  X 3  X 7  X 19  X 20 has the burst length b=20-


3+1=18, and not two bursts of length 5 and 2.

Definition: An error pattern with errors confined to i high-order positions and l-i low-
order positions is also regarded as a burst of length l. This is called an end-around burst.

Example: e  (10100000001110) is an end-around burst of length 7.

106
CASE 1: Suppose that e(X) is a burst of length r = n-k or less.

 e( X )  X j B ( X )
degree ≤ n-k-1
Because degree{B ( X )}  degree{g ( X )}  g ( X ) is not a factor of B ( X )
Also X is not a factor of g ( X ), as g ( X ) divides X n  1
 e( X )  X j B( X ) is not divisible by g( X )
or, equivalently, the syndrome caused by e(X) is not equal to zero.
The (n, k) cyclic code is capable of detecting any error burst of length

CASE 2: Suppose that e(X) is a burst of length r +1 = n-k+1, and let it start from the ith
position. Thus, it ends at the (i+n-k)th position. Errors are confined to ei , ei 1 ,..., ei  n  k
with ei  e i  n  k  1
There are 2 n  k 1 such bursts (the error bits in the first and last positions are 1, and only
the n-k+1-2 (i.e. n-k-1) positions can take any value, i.e., either 0 or 1). Among these,
only one cannot be detected (zero syndrome), i.e.,
e( X )  X i g ( X )
The fraction of undetectable bursts of length n – k +1 is

Error Detection Capability


CASE 3: Suppose that e(X) is a burst of length l > n – k + 1 or (r + 1). then there are
such bursts (the bits in the first and last positions are 1, and only the l -2 positions can
take any value, i.e., either 0 or 1). Among these, the undetectable ones (zero syndrome)
must be of the form,
e( X )  X i a( X ) g ( X )
degree=n-k
degree l -1

a( X )  a0  a1 X  ...  al ( nk )1 X l ( nk )1


a0  1, al ( nk )1  1

The number of such bursts is


The fraction of undetectable burst errors of length l is

107
Example 5.9: Analyze the error detection capacity of the (7, 4) cyclic code generated by
g( X )  1 X  X 3

The minimum Hamming distance for this code is 3, thus, the code can detect up to 2
random errors (see the relation between dmin and td.)
Also, it detects 112 error patterns
The code can detect any burst errors of length
It also detects many burst of length >3.
The fraction of undetectable error patterns with n-k+1=4 errors is .
The fraction of undetectable error patterns with more than 4 errors is

Cyclic Redundancy Check (CRC) Codes


CRC are error-detecting codes typically used in ARQ systems. CRC has no error
correction capability, but they can be used in combination with an error-correcting code.
The error control system is in the form of a concatenated code.

Tx CRC Error Correction


ENCODER ENCODER

Error Correction CRC Syndrome


Rx
DECODER Checker

§5.5 Decoding of Cyclic Codes

Decoding Steps
The decoding process consists of three steps, as for decoding of linear block codes. These
are:
i)
ii)

108
iii)
Syndrome Computation: The syndrome for cyclic codes can be computed with a
division circuit whose complexity is linearly proportional to the number of parity check
binary digits, i.e., n-k.
Error Corrections: The error-correction step is simply adding (mod-2) the error-pattern
to the received vector (exclusive-or gate).
The association of the syndrome with an error pattern can be completely specified by a
decoding table. This is a straightforward approach to the design of a decoding circuit is
via a combinational logic circuit that implements the table look-up procedure. However,
the limit to this approach is that the complexity tends to grow exponentially with the code
length and number of errors to be corrected.
Cyclic Codes have considerable algebraic properties, which allow a low complexity
structure of the encoder. The cyclic structure of a cyclic code allows us to decode a
received vector v(X) serially. The received digits are decoded one at a time, and each
digit is decoded with the same circuitry.

Decoding Circuit (Decoder)

2
109
Two Cases
As soon as the syndrome has been computed, the decoding circuit checks whether the
syndrome s(X) corresponds to a correctable error pattern e( X )  e0  e1X ...  en1X n1,
with an error at the higher position Xn -1, i.e., en-1=1.

CASE I: If s(X) does not correspond to an error pattern with en-1=1, the received
polynomial and the syndrome register are cyclically shifted once simultaneously. We
obtain

and the syndrome register form s (1) ( X ), the syndrome of v (1) ( X ).


Now, the second digit, vn -2 becomes the first digit of v(1) (X).
The same decoding circuit checks whether s(1) (X) corresponds to an error at location Xn-1.

CASE II: If s(X) of v(X) does correspond to an error pattern with en-1=1, the first
received digit vn-1 is an erroneous digit, and it must be corrected. The correction is carried
out by the sum vn 1  en 1.
This correction results in a modified received polynomial
v1 ( X )  v0  v1 X  ...  vn2 X n2  (vn1  en1 ) X n1
The effect of en-1 on the syndrome is removed from the syndrome s(X). v1(X) and the
syndrome register are cyclically shifted once simultaneously. The polynomial which
results now is
v1(1) ( X )  (vn 1  en 1 )  v0 X  ...  vn  2 X n 1
Its syndrome, s1(1) ( X ) , is the remainder resulting from dividing X [ s ( X )  X n 1 ]
by the generator polynomial g(X).

Proof
n 1
v( X )  a( X ) g ( X )  s( X ) X (Error C orrection)

v( X )  X n 1
 a( X ) g ( X )  s( X )  X n 1  X (Shift O nce)
n n
Xv ( X )  X  Xa ( X ) g ( X )  Xs ( X )  X

110
Such that the remainder of Xv( X )  X n : g( X ) is the remainder of
n
X [ s( X )  X n1 ] : g ( X ), which is s(1) ( X )  1, because g ( X ) | ( X  1)

Therefore, if 1 is added to the left end of the syndrome register while it is shifted, we
obtain s1 ( X )
(1)
. The decoding circuitry proceeds to decode vn-2. Whenever an error is
detected and corrected, its effect is removed from the syndrome.
Remarks:
 The decoding stops after n shifts (= total number of binary bits in a received
word).
 If e(X) is a correctable error pattern, the contents of the syndrome register is zero
at the end of the decoding operation, and the received vector has been correctly
decoded. Otherwise, an uncorrectable error pattern has been detected.
 This decoder applies in principle to any (n, k) cyclic code.
 But whether it is practical depends entirely on its error-pattern detection circuit.
In some cases this is a simple circuit.

Design Decoder
Example 5.10: Design the decoder for the (7,4) cyclic code generated by
g(X )  1 X  X 3

d min  3
It is capable of correcting any single error over a block of 7 bits. There are 7 such error
patterns. These and the all-zero vector form all the coset leaders of the decoding table.
They form all correctable error patterns. Suppose that the received polynomial,

v( X )  v0  v1 X  ...  v6 X 6

is shifted into the syndrome register from the left end.


Write syndrome and error patterns in Table 1 on next page.

111
1

We see that e6(X)  X


6 6
is the only error pattern with an error located at X . When this
error pattern occurs, the syndrome in the syndrome register is (101), after the entire v(X)
has entered the syndrome register. The detection of this syndrome indicates that v6 is an
erroneous digit and must be corrected.
Suppose that the single error occurs at location X i , i.e., ei ( X )  X i , 0  i  6
After the entire received polynomial has been shifted into the syndrome register, the
syndrome in the register will not be (101). However, another 6-i shifts, the contents in the
syndrome register will be (101) and the next received digit to come out of the register
will be the erroneous digit. Only the syndrome (101) needs to be detected.
We use a 3 input AND gate.

3 112
In the sequel, we give an example for the decoding process when the codeword
c  (1001011) ( c ( X )  1  X 3  X 5  X 6 ) is transmitted and v  (1011011)
( v( X )  1  X 2  X 3  X 5  X 6 ) is received. A single error occurs at location X2.
When the entire received polynomial has been shifted into the syndrome and buffer
registers, the syndrome register contains (001). We see that after 4 shifts, the content in
the syndrome register is (101) and the next digit to come out from the buffer is the
erroneous digit, v2.

4 3

113
Ch 6 Convolutional Codes

§6.1 Description of Convolutional Codes

Compare with Linear Block Code


Convolutional codes are the second major form of error-correcting channel codes. They
differ from the linear block codes in both structural form and error correcting properties.
With linear block codes, the data stream is divided into a number of blocks of k binary
digits, each block is encoded into an n-bit code word. On the other hand, Convolutional
Codes convert the entire data stream into a single code word.
The code rate for the linear block codes can be ≥0.95, but they have limited error
correction capabilities. For convolutional codes, the code rate is usually below 0.9, but
they have more powerful error-correcting capabilities and good for very noisy channels
with high raw error probabilities. Puncturing is used to achieve higher code rates.

Encoding
The source data is broken into frames of k0 bits per frame. M +1 frames of source data are
coded into n0-bit code frame, where M is the Memory Depth of the shift register.
Convolutional codes are encoded using shift registers. As each new data frame is read,
the old data is shifted one frame to the right, and a new code word is calculated.
Characteristics of the Code: Code Rate , Constraint Length
For binary convolutional codes: k0=1

114
Example 6.1: For a R  1/ 2 ,  3 binary convolutional encoder below, determine
its code polynomials.

R  1/ 2   3 , so k0=1 (binary), n0=2, M=2


For each 1-bit (k0) frame of the input message m(X), we obtain 2-bit (n0) code frame on
the output with one bit in c0(X) and one in c1(X). These are interleaved and sent as a two-
bit symbol sequence.

We can associate two code polynomials,

c0 ( X )  m( X ) g 0 ( X )
such that
c1 ( X )  m( X ) g1 ( X )
The vector corresponding to the output is
C( X )  [c0 ( X ) c1 ( X )]  m( X )[ g0 ( X ) g1 ( X )]  m( X )G ( X )
For example, if message
m( X )  1  X  X 3
Then
Let us assume that the highest power of X is the first symbol transmitted, and that we
first send c0, and then c1. Thus, the transmitted sequence is
C0(0) C1(0) C0(1) C1(1) … C0(t) C1(t) =
You can also input the message to the encoder directly to verify the result. The message
has 4 bits, i.e., ( 1 0 1 1), but the transmitted sequence contains 12 transmitted bits.
Therefore, the Code Rate is 4/12=1/3, not 1/2 !

115
Effective Code Rate
In Example 6.1, the code rate is 1/3, the explanation for that is that the encoder has M=2
memory elements and it has to “flush” its buffer to complete the code sequence. The last
two code symbols in the transmitted code sequence, i.e., 01 and 11, correspond to
empting the encoder’s shift register. The first 8 bits correspond to the 4 message bits at
rate ½, so the Effective Code Rate is 1/3. This reduction in the code rate is known as the
Fractional Rate Loss.
For a convolutional code with rate R (K bits of information) and memory depth M, the
Effective Code Rate is

Convolutional codes are effective when K  M , the effective code rate approaches the
code rate.

Memory Depth and Constraint Length


For the rate R convolutional codes, the generator vector is defined as
G ( X )  [ g 0 ( X )...g n 1 ( X )]
and the vector of the code polynomials as
C ( X )  m ( X )G ( X )
Convolutional codes are LINEAR, as the sum of two code polynomials is a code
polynomial. There is a strong similarity with cyclic codes, and convolutional codes have
some properties of cyclic codes.
Memory Depth
Constraint Length Sometimes it is defined as M.
For a given code rate, if 
increases, a better error rate performance is obtained, at the
expense of increased decoder complexity.

§6.2 Structure Properties of Convolutional Codes

State Diagram
The convolutional encoder is a “state machine” (it is convenient to represent its operation
using a State Diagram). With M memory elements, it has 2 M states

116
Example 6.2: Find the state diagram for the encoder in Example 6.1.

M=2, we associate the 2M=4 states with the content of the shift register, as

State Diagram: Used to analyze performance of a convolutional code


Input Output

Trellis Diagram
Trellis Diagram is to use states at different time to analyze performance of a
convolutional code

117
Adversary Paths
The error-correcting property of a convolutional code is determined by the adversary
paths through the trellis. Adversary Paths: the paths that begin in the same state and
end in the same state, and have no state in common at any step between the initial and
final states.

Adversary Paths and Hamming Distance


For the following paths,
P0 : S0  S0  S0  S0  S0
P1 : S0  S1  S2  S0  S0
P2 : S0  S1  S3  S2  S0
P3 : S0  S0  S1  S2  S0 Hamming Distance

 P0 , P1  are adversaries from time index

 P2 , P3  are adversaries from time index

 P0 , P3  are adversaries from time index

 P0 , P2  are adversaries from time index

118
Performance is based on the Hamming distance ( d H (ci , c j ) of the two code sequences)

between the adversary paths in the trellis. As we can see in this simple example, the
number of adversary paths grows, and we wonder how we can handle the combinatorics
involved. The trellis path analysis is simplified in case of linear codes. On such case, the
Hamming distance between two code sequences in the trellis is equivalent to the
Hamming distance between some code word and the all-zero code sequence.

Transfer function
This information can be found using transfer function. We will show only the non-zero
adversary paths which begin and end in state S0. We modify the state diagram by
removing the self loop at the S0 state, and adding a new node S0, representing the
termination of the non-zero adversary path.
Transfer Function Operators
S 0  S1
1 | 11
Source symbol of Code symbol of weight
weight 1 2
Source Symbol Weight Operator: N Code Symbol Weight Operator: D
Time Index Operator: J
For this case, is the state operator for the transition (exponent means number
of ``1`` bits in D or N).
Example 6.3: Write the transfer operators for each branch of the state diagram.

119
Results are:

We can solve for the transfer function for all possible paths starting at S0 and ending at S0,
by writing a set of state equations for the transfer function diagram.

with X 0 , X 0( e) the beginning and ending state S0, respectively. The transfer function
T ( J , N , D) is found by solving this set of equations for X 0( e ) , with X 0  1 using linear
algebra,
D5 NJ 3
T ( J , N , D) 
1  DNJ (1  J )
To see the the individual adversary paths, apply long division and then
T ( J , N , D)  D5 NJ 3  D6 N 2 J 4 (1  J )  ....

Proof: Check that T ( J , N , D)[1  DNJ (1  J )]  D5 NJ 3


The transfer functions supplies us with all the information we need to completely
characterize the structure and performance of the code. For example,

120
D 6 N 2 J 4 (1  J ) shows that there are exactly two paths of Hamming weight 6 and both
paths involve source symbols with Hamming distance 2. One is reached is 4 transitions,
and the other one in 5. With this information, the two paths satisfied are found as
S 0  S1  S 3  S 2  S 0
S 0  S1  S 2  S1  S 2  S 0

§6.3 Decoding Methods

Viterbi Algorithm
Convolutional codes are employed when significant error correction capability is required.
In such cases, the decoding cannot be carried out using syndrome method and shift
register circuits, but a more powerful method is needed. Such a method was introduced
by Viterbi (1965) and quickly became known as the Viterbi algorithm. The Viterbi
Algorithm is of major practical importance, and we will introduce it primarily by means
of examples.
We have seen that a convolutional code with constraint length   M 1 has states
in its trellis. One way to view the Viterbi decoder is to construct it as a network of simple,
identical processors, with one processor for each state in the trellis.
For example: v  3, M  2 , it needs 2 2  4 states.
Example of node processor: It receives inputs from the node processors S0 and S2, and
supplies outputs for node processors S0 and S1.

S0 S0
S1 S1
S2 S2
S3 S3
t t+1 t+2
Each processor does the following: i) monitors the received code sequence, y(X), which
can be written as y(X)=c(X)+e(X). Each processor calculates a number (likelihood

121
metric) that is related to the probability that the received sequence arises from a
transmitted sequence. The likelihood metric is the accumulated Hamming distance
between the received sequence and expected transmitted sequence. The larger the
distance, the less likely it is that this processor is decoding the true transmitted message.
2) Each processor must supply, as an output, its likelihood metric to each node processor
connected to its output side. 3) For each of its input paths, the node processor must
calculate the Hamming distance between the n-bit code symbol y and the n-bit code
symbol it should have received if the path of the transmitted message had just made a
transition (likelihood update). It adds the likelihood update to the likelihood supplied
to it by the source node processor. It selects the path associated to the input-side
processor having the smallest accumulated Hamming distance (the most likely path).
4) Based on which path is selected, the processor must decode the message associated
with the selected path and update a record (called Survivor Path Register) of all of the
decoded message bits associated with the selected path.

Survivor Path Register Mathod of Viterbi Decoding

Example 6.4: Assume that we have the convolutional code discussed as in Example 6.1.
At time t, assume that the processors have the following initial conditions:

Node Processor Likelihood Metric Survivor Path Register


( Si , i  0,1, 2, 3 ) ( i , i  0,1, 2,3 )
S0 3 000100xxxxx
S1 3 111001xxxxx
S2 1 101110xxxxx
S3 2 111011xxxxx

Assume that the received code-word symbol at time t is y=11. Find the resulting
likelihoods and survivor path registers for each of the node processors at time t+1.

122
Write down Trellis Diagram (see example discussed earlier)
y = 11 For node S0: if y=11
from S0, then 0/00

y=11
from S2, then 0/11

Thus, processor S0 selects transition S 2  S0


t as the most likely transition. The resulting
register for S0 becomes1011100xxxx and the
new likelihood becomes .
Now, let's look at the node S1:
For node S1: if y=11
y = 11
from S0, then 1/11   01 

 0   01 

from S2, then 1/00   21 


 2   21 

The likelihood are tied. The node processor has no


t statistical way to choose between the paths. It
resolves this dilemma by “tossing a coin”. Let’s
say that the node processor S2 “wins the toss”. The
survivor path is 1011101xxxx and the new likelihood metric becomes
The same procedure applies for S2 and S3 node processors.
Example 6.5: For the convolutional code discussed in Example 6.1, assume that it is
known the encoder initial’s state is S0. Decode the received sequence 10 10 00 01 10 01.
Since we know the initial state, we initialize the likelihoods to:
(actually, any large numbers will do)

123
Above is the result of applying the Viterbi algorithm. The solid lines are the selected
Paths , the dashed lines are Rejected Paths. T=tied path,  is shown above the
branches. The accumulated Hamming distances are indicated below each node. the first
two steps are easier since we know S0 always winds (other u are large). Results of steps
3-6 are

t = 3S0 : 000xxx t = 4S0 : 0000xx


 
S1 : 101xxx S1 : 0001xx
 
S2 : 010xxx S2 : 0110xx
S3 : 011xxx S3 : 1011xx

t = 5 S 0 : 01100x t = 6 S 0 : 101100
 
 S1 : 00001x  S1 : 101101
 
 S 2 : 10110x  S 2 : 101110
 S 3 : 10111x  S 3 : 101111

124
After the 3rd step, we cannot decide on the correct decoding of even the 1st bit (since the
4 path registers disagree on what this bit should be). Till the 6th step, all 4 survivor
registers agree on the first 4 decoded bits. Why? If you trace back from t=6, all surviving
paths join together at t = 4. However, see the tie! This result depends on how we choose
the tie!
After the algorithm has a chance to observe a sufficient number of received symbols, it is
able to use the sequence of information to pick the globally most likely transmitted
sequence.
Notice that the path selection for the 4 first steps through the trellis cannot be changed by
any further decisions the node processors may make. This is because all the node
processors now agree on the first four steps.
Received: 10 10 00 01 10 01
Most Likely: 11 10 00 01 ?? ??
In any practical implementation of the Viterbi algorithm, we must use a finite number of
bits for the survivor path register. This is called the Decoding Depth.
If we use a few bits, the performance of the algorithm will be hurt by having to force the
decoding decisions when we run out of decision bits. In such case, the “most likely” bits
are those that lead to the best likelihood metric. Most of the time this will result in correct
decoding, but sometimes it will not. An erroneous decision is called Truncation Error.
How many bits of decoding depth are required to make the probability of
truncation error negligible?
Forney (1970) gave the answer to this question. Answer: 5.8 times the number of bits
in the encoder’s shift register, i.e,

Practical Implementation for Long Code Sequence With Large


Number of Errors
When the number of error is large (this is why we use convolutional codes), the
arithmetic circuits can run out of bits for representing the likelihoods. We should notice
that all node decisions are relative decisions. A strategy for dealing with arithmetic
overflow is to occasionally subtract the value of the lowest likelihood from each node

125
processor’s likelihood. This leaves the relative likelihoods unchanged, while limiting the
range of the likelihood number each node processor must be able to express.

The Traceback Method of Viterbi Decoding


Each node manages a path survivor register, in which the node processor’s best
estimate is stored at each moment. This is a method easy to understand, but it is not
effective of keeping track of the decoded message when a high speed decoder is required.
The survivor path registers must be interconnected to permit parallel transfer. This
interconnection is very costly to implement. The Traceback Method is an alternative
way of keeping track of the decoded message sequence. This method is very popular, as
its implementation in integrated circuits is more cost effective. The method exploits a
priori information that the decoder has about the trellis structure of the code.
M
=
2

Basic Idea : For example:

0 1

Content of the register in state S2


“1” which appears at moment t on the output, is actually applied at moment t-2 on the
input. We can use the content of the last delay in the register to decode the message,
but there is a delay of M clock cycles. Since we have already a delay of at least 5.8M
clock cycles to avoid the truncation errors in the Viterbi algorithm, this additional
decoding delay is a small price to pay for obtaining a lower cost hardware solution.

126
Instead of transferring the contents of the survivor register, each node processor is
assigned a unique register in which we store a single bit. This is the last bit of the
state picked by that node processor as survivor path (in the previous example this is
“1”). As we deal with binary codes, each node has two inputs (two path choices). The
bit that can be chosen is different for the two possible paths (see the trellis diagram).
This will always be true with the state-naming convention we are using.

Trellis Diagram
Only the surviving path decisions are shown at each time step. The solid line is the
survivor path agreed on by all four nodes processors at the last time step shown in
the figure below.

127
The entries into each node processor’s traceback (i.e., survivor path) register at each
trellis step are shown in the figure. The traceback process is also illustrated. It begins at
the far right side of the figure and proceeds backwards in time. Once the traceback is
completed, the decoded bit sequence is read from left to right. The path traces back to
state “00” (S0). Whatever else may have happened during the time prior to the start of
the figure, we know that the last 2 bits leading into state “00” must have been “0,0”, so,
the decoded message sequence corresponding to the solid line must be . The
last two message bits, corresponding to the final 2 steps through the trellis have not
been decoded yet (due to the extra decoding lag mentioned above).

§6.4 Approaches to Increase Code Rate

Code Rate of Convolutional Codes


Let d0 , d1 , d 2 ,... be the set of all possible Hamming distances between
adversaries in the transfer function of the convolutional code, such that
d 0  d1  d 2  ...
The minimum distance d 0 is called the minimum free distance, d f . Performance of

convolutional codes are determined by the minimum free distance. Convolutional codes
provide very powerful error correction capability, at the price of low code rate.
For examples

128
Using Nonbinary Convolutional Codes
So far we have been looking only at convolutional codes with rate 1/n0 ( low R ). If the
source frame is increased to some k0 >1, we can achieve a rate k0/n0 convolutional code.
Example 6.6: Find the code rate and Trellis diagram for the 2-source frame encoder
shown on next page.

R= ; df = 3, it is a 4-ary code.

129
In Example 6.6: The number of inputs in each trellis node processor is equal to 4.
(Disadvantage!) In general, a k0/n0 convolutional code requires to deal with 2 k0 input
paths, so, the complexity of the Viterbi decoder increases geometrically with k0. This is a
severe problem. Non-binary convolutional codes are non-popular.

Using Punctured Convolutional Codes


An alternative way to increase code rate is puncturing. We start with a 1/n0
convolutional code, such as a 1/2 code rate. The transmitted code word corresponding to
the 1/2 code is (c0 c1). We delete one of the code bits every 2 code symbols. Thus, the
Code sequence:

The deleted code bits are not transmitted.


In average, 3 code bits are transmitted every two message bits, which yields a rate 2/3
code. Deleting code bits is called Puncturing the code. The rate is increased at the
expense of reducing the minimum free distance.
However, it is not fair to compare df of a punctured code with df of the base code. Instead,
we should compare df with that of a non-binary code, with the same R and number of
elements in its encoder. Cain showed that there are punctured codes with the same df as

130
the best known non-binary codes of the same rate and memory depth. Punctured codes
with rates up to 9/10 are known. Punctured codes are still linear codes (but not longer
shift invariant).
Example 6.7: Us the same encoder as Example 6.1, but c1 is punctured in every second
code-word. Find its code rate and trellis diagram.

Code rate R=2/3 (from a base code with R=1/2), df = 3.


The state diagram of such code requires 8 states rather than 4 (4 for “time-even” trellis
and 4 for “time-odd” trellis states). The Viterbi algorithm requires only 4 node
processors (M=2).
The Puncturing Period: the number of bits encoded before the returning to the base code.
Examples 6.8: Find the punctual period of the punctured code (7,5), 7
Punctured codes are specified in a manner similar to octal generator notation.
(7, 5), 7  (7 in octal) 

(5 in octal)

2
The second message bit is encoded
 R
using only the generator polynomial 7. 3
Code sequence:

131
Here the puncturing period is equal to 2. It requires (M=2) node processors, and
the state diagram contains 8 (= 4x2, 4 times puncture period) states.
Examples 6.9: Find the punctual period of the punctured code (15,17), 15,17

(15,17),15,17  (15 in octal)



(17 in octal)

The second message bit is encoded using only the generator polynomial 15, whereas the
3
third message bit only by using the generator polynomial 17, thus, R 
4
Code sequence:
Here the puncturing period is equal to 3. The Viterbi decoder requires (M=3) node
processors, and the state diagram contains 24 (= 8x3, 8 times puncture period) states.
The punctured codes presented here are punctured versions of known good-rate 1/2 codes.
However, it is not always true that puncturing a good code (1/n0 rate) yields a good
punctured code. There is no known systematic procedure for generating good
punctured convolutional codes. Good codes are discovered by computer search.

Some good punctured codes examples:

132
133

You might also like