Bits Signals and Packets
Bits Signals and Packets
Hari Balakrishnan
Christopher J. Terman
George C. Verghese
M.I.T. Department of EECS
2008-2012
c H. Balakrishnan
2
Contents
List of Figures 7
List of Tables 9
1 Introduction 1
1.1 Themes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Outline and Plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3
4
3-1 Variable-length code from Figure 2-2 shown in the form of a code tree. . . . 21
3-2 An example of two non-isomorphic Huffman code trees, both optimal. . . . 23
3-3 Results from encoding more than one grade at a time. . . . . . . . . . . . . . 24
3-4 Pseudo-code for the LZW adaptive variable-length encoder. Note that some
details, like dealing with a full string table, are omitted for simplicity. . . . . 26
3-5 Pseudo-code for LZW adaptive variable-length decoder. . . . . . . . . . . . 26
3-6 LZW encoding of string “abcabcabcabcabcabcabcabcabcabcabcabc” . . . . . 27
3-7 LZW decoding of the sequence a, b, c, 256, 258, 257, 259, 262, 261, 264, 260, 266, 263, c 28
5-1 Probability of a decoding error with the replication code that replaces each
bit b with n copies of b. The code rate is 1/n. . . . . . . . . . . . . . . . . . . . 48
5
6
6-1 An example of a convolutional code with two parity bits per message bit
and a constraint length (shown in the rectangular window) of three. I.e.,
r = 2, K = 3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
6-2 Block diagram view of convolutional coding with shift registers. . . . . . . . 69
6-3 State-machine view of convolutional coding. . . . . . . . . . . . . . . . . . . 69
6-4 When the probability of bit error is less than 1/2, maximum-likelihood de-
coding boils down to finding the message whose parity bit sequence, when
transmitted, has the smallest Hamming distance to the received sequence.
Ties may be broken arbitrarily. Unfortunately, for an N-bit transmit se-
quence, there are 2 N possibilities, which makes it hugely intractable to sim-
ply go through in sequence because of the sheer number. For instance, when
N = 256 bits (a really small packet), the number of possibilities rivals the
number of atoms in the universe! . . . . . . . . . . . . . . . . . . . . . . . . . 71
6-5 The trellis is a convenient way of viewing the decoding task and under-
standing the time evolution of the state machine. . . . . . . . . . . . . . . . . 72
7-1 The trellis is a convenient way of viewing the decoding task and under-
standing the time evolution of the state machine. . . . . . . . . . . . . . . . . 76
7-2 The branch metric for hard decision decoding. In this example, the receiver
gets the parity bits 00. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
7-3 The Viterbi decoder in action. This picture shows four time steps. The
bottom-most picture is the same as the one just before it, but with only the
survivor paths shown. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
SECTION 7
7-4 The Viterbi decoder in action (continued from Figure 7-3. The decoded mes-
sage is shown. To produce this message, start from the final state with small-
est path metric and work backwards, and then reverse the bits. At each state
during the forward pass, it is important to remeber the arc that got us to this
state, so that the backward pass can be done properly. . . . . . . . . . . . . . 87
7-5 Branch metric for soft decision decoding. . . . . . . . . . . . . . . . . . . . . 88
7-6 The free distance of a convolutional code. . . . . . . . . . . . . . . . . . . . . 88
7-7 Error correcting performance results for different rate-1/2 codes. . . . . . . 89
7-8 Error correcting performance results for three different rate-1/2 convolu-
tional codes. The parameters of the three convolutional codes are (111, 110)
(labeled “K = 3 glist=(7, 6)”), (1110, 1101) (labeled “K = 4 glist=(14, 13)”),
and (111, 101) (labeled “K = 3 glist=(7, 5)”). The top three curves below the
uncoded curve are for hard decision decoding; the bottom three curves are
for soft decision decoding. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
8
List of Tables
6-1 Examples of generator polynomials for rate 1/2 convolutional codes with
different constraint lengths. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
9
MIT 6.02 DRAFT Lecture Notes
Last update: September 6, 2011
Comments, questions or bug reports?
Please contact hari at mit.edu
C HAPTER 1
Introduction
Our mission is to expose you to a variety of different technologies and techniques in elec-
trical engineering and computer science. We will do this by studying several salient prop-
erties of digital communication systems, learning both important aspects of their design,
and also the basics of how to analyze their performance. Digital communication systems
are well-suited for our goals because they incorporate ideas from a large subset of electrical
engineering and computer science.
Equally important, the ability to disseminate and exchange information over the
world’s communication networks has revolutionized the way in which people work, play,
and live. At the turn of the century when everyone was feeling centennial and reflective,
the U.S. National Academy of Engineering produced a list of 20 technologies that made the
most impact on society in the 20th century.1 This list included life-changing innovations
such as electrification, the automobile, and the airplane; joining them were four technolog-
ical achievements in the area of commununication—radio and television, the telephone, the
Internet, and computers—whose technological underpinnings we will be most concerned
with in this book.
Somewhat surprisingly, the Internet came in only at #13, but the reason given by the
committee was that it was developed toward the latter part of the century and that they
believed the most dramatic and significant impacts of the Internet would occur in the 21st
century. Looking at the first decade of this century, that sentiment sounds right—the ubiq-
uitous spread of wireless networks and mobile devices, the advent of social networks, and
the ability to communicate any time and from anywhere are not just changing the face of
commerce and our ability to keep in touch with friends, but are instrumental in massive
societal and political changes.
Communication is fundamental to our modern existence. Who among you can imagine
life without the Internet and its applications and without some form of networked mobile
device? Most people feel the same way—in early 2011, over 5 billion mobile phones were
active worldwide, over a billion of which had “broadband” network connectivity. To put
this number (5 billion) in perspective, it is larger than the number of people in the world
1
“The Vertiginous March of Technology”, obtained from nae.edu. Document at http://bit.ly/owMoO6
1
2 CHAPTER 1. INTRODUCTION
What makes our communication networks work? This course is a start at understand-
ing the answers to this question. This question is worth studying not just because of the
impact that communication systems have had on the world, but also because the technical
areas cover so many different fields in EECS. Before we dive in and describe the “roadmap”
for the course, we want to share a bit of the philosophy behind the material.
Traditionally, in both education and in research, much of “low-level communication”
has been considered an “EE” topic, covering primarily the issues governing how bits of in-
formation move across a single communication link. In a similar vein, much of “network-
ing” has been considered a “CS” topic, covering primarily the issues of how to build com-
munication networks composed of multiple links. In particular, many traditional courses
on “digital communication” rarely concern themselves with how networks are built and
how they work, while most courses on “computer networks” treat the intricacies of com-
munication over physical links as a black box. As a result, a sizable number of people
have a deep understanding of one or the other topic, but few people are expert in every
aspect of the problem. As an abstraction, however, this division is one way of conquer-
ing the immense complexity of the topic, but our goal in this course is to both understand
the important details, and also understand how various abstractions allow different parts
of the system to be designed and modified without paying close attention (or even really
understanding) what goes on elsewhere in the system.
One drawback of preserving strong boundaries between different components of a com-
munication system is that the details of how things work in another component may re-
main a mystery, even to practising engineers. In the context of communication systems,
this mystery usually manifests itself as things that are “above my layer” or “below my
layer”. And so although we will appreciate the benefits of abstraction boundaries in this
course, an important goal for us is to study the most important principles and ideas that
go into the complete design of a communication system. Our goal is to convey to you both
the breadth of the field as well as its depth.
In short, we cover communication systems all the way from the source, which has some
information it wishes to transmit, to packets, which messages are broken into for transmis-
sion over a network, to bits, each of which is a “0” or a “1”, to signals, which are analog
waveforms sent over physical communication links (such as wires, fiber-optic cables, ra-
dio, or acoustic waves). We describe the salient aspects of all the layers, starting from how
an application might encode messages to how the network handles packets to how links
manipulate bits to how bits are converted to signals for transmission. In the process, we
will study networks of different sizes, ranging from the simplest dedicated point-to-point
link, to shared media with a set of communicating nodes sharing a common physical com-
munication medium, to larger multi-hop networks that themselves are connected to other
networks to form even bigger networks.
2
It is in fact distressing that according to a recent survey conducted by TeleNav—and we can’t tell if this
is a joke—40% of iPhone users say they’d rather give up their toothbrushes for a week than their iPhones!
http://www.telenav.com/about/pr-summer-travel/report-20110803.html
SECTION 1.1. THEMES 3
1.1 Themes
Three fundamental challenges lie at the heart of all digital communication systems and
networks: reliability, sharing, and scalability. We will spend a considerable amount of time
on the first two issues in this introductory course, but much less time on the third.
1.1.1 Reliability
A large number of factors conspire to make communication unreliable, and we will study
numerous techniques to improve reliability. A common theme across these different tech-
niques is that they all use redundancy in creative and efficient ways to provide reliability us-
ing unreliable individual components, using the property of independent (or perhaps weakly
dependent) failures of these unreliable components to achieve reliability.
The primary challenge is to overcome a wide range of faults and disturbances that one
encounters in practice, including Gaussian noise and interference that distort or corrupt sig-
nals, leading to possible bit errors that corrupt bits on a link, to packet losses caused by
uncorrectable bit errors, queue overflows, or link and software failures in the network. All
these problems degrade communication quality.
In practice, we are interested not only in reliability, but also in speed. Most techniques to
improve communication reliability involve some form of redundancy, which reduces the
speed of communication. The essence of many communication systems is how reliability
and speed tradeoff against one another.
Communication speeds have increased rapidly with time. In the early 1980s, people
would connect to the Internet over telephone links at speeds of barely a few kilobits per
second, while today 100 Megabits per second over wireless links on laptops and 1-10 Gi-
gabits per second with wired links are commonplace.
We will develop good tools to understand why communication is unreliable and how
to overcome the problems that arise. The techniques involve error-correcting codes, han-
dling distortions caused by “inter-symbol interference” using a linear time-invariant chan-
nel model, retransmission protocols to recover from packet losses that occur for various rea-
sons, and developing fault-tolerant routing protocols to find alternate paths in networks to
overcome link or node failures.
2. How do messages go from one place to another in the network—this task is faciliated
by routing protocols.
3. How can we communicate information reliably across a multi-hop network (as op-
posed to over just a single link or shared medium)?
A word on efficiency is in order. The techniques used to share the network and achieve
reliability ultimately determine the efficiency of the communication network. In general,
one can frame the efficiency question in several ways. One approach is to minimize the
capital expenditure (hardware equipment, software, link costs) and operational expenses
(people, rental costs) to build and run a network capable of meeting a set of requirements
(such as number of connected devices, level of performance and reliability, etc.). Another
approach is to maximize the bang for the buck for a given network by maximizing the
amount of “useful work” that can be done over the network. One might measure the
“useful work” by calculating the aggregate throughput (in “bits per second”, or at higher
SECTION 1.2. OUTLINE AND PLAN 5
speeds, the more convenient “megabits per second”) achieved by the different communi-
cations, the variation of that throughput among the set of nodes, and the average delay
(often called the latency, measured usually in milliseconds) achieved by the data transfers.
Largely speaking, we will be concerned with throughput and latency in this course, and
not spend much time on the broader (but no less important) questions of cost.
Of late, another aspect of efficiency that has become important in many communica-
tion systems is energy consumption. This issue is important both in the context of massive
systems such as large data centers and for mobile computing devices such as laptops and
mobile phones. Improving the energy efficiency of these systems is an important problem.
1.1.3 Scalability
In addition to reliability and efficient sharing, scalability (i.e., designing networks that scale
to large sizes) is an important design consideration for communication networks. We will
only touch on this issue, leaving most of it to later courses (6.033, 6.829).
1. The source. Ultimately, all communication is about a source wishing to send some
information in the form of messages to a receiver (or to multiple receivers). Hence,
it makes sense to understand the mathematical basis for information, to understand
how to encode the material to be sent, and for reasons of efficiency, to understand how
best to compress our messages so that we can send as little data as possible but yet
allow the receiver to decode our messages correctly. Chapters 2 and 3 describe the
key ideas behind information, entropy (expectation of information), and source coding,
which enables data compression. We will study Huffman codes and the Lempel-Ziv-
Welch algorithm, two widely used methods.
2. Bits. The main issue we will deal with here is overcoming bit errors using error-
correcting codes, specifically linear block codes and convolutional codes. These
codes use interesting and somewhat sophisticated algorithms that cleverly apply re-
dundancy to reduce or eliminate bit errors. We conclude this module with a dis-
cussion of the capacity of the binary symmetric channel, which is a useful and key
abstraction for this part of the course.
3. Signals. The main issues we will deal with are how to modulate bits over signals
and demodulate signals to recover bits, as well as understanding how distortions of
signals by communication channels can be modeled using a linear time-invariant (LTI)
abstraction. Topics include going between time-domain and frequency-domain rep-
resentations of signals, the frequency content of signals, and the frequency response
of channels and filters.
4. Packets. The main issues we will deal with are how to share a medium using a MAC
protocol, routing in multi-hop networks, and reliable data transport protocols.
6 CHAPTER 1. INTRODUCTION
MIT 6.02 DRAFT Lecture Notes
Last update: September 12, 2011
Comments, questions or bug reports?
Please contact hari at mit.edu
C HAPTER 2
Information, Entropy, and the
Motivation for Source Codes
The theory of information developed by Claude Shannon (SM EE ’40, PhD Math ’40) in
the late 1940s is one of the most impactful ideas of the past century, and has changed the
theory and practice of many fields of technology. The development of communication
systems and networks has benefited greatly from Shannon’s work. In this chapter, we
will first develop the intution behind information and formally define it as a mathematical
quantity and connect it to another property of data sources, entropy.
We will then show how these notions guide us to efficiently compress and decompress a
data source before communicating (or storing) it without distorting the quality of informa-
tion being received. A key underlying idea here is coding, or more precisely, source coding,
which takes each message (or “symbol”) being produced by any source of data and asso-
ciate each message (symbol) with a codeword, while achieving several desirable properties.
This mapping between input messages (symbols) and codewords is called a code. Our fo-
cus will be on lossless compression (source coding) techniques, where the recipient of any
uncorrupted message can recover the original message exactly (we deal with corrupted
bits later in later chapters).
• “1” if by land.
• “2” if by sea.
7
8 CHAPTER 2. INFORMATION, ENTROPY, AND THE MOTIVATION FOR SOURCE CODES
(Had the sender been from Course VI, it would’ve almost certainly been “0” if by land
and “1” if by sea!)
Let’s say we have no prior knowledge of how the British might come, so each of these
choices (messages) is equally probable. In this case, the amount of information conveyed
by the sender specifying the choice is 1 bit. Intuitively, that bit, which can take on one
of two values, can be used to encode the particular choice. If we have to communicate a
sequence of such independent events, say 1000 such events, we can encode the outcome
using 1000 bits of information, each of which specifies the outcome of an associated event.
On the other hand, suppose we somehow knew that the British were far more likely
to come by land than by sea (say, because there is a severe storm forecast). Then, if the
message in fact says that the British are coming by sea, much more information is being
conveyed than if the message said that that they were coming by land. To take another ex-
ample, far more information is conveyed by my telling you that the temperature in Boston
on a January day is 75◦ F, than if I told you that the temperature is 32◦ F!
The conclusion you should draw from these examples is that any quantification of “in-
formation” about an event should depend on the probability of the event. The greater the
probability of an event, the smaller the information associated with knowing that the event
has occurred.
This definition satisfies the basic requirement that it is a decreasing function of p. But
so do an infinite number of functions, so what is the intuition behind using the logarithm
to define information? And what is the base of the logarithm?
The second question is easy to address: you can use any base, because loga (1/ p) =
logb (1/ p)/ loga b, for any two bases a and b. Following Shannon’s convention, we will use
base 2,1 in which case the unit of information is called a bit.2
The answer to the first question, why the logarithmic function, is that the resulting defi-
nition has several elegant resulting properties, and it is the simplest function that provides
these properties. One of these properties is additivity. If you have two independent events
(i.e., events that have nothing to do with each other), then the probability that they both
occur is equal to the product of the probabilities with which they each occur. What we
would like is for the corresponding information to add up. For instance, the event that it
rained in Seattle yesterday and the event that the number of students enrolled in 6.02 ex-
ceeds 150 are independent, and if I am told something about both events, the amount of
information I now have should be the sum of the information in being told individually of
the occurrence of the two events.
The logarithmic definition provides us with the desired additivity because, given two
1
And we won’t mention the base; if you see a log in this chapter, it will be to base 2 unless we mention
otherwise.
2
If we were to use base 10, the unit would be Hartleys, and if we were to use the natural log, base e, it would
be nats, but no one uses those units in practice.
SECTION 2.1. INFORMATION AND ENTROPY 9
1 1
IA + IB = log(1/ p A ) + log(1/ p B ) = log = log .
pA pB P(A and B)
2.1.2 Examples
Suppose that we’re faced with N equally probable choices. What is the information re-
ceived when I tell you which of the N choices occurred?
Because the probability of each choice is 1/ N, the information is log(1/(1/ N)) = log N
bits.
Now suppose there are initially N equally probable choices, and I tell you something
that narrows the possibilities down to one of M equally probable choices. How much
information have I given you about the choice?
We can answer this question by observing that you now know that the probability of
the choice narrowing done from N equi-probable possibilities to M equi-probable ones is
M/ N. Hence, the information you have received is log(1/(M/ N)) = log(N / M) bits. (Note
that when M = 1, we get the expected answer of log N bits.)
We can therefore write a convenient rule:
log2 (10/1) = 3.322 bits. Note that this information is same as the sum of the previ-
ous three examples: information is cumulative if the joint probability of the events
revealed to us factors intothe product of the individual probabilities.
In this example, we can calculate the probability that they all occur together, and
compare that answer with the product of the probabilities of each of them occurring
individually. Let event A be “the digit is even”, event B be “the digit is ≥ 5, and event
C be “the digit is a multiple of 3”. Then, P(A and B and C) = 1/10 because there is
only one digit, 6, that satisfies all three conditions. P(A) · P(B) · P(C) = 1/2 · 1/2 ·
4/10 = 1/10 as well. The reason information adds up is that log(1/ P(AandBandC) =
log 1/ P(A) + log 1/ P(B) + log(1/ P(C).
Note that pairwise indepedence between events is actually not necessary for informa-
tion from three (or more) events to add up. In this example, P(AandB) = P(A) ·
P(B| A) = 1/2 · 2/5 = 1/5, while P(A) · P(B) = 1/2 · 1/2 = 1/4.
2.1.3 Entropy
Now that we know how to measure the information contained in a given event, we can
quantify the expected information in a set of possible outcomes. Specifically, if an event i
occurs with probability pi , 1 ≤ i ≤ N out of a set of N events, then the average or expected
information is given by
N
H(p1 , p2 , . . . p N ) = ∑ pi log(1/ pi ). (2.2)
i=1
H is also called the entropy (or Shannon entropy) of the probability distribution. Like
information, it is also measured in bits. It is simply the sum of several terms, each of which
is the information of a given event weighted by the probability of that event occurring. It
is often useful to think of the entropy as the average or expected uncertainty associated with
this set of events.
In the important special case of two mutually exclusive events (i.e., exactly one of the two
SECTION 2.2. SOURCE CODES 11
events can occur), occuring with probabilities p and 1 − p, respectively, the entropy
We will be lazy and refer to this special case, H(p, 1 − p) as simply H(p).
This entropy as a function of p is plotted in Figure 2-1. It is symmetric about p = 1/2,
with its maximum value of 1 bit occuring when p = 1/2. Note that H(0) = H(1) = 0;
although log(1/ p) → ∞ as p → 0, lim p→0 p log(1/ p) → 0.
It is easy to verify that the expression for H from Equation (2.2) is always non-negative.
Moreover, H(p1 , p2 , . . . p N ) ≤ log N always.
easily manipulated independently. For example, to find the 42nd character in the file, one
just looks at the 42nd byte and interprets those 8 bits as an ASCII character. A text file
containing 1000 characters takes 8000 bits to store. If the text file were HTML to be sent
over the network in response to an HTTP request, it would be natural to send the 1000
bytes (8000 bits) exactly as they appear in the file.
But let’s think about how we might compress the file and send fewer than 8000 bits. If
the file contained English text, we’d expect that the letter e would occur more frequently
than, say, the letter x. This observation suggests that if we encoded e for transmission
using fewer than 8 bits—and, as a trade-off, had to encode less common characters, like x,
using more than 8 bits—we’d expect the encoded message to be shorter on average than the
original method. So, for example, we might choose the bit sequence 00 to represent e and
the code 100111100 to represent x.
This intuition is consistent with the definition of the amount of information: commonly
occurring symbols have a higher pi and thus convey less information, so we need fewer
bits to encode such symbols. Similarly, infrequently occurring symbols like x have a lower
pi and thus convey more information, so we’ll use more bits when encoding such sym-
bols. This intuition helps meet our goal of matching the size of the transmitted data to the
information content of the message.
The mapping of information we wish to transmit or store into bit sequences is referred
to as a code. Two examples of codes (fixed-length and variable-length) are shown in Fig-
ure 2-2, mapping different grades to bit sequences in one-to-one fashion. The fixed-length
code is straightforward, but the variable-length code is not arbitrary, but has been carefully
designed, as we will soon learn. Each bit sequence in the code is called a codeword.
When the mapping is performed at the source of the data, generally for the purpose
of compressing the data (ideally, to match the expected number of bits to the underlying
entropy), the resulting mapping is called a source code. Source codes are distinct from
channel codes we will study in later chapters. Source codes remove redundancy and com-
press the data, while channel codes add redundancy in a controlled way to improve the error
resilience of the data in the face of bit errors and erasures caused by imperfect communi-
cation channels. This chapter and the next are about source codes.
We can generalize this insight about encoding common symbols (such as the letter e)
more succinctly than uncommon symbols into a strategy for variable-length codes:
Send commonly occurring symbols using shorter codewords (fewer bits) and
infrequently occurring symbols using longer codewords (more bits).
We’d expect that, on average, encoding the message with a variable-length code would
take fewer bits than the original fixed-length encoding. Of course, if the message were all
x’s the variable-length encoding would be longer, but our encoding scheme is designed to
optimize the expected case, not the worst case.
Here’s a simple example: suppose we had to design a system to send messages con-
taining 1000 6.02 grades of A, B, C and D (MIT students rarely, if ever, get an F in 6.02 ^
¨ ).
Examining past messages, we find that each of the four grades occurs with the probabilities
shown in Figure 2-2.
With four possible choices for each grade, if we use the fixed-length encoding, we need
2 bits to encode a grade, for a total transmission length of 2000 bits when sending 1000
grades.
SECTION 2.3. HOW MUCH COMPRESSION IS POSSIBLE? 13
Figure 2-2: Possible grades shown with probabilities, fixed- and variable-length encodings
With a fixed-length code, the size of the transmission doesn’t depend on the actual
message—sending 1000 grades always takes exactly 2000 bits.
Decoding a message sent with the fixed-length code is straightforward: take each pair
of received bits and look them up in the table above to determine the corresponding grade.
Note that it’s possible to determine, say, the 42nd grade without decoding any other of the
grades—just look at the 42nd pair of bits.
Using the variable-length code, the number of bits needed for transmitting 1000 grades
depends on the grades.
If the grades were all B, the transmission would take only 1000 bits; if they were all C’s and
D’s, the transmission would take 3000 bits. But we can use the grade probabilities given
in Figure 2-2 to compute the expected length of a transmission as
1 1 1 1 2
1000[( )(2) + ( )(1) + ( )(3) + ( )(3)] = 1000[1 ] = 1666.7 bits
3 2 12 12 3
So, on average, using the variable-length code would shorten the transmission of 1000
grades by 333 bits, a savings of about 17%. Note that to determine, say, the 42nd grade, we
would need to decode the first 41 grades to determine where in the encoded message the
42nd grade appears.
Using variable-length codes looks like a good approach if we want to send fewer bits
on average, but preserve all the information in the original message. On the downside,
we give up the ability to access an arbitrary message symbol without first decoding the
message up to that point.
One obvious question to ask about a particular variable-length code: is it the best en-
coding possible? Might there be a different variable-length code that could do a better job,
i.e., produce even shorter messages on the average? How short can the messages be on the
average? We turn to this question next.
Specifically, the entropy, defined by Equation (2.2), tells us the expected amount of in-
formation in a message, when the message is drawn from a set of possible messages, each
occurring with some probability. The entropy is a lower bound on the amount of informa-
tion that must be sent, on average, when transmitting data about a particular choice.
What happens if we violate this lower bound, i.e., we send fewer bits on the average
than called for by Equation (2.2)? In this case the receiver will not have sufficient informa-
tion and there will be some remaining ambiguity—exactly what ambiguity depends on the
encoding, but to construct a code of fewer than the required number of bits, some of the
choices must have been mapped into the same encoding. Thus, when the recipient receives
one of the overloaded encodings, it will not have enough information to unambiguously
determine which of the choices actually occurred.
Equation (2.2) answers our question about how much compression is possible by giving
us a lower bound on the number of bits that must be sent to resolve all ambiguities at the
recipient. Reprising the example from Figure 2-2, we can update the figure using Equation
(2.1).
Figure 2-3: Possible grades shown with probabilities and information content.
Using equation (2.2) we can compute the information content when learning of a particular
grade:
N
1 1 1 1 1
∑ pi log2 ( pi ) = ( 3 )(1.58) + ( 2 )(1) + ( 12 )(3.58) + ( 12 )(3.58) = 1.626 bits
i=1
So encoding a sequence of 1000 grades requires transmitting 1626 bits on the average. The
variable-length code given in Figure 2-2 encodes 1000 grades using 1667 bits on the aver-
age, and so doesn’t achieve the maximum possible compression. It turns out the example
code does as well as possible when encoding one grade at a time. To get closer to the lower
bound, we would need to encode sequences of grades—more on this idea below.
Finding a “good” code—one where the length of the encoded message matches the
information content (i.e., the entropy)—is challenging and one often has to think “outside
the box”. For example, consider transmitting the results of 1000 flips of an unfair coin
where probability of heads is given by p H . The information content in an unfair coin flip
can be computed using equation (2.3):
For p H = 0.999, this entropy evaluates to .0114. Can you think of a way to encode 1000
unfair coin flips using, on average, just 11.4 bits? The recipient of the encoded message
must be able to tell for each of the 1000 flips which were heads and which were tails. Hint:
SECTION 2.4. WHY COMPRESSION? 15
with a budget of just 11 bits, one obviously can’t encode each flip separately!
In fact, some effective codes leverage the context in which the encoded message is be-
ing sent. For example, if the recipient is expecting to receive a Shakespeare sonnet, then
it’s possible to encode the message using just 8 bits if one knows that there are only 154
Shakespeare sonnets. That is, if the sender and receiver both know the sonnets, and the
sender just wishes to tell the receiver which sonnet to read or listen to, he can do that using
a very small number of bits, just log 154 bits if all the sonnets are equi-probable!
• Shorter messages take less time to transmit and so the complete message arrives
more quickly at the recipient. This is good for both the sender and recipient since
it frees up their network capacity for other purposes and reduces their network
charges. For high-volume senders of data (such as Google, say), the impact of send-
ing half as many bytes is economically significant.
• Using network resources sparingly is good for all the users who must share the
internal resources (packet queues and links) of the network. Fewer resources per
message means more messages can be accommodated within the network’s resource
constraints.
• Over error-prone links with non-negligible bit error rates, compressing messages be-
fore they are channel-coded using error-correcting codes can help improve through-
put because all the redundancy in the message can be designed in to improve error
resilience, after removing any other redundancies in the original message. It is better
to design in redundancy with the explicit goal of correcting bit errors, rather than
rely on whatever sub-optimal redundancies happen to exist in the original message.
Exercises
1. Several people at a party are trying to guess a 3-bit binary number. Alice is told that
the number is odd; Bob is told that it is not a multiple of 3 (i.e., not 0, 3, or 6); Charlie
16 CHAPTER 2. INFORMATION, ENTROPY, AND THE MOTIVATION FOR SOURCE CODES
is told that the number contains exactly two 1’s; and Deb is given all three of these
clues. How much information (in bits) did each player get about the number?
2. After careful data collection, Alyssa P. Hacker observes that the probability of
“HIGH” or “LOW” traffic on Storrow Drive is given by the following table:
HIGH traffic level LOW traffic level
If the Red Sox are playing P(HIGH traffic) = 0.999 P(LOW traffic) = 0.001
If the Red Sox are not playing P(HIGH traffic) = 0.25 P(LOW traffic) = 0.75
(a) If it is known that the Red Sox are playing, then how much information in bits
is conveyed by the statement that the traffic level is LOW. Give your answer as
a mathematical expression.
(b) Suppose it is known that the Red Sox are not playing. What is the entropy
of the corresponding probability distribution of traffic? Give your answer as a
mathematical expression.
3. X is an unknown 4-bit binary number picked uniformly at random from the set of all
possible 4-bit numbers. You are given another 4-bit binary number, Y, and told that
the Hamming distance between X (the unknown number) and Y (the number you
know) is two. How many bits of information about X have you been given?
4. In Blackjack the dealer starts by dealing 2 cards each to himself and his opponent:
one face down, one face up. After you look at your face-down card, you know a total
of three cards. Assuming this was the first hand played from a new deck, how many
bits of information do you have about the dealer’s face down card after having seen
three cards?
5. The following table shows the undergraduate and MEng enrollments for the School
of Engineering.
(a) When you learn a randomly chosen engineering student’s department you get
some number of bits of information. For which student department do you get
the least amount of information?
(b) After studying Huffman codes in the next chapter, design a Huffman code to
encode the departments of randomly chosen groups of students. Show your
Huffman tree and give the code for each course.
SECTION 2.4. WHY COMPRESSION? 17
(c) If your code is used to send messages containing only the encodings of the de-
partments for each student in groups of 100 randomly chosen students, what is
the average length of such messages?
6. You’re playing an online card game that uses a deck of 100 cards containing 3 Aces,
7 Kings, 25 Queens, 31 Jacks and 34 Tens. In each round of the game the cards are
shuffled, you make a bet about what type of card will be drawn, then a single card
is drawn and the winners are paid off. The drawn card is reinserted into the deck
before the next round begins.
(a) How much information do you receive when told that a Queen has been drawn
during the current round?
(b) Give a numeric expression for the information content received when learning
about the outcome of a round.
(c) After you learn about Huffman codes in the next chapter, construct a variable-
length Huffman encoding that minimizes the length of messages that report the
outcome of a sequence of rounds. The outcome of a single round is encoded as
A (ace), K (king), Q (queen), J (jack) or X (ten). Specify your encoding for each
of A, K, Q, J and X.
(d) Again, after studying Huffman codes, use your code from part (c) to calculate
the expected length of a message reporting the outcome of 1000 rounds (i.e., a
message that contains 1000 symbols)?
(e) The Nevada Gaming Commission regularly receives messages in which the out-
come for each round is encoded using the symbols A, K , Q, J , and X. They dis-
cover that a large number of messages describing the outcome of 1000 rounds
(i.e., messages with 1000 symbols) can be compressed by the LZW algorithm
into files each containing 43 bytes in total. They decide to issue an indictment
for running a crooked game. Why did the Commission issue the indictment?
7. Consider messages made up entirely of vowels (A, E, I , O, U). Here’s a table of prob-
abilities for each of the vowels:
l pl log2 (1/ pl ) pl log2 (1/ pl )
A 0.22 2.18 0.48
E 0.34 1.55 0.53
I 0.17 2.57 0.43
O 0.19 2.40 0.46
U 0.08 3.64 0.29
Totals 1.00 12.34 2.19
(a) Give an expression for the number of bits of information you receive when
learning that a particular vowel is either I or U.
(b) After studying Huffman codes in the next chapter, use Huffman’s algorithm
to construct a variable-length code assuming that each vowel is encoded indi-
vidually. Draw a diagram of the Huffman tree and give the encoding for each
of the vowels.
18 CHAPTER 2. INFORMATION, ENTROPY, AND THE MOTIVATION FOR SOURCE CODES
(c) Using your code from part (B) above, give an expression for the expected length
in bits of an encoded message transmitting 100 vowels.
(d) Ben Bitdiddle spends all night working on a more complicated encoding algo-
rithm and sends you email claiming that using his code the expected length in
bits of an encoded message transmitting 100 vowels is 197 bits. Would you pay
good money for his implementation?
MIT 6.02 DRAFT Lecture Notes
Last update: September 12, 2011
Comments, questions or bug reports?
Please contact hari at mit.edu
C HAPTER 3
Compression Algorithms: Huffman
and Lempel-Ziv-Welch (LZW)
This chapter discusses source coding, specifically two algorithms to compress messages
(i.e., a sequence of symbols). The first, Huffman coding, is efficient when one knows the
probabilities of the different symbols one wishes to send. In the context of Huffman cod-
ing, a message can be thought of as a sequence of symbols, with each symbol drawn in-
dependently from some known distribution. The second, LZW (for Lempel-Ziv-Welch) is
an adaptive compression algorithm that does not assume any a priori knowledge of the
symbol probabilities. Both Huffman codes and LZW are widely used in practice, and are
a part of many real-world standards such as GIF, JPEG, MPEG, MP3, and more.
“A” → 1
“B” → 01
“C” → 000
“D” → 001
19
20 CHAPTER 3. COMPRESSION ALGORITHMS: HUFFMAN AND LEMPEL-ZIV-WELCH (LZW)
“A” → 10
“B” → 0
“C” → 110
“D” → 111
In general, a code tree is a binary tree with the symbols at the nodes of the tree and the
edges of the tree are labeled with “0” or “1” to signify the encoding. To find the encoding
of a symbol, the receiver simply walks the path from the root (the top-most node) to that
symbol, emitting the label on the edges traversed.
If, in a code tree, the symbols are all at the leaves, then the code is said to be prefix-free,
because no codeword is a prefix of another codeword. Prefix-free codes (and code trees)
are naturally instantaneous, which makes them attractive.1
Expected code length. Our final definition is for the expected length of a code. Given N
symbols, with symbol i occurring with probability pi , if we have a code in which symbol i
has length li in the code tree (i.e., the codeword is `i bits long), then the expected length of
the code is ∑iN=1 pi `i .
In general, codes with small expected code length are interesting and useful because
they allow us to compress messages, delivering messages without any loss of information
but consuming fewer bits than without the code. Because one of our goals in designing
communication systems is efficient sharing of the communication links among different
users or conversations, the ability to send data in as few bits as possible is important.
We say that a code is optimal if its expected code length, L, is the minimum among
all possible codes. The corresponding code tree gives us the optimal mapping between
symbols and codewords, and is usually not unique. Shannon proved that the expected
code length of any decodable code cannot be smaller than the entropy, H, of the underlying
probability distribution over the symbols. He also showed the existence of codes that
achieve entropy asymptotically, as the length of the coded messages approaches ∞. Thus,
1
Somewhat unfortunately, several papers and books use the term “prefix code” to mean the same thing as
a “prefix-free code”. Caveat emptor.
SECTION 3.2. HUFFMAN CODES 21
an optimal code will have an expected code length that “matches” the entropy for long
messages.
The rest of this chapter describes two optimal codes. First, Huffman codes, which are
optimal instantaneous codes when the probabilities of the various symbols are given and
we restrict ourselves to mapping individual symbols to codewords. It is a prefix-free,
instantaneous code, satisfying the property H ≤ L ≤ H + 1. Second, the LZW algorithm,
which adapts to the actual distribution of symbols in the message, not relying on any a
priori knowledge of symbol probabilities.
Figure 3-1: Variable-length code from Figure 2-2 shown in the form of a code tree.
To encode a symbol using the tree, start at the root and traverse the tree until you reach
the symbol to be encoded—the encoding is the concatenation of the branch labels in the
order the branches were visited. The destination node, which is always a “leaf” node for
an instantaneous or prefix-free code, determines the path, and hence the encoding. So B is
encoded as 0, C is encoded as 110, and so on. Decoding complements the process, in that
now the path (codeword) determines the symbol, as described in the previous section. So
111100 is decoded as: 111 → D, 10 → A, 0 → B.
Looking at the tree, we see that the more probable symbols (e.g., B) are near the root of
the tree and so have short encodings, while less-probable symbols (e.g., C or D) are further
down and so have longer encodings. David Huffman used this observation while writing
a term paper for a graduate course taught by Bob Fano here at M.I.T. in 1951 to devise an
algorithm for building the decoding tree for an optimal variable-length code.
Huffman’s insight was to build the decoding tree bottom up, starting with the least prob-
able symbols and applying a greedy strategy. Here are the steps involved, along with a
worked example based on the variable-length code in Figure 2-2. The input to the algo-
rithm is a set of symbols and their respective probabilities of occurrence. The output is the
code tree, from which one can read off the codeword corresponding to each symbol.
22 CHAPTER 3. COMPRESSION ALGORITHMS: HUFFMAN AND LEMPEL-ZIV-WELCH (LZW)
1. Input: A set S of tuples, each tuple consisting of a message symbol and its associated
probability.
Example: S ← {(0.333, A), (0.5, B), (0.083, C), (0.083, D)}
2. Remove from S the two tuples with the smallest probabilities, resolving ties arbitrar-
ily. Combine the two symbols from the removed tuples to form a new tuple (which
will represent an interior node of the code tree). Compute the probability of this new
tuple by adding the two probabilities from the tuples. Add this new tuple to S. (If S
had N tuples to start, it now has N − 1, because we removed two tuples and added
one.)
Example: S ← {(0.333, A), (0.5, B), (0.167, C ∧ D)}
3. Repeat step 2 until S contains only a single tuple. (That last tuple represents the root
of the code tree.)
Example, iteration 2: S ← {(0.5, B), (0.5, A ∧ (C ∧ D))}
Example, iteration 3: S ← {(1.0, B ∧ (A ∧ (C ∧ D)))}
Et voila! The result is a code tree representing a variable-length code for the given symbols
and probabilities. As you’ll see in the Exercises, the trees aren’t always “tall and thin” with
the left branch leading to a leaf; it’s quite common for the trees to be much “bushier.” As
a simple example, consider input symbols A, B, C, D, E, F, G, H with equal probabilities
of occurrences (1/8 for each). In the first pass, one can pick any two as the two lowest-
probability symbols, so let’s pick A and B without loss of generality. The combined AB
symbol has probability 1/4, while the other six symbols have probability 1/8 each. In the
next iteration, we can pick any two of the symbols with probability 1/8, say C and D.
Continuing this process, we see that after four iterations, we would have created four sets
of combined symbols, each with probability 1/4 each. Applying the algorithm, we find
that the code tree is a complete binary tree where every symbol has a codeword of length
3, corresponding to all combinations of 3-bit words (000 through 111).
Huffman codes have the biggest reduction in the expected length of the encoded mes-
sage when some symbols are substantially more probable than other symbols. If all sym-
bols are equiprobable, then all codewords are roughly the same length, and there are
(nearly) fixed-length encodings whose expected code lengths approach entropy and are
thus close to optimal.
Figure 3-2: An example of two non-isomorphic Huffman code trees, both optimal.
bilities 1/4, 1/4, 1/8, 1/8, 1/8, 1/8. The two code trees shown in Figure 3-2 are both valid
Huffman (optimal) codes.
Optimality. Huffman codes are optimal in the sense that there are no other codes with
shorter expected length, when restricted to instantaneous (prefix-free) codes and symbols
occur independently in messages from a known probability distribution.
We state here some propositions that are useful in establishing the optimality of Huff-
man codes.
Proposition 3.1 In any optimal code tree for a prefix-free code, each node has either zero or two
children.
To see why, suppose an optimal code tree has a node with one child. If we take that node
and move it up one level to its parent, we will have reduced the expected code length, and
the code will remain decodable. Hence, the original tree was not optimal, a contradiction.
Proposition 3.2 In the code tree for a Huffman code, no node has exactly one child.
To see why, note that we always combine the two lowest-probability nodes into a single
one, which means that in the code tree, each internal node (i.e., non-leaf node) comes from
two combined nodes (either internal nodes themselves, or original symbols).
Proposition 3.3 There exists an optimal code in which the two least-probable symbols:
• are siblings, i.e., their codewords differ in exactly the one bit (the last one).
Proof. Let z be the least-probable symbol. If it is not at maximum depth in the optimal code
tree, then some other symbol, call it s, must be at maximum depth. But because p z < ps , if
we swapped z and s in the code tree, we would end up with a code with smaller expected
length. Hence, z must have a codeword at least as long as every other codeword.
Now, symbol z must have a sibling in the optimal code tree, by Proposition 3.1. Call it
x. Let y be the symbol with second lowest probability; i.e., px ≥ p y ≥ px . If px = p y , then
24 CHAPTER 3. COMPRESSION ALGORITHMS: HUFFMAN AND LEMPEL-ZIV-WELCH (LZW)
the proposition is proved. Let’s swap x and y in the code tree, so now y is a sibling of z.
The expected code length of this code tree is not larger than the pre-swap optimal code
tree, because px is strictly greater than p y , proving the proposition.
Theorem 3.1 Huffman coding over a set of symbols with known probabilities produces a code tree
whose expected length is optimal.
Huffman coding with grouped symbols. The entropy of the distribution shown in Figure
2-2 is 1.626. The per-symbol encoding of those symbols using Huffman coding produces
a code with expected length 1.667, which is noticeably larger (e.g., if we were to encode
10,000 grades, the difference would be about 410 bits). Can we apply Huffman coding to
get closer to entropy?
One approach is to group symbols into larger “metasymbols” and encode those instead,
usually with some gain in compression but at a cost of increased encoding and decoding
complexity.
Consider encoding pairs of symbols, triples of symbols, quads of symbols, etc. Here’s a
tabulation of the results using the grades example from Figure 2-2:
Figure 3-3: Results from encoding more than one grade at a time.
We see that we can come closer to the Shannon lower bound (i.e., entropy) of 1.626 bits
by encoding grades in larger groups at a time, but at a cost of a more complex encoding
SECTION 3.3. LZW: AN ADAPTIVE VARIABLE-LENGTH SOURCE CODE 25
and decoding process. This approach still has two problems: first, it requires knowledge
of the individual symbol probabilities, and second, it assumes that the probability of each
symbol is independent and identically distributed. In practice, however, symbol probabil-
ities change message-to-message, or even within a single message.
This last observation suggests that it would be nice to create an adaptive variable-length
encoding that takes into account the actual content of the message. The LZW algorithm,
presented in the next section, is such a method.
Figure 3-4: Pseudo-code for the LZW adaptive variable-length encoder. Note that some details, like dealing
with a full string table, are omitted for simplicity.
When encoding a byte stream,2 the first 28 = 256 entries of the string table, numbered 0
through 255, are initialized to hold all the possible one-byte sequences. The other entries
will be filled in as the message byte stream is processed. The encoding strategy works as
follows and is shown in pseudo-code form in Figure 3-4. First, accumulate message bytes
as long as the accumulated sequences appear as some entry in the string table. At some
point, appending the next byte b to the accumulated sequence S would create a sequence
S + b that’s not in the string table, where + denotes appending b to S. The encoder then
executes the following steps:
1. It transmits the N-bit code for the sequence S.
2. It adds a new entry to the string table for S + b. If the encoder finds the table full
when it goes to add an entry, it reinitializes the table before the addition is made.
3. it resets S to contain only the byte b.
2
A byte is a contiguous string of 8 bits.
SECTION 3.3. LZW: AN ADAPTIVE VARIABLE-LENGTH SOURCE CODE 27
Figure 3-7: LZW decoding of the sequence a, b, c, 256, 258, 257, 259, 262, 261, 264, 260, 266, 263, c
This process repeats until all the message bytes are consumed, at which point the en-
coder makes a final transmission of the N-bit code for the current sequence S.
Note that for every transmission done by the encoder, the encoder makes a new entry
in the string table. With a little cleverness, the decoder, shown in pseudo-code form in
Figure 3-5, can figure out what the new entry must have been as it receives each N-bit
code. With a duplicate string table at the decoder constructed as the algorithm progresses
at the decoder, it is possible to recover the original message: just use the received N-bit
code as index into the decoder’s string table to retrieve the original sequence of message
bytes.
Figure 3-6 shows the encoder in action on a repeating sequence of abc. Notice that:
• The encoder algorithm is greedy—it is designed to find the longest possible match
in the string table before it makes a transmission.
• The string table is filled with sequences actually found in the message stream. No
encodings are wasted on sequences not actually found in the file.
• Since the encoder operates without any knowledge of what’s to come in the message
stream, there may be entries in the string table that don’t correspond to a sequence
that’s repeated, i.e., some of the possible N-bit codes will never be transmitted. This
property means that the encoding isn’t optimal—a prescient encoder could do a bet-
ter job.
• Note that in this example the amount of compression increases as the encoding pro-
gresses, i.e., more input bytes are consumed between transmissions.
• Eventually the table will fill and then be reinitialized, recycling the N-bit codes for
new sequences. So the encoder will eventually adapt to changes in the probabilities
of the symbols or symbol sequences.
SECTION 3.3. LZW: AN ADAPTIVE VARIABLE-LENGTH SOURCE CODE 29
Figure 3-7 shows the operation of the decoder on the transmit sequence produced in
Figure 3-6. As each N-bit code is received, the decoder deduces the correct entry to make
in the string table (i.e., the same entry as made at the encoder) and then uses the N-bit code
as index into the table to retrieve the original message sequence.
There is a special case, which turns out to be important, that needs to be dealt with.
There are three instances in Figure 3-7 where the decoder receives an index (262, 264, 266)
that it has not previously entered in its string table. So how does it figure out what these
correspond to? A careful analysis, which you could do, shows that this situation only
happens when the associated string table entry has its last symbol identical to its first
symbol. To handle this issue, the decoder can simply complete the partial string that it is
building up into a table entry (abc, bac, cab respectively, in the three instances in Figure 3-
7) by repeating its first symbol at the end of the string (to get abca, bacb, cabc respectively,
in our example), and then entering this into the string table. This step is captured in the
pseudo-code in Figure 3-5 by the logic of the “if” statement there.
We conclude this chapter with some interesting observations about LZW compression:
• A common choice for the size of the string table is 4096 (N = 12). A larger table
means the encoder has a longer memory for sequences it has seen and increases
the possibility of discovering repeated sequences across longer spans of message.
However, dedicating string table entries to remembering sequences that will never
be seen again decreases the efficiency of the encoding.
• Early in the encoding, the encoder uses entries near the beginning of the string table,
i.e., the high-order bits of the string table index will be 0 until the string table starts
to fill. So the N-bit codes we transmit at the outset will be numerically small. Some
variants of LZW transmit a variable-width code, where the width grows as the table
fills. If N = 12, the initial transmissions may be only 9 bits until entry number 511 in
the table is filled (i.e., 512 entries filled in all), then the code expands to 10 bits, and
so on, until the maximum width N is reached.
• Some variants of LZW introduce additional special transmit codes, e.g., CLEAR to
indicate when the table is reinitialized. This allows the encoder to reset the table
pre-emptively if the message stream probabilities change dramatically, causing an
observable drop in compression efficiency.
• There are many small details we haven’t discussed. For example, when sending N-
bit codes one bit at a time over a serial communication channel, we have to specify
the order in the which the N bits are sent: least significant bit first, or most significant
bit first. To specify N, serialization order, algorithm version, etc., most compressed
file formats have a header where the encoder can communicate these details to the
decoder.
Exercises
1. Huffman coding is used to compactly encode the species of fish tagged by a game
warden. If 50% of the fish are bass and the rest are evenly divided among 15 other
species, how many bits would be used to encode the species when a bass is tagged?
30 CHAPTER 3. COMPRESSION ALGORITHMS: HUFFMAN AND LEMPEL-ZIV-WELCH (LZW)
4. Describe the contents of the string table created when encoding a very long string
of all a’s using the simple version of the LZW encoder shown in Figure 3-4. In this
example, if the decoder has received E encoded symbols (i.e., string table indices)
from the encoder, how many a’s has it been able to decode?
5. Consider the pseudo-code for the LZW decoder given in Figure 3-4. Suppose that
this decoder has received the following five codes from the LZW encoder (these are
the first five codes from a longer compression run):
After it has finished processing the fifth code, what are the entries in the translation
table and what is the cumulative output of the decoder?
table[256]:
table[257]:
table[258]:
table[259]:
cumulative output from decoder:
97 97 98 98 257 256
(b) By how many bits is the compressed message shorter than the original message
(each character in the original message is 8 bits long)?
(c) What is the first string of length 3 added to the compression table? (If there’s no
such string, your answer should be “None”.)
32 CHAPTER 3. COMPRESSION ALGORITHMS: HUFFMAN AND LEMPEL-ZIV-WELCH (LZW)
MIT 6.02 DRAFT Lecture Notes
Last update: October 17, 2011
Comments, questions or bug reports?
Please contact hari at mit.edu
C HAPTER 4
Why Digital? Communication
Abstractions and Digital Signaling
This chapter describes analog and digital communication, and the differences between
them. Our focus is on understanding the problems with analog communication and the
motivation for the digital abstraction. We then present basic recipes for sending and re-
ceiving digital data mapped to analog signals over communication links; these recipes are
needed because physical communication links are fundamentally analog in nature at the
lowest level. After understanding how bits get mapped to signals and vice versa, we will
present our simple layered communication model: messages → packets → bits → signals. The
rest of this book is devoted to understanding these different layers and how the interact
with each other.
33
34 CHAPTER 4. WHY DIGITAL? COMMUNICATION ABSTRACTIONS AND DIGITAL SIGNALING
data may be thought of as being capable of producing analog data from this continuous
space. In practice, of course, there is a measurement fidelity to every sensor, so the data
captured will be quantized, but the abstraction is much closer to analog than digital. Other
sources of data include sensors gathering information about the environment or device
(e.g., accelerometers on your mobile phone, GPS sensors on mobile devices, or climate
sensors to monitor weather conditions); these data sources could be inherently analog or
inherently digital depending on what they’re measuring.
Regardless of the nature of a source, converting the relevant data to digital form is the
modern way; one sees numerous advertisements for “digital” devices (e.g., cameras), with
the implicit message that somehow “digital” is superior to other methods or devices. The
question is, why?
1. The digital abstraction enables the composition of modules to build large systems.
Yet, the digital abstraction is not the natural way to communicate data. Physical com-
munication links turn out to be analog at the lowest level, so we are going to have to
convert data between digital and analog, and vice versa, as it traverses different parts of
the system between the sender and the receiver.
Copy Copy
Copy Copy
(In Reality!)
Copy Copy
and acoustic media, the problem is trickier, but we can send different signals at different
amplitudes “modulated” over a “carrier waveform” (as we will see in later chapters), and
the receiver can measure the quantity of interest to infer what the sender might have sent.
-N +N -N +N
volts
V0 V1
0 1
Figure 4-2: If the two voltages are adequately spaced apart, we can tolerate a certain amount of noise.
0 volts
V0 V1+V 0 V1
2
Receiver can output any value
when the input voltage is in
this range.
1
0 volts
V0 V1+V 0 V1
2
Continuous time
Discrete time
sample interval
time
1. How to cope with differences in the sender and receiver clock frequencies?
The first problem is one of clock and data recovery. The second is solved using line
coding, of which 8b/10b coding is a common scheme. The idea is to convert groups of
bits into different groups of bits that have frequent 0/1 transitions. We describe these two
ideas in the next two sections. We also refer the reader to the two lab tasks in Problem Set
2, which describe these two issues and their implementation in considerable detail.
Message bits 0 1 1 0 1 1
Transmit clock
Transmit samples
Receive samples
Figure 4-5: Transmission using a clock (top) and inferring clock edges from bit transitions between 0 and 1
and vice versa at the receiver (bottom).
Similarly, if the receiver’s clock is a little faster, the transmitter will seem to be transmitting
slower, e.g., transmitting at 5.001 samples per bit. This small difference accummulates over
time, so if the receiver uses a static sampling strategy like the one outlined in the previous
paragraph, it will eventually be sampling right at the transition points between two bits.
And to add insult to injury, the difference in the two clock frequencies will change over
time.
The fix is to have the receiver adapt the timing of it’s sampling based on where it detects
transitions in the voltage samples. The transition (when there is one) should happen half-
way between the chosen sample points. Or to put it another way, the receiver can look
at the voltage sample half-way between the two sample points and if it doesn’t find a
transition, it should adjust the sample index appropriately.
Figure 4-6 illustrates how the adaptation should work. The examples use a low-to-high
transition, but the same strategy can obviously be used for a high-to-low transition. The
two cases shown in the figure differ in value of the sample that’s half-way between the
current sample point and the previous sample point. Note that a transition has occurred
when two consecutive sample points represent different bit values.
• Case 1: the half-way sample is the same as the current sample. In this case the half-
way sample is in the same bit transmission as the current sample, i.e., we’re sampling
too late in the bit transmission. So when moving to the next sample, increment the
40 CHAPTER 4. WHY DIGITAL? COMMUNICATION ABSTRACTIONS AND DIGITAL SIGNALING
Figure 4-6: The two cases of how the adaptation should work.
• Case 2: the half-way sample is different than the current sample. In this case the half-
way sample is in the previous bit transmission from the current sample, i.e., we’re
sampling too early in the bit transmission. So when moving to the next sample,
increment the index by samples per bit + 1 to move ”forward”
If there is no transition, simply increment the sample index by samples per bit to move
to the next sample. This keeps the sampling position approximately right until the next
transition provides the information necessary to make the appropriate adjustment.
If you think about it, when there is a transition, one of the two cases above will be true
and so we’ll be constantly adjusting the relative position of the sampling index. That’s fine
– if the relative position is close to correct, we’ll make the opposite adjustment next time.
But if a large correction is necessary, it will take several transitions for the correction to
happen. To facilitate this initial correction, in most protocols the transmission of message
SECTION 4.5. LINE CODING WITH 8B/10B 41
begins with a training sequence of alternating 0- and 1-bits (remember each bit is actually
samples per bit voltage samples long). This provides many transitions for the receiver’s
adaptation circuity to chew on.
• For electrical reasons it’s desirable to maintain DC balance on the wire, i.e., that on
the average the number of 0’s is equal to the number of 1’s.
• Transitions in the received bits indicate the start of a new bit and hence are useful in
synchronizing the sampling process at the receiver—the better the synchronization,
the faster the maximum possible symbol rate. So ideally one would like to have
frequent transitions. On the other hand each transition consumes power, so it would
be nice to minimize the number of transitions consistent with the synchronization
constraint and, of course, the need to send actual data! In a signaling protocol where
the transitions are determined by the message content may not achieve these goals.
To address these issues we can use an encoder (called the “line coder”) at the transmitter
to recode the message bits into a sequence that has the properties we want, and use a
decoder at the receiver to recover the original message bits. Many of today’s high-speed
data links (e.g., PCI-e and SATA) use an 8b/10b encoding scheme developed at IBM. The
8b/10b encoder converts 8-bit message symbols into 10 transmitted bits. There are 256
possible 8-bit words and 1024 possible 10-bit transmit symbols, so one can choose the
mapping from 8-bit to 10-bit so that the the 10-bit transmit symbols have the following
properties:
• The maximum run of 0’s or 1’s is five bits (i.e., there is at least one transition every
five bits).
• At any given sample the maximum difference between the number of 1’s received
and the number of 0’s received is six.
• Special 7-bit sequences can be inserted into the transmission that don’t appear in any
consecutive sequence of encoded message bits, even when considering sequences
that span two transmit symbols. The receiver can do a bit-by-bit search for these
unique patterns in the incoming stream and then know how the 10-bit sequences are
aligned in the incoming stream.
Here’s how the encoder works: collections of 8-bit words are broken into groups of
words called a packet. Each packet is sent using the following wire protocol:
• A sequence of alternating 0 bits and 1 bits are sent first (recall that each bit is mul-
tiple voltage samples). This sequence is useful for making sure the receiver’s clock
recovery machinery has synchronized with the transmitter’s clock. These bits aren’t
part of the message; they’re there just to aid in clock recovery.
42 CHAPTER 4. WHY DIGITAL? COMMUNICATION ABSTRACTIONS AND DIGITAL SIGNALING
• Each byte (8 bits) in the packet data is line-coded to 10 bits and sent. Each 10-bit
transmit symbol is determined by table lookup using the 8-bit word as the index.
Note that all 10-bit symbols are transmitted least-significant bit (LSB) first. If the
length of the packet (without SYNC) is s bytes, then the resulting size of the line-
coded portion is 10s bits, to which the SYNC is added.
Multiple packets are sent until the complete message has been transmitted. Note that
there’s no particular specification of what happens between packets – the next packet may
follow immediately, or the transmitter may sit idle for a while, sending, say, training se-
quence samples.
If the original data in a single packet is s bytes long, and the SYNC is h bits long, then
the total number of bits sent is equal to 10s + h. The “rate” of this line code, i.e., the ratio
of the number of useful message bits to the total bits sent, is therefore equal to 10s8s+h . (We
will properly define the concept of “code rate” in Chapter 6 more.) If the communication
link is operating at R bits per second, then the rate at which useful message bits arrive is
given by 10s8s+h · R bits per second with 8b/10b line coding.
Digitize Render/display,
(if needed) etc.
Source binary digits
(“message bits”)
Source coding Source decoding
Bit stream Bit stream
COMMUNICATION NETWORK
(packets, bits, and signals). The rest of this book is about these three important abstrac-
tions and how they work together. We do them in the order bits, signals, and packets, for
convenience and ease of exposition and understanding.
44 CHAPTER 4. WHY DIGITAL? COMMUNICATION ABSTRACTIONS AND DIGITAL SIGNALING
End-host
Original source computers Receiving app/user
Digitize Render/display,
(if needed) etc.
Source binary digits
(“message bits”)
Source coding Source decoding
Bit stream Bit stream
Channel
Channel Mapper Recv
Decoding
Coding Bits + Signals samples Bits
(reducing or
(bit error Xmit (Voltages) +
removing
correction) samples over Demapper
physical link bit errors)
End-host
Original source computers Receiving app/user
Digitize Render/display,
(if needed) etc.
Source binary digits
(“message bits”)
Source coding Source decoding
Bit stream Bit stream
Figure 4-8: Expanding on the “big picture”: single link view (top) and the network view (bottom).
MIT 6.02 DRAFT Lecture Notes
Last update: October 14, 2011
Comments, questions or bug reports?
Please contact hari at mit.edu
C HAPTER 5
Coping with Bit Errors using Error
Correction Codes
Recall our main goal in designing digital communication networks: to send information
both reliably and efficiently between nodes. Meeting that goal requires the use of tech-
niques to combat bit errors, which are inevitable in both commmunication channels and
storage media.
The key idea we will apply to achieve reliable communication is the addition of redun-
dancy to the transmitted data, to improve the probability that the original message can be
reconstructed from the possibly corrupted data that is received. The sender has an encoder
whose job is to take the message and process it to produce the coded bits that are then sent
over the channel. The receiver has a decoder whose job is to take the received (coded) bits
and to produce its best estimate of the message. The encoder-decoder procedures together
constitute channel coding; good channel codes provide error correction capabilities that
reduce the bit error rate (i.e., the probability of a bit error).
With proper design, full error correction may be possible, provided only a small num-
ber of errors has occurred. Even when too many errors have occurred to permit correction,
it may be possible to perform error detection. Error detection provides a way for the re-
ceiver to tell (with high probability) if the message was decoded correctly or not. Error
detection usually works by the sender and receiver using a different code from the one
used to correct errors; common examples include the cyclic redundancy check (CRC) or hash
functions. These codes take n-bit messages and produce a compact “signature” of that mes-
sage that is much smaller than the message (e.g., the popular CRC-32 scheme produces a
32-bit signature of an arbitrarily long message). The sender computes and transmits the
signature along with the message bits, usually appending it to the end of the message. The
receiver, after running the decoder to correct errors, then computes the signature over its
estimate of the message bits and compares that signature to its estimate of the signature
bits in the received data. If the computed and estimated signatures are not equal, then
the receiver considers the message to have one or more bit errors; otherwise, it assumes
that the message has been received correctly. This latter assumption is probabilistic: there
is some non-zero (though very small, for good signatures) probability that the estimated
and computed signatures match, but the receiver’s decoded message is different from the
45
46 CHAPTER 5. COPING WITH BIT ERRORS USING ERROR CORRECTION CODES
sender’s. If the signatures don’t match, the receiver and sender may use some higher-layer
protocol to arrange for the message to be retransmitted; we will study such schemes later.
We will not study error detection codes like CRC or hash functions in this course.
Our plan for this chapter is as follows. To start, we will assume a binary symmetric
channel (BSC), which we defined and explained in the previous chapter; here the probabil-
ity of a bit “flipping” is ε. Then, we will discuss and analyze a simple redundancy scheme
called a replication code, which will simply make n copies of any given bit. The replication
code has a code rate of 1/n—that is, for every useful message bit, we end up transmitting
n total bits. The overhead of the replication code of rate c is 1 − 1/n, which is rather high
for the error correcting power of the code. We will then turn to the key ideas that allow
us to build powerful codes capable of correcting errors without such a high overhead (or
equivalently, capable of correcting far more errors at a given code rate compared to the
replication code).
There are two big, inter-related ideas used in essentially all error correction codes. The
first is the notion of embedding, where the messages one wishes to send are placed in a
geometrically pleasing way in a larger space so that the distance between any two valid
points in the embedding is large enough to enable the correction and detection of errors.
The second big idea is to use parity calculations, which are linear functions over the bits
we wish to send, to generate the redundancy in the bits that are actually sent. We will
study examples of embeddings and parity calculations in the context of two classes of
codes: linear block codes, which are an instance of the broad class of algebraic codes,
and convolutional codes, which are perhaps the simplest instance of the broad class of
graphical codes.
We start with a brief discussion of bit errors.
Megabits/s, as long as there is some way to detect errors when they occur.
The BSC is perhaps the simplest error model that is realistic, but real-world channels
exhibit more complex behaviors. For example, over many wireless and wired channels
as well as on storage media (like CDs, DVDs, and disks), errors can occur in bursts. That
is, the probability of any given bit being received wrongly depends on recent history: the
probability is higher if the bits in the recent past were received incorrectly. Our goal is to
develop techniques to mitigate the effects of both the BSC and burst errors. We’ll start with
techniques that work well over a BSC and then discuss how to deal with bursts.
The notation ni denotes the number of ways of selecting i objects (in this case, bit posi-
Figure 5-1: Probability of a decoding error with the replication code that replaces each bit b with n copies
of b. The code rate is 1/n.
When n is even, we add a term at the end to account for the fact that the decoder has a
fifty-fifty chance of guessing correctly when it receives a codeword with an equal number
of 0’s and 1’s.
Figure 5-1 shows the probability of decoding error as a function of the replication factor,
n, for the replication code, computed using Equation (5.1). The y-axis is on a log scale, and
the probability of error is more or less a straight line with negative slope (if you ignore
the flat pieces), which means that the decoding error probability decreases exponentially
with the code rate. It is also worth noting that the error probability is the same when
n = 2` as when n = 2` − 1. The reason, of course, is that the decoder obtains no additional
information that it already didn’t know from any 2` − 1 of the received bits.
Despite the exponential reduction in the probability of decoding error as n increases,
the replication code is extremely inefficient in terms of the overhead it incurs, for a given
rate, 1/n. As such, it is used only in situations when bandwidth is plentiful and there isn’t
much computation time to implement a more complex decoder.
We now turn to developing more sophisticated codes. There are two big related ideas:
embedding messages into spaces in a way that achieves structural separation and parity (linear)
computations over the message bits.
Figure 5-2: Codewords separated by a Hamming distance of 2 can be used to detect single bit errors. The
codewords are shaded in each picture. The picture on the left is a (2,1) repetition code, which maps 1-bit
messages to 2-bit codewords. The code on the right is a (3,2) code, which maps 2-bit messages to 3-bit
codewords.
codeword has more than one bit error, then we can make no guarantees (the method might
return the correct message word, but there is at least one instance where it will return the
wrong answer).
There are 2n possible n-bit strings. Define the Hamming distance (HD) between two n-
bit words, w1 and w2 , as the number of bit positions in which the messages differ. Thus
0 ≤ HD(w1 , w2 ) ≤ n.
Suppose that HD(w1 , w2 ) = 1. Consider what happens if we transmit w1 and there’s
a single bit error that inconveniently occurs at the one bit position in which w1 and w2
differ. From the receiver’s point of view it just received w2 —the receiver can’t detect the
difference between receiving w1 with a unfortunately placed bit error and receiving w2 .
In this case, we cannot guarantee that all single bit errors will be corrected if we choose a
code where w1 and w2 are both valid codewords.
What happens if we increase the Hamming distance between any two valid codewords
to 2? More formally, let’s restrict ourselves to only sending some subset S = {w1 , w2 , ..., ws }
of the 2n possible words such that
Thus if the transmission of wi is corrupted by a single error, the result is not an element
of S and hence can be detected as an erroneous reception by the receiver, which knows
which messages are elements of S . A simple example is shown in Figure 5-2: 00 and 11 are
valid codewords, and the receptions 01 and 10 are surely erroneous.
We define the minimum Hamming distance of a code as the minimum Hamming distance
between any two codewords in the code. From the discussion above, it should be easy to
see what happens if we use a code whose minimum Hamming distance is D. We state the
property formally:
Theorem 5.1 A code with a minimum Hamming distance of D can detect any error pattern of
D − 1 or fewer errors. Moreover, there is at least one error pattern with D errors that cannot be
detected reliably.
50 CHAPTER 5. COPING WITH BIT ERRORS USING ERROR CORRECTION CODES
Hence, if our goal is to detect errors, we can use an embedding of the set of messages we
wish to transmit into a bigger space, so that the minimum Hamming distance between any
two codewords in the bigger space is at least one more than the number of errors we wish
to detect. (We will discuss how to produce such embeddings in the subsequent sections.)
But what about the problem of correcting errors? Let’s go back to Figure 5-2, with S =
{00, 11}. Suppose the received sequence is 01. The receiver can tell that a single error has
occurred, but it can’t tell whether the correct data sent was 00 or 11—both those possible
patterns are equally likely under the BSC error model.
Ah, but we can extend our approach by producing an embedding with more space
between valid codewords! Suppose we limit our selection of messages in S even further,
as follows:
HD(wi , w j ) ≥ 3 for all wi , w j ∈ S where i 6= j (5.3)
How does it help to increase the minimum Hamming distance to 3? Let’s define one
more piece of notation: let Ewi be the set of messages resulting from corrupting wi with a
single error. For example, E000 = {001, 010, 100}. Note that HD(wi , an element of Ewi ) = 1.
With a minimum Hamming distance of 3 between the valid codewords, observe that
there is no intersection between Ewi and Ew j when i 6= j. Why is that? Suppose there
was a message wk that was in both Ewi and Ew j . We know that HD(wi , wk ) = 1 and
HD(w j , wk ) = 1, which implies that wi and w j differ in at most two bits and consequently
HD(wi , w j ) ≤ 2. (This result is an application of Theorem 5.2 below, which states that the
Hamming distance satisfies the triangle inequality.) That contradicts our specification that
their minimum Hamming distance be 3. So the Ewi don’t intersect.
So now we can correct single bit errors as well: the received message is either a member
of S (no errors), or is a member of some particular Ewi (one error), in which case the receiver
can deduce the original message was wi . Here’s another simple example: let S = {000, 111}.
So E000 = {001, 010, 100} and E111 = {110, 101, 011} (note that E000 doesn’t intersect E111 ).
Suppose the received sequence is 101. The receiver can tell there has been a single error
because 101 ∈ / S . Moreover it can deduce that the original message was most likely 111
because 101 ∈ E111 .
We can formally state some properties from the above discussion, and specify the error-
correcting power of a code whose minimum Hamming distance is D.
Theorem 5.2 The Hamming distance between n-bit words satisfies the triangle inequality. That
is, HD(x, y) + HD(y, z) ≥ HD(x, z).
Theorem 5.3 For a BSC error model with bit error probability < 1/2, the maximum likelihood de-
coding strategy is to map any received word to the valid codeword with smallest Hamming distance
from the received one (ties may be broken arbitrarily).
Theorem 5.4 A code with a minimum Hamming distance of D can correct any error pattern of
b D2−1 c or fewer errors. Moreover, there is at least one error pattern with b D2−1 c + 1 errors that
cannot be corrected reliably.
Equation (5.3) gives us a way of determining if single-bit error correction can always
be performed on a proposed set S of transmission messages—we could write a program
to compute the Hamming distance between all pairs of messages in S and verify that the
SECTION 5.4. LINEAR BLOCK CODES AND PARITY CALCULATIONS 51
minimum Hamming distance was at least 3. We can also easily generalize this idea to
check if a code can always correct more errors. And we can use the observations made
above to decode any received word: just find the closest valid codeword to the received
one, and then use the known mapping between each distinct message and the codeword
to produce the message. The message will be the correct one if the actual number of errors
is no larger than the number for which error correction is guaranteed. The check for the
nearest codeword may be exponential in the number of message bits we would like to
send, making it a reasonable approach only if the number of bits is small.
But how do we go about finding a good embedding (i.e., good code words)? This task
isn’t straightforward, as the following example shows. Suppose we want to reliably send
4-bit messages so that the receiver can correct all single-bit errors in the received words.
Clearly, we need to find a set of messages S with 24 elements. What should the members
of S be?
The answer isn’t obvious. Once again, we could write a program to search through
possible sets of n-bit messages until it finds a set of size 16 with a minimum Hamming
distance of 3. An exhaustive search shows that the minimum n is 7, and one example of S
is:
But such exhaustive searches are impractical when we want to send even modestly
longer messages. So we’d like some constructive technique for building S . Much of the
theory and practice of coding is devoted to finding such constructions and developing
efficient encoding and decoding strategies.
Broadly speaking, there are two classes of code constructions, each with an enormous
number of example instances. The first is the class of algebraic block codes. The second
is the class of graphical codes. We will study two simple examples of linear block codes,
which themselves are a sub-class of algebraic block codes: rectangular parity codes and
Hamming codes. We also note that the replication code discussed in Section 5.2 is an
example of a linear block code.
In the next two chapters, we will study convolutional codes, a sub-class of graphical
codes.
block code is defined as k/n; the larger the rate, the less the redundancy overhead incurred
by the code.
A linear code (whether a block code or not) produces codewords from message bits by
restricting the algebraic operations to linear functions over the message bits. By linear, we
mean that any given bit in a valid codeword is computed as the weighted sum of one or
more original message bits.
Linear codes, as we will see, are both powerful and efficient to implement. They are
widely used in practice. In fact, all the codes we will study—including convolutional
codes—are linear, as are most of the codes widely used in practice. We already looked
at the properties of a simple linear block code: the replication code we discussed in Sec-
tion 5.2 is a linear block code with parameters (n, 1, n).
An important and popular class of linear codes are binary linear codes. The computations
in the case of a binary code use arithmetic modulo 2, which has a special name: algebra
in a Galois Field of order 2, also denoted F2 . A field must define rules for addition and
multiplication, and their inverses. Addition in F2 is according to the following rules: 0 +
0 = 1 + 1 = 0; 1 + 0 = 0 + 1 = 1. Multiplication is as usual: 0 · 0 = 0 · 1 = 1 · 0 = 0; 1 · 1 = 1.
We leave you to figure out the additive and multiplicative inverses of 0 and 1. Our focus
in this book will be on linear codes over F2 , but there are natural generalizations to fields
of higher order (in particular, Reed Solomon codes, which are over Galois Fields of order
2q ).
A linear code is characterized by the following theorem, which is both a necessary and
a sufficient condition for a code to be linear:
Theorem 5.5 A code is linear if, and only if, the sum of any two codewords is another codeword.
For example, the block code defined by codewords 000, 101, 011 is not a linear code,
because 101 + 011 = 110 is not a codeword. But if we add 110 to the set, we get a lin-
ear code because the sum of any two codewords is now another codeword. The code
000, 101, 011, 110 has a minimum Hamming distance of 2 (that is, the smallest Hamming
distance between any two codewords in 2), and can be used to detect all single-bit errors
that occur during the transmission of a code word. You can also verify that the minimum
Hamming distance of this code is equal to the smallest number of 1’s in a non-zero code-
word. In fact, that’s a general property of all linear block codes, which we state formally
below:
Theorem 5.6 Define the weight of a codeword as the number of 1’s in the word. Then, the mini-
mum Hamming distance of a linear block code is equal to the weight of the non-zero codeword with
the smallest weight.
To see why, use the property that the sum of any two codewords must also be a code-
word, and that the Hamming distance between any two codewords is equal to the weight
of their sum (i.e., weight(u + v) = HD(u, v)). (In fact, the Hamming distance between any
two bit-strings of equal length is equal to the weight of their sum.) We leave the complete
proof of this theorem as a useful and instructive exercise for the reader.
The rest of this section shows how to construct linear block codes over F2 . For simplicity,
and without much loss of generality, we will focus on correcting single-bit errors. i.e.,
on single-error correction (SEC) codes.. We will show two ways of building the set S of
SECTION 5.4. LINEAR BLOCK CODES AND PARITY CALCULATIONS 53
transmission messages to have single-error correction capability, and will describe how
the receiver can perform error correction on the (possibly corrupted) received messages.
We will start with the rectangular parity code in Section 5.4.1, and then discuss the clev-
erer and more efficient Hamming code in Section 5.4.3.
Rectangular code construction: Suppose we want to send a k-bit message M. Shape the
k bits into a rectangular array with r rows and c columns, i.e., k = rc. For example, if
k = 8, the array could be 2 × 4 or 4 × 2 (or even 8 × 1 or 1 × 8, though those are a little less
interesting). Label each data bit with a subscript giving its row and column: the first bit
would be d11 , the last bit drc . See Figure 5-3.
Define p row(i) to be the parity of all the bits in row i of the array and let R be all the
row parity bits collected into a sequence:
Similarly, define p col( j) to be the parity of all the bits in column j of the array and let C be
all the column parity bits collected into a sequence:
Figure 5-3: A 2 × 4 arrangement for an 8-bit message with row and column parity.
0 1 1 0 0 1 0 0 1 1 0 1 1 1 1
1 1 0 1 1 0 0 1 0 1 1 1 1 0 1
1 0 1 1 1 0 1 0 1 0 0 0
(a) (b) (c)
Figure 5-4: Example received 8-bit messages. Which, if any, have one error? Which, if any, have two?
Proof of single-error correction property: This rectangular code is an SEC code for all
values of r and c. We will show that it can correct all single bit errors by showing that its
minimum Hamming distance is 3 (i.e., the Hamming distance between any two codewords
is at least 3). Consider two different uncoded messages, Mi and M j . There are three cases
to discuss:
• If Mi and M j differ by a single bit, then the row and column parity calculations
involving that bit will result in different values. Thus, the corresponding codewords,
wi and w j , will differ by three bits: the different data bit, the different row parity bit,
and the different column parity bit. So in this case HD(wi , w j ) = 3.
• If Mi and M j differ by two bits, then either (1) the differing bits are in the same
row, in which case the row parity calculation is unchanged but two column parity
calculations will differ, (2) the differing bits are in the same column, in which case the
column parity calculation is unchanged but two row parity calculations will differ,
or (3) the differing bits are in different rows and columns, in which case there will be
two row and two column parity calculations that differ. So in this case HD(wi , w j ) ≥
4.
Hence we can conclude that HD(wi , w j ) ≥ 3 and our simple “rectangular” code will be
able to correct all single-bit errors.
Decoding the rectangular code: How can the receiver’s decoder correctly deduce M
from the received w, which may or may not have a single bit error? (If w has more than
one error, then the decoder does not have to produce a correct answer.)
Upon receiving a possibly corrupted w, the receiver checks the parity for the rows and
columns by computing the sum of the appropriate data bits and the corresponding parity
bit (all arithmetic in F2 ). This sum will be 1 if there is a parity error. Then:
• If there are no parity errors, then there has not been a single error, so the receiver can
use the data bits as-is for M. This situation is shown in Figure 5-4(a).
SECTION 5.4. LINEAR BLOCK CODES AND PARITY CALCULATIONS 55
Figure 5-5: A codeword in systematic form for a block code. Any linear code can be transformed into an
equivalent systematic code.
• If there is single row or column parity error, then the corresponding parity bit is in
error. But the data bits are okay and can be used as-is for M. This situation is shown
in Figure 5-4(c), which has a parity error only in the fourth column.
• If there is one row and one column parity error, then the data bit in that row and
column has an error. The decoder repairs the error by flipping that data bit and then
uses the repaired data bits for M. This situation is shown in Figure 5-4(b), where
there are parity errors in the first row and fourth column indicating that d14 should
be flipped to be a 0.
• Other combinations of row and column parity errors indicate that multiple errors
have occurred. There’s no “right” action the receiver can undertake because it
doesn’t have sufficient information to determine which bits are in error. A common
approach is to use the data bits as-is for M. If they happen to be in error, that will be
detected by the error detection code (mentioned near the beginning of this chapter).
This recipe will produce the most likely message, M, from the received codeword if there
has been at most a single transmission error. √
In the rectangular code the number of parity bits grows at least as fast as k (it should
be easy to verify that the smallest number of parity bits occurs when the number of rows,
r, and the number of columns, c, are equal). Given a fixed amount of communication
“bandwidth” or resource, we’re interested in devoting as much of it as possible to sending
message bits, not parity bits. Are there other SEC codes that have better code rates than
our simple rectangular code? A natural question to ask is: how little redundancy can we get
away with and still manage to correct errors?
The Hamming code uses a clever construction that uses the intuition developed while
answering the question mentioned above. We answer this question next.
every n-bit codeword can be represented as the original k-bit message followed by the
n − k parity bits (it actually doesn’t matter how the original message bits and parity bits
are interspersed). Figure 5-5 shows a codeword in systematic form.
So, given a systematic code, how many parity bits do we absolutely need? We need
to choose n so that single error correction is possible. Since there are n − k parity bits,
each combination of these bits must represent some error condition that we must be able
to correct (or infer that there were no errors). There are 2n−k possible distinct parity bit
combinations, which means that we can distinguish at most that many error conditions.
We therefore arrive at the constraint
n + 1 ≤ 2n−k (5.4)
i.e., there have to be enough parity bits to distinguish all corrective actions that might
need to be taken (including no action). Given k, we can determine n − k, the number of
parity bits needed to satisfy this constraint. Taking the log (to base 2) of both sides, we
can see that the number of parity bits must grow at least logarithmically with the number
of message bits. Not all codes achieve this minimum (e.g., the rectangular code doesn’t),
but the Hamming code, which we describe next, does.
We also note that the reasoning here for an SEC code can be extended to determine a
lower bound on the number of parity buts needed to correct t > 1 errors.
d1
p1 p2
d1 d7
p1 p2 d5
d11
d6
d4 d9
d4 d10
d2 d3 d2 d3
d8 p4
p3 p3
Figure 5-6: Venn diagrams of Hamming codes showing which data bits are protected by each parity bit.
by the receiver:
E1 = (d1 + d2 + d4 + p1 ) mod 2
E2 = (d1 + d3 + d4 + p2 ) mod 2
E3 = (d2 + d3 + d4 + p3 ) mod 2
where each Ei is called a syndrome bit because it helps the receiver diagnose the “illness”
(errors) in the received data. For each combination of syndrome bits, we can look for
the bits in each codeword that appear in all the Ei computations that produced 1; these
bits are potential candidates for having an error since any of them could have caused the
observed parity errors. Now eliminate from the candidates those bits that appear in any Ei
computations that produced 0 since those calculations prove those bits didn’t have errors.
We’ll be left with either no bits (no errors occurred) or one bit (the bit with the single error).
For example, if E1 = 1, E2 = 0 and E3 = 1, we notice that bits d2 and d4 both appear
in the computations for E1 and E3 . However, d4 appears in the computation for E2 and
should be eliminated, leaving d2 as the sole candidate as the bit with the error.
Another example: suppose E1 = 1, E2 = 0 and E3 = 0. Any of the bits appearing in the
computation for E1 could have caused the observed parity error. Eliminating those that
appear in the computations for E2 and E3 , we’re left with p1 , which must be the bit with
the error.
Applying this reasoning to each possible combination of parity errors, we can make a
table that shows the appropriate corrective action for each combination of the syndrome
bits:
58 CHAPTER 5. COPING WITH BIT ERRORS USING ERROR CORRECTION CODES
E3 E2 E1 Corrective Action
000 no errors
001 p1 has an error, flip to correct
010 p2 has an error, flip to correct
011 d1 has an error, flip to correct
100 p3 has an error, flip to correct
101 d2 has an error, flip to correct
110 d3 has an error, flip to correct
111 d4 has an error, flip to correct
index 1 2 3 4 5 6 7
binary index 001 010 011 100 101 110 111
(7,4) code p1 p2 d1 p3 d2 d3 d4
This table was constructed by first allocating the parity bits to indices that are powers
of two (e.g., 1, 2, 4, . . . ). Then the data bits are allocated to the so-far unassigned indicies,
starting with the smallest index. It’s easy to see how to extend this construction to any
number of data bits, remembering to add additional parity bits at indices that are a power
of two.
Allocating the data bits to parity computations is accomplished by looking at their re-
spective indices in the table above. Note that we’re talking about the index in the table, not
the subscript of the bit. Specifically, di is included in the computation of p j if (and only if)
the logical AND of binary index(di ) and binary index(p j ) is non-zero. Put another way, di
is included in the computation of p j if, and only if, index(p j ) contributes to index(di ) when
writing the latter as sums of powers of 2.
So the computation of p1 (with an index of 1) includes all data bits with odd indices: d1 ,
d2 and d4 . And the computation of p2 (with an index of 2) includes d1 , d3 and d4 . Finally,
the computation of p3 (with an index of 4) includes d2 , d3 and d4 . You should verify that
these calculations match the Ei equations given above.
If the parity/syndrome computations are constructed this way, it turns out that E3 E2 E1 ,
treated as a binary number, gives the index of the bit that should be corrected. For exam-
ple, if E3 E2 E1 = 101, then we should correct the message bit with index 5, i.e., d2 . This
corrective action is exactly the one described in the earlier table we built by inspection.
The Hamming code’s syndrome calculation and subsequent corrective action can be ef-
ficiently implemented using digital logic and so these codes are widely used in contexts
where single error correction needs to be fast, e.g., correction of memory errors when fetch-
ing data from DRAM.
SECTION 5.5. PROTECTING LONGER MESSAGES WITH SEC CODES 59
Figure 5-7: Dividing a long message into multiple SEC-protected blocks of k bits each, adding parity bits
to each constituent block. The red vertical rectangles refer to bit errors.
Figure 5-8: Interleaving can help recover from burst errors: code each block row-wise with an SEC, but
transmit them in interleaved fashion in columnar order. As long as a set of burst errors corrupts some set
of kth bits, the receiver can recover from all the errors in the burst.
channel experiencing burst errors. The reason is shown in Figure 5-8 (left), where each
block of the message is protected by its SEC parity bits. The different blocks are shown as
different rows. When a burst error occurs, multiple bits in an SEC block are corrupted, and
the SEC can’t recover from them.
Interleaving is a commonly used technique to recover from burst errors on a channel
even when the individual blocks are protected with a code that, on the face of it, is not
suited for burst errors. The idea is simple: code the blocks as before, but transmit them in
a “columnar” fashion, as shown in Figure 5-8 (right). That is, send the first bit of block 1,
then the first bit of block 2, and so on until all the first bits of each block in a set of some
predefined size are sent. Then, send the second bits of each block in sequence, then the
third bits, and so on.
What happens on a burst error? Chances are that it corrupts a set of “first” bits, or a
set of “second” bits, or a set of “third” bits, etc., because those are the bits sent in order on
the channel. As long as only a set of kth bits are corrupted, the receiver can correct all the
errors. The reason is that each coded block will now have at most one error. Thus, SEC
codes are a useful primitive to correct against burst errors, in concert with interleaving.
Acknowledgments
Many thanks to Katrina LaCurts for carefully reading these notes and making several use-
ful comments.
SECTION 5.5. PROTECTING LONGER MESSAGES WITH SEC CODES 61
D0 D1 D2 D3 D4 | P0
D5 D6 D7 D8 D9 | P1
D10 D11 D12 D13 D14 | P2
-------------------------
P3 P4 P5 P6 P7 |
Here, D0–D14 are data bits, P0–P2 are row parity bits and P3–P7 are column parity
bits. What are n, k, and d for this linear code?
3. Consider a rectangular parity code as described in Section 5.4.1. Ben Bitdiddle would
like use this code at a variety of different code rates and experiment with them on
some channel.
(a) Is it possible to obtain a rate lower than 1/3 with this code? Explain your an-
swer.
(b) Suppose he is interested in code rates like 1/2, 2/3, 3/4, etc.; i.e., in general a
rate of n−1
n , for integer n > 1. Is it always possible to pick the parameters of
the code (i.e, the block size and the number of rows and columns over which to
construct the parity bits) so that any such code rate is achievable? Explain your
answer.
4. Two-Bit Communications (TBC), a slightly suspect network provider, uses the fol-
lowing linear block code over its channels. All arithmetic is in F2 .
P0 = D0 , P1 = (D0 + D1 ), P2 = D1 .
5. Pairwise Communications has developed a linear block code over F2 with three data
and three parity bits, which it calls the pairwise code:
P1 = D1 + D2 (Each Di is a data bit; each Pi is a parity bit.)
P2 = D2 + D3
P3 = D3 + D1
(a) Fill in the values of the following three attributes of this code:
(i) Code rate =
(ii) Number of 1s in a minimum-weight non-zero codeword =
(iii) Minimum Hamming distance of the code =
6. Consider the same “pairwise code” as in the previous problem. The receiver com-
putes three syndrome bits from the (possibly corrupted) received data and parity
bits: E1 = D1 + D2 + P1 , E2 = D2 + D3 + P2 , and E3 = D3 + D1 + P3 . The receiver
performs maximum likelihood decoding using the syndrome bits. For the combi-
nations of syndrome bits in the table below, state what the maximum-likelihood de-
coder believes has occured: no errors, a single error in a specific bit (state which one),
or multiple errors.
E3 E2 E1 Error pattern [No errors / Error in bit ... (specify bit) / Multiple errors]
0 0 0
0 0 1
0 1 0
0 1 1
1 0 0
1 0 1
1 1 0
1 1 1
7. Alyssa P. Hacker extends the aforementioned pairwise code by adding an overall par-
ity bit. That is, she computes P4 = ∑3i=1 (Di + Pi ), and appends P4 to each original code-
word to produce the new set of codewords. What improvement in error correction
or detection capabilities, if any, does Alyssa’s extended code show over Pairwise’s
original code? Explain your answer.
8. For each of the sets of codewords below, determine whether the code is a linear block
code over F2 or not. Also give the rate of each code.
(a) {000,001,010,011}.
(b) {000, 011, 110, 101}.
(c) {111, 100, 001, 010}.
(d) {00000, 01111, 10100, 11011}.
(e) {00000}.
SECTION 5.5. PROTECTING LONGER MESSAGES WITH SEC CODES 63
9. For any linear block code over F2 with minimum Hamming distance at least 2t + 1
between codewords, show that:
n−k n n n
2 ≥1+ + +... .
1 2 t
10. For each (n, k, d) combination below, state whether a linear block code with those
parameters exists or not. Please provide a brief explanation for each case: if such a
code exists, give an example; if not, you may rely on a suitable necessary condition.
11. Using the Hamming code construction for the (7, 4) code, construct the parity equa-
tions for the (15, 11) code. How many equations does this code have? How many
message bits contribute to each parity bit?
12. Prove Theorems 5.2 and 5.3. (Don’t worry too much if you can’t prove the latter; we
will give the proof when we discuss convolutional codes in Lecture 8.)
13. The weight of a codeword in a linear block code over F2 is the number of 1’s in
the word. Show that any linear block code must either: (1) have only even weight
codewords, or (2) have an equal number of even and odd weight codewords.
Hint: Proof by contradiction.
14. There are N people in a room, each wearing a hat colored red or blue, standing in a
line in order of increasing height. Each person can see only the hats of the people in
front, and does not know the color of his or her own hat. They play a game as a team,
whose rules are simple. Each person gets to say one word: “red” or “blue”. If the
word they say correctly guesses the color of their hat, the team gets 1 point; if they
guess wrong, 0 points. Before the game begins, they can get together to agree on a
protocol (i.e., what word they will say under what conditions). Once they determine
the protocol, they stop talking, form the line, and are given their hats at random.
Can you develop a protocol that will maximize their score? What score does your
protocol achieve?
64 CHAPTER 5. COPING WITH BIT ERRORS USING ERROR CORRECTION CODES
MIT 6.02 DRAFT Lecture Notes
Last update: October 2, 2011
Comments, questions or bug reports?
Please contact hari at mit.edu
C HAPTER 6
Convolutional Codes: Construction
and Encoding
This chapter introduces a widely used class of codes, called convolutional codes, which
are used in a variety of systems including today’s popular wireless standards (such as
802.11) and in satellite communications. They are also used as a building block in more
powerful modern codes, such as turbo codes, which are used in wide-area cellular wireless
network standards such as 3G, LTE, and 4G. Convolutional codes are beautiful because
they are intuitive, one can understand them in many different ways, and there is a way
to decode them so as to recover the most likely message from among the set of all possible
transmitted messages. This chapter discusses the encoding of convolutional codes; the
next one discusses how to decode convolutional codes efficiently.
Like the block codes discussed in the previous chapter, convolutional codes involve
the computation of parity bits from message bits and their transmission, and they are also
linear codes. Unlike block codes in systematic form, however, the sender does not send the
message bits followed by (or interspersed with) the parity bits; in a convolutional code, the
sender sends only the parity bits. These codes were invented by Peter Elias ’44, an MIT EECS
faculty member, in the mid-1950s. For several years, it was not known just how powerful
these codes are and how best to decode them. The answers to these questions started
emerging in the 1960s, with the work of people like Andrew Viterbi ’57, G. David Forney
(SM ’65, Sc.D. ’67, and MIT EECS faculty member), Jim Omura SB ’63, and many others.
65
66 CHAPTER 6. CONVOLUTIONAL CODES: CONSTRUCTION AND ENCODING
Figure 6-1: An example of a convolutional code with two parity bits per message bit and a constraint length
(shown in the rectangular window) of three. I.e., r = 2, K = 3.
implies a greater resilience to bit errors. The trade-off, though, is that it will take consider-
ably longer to decode codes of long constraint length (we will see in the next chapter that
the complexity of decoding is exponential in the constraint length), so one cannot increase
the constraint length arbitrarily and expect fast decoding.
If a convolutional code produces r parity bits per window and slides the window for-
ward by one bit at a time, its rate (when calculated over long messages) is 1/r. The greater
the value of r, the higher the resilience of bit errors, but the trade-off is that a proportionally
higher amount of communication bandwidth is devoted to coding overhead. In practice,
we would like to pick r and the constraint length to be as small as possible while providing
a low enough resulting probability of a bit error.
In 6.02, we will use K (upper case) to refer to the constraint length, a somewhat unfortu-
nate choice because we have used k (lower case) in previous lectures to refer to the number
of message bits that get encoded to produce coded bits. Although “L” might be a better
way to refer to the constraint length, we’ll use K because many papers and documents in
the field use K (in fact, many papers use k in lower case, which is especially confusing).
Because we will rarely refer to a “block” of size k while talking about convolutional codes,
we hope that this notation won’t cause confusion.
Armed with this notation, we can describe the encoding process succinctly. The encoder
looks at K bits at a time and produces r parity bits according to carefully chosen functions
that operate over various subsets of the K bits.1 One example is shown in Figure 6-1, which
shows a scheme with K = 3 and r = 2 (the rate of this code, 1/r = 1/2). The encoder spits
out r bits, which are sent sequentially, slides the window by 1 to the right, and then repeats
the process. That’s essentially it.
At the transmitter, the two princial remaining details that we must describe are:
1. What are good parity functions and how can we represent them conveniently?
2. How can we implement the encoder efficiently?
The rest of this lecture will discuss these issues, and also explain why these codes are
called “convolutional”.
1
By convention, we will assume that each message has K − 1 “0” bits padded in front, so that the initial
conditions work out properly.
SECTION 6.2. PARITY EQUATIONS 67
In general, one can view each parity equation as being produced by combining the mes-
sage bits, X, and a generator polynomial, g. In the first example above, the generator poly-
nomial coefficients are (1, 1, 1) and (1, 1, 0), while in the second, they are (1, 1, 1), (1, 1, 0),
and (1, 0, 1).
We denote by gi the K-element generator polynomial for parity bit pi . We can then write
pi [n] as follows:
k−1
pi [n] = ( ∑ gi [ j]x[n − j]) mod 2. (6.3)
j=0
The form of the above equation is a convolution of g and x—hence the term “convolu-
tional code”. The number of generator polynomials is equal to the number of generated
parity bits, r, in each sliding window. The rate of the code is 1/r if the encoder slides the
window one bit at a time.
6.2.1 An Example
Let’s consider the two generator polynomials of Equations 6.1 (Figure 6-1). Here, the gen-
erator polynomials are
g0 = 1, 1, 1
g1 = 1, 1, 0 (6.4)
If the message sequence, X = [1, 0, 1, 1, . . .] (as usual, x[n] = 0 ∀n < 0), then the parity
68 CHAPTER 6. CONVOLUTIONAL CODES: CONSTRUCTION AND ENCODING
p0 [0] = (1 + 0 + 0) = 1
p1 [0] = (1 + 0) = 1
p0 [1] = (0 + 1 + 0) = 1
p1 [1] = (0 + 1) = 1
p0 [2] = (1 + 0 + 1) = 0
p1 [2] = (1 + 0) = 1
p0 [3] = (1 + 1 + 0) = 0
p1 [3] = (1 + 1) = 0. (6.5)
Therefore, the bits transmitted over the channel are [1, 1, 1, 1, 0, 0, 0, 0, . . .].
There are several generator polynomials, but understanding how to construct good
ones is outside the scope of 6.02. Some examples (found by J. Busgang) are shown in
Table 6-1.
Constraint length g0 g1
3 110 111
4 1101 1110
5 11010 11101
6 110101 111011
7 110101 110101
8 110111 1110011
9 110111 111001101
10 110111001 1110011001
Table 6-1: Examples of generator polynomials for rate 1/2 convolutional codes with different constraint
lengths.
Figure 6-2: Block diagram view of convolutional coding with shift registers.
as the state of the encoder. This block diagram takes message bits in one bit at a time, and
spits out parity bits (two per input bit, in this case).
Input message bits, x[n], arrive from the left. (These bits arrive after being processed
by the receiver’s sampling and demapping procedures). The block diagram calculates the
parity bits using the incoming bits and the state of the encoder (the k − 1 previous bits;
two in this example). After the r parity bits are produced, the state of the encoder shifts
by 1, with x[n] taking the place of x[n − 1], x[n − 1] taking the place of x[n − 2], and so on,
with x[n − K + 1] being discarded. This block diagram is directly amenable to a hardware
implementation using shift registers.
An important point to note: the state machine for a convolutional code is identical for
all codes with a given constraint length, K, and the number of states is always 2K−1 . Only
the pi labels change depending on the number of generator polynomials and the values of
their coefficients. Each state is labeled with x[n − 1]x[n − 2] . . . x[n − K + 1]. Each arc is
labeled with x[n]/p0 p1 . . .. In this example, if the message is 101100, the transmitted bits
are 11 11 01 00 01 10.
This state-machine view is an elegant way to explain what the transmitter does, and also
what the receiver ought to do to decode the message, as we now explain. The transmitter
begins in the initial state (labeled “STARTING STATE” in Figure 6-3) and processes the
message one bit at a time. For each message bit, it makes the state transition from the
current state to the new one depending on the value of the input bit, and sends the parity
bits that are on the corresponding arc.
The receiver, of course, does not have direct knowledge of the transmitter’s state tran-
sitions. It only sees the received sequence of parity bits, with possible bit errors. Its task is
to determine the best possible sequence of transmitter states that could have produced
the parity bit sequence. This task is called decoding, which we introduce next, and study
in more detail in the next chapter.
Figure 6-4: When the probability of bit error is less than 1/2, maximum-likelihood decoding boils down
to finding the message whose parity bit sequence, when transmitted, has the smallest Hamming distance
to the received sequence. Ties may be broken arbitrarily. Unfortunately, for an N-bit transmit sequence,
there are 2N possibilities, which makes it hugely intractable to simply go through in sequence because
of the sheer number. For instance, when N = 256 bits (a really small packet), the number of possibilities
rivals the number of atoms in the universe!
of analog samples into the bits 1101001. Is the sender more likely to have sent 1100111
or 1100001? The first has a Hamming distance of 3, and the probability of receiving that
sequence is (0.999)4 (0.001)3 = 9.9 × 10−10 . The second choice has a Hamming distance of
1 and a probability of (0.999)6 (0.001)1 = 9.9 × 10−4 , which is six orders of magnitude higher
and is overwhelmingly more likely.
Thus, the most likely sequence of parity bits that was transmitted must be the one with
the smallest Hamming distance from the sequence of parity bits received. Given a choice
of possible transmitted messages, the decoder should pick the one with the smallest such
Hamming distance.
Determining the nearest valid codeword to a received word is easier said than done for
convolutional codes. For example, see Figure 6-4, which shows a convolutional code with
K = 3 and rate 1/2. If the receiver gets 111011000110, then some errors have occurred,
because no valid transmitted sequence matches the received one. The last column in the
example shows d, the Hamming distance to all the possible transmitted sequences, with
72 CHAPTER 6. CONVOLUTIONAL CODES: CONSTRUCTION AND ENCODING
Figure 6-5: The trellis is a convenient way of viewing the decoding task and understanding the time evo-
lution of the state machine.
the smallest one circled. To determine the most-likely 4-bit message that led to the parity
sequence received, the receiver could look for the message whose transmitted parity bits
have smallest Hamming distance from the received bits. (If there are ties for the smallest,
we can break them arbitrarily, because all these possibilities have the same resulting post-
coded BER.)
The straightforward approach of simply going through the list of possible transmit se-
quences and comparing Hamming distances is horribly intractable. The reason is that a
transmit sequence of N bits has 2 N possible strings, a number that is simply too large for
even small values of N, like 256 bits. We need a better plan for the receiver to navigate this
unbelievable large space of possibilities and quickly determine the valid message with
smallest Hamming distance. We will study a powerful and widely applicable method for
solving this problem, called Viterbi decoding, in the next lecture. This decoding method
uses a special structure called the trellis, which we describe next.
shows the links between states that are traversed in the trellis given the message 101100.
We can now think about what the decoder needs to do in terms of this trellis. It gets a
sequence of parity bits, and needs to determine the best path through the trellis—that is,
the sequence of states in the trellis that can explain the observed, and possibly corrupted,
sequence of received parity bits.
The Viterbi decoder finds a maximum-likelihood path through the trellis. We will
study it in the next chapter.
Problems and exercises on convolutional coding are at the end of the next chapter, after we
discuss the decoding process.
74 CHAPTER 6. CONVOLUTIONAL CODES: CONSTRUCTION AND ENCODING
MIT 6.02 DRAFT Lecture Notes
Last update: October 9, 2011
Comments, questions or bug reports?
Please contact hari at mit.edu
C HAPTER 7
Viterbi Decoding of Convolutional
Codes
This chapter describes an elegant and efficient method to decode convolutional codes,
whose construction and encoding we described in the previous chapter. This decoding
method avoids explicitly enumerating the 2 N possible combinations of N-bit parity bit
sequences. This method was invented by Andrew Viterbi ’57 and bears his name.
75
76 CHAPTER 7. VITERBI DECODING OF CONVOLUTIONAL CODES
Figure 7-1: The trellis is a convenient way of viewing the decoding task and understanding the time evo-
lution of the state machine.
derstanding the decoding procedure for convolutional codes (Figure 7-1). Suppose we
have the entire trellis in front of us for a code, and now receive a sequence of digitized
bits (or voltage samples). If there are no errors (i.e., the signal-to-noise ratio, SNR, is high
enough), then there will be some path through the states of the trellis that would exactly
match the received sequence. That path (specifically, the concatenation of the encoding of
each state along the path) corresponds to the transmitted parity bits. From there, getting
to the original message is easy because the top arc emanating from each node in the trellis
corresponds to a “0” bit and the bottom arrow corresponds to a “1” bit.
When there are bit errors, what can we do? As explained earlier, finding the most likely
transmitted message sequence is appealing because it minimizes the BER. If we can come
up with a way to capture the errors introduced by going from one state to the next, then
we can accumulate those errors along a path and come up with an estimate of the total
number of errors along the path. Then, the path with the smallest such accumulation of
errors is the path we want, and the transmitted message sequence can be easily determined
by the concatenation of states explained above.
To solve this problem, we need a way to capture any errors that occur in going through
the states of the trellis, and a way to navigate the trellis without actually materializing the
entire trellis (i.e., without enumerating all possible paths through it and then finding the
one with smallest accumulated error). The Viterbi decoder solves these problems. It is
an example of a more general approach to solving optimization problems, called dynamic
programming. Later in the course, we will apply similar concepts in network routing, an
unrelated problem, to find good paths in multi-hop networks.
SECTION 7.2. THE VITERBI DECODER 77
Figure 7-2: The branch metric for hard decision decoding. In this example, the receiver gets the parity bits
00.
Among all the possible states at time step i, the most likely state is the one with the
smallest path metric. If there is more than one such state, they are all equally good possi-
bilities.
Now, how do we determine the path metric at time step i + 1, PM[s, i + 1], for each state
s? To answer this question, first observe that if the transmitter is at state s at time step i + 1,
then it must have been in only one of two possible states at time step i. These two predecessor
states, labeled α and β , are always the same for a given state. In fact, they depend only
on the constraint length of the code and not on the parity functions. Figure 7-2 shows the
predecessor states for each state (the other end of each arrow). For instance, for state 00,
α = 00 and β = 01; for state 01, α = 10 and β = 11.
Any message sequence that leaves the transmitter in state s at time i + 1 must have left
the transmitter in state α or state β at time i. For example, in Figure 7-2, to arrive in state
’01’ at time i + 1, one of the following two properties must hold:
1. The transmitter was in state ‘10’ at time i and the ith message bit was a 0. If that is
the case, then the transmitter sent ‘11’ as the parity bits and there were two bit errors,
because we received the bits 00. Then, the path metric of the new state, PM[‘01’, i + 1]
is equal to PM[‘10’, i] + 2, because the new state is ‘01’ and the corresponding path
metric is larger by 2 because there are 2 errors.
2. The other (mutually exclusive) possibility is that the transmitter was in state ‘11’ at
time i and the ith message bit was a 0. If that is the case, then the transmitter sent 01
as the parity bits and tere was one bit error, because we received 00. The path metric
of the new state, PM[‘01’, i + 1] is equal to PM[‘11’, i] + 1.
Formalizing the above intuition, we can see that
Figure 7-3 shows the decoding algorithm in action from one time step to the next. This
example shows a received bit sequence of 11 10 11 00 01 10 and how the receiver processes
it. The fourth picture from the top shows all four states with the same path metric. At this
stage, any of these four states and the paths leading up to them are most likely transmitted
bit sequences (they all have a Hamming distance of 2). The bottom-most picture shows
the same situation with only the survivor paths shown. A survivor path is one that has
a chance of being the maximum-likelihood path; there are many other paths that can be
pruned away because there is no way in which they can be most likely. The reason why
the Viterbi decoder is practical is that the number of survivor paths is much, much smaller
than the total number of paths in the trellis.
Another important point about the Viterbi decoder is that future knowledge will help it
break any ties, and in fact may even cause paths that were considered “most likely” at a
certain time step to change. Figure 7-4 continues the example in Figure 7-3, proceeding un-
til all the received parity bits are decoded to produce the most likely transmitted message,
which has two bit errors.
where u = u1 , u2 , . . . , u p are the expected p parity bits (each a 0 or 1). Figure 7-5 shows the
soft decision branch metric for p = 2 when u is 00.
With soft decision decoding, the decoding algorithm is identical to the one previously
described for hard decision decoding, except that the branch metric is no longer an integer
Hamming distance but a positive real number (if the voltages are all between 0 and 1, then
the branch metric is between 0 and 1 as well).
It turns out that this soft decision metric is closely related to the probability of the decoding
being correct when the channel experiences additive Gaussian noise. First, let’s look at the
simple case of 1 parity bit (the more general case is a straightforward extension). Suppose
80 CHAPTER 7. VITERBI DECODING OF CONVOLUTIONAL CODES
the receiver gets the ith parity bit as vi volts. (In hard decision decoding, it would decode
− as 0 or 1 depending on whether vi was smaller or larger than 0.5.) What is the probability
that vi would have been received given that bit ui (either 0 or 1) was sent? With zero-mean
additive Gaussian noise, the PDF of this event is given by
2 2
e −d i /2σ
f (vi |ui ) = √ , (7.3)
2πσ 2
3. What is the reduction in the bit error rate, and how does that compare with other
codes?
Why do we define a “free distance”, rather than just call it the Hamming distance, if it
is defined the same way? The reason is that any code with Hamming distance D (whether
linear or not) can correct all patterns of up to b D2−1 c errors. If we just applied the same
notion to convolutional codes, we will conclude that we can correct all single-bit errors in
the example given, or in general, we can correct some fixed number of errors.
Now, convolutional coding produces an unbounded bit stream; these codes are
markedly distinct from block codes in this regard. As a result, the b D2−1 c formula is not
too instructive because it doesn’t capture the true error correction properties of the code.
A convolutional code (with Viterbi decoding) can correct t = b D2−1 c errors as long as these
errors are “far enough apart”. So the notion we use is the free distance because, in a sense,
errors can keep occurring and as long as no more than t of them occur in a closely spaced
burst, the decoder can correct them all.
2. A convolutional code with generators (111, 100) and constraint length K = 3, shown
in the picture as “K = 3”.
Some observations:
1. The probability of error is roughly the same for the rectangular parity code and hard
decision decoding with K = 3. The free distance of the K = 3 convolutional code is 4,
which means it can correct one bit error over blocks that are similar in length to the
rectangular parity code we are comparing with. Intuitively, both schemes essentially
produce parity bits that are built from similar amounts of history. In the rectangular
parity case, the row parity bit comes from two successive message bits, while the
column parity comes from two message bits with one skipped in between. But we
also send the message bits, so we’re mimicking a similar constraint length (amount
of memory) to the K = 3 convolutional code.
1
You will produce similar pictures in one of your lab tasks using your implementations of the Viterbi and
rectangular parity code decoders.
SECTION 7.6. SUMMARY 83
2. The probability of error for a given amount of noise is noticeably lower for the K = 4
code compared to K = 3 code; the reason is that the free distance of this K = 4 code is
6, and it takes 7 trellis edges to achieve that (000 → 100 → 010 → 001 → 000), meaning
that the code can correct up to 2 bit errors in sliding windows of length 2 · 4 = 8 bits.
3. The probability of error for a given amount of noise is dramatically lower with soft
decision decoding than hard decision decoding. In fact, K = 3 and soft decoding
beats K = 4 and hard decoding in these graphs. For a given error probability (and
signal), the degree of noise that can be tolerated with soft decoding is much higher
(about 2.5–3 dB, which is a good rule-of-thumb to apply in practice for the gain from
soft decoding, all other things being equal).
Figure 7-8 shows a comparison of three different convolutional codes together with the
uncoded case. Two of the codes are the same as in Figure 7-7, i.e., (111, 110) and (1110, 1101);
these were picked because they were recommended by Bussgang’s paper. The third code
is (111, 101), with parity equations
The results of this comparison are shown in Figure 7-8. These graphs show the prob-
ability of decoding error (BER after decoding) for experiments that transmit messages of
length 500,000 bits each. (Because the BER of the best codes in this set are on the order of
10−6 , we actually need to run the experiment over even longer messages when the SNR is
higher than 3 dB; that’s why we don’t see results for one of the codes, where the experi-
ment encountered no errors.)
Interestingly, these results show that the code (111, 101) is stronger than the other two
codes, even though its constraint length, 3, is smaller than that of (1110, 1101). To under-
stand why, we can calculate the free distance of this code, which turns out to be 5. This free
distance is smaller than that of (1110, 1101), whose free distance is 6, but the number of trel-
lis edges to go from state 00 back to state 00 in the (111, 101) case is only 3, corresponding to
a 6-bit block. The relevant state transitions are 00 → 10 → 01 → 00 and the corresponding
path metrics are 0 → 2 → 3 → 5. Hence, its error correcting power is marginally stronger
than the (1110, 1101) code.
7.6 Summary
From its relatively modest, though hugely impactful, beginnings as a method to decode
convolutional codes, Viterbi decoding has become one of the most widely used algorithms
in a wide range of fields and engineering systems. Modern disk drives with “PRML”
technology to speed-up accesses, speech recognition systems, natural language systems,
and a variety of communication networks use this scheme or its variants.
In fact, a more modern view of the soft decision decoding technique described in this
lecture is to think of the procedure as finding the most likely set of traversed states in
a Hidden Markov Model (HMM). Some underlying phenomenon is modeled as a Markov
state machine with probabilistic transitions between its states; we see noisy observations
84 CHAPTER 7. VITERBI DECODING OF CONVOLUTIONAL CODES
from each state, and would like to piece together the observations to determine the most
likely sequence of states traversed. It turns out that the Viterbi decoder is an excellent
starting point to solve this class of problems (and sometimes the complete solution).
On the other hand, despite its undeniable success, Viterbi decoding isn’t the only way
to decode convolutional codes. For one thing, its computational complexity is exponential
in the constraint length, K, because it does require each of these states to be enumerated.
When K is large, one may use other decoding methods such as BCJR or Fano’s sequential
decoding scheme, for instance.
Convolutional codes themselves are very popular over both wired and wireless links.
They are sometimes used as the “inner code” with an outer block error correcting code,
but they may also be used with just an outer error detection code. They are also used
as a component in more powerful codes line turbo codes, which are currently one of the
highest-performing codes used in practice.
(a) What is the rate of this code? How many states are in the state machine repre-
sentation of this code?
(b) Suppose the decoder reaches the state “110” during the forward pass of the
Viterbi algorithm with this convolutional code.
i. How many predecessor states (i.e., immediately preceding states) does state
“110” have?
ii. What are the bit-sequence representations of the predecessor states of state
“110”?
iii. What are the expected parity bits for the transitions from each of these pre-
decessor states to state “110”? Specify each predecessor state and the ex-
pected parity bits associated with the corresponding transition below.
(c) To increase the rate of the given code, Lem E. Tweakit punctures the p0 parity
stream using the vector (1 0 1 1 0), which means that every second and fifth bit
produced on the stream are not sent. In addition, she punctures the p1 parity
stream using the vector (1 1 0 1 1). She sends the p2 parity stream unchanged.
What is the rate of the punctured code?
3. Let conv encode(x) be the resulting bit-stream after encoding bit-string x with a
convolutional code, C. Similarly, let conv decode(y) be the result of decoding y
to produce the maximum-likelihood estimate of the encoded message. Suppose we
SECTION 7.6. SUMMARY 85
send a message M using code C over some channel. Let P = conv encode(M) and
let R be the result of sending P over the channel and digitizing the received samples
at the receiver (i.e., R is another bit-stream). Suppose we use Viterbi decoding on
R, knowing C, and find that the maximum-likelihood estimate of M is M̂. During
the decoding, we find that the minimum path metric among all the states in the final
stage of the trellis is Dmin .
Dmin is the Hamming distance between and . Fill in the
blanks, explaining your answer.
86 CHAPTER 7. VITERBI DECODING OF CONVOLUTIONAL CODES
Figure 7-3: The Viterbi decoder in action. This picture shows four time steps. The bottom-most picture is
the same as the one just before it, but with only the survivor paths shown.
SECTION 7.6. SUMMARY 87
Figure 7-4: The Viterbi decoder in action (continued from Figure 7-3. The decoded message is shown. To
produce this message, start from the final state with smallest path metric and work backwards, and then
reverse the bits. At each state during the forward pass, it is important to remeber the arc that got us to this
state, so that the backward pass can be done properly.
88 CHAPTER 7. VITERBI DECODING OF CONVOLUTIONAL CODES
0.0,1.0 1.0,1.0
Vp0,Vp1
V2p0
+
V2p1
So#
metric
when
expected
parity
bits
are
0,0
0.0,0.0 1.0,0.0
x[n] 0 0 0 0 0 0
00 00 00 00
00 0/00 0/00 0/00 0/00 4 0/00 0/00
1/11 1/11 1/11 1/11 1/11 1/11
x[n-1]x[n-2]
time
The free distance is the difference in path metrics between the all-zeroes output
and the path with the smallest non-zero path metric going from the initial 00 state
to some future 00 state. It is 4 in this example. The path 00 à 10 à01 à 00 has
a shorter length, but a higher path metric (of 5), so it is not the free distance.
Figure 7-7: Error correcting performance results for different rate-1/2 codes.
Figure 7-8: Error correcting performance results for three different rate-1/2 convolutional codes. The pa-
rameters of the three convolutional codes are (111, 110) (labeled “K = 3 glist=(7, 6)”), (1110, 1101) (labeled
“K = 4 glist=(14, 13)”), and (111, 101) (labeled “K = 3 glist=(7, 5)”). The top three curves below the uncoded
curve are for hard decision decoding; the bottom three curves are for soft decision decoding.