0% found this document useful (0 votes)
2K views604 pages

Gallagher Information Theory

Uploaded by

udslv
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
2K views604 pages

Gallagher Information Theory

Uploaded by

udslv
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 604
aes | o GALLAGER | _ INFORMATION | oes sa) and ae RELIABLE ae COMMUNICATION 3. al cs S 3 a5 me. eo a er L Pr : 1 RUSERT G, GALLAGER Bi INFORMATION THEORY AND RELIABLE COMMUNICATION Robert G. Gallager Massachusetts Institute of Technology JOHN WILEY & SONS [New York + Chichester Brisbane © Toronto * Singapore 30 29 28 27 26 25 24 23 22 Copyright © 1968 by John Wiley & Sons, Inc. All rights reserved, Reproduction or translation of any part of this work beyond that permitted by Sections 107 or 108 of the 1976 United States Copyright Act without the permission of the copyright owner is unlawful. Requests for permission or further information should be addressed 1o the Permissions Department, John Wiley & Sons, Inc. ISBN W-471-29048-3 Library of Congress Catalog Card Number: 68-26850 Printed in the United States of America PREFACE This book is designed primarily for use as a first-year graduate text in information theory, suitable for both engineers and mathematicians. It is assumed that the reader has some understanding of freshman calculus and elementary probability, and in the later chapters some introductory random process theory. Unfortunately there is one more requirement that is harder to meet. The reader must have a reasonable level of mathematical maturity and capability for abstract thought. The major results of the theory are quite subtle and abstract and must sometimes be arrived at by what appears to be rather devious routes. Fortunately, recent simplifications in the theory have made the major results more accessible than in the past. Because of the subtlety and abstractness of the subject, it is necessary to be more rigorous than is usual in engineering. I have attempted to soften this wherever possible by preceding the proof of difficult theorems both with some explanation of why the theorem is important and with an intuitive explanation of why it is true. An attempt has also been made to provide the simplest and most elementary proof of each theorem, and many of the proofs here are new. | have carefully avoided the rather obnoxious practice in many elementary textbooks of quoting obscure mathematical theorems in the middle of a proof to make it come out right. There are a number of reasons for the stress on proving theorems here. One of the main reasons is that the engineer who attempts to apply the theory will rapidly find that engineering problems are rarely solved by applying theorems to them. The theorems seldom apply exactly and one must under- stand the proofs to see whether the theorems provide any insight into the problem. Another reason for the stress on proofs is that the techniques used in the proofs are often more useful in doing new research in the area than the theorems themselves. A final reason for emphasizing the precise state- ment of results and careful proofs is that the text has been designed as an integral part of a course in information theory rather than as the whole course. Philosophy, intuitive understanding, examples, and applications, for example, are better developed in the give and take of a classroom, whereas precise statements and details are better presented in the permanent record y w Preface of a textbook. Enough of the intuition has been presented here for the instructor and the independent student, but added classroom stress is needed for the beginning graduate student. A large number of exercises and problems are given at the end of the text. ‘These range from simple numerical examples to significant generalizations of the theory. There are relatively few examples worked out in the text, and the student who needs examples should pause frequently in his reading to work out some of the simpler exercises at the end of the book There are a number of ways to organize the material here into a one- semester course. Chapter 1 should always be read first (and probably last also). After this, my own preference is to cover the following sections in order: 2.1-2.4, 3.1-34, 4.1-4.5, 5.1-5.6, 6.1-6.5, and finally either 6.8-6.9 or 6.6-6.7 or 8.1-8.3. Another possibility, for students who have some background in random processes, is to start with Sections 8.1 and 8.2 and then to proceed with the previous outline, using the white Gaussian noise channel as an example throughout, Another possibility, for students with a strong practical motivation, is to start with Chapter 6 (omitting Section 6.2), then to cover Sections 5.1 to 5.5, then 6.2, then Chapters 2 and 4 and Sections 8.1 and 8.2. Other possible course outlines can be made up with the help of the following table of prerequisites. Table of Prerequisites Prerequisites ee, | EERE ene SOARRR 5.6, 6.1 61 61 51-55, 6.1-6.5, 6.8 i 61-65, 68 A145 21-23 | rans 241-25, 43,5.6-5.7 46 3.6,4.1-4.5 | 81-82 None : None 8386 7.1-7.5,8.1-8.2 5658 44,5155 | 91-96 41-45) 59 46,51-56 | 97 8.5, 9.1-9.5 61 None 98 3.5, 9.1-9.5 dT Asa general rule, the latter topics in each chapter are more difficult and are presented in.a more terse manner than the earlier topics. They are included primarily for the benefit of advanced students and workers in the field, although most of them can be covered in a second semester. Instructors are Preface vi cautioned not to spend too much time on Chapter 3, particularly in a one- semester course. The material in Sections 4.1-4.5, 9.1-5.6, and 6.1-6.5 is simpler and far more significant than that of Sections 3.5-3.6, even though it may be less familiar to some instructors. Lapologize to the many authors of significant papers in information theory whom I neglected to cite. 1 tried to list the references that [ found useful in the preparation of this book along with references for selected advanced material, Many papers of historical significance were neglected, and the authors cited are not necessarily the ones who have made the greatest contributions to the field. Robert G. Gallager ACKNOWLEDGMENTS Iam grateful for the patient support of the Research Laboratory of Elec- tronics and of the Electrical Engineering Department at MIT while this book was being written. The work at the Research Laboratory of Electronics ‘was supported by the National Aeronautics and Space Administration under Grant NSG-334, I am particularly grateful to R. M, Fano who stimulated my early interest in information theory and to whom I owe much of my conceptual understanding of the subject. This text was started over four years ago with the original idea of making it a revision, under joint author~ ship, of The Transmission of Information by R. M. Fano. As the years passed and the text grew and changed, it became obvious that it was a totally different book. However, my debt to The Transmission of Information is obvious to anyone familiar with both books. Tam also very grateful to P. Elias, J. M. Wozencraft, and C. E, Shannon for their ideas and teachings, which I have used liberally here. Another debt is owed to the many students who have taken the information theory course at MIT and who have made candid comments about the many experiments in different ways of presenting the material here. Finally [ am indebted to the many colleagues who have been very generous in providing detailed criticisms of different parts of the manuscript. J. L. Massey has been particularly helpful in this respect. Also, G. D. Forney, H. Yudkin, A. Wyner, P. Elias, R. Kahn, R. S. Kennedy, J. Max, J. Pinkston, E. Berlekamp, A. Kohlenberg, I. Jacobs, D. Sakrison, T. Kailath, L. Seidman, and F, Preparata have all made a number of criticisms that significantly improved the manuscript. R.G.G. CONTENTS 1 Communication Systems and Information Theory 1.1 Introduction 1.2. Source Models and Source Coding 1.3. Channel Models and Channel Coding Historical Notes and References 2 A Measure of Information 2.1 Discrete Probability: Review and Notation 2.2 Definition of Mutual Information 2.3. Average Mutual Information and Entropy 2.4 Probability and Mutual Information for Continuous Ensembles 2.5 Mutual Information for Arbitrary Ensembles Summary and Conclusions Historical Notes and References 3 Coding for Discrete Sources 3.1 Fixed-Length Codes 3.2. Variable-Length Code Words 3.3 A Source Coding Theorem 3.4. An Optimum Variable-Length Encoding Procedure 3.5 Discrete Stationary Sources 3.6 Markov Sources Summary and Conclusions Historical Notes and References 4 Discrete Memoryless Channels and Capacity 4.1 Classification of Channels 42. Discrete Memoryless Channels 1B 1B 16 23, 7 33 37 37 38 39 43 50 32 36 63 0 70 n n 2B xii Contents 43. The Converse to the Coding Theorem 44° Convex Functions 4.5 Finding Channel Capacity for a Discrete Memoryless Channel 4.6 Discrete Channels with Memory Indecomposable Channels Summary and Conclusions Historical Notes and References Appendix 44 ‘The Noisy-Channel Coding Theorem 5.1 Block Codes 5.2. Decoding Block Codes 5.3. Ertor Probability for Two Code Words 5.4 The Generalized Chebyshev Inequality and the Chernoff Bound 5.5 Randomly Chosen Code Words 5.6 Many Code Words—The Coding Theorem Properties of the Random Coding Exponent 5.7 Error Probability for an Expurgated Ensemble of Codes 5.8 Lower Bounds to Error Probability Block Error Probability at Rates above Capacity 5.9 The Coding Theorem for Finite-State Channels State Known at Receiver Summary and Conclusions Historical Notes and References Appendix SA Appendix 5B ‘Techniques for Coding and Deco 6.1 Parity-Check Codes Generator Matrices Parity-Check Matrices for Systematic Parity-Check Codes Decoding Tables Hamming Codes 6.2 The Coding Theorem for Parity-Check Codes 16 82 91 97 105 i Wu 112 116 16 120 122 126 131 135, 141 150 187 173, 176 182 187 188, 188 193 196 196 199 200 202 203 206 Contents 63 Group Theory Subgroups Cyclic Subgroups 64 Fields and Polynomials Polynomials 65. Cyclic Codes 6.6 Galois Fields Maximal Length Codes and Hamming Codes Existence of Galois Fields 6.7 BCH Codes Iterative Algorithm for Finding o(D) 68 Convolutional Codes and Threshold Decoding 6.9 Sequential Decoding Computation for Sequential Decoding Error Probability for Sequential Decoding 6.10 Coding for Burst Noise Channels Cyclic Codes Convolutional Codes Summary and Conclusions Historical Notes and References Appendix 6A Appendix 6B Memoryless Channels with Discrete Time 7.1 Introduction 7.2. Unconstrained Inputs 7.3. Constrained Inputs 7.4 Additive Noise and Additive Gaussian Noise Additive Gaussian Noise with an Energy Constrained Input 7.5 Parallel Additive Gaussian Noise Channels Summary and Conclusions Historical Notes and References Waveform Channels 8.1 Orthonormal Expansions of Signals and White Gaussian Noise Gaussian Random Processes Mutual Information for Continuous-Time Channels ili 209 210 2u1 213 214 219 225 230 235 238 245 258 263 273 280 286 291 297 305 305 306 309 Contents 8.2 White Gaussian Noise and Orthogonal Signals Error Probability for Two Code Words Error Probability for Orthogonal Code Words 8.3 Heuristic Treatment of Capacity for Channels with Additive Jaussian Noise and Bandwidth Constraints 8.4 Representation of Linear Filters and Nonwhite Noise Filtered Noise and the Karhunen-Loeve Expansion Low-Pass Ideal Filters 8.5. Additive Gaussian Noise Channels with an Input Constrained in Power and Frequency 8.6. Fading Dispersive Channels Summary and Conclusion Historical Notes and References Source Coding with a Fidelity Criterion 9.1 Introduction 9.2 Diserete Memoryless Sources and Single-Letter Distortion Measures 9.3. The Coding Theorem for Sources with a Fidelity Criterion 9.4 Calculation of R(d*) 9.5 The Converse to the Noisy-Channel Coding Theorem Revisited 9.6 Discrete-Time Sources with Continuous Amplitudes 9.7 Gaussian Sources with Square Difference Distortion Gaussian Random-Process Sources 9.8 Discrete Ergodic Sources Summary and Conclusions Historieal Notes and References Exercises and Problems References and Selected Reading Glossary of Symbols Index 371 374 379) 383 390 398, 402 407 431 439 442 442 443 451 457 465 470 475 482 490 500 501 303 369 578 581 Information Theory and Reliable Communication Chapter 1 COMMUNICATION SYSTEMS AND INFORMATION THEORY 1.1 Introduction Communication theory deals primarily with systems for transmitting information or data from one point to another. A rather general block diagram for visualizing the behavior of such systems is given in Figure 1.1.1. The source output in Figure 1.1.1 might represent, for example, a voice waveform, a sequence of binary digits from a magnetic tape, the output of a set of sensors in a space probe, a sensory input to a biological organism, or a target in a radar system. The channel might represent, for example, a telephone line, a high frequency radio link, a space communication link, a storage medium, or a biological organism (for the case where the source output is a sensory input to that organism). The channel is usually subject to various types of noise disturbances, which on a telephone line, for example, might take the form of a time-varying frequency response, crosstalk from other lines, thermal noise, and impulsive switching noise. The encoder in Figure 1.1.1 represents any processing of the source output performed prior to transmission, The processing might include, for example, any combination of modulation, data reduction, and insertion of redundancy to combat the channel noise. The decoder represents the processing of the channel output with the objective of producing at the destination an acceptable replica of (or response to) the source output. In the early 1940's, C. E, Shannon (1948) developed a mathematical theory, called information theory, for dealing with the more fundamental aspects of communication systems. The distinguishing characteristics of this theory are, first, a great emphasis on probability theory and, second, a primary concern with the encoder and decoder, both in terms of their functional roles and in terms of the existence (or nonexistence) of encoders and decoders that achieve a given level of performance. In the past 20 years, information theory has been made more precise, has been extended, and has 1 2 Communication Systems and Information Theory been brought to the point where it is being applied in practical communication systems. Our purpose in this book is to present this theory, both bringing Out its logical cohesion and indicating where and how it can be applied. As in any mathematical theory, the theory deals only with mathematical models and not with physical sources and physical channels. One would think, therefore, that the appropriate way to begin the development of the theory would be with a discussion of how to construct appropriate mathe- matical models for physical sources and channels. This, however, is not the way that theories are constructed, primarily because physical reality is rarely simple enough to be precisely modeled by mathematically tractable Figure 1.1.1. Block diagram of communication system. models. Our procedure here will be rather to start by studying the simplest classes of mathematical models of sources and channels, using the insight and the results gained to study progressively more complicated classes of models. Naturally, the choice of classes of models to study will be influenced and motivated by the more important aspects of real sources and channels, but our view of what aspects are important will be modified by the theoret results. Finally, after understanding the theory, we shall find it useful in the study of real communication systems in two ways. First, it will provide a framework within which to construct detailed models of real sources and channels. Second, and more important, the relationships established by the theory provide an indication of the types of tradeoffs that exist in constructing encoders and decoders for given systems, While the above com- ‘ments apply to almost any mathematical theory, they are particularly neces- sary here because quite an extensive theory must be developed before the more important implications for the design of communication systems will become apparent. In order to further simplify our study of source models and channel models, it is helpful to partly isolate the effect of the source in a communica- tion system from that of the channel. This can be done by breaking the encoder and decoder of Figure 1.1.1 each into two parts as shown in Figure 1.1.2. The purpose of the source encoder is to represent the source output by a sequence of binary digits and one of the major questions of concern is to determine how many binary digits per unit time are required to represent the output of any given source model. The purpose of the channel encoder Introduction 3 and decoder is to allow the binary data sequences to be reliably reproduced at the output of the channel decoder, and one of the major questions of concern here is if and how this can be done. It is not obvious, of course, whether restricting the encoder and decoder to the form of Figure 1.1.2 imposes any fundamental limitations on the performance of the communication system. One of the most important results of dc theory, however, is that under very broad conditions no such limitations are imposed (this does not say, however, that an encoder and decoder of the form in Figure 1.1.2 is always the most economical way to achieve a given performance). [am Se ee ee] encoder [data “| _ encoder Channel Noise ire 7 Binary [cha = Source "YT Channel owner kf Sess ae l Figure 1.1.2. Block diagram of communication system with encoder and decoder each split in two parts. From a practical standpoint, the splitting of encoder and decoder in Figure 1.1.2 is particularly convenient since it makes the design of the channel encoder and decoder virtually independent of the source encoder and decoder, using binary data as an interface. This, of course, facilitates the use of different sources on the same channel In the next two sections, we shall briefly describe the classes of source models and channel models to be studied in later chapters and the encoding and decoding of these sources and channels. Since the emphasis in information theory is primarily on this encoding and decoding, it should be clear that the theory is not equally applicable to all communication situations. For example, if the source is a radar target, there is no opportunity for encoding the source output (unless we want to look at the choice of radar signals as a form of encoding), and thus we cannot expect the theory to produce any more than peripheral insight. Similarly, if the source output is a sensory input to a biological organism, we might consider the organism to be a combination of encoding, channel, and decoding, but we have no control over the encoding and decoding, and it is not at all clear that this is the most fruitful model for such studies of a biological organism. Thus, again, information theory might provide some insight into the behavior of such organisms but it can certainly not be regarded as a magical key for understanding. 4 Communication Systems and Information Theory 1.2 Source Models and Source Coding We now briefly describe the mathematical models of sources that the theory will deal with. Naturally, these models will be presented more care- fully in subsequent chapters. All source models in information theory are random process (or random sequence) models. Diserete memoryless sources constitute the simplest class of source models. These are sources for which the output is a sequence (in time) of letters, each letter being a selection from some fixed alphabet consisting of, say, the letters a, da, ... , dyc. The letters in the source output sequence are random statistically independent selections from the alphabet, the selection being made according to some fixed prob- ability assignment Q(a,), ... , (ax) Method 1 Method 2 a 00 a0 a, 01 ay +10 ay +10 ay > 110 ag eet Figure 1.2.1. Two ways of converting a four-letter alphabet into binary dis It undoubtedly seems rather peculiar at first to model physical sources, which presumably produce meaningful information, by a random proces model. The following example will help to clarify the reason for this. Suppose that a measurement is to be performed repeatedly and that the result of each measurement is one of the four events a, dz, @, OF ay. Suppose that this sequence of measurements is to be stored in binary form and suppose that ‘two ways of performing the conversion to binary digits have been proposed, as indicated in Figure 1.2.1 In the first method above, two binary digits are required to represent each source digit, whereas in the second method, a variable number is required. If it is known that a, will be the result of the great majority of the measure- ments, then method 2 will allow a long sequence of measurements to be stored with many fewer binary digits than method 1. In Chapter 3, methods for encoding the output of a discrete source into binary data will be discussed in detail. The important point here is that the relative effectiveness of the two methods in Figure 1.2.1 depends critically upon the frequency of occur- rence of the different events, and that this is incorporated into a mathematical model of the source by assigning probabilities to the set of source letters. More familiar, but more complicated, examples of the same type are given by shorthand, where short symbols are used for commonly occurring words, and in Morse code, where short sequences of dots and dashes are assigned to common letters and longer sequences to uncommon letters. Source Models and Source Coding 5 Closely related to the encoding of a source output into binary data is the measure of information (or uncertainty) of the letters of a source alphabet, which will be discussed in Chapter 2. If the kth letter of the source alphabet has probability Q(a,), then the self-information of that letter (measured in bits) is defined as I(a,) 4 — log, Q(a,). From an intuitive standpoint, as will be seen in more detail in Chapter 2, this technical definition has many of the same qualities as the nontechnical meaning of information. In partic- ular, if Q(@,) =1, then M(q,) = 0, corresponding to the fact that the occurrence of a, is not at all informative since it had to occur. Similarly, the smaller the probability of a,, the larger its self-information. On the other hand, it is not hard to see that this technical definition of information also lacks some qualities of the nontechnical meaning. For example, no matter how unlikely an event is, we do not consider it informative (in the non- technical sense) unless it happens to interest us. This does not mean that there is something inadequate about the definition of self-information; the usefulness of a definition in a theory comes from the insight that it provides and the theorems that it simplifies. The definition here turns out to be useful in the theory primarily because it does separate out the notion of unexpected- ness in information from that of interest or meaning. The average value of self-information over the letters of the alphabet is a particularly important quantity known as the entropy of a source letter, and it is given by K 2 —Q(4,) log; Q(a,). The major significance of the entropy of a source letter comes from the source coding theorem which is treated in Chapter 3. This states that, if H is the entropy of a source letter for a discrete memoryless source, then the sequence of source outputs cannot be represented by a binary sequence using fewer than H binary digits per source digit on the average, but it can be represented by a binary sequence using as close to H binary digits per source digit on the average as desired. Some feeling for this result may be obtained by noting that, if for some integer L, a source has an alphabet of 2 equally likely letters, then the entropy of a source letter is L bits. On the other hand, if we observe that there are 2 different sequences of L binary digits, then we see that each of these sequences can be assigned to a different letter of the source alphabet, thus representing the output of the source by L binary digits per source digit. This example also goes a long way towards showing why a logarithm appears in the definition of self-information and entropy. The entropy of a source is also frequently given in the units of bits per second. If, for a discrete memoryless source, the entropy of a source letter 6 Communication Systems and Information Theory is H, and if the source produces one letter each 7, seconds, then the entropy in bits per second is just H/r,, and the source coding theorem indicates that the source output can be represented by a binary sequence of arbitrarily close to H/x, binary digits per second. ‘As a more complicated class of source models, we shall consider discrete sources with memory in which successive letters from the source are statisti- cally dependent. In Section 3.5, the entropy for these sources (in bits per digit or bits per second) is defined in an analogous but more complicated way and the source coding theorem is shown to apply if the source is ergodic. Finally, in Chapter 9, we shall consider nondiscrete sources. The most familiar example of a nondiscrete source is one where the source output is a random process. When we attempt to encode a random process into a binary sequence, the situation is conceptually very different from the encoding of a discrete source. A random process can be encoded into binary data, for example, by sampling the random waveform, then quantizing the samples, and then encoding the quantized samples into binary data. The difference between this and the binary encoding discussed previously is that the sample waveform cannot be precisely reconstructed from the binary sequence, and thus such an encoding must be evaluated both in terms of the number of binary digits required per second and some measure of the distortion between the source waveform and the waveform reconstructed from the binary digits. In Chapter 9 we shall treat the problem of finding the minimum number of binary digits per second required to encode a source output so that the average distortion between the source output and a replica constructed from the binary sequence is within a given level. The major point here is that a nondiscréte source can be encoded with distortion into a binary sequence and that the required number of binary digits per unit time depends on the permissible distortion 1.3 Channel Models and Channel Coding In order to specify a mathematical model for a channel, we shall specify first the set of possible inputs to the channel, second, the set of possible outputs, and third, for each input, a probability measure on the set of ‘outputs. Discrete memoryless channels constitute the simplest class of channel models and are defined as follows: the input is a sequence of letters from a finite alphabet, say a, ... . a, and the output is a sequence of letters from the same or a different alphabet, say b,,... , by. Finally, each letter in the ‘output sequence is statistically dependent only on the letter in the corre- sponding position of the input sequence and is determined by a fixed con- ditional probability assignment P(b, | a,) defined for each letter a, in the input alphabet and each letter b, in the output alphabet. For example, the binary symmetric channel (see Figure 1.3.1) is a diserete memoryless channel Channel Models and Channel Coding 1 with binary input and output sequences where each digit in the input sequence is reproduced correctly at the channel output with some fixed probability 1 — € and is altered by noise into the opposite digit with probability ¢, In general, for discrete memoryless channels, the transition probability assign- ‘ment tells us everything that we have to know about how the noise combines with the channel input to produce the channel output. We shall describe later how discrete memoryless channels relate to physical channels. ‘A much broader class of channels, which we shall call discrete channels with memory, is the class where the input and output are again sequences of letters from finite alphabets but where each letter in the output sequence can depend statistically on more than just the corresponding letter in the input sequence. Another class of channel models which bears a more immediate resem- blance to physical channels is the class where the set of inputs and set of out- puts are each a set of time functions (that is, waveforms), and for each input Input Output alphabet alphabet (ay) Poojlaw) ) 0 0 THe i 1 Ta. Figure 1.3.1. Binary symmetric channel. waveform the output is a random process. A particular model in this class which is of great theoretical and practical importance (particularly in space communication) is the additive white Gaussian noise channel. The set of inputs for this model is the set of time functions with a given upper limit on power and the output is the sum of the input plus white Gaussian noise, When using this model for a physical channel with attenuation, we naturally take the input in the model to be the input to the physical channel as attenu- ated by the channel. When transmitting binary data over a channel in the above class, it is often convenient to separate the channel encoder and channel decoder each into two parts, as shown in Figure 1.3.2. The output from the discrete channel encoder in Figure 1.3.2 isa sequence of letters from a finite alphabet, say dy, ..., ax. These letters are produced at some fixed rate in time, say one letter each 7, seconds. In each interval of 7, seconds, the digital data modulator (DDM) produces one of a fixed set of waveforms, say s(t), A(t)... 5 8(1), each of duration r.. The particular waveform produced is determined by the letter entering the DDM in that interval, a, causing s,(0), 8 Communication Systems and Information Theory a, causing s,(1), and so forth. Thus the entire waveform input to the channel has the form Dsi(t = where the sequence i,, 7” =..., —1, 0, 1,... is determined by the corre- sponding inputs to the DDM. The digital data demodulator (DDD) takes the received waveform from the channel and converts it into a sequence of letters from a finite alphabet, say b,,..., By, producing letters again at a rate of one letter each 7, seconds, Discrete channel i | i [Bieta wave. ! — come er he | wavelm — * no | modslaor ! ' Ss J cramer || Jt ' \ J Source ' [mom pT BE JH | Figure 1.3.2. Representation of waveform channel as discrete channel. In the simplest case, each letter from the DDD will be a decision (perhaps incorrect) on what letter entered the DDM in the corresponding time interval, and in this case the alphabet 6,,... , by will be the same as the alphabet at the input to the DDM. In more sophisticated cases, the output from the DDD will also contain information about how reliable the decision is, and in this case the output alphabet for the DDD will be larger than the input alphabet to the DDM. Tt can be seen from Figure 1.3.2 that the combination of DDM, waveform channel, and DDD can together be considered as a discrete channel, and it i this that gives discrete channels their importance as models of physical channels. If the noise is independent between successive intervals of r, seconds, as will be the case for additive white Gaussian noise, then the above discrete channel will also be memoryless. By discussing encoding and decoding for discrete channels as a class, we shall first find out something about the discrete channel encoder and decoder in Figure 1.3.2, and second, we shall be able to use these results to say something about how a DDM and DDD should be designed in such a system. One of the most important parameters of a channel is its capacity. In ‘Channel Models and Channel Coding 9 Chapter 4 we define and show how to calculate capacity for a broad class of discrete channels, and in Chapters 7 and 8, the treatment is extended to nondiscrete channels. Capacity is defined using an information measure similar to that used in discussing sources and the capacity can be interpreted as the maximum average amount of information (in bits per second) that can be transmitted over the channel. It turns out that the capacity of a non- discrete channel can be approached arbitrarily closely by the capacity of a discrete channel made up of an appropriately chosen digital data modulator and digital data demodulator combined with the nondiserete channel. The significance of the capacity of a channel comes primarily from the noisy-channel coding theorem and its converse. In imprecise terms, this coding theorem states that, for a broad class of channels, if the channel has capacity C bits per second and if binary data enters the channel encoder (see Figure 1.1.2) at a rate (in binary digits per second) R < C, then by appro- priate design of the encoder and decoder, it is possible to reproduce the binary digits at the decoder output with a probability of error as small as desired. This result is precisely stated and proved in Chapter 5 for discrete channels and in Chapters 7 and 8 for nondiscrete channels. The far-reaching significance of this theorem will be discussed later in this section, but not much intuitive plausibility can be given until Chapter 5, If we combine this result with the source coding theorem referred to in the last section, we find that, if'a discrete source has an entropy (in bits per second) less than C, then the source output can be recreated at the destination with arbitrarily small error probability through the use of appropriate coding and decoding. Similarly, for a nondiserete source, if R is the minimum number of binary digits per second required to reproduce the source output within a given level of average distortion, and if R < C, then the source output can be trans- mitted over the channel and reproduced within that level of distortion. The converse to the coding theorem is stated and proved in varying degrees of generality in Chapters 4, 7, and 8, In imprecise terms, it states that if the entropy of a discrete source, in bits per second, is greater than C, then, independent of the encoding and decoding used in transmitting the source output over the channel, the error probability in recreating the source output at the destination cannot be less than some positive number which depends on the source and on C. Also, as shown in Chapter 9, if R is the minimum number of binary digits per second required to reproduce a source within a given level of average distortion, and if R > C, then, independent of the encoding and decoding, the source output cannot be transmitted over the channel and reproduced within that given average level of distortion. The most surprising and important of the above results is the noisy channel coding theorem which we now discuss in greater detail. Suppose that we want to transmit data over a discrete channel and that the channel accepts 10 Communication Systems and Information Theory aninputletter once each 7, seconds. Suppose also that binary data are entering the channel encoder at a rate of R binary digits per second. Let us consider a particular kind of channel encoder, called a block encoder, which operates in the following way : the encoder accumulates the binary digits at the encoder input for some fixed period of T seconds, where T is a design parameter of the encoder. During this period, TR binary digits enter the encoder (for simplicity, we here ignore the difficulty that TR might not be an integer) We can visualize the encoder as containing a list of all 2” possible sequences of TR binary digits and containing alongside each of these sequences a code word consisting of a sequence of N= T/z, channel input letters. Upon receiving a particular sequence of TR binary digits, the encoder finds that Binary input Code word sequences to outputs from encoder encoder 0 a OL . ag 1 ayaa, eee ir Figure 1.3.3. Example of discrete channel encoder, TR=2,N=3. sequence in the list and transmits over the channel the corresponding code word in the list, It takes J seconds to transmit the N letter code word over the channel and, by that time, another sequence of TR binary digits has entered the encoder, and the transmission of the next code word begins. A simple example of such an encoder is given in Figure 1.3.3. For that example, if the binary sequence OO11 - enters the encoder, the 00 is the encoder input in the first T-second interval and at the end of this interval the code word ayaa, is formed and transmitted in the second T-second interval, Similarly, 11 is the encoder input in the second T-second time interval and the corresponding code word, transmitted in the third time interval, is aya ‘The decoder for such a block encoder works in a similar way. The decoder accumulates NV received digits from the channel corresponding to a trans- mitted code word and makes a decision (perhaps incorrect) concerning the corresponding TR binary digits that entered the encoder. This decision making can be considered as being built into the decoder by means of a list of all possible received sequences of N digits, and corresponding to each of these sequences the appropriate sequence of 7R binary digits. For a given discrete channel and a given rate R of binary digits per second entering the encoder, we are free to choose first T (or, equivalently, N Tir, second the set of 2"* code words, and third the decision rule. The ‘Channel Models and Channel Coding u probability of error in the decoded binary data, the complexity of the system, and the delay in decoding all depend on these choices. In Chapter 5 the follow- ing relationship is established between the parameter T and the probability, P,, of decoding a block of TR binary digits incorrectly: it is shown for a broad class of channels that it is possible to choose the 2”* code words and the decision rule in such a way that P. S exp [—TE(R)] The function E(R) is a function of R (the number of binary digits per second entering the encoder) and depends upon the channel model but is independent of T. It is shown that E(A) is decreasing with R but is positive for all R less than channel capacity (see Figure 1.3.4), It turns out that the above bound on P, is quite tight and it is not unreasonable to interpret exp (—TE(R)] as an estimate of the minimum probability of error (over all choices of the E(R) L_* Figure 1.3.4. Sketch of function E(R) for a typical ‘channel model, 27® code words and all decision rules) that can be achieved using a block encoder with the constraint time 7. Thus, to make P, small, it is necessary to choose T large and the closer R is to C, the larger T must be. In Chapter 6 we shall discuss ways of implementing channel encoders and decoders. It is difficult to make simple statements about either the complexity or the error probability of such devices. Roughly, however, it is not hard to see that the complexity increases with the constraint time T (in the best techniques, approximately linearly with 7), that P, decreases with T for fixed R, and that T must increase with R to achieve a fixed value of P,. Thus, roughly, there is a tradeoff between complexity, rate, and error probability The closer R is to capacity and the lower P, is, the greater the required encoder and decoder complexity is In view of the above tradcoffs, we can see more clearly the practical ad- vantages of the separation of encoder and decoder in Figure 1.3.2 into two 2 Communication Systems and Information Theory parts, In recent years, the cost of digital logic has been steadily decreasing, Whereas no such revolution has occurred with analog hardware. Thus it is desirable in a complex system to put as much of the complexity as possible in the digital part of the system. This is not to say, of course, that com- pletely analog communication systems are outmoded, but simply that there are many advantages to a primarily digital system that did not exist ten years ago. Historical Notes and References ‘Much of modern communication theory stems from the works of Shannon (1948), Wiener (1949), and Kotel'nikov (1947), All of these men recognized clearly the fundamental role of noise in limiting the performance of com- munication systems and also the desirability of modeling both signal and noise as random processes. Wiener was interested in finding the best linear filter to separate the signal from additive noise with a prescribed delay and his work had an important influence on subsequent research in modulation theory. Also Wiener’s interest in reception with negative delay (that is, prediction) along with Kolmogorov's (1941) work on prediction in theabsence of noise have had an important impact on control theory. Similarly, Kotel nikov was interested in the detection and estimation of signals at the receiver. While his work is not as widely known and used in the United States as it should be, it provides considerable insight into both analog modulation and digital data modulation. Shannon's work had a much more digital flavor than the others, and more important, focused jointly on the encoder and decoder. Because of this joint emphasis and the freedom from restrictions to particular types of receiver structures, Shannon's theory provides the most general conceptual framework known within which to study efficient and reliable communication. For theoretically solid introductory texts on communication theory, see Wozencraft and Jacobs (1965) or Sakrison (1968). Chapter 2 A MEASURE OF INFORMATION The concepts of information and communication in our civilization are far too broad and pervading to expect any quantitative measure of information to apply universally, As explained in the last chapter, however, there are many communication situations, particularly those involving transmission and processing of data, in which the information (or data) and the channel are appropriately represented by probabilistic models. The measures of information to be defined in this chapter are appropriate to these proba- bilistic situations, and the question as to how appropriate these measures are generally revolves around the question of the appropriateness of the prob- abilistic model. 2.1 Discrete Probability: Review and Notation We may visualize a probabilistic model as an experiment with an outcome chosen from a set of possible alternatives with a probability measure on the alternatives. The set of possible alternatives is called the sample space, each alternative being an element of the sample space. For a discrete set of alter- natives, a probability measure simply involves the assignment of a probability to each alternative. The probabilities are of course nonnegative and sum to one, A sample space and its probability measure will be called an ensemble* and will be denoted by a capital letter: the outcome will be denoted by the same letter, lower cas, For an ensemble U with a sample space {a,, as, ... , ag}, the probability that the outcome u will be a particular element a, of the sample space will be denoted by P;;(a,). The probability that the outcome will be an arbitrary element w is denoted P,(w). In this expression, the subscript U is used as a reminder of which ensemble is under consideration * In most of the mathematical literature, what we call an ensemble here is called a prob- ability space. 1B “4 A Measure of Information and the argument w is used asa variable that takes on values from the sample space. When no confusion can arise, the subscript will be omitted. ‘Asan example, the ensemble U might represent the output of a source at fa given time where the source alphabet is the set of letters {a,,..- , ajc) and P(a,) is the probability that the output will be letter dy. ‘We shall usually be concerned with experiments having a number of out- comes rather than a single outcome. For example, we might be interested in aa sequence of source letters, or in the input and output to a channel, or in a sequence of inputs and outputs to a channel Suppose that we denote the outcomes by x and y in a two-outcome experi ment and suppose that x is a selection from the set of alternatives a,,. .- 5 a and y isa selection from the set of alternatives by, ... .by. The set {a,..- ,@x} is called the sample space for X, the set {b,,..., by} is called the sample space for Y, and the set of pairs {a,.b,}, 1 0, the conditional probability that outcome y is b,, outcome 2 is a,, is defined as n that Pxr(asbj) Px(a,) (2.1.4) Pyix(b,| ay Discrete Probability: Review and Notation 15 In abbreviated notation, this is Py | 2) = Pee y)/PQ) Q1.5) Likewise Pee|y) = PeeaiPW) 216 The events z = a and y = b, are defined to be statistically independent if Pxy (ausb,) = Px(a)Py(b,) Q17 If Px-(q) > 0, this is equivalent to Prixlbj| a) = Py(b,) (2.1.8) so that the conditioning does not alter the probability that y = b,. The ensembles X and Y are statistically independent if (2.1.7) is satisfied for all pairs a,b, in the joint sample space. Next consider an experiment with many outcomes, say ty tgs... 5 t each selected from a set of possible alternatives. The set of possible alter. natives for outcome u, is called the sample space for Uy, 1 P(ry) log P|) (2.2.16) ‘This is interpreted as the average information (over # and y) required to specify x after y is known. + See, for example, R. C, Tolman, The Principles of Statistical Mechanics, p. 68 (Tolman’s HPs the negative of the entropy). Definition of Mutual Information a If (2.2.13) is averaged over the X'Y ensemble, we find that the average mutual information between 2 and y is the difference between the entropy of V and the conditional entropy of X given Y. WKY) = HO) — HY ¥) (2.2.17) This equation shows that we can interpret /(¥;Y) as the average amount of uncertainty in X resolved by the observation of the outcome in the Y ensemble, H(X| ¥) being the average remaining uncertainty in X after the observation. We can obtain some additional relations between self and mutual informa. tions by considering a joint ensemble XY to be a single ensemble whose clements are the xy pairs of the joint sample space. The self-information of an x,y pair is then May) = log Pees) (2.2.18) Since P(xy) = P(2)Ply | 2) = Ply)Ple |), we obtain Mey) = Ke) + Hy | 2) = My) + Ke| (2.2.19) The mutual information can also be written in terms of. I(x,y) as Hxiy) = Ka) + Ky) — Key) (2.2.20) Averaging these expressions over the joint XY ensemble, we obtain HXY) = HX) + HO’ |X) = HY) + HX|Y) 2.2.21) WAXY) = A(X) + HUY) — HUY) (2.2.2) Next, let u,,..., uy be the outcomes in a joint ensemble U,... Uy The conditional mutual information between w, and us given uy is defined, consistently with (2.2.1), as Tuy; ug | us) (2.2.23) = Hu | ws) — Hay | ta, 5) (2.2.24) The average conditional mutual information is then given by HU ys Us| U9) = TY Plays tog Mle) — 995) an as Plus |us) = H(U, | Us) — H(U, | U,U,) (2.2.26) We could now develop an unlimited number of relations between con- ditional and unconditional mutual and self informations by using joint outcomes in place of single outcomes in these quantities. One relation of par- ticular interest is that the mutual information provided about a particular outcome u; by a particular pair of outcomes 1,4, is equal to the information 2 A Measure of Information provided about 1 by u; plus that provided about u, by 1, conditioned on tug. To see this, we have Han us) + Hans ts | a) Plu |) Hints) — 2.2.27) ‘A second relationship, following from the chain rule of probability, (u,)PCute | 15)» Paty | as ee Hs Pity tay 5 Hy) = is Tlttsstay « « tty) = Muy) + Hetg | m4) # oo Matyas yd) (2.2.28) Averaging (2.2.27) and (2.2.28) over the joint ensemble, we obtain HU; UUs) = MOUs) + (UUs |e (2.2.29)* HU, 0+ Ux) = Hy) + (Ug | Uy) +20 HW y | Ure Ux) (2.2.30) Example 2.3. Consider the channel of Figure 2.2.1 again, but consider using it three times in succession so that the input is a sequence rats of three binary digits and the output is a sequence y,uats of three binary digits. Suppose also that we constrain the input to be a triple repetition of the same digit, using the sequence a,a,4 with probability 14 and the sequence ayaa, with probability 14, Finally, assume that the channel acts independently on cach digit or, in other words, that PC yyetls | *vtets) = Pn | %2)PCYe | PCs | 9) (2.2.31) We shall analyze the mutual information when the sequence aaa is sent and the sequence b,b,, is received. We shall see that the first output provides negative information about the input but that the next two outputs provide enough positive information to overcome this initial confusion. As in (2.2.6), we have Tx av (aids) = log 2€ (2.2.32) Prurvlar] b: 2.33) Px,irlas | ba) Fxyrrair(éts by | bs) = log = log — = —log 2 + all equations and theorems marked with an asterisk in this section and the next are also valid for nondiscrete ensembles (see Sections 2.4 and 2.5), Average Mutual Information and Eneropy 23 We see that the conditional information provided by the second output exactly counterbalances the negative information on the first output. This is intuitively satisfying since, after the reception of byb,, the receiver is just as uncertain of the input as it was initially. The conditional information provided by the third received digit is Lregargv x tes bs | babs) = log (1 ~ 9) 2.234 The total information provided by the three received digits about the input is then positive, corresponding to the a posteriori probability of the input ,a,a, being larger than the a priori probability of @,a,a, 2.3 Average Mutual Information and Entropy In this section, we derive a number of inequalities concerning entropy and average mutual information Theorem 2.3.1. Let Y be an ensemble with a Sample space consisting of K elements. Then H(X) < log K 23.1) with equality if and only if the elements are all equally probable. Proof. This theorem and a number of subsequent inequalities can be proven by use of the inequality Int <2—15 Inz=2—1; (2.3.2) This is sketched in Figure 2.3.1 and is verified analytically by noting that the difference Inz — (z — 1) has a negative second derivative and a stationary point atz = 1 Figure 2.3.1. Sketch of \n = and =~ 1. mu A Measure of Information We now show that H(X) — log K <0. H(X) — log K = ¥ P(x) eS — Pia) log K log e) ¥ P(x) In (oe 3 POO ay Considering the sum to be only over those for which P(x) > 0, we can apply (2.3.2) fo each term, obtaining (2.3.3) a H(X0) — log K < coe 3 Pel Ey 1 =v Sp - Zee] <0 (23.4) The last inequality follows since the sum over «r involves, at most, K terms. Both inequalities are equalities if and only if 1/[KP(2)] = 1 for all «; this is equivalent to the elements being equiprobable. | Since the entropy of an ensemble is maximized when the elements are equiprobable, we might surmise that the entropy of an ensemble is increased whenever the probability of one element is incrementally increased at the expense of some more probable element; this result is proven in Problem 2.15. In the next theorem, we show that even though mutual information as a random variable can be negative, the average mutual information is always nonnegative. ‘Theorem 2.3.2.* Let YY be a discrete joint ensemble, The average mutual information between X and Y satisfies WXGY) >0 (2.3.5) with equality if and only if X and ¥ are statistically independent. Proof. We show that —I(X;¥) < 0. P(x) IY) = (log oS Ps) nO Plx|y) (2.3.6) Consider the sum in (2.3.6) to be over only those y for which P(z,y) > 0. For these terms, P(x) > 0, P(x | y) > 0, and (2.3.2) can be applied to each term. =X; Slog. S Pen] 5 -1] 23.7) ) Re ty) = (log of SPP) — Pees o]}s° 238) Average Mutual Information and Entropy 25 Equation 2.3.7 is satisfied with equality if and only if P(@) = Ple|y) whenever P(x.) > 0. Since the sum in (2.3.8) is over only those :ry pairs for which P(x.) > 0, (2.3.8) is satisfied with equality ifand only if P(@)P(y) = 0 when P(xy) = 0. Thus both inequalities are satisfied with equality, and consequently ((X; ¥) = 0, if and only if Y and ¥ are statistically independent. | As an immediate consequence of this theorem, we can use the relationship = H(X) — H(X| ¥) to obtain H(X) > H(X| Y) (2.3.9) K with equality if and only if X and Y are statistically independent. Thus any conditioning on an ensemble can only reduce the entropy of the ensemble. It is important to note that (2.3.9) involves an averaging over both the X and Y ensembles. The quantity = P(e | y log P(e | y) can be either larger or smaller than H(X) (see Problem 2.16). Applying (2.3.9) to each term of (2.2.30), letting U,, play the role of X and U,-+* U,,.. the role of ¥, we have x HO, UY ST HU,) (2.3.10) with equality if and only if the ensembles are statistically independent. ‘Theorem 2.3.3.* Let XYZ be a discrete joint ensemble. Then WXYZ) 20 (2.3.11) with equality if and only if, conditional on each z, X and Y are statistically independent; that is, if P(ry| P(r | 2)P(y | 2) (2.3.12) for cach clement in the joint sample space for which P(2) > 0. Proof. Repeat the steps of the proof of Theorem 2.3.2, adding in the conditioning 01 Combining (2.3.11) and (2.2.26), we see that H(X|Z)> H(X|ZY) (2.3.13) with equality if and only if (2.3.12) is satisfied ‘The situation in which 1(X; ¥ |Z) = 0 has a number of interesting inter- pretations. We can visualize the situation as a pair of channels in cascade as 6 A Measure of Information shown in Figure 2.3.2. The ¥ ensemble is the input to the first channel, the Z ensemble is both the output from the first channel and the input to the second channel, and the ¥ ensemble is the output from the second channel. We assume that the output of the second channel depends statistically only on the input to the second channel; that is, that PY |) =P |2,2; alla,y,2 with P@2)>0 (2.3.14) Multiplying both sides by PC 2), we obtain (23.12), so that U(X; ¥|Z)=0 (2.3.15)* For such a pair of cascaded channels, it is reasonable to expect the average mutual information between V and ¥ to be no greater than that through x Zz y a o 6 a @ be Peay) laren eee eo an & by Figure 2.3.2. Cascaded channels. se, From cither channel separately. We now show that this is indeed the (2.2.29), we have both of the following equations. IOGYZ) = WX) + 103 Z| YY (2.3.16)* = MXsZ) + UX |Z) (2.3.17)* Equating the right-hand sides and using (2.3.15), we have WXiZ) = MXGY) + HX Z|Y) (2.3.18)* From (2.3.11), (X; Z| ¥) > 0 and, thus, (2.3.15) implies that WXZ) > UXY) (2.3.194)* From the symmetry of (2.3.12) between and Y, it also follows tha WZ) > WY) (2.3.195)* Writing out (2.3.19) in terms of entropies, we have H(X) — H(X | Z) > H(X) — H(X| Y) H(X |Z) < HX | Y) Z) about the input of a channel given the 20) yields the (2.3.20) The average uncertainty H(X output is called the equivdcation on the channel, and thus (2. Probability and Mutual Information for Continuous Ensembles 27 intuitively satisfying result that this uncertainty or equivocation can never decrease as we go further from the input on a sequence of cascaded channels. Equations 2.3.19 and 2.3.20 become somewhat more surprising if we inter- pret the second box in Figure 2.3.2 as a data processor, processing the output of the first box which is now the channel. Whether this processing on the ensemble Z is deterministic or probabilistic, it can never decrease the equivo- cation about X nor increase the mutual information about X. This does not mean that we should never process the output of a channel and, in fact, processing is usually necessary to make any use of the output of the channel. Instead, it means that average mutual information must be interpreted as an average measure of available statistical data rather than in terms of the usefulness of the presentation. This result will be discussed in more detail in Chapter 4. 2.4 Probability and Mutual Information for Continuous Ensembles Consider an ensemble X where the outcome « is a selection from the sample space consisting of the set of real numbers. A probability measure ‘on this sample space is most easily given in terms of the distribution function, Fy(e) = Prix WKS) (24.21) Next, suppose that y is a reversible transformation of 2, so that = = f(y). We can then consider y as the channel output and = as the transformed out- put, yielding da dydz (2.4.20) WAXY) MZ) (2.4.22) = There are some minor mathematical problems here. Since wr is uniquely specified by the joint probability density pry.) will have impulse functions init. Since this is a special ‘casé of the more general ensembies to be discussed later, we shall ignore these mathematical details For the time being. Probability and Mutual Information for Continuous Ensembles mu Combining these equations, we have AX;¥) = 1X:Z) and, consequently, the average mutual information between two ensembles is invariant to any reversible transformation of one of the outcomes. The same argument of course can be applied independently to any reversible transformation of the other outcome. Let us next consider whether a meaningful definition of self-information can be made for a continuous ensemble. Let X be an ensemble with a real- valued outcome x and a finite probability density p(x). Let the x axis be quantized into intervals of length A, so that the self-information of an interval from x, — A to 2, is 1 Pri, —A<2< al log (2.4.23) In the limit as A approaches 0, Prix, — A < + < x] approaches Ap x(x), which approaches 0, Thus the self-information of an interval approaches co as the length of the interval approaches 0. This result is not surprising if we think of representing real numbers by their decimal expansions. Since an infinite sequence of decimal digits is required to exactly specify an arbitrary real number, we would expect the self-information to be infinite. The diffi- culty here lies in demanding an exact specification of a real number. From aa physical standpoint, we are always satisfied with an approximate specifica- tion, but any appropriate generalization of the concept of self-information must involve the kind of approximation desired. This problem will be treated from a fundamental standpoint in Chapter 9, but we shall use the term self- information only on discrete ensembles. For the purposes of calculating and manipulating various average mutual informations and conditional mutual informations, it is often useful to define the entropy of a continuous ensemble. If an ensemble X has a prob- ability density p(x), we define the entropy of X by W(X) { iPedlog 5 de (2.4.24) Likewise, conditional entropy is defined by Jove y) log HX de dy (2.4.25) ] per |v) Using these definitions, we have, as in (2.2.17) and (2.2.22), WAXY) = HX) —HO| ¥) (2.4.26) HY) ~H(Y |X) (2.4.27) = W(X) +H(Y) -HOX) (2.4.28) 2 A Measure of Information These entropies are not necessarily positive, not necessarily finite, not invariant to transformations of the outcomes, and not interpretable as average self informations. Example 4, The following example of the preceding de useful Jater in dealing with additive Gaussian noise channels. Let the input @ to a channel be a zero mean Gaussian random variable with probability density pt) exp (- 2) (2.4.29) VanE The parameter £ is the mean square value or “energy” of the input. Suppose that the output of the channel, ¥, is the sum of the input and an independent zero mean Gaussian random variable of variance o®. The conditional prob- ability density of the output given the input is then 1 y — 2) ny| 2 xp iE | (2.4.30) V2na? That is, given 2, y has a Gaussian distribution of variance o centered around x. The joint probability density p(:r,y) is given by p(x)p(y/x) and the joint ensemble XY is fully specified. It is most convenient to calculate the average mutual information 1(¥;¥) from (2.4.27). HY|X) = — f r(29f pla 2) log ply | 2) dy de (2.4.31) =[ro [cv alive Vind + Y= ope dy dz = [ rletog V2n0? + Yt0g e] dx (2.4.32) Slog 2nea* (2.4.33) In (2.4.32), we have used the fact that f p(y/2)(y —)* dy is simply the variance of the conditional distribution, or 0% We next observe that the channel output is the sum of two independent Gaussian random variables and is thus Gaussian* with variance E + 0%, 1 ? ply) = exp le 5] (2.434) V2n(E + 0°) 2E + 0°) Calculating H(Y) in the same way as H(Y |X), we obtain H(Y¥) = M log 2ne(E + o*) (2.4.35) (2.4.36) 1(XiY) = H(Y) = H(Y| X) = log (1 + * See Problem 2.22. Mutual Information for Arbitrary Ensembles 33 We observe that, as o? approaches 0, the output ¥ approximates the input 2 more and more exactly and /(X;¥) approaches a. This is to be expected since we have already concluded that the self-information of any given sample value of x should be infinite, We shall often be interested in joint ensembles for which some of the outcomes are discrete and some continuous. The easiest way to specify a probability measure on such ensembles is to specify the joint probability of the discrete outcomes taking on each possible joint alternative and to specily the conditional joint probability density on the continuous outcomes conditional on each joint alternative for the discrete outcomes. For example, if outcome « has the sample space (a,,..., dq) and outcome y has the set of real numbers as its sample space, we specify P,(a,) for 1 < k < K and Pryyh | a) for all real numbers and 1 PxadPyy x | a) (2.4.37) The conditional probability of an = alternative given a y alternative, y,, for which py(y,) > 0 is PxlasdPrx(th | ae) vn) The mutual information and average mutual information between x and yis given by Pxir(a| mn) (2.4.38) Tyov(asn) = lo Mle) 439) Prin) 1oGyy=S[" — Pxapdpyix(n | a) log! (us | as) dy, (2.4.40) Bidn—x Pr(h) Conditional mutual information is defined in the analogous way. All of the relationships with asterisks in Sections 2.2 and 2.3 clearly hold for these mixed discrete and continuous ensembles. 5 Mutual Information for Arbitrary Ensembles The previously discussed discrete ensembles and continuous ensembles with probability densities appear to be adequate to treat virtually all of the Proviems of engincering interest in information theory, particularly if we ploy some judicious limiting operations to treat more general cases. However, in order to state general theorems precisely without a plethora of special cases, a more abstract point of view is often desirable. A detailed treatment of such a point of view requires measure theory and is beyond the

You might also like