100% found this document useful (1 vote)
130 views

Entropy: Tomasz Downarowicz Institute of Mathematics and Computer Science Wroclaw University of Technology

The document discusses the concept of entropy in dynamical systems and how it relates to digitally encoding information about a dynamical system over time. It describes how a dynamical system can be represented as a sequence of symbols from a finite alphabet as observations are made at discrete time intervals. The amount of information lost during this process depends on the measure-theoretic and topological entropy of the system, which are discussed. Kolmogorov-Sinai entropy in particular governs how efficiently the system can be encoded with the smallest possible alphabet size.

Uploaded by

Aci Lina
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
130 views

Entropy: Tomasz Downarowicz Institute of Mathematics and Computer Science Wroclaw University of Technology

The document discusses the concept of entropy in dynamical systems and how it relates to digitally encoding information about a dynamical system over time. It describes how a dynamical system can be represented as a sequence of symbols from a finite alphabet as observations are made at discrete time intervals. The amount of information lost during this process depends on the measure-theoretic and topological entropy of the system, which are discussed. Kolmogorov-Sinai entropy in particular governs how efficiently the system can be encoded with the smallest possible alphabet size.

Uploaded by

Aci Lina
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Entropy

Tomasz Downarowicz Institute of Mathematics and Computer Science Wroclaw University of Technology
Extracted from Entropy in Dynamical Systems Cambridge University Press New Mathematical Monographs 18 Cambridge 2011

1 Introduction
Nowadays, nearly every kind of information is turned into the digital form. Digital cameras turn every image into a computer le. The same happens to musical recordings or movies. Even our mathematical work is registered mainly as computer les. Analog information is nearly extinct. While studying dynamical systems (in any understanding of this term) sooner or later one is forced to face the following question: How can the information about the evolution of a given dynamical system be most precisely turned into a digital form? Researchers specializing in dynamical systems are responsible for providing the theoretical background for such a transition. So suppose that we do observe a dynamical system, and that we indeed turn our observation into the digital form. That means, from time to time, we produce a digital report, a computer le, containing all our observations since the last report. Assume for simplicity that such reports are produced at equal time distances, say, at integer times. Of course, due to bounded capacity of our recording devices and limited time between the reports, our les have bounded size (in bits). Because the variety of digital les of bounded size is nite, we can say that at every integer moment of time we produce just one symbol, where the collection of all possible symbols, i.e. the alphabet, is nite. An illustrative example is lming a scene using a digital camera. Every unit of time, the camera registers an image, which is in fact a bitmap of some xed size (camera resolution). The camera turns the live scene into a sequence of bitmaps. We can treat every such bitmap as a single symbol in the alphabet of the language of the camera. The sequence of symbols is produced as long as the observation is being conducted. We have no reason to restrict the global observation time, and we can agree that it goes on forever and we usually also assume that the observation has been conducted since ever in the past as well. In this manner, the history of our recording takes on the form

of a bilateral sequence of symbols from some nite alphabet, say . Advancing in time by a unit corresponds, on one hand, to the unit-time evolution of the dynamical system, on the other, to shifting the enumeration of our sequence of symbols. This way we have come to the conclusion that the digital form of the observation is nothing else but an element of the space Z of all bilateral sequences of symbols, and the action on this space is the familiar shift transformation advancing the enumeration: for x = (x(n))nZ , (x)(n) = x(n + 1). Now, in most situations, such a digitalization of the dynamical system will be lossy, i.e., it will capture only some aspects of the observed dynamical system, and much of the information will be lost. For example, the digital camera will not be able to register objects hidden behind other objects, moreover, it will not see objects smaller than one pixel or their movements until they pass from one pixel to another. However, it may happen that, after a while, each object will eventually become detectable, and that we will be able to reconstruct its trajectory from the recorded information. Of course, lossy digitalization is always possible and hence presents a lesser kind of challenge. We will be much more interested in lossless digitalization. When and how is it possible to digitalize a dynamical system so that no information is lost, i.e., in such a way that after viewing the entire sequence of symbols we can completely reconstruct the evolution of the system? In this note the task of encoding a system with possibly smallest alphabet is refereed to as data compression. The reader will nd answers to the above question at two major levels: measure-theoretic, and topological. In the rst case the digitalization is governed by the Kolmogorov-Sinai entropy of the dynamical system, the rst major subject of this note. In the topological setup the situation is more complicated. Topological entropy, our second most important notion, turns out to be insufcient to decide about digitalization that respects the topological structure. Thus another parameter, called symbolic extension entropy, emerges as the third main object discussed here.

2 A few words about the history of entropy


Below we review very briey the development of the notion of entropy focusing on the achievements crucial for the genesis of the basic concepts of entropy discussed in this note. For a more complete survey we refer to the expository article [Katok, 2007]. The term entropy was coined by a German physicist Rudolf Clausius from Greek en- = in + trope = a turning [Clausius, 1850]. The word reveals analogy to energy and was designed to mean the form of energy that any energy eventually and inevitably turns into a useless heat. The idea was inspired by an earlier formulation by French physicist and mathematician Nicolas L onard Sadi Carnot [Carnot, 1824] of what is e now known as the second Law of Thermodynamics: entropy represents the energy no longer capable to perform work, and in any isolated system it can only grow. Austrian physicist Ludwig Boltzmann put entropy into the probabilistic setup of statistical mechanics [Boltzmann, 1877]. Entropy has also been generalized around 1932 to quantum mechanics by John von Neumann [see von Neumann, 1968].

Later this led to the invention of entropy as a term in probability and information theory by an American electronic engineer and mathematician Claude Elwood Shannon, now recognized as the father of the information theory. Many of the notions have not changed much since they rst occurred in Shannons seminal paper A Mathematical Theory of Communication [Shannon, 1948]. Dynamical entropy in dynamical systems was created by one of the most inuential mathematicians of modern times, Andrei Nikolaevich Kolmogorov, [Kolmogorov, 1958, 1959] and improved by his student Yakov Grigorevich Sinai who practically brought it to the contemporary form [Sinai, 1959]. The most important theorem about the dynamical entropy, the Shannon-McMillanBreiman Theorem gives this notion a very deep meaning. The theorem was conceived by Shannon [Shannon, 1948], and proved in increasing strength by Brockway McMillan [McMillan, 1953] (L1 -convergence), Leo Breiman [Breiman, 1957] (almost everywhere convergence), and Kai Lai Chung [Chung, 1961] (for countable partitions). In 1970 Wolfgang Krieger obtained one of the most important, from the point of view of data compression, results about the existence (and cardinality) of nite generators for automorphisms with nite entropy [Krieger, 1970]. In 1970 Donald Ornstein proved that Kolmogorov-Sinai entropy was a a complete invariant in the class of Bernoulli systems, a fact considered one of the most important features of entropy (alternatively of Bernoulli systems) [Ornstein, 1970]. In 1965, Roy L. Adler, Alan G. Konheim and M. Harry McAndrew carried the concept of dynamical entropy over to topological dynamics [Adler et al., 1965] and in 1970 Em I. Dinaburg and (independently) in 1971 Rufus Bowen redened it in the language of metric spaces [Dinaburg, 1970; Bowen, 1971]. With regard to entropy in topological systems, probably the most important theorem is the Variational Principle proved by L. Wayne Goodwyn (the easy direction) and Timothy Goodman (the hard direction), which connects the notions of topological and Kolmogorov-Sinai entropy [Goodwyn, 1971; Goodman, 1971] (earlier Dinaburg proved both directions for nite dimensional spaces [Dinaburg, 1970]). The theory of symbolic extensions of topological systems was initiated by Mike Boyle around 1990 [Boyle, 1991]. The outcome of this early work is published in [Boyle et al., 2002]. The author of this note contributed to establishing that invariant measures and their entropies play a crucial role in computing the so-called symbolic extension entropy [Downarowicz, 2001; Boyle and Downarowicz, 2004; Downarowicz, 2005]. Dynamical entropy generalizing the Kolmogorov-Sinai dynamical entropy to noncommutative dynamics occurred as an adaptation of von Neumanns quantum entropy in a work of Robert Alicki, Johan Andries, Mark Fannes and Pim Tuyls [Alicki et al., 1996] and then applied to doubly stochastic operators by Igor I. Makarov [Makarov, 2000]. The axiomatic approach to entropy of doubly stochastic operators, as well as topological entropy of Markov operators have been developed in [Downarowicz and Frej, 2005]. The term entropy is used in many other branches of science, sometimes distant from physics or mathematics (such as sociology), where it no longer maintains its rigorous quantitative character. Usually, it roughly means disorder, chaos, decay

of diversity or tendency toward uniform distribution of kinds.

3 Multiple meanings of entropy


In the following paragraphs we review some of the various meanings of the word entropy and try to explain how they are connected. We devote a few pages to explain how dynamical entropy corresponds to data compression rate.

3.1

Entropy in physics

In classical physics, a physical system is a collection of objects (bodies) whose state is parametrized by several characteristics such as the distribution of density, pressure, temperature, velocity, chemical potential, etc. The change of entropy of a physical system, as it passes from one state to another, equals S = dQ , T

where dQ denotes an element of heat being absorbed (or emitted; then it has the negative sign) by a body, T is the absolute temperature of that body at that moment, and the integration is over all elements of heat active in the passage. The above formula allows to compare entropies of different states of a system, or to compute entropy of each state up to an additive constant (this is satisfactory in most cases). Notice that when an element dQ of heat is transmitted from a warmer body of temperature T1 to a cooler one of temperature T2 then the entropy of the rst body changes by dQ , while T1 that of the other rises by dQ . Since T2 < T1 , the absolute value of the latter fraction T2 is larger and jointly the entropy of the two-body system increases (while the global energy remains the same). A system is isolated if it does not exchange energy or matter (or even information) with its surroundings. In virtue of the rst Law of Thermodynamics, the conservation of energy principle, an isolated system can pass only between states of the same global energy. The second Law of Thermodynamics introduces irreversibility of the evolution: an isolated system cannot pass from a state of higher entropy to a state of lower entropy. Equivalently, it says that it is impossible to perform a process whose only nal effect is the transmission of heat from a cooler medium to a warmer one. Any such transmission must involve an outside work, the elements participating in the work will also change their states and the overall entropy will rise. The rst and second laws of thermodynamics together imply that an isolated system will tend to the state of maximal entropy among all states of the same energy. The energy distributed in this state is uncapable of any further activity. The state of maximal entropy is often called the thermodynamical death of the system. Ludwig Boltzmann gave another, probabilistic meaning to entropy. For each state A the (negative) difference between the entropy of A and the entropy of the maximal state B is nearly proportional to the logarithm of the probability that the system

spontaneously assumes state A, S(A) Smax k log2 (Prob(A)). The proportionality factor k is known as the Boltzmann constant. In this approach the probability of the maximal state is almost equal to 1, while the probabilities of states of lower entropy are exponentially small. This provides another interpretation of the second Law of Thermodynamics: the system spontaneously assumes the state of maximal entropy simply because all other states are extremely unlikely. Example Consider a physical system consisting of an ideal gas enclosed in a cylindrical
container of volume 1. The state B of maximal entropy is clearly the one where both pressure

State B

State A

and temperature are constant (P0 and T0 , respectively) throughout the container. Any other state can be achieved only with help from outside. Suppose one places a piston at a position p < 1 in the cylinder (the left gure; thermodynamically, this is still the state B) and then 2 slowly moves the piston to the center of the cylinder (position 1 ), allowing the heat to ow 2 between the cylinder and its environment, where the temperature is T0 , which stabilizes the temperature at T0 all the time. Let A be the nal state (the right gure). Note that both states A and B have the same energy level inside the system. To compute the jump of entropy one needs to examine what exactly happens during the passage. The force acting on the piston at position x is proportional to the difference between the pressures: p 1p P0 F = c P0 . 1x x Thus, the work done while moving the piston equals: Z2
p
1

W =

` F dx = cP0 (1 p) ln(1 p) + p ln p + ln 2 . p (1 p) ln(1 p) + p ln p

The function
1 is negative and assumes its minimal value ln 2 at p = 2 .

Thus the above work W is positive and represents the amount of energy delivered to the system from outside. During the process the compressed gas on the right emits heat, while the depressed gas on the left absorbs heat. By conservation of energy (applied to the enhanced system including the outside world), the gas altogether will emit heat to the environment equivalent to the delivered work Q = W . Since the temperature is constant

1 all the time, the change in entropy between states B and A of the gas is simply T0 times Q, i.e., ` 1 cP0 (1 p) ln(1 p) p ln p ln 2 . S = T0 Clearly S is negative. This conrms, what was already expected, that the outside intervention has lowered the entropy of the gas. This example illustrates very clearly Boltzmanns interpretation of entropy. Assume that there are N particles of the gas independently wandering inside the container. For each 1 particle the probability of falling in the left or right half of the container is 2 . The state A of the gas occurs spontaneously if pN and (1 p)N particles fall in the left and right halves of the container, respectively. By elementary combinatorics formulae, the probability of such an event equals N! 2N . Prob(A) = (pN )!((1 p)N )!

which is indeed proportional to the drop S of entropy between the states B and A (see above).

By Stirlings formula (ln n! n ln n n for large n), the logarithm of Prob(A) equals approximately ` N (1 p) ln(1 p) p ln p ln 2 ,

3.2

Shannon entropy

In probability theory, a probability vector p is a sequence of nitely many nonnegative numbers {p1 , p2 , . . . , pn } whose sum equals 1. The Shannon entropy of a probability vector p is dened as
n

H(p) =
i=1

pi log2 pi

(where 0 log2 0 = 0). Probability vectors occur naturally in connection with nite partitions of a probability space. Consider an abstract space equipped with a probability measure assigning probabilities to measurable subsets of . A nite partition P of is a collection of pairwise disjoint measurable sets {A1 , A2 , . . . , An } whose union is . Then the probabilities pi = (Ai ) form a probability vector pP . One associates the entropy of this vector with the (ordered) partition P: H (P) = H(pP ). In this setup entropy can be linked with information. Given a measurable set A, the information I(A) associated with A is dened as log2 ((A)). The information function IP associated with a partition P = {A1 , A2 , . . . , An } is dened on the space and it assumes the constant value I(Ai ) at all points belonging to the set Ai . Formally,
n

IP () =
i=1

log2 ((Ai ))1 Ai (), I

where 1 Ai is the characteristic function of Ai . One easily veries that the expected I value of this function with respect to coincides with the entropy H (P).

We shall now give an interpretation of the information function and entropy, the key notions in entropy theory. The partition P of the space associates with each element the information that gives answer to the question in which Ai are you?. That is the best knowledge we can acquire about the points, based solely on the partition. One bit of information is equivalent to acquiring an answer to a binary question, i.e., a question of a choice between two possibilities. Unless the partition has two elements, the question in which Ai are you? is not binary. But it can be replaced by a series of binary questions and one is free to use any arrangement (tree) of such questions. In such an arrangement, the number of questions N () (i.e., the amount of information in bits) needed to determine the location of the point within the partition may vary from point to point (see the Example below). The smaller the expected value of N () the better the arrangement. It turns out that the best arrangement satises IP () N () IP () + 1 for -almost every . The difference between IP () and N () follows from the crudeness of the measurement of information by counting binary questions; the outcome is always a positive integer. The real number IP () can be interpreted as the precise value. Entropy is the expected amount of information needed to locate a point in the partition. Example Consider the unit square representing the space , where the probability is the Lebesgue measure (i.e., the surface area), and the partition P of into four sets Ai of probabilities 1 , 1 , 1 , 1 , respectively, as shown in the gure. The information function equals 8 4 8 2

log2 ( 1 ) = 3 on A1 and A3 , log2 ( 1 ) = 2 on A2 and log2 ( 1 ) = 1 on A4 . The 8 4 2 entropy of P equals 1 1 1 1 7 3+ 2+ 3+ 1= . 8 4 8 2 4 The arrangement of questions that optimizes the expected value of the number of questions asked is the following: 1. Are you in the left half? The answer no, locates in A4 using one bit. Otherwise the next question is: 2. Are you in the central square of the left half? The yes answer locates in A2 using two bits. If not, the last question is: 3. Are you in the top half of the whole square? Now yes or no locate in A1 or A3 , respectively. This takes three bits. 8 8 >yes A2 (2 bits)( > < > > <yes Question 2 yes A1 (3 bits) >no Question 3 Question 1 : > no A3 (3 bits) > > : no A4 (1 bit) H(P) =

In this example the number of questions equals exactly the information function at every point and the expected number of question equals the entropy 7 . There does not exist a 4 better arrangement of questions. Of course, such an accuracy is possible only when the probabilities of the sets Ai are integer powers of 2; in general the information is not integer valued.

Another interpretation of Shannon entropy deals with the notion of uncertainty. Let X be a random variable dened on the probability space and assuming values in a nite set {x1 , x2 , . . . , xn }. The variable X generates a partition P of into the sets Ai = { : X() = xi } (called the preimage partition). The probabilities pi = (Ai ) = Prob{X = xi } form a probability vector called the distribution of X. Suppose an experimenter knows the distribution of X and tries to guess the outcome of X before performing the experiment, i.e., before picking some and reading the value X(). His/her uncertainty about the outcome is the expected value of the information he/she is missing to be certain. As explained above that is exactly the entropy H (P).

3.3

Connection between Shannon and Boltzmann entropy

Both notions in the title of this subsection refer to probability and there is an evident similarity in the formulae. But the analogy fails to be obvious. In the literature many different attempts toward understanding the relation can be found. In simple words, the interpretation relies on the distinction between the macroscopic state considered in classical thermodynamics and the microscopic states of statistical mechanics. A thermodynamical state A (a distribution of pressure, temperature, etc.) can be realized in many different ways at the microscopic level, where one distinguishes all individual particles, their positions and velocity vectors. As explained above, the difference of Boltzmann entropies S(A) Smax is proportional to log2 (Prob(A)), the logarithm of the probability of the macroscopic state A in the probability space of all microscopic states . This leads to the equation Smax S(A) = k I(A), (3.1)

where I(A) is the probabilistic information associated with the set A . So, Boltzmann entropy seems to be closer to Shannon information rather than Shannon entropy. This interpretation causes additional confusion, because S(A) appears in this equation with negative sign, which reverses the direction of monotonicity; the more information is associated with a macrostate A the smaller its Boltzmann entropy. This is usually explained by interpreting what it means to associate information with a state. Namely, the information about the state of the system is an information available to an outside observer. Thus it is reasonable to assume that this information acually escapes from the system, and hence it should receive the negative sign. Indeed, it is the knowledge about the system possessed by an outside observer that increases the usefulness of the energy contained in that system to do physical work, i.e., it decreases the systems entropy. The interpretation goes further: each microstate in a system appearing to the observer as being in macrostate A still hides the information about its identity. Let 8

Ih (A) denote the joint information still hiding in the system if its state is identied as A. This entropy is clearly maximal at the maximal state, and then it equals Smax /k. In a state A it is diminished by I(A), the information already stolen by the observer. So, one has Smax I(A). Ih (A) = k This, together with (3.1), yields S(A) = k Ih (A), which provides a new interpretation to the Boltzmann entropy: it is proportional to the information still hiding in the system provided the macrostate A has been detected. So far the entropy was determined up to an additive constant. We can compute the change of entropy when the system passes from one state to another. It is very hard to determine the proper additive constant of the Boltzmann entropy, because the entropy of the maximal state depends on the level of precision of identifying the microstates. Without quantum approach, the space is innite and such is the maximal entropy. However, if the space of states is assumed nite, the absolute entropy obtains a new interpretation, already in terms of the Shannon entropy (not just of the information function). Namely, in such case, the highest possible Shannon entropy H (P) is achieved when P = is the partition of the space into single states and is the uniform measure on , i.e., such that each state has probability (#)1 . It is thus natural to set Smax = k H () = k log2 #. The detection that the system is in state A is equivalent to acquiring the information I(A) = log2 ((A)) = log2 #A . By the equation (3.1) we get # S(A) = k( log2 # + log2
#A #

) = k log2 #A.

The latter equals (k times) the Shannon entropy of A , the normalized uniform measure restricted to A. In this manner we have compared the Boltzmann entropy directly with the Shannon entropy and we have gotten rid of the unknown additive constant. The whole above interpretation is a subject of many discussions, as it makes entropy of a system depend on the seemingly nonphysical notion of knowledge of a mysterious observer. The classical Maxwells paradox [Maxwell, 1871] is based on the assumption that it is possible to acquire information about the parameters of individual particles without any expense of heat or work. To avoid such paradoxes, one must agree that every bit of acquired information has its physical entropy equivalent (equal to the Boltzmann constant k), by which the entropy of the memory of the observer increases. In consequence, erasing one bit of information from a memory (say, of a computer) at temperature T , results in the emission of heat in amount kT to the environment. Such calculations set limits on the theoretical maximal speed of computers, because the heat can be driven away with a limited speed only.

3.4

Dynamical entropy

This is the key entropy notion in ergodic theory; a version of the Kolmogorov-Sinai entropy for one partition. It refers to Shannon entropy, but it differs signicantly as it makes sense only in the context of a measure-preserving transformation. Let T be a measurable transformation of the space , which preserves the probability measure , i.e., such that (T 1 (A)) = (A) for every measurable set A . Let P be a nite measurable partition of and let Pn denote the partition PT 1 (P) T n+1 (P) (the least common renement of n preimages of P). By a subadditivity argument, the 1 sequence of Shannon entropies n H (Pn ) converges to its inmum. The limit h (T, P) = lim
n

1 H (Pn ) n

(3.2)

is called the dynamical entropy of the process generated by P under the action of T . This notion has a very important physical interpretation, which we now try to capture. First of all, one should understand that in the passage from a physical system to its mathematical model (a dynamical system) (, , T ), the points should not be interpreted as particles nor the transformation T as the way the particles move around the system. Such an interpretation is sometimes possible, but has a rather restricted range of applications. Usually a point (later we will use the letter x) represents the physical state of the entire physical system. The space is hence called the phase space. The transformation T is interpreted as the set of physical rules causing the system that is currently at some state to assume in the following instant of time (for simplicity we consider models with discrete time) the state T . Such a model is deterministic in the sense that the initial state has imprinted entire future evolution. Usually, however, the observer cannot fully determine the identity of the initial state. He knows only the values of a few measurements, which give only a rough information, and the future of the system is, from his standpoint, random. In particular, the values of his future measurements are random variables. As time passes, the observer learns more and more about the evolution (by repeating his measurements) through which, in fact, he learns about the initial state . A nite-valued random variable X imposes a nite partition P of the phase space . After time n, the observer has learned the values X(), X(T ), . . . , X(T n ) i.e., he/she has learned which element of the partition Pn contains . His/her acquired information about the identity of equals IPn (), the expected value of which is H (Pn ). It is now seen directly from the denition that: The dynamical entropy equals the average (over time and the phase space) gain in one step of information about the initial state. Notice that it does not matter whether in the end (at time innity) the observer determines the initial state completely, or not. What matters is the gain of information in one step. If the transformation T is invertible, we can also assume that the evolution of the system runs from time , i.e., it has an innite past. In such case should be called the current state rather than initial state (in a process that runs from time , there is no initial state). Then the entropy h (T, P) can be computed alternatively using

10

conditional entropy: h (T, P) = lim H(P|T (P) T 2 (P) T n1 (P)) = H(P|P ),


n

where P is the sigma-algebra generated by all partitions T n (P) (n 0) and is called the past. This formula provides another interpretation: The dynamical entropy equals the expected amount of information about the current state acquired, in addition to was already known from the innite past, by learning the element of the partition P to which belongs . Notice that in this last formulation the averaging over time is absent.

3.5

Dynamical entropy as data compression rate

The interpretation of entropy given in this paragraph is going to be fundamental for our understanding of dynamical entropy, in fact, we will also refer to a similar interpretation when discussing topological dynamics. We will distinguish two kinds of data compression: horizontal and vertical. In horizontal data compression we are interested in replacing computer les by other les, as short as possible. We want to shrink them horizontally. Vertical data compression concerns innite sequences of symbols interpreted as signals. Such signals occur for instance in any everlasting data transmission, such as television or radio broadcasting. Vertical data compression attempts to losslessly translate the signal maintaining the same speed of transmission (average lengths of incoming les) but using a smaller alphabet. We call it vertical simply by contrast to horizontal. One can imagine that the symbols of a large alphabet, say of cardinality 2k , are binary columns of k zeros or ones, and then the vertical data compression will reduce not the length but the height of the signal. This kind of compression is useful for data transmission in real time; a compression device translates the incoming signal into the optimized alphabet and sends it out at the same speed as the signal arrives (perhaps with some delay). First we discuss the connection between entropy and the horizontal data compression. Consider a collection of computer les, each in form of a long string B (we will call it a block) of symbols belonging to some nite alphabet . For simplicity let us assume that all les are binary, i.e., that = {0, 1}. Suppose we want to compress them to save the disk space. To do it, we must establish a coding algorithm which replaces our les B by some other (preferably shorter) les (B) so that no information is lost, i.e., we must also have a decoding algorithm 1 allowing to reconstruct the original les when needed. Of course, we assume that our algorithm is efcient, that is, it compresses the les as much as possible. Such an algorithm allows to measure the effective information content of every le: a le carries s bits of information (regardless of its original size) if it can be compressed to a binary le of length s(B) = s. This complies with our previous interpretation of information: each symbol in the compressed le is an answer to a binary question, and s(B) is the optimized number of answers needed to identify the original le B.

11

Somewhat surprisingly, the amount of information s(B) depends not only on the initial size m = m(B) of the original le B but also on subtle properties of its structure. Evidently s(B) is not the simple-minded Shannon information function. There are 2m binary blocks of a given length m, all of them are equally likely so that each has probability 2m , and hence each should carry the same amount of information equal to m log2 2 = m. But s(B) does not behave that simple! Example Consider the two bitmaps shown in this gure. They have the same dimensions

and the same density, i.e., the same amount of black pixels. As uncompressed computer les, they occupy exactly the same amount of disk space. However, if we compress them, using nearly any available zipping program, the sizes of the zipped les will differ significantly. The left hand side picture will shrink nearly 40 times, while the right hand side one only 8 times. Why? To quickly get an intuitive understanding of this phenomenon imagine that you try to pass these pictures over the phone to another person, so that he/she can literally copy it based on your verbal description. The left picture can be precisely described in a few sentences containing the precise coordinates of only two points, while the second picture, if we want it precisely copied, requires tedious dictating the coordinates of nearly all black pixels. Evidently, the right hand side picture carries more information. A le can be strongly compressed if it reveals some regularity or predictability, which can be used to shorten its description. The more it looks random, the more information must be passed over to the recipient, and the less it can be compressed using no matter how intelligent zipping algorithm.

How can we a priori, i.e., without experimenting with compression algorithms, just s(B) by looking at the les internal structure, predict the compression rate m(B) of a given block B? Here is an idea: The compression rate should be interpreted as the average information content per symbol. Recall that the dynamical entropy was interpreted similarly, as the expected gain of information per step. If we treat our long block as a portion of the orbit of some point representing a shift-invariant measure on the symbolic space N{0} of all sequences over , then the global information carried by this block should be approximately equal to its length (number of steps in the shift map) times the dynamical entropy of . It will be only an approximation, but it should work. The alphabet plays the role of the nite partition P of the symbolic space, and the partition Pn used in the denition of the dynamical entropy can be identied with n the collection of all blocks over of length n. Any shift-invariant measure on N{0} assigns values to all blocks A n (n N) following some rules of 12

consistency; we skip discussing them now. It is enough to say that a long block B (of a very large length m) nearly determines a shift-invariant measure: for subblocks A of lengths n much smaller than m (but still very large) it determines their frequencies: (B) (A) = #{1 i m n + 1 : B[i, i + n 1] = A} , mn+1

i.e., it associates with A the probability of seeing A in B at a randomly chosen window of length n. Of course, this measure is not completely dened (values on longer blocks are not determined), so we cannot perform the full computation of the dynam1 ical entropy. But instead, we can use the approximate value n H(B) (n ) (see (3.2)), which is dened and practically computable for some reasonable length n. We call it the combinatorial entropy of the block B. In other words, we decide that the compression rate should be approximately 1 s(B) H(B) (n ). m(B) n (3.3)

This idea works perfectly well; in most cases the combinatorial entropy estimates the compression rate very accurately. We replace a rigorous proof with a simple example. Example We will construct a lossless compression algorithm and apply it to a le B of a nite length m. The compressed le will consist of a decoding instruction followed by the coded image (B) of B. To save on the output length, the decoding instruction must be relatively short compared to m. This is easily achieved in codes which refer to relatively short components of the block B. For example, the instruction of the code may consist of the complete list of subblocks A of some carefully chosen length n (appearing in B) followed by the list of their images (A). The images may have different lengths (as short as possible). The assignment A (A) will depend on B, therefore it must be included in the output le. The coded image (B) is obtained by cutting B into subblocks B = A1 A2 . . . Ak of length n and concatenating the images of these subblocks: (B) = (A1 )(A2 ) (Ak ). There are additional issues here: in order for such a code to be invertible, the images (A) must form a prex free family (i.e., no block in this family is a prex of another). Then there is always a unique way of cutting (B) back into the images (Ai ). But this does not affect essentially the computations. For best compression results, it is reasonable to assign shortest images to the subblocks appearing in B with highest frequencies. For instance, consider a long binary block
B = 010001111001111...110 = 010, 001, 111, 001, 111, ..., 110 On the right, B is shown divided into subblocks of length n = 3. Suppose that the frequencies of the subblocks in this division are: 000 0% 100 0% 001 40% 101 0% 010 10% 110 10% 011 10% 111 30%

The theoretical value of the compression rate (obtained using the formula (3.3) for n = 3) is ` 0.4 log2 (0.4) 0.3 log2 (0.3) 3 0.1 log2 (0.1) /3 68.2%.

13

A binary prex free code giving shortest images to most frequent subblocks is 001 0, 111 10, 010 110, 011 1110, 110 1111. The compression rate achieved on B using this code equals (0.4 1 + 0.3 2 + 0.1 3 + 0.1 4 + 0.1 4)/3 = 70% (ignoring the nite length of the decoding instruction, which is simply a recording of the above code). This code is nearly optimal (at least for this le).

We now focus on the vertical data compression. Its connection with the dynamical entropy is easier to describe but requires a more advanced apparatus. Since we are dealing with an innite sequence (the signal), we can assume it represents some genuine (not only approximate as it was for a long but nite block) shift-invariant probability measure on the symbolic space Z . Recall that the dynamical entropy h = h (, ) (where denotes the shift map) is the expected amount of new information per step (i.e., per incoming symbol of the signal). We intend to replace the alphabet by a possibly small one. It is obvious that if we manage to losslessly replace the alphabet by another, say 0 , then the entropy h cannot exceed log2 #0 . Conversely, it turns out that any alphabet of cardinality #0 > 2h is sufcient to encode the signal. This is a consequence of the famous Krieger Generator Theorem. Thus we have the following connection: log2 #0 1 h log2 #0 , where 0 is the smallest alphabet allowing to encode the signal. In this manner the cardinality of the optimal alphabet is completely determined by the entropy. If 2h happens to be an integer we seem to have two choices, but there is an easy way to decide which one to choose.

3.6

Topological entropy

By a topological dynamical system we understand the pair (X, T ), where X is a compact metric space, T : X X is continuous. Just like in the measure-theoretic case, we are interested in a notion of entropy that captures the complexity of the dynamics, interpreted as the amount of information transmitted by the system per unit of time. Again, the initial state carries complete information about the evolution (both forward and backward in time or just forward, depending on whether T is invertible or not), but the observer cannot read all this information immediately. Since we do not x any particular measure, we want to use the metric (or, more generally, the topology) to describe the amount of information about the initial state, acquired by the observer in one step (one measurement). A reasonable interpretation relies on the notion of topological resolution. Intuitively, resolution is a parameter measuring the ability of the observer to distinguish between points. A 14

resolution is topological, when this ability agrees with the topological structure of the space. The simplest such resolution is based on the metric and a positive number : two points are indistinguishable if they are less than apart. Another way to dene a topological resolution (applicable in all topological spaces) refers to an open cover of X. Points cannot be distinguished when they belong to a common cell of the cover. By compactness, the observer is able to see only a nite number N of classes of indistinguishability and classify the current state of the system to one of them. The logarithm to base 2 of N roughly corresponds to the number of binary questions answering which is equivalent to what he has learned, i.e., the amount of acquired information. The static entropy, instead of an expectation (which requires a measure), will now be replaced by the supremum, over the space, of this information. The rest is done just like in the measure-theoretic case; we dene the topological dynamical entropy with respect to a resolution as the average (along the time) information acquired per step. Finally we pass to the supremum as the resolution renes. Notice that indistinguishability is not an equivalence relation; the classes often overlap without being equal. This makes the interpretation of a topological resolution a bit fuzzy and its usage in rigorous computations rather complicated. We will describe in more detail topological entropy in the sense of Dinaburg and Bowen, using the metric [comp. Dinaburg, 1970; Bowen, 1971]. Let X be endowed with a metric d. For n N, by dn we will mean the metric dn (x, y) = max{d(T i x, T i y) : i = 0, . . . , n 1}. Of course d1 = d, dn+1 dn for each natural n, and, by compactness of X, all these metrics are pairwise uniformly equivalent. Following the concept of indistinguishability in the resolution determined by a distance > 0, a set F X is said to be (n, )-separated if the distances between distinct points of F in the metric dn are at least : x,yF dn (x, y) . By compactness, the cardinalities of (n, )-separated sets in X are nite and bounded. By s(n, ) we will denote the maximal cardinality of an (n, )-separated set: s(n, ) = max{#F : F is (n, )-separated}. It is clear that s(n, ) (hence also the rst two terms dened below) depends decreasingly on . So, we can apply the general scheme: H(n, ) = log s(n, ),
1 h(T, ) = lim sup n H(n, ), n

h(T ) = lim h(T, ).


0

For the interpretation, suppose we observe the system through a device whose resolution is determined by the distance . Then we can distinguish between two 15

n-orbits (x, T x, . . . , T n1 x) and (y, T y, . . . , T n1 y) if and only if for at least one i {0, . . . , n1} the points T i x, T i y can be distinguished (which means their distance is at least ), i.e., when the points x, y are (n, )-separated. Thus, s(n, ) is the maximal number of pairwise distinguishable n-orbits that exists in the system. The term h(T, ) is hence the rate of the exponential growth of the number of -distinguishable n-orbits (and h(T ) is the limit of such rates as the resolutions rene). The connection between topological entropy and Kolmogorov-Sinai entropy is established by the celebrated Variational Principle [Goodwyn, 1971; Goodman, 1971]. By MT (X) we will denote the set of all T -invariant Borel probability measures on X (which is nonempty by an appropriate xpoint theorem). Theorem 3.4 (Variational Principle). h(T ) = sup{h (T ) : MT (X)}.

3.7

Symbolic extension entropy

Given a topological dynamical system (X, T ) of nite entropy, we are interested in computing the amount of information per unit of time transferred in this system. Suppose we want to compute verbatim the vertical data compression, as described in Section 3.5: the logarithm of the minimal cardinality of the alphabet allowing to losslessly encode the system in real time. Since we work with topological dynamical systems, we want the coding to respect not only the measurable, but also the topological, structure. There is nothing like Krieger Generator Theorem in topological dynamics. A topological dynamical system of nite entropy (even invertible) need not be conjugate to a subshift over a nite alphabet. Thus, we must rst create its lossless digitalization, which has only one possible form: a subshift, in which the original system occurs as a factor. In other words, we seek for a symbolic extension. Only then we can try to optimize the alphabet. It turns out that such a vertical data compression is not governed by the topological entropy only by a different (possibly much larger) parameter. Precisely, what are symbolic extensions? First of all, we do not accept extensions in form of subshifts over innite alphabets, even countable, like N0 . Such extensions are useless in terms of vertical data compression. So, in all we are about to say, a symbolic extension always means a subshift over a nite alphabet. Notice that if a system has innite topological entropy, it simply does not have any symbolic extensions, as these have nite topological entropy and the entropy of a factor is smaller than or equal to that of the extension. Thus we will focus exclusively on systems (X, T ) with nite topological entropy. Next, we want to explain that, no matter whether T is invertible or not, we are always going to seek for a symbolic extension in form of a bilateral subshift, i.e., a subset of Z , never of N0 . Without any assumptions, the system may happen to posses an invertible factor of positive topological entropy, which eliminates the existence of unilateral symbolic extensions. So, if we want a unied theory of symbolic extensions, we should agree on the denition of a symbolic extension given below:

16

Denition 3.5. Let (X, T ) be a topological dynamical system. By a symbolic extension of (X, T ) we understand a bilateral subshift (Y, ), where Y Z ( - nite), together with a topological factor map : Y X. The construction of symbolic extensions with minimized topological entropy relies on controlling the measure-theoretic entropies of the measures in the extension. This leads to the following notion: Denition 3.6. Let (Y, ) be an extension of (X, T ) via a map : Y X. The extension entropy function is dened on the collection of all T -invariant measures MT (X) on X as h () = sup{h(, S) : MS (Y ), = }, ext where h denotes the entropy function on MS (Y ). We now introduce the key notions of this section: Denition 3.7. Let (X, T ) be a topological dynamical system. The symbolic extension entropy function is dened on MT (X) as hsex () = hsex (, T ) = inf{h () : is a symbolic extension of (X, T )}. ext The topological symbolic extension entropy of (X, T ) is hsex (T ) = inf{h(S) : (Y, ) is a symbolic extension of (X, T )}. In both cases, the supremum over the empty set is . It turns out that the following equality holds (see [Boyle and Downarowicz, 2004]): Theorem 3.8 (Symbolic Extension Entropy Variational Principle). hsex (T ) = sup{hsex () : MT (X)}. A system has no symbolic extensions if and only if hsex (T ) = if and only if hsex (, T ) = for all (equivalently, some) invariant measures. If hsex (T ) is nite, we can create a symbolic extension whose topological entropy is only a bit larger, for example, smaller than log(2hsex (T ) + 1). The alphabet in this symbolic extension can be optimized to contain 2hsex (T ) + 1 elements. On the other hand, there is no way to do better than that. In this manner the symbolic extension entropy controls the vertical data compression in the topological sense. An effective method allowing to establish (or at least estimate) the symbolic extension entropy function (and hence the topological symbolic extension entropy) is quite complicated and refers to the notion of an entropy structure, which describes how entropy emerges for invariant measures as the topological resolution renes. In most dynamical systems the topological symbolic extension entropy is essentially larger than topological entropy; it can even be innite (i.e., a system has no symbolic extension) 17

in systems with nite topological entropy. The details are described in [Boyle and Downarowicz, 2004]. Let us mention that historically, in the rst attempt to capture the entropy jump when passing a to symbolic extension Mike Boyle dened topological residual entropy which, in our notation, is the difference hres (T ) = hsex (T ) h(T ). [see Boyle, 1991; Boyle et al., 2002].

3.8

Entropy as disorder

The interplay between Shannon and Boltzmann entropy has led to associating with the word entropy some colloquial understanding. In all its strict meanings (described above), entropy can be viewed as a measure of disorder and chaos, as long as by order one understands that things are segregated by their kind (e.g. by similar properties or parameter values). Chaos is the state of a system (physical or dynamical) in which elements of all kinds are mixed evenly throughout the space. For example, a container with gas is in its state of maximal entropy when the temperature and pressure are constant. That means there is approximately the same amount of particles in every unit of the volume, and the proportion between slow and fast particles is everywhere the same. States of lower entropy occur when particles are organized: slower ones in one area, faster ones in another. A signal (an innite sequence of symbols) has large entropy (i.e., compression rate) when all subblocks of a given length n appear with equal frequencies in all sufciently long blocks. Any trace of organization and logic in the structure of the le allows for its compression and hence lowers its entropy. These observations generated a colloquial meaning of entropy. To have order in the house, means to have food separated from utensils and plates, clothing arranged in the closet by type, trash segregated and deposited in appropriate recycling containers, etc. When these things get mixed together entropy increases causing disorder and chaos. Entropy is a term in social sciences, too. In a social system, order is associated with classication of the individuals by some criteria (stratication, education, skills, etc.) and assigning to them appropriate positions and roles in the system. Law and other mechanisms are enforced to keep such order. When this classication and assignment fails, the system falls into chaos called entropy. Individuals loose their specialization. Everybody must do all kinds of things in order to survive. Ergo, Entropy equals lack of diversity.

References
Adler, R. L., Konheim, A. G., and McAndrew, M. H. 1965. Topological entropy. Trans. Amer. Math. Soc., 114, 309319. Alicki, R., Andries, J., Fannes, M., and Tuyls, P. 1996. An algebraic approach to the Kolmogorov-Sinai entropy. Rev. Math. Phys., 8(2), 167184. 18

Boltzmann, L. 1877. Uber die beziehung dem zweiten Haubtsatze der mechanischen W rmetheorie und der Wahrscheinlichkeitsrechnung respektive den S tzen uber das a a W rmegleichgewicht. Wiener Berichte, 76, 373435. a Bowen, R. 1971. Entropy for group endomorphisms and homogeneous spaces. Trans. Amer. Math. Soc., 153, 401414. Boyle, M. 1991. Quotients of subshifts. Adler conference lecture. Unpublished. Boyle, M., and Downarowicz, T. 2004. The entropy theory of symbolic extensions. Invent. Math., 156(1), 119161. Boyle, M., Fiebig, D., and Fiebig, U.-R. 2002. Residual entropy, conditional entropy and subshift covers. Forum Math., 14(5), 713757. Breiman, L. 1957. The individual ergodic theorem of information theory. Ann. Math. Statist., 28, 809811. Carnot, S. 1824. Reections on the Motive Power of Fire and on Machines Fitted to Develop that Power. Paris: Bachelier. French title: R exions sur la puissance e motrice du feu et sur les machines propres a d velopper cette puissance. ` e Chung, K. L. 1961. A note on the ergodic theorem of information theory. Ann. Math. Statist., 32, 612614. Clausius, R. 1850. Uber die bewegende Kraft der W rme, Part I, Part II. Annalen der a Physik, 79, 368397, 500524. English translation On the Moving Force of Heat, and the Laws regarding the Nature of Heat itself which are deducible therefrom Phil. Mag. (1851), 2, 121, 102119. Dinaburg, E. I. 1970. A correlation between topological entropy and metric entropy. Dokl. Akad. Nauk SSSR, 190, 1922. Downarowicz, T. 2001. Entropy of a symbolic extension of a dynamical system. Ergodic Theory Dynam. Systems, 21(4), 10511070. Downarowicz, T. 2005. Entropy structure. J. Anal. Math., 96, 57116. Downarowicz, T., and Frej, B. 2005. Measure-theoretic and topological entropy of operators on function spaces. Ergodic Theory Dynam. Systems, 25(2), 455481. Goodman, T. N. T. 1971. Relating topological entropy and measure entropy. Bull. London Math. Soc., 3, 176180. Goodwyn, L. W. 1971. Topological entropy bounds measure-theoretic entropy. 6984. Lecture Notes in Math., Vol. 235. Katok, A. 2007. Fifty years of entropy in dynamics: 19582007. J. Mod. Dyn., 1(4), 545596. Kolmogorov, A. N. 1958. A new metric invariant of transient dynamical systems and automorphisms in Lebesgue spaces. Dokl. Akad. Nauk SSSR (N.S.), 119, 861864. 19

Kolmogorov, A. N. 1959. Entropy per unit time as a metric invariant of automorphisms. Dokl. Akad. Nauk SSSR, 124, 754755. Krieger, W. 1970. On entropy and generators of measure-preserving transformations. Trans. Amer. Math. Soc., 149, 453464. Makarov, I. I. 2000. Dynamical entropy for Markov operators. J. Dynam. Control Systems, 6(1), 111. Maxwell, J. C. 1871. Theory of Heat. Unver nderter Nachdruck der ersten Auage a von 1932. Die Grundlehren der mathematischen Wissenschaften, Band 38. Dover Publications, Inc. McMillan, B. 1953. The basic theorems of information theory. Ann. Math. Statistics, 24, 196219. Ornstein, D. S. 1970. Bernoulli shifts with the same entropy are isomorphic. Advances in Math., 4, 337352 (1970). Shannon, C. E. 1948. A Mathematical theory of communication. Bell System Tech., 379423, 623656. Sinai, Y. G. 1959. On the concept of entropy for a dynamic system. Dokl. Akad. Nauk SSSR, 124, 768771. von Neumann, J. 1968. Mathematische Grundlagen der Quantenmechanik. Unver nderter Nachdruck der ersten Auage von 1932. Die Grundlehren der mathea matischen Wissenschaften, Band 38. Berlin: Springer-Verlag.

Institute of Mathematics and Computer Science Wroclaw University of Technology Wybrzee Wyspiaskiego 27, 50-370 Wroclaw, POLAND e-mail: [email protected]

20

You might also like