20.5 Arithmetic Coding
20.5 Arithmetic Coding
Chapter 20.
Less-Numerical Algorithms
Sample page from NUMERICAL RECIPES IN C: THE ART OF SCIENTIFIC COMPUTING (ISBN 0-521-43108-5) Copyright (C) 1988-1992 by Cambridge University Press. Programs Copyright (C) 1988-1992 by Numerical Recipes Software. Permission is granted for internet users to make one paper copy for their own personal use. Further reproduction, or any copying of machinereadable files (including this one) to any server computer, is strictly prohibited. To order Numerical Recipes books or CDROMs, visit website http://www.nr.com or call 1-800-872-7423 (North America only), or send email to [email protected] (outside North America).
911
0.4033 0.37819 A 0.3780
0.3778
Sample page from NUMERICAL RECIPES IN C: THE ART OF SCIENTIFIC COMPUTING (ISBN 0-521-43108-5) Copyright (C) 1988-1992 by Cambridge University Press. Programs Copyright (C) 1988-1992 by Numerical Recipes Software. Permission is granted for internet users to make one paper copy for their own personal use. Further reproduction, or any copying of machinereadable files (including this one) to any server computer, is strictly prohibited. To order Numerical Recipes books or CDROMs, visit website http://www.nr.com or call 1-800-872-7423 (North America only), or send email to [email protected] (outside North America).
0.3776 0.3774
0.5 0.4 0.3 O 0.2 0.1 U 0.0 0.37 0.39 0.38 U O I 0.41 0.40 I
0.390 I 0.385 O
Figure 20.5.1. Arithmetic coding of the message IOU... in the ctitious language Vowellish. Successive characters give successively ner subdivisions of the initial interval between 0 and 1. The nal value can be output as the digits of a fraction in any desired radix. Note how the subinterval allocated to a character is proportional to its probability of occurrence.
The routine arcmak constructs the cumulative frequency distribution table used to partition the interval at each stage. In the principal routine arcode, when an interval of size jdif is to be partitioned in the proportions of some n to some ntot, say, then we must compute (n*jdif)/ntot. With integer arithmetic, the numerator is likely to overow; and, unfortunately, an expression like jdif/(ntot/n) is not equivalent. In the implementation below, we resort to double precision oating arithmetic for this calculation. Not only is this inefcient, but different roundoff errors can (albeit very rarely) make different machines encode differently, though any one type of machine will decode exactly what it encoded, since identical roundoff errors occur in the two processes. For serious use, one needs to replace this oating calculation with an integer computation in a double register (not available to the C programmer). The internally set variable minint, which is the minimum allowed number of discrete steps between the upper and lower bounds, determines when new lowsignicance digits are added. minint must be large enough to provide resolution of all the input characters. That is, we must have p i minint > 1 for all i. A value of 100Nch, or 1.1/ min pi , whichever is larger, is generally adequate. However, for safety, the routine below takes minint to be as large as possible, with the product minint*nradd just smaller than overow. This results in some time inefciency, and in a few unnecessary characters being output at the end of a message. You can
912
Chapter 20.
Less-Numerical Algorithms
decrease minint if you want to live closer to the edge. A nal safety feature in arcmak is its refusal to believe zero values in the table nfreq; a 0 is treated as if it were a 1. If this were not done, the occurrence in a message of a single character whose nfreq entry is zero would result in scrambling the entire rest of the message. If you want to live dangerously, with a very slightly more efcient coding, you can delete the IMAX( ,1) operation.
#include "nrutil.h" #include <limits.h> ANSI header le containing integer ranges. #define MC 512 #ifdef ULONG_MAX Maximum value of unsigned long. #define MAXINT (ULONG_MAX >> 1) #else #define MAXINT 2147483647 #endif Here MC is the largest anticipated value of nchh; MAXINT is a large positive integer that does not overow. typedef struct { unsigned long *ilob,*iupb,*ncumfq,jdif,nc,minint,nch,ncum,nrad; } arithcode; void arcmak(unsigned long nfreq[], unsigned long nchh, unsigned long nradd, arithcode *acode) Given a table nfreq[1..nchh] of the frequency of occurrence of nchh symbols, and given a desired output radix nradd, initialize the cumulative frequency table and other variables for arithmetic compression in the structure acode. { unsigned long j; if (nchh > MC) nrerror("input radix may not exceed MC in arcmak."); if (nradd > 256) nrerror("output radix may not exceed 256 in arcmak."); acode->minint=MAXINT/nradd; acode->nch=nchh; acode->nrad=nradd; acode->ncumfq[1]=0; for (j=2;j<=acode->nch+1;j++) acode->ncumfq[j]=acode->ncumfq[j-1]+IMAX(nfreq[j-1],1); acode->ncum=acode->ncumfq[acode->nch+2]=acode->ncumfq[acode->nch+1]+1; }
Sample page from NUMERICAL RECIPES IN C: THE ART OF SCIENTIFIC COMPUTING (ISBN 0-521-43108-5) Copyright (C) 1988-1992 by Cambridge University Press. Programs Copyright (C) 1988-1992 by Numerical Recipes Software. Permission is granted for internet users to make one paper copy for their own personal use. Further reproduction, or any copying of machinereadable files (including this one) to any server computer, is strictly prohibited. To order Numerical Recipes books or CDROMs, visit website http://www.nr.com or call 1-800-872-7423 (North America only), or send email to [email protected] (outside North America).
The structure acode must be dened and allocated in your main program with statements like this:
#include "nrutil.h" #define MC 512 Maximum anticipated value of nchh in arcmak. #define NWK 20 Keep this value the same as in arcode, below. typedef struct { unsigned long *ilob,*iupb,*ncumfq,jdif,nc,minint,nch,ncum,nrad; } arithcode; ... arithcode acode; ... acode.ilob=(unsigned long *)lvector(1,NWK); Allocate space within acode. acode.iupb=(unsigned long *)lvector(1,NWK); acode.ncumfq=(unsigned long *)lvector(1,MC+2);
913
Individual characters in a message are coded or decoded by the routine arcode, which in turn uses the utility arcsum.
#include <stdio.h> #include <stdlib.h> #define NWK 20 #define JTRY(j,k,m) ((long)((((double)(k))*((double)(j)))/((double)(m)))) This macro is used to calculate (k*j)/m without overow. Program eciency can be improved by substituting an assembly language routine that does integer multiply to a double register. typedef struct { unsigned long *ilob,*iupb,*ncumfq,jdif,nc,minint,nch,ncum,nrad; } arithcode; void arcode(unsigned long *ich, unsigned char **codep, unsigned long *lcode, unsigned long *lcd, int isign, arithcode *acode) Compress (isign = 1) or decompress (isign = 1) the single character ich into or out of the character array *codep[1..lcode] , starting with byte *codep[lcd] and (if necessary) incrementing lcd so that, on return, lcd points to the rst unused byte in *codep. Note that the structure acode contains both information on the code, and also state information on the particular output being written into the array *codep. An initializing call with isign=0 is required before beginning any *codep array, whether for encoding or decoding. This is in addition to the initializing call to arcmak that is required to initialize the code itself. A call with ich=nch (as set in arcmak) has the reserved meaning end of message. { void arcsum(unsigned long iin[], unsigned long iout[], unsigned long ja, int nwk, unsigned long nrad, unsigned long nc); void nrerror(char error_text[]); int j,k; unsigned long ihi,ja,jh,jl,m; if (!isign) { Initialize enough digits of the upper and lower bounds. acode->jdif=acode->nrad-1; for (j=NWK;j>=1;j--) { acode->iupb[j]=acode->nrad-1; acode->ilob[j]=0; acode->nc=j; if (acode->jdif > acode->minint) return; Initialization complete. acode->jdif=(acode->jdif+1)*acode->nrad-1; } nrerror("NWK too small in arcode."); } else { if (isign > 0) { If encoding, check for valid input character. if (*ich > acode->nch) nrerror("bad ich in arcode."); } else { If decoding, locate the character ich by bisection. ja=(*codep)[*lcd]-acode->ilob[acode->nc]; for (j=acode->nc+1;j<=NWK;j++) { ja *= acode->nrad; ja += ((*codep)[*lcd+j-acode->nc]-acode->ilob[j]); } ihi=acode->nch+1; *ich=0; while (ihi-(*ich) > 1) { m=(*ich+ihi)>>1; if (ja >= JTRY(acode->jdif,acode->ncumfq[m+1],acode->ncum)) *ich=m; else ihi=m; } if (*ich == acode->nch) return; Detected end of message. } Following code is common for encoding and decoding. Convert character ich to a new subrange [ilob,iupb) .
Sample page from NUMERICAL RECIPES IN C: THE ART OF SCIENTIFIC COMPUTING (ISBN 0-521-43108-5) Copyright (C) 1988-1992 by Cambridge University Press. Programs Copyright (C) 1988-1992 by Numerical Recipes Software. Permission is granted for internet users to make one paper copy for their own personal use. Further reproduction, or any copying of machinereadable files (including this one) to any server computer, is strictly prohibited. To order Numerical Recipes books or CDROMs, visit website http://www.nr.com or call 1-800-872-7423 (North America only), or send email to [email protected] (outside North America).
914
Chapter 20.
Less-Numerical Algorithms
jh=JTRY(acode->jdif,acode->ncumfq[*ich+2],acode->ncum); jl=JTRY(acode->jdif,acode->ncumfq[*ich+1],acode->ncum); acode->jdif=jh-jl; arcsum(acode->ilob,acode->iupb,jh,NWK,acode->nrad,acode->nc); arcsum(acode->ilob,acode->ilob,jl,NWK,acode->nrad,acode->nc); How many leading digits to output (if encoding) or skip over? for (j=acode->nc;j<=NWK;j++) { if (*ich != acode->nch && acode->iupb[j] != acode->ilob[j]) break; if (*lcd > *lcode) { fprintf(stderr,"Reached the end of the code array.\n"); fprintf(stderr,"Attempting to expand its size.\n"); *lcode += *lcode/2; if ((*codep=(unsigned char *)realloc(*codep, (unsigned)(*lcode*sizeof(unsigned char)))) == NULL) { nrerror("Size expansion failed"); } } if (isign > 0) (*codep)[*lcd]=(unsigned char)acode->ilob[j]; ++(*lcd); } if (j > NWK) return; Ran out of message. Did someone forget to encode a acode->nc=j; terminating ncd? for(j=0;acode->jdif<acode->minint;j++) How many digits to shift? acode->jdif *= acode->nrad; if (acode->nc-j < 1) nrerror("NWK too small in arcode."); if (j) { Shift them. for (k=acode->nc;k<=NWK;k++) { acode->iupb[k-j]=acode->iupb[k]; acode->ilob[k-j]=acode->ilob[k]; } } acode->nc -= j; for (k=NWK-j+1;k<=NWK;k++) acode->iupb[k]=acode->ilob[k]=0; } return; } Normal return.
Sample page from NUMERICAL RECIPES IN C: THE ART OF SCIENTIFIC COMPUTING (ISBN 0-521-43108-5) Copyright (C) 1988-1992 by Cambridge University Press. Programs Copyright (C) 1988-1992 by Numerical Recipes Software. Permission is granted for internet users to make one paper copy for their own personal use. Further reproduction, or any copying of machinereadable files (including this one) to any server computer, is strictly prohibited. To order Numerical Recipes books or CDROMs, visit website http://www.nr.com or call 1-800-872-7423 (North America only), or send email to [email protected] (outside North America).
void arcsum(unsigned long iin[], unsigned long iout[], unsigned long ja, int nwk, unsigned long nrad, unsigned long nc) Used by arcode. Add the integer ja to the radix nrad multiple-precision integer iin[nc..nwk] . Return the result in iout[nc..nwk] . { int j,karry=0; unsigned long jtmp; for (j=nwk;j>nc;j--) { jtmp=ja; ja /= nrad; iout[j]=iin[j]+(jtmp-ja*nrad)+karry; if (iout[j] >= nrad) { iout[j] -= nrad; karry=1; } else karry=0; } iout[nc]=iin[nc]+ja+karry; }
If radix-changing, rather than compression, is your primary aim (for example to convert an arbitrary le into printable characters) then you are of course free to set all the components of nfreq equal, say, to 1.
915
CITED REFERENCES AND FURTHER READING: Bell, T.C., Cleary, J.G., and Witten, I.H. 1990, Text Compression (Englewood Cliffs, NJ: PrenticeHall). Nelson, M. 1991, The Data Compression Book (Redwood City, CA: M&T Books). Witten, I.H., Neal, R.M., and Cleary, J.G. 1987, Communications of the ACM, vol. 30, pp. 520 540. [1]
Sample page from NUMERICAL RECIPES IN C: THE ART OF SCIENTIFIC COMPUTING (ISBN 0-521-43108-5) Copyright (C) 1988-1992 by Cambridge University Press. Programs Copyright (C) 1988-1992 by Numerical Recipes Software. Permission is granted for internet users to make one paper copy for their own personal use. Further reproduction, or any copying of machinereadable files (including this one) to any server computer, is strictly prohibited. To order Numerical Recipes books or CDROMs, visit website http://www.nr.com or call 1-800-872-7423 (North America only), or send email to [email protected] (outside North America).
i+1 = i Yi Yi+1 =
(20.6.2)
The value emerges as the limit . Now, to the question of how to do arithmetic to arbitrary precision: In a high-level language like C, a natural choice is to work in radix (base) 256, so that character arrays can be directly interpreted as strings of digits. At the very end of our calculation, we will want to convert our answer to radix 10, but that is essentially a frill for the benet of human ears, accustomed to the familiar chant, three point