Lecture10 AudioVideo Compression
Lecture10 AudioVideo Compression
Brown
Lecture Topics
• Basic Compression
• Audio and hearing
• Audio compression
• Video compression
– Transform coding (DCT)
• Codecs
Outcome: After this lecture you should have a much better idea to audio and
video representation, and how it is compressed. We will also talk about
a few standards.
Brown 2
Basic Compression
Brown
Data Compression
• Two categories
– Information Preserving
• Error free compression
• Original data can be recovered completely
– Lossy
• Original data is approximated
• Less than perfect
• Generally allows much higher compression
Brown 4
Basics
• Data Compression
– Process of reducing the amount of data
required to represent a given quantity of
information
• Data vs. Information
– Data and Information are not the same thing
– Data
• the means by which information is conveyed
• various amounts of data can convey the same
information
– Information
• “A signal that contains no uncertainty”
Brown 5
Redundancy
• Redundancy
– “data” that provides no relevant information
– “data” that restates what is already known
• For example
– Consider that N1 and N2 denote the
number of “data units” in two sets that
represent the same information
– where Cr is the “Compression Ratio”
• Cr = N1 / N2
EXAMPLE
N1 = 10 & N2 = 1 data can encode the same information
Compression is Ratio
Cr = N1/N2 = 10 (or 10:1)
Implying 90% of the data in N1 is redundant
Brown 6
Variable Length Coding (1)
Our binary system (called natural binary) is not always that
good at representing data from a compression point of view
Consider the following string of data:
abaaaaabbbcccccaaaaaaaabbbbdaaaaaaa
There are 4 different pieces of information (let’s say 4 symbols)
a, b, c, d
In natural binary we would need at least 2 bits to represent this,
assigning bits as follows:
a=00, b=01, c=10, d=11
Brown 7
Variable Length Coding (2)
Now, consider the occurrence of each symbol: a,b,c,d
abaaaaabbbcccccaaaaaaaabbbbdaaaaaaa
a = 21/35 (84%)
b = 8 /35 (22%)
c = 5/35 (14%)
d = 1/35 (02%)
Brown 8
Variable Length Coding (3)
Idea of variable length coding: assign less bits to encode to
more frequent symbols, more bits for less frequent
symbols
Brown 9
Huffman encoding
• This is an example of error free coding, the information is
completely the same, the data is different
Brown 11
Quantization (Lossy)
Another thing we can do is actually quantize the data such that it
cannot be recovered completely
Brown 12
Lossy vs. Lossless
• For things like text documents and computer data files, lossy
compression doesn’t make sense
– An approximation of the original is no good!
• But for data like audio or visual, small errors are not easily
detectable by our senses
– An approximation is acceptable
Brown 13
Audio
Brown
Audio
Brown 15
Sound waves are important, because
they are perceived/heard by us.
Brown 16
Sound Representation
• We can consider sound as amplitudes
over time
Brown 17
Sound Waves
• We can decompose a sound wave into frequencies and
their amplitude
Frequencies Frequencies
Original
Decomposition to
10 frequencies
Result Result
adding up adding up
the frequencies Brown
the frequencies 18
Human Response
Pain
threshold
Cannot
hear
20 log(P2/R) = X dB
Brown 20
Examples in DB
Example:
A sound pressure, P2, at 86dB means:
20 log(P2/R) = 86dB // 20 is because the reference measure is 20 µPA
log (P2/R) = 4.30 // divide by 20
P2/R = 104.3 // raised to power of 10 to remove log
This means that P2 is 104.3 or (~20000) times stronger the reference sound
Brown 21
dB in different Frequencies
If we change the reference level for each frequency (because our response is different),
we have dBs defined differently for each frequency. This is the dB curves based on human hearing
experiments from the 1930s. This is called the Fletcher-Munson curve.
Brown 22
Sound Examples
Brown 23
Equalization
Not surprising that when we manipulate audio, we do it at varying frequencies.
We rarely manipulate the compression wave directly.
Brown 24
Digital Representation
• Wave-form coding
• Just sample the waveform in time
– We call this pulse-code modulation (PCM)
• Two values to control
• Number of samples to capture per second
• Quantization values (number of bits used to represent that sample)
sound amplitude
Sound waveform
Time
Brown 25
Base-Line High-Quality Sampling
• Human hearing can hear frequencies from about 16Hz to
20,000Hz
– That is 16 oscillations per second (low pitch - bass)
– To 20,000 oscillations per second (very high pitch – high treble)
• To capture these frequencies we need at least twice the
number of samples
– This comes from sampling theory, known as the Nyquist Rate
– So, 44.1KHz per second is considered the
• 16 bit per sample allows -32768 to +32767 discrete
quantities to be captured
– This seems to be sufficient bits to quantize the amplitude,
although you can read online that people will debate this
http://www.cs.columbia.edu/~hgs/audio/44.1.html Brown 26
Uncompressed Audio
• CD audio uses:
– 44,100 samples per second
– 16 bits per sample
– 2 channels (stereo)
= 1,411,200 bit/s = 1,411.2 kbit/s
(Around 10MB for minute of recording)
Brown 27
Cheap Compression (Lossy)
• Less sampling
• Less bits
– 8-bit, 4-bit quantization
• The resolution of the amplitudes will be coarser
• Maybe OK for some sounds?
Brown 28
Some Tricks for Compression
• aLaw and µLaw encoding
– Non-linear Amplitude Encoding
• Assignment more bits to low amplitudes, less bits to higher
amplitudes
– Average bit is still 8-bits per sample, but quantization of low
amplitudes have more bits
– Can call these “log compressors”
• 8Khz temporal sampling with aLaw or µLaw bit
assignment is often used for phone
– Claims it reduces perceive able noise in the quantization
– The quantization scheme is fixed (based on empirical studies)
• aLaw or µLaw are two different quantization standards, but the
same idea
http://en.wikipedia.org/wiki/Mu-law
Brown 29
Adaptive Resolution PCM (ADPCM)
• Like log compressor
– But add an additional trick
• Encode the difference of the samples, not
just the sample
– We call this delta-pulse-code-modulation
13 14
ADPCM claims to get almost 50% better
compression (with similar quality) as uLaw.
Brown 30
Variations on ADPCM
• Encoding the difference between the two sample is a
simple form of “predictive coding”, where you try to
predict what the next value is going to be, and encode
only the difference
• There are better predictors (like looking at a longer
history of previous sample instead of just one – i.e.
ADPCA)
• GMS 06.10 is a predictive coder standard for voice
Brown 31
Any of these look familiar?
Brown 32
Exploiting Psychoauditory Properties
• For even better compression we can
exploit properties of the human auditory
facilities
• Loudness
– Quiet sound around a relatively loud sound
won’t be perceived
• Frequency
– Low sound around a relatively high sound
won’t be perceived
Brown 33
Temporal Masking
• Masking can effect our perception of sound up to 200
milliseconds after a “masking event”
(mask=loud noise or high-frequency)
– Call this Post-Masking
• And even up to 20 milliseconds before a mask
– Call this Pre-Masking (amazing!)
– That’s right, if there is a loud sound at time T, it will effect your
perception at time T minus 20ms --- amazing!
Brown 34
Compressors can exploit this
Brown 35
MP3
• Actually part of the MPEG1 standard for audio
• Input data is at 44Kz sampling and 16bit
– (others sampling can be used, but 44Kz most common)
• Algorithm cuts the audio into “frames”
– Each lasts a fraction of a second
• Transforms the frame into 32 frequency bands
– Looks at these bands and determines how to distribute bits
– Bands with more activity need more bits
• Apply psychoaudio masking: throw away bands based
on human perception (see previous slide)
– Where most of the compression comes in
– Lots of bands may become 0
• Then applies Huffman encoding to the output to squeeze
the bits a little further
http://www.mp3-converter.com/mp3codec/
Brown 36
MP3 Bitrate
• In MP3, we can set the bitrate
• This is considered a constant bitrate encoding,
versus a variable bitrate
• The encoder changes the amount of
quantization performed based on specified
bitrate
• This can be complex, since Huffman encoding is
also used, the encoder may need to try several
different quantization levels and then run
through Huffman encoding to see the result
– Padding with 0s is sometimes used
Brown 37
Other formats
• Advance Audio Coding (ACC)
– Similar to MP3, but has a more flexible encoding standard
– Supports a larger range of temporal sampling
– Part of the MPEG2 and MPEG4 standard
• Ogg Vorbis
– Freeware version that has become popular
– Again, similar to MP3 in use of masking
Brown 38
Video
Brown
Images and Video
• Image: 2D array of pixels
time
Brown 40
Video cameras interlacing and progressive scan
• Interlaced
– Cameras sometime
capture 2-fields to make a frame
– So, we capture 60 fields per second
time
• Progressive Scan
– Captures the entire frame at the same time
• 30 full frames per second
– No interlace effect
– But the motion may not look as smooth
time
Brown 41
Transform Coding
+ + + +...
...+ + + ...+
+
...+ + + +
In the pixel domain, you can consider each pixel intensity as a coefficient of a
basis that only occupies that pixels (all other pixels are 0). For a 8x8 image, we
would have 64 basis. Summing of these basis would give you the 8x8 image.
Brown 42
DCT Frequency Domain
In the frequency domain we can transform the 8x8 pixels into their corresponding frequency
representation (Discrete Cosine Transform). This also requires 64 coefficient, each one
Corresponding the amount of each basis to use. Sum all the basis up and you have the
original image.
Brown 43
Why Frequency Domain?
• It turns out we can remove many of the
frequencies and the viewer will not notice.
– Call this pyschovisual redundancy
• Humans are not so sensitive to high-frequencies
• If we heavily quantize the high-frequency
coefficients and then transform back to the pixel
domain
– Pixel values are different, but visually we can’t tell
• This is a lossy compression and provides the
most significant gain in image compression
Brown 44
JPEG Approach
f(x,y) - 128
(normalize between –128 to 127)
T(u,v)
Take original image and
Break it into 8x8 blocks Differential 0
coding DC component
Huffman
JPEG bitstream
Encode RLE
AC Vector
Brown 48
GOP
• Group of Pictures
GOP of 12
GOP of 6
• DVD (MPEG2)
– GOP size is fixed at 18 for NTSC, 15 for PAL
Brown 49
Motion Compensation
Macroblock is
broken into 4
8x8 blocks and
Break image encoded in
into 16x16 blocks JPEG style
In a predictive frame,
search a reference
frame for a similar
block.
Brown 50
I,P,B frames
• I frames are the largest in terms of bits
• P frames are significantly smaller
• B frames even smaller
Brown 52
Errors in MPEG
• If data becomes corrupted (e.g during
transmission), the 8x8 and 16x16 block’s
used in the encoding scheme become
apparent
Brown 53
Errors in MPEG
• If motion vector data becomes corrupted,
we see very strange things
This is really bad, because if one frames gets corrupted, subsequent frames are
using it for reconstruction (i.e. they predicted from the correct frame, now the
frame is bad, so their prediction is wrong).
Video should become good again when an I-frame is encountered. One reason
for short GOP
Brown 54
CODECs
• Codecs
– Means “Coder/Decoder”
– Often when we download it means decoder
• Codec refers to the algorithms used to encode and
decode the audio/video file
• Note that while MPEG is a standard, its exact
implementation is not specified
– Some encoders might do a better job
– Better motion compensation, better use of DCT quantization, etc
– Almost all encoders are based on MPEG or MPEG-like encoding
• I.e. Motion compensation with GOP structure
• Also, note that encoding and decoding are asymmetric
– Encoding is much more computation than decoding
Brown 55
AVI, MOV, and DV
• AVI (Windows)
– Audio, Video, Interleave (AVI)
– Is a wrapper format for that encodes up other
formats
• I.e. you can have an AVI with codecs
• MOV (Apple)
– Similar, it is a wrapper or container format
– Many different codecs are supported
• (Digital Video) DV format
– Uses interface only compression
• (similar to motion JPEG)
Brown 56
Interleaving Audio
• Technically Audio and Video are separate
• You can have the audio in a different file or logical region
in the video file
• But this might not be good for access, esp on a
harddrive (you’d have disk thrashing)
• We instead interleave the audio data with the video data
Brown 57
Constant Bitrate Video
• Some video encoders support constant
bitrate
– The video is compressed as several different
qualities
– The highest quality closest to the desired
bitrate is chosen
– This means quality may change based on the
image content
• More motion and image content, lower quality
• Less motion and image content, higher quality
Brown 58
Temporal Downsampling
• Another way to get compression is to reduce the
framerate
– That is, show less frames per second
• A rule is that human’s feel comfortable at 24
frames per second
– Less than this can lead to temporal aliasing where we
feel uncomfortable and motion looks unnatural
• For animations it can be as low as 4-8 frames
per second
– This is acceptable as long as the animation has
limited motion
– Once motion is too large, we see temporal aliasing
again
Brown 59
Spatial Downsampling
• Another easier way to gain compression is
spatial down-sampling
– Make the image smaller
• This is acceptable for internet use
– Unacceptable for TV, film, broadcast
• Keep in mind that once you down-sample
you can never get back the original
– Resizing back to the larger size does not
restore the lost details
– Matter of fact, it often looks worse
Brown 60
Summary
• Compression is necessary to save space
– Even with today’s storage, uncompressed
data is too large
• Several types of compression
– Redundancy coding, RLE, transform coding,
psycho coding
• Psycho-coding allows us to “throw away”
information (lossy coding)
• This gives us the greatest “gain”
• Also is often what “quality” is associated with in
compression
• Higher quality means less “thrown away”
Brown 61