0% found this document useful (0 votes)
71 views61 pages

Lecture10 AudioVideo Compression

This lecture covers audio and video compression. It discusses how audio can be represented digitally through pulse-code modulation by sampling the audio waveform over time. It also explains how lossy compression techniques like quantization can reduce audio file sizes by approximating the original signal. The lecture outlines concepts like the Fletcher-Munson curves that describe human auditory perception and how audio is typically equalized at different frequencies. After the lecture, students should understand audio and video representation and common compression standards.

Uploaded by

Baraa Baraa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
71 views61 pages

Lecture10 AudioVideo Compression

This lecture covers audio and video compression. It discusses how audio can be represented digitally through pulse-code modulation by sampling the audio waveform over time. It also explains how lossy compression techniques like quantization can reduce audio file sizes by approximating the original signal. The lecture outlines concepts like the Fletcher-Munson curves that describe human auditory perception and how audio is typically equalized at different frequencies. After the lecture, students should understand audio and video representation and common compression standards.

Uploaded by

Baraa Baraa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 61

CS3343: Digital Media Production

Audio and Video


Compression

Brown
Lecture Topics
• Basic Compression
• Audio and hearing
• Audio compression
• Video compression
– Transform coding (DCT)
• Codecs

Outcome: After this lecture you should have a much better idea to audio and
video representation, and how it is compressed. We will also talk about
a few standards.
Brown 2
Basic Compression

Brown
Data Compression
• Two categories
– Information Preserving
• Error free compression
• Original data can be recovered completely

– Lossy
• Original data is approximated
• Less than perfect
• Generally allows much higher compression

Brown 4
Basics
• Data Compression
– Process of reducing the amount of data
required to represent a given quantity of
information
• Data vs. Information
– Data and Information are not the same thing
– Data
• the means by which information is conveyed
• various amounts of data can convey the same
information
– Information
• “A signal that contains no uncertainty”
Brown 5
Redundancy
• Redundancy
– “data” that provides no relevant information
– “data” that restates what is already known
• For example
– Consider that N1 and N2 denote the
number of “data units” in two sets that
represent the same information
– where Cr is the “Compression Ratio”
• Cr = N1 / N2

EXAMPLE
N1 = 10 & N2 = 1 data can encode the same information
Compression is Ratio
Cr = N1/N2 = 10 (or 10:1)
Implying 90% of the data in N1 is redundant

Brown 6
Variable Length Coding (1)
Our binary system (called natural binary) is not always that
good at representing data from a compression point of view
Consider the following string of data:

abaaaaabbbcccccaaaaaaaabbbbdaaaaaaa
There are 4 different pieces of information (let’s say 4 symbols)
a, b, c, d
In natural binary we would need at least 2 bits to represent this,
assigning bits as follows:
a=00, b=01, c=10, d=11

There are 35 pieces of data, that is 35*2bits = 70bits

Brown 7
Variable Length Coding (2)
Now, consider the occurrence of each symbol: a,b,c,d

abaaaaabbbcccccaaaaaaaabbbbdaaaaaaa

a = 21/35 (84%)
b = 8 /35 (22%)
c = 5/35 (14%)
d = 1/35 (02%)

Brown 8
Variable Length Coding (3)
Idea of variable length coding: assign less bits to encode to
more frequent symbols, more bits for less frequent
symbols

Bit assignment using VLC:


a=1, b=01, c=001, d=0001

Now, compute the number of bits used to encode the same


data:
21*(1) + 8*(2) + 5*(3) + 2*(4) = 51bits

So, we have a compression ratio of 70:51 or 1.37,


meaning 28% of the data using natural binary encoding
is redundant

Brown 9
Huffman encoding
• This is an example of error free coding, the information is
completely the same, the data is different

• This idea of variable length coding is used in many places


– Area codes for cities in large countries (China/India)

• Prof. David A. Huffman developed an algorithm to take a data set


and compute its “optimal” encoding (bit assignment)
– He developed this as a student at MIT

• This is very commonly applied to many compression techniques as


a
final stage

• There are other VLC techniques, such as Arthimetic coding and


LZW coding (ZIP)

David Huffman Brown 10


Run-length-encoding (RLE)
• Consider the following
aaaaaaaabbbaaaaaaaabbbba

We could instead represent this by “runs”


8a,3b,19a,4b,1a

RLE is often more compact, especially when data contains


lots of runs of the same number(s)

RLE is also lossless, you can always reconstruct the


original

Fax machines transmit RLE-encoded linescans of the


black-and-white documents

Brown 11
Quantization (Lossy)
Another thing we can do is actually quantize the data such that it
cannot be recovered completely

Consider the following string of numbers:


ORIGINAL = {10, 12, 15, 18, 20, 3, 5, 2, 13}
(all the numbers are unique so Huffman coding won’t help)

Integer divide this by 5 (i.e. quantize it)


QUANTIZED = {2, 2, 3, 3, 4, 0, 1, 0, 2}

The natural binary range is smaller, we could also use Huffman


encoding to get a bit more compression.

Of course, the values are not the same, on “reconstruction” (multiply by


5) we get only an approximation of the original:

RECOVERED = {10, 10, 15, 15, 20, 0, 5, 0, 10}

Brown 12
Lossy vs. Lossless
• For things like text documents and computer data files, lossy
compression doesn’t make sense
– An approximation of the original is no good!

• But for data like audio or visual, small errors are not easily
detectable by our senses
– An approximation is acceptable

• This is one reason we can get significant compression of images


and audio, vs. other types of data
– Lossless 10:1 is typically possible
– Lossy 300:1 is possible with no significant perceptual loss

• With lossy we can even talk about “quality”


• The more like the original the higher the quality
• The less like the original, the lower the quality

Brown 13
Audio

Brown
Audio

Sound is a compression wave made by moving particles in a medium.


Sound can be transmitted thorough gases, plasma, liquids, and even solids.

Brown 15
Sound waves are important, because
they are perceived/heard by us.

Brown 16
Sound Representation
• We can consider sound as amplitudes
over time

Brown 17
Sound Waves
• We can decompose a sound wave into frequencies and
their amplitude
Frequencies Frequencies

Original

Decomposition to
10 frequencies
Result Result
adding up adding up
the frequencies Brown
the frequencies 18
Human Response
Pain
threshold

Cannot
hear

Human response is not uniform. We have “better” hearing at difference frequencies.


Brown 19
Measure of sound
• Our range is huge, from ~0.0002 to 130
– This is measured in terms of pressure micropascals (µPA)
– 20 µPA is generally considered the lower limit of our hearing

• We use a log scale (decibels=dB) to represent this range


– dB is not a unit, the unit is still µPA

• X dB is expressed as a ratio, between a reference (R)


and another pressure (P2)

20 log(P2/R) = X dB

Brown 20
Examples in DB
Example:
A sound pressure, P2, at 86dB means:
20 log(P2/R) = 86dB // 20 is because the reference measure is 20 µPA
log (P2/R) = 4.30 // divide by 20
P2/R = 104.3 // raised to power of 10 to remove log
This means that P2 is 104.3 or (~20000) times stronger the reference sound

A sound pressure, P2 at 10dB means:


20 log(P2/R) = 10dB // 20 is because the reference measure is 20 µPA
log(P2/R) = 0.5 // divide by 20
P2/R = 100.5 // raised to power of 10 to remove log
This means that P2 is 100.5 or (~3.4) times stronger the reference sound

A sound pressure at 0dB means:


20 log( P2 / Reference ) = 0dB
log ( P2 / Reference ) = 0
P2 / Reference = 1
This means P2 is the same as the reference. So, 0dB is the starting point.

Brown 21
dB in different Frequencies

If we change the reference level for each frequency (because our response is different),
we have dBs defined differently for each frequency. This is the dB curves based on human hearing
experiments from the 1930s. This is called the Fletcher-Munson curve.
Brown 22
Sound Examples

Brown 23
Equalization
Not surprising that when we manipulate audio, we do it at varying frequencies.
We rarely manipulate the compression wave directly.

Brown 24
Digital Representation
• Wave-form coding
• Just sample the waveform in time
– We call this pulse-code modulation (PCM)
• Two values to control
• Number of samples to capture per second
• Quantization values (number of bits used to represent that sample)
sound amplitude

Sound waveform

Time

Digitized sample points, uniform in time

Brown 25
Base-Line High-Quality Sampling
• Human hearing can hear frequencies from about 16Hz to
20,000Hz
– That is 16 oscillations per second (low pitch - bass)
– To 20,000 oscillations per second (very high pitch – high treble)
• To capture these frequencies we need at least twice the
number of samples
– This comes from sampling theory, known as the Nyquist Rate
– So, 44.1KHz per second is considered the
• 16 bit per sample allows -32768 to +32767 discrete
quantities to be captured
– This seems to be sufficient bits to quantize the amplitude,
although you can read online that people will debate this

http://www.cs.columbia.edu/~hgs/audio/44.1.html Brown 26
Uncompressed Audio
• CD audio uses:
– 44,100 samples per second
– 16 bits per sample
– 2 channels (stereo)
= 1,411,200 bit/s = 1,411.2 kbit/s
(Around 10MB for minute of recording)

• This is often considered the “gold standard” for


audio encoding
– That is, we would say this is uncompressed

Brown 27
Cheap Compression (Lossy)
• Less sampling

– There will be frequencies you can’t reproduce in the sampled


signal
– Maybe OK for some sounds

• Less bits
– 8-bit, 4-bit quantization
• The resolution of the amplitudes will be coarser
• Maybe OK for some sounds?

Brown 28
Some Tricks for Compression
• aLaw and µLaw encoding
– Non-linear Amplitude Encoding
• Assignment more bits to low amplitudes, less bits to higher
amplitudes
– Average bit is still 8-bits per sample, but quantization of low
amplitudes have more bits
– Can call these “log compressors”
• 8Khz temporal sampling with aLaw or µLaw bit
assignment is often used for phone
– Claims it reduces perceive able noise in the quantization
– The quantization scheme is fixed (based on empirical studies)
• aLaw or µLaw are two different quantization standards, but the
same idea

http://en.wikipedia.org/wiki/Mu-law
Brown 29
Adaptive Resolution PCM (ADPCM)
• Like log compressor
– But add an additional trick
• Encode the difference of the samples, not
just the sample
– We call this delta-pulse-code-modulation

Assume the two samples are values 13 and 14


PCM would record 13 and 14
Delta-PCM could record “1” (i.e. 14-13=1)

13 14
ADPCM claims to get almost 50% better
compression (with similar quality) as uLaw.
Brown 30
Variations on ADPCM
• Encoding the difference between the two sample is a
simple form of “predictive coding”, where you try to
predict what the next value is going to be, and encode
only the difference
• There are better predictors (like looking at a longer
history of previous sample instead of just one – i.e.
ADPCA)
• GMS 06.10 is a predictive coder standard for voice

Note that the delta-code and predictive coders do not have


to be lossy. That is if the encode the difference
correctly, no data is loss

Brown 31
Any of these look familiar?

Brown 32
Exploiting Psychoauditory Properties
• For even better compression we can
exploit properties of the human auditory
facilities
• Loudness
– Quiet sound around a relatively loud sound
won’t be perceived
• Frequency
– Low sound around a relatively high sound
won’t be perceived

Brown 33
Temporal Masking
• Masking can effect our perception of sound up to 200
milliseconds after a “masking event”
(mask=loud noise or high-frequency)
– Call this Post-Masking
• And even up to 20 milliseconds before a mask
– Call this Pre-Masking (amazing!)
– That’s right, if there is a loud sound at time T, it will effect your
perception at time T minus 20ms --- amazing!

Brown 34
Compressors can exploit this

Other bands are set to 0

Brown 35
MP3
• Actually part of the MPEG1 standard for audio
• Input data is at 44Kz sampling and 16bit
– (others sampling can be used, but 44Kz most common)
• Algorithm cuts the audio into “frames”
– Each lasts a fraction of a second
• Transforms the frame into 32 frequency bands
– Looks at these bands and determines how to distribute bits
– Bands with more activity need more bits
• Apply psychoaudio masking: throw away bands based
on human perception (see previous slide)
– Where most of the compression comes in
– Lots of bands may become 0
• Then applies Huffman encoding to the output to squeeze
the bits a little further
http://www.mp3-converter.com/mp3codec/

Brown 36
MP3 Bitrate
• In MP3, we can set the bitrate
• This is considered a constant bitrate encoding,
versus a variable bitrate
• The encoder changes the amount of
quantization performed based on specified
bitrate
• This can be complex, since Huffman encoding is
also used, the encoder may need to try several
different quantization levels and then run
through Huffman encoding to see the result
– Padding with 0s is sometimes used

Brown 37
Other formats
• Advance Audio Coding (ACC)
– Similar to MP3, but has a more flexible encoding standard
– Supports a larger range of temporal sampling
– Part of the MPEG2 and MPEG4 standard
• Ogg Vorbis
– Freeware version that has become popular
– Again, similar to MP3 in use of masking

Brown 38
Video

Brown
Images and Video
• Image: 2D array of pixels

• Video: Images/frames over time

time

Brown 40
Video cameras interlacing and progressive scan
• Interlaced
– Cameras sometime
capture 2-fields to make a frame
– So, we capture 60 fields per second

time

• Progressive Scan
– Captures the entire frame at the same time
• 30 full frames per second
– No interlace effect
– But the motion may not look as smooth
time

Brown 41
Transform Coding

+ + + +...

...+ + + ...+
+

...+ + + +

In the pixel domain, you can consider each pixel intensity as a coefficient of a
basis that only occupies that pixels (all other pixels are 0). For a 8x8 image, we
would have 64 basis. Summing of these basis would give you the 8x8 image.

Brown 42
DCT Frequency Domain

In the frequency domain we can transform the 8x8 pixels into their corresponding frequency
representation (Discrete Cosine Transform). This also requires 64 coefficient, each one
Corresponding the amount of each basis to use. Sum all the basis up and you have the
original image.
Brown 43
Why Frequency Domain?
• It turns out we can remove many of the
frequencies and the viewer will not notice.
– Call this pyschovisual redundancy
• Humans are not so sensitive to high-frequencies
• If we heavily quantize the high-frequency
coefficients and then transform back to the pixel
domain
– Pixel values are different, but visually we can’t tell
• This is a lossy compression and provides the
most significant gain in image compression

Brown 44
JPEG Approach
f(x,y) - 128
(normalize between –128 to 127)

Quantize DCT coefficients


DCT via
(Forward DCT) Quantization “Table”
on each block
C’(u,v) = round(C(u,v)/T(u,v))

T(u,v)
Take original image and
Break it into 8x8 blocks Differential 0
coding DC component

Huffman
JPEG bitstream
Encode RLE
AC Vector

“Zig-zag” Order Coefficients

JPEG applies almost every compression trick known.


1) Transform coding, 2) psychovisual (loss), 3) RLE, 4) Difference coding, and Huffman. Brown 45
JPEG for Color Images
C
o
Note: Quantization of
l
color info 2:1:1
o
Luminance r
y y q
y y i

Y  0.299 0.587 0.114   R 


 I   0.596  0.275  0.321 G 
  
Q  0.212  0.523 0.311   B 

Convert image from RGB to YIQ, where Y=brightness, IQ are chroma.


Eye is less sensitive to chroma, so downsample them by 2.
This downsample by 2 is a lossy compression. Brown 46
JPEG Quality
• The amount of quantization applied on the
DCT coefficients amounts to a “quality”
measurement

“Format Options” refer to different ways


Quality to choose the quantization. Baseline
are the original quantization tables specified
by JPEG. Optimize may generate tables
specific to your image.

• JPEG has a 100% quality – it means no


quantization
Brown 47
MPEG is based on JPEG
• JPEG is the foundation
– MPEG extends this to video
– Some encoders just use lots of JPEGs, we call this motionJPEG
• MPEG incorporates motion compensation prediction to help exploit
similarity between frames
– This is done with 16x16 Macroblocks
• Two types of frames
– I frames = intra-coded (JPEG)
– P&B frames = inter-coded
• predicted frames (motion compensation + JPEG-like compression)
• P predicts itself from either I or P frames
• B predicts itself from I or P in either direction (front or back)

• Frames organize a series of I and PB frames


– Call this the Group of Pictures (GOP)
– This is the basic building block

Brown 48
GOP
• Group of Pictures
GOP of 12

GOP of 6

• DVD (MPEG2)
– GOP size is fixed at 18 for NTSC, 15 for PAL

Brown 49
Motion Compensation
Macroblock is
broken into 4
8x8 blocks and
Break image encoded in
into 16x16 blocks JPEG style
In a predictive frame,
search a reference
frame for a similar
block.

Encode only the


difference and the
motion vector.

(Best case, you find


exactly the same
pixels, difference = 0)

Brown 50
I,P,B frames
• I frames are the largest in terms of bits
• P frames are significantly smaller
• B frames even smaller

• Why don’t we just have very large GOP?


– E.g. One I frame with 127 PB frames

• Consider random access


– How can you jump to the frame in the middle of a
GOP?
• Consider if we encounter errors
– What happens on the decoder if there is an error?
Brown 51
MPEG Quality
• Very similar to JPEG quality
• You are setting the amount of quantization
in the DCT components

Quality control in the encoder

Brown 52
Errors in MPEG
• If data becomes corrupted (e.g during
transmission), the 8x8 and 16x16 block’s
used in the encoding scheme become
apparent

Brown 53
Errors in MPEG
• If motion vector data becomes corrupted,
we see very strange things

This is really bad, because if one frames gets corrupted, subsequent frames are
using it for reconstruction (i.e. they predicted from the correct frame, now the
frame is bad, so their prediction is wrong).

Video should become good again when an I-frame is encountered. One reason
for short GOP
Brown 54
CODECs
• Codecs
– Means “Coder/Decoder”
– Often when we download it means decoder
• Codec refers to the algorithms used to encode and
decode the audio/video file
• Note that while MPEG is a standard, its exact
implementation is not specified
– Some encoders might do a better job
– Better motion compensation, better use of DCT quantization, etc
– Almost all encoders are based on MPEG or MPEG-like encoding
• I.e. Motion compensation with GOP structure
• Also, note that encoding and decoding are asymmetric
– Encoding is much more computation than decoding

Brown 55
AVI, MOV, and DV
• AVI (Windows)
– Audio, Video, Interleave (AVI)
– Is a wrapper format for that encodes up other
formats
• I.e. you can have an AVI with codecs
• MOV (Apple)
– Similar, it is a wrapper or container format
– Many different codecs are supported
• (Digital Video) DV format
– Uses interface only compression
• (similar to motion JPEG)
Brown 56
Interleaving Audio
• Technically Audio and Video are separate
• You can have the audio in a different file or logical region
in the video file
• But this might not be good for access, esp on a
harddrive (you’d have disk thrashing)
• We instead interleave the audio data with the video data

Brown 57
Constant Bitrate Video
• Some video encoders support constant
bitrate
– The video is compressed as several different
qualities
– The highest quality closest to the desired
bitrate is chosen
– This means quality may change based on the
image content
• More motion and image content, lower quality
• Less motion and image content, higher quality

Brown 58
Temporal Downsampling
• Another way to get compression is to reduce the
framerate
– That is, show less frames per second
• A rule is that human’s feel comfortable at 24
frames per second
– Less than this can lead to temporal aliasing where we
feel uncomfortable and motion looks unnatural
• For animations it can be as low as 4-8 frames
per second
– This is acceptable as long as the animation has
limited motion
– Once motion is too large, we see temporal aliasing
again

Brown 59
Spatial Downsampling
• Another easier way to gain compression is
spatial down-sampling
– Make the image smaller
• This is acceptable for internet use
– Unacceptable for TV, film, broadcast
• Keep in mind that once you down-sample
you can never get back the original
– Resizing back to the larger size does not
restore the lost details
– Matter of fact, it often looks worse

Brown 60
Summary
• Compression is necessary to save space
– Even with today’s storage, uncompressed
data is too large
• Several types of compression
– Redundancy coding, RLE, transform coding,
psycho coding
• Psycho-coding allows us to “throw away”
information (lossy coding)
• This gives us the greatest “gain”
• Also is often what “quality” is associated with in
compression
• Higher quality means less “thrown away”

Brown 61

You might also like