Multimedia Signals and Systems
Multimedia Signals and Systems
AND SYSTEMS
THE KLUWER INTERNATIONAL SERIES
IN ENGINEERING AND COMPUTER SCIENCE
MULTIMEDIA SIGNALS
AND SYSTEMS
...
Fax (781) 681-9045 Fax 31 78 6576 254
E-Mail <[email protected]> E-Mail <[email protected]>
INDEX 371
PREFACE
Multimedia computing and communications have emerged as a major
research and development area. Multimedia computers in particular open a
wide range of possibilities by combining different types of digital media
such as text, graphics, audio and video. The emergence of the World Wide
Web, unthinkable even two decades ago, also has fuelled the growth of
multimedia computing.
There are several books on multimedia systems that can be divided into
two major categories. In the first category, the books are purely technical,
providing detailed theories of multimedia engineering, with an emphasis on
signal processing. These books are more suitable for graduate students and
researchers in the multimedia area. In the second category, there are several
books on multimedia, which are primarily about content creation and
management.
Because the number of multimedia users is increasing daily, there is a
strong need for books somewhere between these two extremes. People with
engineering or even non-engineering background are now familiar with
buzzwords such as JPEG, GIF, WAV, MP3, and MPEG files. These files
can be edited or manipulated with a wide variety of software tools.
However, the curious-minded may wonder how these files work that
ultimately provide us with impressive images or audio.
This book intends to fill this gap by explaining the multimedia signal
processing at a less technical level. However, in order to understand the
digital signal processing techniques, readers must still be familiar with
discrete time signals and systems, especially sampling theory, analog-to-
digital conversion, digital filter theory, and Fourier transform.
The book has 15 Chapters, with Chapter 1 being the introductory chapter.
The remaining 14 chapters can be divided into three parts. The first part
consists of Chapters 2-4. These chapters focus on the multimedia signals,
namely audio and image, their acquisition techniques, and properties of
human auditory and visual systems. The second part consists of Chapters 5-
11. These chapters focus on the signal processing aspects, and are strongly
linked in order to introduce the signal processing techniques step-by-step.
The third part consists of Chapters 12-15, which presents a few select
multimedia systems. These chapters can be read independently. The
objective of including this section is to introduce readers to the intricacies of
a few select frequently used multimedia systems.
·
XIV Preface
Introduction
Multimedia
Applications
.J\..
Figure 1.1. Multimedia applications.
Perceptual World
(Observer's
experience of
the situation)
Presentation: This refers to the tools and devices for the input and
output of information. The paper, screen, and speakers are the output
media, while the keyboard, mouse, microphone, and camera are the
input media.
Storage: This refers to the data carrier that enables the storage of
information. Paper, microfilm, floppy disk, hard disk, CD, and DVD are
examples of storage media.
Transmission: This characterizes different information carriers that
enable continuous data transmission. Optical fibers, coaxial cable, and
free air space (for wireless transmission) are examples of transmission
media.
Discrete/Continuous: Media can be divided into two types: time-
independent or discrete media, and time-dependent or continuous media.
For time-independent media (such as text and graphics), data processing
is not time critical. In time-dependent media, data representation and
processing is time critical. Figure 1.3 shows a few popular examples of
discrete and continuous media data, and their typical applications. Note
that the multimedia signals are not limited to these traditional examples.
Other signals can also be considered as multimedia data. For example,
the output of different sensors such as smoke detectors, air pressure, and
temperature can be considered continuous media data.
Chapter 1: Introduction 5
Interactivity
C><J
Non- Interactive
C><J
Non- Interactive
interactive interactive
Application
~
Books,
~
Net-talk,
!
DVJ! Movies,
I
Vi1eo Conf.,
Examples Slideshow Browsing TV/Audio- Interactive
Broadcasting Television
Figure 1.3. Different types of multimedia and their typical applications.
00
00
00
00
00
o 0
00
L---r----1-----.~~
00
00
: : --!~.,-,-
Home :: Office/Home
Environment i:
................................................................................................ 1
Environment
..................................... _ ...... _ ..................................................................................... _
REFERENCES
l. R. Steinmatz and K. Nahrstedt, Multimedia: Computing. Communications and
Applications, Prentice Hall, 1996.
2. B. Furht, S. W. Smoliar, and H. Zhang, Video and Image Processing in Multimedia
Systems, Kluwer Academic Publishers, 1995.
3. N. Chapman and J. Chapman, Digital Multimedia, John Wiley & Sons, 2000.
4. W. L. Rosch, Multimedia Bible, SAMS Publishing, Indianapolis, 1995.
5. K. Dowd, C. R. Severance, M. Loukides, High Performance Computing, O'Reilly
& Associates, 2nd edition, August 1998.
6. C. E. Kozyrakis and D. A. Patterson, "A new direction for computer architecture
research," IEEE Computer, pp. 24-32, Nov 1998.
7. A. Silberschatz, P. B. Galvin, and G. Gagne, Operating System Concepts, John
Wiley & Sons, 6th Edition, 2001.
8. B. Prince, Emerging memories: technologies and trends, Kluwer Academic
Publishers, Boston, 2002.
9. V. Castelli and L. D. Bergman, Image Databases: Search and Retrieval of Digital
Imagery, John Wiley & Sons, 2002.
10. F. Halsall, Multimedia Communications: Applications. Networks. Protocols. and
Standards, Addison-Wesley Publishing, 2000.
11. T. N. Ryman, "Computers learn to smell and taste," Expert Systems, Vol. 12, No.2,
pp. 157-161, May 1995.
QUESTIONS
1. What is a multimedia system? List a few potential multimedia applications that are
likely to be introduced in the near future, and twenty years from now.
2. Do you think that integrating the minor senses with the existing multimedia system
will enhance its capability? What are the possible technical difficulties in the
integration process?
3. Classity the media with respect to the following criteria - i) perception, ii)
representation, and iii) presentation.
4. What are the properties of a multimedia system?
5. What is continuous media? What are the difficulties of incorporating continuous
media in a multimedia system?
6. List some typical applications that require high computational power.
7. Why is real-time operating system important for designing an efficient multimedia
system?
8. Explain the impact of high-speed networks on multimedia applications.
9. Explain with a schematic the four main domains ofa multimedia system.
Chapter 2
Audio Fundamentals
Sound Intensity
The sound intensity or amplitude of a sound corresponds to the loudness
with which it is heard by the human ear. For sound or audio recording and
reproduction, the sound intensity is expressed in two ways. First, it can be
expressed at the acoustic level, which is the intensity perceived by the ear.
Second, it can be expressed at an electrical level after the sound is converted
to an electrical signal. Both types of intensities are expressed in decibels
(dB), which is a relative measure.
The acoustic intensity of sound is generally measured in terms of the
sound pressure level.
Sound intensity (in dB) = 20 * loglo (P I PRef ) (2.1)
where P is the power of the audio signal, and Po =I mW. Note that the
suffix m in dBm is because the intensity is measured with respect to 1 m W.
Envelope
An important characteristic of a sound is its envelope. When a sound is
generated, it does not last forever. The rise and fall of the intensity of the
sound (or a musical note) is known as the envelope. A typical envelope has
four sections: attack, decay, sustain and release, as shown in Fig. 2.1.
During the attack, the intensity of a note increases from silence to a high
level. The intensity then decays to a middle level where it is sustained for a
short period of time. The intensity drops from the sustain level to zero
during the release period.
It&lLk
•• 1 ,.
Tll1iIIC T ime (1n .econd.)
Hammer
AmAI
Auditory Nerve
Inner ear
Figure 2.2. Anatomy of human ear. The coiled cochlea and basilar
membrane are straightened for clarity of illustration.
, --
.......... "- ..//
~
CQ 12
~,
"0120
.........
.5 .......... I()( -............ r ......
4) 80 ...... .............
L
'"
~80 v
~ ..... 60
'--- r-
!:!
\: ~ "- '--- r
v
'"
~ 40 40 .1
!:!
>- '-.... ~ 20
"'-
r V.
- ----
0-
"0
c:
g
CI)
0
'u
<' "r It> ~ r-- r-...
~
tL v
The resonant behavior of the basilar membrane (in Fig. 2.2) is similar to
the behavior of a transform analyzer. According to the uncertainty principle
of transforms, there is a tradeoff between frequency resolution and time
resolution. The human auditory system has evolved a compromise that
16 Multimedia Signals and Systems
..
~
Q.
E
<
The frequency resolution of the ear is not uniform throughout the entire
audio spectrum (20Hz-20KHz). The sensitivity is highest at the low
frequencies, and decreases at higher frequencies. At low frequencies, the ear
can distinguish tones a few Hz apart. However, at high frequencies, the
tones must differ by hundreds of Hz to be distinguished. Hence, the human
ear can be considered as spectrum-analyzer with logarithmic bands.
Experimental results show that the audio spectrum can be divided into
several critical bands (as shown in Table 2.2). Here, a critical band refers to
a band of frequencies that are likely to be masked by a strong tone at the
center frequency of the band. The width of the critical bands is smaller at
lower frequencies. It is observed in Table 2.2 that the critical band for a 1
KHz sine tone is about 160 Hz in width. Thus, a noise or error signal that is
160 Hz wide and centered at 1 KHz is audible only if it is greater than the
same level of a 1 KHz sine tone.
When the frequency sensitivity and the noise masking properties are
combined, we obtain the threshold of hearing as shown in Fig. 2.5. Any
audio signal whose amplitude is below the masking threshold is inaudible to
the human ear. For example, if a 1 KHz, 60 dB tone and a 1.1 KHz, 25 dB
tone are simultaneously present, we will not be able to hear the 1.1 KHz
tone; it will be masked by the 1 KHz tone.
Chapter 2: Audio Fundamentals 17
Table 2.2. An example of critical bands in the human hearing range showing
increase in the bandwidth with absolute frequency. A critical band will arise at an
audible sound at any frequency. For example, 220 Hz strong tone is likely to mask
the frequencies in the band 170-270 Hz.
Critical Band Lower Cut-off Upper Cut-off Critical Center
Number Frequency Frequency Band Frequency
(Bark) (Hz) (Hz) (Hz) (Hz)
1 --- 100 --- 50
2 100 200 100 150
3 200 300 100 250
4 300 400 100 350
5 400 510 110 450
6 510 630 120 570
7 630 770 140 700
8 770 920 150 840
9 920 1080 160 1000
10 1080 1270 190 1170
11 1270 1480 210 1370
12 1480 1720 240 1600
13 1720 2000 280 1850
14 2000 2320 320 2150
15 2320 2700 380 2500
16 2700 3150 450 2900
17 3150 3700 550 3400
18 3700 4400 700 4000
19 4400 5300 900 4800
20 5300 6400 1100 5800
21 6400 7700 1300 7000
22 7700 9500 1800 8500
23 9500 12000 2500 10500
24. 12000 15500 3500 13500
25 15500 22050 6550 18775
• Example 2.1
In this example, the psychoacoustics characteristics are demonstrated
using a few test audio signals. The following MATLAB code generates
three test audio files (in wav format) each with duration of two seconds. The
first one-second audio consists of a pure sinusoidal tone with a frequency of
2000 Hz. The next one-second of audio signals contain a mixture of 2000
Hz and 2150 Hz tones. The two tones have similar energy in the test1 audio
file. However, the 2000 Hz tone has 20 dB higher energy than the 2150 Hz
tone in the test2 audio file. In the test3 audio file, the 2000 Hz tone has 40
dB higher energy than the 2150 Hz tone. The power spectral density of the
test3 (for duration 1-2 seconds) is shown in Fig. 2.6.
18 Multimedia Signals and Systems
It can be easily verified by playing the files that the transition from the
pure tone (first one second) to the mixture (the next one second) is very
sharp in the testl audio file . In the second file (i.e., test2.wav), the transition
is barely identifiable. In the third file, the 2150 Hz signal is completely
inaudible .•
~ 40
c
;'=" 20
...J
Q..
en 0
20
U
.
2 2.1 1.2
Frequency (In KHz,
Multichannel Audio
A brief introduction to the human auditory system was provided in section
2. When sound is received by the ears, the brain decodes the two resulting
signals, and determines the directivity of the sound. Historically, sound
20 Multimedia Signals and Systems
.. H'Istory 0 fM uI'tic h
T ahI e 23 Id
anne au '10 t or horne an d cmema
. r
applicatIOns
Decades Major Milestones
1930s Experiment with three channel audio at Bell Laboratories
1950s 4-7 channels (cinema)
2 channel stereo audio (home)
1970s Stereo surround (cinema)
Four channel stereo audio (home)
Mono and stereo video cassettes (home)
1980s 3-4 channel surround audio (home)
2 channel digital CD audio (home)
1990s 5.1 channel surround audio (home)
It has been found that more realistic sound reproduction can be obtained
by having one or more reproduction channels that emit sound behind the
listener [4]. This is the principle of surround sound, which has been widely
used in movie theater presentations, and has recently become popular for
home theater systems. There is a variety of configurations for arranging the
speakers around the listener (see Table 2.4). The most popular configuration
for today's high-end home listening environment is the standard surround
system that employs 5 channels with 3 speakers in the front and two
speakers at the rear (see Fig. 2.7). For cinema application, however, more
rear speakers may be necessary depending on the size of the theater hall.
Table 2.4 shows that standard surround sound can be generated with 5 full
audio channels (with up to 20 KHz bandwidth). However, it has been
observed that adding a low bandwidth (equivalent of 0.1) sub woofer channel
(termed as LFE in Fig. 2.7) enhances the quality of the reproduction. These
Chapter 2: Audio Fundamentals 21
systems are typically known as 5.1 channels - i.e., five full bandwidth
channels and one low bandwidth channel, and have become very popular for
high-end home audio systems.
Table 2.4: Configuration of speakers in surround sound system. The code (p/q)
refers to the speaker configuration in which p speakers are in the front, and q
speakers are at the rear. "x" indicates the presence of a speaker in a given
configuration. F-L: front left, F-C: front center, F-R: front right, M-L: mid left, M-
R: mid right, R-L: rear left, R-C: rear center, R-R: rear right
Speaker positions
Name Codes F-L F-C F-R M-L M-R R-L R-C R-R
Mono lID
Stereo 2/0 x x
3-ch. Stereo 3/0 x x x
3-ch surround 211 x x x
Quadrphonic 2/2 x x x x
surround
Standard 3/2 x x x x x
surround
Enhanced 5/2 x x x x x x x
surround
Multi-track Audio
Figure 2.8(a) shows a schematic of a simple stereo audio recording
appropriate for a musical performance. The sound waves are picked by the
two microphones and converted to electrical signals. The resulting sound
quality is affected by the choice of microphones and their placement. The
two-track recorder records the two channels (left and right) separately into
two different tracks. For playback, the recorded signal is fed to a power
amplifier, and then driven in electrical form to two speakers.
Figure 2.8(b) shows the audio recording of a more complex musical
performance with several microphones [5]. Here, each microphone is placed
close to each singer or instrument. To obtain a balanced musical recording,
all the microphones are plugged into a mixer that can control individually
the volume of signals coming from each microphone. The output of the
mixer can be recorded on a multi-track tape for future editing, but the sound
editing might require playback of the music several times for fine
adjustment of individual components. On completion of the editing process,
the audio signal can be recorded on a two-track stereo tape or one-track
mono tape (see Fig. 2.8(c)).
22 Multimedia Signals and Systems
[IJfLFEl [IJ 0
~ ~ ~
~ ~
~ © ~ ~
~ ~
~
(a) (b)
l\cousllCs
Speakers
(a)
JJ
JJ
JJ
JJ Multitrack
Instru- Micro- recorder
Mixer
ments phones
(b)
Effects
(c)
Figure 2.8. Live audio recording. a) Two-track recorder, (b) Four-track
recorder, (c) conversion of four-track recorded signal to two-track.
piano signal was not blending well with the other components. With only
one or two recorded tracks, one might have to repeat the entire musical
performance. However, in multi-track audio, the track corresponding to the
piano component can be substituted by a new recording of just the piano
component.
Storage or Digital
Transmission To
Analog
driven to the speakers, we hear the sound of the music. The sound module
can also be connected to a sequencer that records the MIDI signal, which
can be saved on a floppy disk, a CD or a hard disk.
Controller
Figure 2.10. A simple MIDI System.
Figure 2.11 shows the bit-stream organization of a MIDI file. The file
starts with the header chunk, which is followed by different tracks. Each
track contains a track header and a track chunk. The format of the header
and track chunks is shown in Table 2.5. The header chunk contains four-
bytes of chunk ID which is always "MThd". This is followed by chunk size,
format type, number of tracks, and time division. There are three types of
standard MIDI files:
• Type 0 - which combines all the tracks or staves into a single track.
• Type 1 - saves the files as separate tracks or staves for a complete
score with the tempo and time signature information only included
in the first track.
• Type 2 - saves the files as separate tracks or staves and also includes
the tempo and time signatures for each track.
The header chunk also contains the time-division, which defines the default
unit of delta-time for this MIDI file. The time-division is a 16-bit binary
value, which may be in either of two formats, depending on the value of the
most significant bit (MSB). If the MSB is 0, then bits 0-14 represent the
number of delta-time units in each quarter-note. However, if the MSB is 1,
then bits 0-7 represent the number of delta-time units per SMTPE frame,
and bits 8-14 form a negative number, representing the number of SMTPE
frames per second (see Table 2.6).
The track chunks contains a chunk ID (which is always "MTrk"), chunk
size, and track event data. The track event data contains a stream of MIDI
events that defines information about the sequence and how it is played.
This is the actual music data that we hear. Musical control information such
as playing a note and adjusting a MIDI channel's modulation value are
26 Multimedia Signals and Systems
defined by MIDI channel events. There are three types of events: MIDI
Control Events, System Exclusive Events and Meta Events.
Track-l Track-2
Header
Track Track
Chunk Header Chunk
I Track Track
Header Chunk
I ......... ,
' .... -,
.... - -
........ .... ,
Status
Actual Music Data
Data Status Data
-
....... ..........
Byte Bytes Byte Bytes
The MIDI channel event format is shown in Table 2.7. It is observed that
each MIDI channel event consists of a variable-length delta time and 2-3
bytes description that determines the MIDI channel it corresponds to, the
type of event it is, and one or two event type specific values. A few selected
MIDI Channel Events, and their numeric value and parameters are shown in
Table 2.8.
Table 2.7. MIDI channel event format
Delta Time Event Type Value MIDI Channel Parameter 1 Parameter 2
Variable-
4 bits 4 bits 1 byte 1 byte
length
turn relates to the volume at which the note is played. However, specifying a
velocity of 0 for a Note On event is the same as using the Note Off event.
Most MIDI files use this method as it maximizes running mode, where a
command can be omitted and the previous command is then assumed.
Note that when a device has received a Note Off message, the note may
not cease abruptly. Some sounds, such as organ and trumpet sounds will do
so. Others, such as piano and guitar sounds, will decay (fade-out) instead,
albeit more quickly after the note-off message is received.
A large number of devices are employed in a professional recording
environment. Hence, MIDI protocol has been designed to enable computers,
synthesizers, keyboards, and other musical devices to communicate with
each other. In the protocol, each musical device is given a number. Table 2.9
lists the MIDI instruments and their corresponding numbers.
Table 2.9 shows the names of the instruments whose sound would be
heard when the corresponding number is selected on MIDI synthesizers.
These sounds are the same for all MIDI channels except channel to, which
has only percussion sounds and some sound "effects."
On MIDI channel 10, each MIDI Note number (e.g., "Key#") corresponds
to a different drum sound, as shown in Table 2.tO. While many current
instruments also have additional sounds above or below the range shown
here, and may even have additional "kits" with variations of these sounds,
only these sounds are supported by General MIDI Levell devices .
• Example 2.2
In this example, we demonstrate the MIDI format using a small one-track
MIDI file. Consider a MIDI file (consisting of only 42 bytes) as follows.
40546864 000000 06 00 0100010078405472 6B 0000 00 14
01 C3 02 01934364 78 4A 64 00 430000 4A 00 00 FF 2F 00
28 Multimedia Signals and Systems
The explanation of different bytes is shown in Fig. 2.12. The first 22 bytes
contain the header chunk and the track header, whereas the next 20 bytes
contain the music information such as the delta-time, the instrument and the
note's volume.
Format type= I
Time division = 120 ticks/frame ChunkSize = 20
M
4D
M
54
T
72
r
68
K
€ 00 O~
TRACK HEADER
After the track header, the first byte represents the delta-time for the
following note. The next byte is OxC3 (Ox denotes the hexadecimal format),
which indicates change of instrument (denoted by OxC) on channel 4. The
30 Multimedia Signals and Systems
next byte is Ox02, which corresponds to the Electric Grand Piano. Different
instruments can also be set to different channels (there are 16 channels: 0 to
F) and can be played simultaneously.
The MIDI file can be created using the following MATLAB code:
data=hex2dbytes(,4 05468640000000600010001 00784D54726BOOOOOO 140 1C3020 1934
364784A64004300004AOOOOFF2FOO,);
fid=fopen('F:\ex2_2.mid', 'wb');
fwrite(fid,data);
fclose('all ');
Note that "hex2dbytes" is a MATLAB function (included in the CD) that
creates a stream of bytes from the hex data, which is then written as a MIDI
file. The ex2_2.mid file is included in the CD. The readers can verify that
the file creates a musical tone from the piano .•
Most conventional music can be represented very efficiently using the
MIDI standard. However, a MIDI file may produce different qualities of
sound depending on the playback hardware. For example, a part of the MIDI
sequence may correspond to a particular note on piano. If the quality of the
piano is poor, or the piano is improperly tuned, the sound produced from the
piano may not be of high quality. Another problem with the MIDI standard
is that spoken dialogs or songs are difficult to represent parametrically.
Hence, the human voice may need to be represented by sampled data in a
MIDI file. However, this will increase the size of the MIDI file.
REFERENCES
1. M. T. Smith, Audio Engineer's Reference Book, 2nd Edition, Focal Press, Oxford,
1999.
2. K. C. Pohlmann, Principles of Digital Audio, McGraw-Hill, New York 2000.
3. D. W. Robinson and R. S. Dadson, "A redetermination of the equal-loudness relations
to for pure tones," British Journal of Applied Physics, Vol. 7, pp. 166-181, 1956.
4. D. R. Begault, 3D Soundfor Virtual Reality and Multimedia, Academic Press, 1994.
5. B. Bartlett and J. Bartlett, Practical Recording Techniques, 2nd Edition, Focal Press,
1998.
6. 1. Rothstein, MIDI: A Comprehensive Introduction, 2nd Edition, Madison, Wis., 1995.
7. M. Joe, MIDI Specification 1.0, http://www.ibiblio.org/emusic-lIinfo-docs-
FAQs/MIDI-doc/.
Chapter 2: Audio Fundamentals 31
QUESTIONS
1. What are the characteristics of sound?
2. What is audible frequency range? Does the audible frequency range of a person
remain same throughout his life?
3. How is the loudness of sound measured?
4. What is envelope of a sound? Explain the differences between the envelopes of a
few musical instruments.
5. Explain the anatomy of the human ear. How does it convert the acoustic energy to
electrical energy?
6. Draw the block schematic of a typical audio (stereo) recording system. Explain
briefly the function of each block.
7. Explain the functionality of the human ear.
8. What are the critical bands? Does the width of a critical band depend on its center
frequency?
9. What is the difference between echo and reverberation?
lO. What is multi-channel audio? How does it improve sound quality?
II. Explain how does surround sound improve the sound perception compared to the
two-channel audio.
12. What is multi-track recording? How does it help in sound recording and editing?
13. What is MIDI? What are the advantages of the MIDI representation over digital
audio?
14. How many different types of instruments are included in the General MIDI
standard?
15. Create a single-track MIDI file that plays one note of an electric guitar. Change the
volume from a moderate to loud level.
16. Why do we need a percussion key map?
Chapter 3
3.1 INTRODUCTION
Electromagnetic (EM) waves can have widely different wavelength
starting from long wavelengths (corresponding to the dc/power band) to
extremely short wavelengths (gamma rays). However, the eye can detect the
waves of a narrow spectrum, extending from approximately 400 to 700
nanometers. Sunlight has a spectrum in this wavelength range, and is the
major source of the Earth's EM waves. Indeed, it is believed that the
evolution of the HVS was influenced by the presence of the sunlight.
Fig. 3.1 shows a typical object detection process. If L(A,) is the incident
energy distribution coming from a light source, the light received by an
observer from an object can be expressed as
/(A,) == peA, )L(A,) (3.1)
The human visual system can be considered an optical system (see Fig.
3.2). When light from an object falls on the eye, the pupil of the eye acts as
an aperture, and an image is created on the retina, and the viewer see the
object. The perceived size of an object depends on the angle it creates on the
retina. The retina can resolve detail better when the retinal image is larger
and spread over more of its photoreceptors. The pupil controls the amount of
light entering the eye. For typical illumination, the pupil is about 2 mm in
diameter. For low illumination, the size increases to allow more light,
whereas for high illumination the size decreases to limit the amount of light.
Chapter 3: Human Visual Systems and Perceptions 35
t!==~[J~==~~~~~~
i
Xll
t .
i i~. . .1.2...... ~.~~.~.............J
................-................................................. ~ Retina
YI
Figure 3.2. Image formation at human eye. The angle e=
2 arctan(x I / 2Y I ) = 2 arctan(x 2 / 2Y 2) determines the retinal image size.
The retina contains two types of photoreceptors called rods and cones.
There are approximately 100 millions rods in a normal eye, and these cells
are relatively long and thin. These cells provide scotopic vision, which is the
visual response of the HVS at low illumination. The cone cells are shorter,
thicker, and less sensitive than the rods, and are fewer in numbers
(approximately 6-7 millions). These cells provide photopic vision, the visual
response at higher illumination. In the mesopic vision (i.e., the intermediate
illumination), both rods and cones are active. Since the electronic displays
are well lighted, the photopic vision is of primary importance in display
engineering. The color vision is provided by the cones. Because these
receptors are not active at low illumination, the color of an object cannot be
detected properly at low illumination.
The distribution of rod and cone cells (per sq. mm) is shown in Fig. 3.3 . It
is observed that the retina is not a homogeneous structure. The cone cells are
densely packed in the fovea (a circular area of about 1.5 mm diameter in the
center), and fall off rapidly outside a circle of ]0 radius. There are no rods or
cones in the vicinity of the optic nerve, and thus the eye has a blind spot in
this region. When a light stimulates a rod or cone, a photochemical
transition occurs producing a nerve impulse. There are about 0.8 millions
nerve fibers . In many regions of the retina, the rods/cones are interconnected
to the nerve fiber on a many-to-one basis.
The central area, known as the foveal area, is capable of providing high-
resolution viewing, but performance falls off in the extra-foveal region. The
foveal area subtends only an angle of approximately ]- 20 degrees, while the
home television generally subtends 5-15° (when a viewer sits at a distance
of about 6 times the height of the TV receiver, a vertical viewing angle of
9.50 is created) depending on viewing distance. Hence, home televisions are
generally watched with extrafoveal vision.
36 Multimedia Signals and Systems
8'
~ 16 ,r, -___ Rods
ones
I '
~ 12
,
I "
I '\
u 8
o<!
""
~ 4
....'"o Cones
o
z ~ ~ • w r w • ~ ~
Na al Temporal
PenrnclTlc angle from fovea
Figure 3.3, Distribution of rod and cone receptors in the retina,
f/(A)V(A)dA
00
L = (3,2)
§ 04
-"
.j
II>
02
IX: o-l-"""""''';::;:''--'''-'''''''''''''''''''''T"T""'T'T'T-.-,-:::>O---r,..,.,-l
380 440 500 560 620 680 740
Wavelengths (In nm)
Figure 3.5. Simultaneous contrast: small circles in the middle have equal
luminance values, but have different brightness.
B + 2M 1------:...--
8+118
B
L L+fJ.L
'" L + 3M.
Figure 3.6. Contrast and luminance Figure 3.7. Brightness (B) and luminance (L)
Figure 3.8. Square wave gratings. The left grating (5 cycles/deg) has a
higher frequency than the right (2 cycles/deg).
the eye, which is defined as the smallest angular separation at which the
individual lines (if we assume that the image consists of horizontal and
vertical lines of pixels) can be distinguished. Visual acuity varies widely, as
much as 0.5' to 5' (minutes of arc) depending on the contrast ratio and the
keenness of the individual's vision. An acuity of 1.7' is often assumed for
practical purposes, and this corresponds to approximately 18
( = 60/(2 X 1.7') ) cycles/degree.
01
30
o
·10
It
·20
0o!---~-,"o=-- -,",=---c": ':--25
:':---!". _ _ _ ncydosldog .3().3() H,.tlCl,tJ,t_ncydosldog
""""'''-('''''''''''''
(a) (b)
Figure 3.10: Frequency response of the eye. a) J-D and b) 2-D frequency response.
.
c
.S;! ,
i !
--i . ·· ..... ··· --.-- .. , .
..
La.. .................. ~-----.-
."
,,
~ ,
til
~ ==~~---+-r4r--~--~
:::l A
......- Degrees
(a) (b)
Bt1oghtne ••
t
0
f
~
LtoHftkunc.
A
(e) (d)
Figure 3.12: Mach band effect. (a) the unit step function, (b) the stair· case
function, (c) brightness corresponding to image (a), (d) brightness
corresponding to image (b).
_ Example 5.1
In this example, the Mach band effect at Fig. 3.12 (c) is justified from the
line-spread function of Fig. 3.11. The line-spread function (i.e., the impulse
response in one spatial dimension) of the HVS is shown in Fig. 3.11. The
input shown in Fig. 3.12(a) can be considered as the step input. Assuming
that the HVS is linear and time invariant, the step response of the system can
be obtained by integrating the impulse response of the system. If we
integrate the line spread function in Fig. 3.11, a function similar to Fig.
3.12(c) is obtained. Note that points A, B, C, and 0 in the two figures
roughly correspond to each other. _
42 Multimedia Signals and Systems
Note that if the two different spectral distributions C,(2) and C 2 (2) produce
identical {a /I' a G , a B }, the two colors will be perceived as identical. Hence,
two colors that look identical can have different spectral distributions.
Not all humans, however, are so lucky - many people are color blind.
Monochromatic people have rod cells and one type of cone cells. These
people are completely color blind. Dichromatic people have only two types
of cones (i.e .. one cone type is absent). They can see some colors perfectly,
and misjudge other colors (depending on which type of cone cells is absent).
Creen
Spoctr.1
Colours
lIn .. of
1~---+--Groy
Brighl/1C
~ (a)
aluralion Block
(b)
100
"
C>
..::..
'1:0 80
Co
C>
.J:>
60
.." 40
.•
::;
20
o ~~--~--~-L--~~~~--~
400 460 600 660 600 660 700 750
A ( in nm)
Figure 3.14. Typical absorption spectra of three types of cones
in the human retina.
White
Figure 3.15. Primary colors can be added to obtain different composite colors. The
corresponding color figure is included in "Chapter 3 supplement" in the CD.
46 Multimedia Signals and Systems
Green
Black
(a) (b)
Wavelengths reflected by
Wavelengths reflected both yellow and blue pigments Wavelengths reflected
by bl ue pigments .'by yellow pigments
\'" / ....
\, Wavelengths absorbed l
\, by yellow pigments /t .....
\
Wavelengths absorbed
\\~ i
by bl ue pigments
../
2.0
O. r-----------------~
~ 1.5
>
~ 1.0
"3
E
:~0.5
I-
OO ~4_~~~~~~rr~_r
Figure 3.17. Tristimulus curves for a) CIE spectral primary system, and
b) CIE {X,Y,Z} system.
0.7
G2~
..
"
0.6 ~
/ ~ "0
.
51.
O.S
y o.• \ I 10 ~~
...,.
0.3
\ ... t I 6~ ~ ~~
~
,/
.5- ~..I:"
81 ......;::; ~
~~f
0.1
:~ ~ .....- V
o -.
o ~ u"'u u ~ ~ u ~
x
Figure 3.18. The 1931 CIE chromaticity diagram. {RI' G" B,} and
{R2' G 2, B2} are the locations of the red, green and blue color television
primaries for PAL system I and NTSC system, respectively [2].
(3 .7)
The matrix
Chapter 3: Human Visual Systems and Perceptions 49
(3.8)
..:.:.. ..........»:
~ ~
R· ••• :
.'" •••• ..z
.~ ~ :
X
B z
Figure 3.19. Geometrical representation of a color C in
{R, G, B} and {X, Y, Z} space.
The coefficients of the matrix M XYZ->RGB can be calculated as follows.
We note that X =XwR when G =B =o. So, project one unit of R-vector on
the {X, Y, Z} coordinates. The projections on the {X, Y, Z} coordinates will
provide the values of X R' YR , Z R' Similarly, the projection of a unit G-vector
on the {X, Y, Z} coordinates will provide the values of X G' YG' ZG' The
coefficients X B , YB , ZB can be obtained by projecting a unit B-vector on the
{X, Y, Z} coordinates. Table 3.1 shows the transformation matrices
corresponding to several color coordinates.
A brief introduction to the color systems that are widely used in practice
will now be presented.
n
~stems
CIE spectral Monochromatic primary sources, Reference white has
primary system: red=700 nm, green =546.1 nm, flat spectrum and
{R, G, B} blue=435.8 nm. R=G=B=1.
CIE
system.
{X,Y,Z}
[0490 0.310
y = 0.177 0.813 0.011 G 0201'] All tristimulus
values are positive.
Y=luminance
Z 0.000 0.010 0.990 B
CIE uniform
[~lf:7 o~J~l
0 Mac Adam ellipses
chromaticity
I are almost circle.
scale (UCS): U,
V,W W -0.5 1.5
G N1= [19\0
NTSC receiver Linear
-028r~
- 0.533
primary system [R" -0.985 2.000 -0.028 Y transformation of
RN,GN,BN {X, Y, Z} is based
BN 0.058 -0.118 0.896 Z
on TV phosphor
primaries.
NTSC
system:
trans.
y=
luminance I, Q~
n [0299 0.587
I = 0.596 -0.274 -0.322 GN
OI14rl Used for
transmission
TV
In
CIE Spectral
[1161 -0.146 -0.151]
primary
NTSC
[0.299 0.587 0.114]
m
Transmission
System
0.596 -0.274 -0.322
0.211 -0.523 0.312
r 0133]
CIE UCS 0.116
tristimulus
system
[;] A05
0.299 0.587 0.144
0.145 0.827 0.627
CIE X,Y,Z
[0601 0.174
0201]
m
system
0.299 0.587 0.114
0.000 0.066 1.117
• Example 5.2
The NTSC receiver primary magenta corresponds to the tristimulus values
RN=BN=I,GN=O. The tristimulus and chromaticity values of magenta are
determined in i) CIE spectral primary and ii) {X,Y,Z} coordinate systems.
CIE Spectral primaries
The tristimulus values of magenta can be obtained using the
transformation matrix (corresponding to CIE primary system) shown in
Table 3.2.
R=1.167RN -0.146GN -0.151BN =1.167-0.151=1.016
X. Y,Z coordinates
The tristimulus values of magenta can be obtained using the transformation
matrix (corresponding to {X,Y,Z} system) shown in Table 3.2.
X = 0.607 RN + 0.1 74GN + 0.201BN = 0.607 + 0.201 = 0.808
(a) (b)
Figure 3.20. Perceptual distance experiment using two concentric circles in R,G,B color
space. a) Inner circle: r=0.2, g=0.6, b=0.2; outer circle: r==0.2, g=0.62, b=0.2, b) Inner
circle: r=0.2, g=0.2, b=0.6; outer circle: r=0.2, g=0.2, b=0.62. The values of r, g, and b
are normalized. The Euclidian distance in both cases is 0.02. The corresponding color
figure is included in "Chapter 3 supplement" in the CD.
~o
U 10 f---;--.,-------;f-----.;-t---P...-------1.--+_~
~ 5~---+--+-~-+_+__1
S
~ 1~--~-=--~=_~
1
It is observed in Fig. 3.21 that the eye is more sensitive to flicker at high
luminance. Moreover, the eye acts as a lowpass filter for temporal
frequencies. The flicker sensitivity is negligible above 50/60 Hz. This
property is the basis of designing television systems, and computer displays
that employs at least 50 Hz field or frame rate.
REFERENCES
1. R. C. Gonzalez and Richard E. Woods, Digital Image Processing, Addison Wesley,
1993.
2. A. K. Jain, Fundamentals of Digital Image Processing, Prentice Hall, 1989.
3. T. N. Cornsweet, Visual Perception, Academic Press, New York, 1970.
4. D. E. Pearson, Transmission and Display of Pictorial Information, Pentech Press,
1975.
5. H. G. Grassman, "Theory of compound colors," Philosophic Magazine, Vol. 4, No.7,
pp. 254-264, 1954.
6. A. N. Netravali and B. G. Haskell, Digital Pictures: Representation, Compression
and Standards, Plenum Press, New York, 1994.
56 Multimedia Signals and Systems
QUESTIONS
I. Explain the importance of studying the HVS model for designing a multimedia
system.
2. Explain the image formation process on the retina. How many different types of
receptors are present in retina?
3. What is relative luminous efficiency of the HVS?
4. Define luminance and brightness. Even if two regions of an image have identical
luminance values, they may not have same brightness. Why?
5. Define spatial frequencies and modulation transfer function. What is the sensitivity of
the HVS to different spatial frequencies?
6. Consider a sinusoidal grating being viewed by an observer on a computer monitor.
The distance between the observer and the monitor is X, and the grating has a
frequency of2 cycles/degree. If the observer moves forward (towards the monitor) by
X14, what will be the new grating frequency? Assume a narrow angle vision.
7. Explain the frequency sensitivity of the HVS, and the Mach band effect.
8. Why do we need vision models? Explain a simple monochrome vision model.
9. What are the three perceptual attributes of color?
10. What are the three primary colors that are used for additive color reproduction? What
is the basis of this model from the HVS perspective?
II. What are tristimulus values of a color?
12. Why is there a large number of color spaces? Compare the {R,G,B}, {X,Y,Z}, and
{Y,I,Q} color spaces.
13. What is the principle of subtractive color mixing? What are the three primaries
typically used? What will be the reproduced color for i) yellow-green mixture, and ii)
yellow-red?
14. Which method (subtractive or additive) of color reproduction is used in i) printing, ii)
painting, iii) computer monitors, and iv) television monitors?
15. What is a chromaticity diagram? How does it help in color analysis? Which colors are
found near the edges of the diagram? Which colors are found near the center?
16. Given a spectral color on the chromaticity diagram, how would you locate the
coordinates of the color for different saturation value?
17. Determine the chromaticity values of i) green, ii) cyan, and iii) blue in CIE Spectral
Primary coordinate systems.
18. The reference yellow in the NTSC receiver primary system corresponds to
RN =GN=1, BN =0. Determine the chromaticity coordinates of yellow in the CIE
spectral primary system, and the {X,Y,Z} coordinates.
19. Repeat the previous problem for the reference white that corresponds to
RN=GN=BN=1.
20. A 2 x 2 color image is represented in CIE spectral primary system with the following
R, G, and B values (in 8-bit representation).
f
~
If G(cv) =0 for Iwl > 27l'B, g(t) can be said to be band limited to B Hz.
The signal get) is sampled uniformly at a rate of !, samples per second
(i.e., sampling interval T =11 f. sec). The sampled signal g, (t) can be
represented as
~
s(t)= L8(t-kT)
k=~
Figure 4.1. The waveform of the audio signal "test44k." The full signal is
of 30 second duration of which the first 1.3 seconds is shown here.
Note that G,(cv) is a periodic extension G(cv) (as shown in Fig 4.3). The
two consecutive periods of G, (cv) will not overlap if Is > 2B . Hence, G(cv)
can be obtained exactly from G,.(cv) by lowpass filtering if the following
condition is satisfied .
.(, >2B (4.4)
Chapter 4: Multimedia Data Acquisition 59
1o.-----~----~------~----~----__,
i'D 0
"'CI
.5
~-10
III
c
~
1"20
In -30
i
.f -40
~~L-----~6~----~10~----~16~~~2~0--~~26
Frequency (in KHz)
Figure 4.2. The amplitude spectrum of the audio signal shown in Fig. 4.1.
get) G(w)
I--.a....----.
Frequency
f
g,(t)t,ll', 1(11 1
d
~ t
G,(Wlpcb rh cb. I
f, 2f, 3r,
g(tl~t
Lowpass
G(w) .....;r'" filter
f
e/2
Figure 4.3 Time domain and frequency domain signals illustrate the process of
bandlimited sampling. g(t): input signal, Is: sampling frequency, s(t): sampling
waveform, gs(t): sampled signal. G(m), SCm), and Gs(m) are the Fourier
transforms of g(t), s(t), and g s (t), respectively.
60 Multimedia Signals and Systems
~ (k ) sin(f,t - k)1l
t = L.J g T ----=-'- -
g () (4.5)
k;-~ (f.t - k)n
Eq. (4.5) states that the signal values between the non-sampling instances
can be calculated exactly using a weighted sum of all sampled values. Note
that both frequency and time-domain approaches, although apparently
different, are equivalent since the sine function at the right side of Eq. (4.5)
is the impulse response of the ideallowpass filter.
• Example 4.1
Consider the following audio signal with a single sinusoidal tone at 4.5
KHz.
get) =5cos(21l * 4500t)
Sample the signal at a rate of i) 8000 samples/sec, and ii) 10000
samples/sec. Reconstruct the signal by passing it through an ideal lowpass
filter with cut-off frequencies of one-half of the sampling frequency.
Assume that the passband gains of the two filters are i) 118000, and ii)
1110000. Determine the reconstructed audio signal in both cases.
The Fourier transform of the input signal can be expressed as
G(m) =51l [15 (m - 90001l)+ 15 (m + 90001l)]
Case 1:
Using Eq. (4.3), the Fourier transform of the sampled signal can be
expressed as
1
L L
G(w - 27l11iJ = 8000 G(w -16000n * n) f: 117; =8000]
~ ~
Gs (w) = -
Tl n;- n;-
Case 2:
The Fourier transform of the sampled signal can be expressed as
G,(w) =~ tG(W - 2mj,) = 10000 tG(W - 20000n * n) t: 1ITz = 10000]
T2 n=- n=-
Note that the reconstructed signal is identical to the original signal in Case
2. However, in Case 1, the reconstructed signal is different from the original
signal. The original and reconstructed signals are plotted in Fig 4.4. It is
observed that the two signals are identical only at the sampling instances. At
other time instances, the two signals differ because of the aliasing.
Aliasing
When an arbitrary audio signal is sampled at a rate of f, samples/sec, the
frequency range that can be reconstructed exactly is
0::;j::;j,/2 (4.6)
62 Multimedia Signals and Systems
- Original
6
• • •• R.conetructed
~O~--~--~2~--~3----.~--~5----~
' --~7~~8
Sampling '"mne ••
(4.8)
where OJ" and OJ v are the horizontal and vertical spatial frequencies,
respectively, and are expressed in radian/degree. Note that OJ" =211/h and
OJ v=211/v where I" and Iv are horizontal and vertical frequencies in
cycles/degree (see section 3.2.3 for a definition). Let the image be
considered bandlimited, and the maximum horizontal and vertical
frequencies be denoted by I H and Iv' Assuming a rectangular sampling
grid, if the image is sampled at spatial interval of (~, fly), the sampled
image can be expressed as
where sex, y) is the 2-D sampling function (shown in Fig. 4.5) and defined
as
~ ~
Note that the sampling frequencies in the horizontal and vertical directions
are 1/ ~, and 11 fly (in samples/degree), respectively. Extending the
Nyquist criterion to two-dimensions, it can be shown that the image i(x, y)
can be obtained exactly if the following conditions are true:
64 Multimedia Signals and Systems
r s(x,y)
y
Figure 4.5. Two-dimensional sampling function .
• Example 4.2
Consider the following image grating with a horizontal and vertical
frequency of 4 and 6 cycles/degree, respectively:
i(x, y) =255cos[21l(4x + 6y)] (4.11)
Case 1:
Using Eq. (4.3), the Fourier transform of the sampled signal can be
expressed as
Chapter 4: Multimedia Data Acquisition 65
The spectrum of the original and sampled image is shown in Fig. 4.6. The
circular black dots represent two-dimensional impulse functions. It is
observed that the spectrum of the sampled image is periodic. The 2-D filter
as specified in Eq. (4.12) is shown as a square in Fig. 4.6(b). The filter will
pass through two impulse functions located at (4,-4) and (-4,4)
cycles/degree. The filtered spectrum can then be expressed as
/(OJ h ,OJ v ) = H2O (OJ h ,OJ.)I,.(OJ h ,OJ v )
= 255Jl"[o(OJ h -4*2Jl",OJ v +4 * 2Jl")+o(OJ h +4*2Jl",OJ v -4*2Jl")]
-6 1• • • ••
(a) (b)
suitable resolution can be determined from Eq. (4.10) and the properties of
human visual system discussed in Chapter 3. It was observed in Chapter 3
that the details of an image visible to our eyes depend on the angle the
image makes on the retina. It was found that the sensitivity of human eyes is
low for frequency components above 20 cycles/degree. Therefore, as a
rough approximation, we can assume that the maximum horizontal and
vertical frequencies present in an image are 20 cycles/degree, i.e.,
IH =Iv =20 . The corresponding Nyquist sampling rate will be 40
samples/degree in both directions.
(0) (b)
Figure 4.7. Example of aliasing in images. a) original image, and b)
undersampled image reconstructed by lowpass filtering.
_ Example 4.3
A 4"x6" photo is to be scanned. Determine the minimum scanning
resolution.
T
6W
1
(a) (b)
Figure 4.8. Photo scanning. (a) The viewing angles of a 4"x6" image,
(b) The angle formation at eyes.
• Example 4.4
Consider an audio signal with a spectrum 0-20KHz. The audio signal has
to be sampled at 8 KHz. Design an anti-aliasing filter suitable for this
application.
The sampling frequency is 8 KHz. An ideal lowpass filter with a sharp
cut-off frequency of 4KHz will suppress all the components wth a frequency
higher than 4 KHz. However, it is physically impractical to design an ideal
68 Multimedia Signals and Systems
filter. In this example, we would design a lowpass filter with the following
characteristics:
i) The passband is 0-3200 Hz. The gain in the passband, Gp > -2 dB
ii) The transition band is 3200-4000 Hz
iii) The stopband is> 4000 Hz. The gain in the stopband, Gs < -20 dB
There is a rich body of literature [3] available for filter design, which is
beyond the scope of this book. Consequently, the filter design procedure is
briefly mentioned using a MA TLAB example. There are several choices to
design a filter with the above design constraints. The following MA TLAB
code (shown in Table 4.2) calculates the transfer function corresponding to
the Butterworth, Chebyshev-l, and Chebyshev-2 filters .
01 01
: Tr_1IInI
I
c:
01 &01
...j 04
- P-.bInd
_ Slopbind
...~ 04 - P - . I
02 O~
'.12
Fr.quency (In Hz) "
t,
(a)
Figure 4.9. Lowpass/anti-alias filter characteristics. (a) An ideal lowpass filter has
flat passband gain, instantaneous cutoff and zero stopband gain, and (b) a realizable
lowpass filter with a gradual transition from passband to stopband. Note that isIs
the sampling frequency.
J
00
Note that H(s) is also known as the transfer function of the filter. Using
the MA TLAB design procedure, it can be shown that the transfer functions
corresponding to the three filters are as follows:
Chapter 4: Multimedia Data Acquisition 69
Butterworth
H(s) = 4.909 X 10 45
D(s)
where D(s) = Sl3 + 27 1401.1 z + 3.682xld' S" + 3.298x 10IZ Sill + 2.l7lx 1016S9 +
1.l07xlO Z() S8 + 1.l07xlO Z() S8 +4.493xlOz3 S7 + 1.47xl027 S6 +
3.874xl0 30 i + 8.l29x 10 33 S4 + 1.322 x 10 37 S3 + 1.322 x 10 37 S3 +
1.322 x 10 37 S3 + 1.579 x 10 40 S2 + 1.245 x 10 43 S + 4.909 xl 0 45 (4.14)
Chebyshev-l
Chebyshev-2
Hs _ 1962s 4 -1.36 X 10-8 Sl + 1.196x 10" S2 - 0.2839s + 1.457 x 10 18 (4.16)
( ) - S5 + 12670s 4 + 7.84 X 107 Sl + 3.215 X lO" S2 + 7.672 X 10 14 s + 1.457 X 10 18
The characteristics of the filters are shown in Fig. 4.10. It is observed that
all three filters satisfy the design constraints. The Butterworth filter (13 th
order, cut-off frequency=3271 Hz) has monotonic gain response. The
Chebyshev-1 filter (5 th order, Wp=3200, Ws=3903) has ripples in the
passband, and monotonic gain in the stopband, whereas the Chebyshev-2
70 Multimedia Signals and Systems
The design of 1-0 anti-aliasing filter can be extended to design 2-D anti-
aliasing filter for images. In this case, the filter will have two cut-off
frequencies, one each for the horizontal and vertical direction.
Butterworth
Ide.' Filter
c
ii
CII 0.8
ChtbYSMV-1
~
u..
0.4
0.2
°oL---~--~--~----~~~----~~~--~
1000 2000 3000 4000 6000 8000 7000 8000
Frequency (in Hz)
the aliasing energy in the sampled audio signal. The sample and hold, and
analog-to-digital conversion are applied to obtain the actual digital audio
samples.
As discussed in Chapter 2, a typical audio system can have up to 7-8
channels. The above procedure is applied separately to each channel of the
audio system. The samples from different audio channels are then
multiplexed, compressed, and stored for future playback.
Dither
Generator
Continuous, ---J .
Analog Audio -I AmplIfier
~
1 -"-' -.
Anti-
Aliasing -. Sample
and f--t
Analog to
Digital
Discrete,
---. Digital
Filter Hold Converter Audio
(a)
Analog
Audio
(C-I)
Compression
& Error
Correction
Analog
Audio
(C-N)
(b)
Figure 4.11 Digitization of audio signal. a) Sampling and digitization, b) N-
channel audio recording and storing process.
Blocks Functions
Amplifier Amplifies the audio signal before any noise (e.g., dithering
and quantization noise) is introduced.
Dither Generator Adds a small amount of random noise. Ironically, this
increases the perceptual quality when the signal is quantized
Anti-aliasing Filter It is a lowpass filter to make sure the audio signal is band
limited. It eliminates the aliasing component after sampling.
Sample and Hold Holds the value of the audio signal and then sample the value
ofthe signal at each sampling instances
Analog-to-Digital Computes the equivalent digital representation of the analog
Converter signal
Multiplexer Multiplexes the bitstream coming from different channels
Compression Reducing the redundancy and compress the audio file size
maintaining an acceptable quality of audio
g Quantizer output
• Example 4.5
Consider an audio recording system where the microphone generates a
continuous voltage in the range [-1,1] volts. Calculate the decision and
reconstruction levels for an eight level quantizer.
The decision and reconstruction levels can be calculated from the
following equations:
Chapter 4: Multimedia Data Acquisition 73
The decision and reconstruction levels of the quantizer are shown in Table
4.4.
In the quantization process, an input sample value is represented by the
nearest allowable output level, and in this process, some noise is introduced
in the quantized signal. The quantization error (also known as quantization
noise) is the difference between the actual value of the analog signal and its
quantized value. In other words,
e(nT) = f(nT) - g(nT) =g(nT) - Q[g(nT)] (4.18)
where g(nT) is the input signal, g(nT) is the quantizer output, and e(nT)
the error signal at sampling instance nT. Figure 4.l3 shows the
quantization of the audio signal by the quantizer. It is observed that the
largest (Le., worst) quantization error is half of the decision interval, i.e.,
0.125 .•
represent a sample, the number of output levels (N) will increase. This will
decrease the quantization error, and therefore will provide superior audio
quality.
..21
tm
I.IZI
0.1.
e;
II
! .5
~ I.llS
IB
J II
~ ... 11S
!
,f
CII.-.m
~IZI
4.I1f
•MIS,
I , I
4.21• I , I
S.mpling Time
SMlpIing TIme
(a) (b)
Bit-rate
The bit-rate of a quantized audio signal can be calculated as
bitrate = Fs *B bits / channel/second (4.20)
N-I
'L](ni
SNR = .=0 (4.21)
IV(n)- f(n)]2
.=0
where f(n) and ](n) are the nth sample of the original and quantized
audio signal, and N is the number of audio samples.
Traditionally, the signal to noise ratio (SNR) has been the most popular
error measure in electrical engineering. It provides useful information in
most cases, and is mathematically tractable. For this reason, it is also widely
used in audio coding. Unfortunately, the SNR values do not correlate well
with the subjective ratings, especially at high compression ratios. Several
new distortion measures have been proposed for better adaptation to human
auditory system.
The quality of the digital audio sample is determined by the number of
bits per sample. If the error e(n1) is assumed to be statistically independent
and uniformly distributed in the interval [--Q/2 and Q/2], mean squared
value of the quantization noise for a signal with dynamic range of 1 (i.e.,
Q = r B ) can be expressed as
Q2 2-2B
.
E=-
1 Q/22
Q -Q12
Je
de=-=--
12 12
(4.24)
where B is the number of bits used to represent each sample. Note that
e(n1) is a discrete-time analog signal, and hence the integral has been used
in Eq. (4.24). In decibel scale, the mean squared noise can be expressed as
E (in dB)=1010g(Q2 1l2)=1010g(2-2B /12)=-6* B-1O.8 (4.25)
The above relation shows that each additional bit for analog-to-digital
quantization reduces the noise by approximately 6 dB, and thus increases
the SNR by the same amount. As a rule of thumb, a signal quantized with B-
bits/sample is expected to provide
SNR (in dB) ~ 6 * B (4.26)
As a result, the SNR of a 16-bit quantized audio will be in the order of 96
dB. Typically, an audio signal with more than 90 dB SNR is considered to
be of excellent quality for most applications.
Chapter 4: Multimedia Data Acquisition 77
• Example 4.6
Consider a real audio signal "chord" which is a stereo audio signal (i.e.,
two channel) digitized with 22.050 KHz sampling frequency, with a
precision of 16 bits/sample. The normalized sample values (i.e., the dynamic
range is normalized to 1) of the left channel are shown in Fig. 7.3 (a). The
signal has duration of 1.1 sec, and hence, there are 24231 samples. Estimate
the SNRs of the signal if the signal is quantized with 5-12 bits/sample.
For this example, we may consider that the 16-bit audio signal is the
original signal. A sample with value x ( - 0.5 ~ x ~ 0.5) can be quantized to
m bits with the following operation:
y=round(x*2h)/2h
where y is the new sample value with 8-bit precision. All sample values are
quantized using the above equation; the noise energy is calculated, and the
SNR is calculated in dB. Figure 4.14(b) shows the quantization noise for 8
bit/sample quantization. It is observed that noise has a dynamic range of [-
2-(h+l) , r(h+l)] (in this case, it is [-0.0020,0.0020]). The probability density
function of the error values is shown in Fig. 4.14(c). The pdf is seen to be
close to the uniform density. Finally, the SNR for quantization with
different bit-rates is shown in Fig. 4.14(d). It is observed that the SNR
increases by 6 dB for each increment of precision by 1 bit. However, the
overall SNR is about 15 dB different than what is expected from Eq. (4.26).
This is mainly because the SNR also depends on the signal characteristics,
which Eq. (4.26) does not take into consideration .•
Table 4.6 shows the sampling frequency, bits/sample, and the
uncompressed data rate for several digital audio applications. From our
experience, we can follow the improvement of the audio quality as the bit-
rate increases. The telephone and the AM radio employ a lower sampling
rate of 8 and 11.025 KHz, respectively, and provide a dark and muffled
quality audio. At this bit-rate, it may not be advantageous to use the stereo
mode. The FM radio provides a good audio quality because of high bit
resolution and stereo mode. However, it produces darker sounding than the
CDs because of the lower sampling rate. The CDs and digital audio tape
(OAT) generally provide excellent audio quality, and are the recognized
standards of audio quality. However, for extremely high quality of video,
DVD audio format is used, which has a sampling rate of 192 KHz, and 24
bits/channel/sample resolution. Note that the DVD audio format, mentioned
in Table 4.6, is not the same as the Dolby digital audio format used in a
DVD movie. It is simply a superb quality audio, without any corresponding
video, available in a DVD.
78 Multimedia Signals and Systems
10 15 20 25 11 20 21
Sampling Insulne.. ,.10001 Sampling Instaneas ,.10001
(a) (b)
57
51
iii 45
",§. 38
a:
z
en 33
27
a 8 10 11 12
BIts/SImple
(c) (d)
Figure 4.14. The audio signal chord.wav. (a) The signal waveform, (b)
Quantization error at 8 bits/sample, (c) probability density function of the
quantization error at 8 bits/sample, and (d) SNR versus bits/sample.
.. D'Iglta
T a bl e 46 r rates and resoIutlOns.
. I aud'10 at vanous samplmg '
Quality Sample Bits per Monol Data Rate Frequency
Rate Sample Stereo (if uncompressed) Band
(in KHz) (in Hz)
Telephone 8 8, ~ law 2 Mono 8 Kbytes/sec 200-3,400
AM Radio 11.025 8 Mono 11.0 Kbytes/sec
FM Radio 22.050 16 Stereo 88.2 Kbytes/sec
CD 44.1 16,linear Stereo 176.4 Kbytes/sec 20-20,000
PCM
DATI 48 16 Stereo 192.0 Kbytes/sec 20-20,000
DVD 192 24 Stereo 1152.0 Kbytes/sec 20-20,000
Audio
IDigital audio tape, Imore details about I! law in Chapter 7.
The color image typically has three components (as discussed in Chapter
3). For electronic display, the {R, G, B} components are commonly used.
Each of these three components is quantized separately. High resolution
sensors generally employs 8 bits/channel/pixel. Hence, a total of 24 bits are
required to represent a color pixel. Fig. 4.16 shows the Lena image with 24
bits/pixel resolution.
Note that 24 bits/pixel corresponds to 16 million (= 224) colors, which is
much more than the human eyes can differentiate. Therefore, many
electronic devices use a lower resolution (e.g., 8 bit or 16 bits) display. This
is generally obtained using a color map.
there are primarily two types of fidelity measures: subjective and objective.
An impairment scale similar to that shown in Table 4.5 can be employed to
evaluate the quality for images. However, visual testing with real subjects is
a cumbersome procedure. Therefore, the objective measures are mostly used
in practice.
As in audio, the SNR and MSE are two popular measures used to evaluate
the distortion in an image. In addition, peak signal to noise ratio (PSNR) and
the mean absolute error (MAE) are two other popular distortion measures. If
I
f(m,n) and f(m,n) are the original and distorted images of size MxN,
the distortion with respect to various criteria is calculated as follows:
SNR (in
~~[JCm,n)]2
dB) = lOloglO [ -M-_I-N~",-I:=.;;..;.o,,;.....;=o,------- (4.27)
LLVCm,n)- fCm,n)J
m=On==O
Chapter 4: Multimedia Data Acquisition 81
LLV(m,n) - f(m,n)j
(4.28)
II/~~O .
1 k j
MeanSquareError(MSE) = -
MN
L L M-I N-I
r,ftm,n)-!(tn/l)
m=O 11=0
(4.29)
1 M-I N-I
MeanAbsolute liror(MSE) =- L Llj(m,n)-f(mfl~ (4.30)
MNm~ II~
REFERENCES
1. K. C. Pohlmann, Principles of Digital Audio, McGraw-Hili, New York 2000.
2. R. W. Schafer and L. R. Rabiner, "Digital representations of speech signals," Proc. of
the IEEE, Vol. 63, No.4, April 1975.
3. B. P. Lathi, Signal Processing and Linear Systems, Berkeley Cambridge Press, 1998.
4. G. L. Frendendall and W. L. Behrend, "Picture quality - procedures for evaluating
subjective effects of interference," Proc. of IRE, Vol. 48, pp. 1030-1034, 1960.
5. ITU-R Rec.BS.1116, "Methods for the subjective assesment of small impairments in
audio systems including multichannel sound systems," International
Telecommunication Union, Geneva, Switzerland, 1994.
6. N. S. Jayant and P. Noll, Digital coding of waveforms: principles and applications to
speech and video, Prentice-Hall, New Jersey, 1984.
QUESTIONS
1. Consider an audio signal with a single sinusoidal tone at 6 KHz. Sample the signal at
a rate of 10000 samples/sec. Reconstruct the signal by passing it through an ideal
lowpass filter with cut-off frequency of 5 KHz. Determine the frequency of the
reconstructed audio signal.
2. What is aliasing in digital audio? How can it be avoided?
3. You are recording an audio source that has a bandwidth of 5 KHz. The digitized
audio should have an average SNR of at least 60 dB. Assuming that there are two
82 Multimedia Signals and Systems
stereo channels, and no digital compression, what should be the minimum data rate in
bits/sec.
4. You are required to design anti-aliasing filters for i) telephony system, ii) FM radio,
and iii) CD recording. The audio sampling rates are given in Table 4.6. Determine
typical frequency characteristic requirements of the filters for the three applications.
Using MATLAB, calculate the transfer functions of the corresponding Butterworth
and Chebyshev filters. Which filters will you choose, and why?
5. What should be the typical Nyquist sampling frequencies for full bandwidth audio
and image signals?
6. You want to design an image display monitor with an aspect ratio of 4:3 (i.e.,
width:height). A user is expected to observe the images at a distance of 4 times the
monitor diagonal. What should be the minimum monitor resolution? Assume that the
HVS sensitivity is circularly symmetric, and is negligible above 25 cycles/degree.
7. Generate a simple color map with 64 colors. To observe the different colors, create
images with different combinations of R {:::32, 96, 160,224 I, G {:::32, 96, 160, 224),
and B {:::32, 96, 160, 224 I and save them in any standard image format such as tiff,
and ppm. Display the colors and verify the additive color mixing property discussed
in Chapter 3.
Chapter 5
The properties of the audio and video signals, and the digitization process
have been discussed in the previous Chapters. When a signal is digitized,
further processing of these signals may be needed for various applications,
such as compression, and enhancement. The processing of multimedia
signal can be done effectively when the limitation of our hearing or visual
systems is taken into account. For example, it was shown in Chapter 3 that
the human ear is not very sensitive to audio signals with frequencies above
10-12 KHz. Similarly, the eyes also do not respond well above 20
cycles/degree. This dependency of our sensory systems on the frequency
spectrum of the audio or visual signals has led to the development of
transform and subband-based signal processing techniques. In these
techniques, the signals are decomposed into various frequency or scale
components. Various components are then suitably modified depending on
the application at hand. In this Chapter, we will discuss mainly two types of
signal decomposition techniques: transform-based decomposition and
subband decomposition.
Although a wide variety of transforms have been proposed in the
literature, the unitary transforms are the most popular for representing
signals. A unitary transform is a special type of linear transform where
certain orthogonal conditions are satisfied. There are several unitary
transforms that are used in signal processing. In this Chapter, we present
only a few selected transforms, namely Fourier, cosine and wavelet
transforms that are widely used for multimedia signal processing. In
addition, we also discuss subband decomposition that is very popular for
compression and filtering applications. The application of these techniques
will be covered in the later Chapters.
Since audio signals are one-dimensional, these signals are represented
with one-dimensional (1-0) basis functions. On the other hand, an image
can be represented by two-dimensional (2-D) basis functions that are also
known basis images. Similarly, a video signal can be represented by three-
dimensional basis functions. However, a video signal is generally analyzed
where the symbols "*,, and "T" denote complex conjugation and matrix
transpose, respectively. Eq. (5.2) can be considered as a series
representation of the sequence f(n). The columns of U*r , i.e., the vectors
{u * (k,n),O ~ n ~ N -I} are called the basis vectors of U.
N-l
Lu(k,n)u * (k,n o ) =8(n - no) (5.4)
k=O
8(k) = 01 otherwise
1 k =0
(5.5)
The above two properties ensure that the basis vectors corresponding to
the unitary transform are orthogonal to each other and the set of basis
vectors is complete. The completeness property ensures that there exists no
non-zero vector (which is not in the set) that is orthogonal to all the basis
vectors. The above two conditions will be true if and only if the following
condition is true.
V-I =v'r (5.6)
We now present Fourier and cosine transforms as special cases of unitary
transforms.
where F(k) is the kth OFT coefficient, and WN =exp{- ~7Z'}. The NxN
(S.9)
• Example 5.1
Consider a 4-point sequence f=[2, S, 7, 6]. Calculate the corresponding
OFT coefficients. Reconstruct the input sequence from the DFf coefficients
and the OFT basis functions.
86 Multimedia Signals and Systems
F(O) 1 1 1 1 2 10
_j21f _j41f _j61f
1 1 e e e
4 4 4
F(l) 5 -2.5+ jO.5 (5.11)
=
F(2) ..[4 1 e
j41f j81f _jI21f
= -1
4
e 4
e 4 7
_j61f _jI21f _jI81f
F(3) 1 e 4
e 4
e 4 6 -2.5- jO.5
The input sequence can be calculated from its Off coefficients using the
inverse Off transformation matrix.
J(O) 1 1 1 1 10 2
j21f j41f j61f
J(I) 1 1 e j41f
4 e4 e 4 -2.5+ jO.5 5 (5.12)
j81f
f(2) =..[4 1 e 4 e 4 e 4
jl2tr
-1
=
7
j61f j121f jlStr
J(3) 1 e 4 e 4 e 4 -2.5- jO.5 6
The four basis vectors corresponding to the 4-point transform are the
columns of the 4x4 inverse matrix shown in Eq. (5.12). The input sequence
[ I [I
can also be reconstructed using the basis vectors as follows:
1 1 1
f(O) 1 j21f j41f j61f
J(l) =10* 1 +(-2.5+ jO.5) * ej4: -1* j:1f +(-2.5- jO.5) * ej1: 1f
J(2) 1 4 e 4 e 4
j67r j127r j181f
J(3) 1 e4 e-4- e 4
.j
INg part rJ OFT &.sls FI.I1Ctions
0 OS
0
0+
05
4.5
r O~[ 1 ,) r+
os
Ij
.j of-
.os .05
05
O~[ 1
N
.05
N
.05 d
Il
05
~ O~t l
I 'l t ~ 0[-
Il + .j
.05 .05
n • O~[ 1 n.
.05 ·1
/~tl
05
,]
.os 'l ~ 0[- .05
·l ~+
05
~ O~t l
.05 .05 'l
~ ;it~ 0
: ; ; ;
2 3 4
: ~ 1 ~ ;if:
6 7 0 3
~ ~
4
! ~l
K- K ~
(a) (b)
Figure 5.1. The basis vectors of 8-point DFT. a) real part, and b) imaginal)' parts.
O~n~N-I (5.14)
Note that the only difference between the transform pairs {Eqs. (5 .7) and
(5.8)} and {Eqs. (5 .13) and (5.14)} are the scaling factors. Due to the
different scaling factors, the transform represented by (Eq. (5.13) and
(5.14» is orthogonal, but not orthonormal or unitary.
88 Multimedia Signals and Systems
Although the total energy does not change in the Fourier domain, the
energy however is redistributed among the Fourier coefficients. This is
illustrated with the following example.
Table 5.1. Comparison of N 2 versus Nlog 2 N for various values of N.
N N2 Nlog2 N Computational
(FFf) saving
(N /Iog2N)
32 1024 160 6.4
256 65,536 2,048 32
1024 1,048,576 10,240 102
8192 67,108,864 106,496 630
• Example 5.2
Construct a 1-0 signal from the pixel values of horizontal line# 100 of
Lena (gray-level) image. In order to avoid a large OC component, make the
signal zero mean (by subtracting the average value of the pixels from each
pixel amplitude). Calculate the Off, and the total energy in the first 20
Fourier coefficients. Compare it with the total energy of the signal.
The 1-0 signal is plotted in Fig. 5.2(a). The abrupt amplitude fluctuation
corresponds to the edges in the horizontal direction. It is observed that the
energy of the pixels is distributed across the horizontal direction. The Off
amplitude spectrum of the zero-mean signal is shown in Fig. 5.2(b). The
low frequency OFT coefficients lie in the two extreme sides (left side
corresponds to positive low frequency and right side corresponds to
negative low frequency), whereas the high frequency coefficients lie in the
middle region. The total energy of the audio signal (lying across all 512
coefficients) is 11.3 X 105. The first twenty low frequency coefficients
together have an energy of 8.6x10 5 , which is approximately 76% of the
total energy. The remaining 492 coefficients contain only 24% of the total
energy .•
This energy preservation and compaction is the primary reason for the
popularity of transform-based image compression techniques. This will be
discussed in more detail in Chapter 8.
90 Multimedia Signals and Systems
I-
22A
~
11
, 100
.2
Ll ~
...
00 .. '211 '12_ 2M I'la...
..... 120 1M ... 112
~O 10 '10 210 JIG
Figure S.2. Energy compaction with DFf. a) Horizontal line (Iine# 100)
from Lena image, and b) The DFf spectrum of the scanline.
F(k) =a(k) I
11=0
!(n)[cos (2n + I)Jlk]
2N
for o:s; k:S; N-l (5.18)
!(n) =Ia(k)[coS (2n + l)Jlk ]F(k) for o:s; n:S; N-l (5.19)
k=O 2N
Note that the coefficients a(k) are used in Eqs. (5.18) and (5.19) to keep the
norm of the basis functions unity. It can be shown that the discrete cosine
transform is real and unitary, unlike Off that requires complex operations .
• Example 5.3
Consider a 4-point sequence f=[2, 5, 7, 6]. Calculate the OCT
coefficients. Reconstruct the input sequence from the OCT coefficients.
Substituting N = 4 in Eq. (5.18), we obtain the OCT coefficients as
follows:
11.f5. 11.f5.
[
F(O)]
F(l) [11.f5.
1 cos(n/8) cos(3n/8) cos(Sn/8) cos(7n/8)11.f5. S = -3.1S
F(2) = .f5. cos(2n 18) cos(6n 18) cos(lOn I 8) cos(l4n / 8) 7
10 ]
- 2
12] [ (5.20)
The input sequence can be calculated using the inverse OCT (Eq. (5.19».
•
Using different values of N in Eq. (5.18), the OCT of arbitrary length
sequence can be calculated. The basis vectors corresponding to 8-point OCT
is shown in Fig. 5.3(a). The frequency characteristic of the OCT basis
vectors is shown in Fig. 5.3(b).
Note that in many signal processing applications, such as audio and image
compression, a long sequence of samples is divided into blocks of
nonoverlapping samples. The OCT (or OFT) is then calculated for each
block of data. This is traditionally known as block transform.
Properties of DCT
i) Energy Compaction
The OCT has an excellent energy compaction performance. This is
demonstrated with an example .
• Example 5.4
Consider the image scan line in Example 5.2. Calculate the OCT, and the
total energy in the first 20 Fourier coefficients. Compare it with the total
energy of the signal. Compare the energy compaction performance of OCT
and OFT.
Figure 5.4 shows the OCT coefficients of the scan line shown in Fig.
5.2(a). It can be shown that (the MATLAB code is included in the CO) the
first 20 OCT coefficients have a total energy of 9.4x10 5 which is about 83%
of the total energy. Comparing with the results obtained in Example 5.2, it
can be said that the OCT provides a better compaction than the OFT .•
92 Multimedia Signals and Systems
o o~f
~6 1 0
061
,. O· ,.
~6 1
06
N O· N
~S 1
06
!Mj 1 1 M1
0
n os 11
• 0
~6 1 •0
~ O:~ ~
11
~S 1
os
~ O~ 1
C)
~S 0
os 1
" O·
~S : ; ! I ! I : :1 "0
til
3 4 I US 117
k- K-
(3) (b)
Figure 5.3. Eight-point discrete cosine transform. a) The basis vectors, and
b) the frequency characteristics of each basis function . The frequency axis
is normalized to [0,7Z /2] .
-
Figure 5.4. DCT coefficients of the 1-D signal shown in Fig. 5.2(a).
Chapter 5: Transforms and Subband Decomposition 93
Although the DCT basis functions are discrete cosine functions, the DCT
is not the real part of the DFT. However, it is related to the DFT. It can be
shown that the DCT of an Nxl sequence {f(O),f(l), .... .j(N -I)} is related to
the DFT of the 2Nx1 sequence given as follows.
{feN -1), I(N - 2), ..../(1), 1(0), I(O), .... .j(N -I)} (5.22)
.S r-==:.:.:.;
c3
...
Transition
Band ~I H;gT Fih" •
o Stopband
-'=
u:: '---__1.......lI..._---..
Frequency Frequency
BriT:'
.S
B""d~" F;~, ~ I
Frequency Frequency
Figure 5.5. Different types of Filter: lowpass, highpass, bandpass filter,
and bandstop filter.
The IIR filters can produce a steeper slope with a few coefficients.
Therefore, a better frequency response can be obtained with a low
implementation cost. However, stability is a prime concern in IIR filtering
since there is a feedback element. The FIR filters on the other hand are more
robust, and can provide linear phase response. Hence, in practice, most
practical systems employ FIR filtering.
0 .3
willi
~~· ~
0 ---------~20
~------~0--------~2~0--------~
40
k_
(a)
i -R........
(b) (c)
Figure 5.6. Impulse response of lowpass filter. a) ideal lowpass filter impulse
response, b) rectangular and Hamming window, c) windowed impulse
response. Note that an ideal filter has infinitely long impulse response. In Fig.
(a) only ;8 ! impulses are shown.
are truncated to obtain FIR filter with finite number of taps (or impulses) as
shown in Fig. S.6(a). However, a direct truncation results in a poor
frequency response (high ripples in the filter characteristics) due to the
Gibbs phenomenon. Hence, the truncation is generally done using a tapered
window (e.g ., triangular, Hanning, Hamming, Kaiser) [9].
Figure S.6(b) shows both rectangular (i.e., direct truncation) and hamming
windows with 41 taps. The windowed filter impulse responses are shown in
Fig. S.6(c). The frequency characteristics of these windows are shown in
Fig. 5.7. It is observed that the ripples near the transition band are
significantly smaller for Hamming window. However, the Hamming
window increases the width of the transition band (see Fig. S.7(b)).
Typically, Hamming and Kaiser windows (which is an optimal window) are
used in FIR filter design.
1.2 20
_(101014111po)
"~
0
Roc:tIftgUIo<_
0.1
_~41.) fIfI
fIfIfI
04
./""
HotmwIg
02 w.-
· 100
00 05 ,- 1.5 2 1.5 o 05
f_~)
(a) (b)
Figure 5.7. Gain response of lowpass filters with 41 taps rectangular and
Hamming windows. a) Gain response, b) Gain response in dB.
The impulse response of FIR filters with desired cut-off frequency can
easily be designed using the MA TLAB function "fir!". The fir 1 function
uses Hamming window by default. The fir 1 function accepts normalized
(with respect to the sampling frequency) cut-off frequency. For example, if
we want to design a lowpass filter with a cutoff frequency of 3200 Hz for an
audio signal sampled at 8000 samples/sec, the cutoff frequency will be
3200/8000 or 0.4. The following MATLAB code will provide 9-tap lowpass
and high pass digital filters with cutoff frequency of 0.4.
filter_l owpass = firl(8,0.4) ; %8th order, i.e. 9 tap filter
fi Iter_highpass =fir I (8,0.4,' high') ;
Chapter 5: Transforms and Subband Decomposition 97
For bandpass and bandstop filters, two cutoff frequencies (lower and
upper) are required. The following code will provide the impulse response
of 9-tap bandpass and bandstop filters with cutoff frequencies [0.4, 0.8].
filter_bandpass = firl(8,[0.4 0.8])
filter_bandstop = firl(8,[0.4 0.8],'stop')
The filter coefficients obtained using the above codes are shown in Table
5.2. The corresponding gain characteristics are shown in Fig. 5.8. Note that
the gain characteristics are far from ideal, due to their short lengths. If the
filter length is increased, a better frequency characteristics can be obtained
as shown in Fig. 5.9. However, this will also increase the computational
complexity (or, the implementation cost) of the filtering operation.
Table 5.2. Examples of a few 9-tap FIR digital filters. The lowpass and
highpass cut-off frequency is 0.4. The bandpass and bandstop cut-off
o
frequencIes are [0.4, .8] and [0 .4, 08 . I .
. ], respeCtively
Filter Coefficients
Lowpass [-0.0061 -0.0136 0.0512 0.2657 0.4057 0.2657
0.0512 -0.0136 -0.0061]
Highpass [0.0060 0.0133 -0.0501 -0.2598 0.5951 -0.2598 -
0.0501 0.0133 0.0060]
Bandpass [0.0032 0.0478 -0.1802 -0.1363 0.5450 -0.1363 -
0.1802 0.0478 0.0032]
Bandstop [-0.0023 -0.0354 0.1336 0.1011 0.6061 0.1011
0.1336 -0.0354 -0.0023]
t.
-
-'- -'-
$o~--~
u--~••~~.~
. --~~~
U- 04 o. o.
contain different frequency band information of the input signal. Hence, the
output coefficients corresponding to a particular filter are collectively
known as a subband. In other words, an N-band filterbank has N subbands,
each representing a part (lIN) of the entire frequency spectrum. The bank of
filters is designed in such a way that there is minimum overlap between the
passband of the individual filters.
o _______ ;--B_an_d_pn8_--..
Processing{
Analysis Encoding{ Synthesis
Input Filterbank Feature- Filterbank Output
Signal Extraction Signal
Two-band Filterbank
A two-band FIR filterbank is shown in Fig. 5.11 (a). Here, the input
signal x(n) is passed through the lowpass and highpass filters with impulse
responses h(n) and g(n), respectively . Typical frequency response of the
filters is shown in Fig. 5.11(b). The passband of the lowpass filter is
approximately [0, F, /4], and the highpass filter is approximately
[ F, /4, F,. /2] where F, is the sampling frequency. Because of the filter
characteristics, the bandwidths of the intermediate signals xo(n) and XI (n)
are approximately half of the bandwidth of x(n) . Therefore, xo(n) and
XI (n) can be decimated by 2, without violating the Nyquist criterion, to
obtain Vo (n) and v I (n). The overall data-rate after the decimation is the
same as that of the input, but the frequency components have been
separated into two bands. After vo(n) and vl(n) are generated, they can be
decomposed again by another two-band filterbank to obtain a total of four
bands. This process can be repeatedly used to obtain a larger number of
bands.
LPF
,,
,
,,,
x(n) ,,
,,
.2 :J'
HPF
[EJ
xl(n) VIC") __
(a)
.:0c
.!!
u:
Once, the decimated outputs of the filterbanks are calculated, the subband
data is ready for further processing. For example, the coefficients may be
quantized to achieve compression. Assume that the processed signal is
represented by Vo (n) and vI (n) for the lowpass and highpass bands,
respecti vely. The reconstruction of x( n) is as follows. The signals Vo (n)
and VI (n) are upsampled by 2, and xo(n) and XI (n) are obtained. Note that
upsampling (by 2) operation inserts a zero between every two samples. The
signals xo(n) and XI (n) are then passed through the hen) and g(n) filters,
respectively. The outputs xo[n] and xl[n] are added to obtain x[n].
Figure 5.11 (b) shows the frequency response of a special class of filters,
where the lowpass and highpass filters have responses symmetric to the
frequency F, /4. Hence, these filters are known as the quadrature mirror
filter (QMF). The lowpass and highpass filters do not strictly have a
bandwidth of F, /4. Hence, there will be aliasing energy in Vo (n) and
VI (n). However, these filters have a special property: if the original
subband coefficients (i.e., with no quantization) are passed through the
synthesis filterbank, the aliasing in the synthesis bank will cancel the
aliasing energy in Vo (n) and VI (n), and hence reconstruct the input signal
x[n] without any distortion.
The filterbanks can be divided into several categories. They can be perfect
reconstruction or reversible if the input sequence can be perfectly
reconstructed using the unquantized subband coefficients and the synthesis
filterbanks. Otherwise, the filterbank is called non-perfect reconstruction or
irreversible. Another important category is the paraunitary filterbank where
the forward transformation matrix is the inverse of the inverse
transformation matrix (similar to an orthonormal transform).
An important question in the filterbank analysis is how do we design a
filterbank, and especially how do we ensure that it is paraunitary? It can be
shown that a two-band filterbank will be paraunitary if Eqs. (5.24)-(5.27)
are satisfied (see Ref [10] for more details).
N-I
z.h(n)h(n - 2k) = 8(k) (5.24)
n=O
The first condition satisfies the paraunitary property. The other three
conditions provide a convenient way to calculate the three other filters from
a given filter. The above conditions state that if we can find a filter hen)
such that it together with its derivatives (i.e., hen), hen), hen»~ satisfy the
four above equations, we are sure that the filterbank is paraunitary. The
following example illustrates how to obtain the other filters from h(n) .
• Example 5.5
It can be easily verified that Eq. (5.24) is satisfied for the hen) given in
Eq. (5.28). As per the convention used in Fig. 5.11, hen) is the lowpass
filter in the synthesis section. Eq. (5.25) states that h(O) =-g(O) ,
h(1) = g(l), h(2) = -g(2), and h(3) = g(3). In other words:
• Example 5.6
Consider the lowpass filter:
Using an approach similar to Example 5.4, the other filters can be obtained
as follows:
g(n) =[0.7071, -0.7071]
hen) = [0.7071, 0.7071]
g(n) = [-0.7071, 0.7071] •
102 Multimedia Signals and Systems
• Example 5.7
Consider the filterbank in Example 5.6. Calculate the output at various
stages of the filterbank for the input x(n) = [1, 2, 5, 7, 8, 1, 4, 5]
After the first stage filters:
The outputs of the filter xo(n) and Xl (n) can be calculated by convolving
the input with Ii (n) and g(n). Since the input signal has a finite width,
circular convolution should be performed in order to achieve the perfect
reconstruction at the output of the synthesis filterbank.
Xo (n) = [2.1213 4.9497 8.4853 10.6066 6.3640 3.5355 6.3640 4.2426]
x l (n)=[-0.7071 -2.1213 -1.4142 -0.7071 4.9497 -2.1213 -0.7071 2.8284]
Note that the wrapped around convolution has been pushed to the end of
xo(n) and xl(n). For example, x o(O)=lxO.7071 +2xO.7071 =2.1213
whereas x o(7)=5xO.7071 + lxO.7071 = 4.2426.
Lxo(o) = 0.0 x 0.7071 + 2.1213 x 0.7071 = 1.5, xo(l) = 2.1213 xO.7071 + 0.OxO.7071 = 1.5]
Reconstructed Output
The reconstructed input can be calculated by adding xo(n) and Xl (n).
Chapter 5: Transforms and Sub band Decomposition 103
•
Note that the signal has been reconstructed perfectly as the original output
of the analysis filterbank was fed to the synthesis filterbank. Even if a single
coefficient is changed marginally, the output will not be identical to the
input.
• Example 5.8
Consider a 12-point sequence.f=[{2, 5, 7, 6}, {l, 3, 9, 4}, {6, 8, 5, 7}].
Calculate the 2nd nCT coefficients of each block of 4 coefficients. Can we
implement the nCT as a filter bank?
Using Eq. (5.18), the 2nd coefficients of each block can be calculated as: {-
3.1543, -3.5834, -0.6069, O.l585}. Yes, the nCT coefficients can also be
implemented using a filter bank. The coefficients can be calculated by
passing the input sequence through a digital filter with impulse response h =
[cos(pi/8), cos(3*pi/8), cos(5*pi/8), cos(7*pi/8)] which is the 2nd basis
function (see Example 5.3). Note that when the digital filter is in operation,
each block of four input samples will produce one sample at the output of
the 2nd filter .•
The filter characteristics of 8-point nCT was shown in Fig. 5.3(b). It was
observed that the neighboring filters have substantial overlap, and hence
may not be efficient in many applications.
104 Multimedia Signals and Systems
o~ k ~ N / 2 j+1 -1 (5.31)
11/
where
C p.q = Lowpass coefficient of p th scale at q th location.
d p.q = Highpass coefficient of p th scale at q th location.
There are several points to be noted:
i) The input sequence can be considered as the DWT coefficients at oth
scale.
ii) The lowpass DWT coefficients CI,k can be obtained by convolving
cO,m and h[ -n], and decimating the convolution output by a factor of
2 (because of the "2k" factor). Similarly, dl,k can be obtained by
convolving cO,m and g[-n], and decimating the convolution output
by a factor of 2. The numbers of CI,k and dl,k coefficients are N/2
each.
iii) C 2,k and d 2,k can be obtained from CI,k using a step similar to step
(ii). The numbers of C 2 ,k and d 2 ,k coefficients are N/4 each. Higher
scale coefficients can be obtained in a similar manner.
iv) Only the lowpass coefficients output of a given stage is used for
further decomposition (there is no d j,/ll in the right hand side of Eqs.
(5.30) and (5.31».
Eqs. (5.30) and (5.31) provide a simple algorithm of calculating j + 1
scale wavelet coefficients fromj scale coefficients. We can use the same set
of equations to calculate j + 2 scale coefficients from j + 1 scale scaling
coefficients. Figure 5.12(a) shows the schematic of computing the DWT
coefficients of lower resolution scales from the wavelet coefficients of a
given scale by passing it through a lowpass filter (LPF) and a highpass filter
(HPF), and by decimating the filters' outputs by a factor of two. This
recursive calculation of DWT coefficient is known as the tree algorithm [6].
106 Multimedia Signals and Systems
(a)
(b)
Figure 5.12. Calculation of wavelet transform coefficients using the tree
algorithm [6]. a) Signal decomposition using analysis filters, and b) signal
reconstruction using synthesis filters.
The 8x8 H matrix in the right side ofEq. (5.34) can be considered as the
forward transformation matrix of the wavelet transform. This representation
is comparable to other transforms, such as Fourier (Eq. (5.7» and DCT (Eq.
(5.18». If the transform is unitary, the H matrix should be unitary, i.e.,
H- ' =HT.
Table 5.3 shows four orthogonal wavelets from a wavelet family first
constructed by Daubechies [7]. The first wavelet is also known as Haar
wavelet. Since the wavelets are orthogonal, the corresponding g[n] can be
calculated using Eq. (5.33).
Table 5.3. Minimum phase Daubechies wavelets for N=2, 4, 8, and 12 taps.
n h[n] n h[n]
N=2 0 0.707lO678 N=12 0 0.111540743350000
1 0.707lO678 1 0.494623890398000
N=4 0 0.482962913144 2 0.75113390802lO00
1 0.836516303737 3 0.315250351709000
2 0.224143868042 4 -0.226264693965000
3 -0.129409522551 5 -0.129766867567000
N=8 0 0.230377813308 6 0.097501605587000
1 0.714846570552 7 0.027522865530000
2 0.630880767939 8 -0.031582039318000
3 -0.027983769416 9 0.000553842201000
4 -0.18703481171 lO 0.004777257511000
5 0.030841381835 II -0.00 lO7730 1085000
6 0.032883011666
7 -0.010597401785
• Example 5.9
Consider a 4-point sequence f=[2, 5, 7, 6]. Decompose the sequence using
2-tap wavelet given in Table 5.3, which is also known as Haar wavelet.
Reconstruct the input sequence from the Haar coefficients.
108 Multimedia Signals and Systems
The filter coefficients of Haar wavelet are h=[0.707 0.707]). The 1st stage
(i.e., scale 1) Haar coefficients can be calculated as:
rd
C I•I
l•1
0
0
0
0
h[O]
h[l]
h[l]
- h[O]
CO.2
CO•3
0
0
0
0
0.707 0.707 7 = 9.19
0.707 -0.707 6 0.71
[c
2•0 ]=[h[0] h[l] ][c o]=[0.707
l. 0.70714.95]=[10]
d 2•0 h[l] -h[O] C I •I 0.707 -0.707 9.19 -3
[
C I•O ]
CI.I
= [h[O]
h[l]
h[l] Ic o
2. ]=[0.707
-h[O] d 2 .0 0.707
0.707 110]=[4.95]
-0.707 -3 9.l9
['"'1[,[01
CO•I
CO•2
CO•3
= h[l] -h[O] 0
0
0
0
0
0 h[l]
0
0
° 1"'1 [0707
o d l •o = 0.707 -0.707 0
0
o -2.12
o
0.707 0.707
4 95
.
9.19
0.707 -0.707 0.71
5
= 7
6
1 1 [21
•
Figure 5.13(a) shows the basis functions corresponding to 8-point Haar
wavelets (h=[0.707 0.707]). The frequency response of these basis
functions is shown in Fig. 5.13(b). It is observed that four basis functions
corresponding to n=4, 5, 6, and 7 have support of only two-samples, and
hence can be considered as compact in the time-domain. These basis
functions are the shifted version of each other, and have identical gain
response. They correspond to the highest frequency subband with highest
time resolution but poorest frequency resolution. Next consider the basis
functions for n=2 and 3. These functions have a support-width of 4 samples.
These basis functions have moderate time and frequency resolution. Finally,
the basis functions corresponding to n=O and 1, have the support width of 8
pixels. These functions have poorer time resolution, but better frequency
resolution. This adaptive time-frequency is useful in nonstationary signal
(such as audio and image) analysis.
Chapter 5: Transforms and Subband Decomposition 109
L L v(m,n;k,l)8(k,l)
N-IN-I
i(m,n) = O$m,n$N -1 (5.36)
k=O /=0
where m(.) and v(.) are the forward and inverse transform kernels.
(5.37)
Eq. (5.37) states that a unitary transform preserves the signal energy or,
equivalently the length of the vector I in N 2 dimensional vector space. A
unitary transformation can be considered simply as a rotation of the vector
I in the N 2 dimensional vector space. In other words, a unitary transform
rotates the basis coordinates, and the components of (9 (i.e., the transform
coefficients) are the projections of I on the new basis coordinates.
110 Multimedia Signals and Systems
r:~ 1
N ~ 1
~f ·
nI n ·1' - - - - - - - - - - - - - - - - - '1
• ~i t •1
·1'----'-----------.J·
1~---.------___,
~j 1
~~ • 1 •
: : : : : ~ ;1
o~~--------~
~ .~f:01214$11
~
041--r--=::::::;::t:=~
The left side of Eq. (5.38) is the total square error between the original
and the reconstructed image whereas the right side is the total square error
between the original and noisy coefficients. The above relationship basically
states that the total squared reconstruction error in the pixel domain is equal
to the total squared quantization error. The squared error will be zero if and
Chapter 5: Transforms and Subband Decomposition 111
only if all N 2 original coefficients are used for the reconstruction. This
property has been used extensively to design transform based data
compression.
iii) Separable Basis Images
In most cases of practical interest, the 2-D kernels are separable and
symmetric. Therefore, the 2-D kernel can be expressed as the product of two
1-D orthogonal basis functions. If the 1-D transform operator is denoted by
<1> , the forward and inverse transformations can be expressed as
(5.39)
(5.40)
The above formulations reveal that the image transformation can be done
in two stages: by taking the unitary transform <1> of each row of the image
array, and then by applying transformation <1>* to each column of the
intermediate result.
05:m,n5:N-l (5.42)
• Example 5.10
Calculate the DFf, and plot the amplitude spectrum of the following 32x32
image.
{~
155: 16 5: 17 & 8 5: n 5: 24
i(m,n) =
otherwise
The image is shown in Fig. 5.14(a), which is basically a rectangular
parallelepiped. The DFf is calculated using MATLAB, and the amplitude
spectrum is shown in Fig. 5.14(b). The spectrum has been made centralized
(i.e., the DC frequency is in the middle) for clarity. It is observed that the
amplitude spectrum is a 2-D sine pattern, which is expected. Note that the
rectangle has a larger width in the horizontal direction compared to the
112 Multimedia Signals and Systems
vertical direction. This is reflected in the spectrum. The spectrum has better
frequency resolution in the horizontal direction compared to the vertical
direction.
Note that the width of the rectangle is larger in the horizontal direction
than in the vertical direction. It has an opposite effect in the frequency
domain: the sinc pattern is narrower in the horizontal frequency axis, and
wider in the vertical frequency axis. This is generally true for all transforms
due to the uncertainty principle: an improved time resolution degrades the
frequency resolution, and vice versa. _
OFT Spectrum
40
,~ .
(a) (b)
Figure 5.14. Discrete Fourier transform of a rectangle. (a) The 2-D input
pattern, and (b) it's centralized amplitude spectrum. Note that the spectrum has
been centralized (the DC frequency is in the middle), and zero padding has
been applied to make the spectrum smooth.
i) Separability
It can be easily shown that the 2-D OFT is separable. Therefore, the 2-D
Off calculation of an image can be done in two simple steps:
a) Calculate the 1-0 Off of each row of the image. The row is
substituted by its Off coefficients.
Chapter 5: Transforms and Subband Decomposition 113
u( )
m,n Row ~ v'( m,.I) lurnn ~
Transronn Transrorrn
-I N-I N-
m m k
(a) (b) (c)
Figure 5.15. 2-D Fourier transform calculation using separable approach .
• Example 5.11
Using separable approach, calculate the 2-0 Off of the image:
[~ ; -t ;1
To calculate the 2-0 OFT, the 1-0 transform of each row is first
calculated. The 1-0 OFT of [2 3 -4 7] is [4, 3+j2, -6, 3-j2]. Similarly, 1-0
OFT of other rows can be calculated, and the coefficients matrix after row
transform will be:
8
.!. [ 16
6+ j4
-12 6- 4]
j
-I + j3 -10 -1- j3
2 18 -7-j 4 -7+j
14 1- j3 8 1+ j3
After the column transform, the final Off coefficient matrix will be
-I + 3j
_1~~62j ~l:::]
56
I [ -1O-2j 19+7j
F(k,/) =-
4 -4 -1+3j 10 -1-3j
-10+2) 7+3) -16-2) 19-7)
114 Multimedia Signals and Systems
The results can be verified easily by direct calculation using Eq. (5.41) . •
Consider the 2-0 image in Example 5.11. The symmetry property can
easily be verified from the Off coefficient matrix. For example, a 4x4 input
image and its 2-D Off are shown below. It is observed that the Off
coefficients satisfy the symmetry properties. Since natural images have real-
valued pixels, the above property is applicable most of the time.
iii) Energy Compaction
The Off provides significant energy compaction for most images. Figure
5.16 shows Lena image and its Off amplitude spectrum. It is observed that
most energy in the spectrum is concentrated in the four comers that
represent the lower frequencies.
(a) (b)
Figure 5.16. Fourier spectrum of image data. a) Lena image and b)
it's amplitude spectrum. The white pixels in Fig (b) represents
higher amplitude coefficients.
B(k, I) = a(k )a(l) ~ ~ i(m, n)[cos (2m + l)k7Z"] [cos (2n + 1)11l] (5.44)
111=0,, =0 2N 2N
Chapter 5: Transforms and Subband Decomposition 115
• Example 5.12
Using separable approach, calculate the 2-0 OCT of the image:
II ; -t ;1
The coefficient matrix after row transform will be:
4 -l.37 5 - 5.921
[ 8 -3.76 1 -3.85
9 -2 -4 2.99
7 0.32 -1 -4.4
After the column transform, the final OFT coefficient matrix will be
The results can be verified easily by direct calculation using Eq. (5.44) .•
As in the 1-0 case, the 2-0 OCT can compact energy of typical images
very efficiently. Fig. 5.18 shows the OCT spectrum of the Lena image. It is
observed that most of the energy is concentrated in the low frequency region
(upper left corner). It has been shown that OCT provides near optimal
energy compaction performance [3] for most natural images. As a result, it
116 Multimedia Signals and Systems
has been accepted as the transform kernel for most existing image and video
coding standard. More details about the OCT-based coding will be provided
in Chapters 7 and 8.
Figure 5.17. SxS DCT Basis images Figure 5.1S. DCT spectrum of Lena image
L H LL HL
LH HH
(a)
1111
2-00WT ilL
(I \age)
Image LU
LL
(b)
Figure 5.19. Two-dimensional wavelet transform. a) 2-0 OWT by 1-0 row
and column transforms, b) Equivalent block schematic. L: output of lowpass
filter after decimation, H: output of highpass filter after decimation .
• Example 5.13
Consider the Lena image in Example 5.2. i) Calculate two-stage wavelet
transform of the image using Daub-4 tap wavelet, ii) display the transform
coefficients, iii) calculate the root mean square energy (RMSE) of the pixels
of different bands, and iv) calculate the percentage of total energy
compacted in the LL band.
As in the 1-0 case (see Fig. 5.12), multi-stage 2-D wavelet decomposition
can be performed by decomposing the LL band recursively. The two stage
DWT calculation program is provided in the accompanying CD. Fig.
5.20(a) shows the 2-level wavelet decomposition of the Lena image. The
RMSE of different bands are shown in Fig. 5.20(b). It can be easily shown
that about 99.5% of the total energy is compacted in the lowpass subband .
•
Several features can be noted in Fig. 5.20. First, all subbands provide
spatial and frequency information simultaneously. The spatial structure of
the image is visible in all subbands. A low resolution image can be
observed in the lowest resolution lowpass band. The highpass bands
provide the detailed (i.e., edge) information about the image at various
scales. Secondly, the wavelet decomposition provides a multiresolution
representation of the image. A coarse image is already available in the
lowest resolution band. A higher resolution image can be obtained by
calculating inverse transform of the four low resolution subbands. The full
resolution image can be obtained by calculating inverse transform of all
118 Multimedia Signals and Systems
seven subbands. Thirdly, the coefficients of high pass subbands have small
magnitude. Therefore, superior compression of images can be achieved by
quantizing the wavelet coefficients.
530.4 26.4
8.4
15.7 11.3
5.4 3.4
(a) (b)
Figure 5.20. Wavelet decomposition of Lena Image. a) Magnitude of wavelet
coefficients, and b) root means square energy of different wavelet bands. The
high pass coefficients have been magnified to show the details.
REFERENCES
I. A. K. Jain, Fundamentals of Digital Image Processing, Prentice Hall, 1989.
2. E. O. Brigham, The Fast Fourier Transform, Prentice Hall, 1974.
3. K. R. Rao and R. C. Yip, The Transform and Data Compression Handbook, CRC
Press, New York, 2000.
4. P. P. Vaidyanathan, Multirate Systems and Filterbanks, Prentice Hall, 1992.
5. C. S. Burrus, R. A. Gopinath, and H. Guo, Introduction to Wavelets and Wavelet
Transforms, Prentice Hall, 1998.
6. S. G. Mallat, "A theory for multiresolution signal representation: the wavelet
decomposition," IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol. 11,
pp. 674-693, July 1989.
7. M. K. Mandai, Wavelet Theory and Implementation , Chapter 3 of M.A.Sc Thesis,
Wavelets for Image Compression, University of Ottawa, 1995 (Included in the CD).
8. R. C. Gonzalez and Richard E. Woods, Digital Image Processing, Addison Wesley,
1993.
9. B. P. Lathi, Signal Processing and Linear Systems, Berkeley Cambridge Press, 1998.
Chapter 5: Transforms and Subband Decomposition 119
QUESTIONS
1. Show that the 4x4 forward DFr transformation matrix given in Eq. (5.11) is unitary.
2. Show that the 4x4 forward DCT matrix given in Eq. (5.20) is unitary.
3. Show that the the 4x4 forward DWT (Haar) transformation matrix given in Example-
5.9 is unitary.
4. Calculate (by hand or using calculator) the DFr, DCT and Haar coefficients of the
signal !(n) = [1, 1, 0, 0]. Calculate the total energy in the signal domain, and in the
frequency (i.e., DFr, DCT, and Haar) domain. Show that the total energies in all
three transform domains are individually identical to the total signal energy.
5. Calculate the DCT of the sequence f(n) = [0.5, - 0.5, - 0.5, 0.5]. You will find that
only one DCT coefficient is non-zero. Can you explain why is it so?
[Hints: show that the signal f(n) is one of the basis vectors.]
6. Using MATLAB, calculate the DFr coefficients of the following signal
I 0:::;n:::;4
f(n) ={
o 5:::; n:::; 15
Plot the amplitude spectrum (in this particular case you can verify that the spectrum
follows a sinc pattern). Show that the DFr preserves the signal energy.
7. Repeat the above problem for the DCT.
S. Repeat the above problem for 2-stage Daub-4 wavelet.
9. Calculate the total energy of the 4 largest (absolute value) coefficients in each of the
above three cases. Compare the energy compaction performance of the three
transforms: DFr, DCT, and DWT in this particular case.
10. Can we consider the block transforms as a special case of filterbank?
11. Compare and contrast orthogonal transform and subband decomposition.
12. Show that the Daub-4 wavelet filter coefficients satisfy Eq. (5.24).
13. Write down SxS forward transform matrix for Haar wavelet. Show that the matrix is
unitary.
14. Using separable approach, calculate the DFr and DCT coefficients of the following
[; ~ ~l
3x3 image.
Tabl e 6..
1 ASCII character set. DE DeClma
. IE~qulva
. Ient, an dCH Ch aracter
DE CH DE CH DE CH DE CH DE CH
0 NUL 26 SUB 52 4 78 N 104 h
1 SOH 27 ESC 53 5 79 0 105 I
2 STX 28 FS 54 6 80 P 106 j
3 ETX 29 GS 55 7 81 Q 107 k
4 EOT 30 RS 56 8 82 R 108 I
5 ENQ 31 US 57 9 83 S 109 m
6 ACK 32 Space 58 84 T 110 n
7 BEL 33 ! 59 , 85 U III 0
8 BS 34 " 60 < 86 V 112 p
9 TAB 35 # 61 = 87 W 113 q
10 LF 36 $ 62 > 88 X 114 r
11 VT 37 % 63 ? 89 Y 115 s
12
13
FF
CR
38
39
&
. 64
65
@
A
90
91
Z
[
116
117
t
u
14 SO 40 ( 66 B 92 \ 118 v
15 SI 41 ) 67 C 93 1 119 w
16 DLE 42 * 68 D 94 1\
120 x
17
18
DC1
DC2
43
44
+
,
69
70
E
F
95
96 . 121
122
Y
z
19 DC3 45 - 71 G 97 a 123 {
20 DC4 46 72 H 98 b 124 I
21 NAK 47 / 73 I 99 C 125 }
22 SYN 48 0 74 J 100 d 126 -
23 ETB 49 1 75 K 101 e 127 DEL
24 CAN 50 2 76 L 102 f
25 EM 51 3 77 M 103 g
Table 6.2: ISO 8859(Latin1) to ASCII character set. The codes represented
b EMP (empty )J has not been defime d, an d Ief t ~or f uture expansIOn.
.
128 EMP 154 EMP 180
,
206 I e
232
129 EMP 155 EMP 181 JL 207 I e
233
130 EMP 156 EMP 182 Ij[ 208 f) 234 e
131 EMP 157 EMP 183 209 N 235 e
132 EMP 158 EMP 184 210 0 236 i
l33 EMP 159 EMP 185 1 211 0 237 i
134 EMP 160 EMP 186 0
212 0 238 i
l35 EMP 161 i 187 » 213 6 239 1
136 EMP 162 ¢ 188 1,4 214 0 240 (}
137 EMP 163 £ 189 Y2 215 x 241 fi
138 EMP 164 J:t 190 % 216 0 242 0
139 EMP 165 ¥ 191 i, 217 U 243 6
140 EMP 166 II 192 A 218 U 244 6
141 EMP 167 § 193 A 219 U 245 5
142 EMP 168
.. 194 A 220 D 246 0
143 EMP 169 © 195 A 221 Y 247
144 EMP 170 a
196 A 222 I> 248 !Ii
145 EMP 171 « 197 A 223 B 249 U
146 EMP 172 ..., 198 IE 224 a 250 u
147 EMP 173 199 <; 225 a 251 fi
148 EMP 174 ® 200 E 226 a 252 ti
149 EMP 175 - 201 E 227 a 253 y
150 EMP 176 0 202 E 228 a 254 p
151 EMP 177 ± 203 E 229 0
a 255 Y
152 EMP 178 2 204 I 230 re
153 EMP 179 3 205 I 231 C
_ Example 6.1
Consider a book that consists of 800 pages. Each page has 40 lines, and
each line has on average 80 spaces (including spaces). If the book is stored
in digital form, how much storage space is required?
The book contains approximately 800x40x80 or 2.56 million characters.
Assuming that each character is represented by eight bits (or one byte), the
book will require 2.56 million bytes or 2.44 Mbytes. _
In the above example, it was observed that a typical document might
require several Mbytes of storage space. The important question: is it
possible to store the same document using a smaller space? This is the topic
of the next section. It will be shown that the text can be compressed without
losing any information. In fact, most readers are likely to be aware of
various tools such as WinZip (for Windows OS), compress, and gzip [3] (for
Unix OS) that employ compression techniques to save storage space.
LP[k] = 1 (6.2)
k=O
• Example 6.2
Consider the string of characters: X = "aaabbbbbbbccaaabbcbbbb".
Determine the alphabet; calculate the histogram and probability density
function of the characters. Show that relation (6.2) is satisfied.
The alphabet of the above source contains three characters {a,b,c}. In the
character string, there are 6 a's, 13 b's, and 3 c's. Hence, the histogram will
be h(a)=6, h(b)=13, and h(c)=3. Since there are a total of 22 characters,
126 Multimedia Signals and Systems
In Eq. (6.4), the base of the logarithm has been taken as 2 in order to
express the information in bits per symbol.
Although each symbol can be analyzed individually, it is often useful to
deal with large blocks of symbols, rather than individual symbols. The
output of a source S can be grouped into blocks of N symbols, and the
symbols can be assumed to be generated by a source SN with an alphabet of
size K N • The source SN is called the N -th extension of the source S. A
total of K N possible audio patterns can be generated by such a source. Let
the probability of a specific character string s be given by p(s). The
entropy, i.e., the average information of the source can be calculated as
The left side equality holds if p[ k] is zero for all source symbols Sk
except one, in which case the source is totally predictable. The right side
equality holds when every source symbol Sk has the same probability. The
redundancy of the source is defined as:
redundancy = log2 K - H(S) (6.8)
Eq. (6.7) states that if a source has an alphabet of size K, the maximum
entropy the source can have is log2 K . If the entropy is equal to log2 K , the
source is said to have zero redundancy. In most cases, the information
contains dome redundancy.
_ Example 6.4
Consider an information source with the alphabet {a,b,c,d}. The symbols
have equal probabilities of occurrence. Calculate the entropy of the source,
and redundancy of the source. What is the average bit-rate required to
transmit the symbols generated by the source? Design a suitable coder to
encode the symbols.
Since p(a)=p(b)=p(c)=p(d)=O.25, the entropy of the source will be equal
to - 4 * 0.25 * \og2 0.25 or 2 bits/symbol. The redundancy is zero since the
entropy is equal to \og2 K. The average bit-rate required to transmit the
symbols is 2 bits/symbol. The coder design in this case is trivial. Assign
a:OO, b:O 1, c: 10, and d: 11.
Note that it might be possible to encode a four-symbol source using less
than 2 bits/symbol if the entropy is less than 2 bits/symbol. _
_ Example 6.S
What is the average bit-rate required for a three-symbol source? What is
the minimum bit-rate required to encode the three-symbol source considered
Chapter 6: Text Representation and Compression 129
code (VLC) for each source symbol such that the number of bits in the code
is approximately inversely proportional to the probability of occurrence of
that symbol.
Table 6.3 shows the alphabet of a source along with the probabilities of
individual symbols. Since there are six symbols, a straightforward pulse
code modulation (PCM) will require 3 bits. However, if 3 bits, which can
represent 8 symbols, is used in this case, the bits are under-utilized. An
information source with six symbols requires at most log 2 6 or 2.58 bits.
Since the probability of occurrences of the symbols are not uniform, there is
a further scope of improvement. The entropy of the source can be calculated
using Eq. (6.4).
H = -(0.3 * log 2 0.3 + 2 *0.2 *log 2 0.2 + 3 *0.1 * log 2 0.1) = 2.44bits / symbol
f
Symbol Probability I 1 2 3 4
03 03fOAj"
I
03 03 0.
e 0.3
a 0.2
0>
0.1 J---l.o>
c 0.2 0.2 0.2 0.3
b
0>
d 0.1 I 0.1
I
! 0.1 [
Code Assignment
Step 1: After the source reduction process is over, the Huffman code
assignment process is started (see Fig. 6.2). The first two symbols are
assigned the trivial codewords "0" and" 1". In this case, we assign "0" to the
symbol with probability 0.6, and "I" to the symbol with probability 0.4.
Step 2: The probability "0.6" was obtained by merging two probabilities 0.3
and 0.3. Hence, the code assigned to these probabilities are "0 followed by
0", and "0 followed by 1". After this assignment, there are three symbols
with probabilities 0.4, 0.3, 0.3 with codes 1,00, and 01, respectively.
Step 3: The probability 0.4 was obtained by merging two probabilities 0.2
and 0.2. In this step, the code assigned to these probabilities (0.2 and 0.2)
are "1 followed by 0", and "1 followed by 1." After this assignment, there
are four symbols with probabilities 0.3, 0.3, 0.2 and 0.2 with codes 00, 01,
10, and 11, respectively.
Step 4: The second probability 0.3 was obtained by merging two
probabilities 0.2 and 0.1. In this step, the codes assigned to these
probabilities (0.2 and 0.1) are "01 followed by 0," and "01 followed by 1."
After this assignment, there are five symbols with probabilities 0.3, 0.2, 0.2
and 0.2, and 0.1 with codes 00, 10, 11,010, and 011, respectively.
132 Multimedia Signals and Systems
:f OA ·
Sym. Prob. Code 1 2 3 4
:fl
a 0.2 .0 0.2 .OJO.3 O. 0.3 00
c 0.2 11 0.2 11 0.2 10 0.3 01
b 0.1 on 0.2 O. 0.2 n
d 0.1 0100 0.1 011
! 0.1 0101 I
Figure 6.2. Huffman Code Assignment Process
The average length of the Huffman code is, R = (0.3 + 0.2 + 0.2) * 2 + 0.1 * 3
+ 0.2 * 4 = 2.5 bits / symbol, which is close to the first order entropy (=2.44
bits/symbol) of the source.
_ Example 6.6
i) Using the Huffman table shown in Table 6.3, calculate the Huffman
code for the text string "baecedeac!".
ii) A string of characters was Huffman coded, and the code is
"00011000100110000010111". Determine the text string. Is the
decoding unique?
The Huffman code for text string "baecedeac!" is "0111000110001000
010110101". Note that 10 symbols (or characters) could be compressed
using 25 bits. So, an average of 2.5 bits is used to represent each symbol.
The code can arranged as "00,011,00,0100,11,00,00,0101,11". In other
words, the decoded string is "ebedcee!c". Decoding of the Huffman code is
unique for a given Huffman table. One cannot obtain a different output
string. _
Chapter 6: Text Representation and Compression 133
Note that the Huffman code has been used extensively since its discovery
by Huffman in 1952. However, it has several limitations. First, it requires at
least one bit to represent the occurrence of each symbol. Therefore, we
cannot design a Huffman code where we assign 0.5 bits to a symbol. Hence,
if the entropy of a source is less than 1 bit/symbol, Huffman coding is not
efficient. Even if the entropy is greater than one, in most cases the Huffman
code requires a higher average bit-rate than the entropy (see Problem 16).
Second, it cannot efficiently adapt to changing source statistics. Although
dynamic Huffman coding schemes have been developed to address these
issues, these schemes are difficult to implement.
L,P(Sm) =1 (6.9)
m=O
If one wishes to encode an entire message at once, a code table (using the
Huffman coding technique) can be generated. The length of the codeword
will be approximately equal to -log2 p(s",) . Since there is a large number of
code words, the individual p(s",) 's will be very small. Therefore, the length
of the code words is likely to be significantly longer than the example
shown in Table 6.3.
Encoding an entire message at once increases the efficiency of the coder.
Even the Huffman code will provide an average bit-rate close to the entropy
even if it is much less than one bit/symbol. However, generating a Huffman
table of K'" symbols would be very difficult, if not almost impossible. The
arithmetic coding provides a convenient way to achieve this goal, and
provides optimal performance. The coding principle is explained below.
Assume that the p(s",) of different possible messages are known. In Eq.
(6.9), it has been shown that their overall sum is equal to 1. Divide the
interval [0,1) in the real axis into K m sub-intervals, such that the width of
each sub-interval corresponds exactly to the probability of occurrence of a
message. Then assume that the sub-intervals are [LpR/), V=1,2, .... .K m ).
Once the subintervals are determined, any real number in that sub-interval
will correspond to a unique message, and hence the entire input message can
be decoded. Since the sub-intervals are non-overlapping, a codeword for
S m can be constructed by expanding any point in the interval in binary form
and retaining only the n/ =I-Iog2 p(sm) l bits after the decimal point.
Consequently, the number of bits required to represent the message can
differ from the source entropy by a maximum of one bit.
The above procedure might seem too difficult to implement. However, a
simple recursive technique can be employed to implement such a scheme.
To explain the coding technique with a concrete example, consider the
information source given in Table 6.3.
Chapter 6: Text Representation and Compression 135
Step 1: List the symbols and their probabilities (not in any particular
order) as shown in Table 6.4.
Step 2: Divide the interval [0,1) into 6 (=K) sub-intervals since there are
six symbols. The width of each sub-interval is equal to the probability of the
individual symbol occurrences.
Step 3: Now suppose we want to transmit the symbol baad!. Since the
first symbol is b, we chose the interval [0.2,0.3). This is shown in Table 6.5,
3rd row, as well as in Fig. 6.3. Once the sub-interval is chosen, it is divided
into six second-level sub-intervals. Note that the widths of these sub-
intervals are smaller than the first-level sub-intervals. This is shown in Fig.
6.3.
Step 4: The second symbol is "a". Therefore, we choose the sub-interval
corresponding to "a" in the interval [0.2,0.3), which is [0.2,0.22) (see Fig.
6.3). The sub-interval [0.2,0.22) is further divided into six third-level sub-
intervals.
Step 5: The third symbol is again "a". Therefore, the sub-interval
[0.2,0.204) is chosen. This sub-interval is again divided into six fourth-level
sub-intervals.
Step 6: The fourth symbol is "d". Therefore, the sub-interval (see Fig. 6.3)
[0.2020, 0.2024) is chosen. This sub-interval is again divided into six fifth-
level sub-intervals.
Step 7: The fifth symbol is "!". Hence, the sub-interval [0.20236,0.2024)
is chosen.
Since there is no other symbol, the encoding is now complete. The
transmitter can send an arbitrary number in the interval [0.20236, 0.2024)
and the decoder will be able to reconstruct the input sequence (of course, the
decoder has to know the information provided in Table 6.4).
Table 6.4. Example of a fixed model for alphabet {a, b, c, d, e, !}
However, to stop decoding process, the decoder will face the problem of
detecting the end of the message. For example, the single number 0.20236
could represent baad!, baad!a, or baad!aa. To resolve this ambiguity,
normally a special EOF (end of file) symbol is used. This EOF marker is
known to both the encoder and the decoder and thus the ambiguity can be
resolved. In this case, "!" can be used as the EOF marker.
0.3 0.22
e e
d 0.202 d
0.202
C c
0.201
0 b
0.204 - -
0 ;-~0.2 a
_ _ _ 0.2
a
0.2
a
The procedure described so far produces the final real number that can be
sent to the decoder. The real number is generally represented by a stream of
bits, which can be generated once we know the final real number. However,
we do not need to wait until the final real number is calculated. The
bitstream can be generated as the input symbols are coming in, and the sub-
intervals are being segmented. This is shown in Table 6.5. After seeing the
first symbol b, the encoder narrows the range to [0.2, 0.3). As soon as it is
determined that the interval is [0.0,0.5), the encoder knows that the first
symbol is "0". Hence, it can start sending the output bits. After the second
symbol, the encoder narrows the range further to [0.2, 0.22). At this moment
the encoder knows that the second symbol is "0" because the interval falls
within [0,0.25). Similarly, the third symbol will be "1" as the interval falls
within [0.125,0.25). The fourth symbol is 1 as the interval.
The decoding scheme is very simple. Suppose the number 0.20238 is sent.
The decoder knows that the first symbol is b, as the number lies in [0.2,0.3).
Chapter 6: Text Representation and Compression 137
The second symbol is a, since the number lies in the first one-fifth of the
range, i.e., within [0.2, 0.22). Thus, the process continues.
In the above example, a fixed model has been used, which is not suitable
for coding nonstationary sources where the source statistics change with
time. In these cases, an adaptive model should be employed to achieve a
superior performance. Witten et al. [7] have presented an efficient
implementation of adaptive arithmetic coding that provides a bit-rate very
close to the entropy of the source.
• Example 6.8
Consider a 20-character dictionary or sliding window and an 8-character
look-ahead window. The input source string is
RYYXXZERTYGJJJASDERRXXZERFEERZXYURPP
Calculate the output of the text compressor.
Step 1
The sliding window is loaded with the first 20 characters from the input
source, with the symbol or data byte "R" in window position 0, and the
symbol R in window position 19. The remainder of the input file consists of
symbols starting with XXZERFEE that are loaded into the look-ahead
window.
Sliding window Look ahead window
I RYYXXZERTYGJJJASDERR I XXZERFEE I
~~r-----------------.~ ~ ~
20 characters 8 characters
Step 2
Search for a sequence in the sliding window that begins with the character
in the look-ahead position 0 (i.e., 'X'). The sequence of five characters starts
at sliding window index 5 ("XXZER") and matches the pattern in the look
ahead buffer. These five characters can be replaced by a (offset, length)
record as shown below:
RYYXXZERTYGJJJASDERR I (4,5)
Step 3
The sliding window is then shifted over five characters. The five characters
"RRYYX" are moved out of the sliding window. The new 5 characters
"RXXYU "are moved into the look ahead window.
There is no sequence match in the first three characters of the look ahead
window and the sliding window. Hence, the sliding window is shifted by 3
140 Multimedia Signals and Systems
characters, and three new characters come into the look ahead window.
Again, search for a sequence in the sliding window that matches the look-
ahead window sequence is carried. There is a sequence of 3 characters
starting at sliding window index 12 ("RXX") that matches the pattern in the
look-ahead window. Hence these 3 characters are replaced by (12,3).
Sliding window Look ahead window
....---------+~ ... ~
20 characters 8 characters
The above procedure is followed for the new symbols, which are coming
into the look-ahead window. The final compressed output ignoring the first
20 characters is [(4,5), FEE, (12,3), ........ ]. •
During the decompression, a sliding window of identical size is required.
However, the look-ahead window is not required. The uncompressed data is
put into the sliding window. When a record (offset, length) is detected, the
decompressor points to the position of the offset, and begins to copy the
specified number of symbols and shifts them into the same sliding window.
• Example 6.9
Assume that the text string to be encoded is "PUP PUB PUPPY ... ".
Show how the dictionary is built up, and the first few compressor output
symbols.
Chapter 6: Text Representation and Compression 141
Step 1
At first, the encoder starts with a dictionary having the null string only.
The first character 'P' is read in from the source, and there is no match in the
dictionary, and hence it can be considered as matching the null string. Thus
the encoder outputs the corresponding dictionary index '0', which
corresponds to the null string and character output as 'P', and then adds
string to the dictionary as entry 1 (see Fig. 6.4). The second character 'U' is
read from the source; there is no match in the dictionary. The encoder
outputs again dictionary index '0' and character output" U " and adds it to
the dictionary as entry 2.
Step 2
The next character 'P' is read in. It matches an existing dictionary phrase,
and hence the next character ' , is read in which creates a new phrase with
no match in the dictionary. The encoder outputs the dictionary index '1' and
the character output as ' '. The string is then added to the dictionary as entry
"3".
DICTIONARY CODED OUTPUT
o
P o P
2 U o U
3 "P H
4 PU U
5 B o B
6 o
7 PUP 4 P
8 PY Y
Step 3:
The next character 'P' is read in. It matches an existing dictionary phrase, so
the next character "U" is read which creates a new phrase with no match in
the dictionary. The encoder outputs the dictionary index '1' and the
character output as "U". The string is the added to the dictionary as entry
"4" .
142 Multimedia Signals and Systems
Step 4:
The next character 'B' is read in, there is no match in the dictionary. The
encoder outputs the corresponding dictionary index '0', which is for the null
string and character output as 'B' and then adds string to the dictionary as
entry "5". The next character is ' , read which does not the match the
dictionary. Again, the encoder outputs the Index output as '0' and character
index as' '.
Step 5:
When the next character 'P' is read in, it matches the dictionary phrases.
So the next character is read in; it again matches the dictionary phrases.
When the next character 'P' is read, there is no match in the dictionary. The
encoder outputs the dictionary index '4' which matches "PU" and the
character output 'P'.
The encoding continues in this procedure. The encoded string for this
example is "OPOUl' '1 UOBO' '4Pl Y........ " .•
"P "entry "3" in the dictionary and 'P 'is sent as decoder output. The
string decode is 'PUP '. The same procedure used for decoding the
remaining encoded symbols and building the dictionary.
6.5 SUMMARY
Several lossless coding techniques have been presented in this Chapter.
The first set of techniques is based on entropy-based approaches. Although
these techniques are presented for encoding text data, these techniques are
also used extensively for audio and image coding (more details in Chapters
7 and 8). The second set of techniques is based on dictionary-based
approaches. Most text-compression scheme employs these approaches to
achieve high compression.
REFERENCES
1. ASCII Table and Description, http://www.asciitable.com/
2. Unicode in the Unix Environment, http://czyborra.com/
3. P. Deutsch, GZIP file format specification version 4.3, RFC 1952, May 1996. Can be
downloaded from http://www.ietf.org/rfc/rfcI952.txt.
4. M. Nelson, The Data Compression Book, 2nd Edition, M&T Books, New York, 1996.
5. C. E. Shannon and W. Weaver, The Mathematical Theory of Communication,
University of Illinois Press, Urbana, IL, 1949.
6. D. A. Huffman, "A method for the construction of minimum redundancy codes,"
Proc. of the IRE, Vol. 40, Sept. 1952.
7. I. H. Witten, R. M. Neal and J. G. Cleary, "Arithmetic coding for data compression,"
Communications of the ACM, Vol 30, pp 520-540, June 1987.
8. J. Ziv and A. Lempel, "A universal algorithm for sequential data compression," IEEE
Trans. on Information Theory, Vol. 23, No.3, pp. 337-343, 1977.
9. J. Ziv and A. Lempel, "Compression of individual sequences via variable-rate
coding," IEEE Trans. on Information Theory, Vol. 24, No.5, pp. 530-536, 1978.
10. T. A. WeIch, "A technique for high-performance data compression," IEEE computer,
Vol. 17, pp. 8-19, June 1984.
QUESTIONS
1. How many bits/character are generally required for simple text files?
2. Why are there so many extensions of the 7 -bit ASCII character set?
144 Multimedia Signals and Systems
3. A book has 900 pages. Assume that each page contains on average 40 lines, and
each line contains 75 characters. What would be the file size if the book is stored in
digital form?
4. What are the main principles of the text compression techniques?
5. Justify the inverse logarithmic relationship between the information contained in a
symbol and its probability of occurrence.
6. Why is entropy a key concept in coding theory? What is the significance of
Shannon's noiseless coding theorem?
7. Calculate the entropy of a 5-symbol source with symbol probabilities {0.25, 0.2,
0.35,0.12,0.08}.
8. Show that the first order entropy of a source with alphabet-size K is equal to
iog 2 K only when all the symbols are equi-probable.
9. Design a Huffman table for the information source given in Example 6.2. Show
that the average bit-rate is 1.41 bits/symbol.
10. How many unique sets of Huffman codes are possible for a three-symbol source?
Construct them.
11. An information source has generated a string "AbGOODbDOG" where the symbol
"b" corresponds to a blank space. Determine the symbols you need to represent the
string. Calculate the entropy of the information source (with the limited
information given by the string). Design a Huffman code to encode the symbols.
How many bits does the Huffman coder need on average to encode a symbol?
12. How many distinct text files are possible that contain 2000 English letters? Assume
an alphabet size of 26.
13. Consider the information source given in Example 6.2. Determine the sub-interval
for arithmetic coding of the string "abbbc".
14. Repeat the above experiment with an arithmetic coder.
15. Why do we need a special EOF symbol in arithmetic coding? Do we need one such
symbol in Huffman coding?
16. Show that Huffman coding performs optimally (i.e., it provides the bit-rate
identical to entropy of the source) only if all the symbol probabilities are integral
powers of \/2.
17. Explain the principle of dictionary-based compression. What are the different
approaches in this compression method?
18. Explain sliding window-based compression technique. Consider the text string
"PJRKTYLLLMNPPRKLLLMRYKMNPPRLMRY". Compress the file using the
sliding window of length 16 and a look-ahead window of length 5.
19. Explain the LZ78 compression method. Encode the sentence "DOG EAT DOGS".
20. Compare the LZ77 and LZ78 compression technique?
Chapter 7
. 56000
SamplmgFrequency =- - H z =1750Hz
2x16
In order to avoid aliasing, the audio signal has to be passed
through a lowpass filter with a cut-off frequency of 875 Hz or
lower.
iii) We want to use a sampling frequency of 44.1 KHz, as well as 16
bits/sample/channel representation. What is the minimum compression
ratio we need in order to transmit the audio signal?
Uncompressed bit-rate = 44100x16x2 bit / s = 1.41xl06 bitls
Minimum compression factor = 1.41x106/56100 = 25.2 •
The above example demonstrates that low bit-rate audio can be obtained
by reducing the sampling bit resolution of each audio sample, or by reducing
the sampling frequency. Although, these methods can reduce the bit-rate, the
quality may be degraded significantly. In this Chapter, we will present
techniques that achieve superior compression performance (good quality at
low bit-rate) by removing redundancies in an audio signal. There are
primarily three types of redundancies:
Statistical Redundancy: The audio samples do not have equal probability
of occurrence resulting in statistical redundancy. The sample values that
occur more frequently are given a larger number of bits than the less
probable sample values.
Temporal redundancy: The audio signal is time varying. However, there
is often a strong correlation between the neighboring sample values. This
inter-sample redundancy is typically removed by employing compression
techniques such as predictive coding, and transform coding.
Knowledge redundancy: When the signal to be coded is limited in its
scope, a common knowledge can be associated with it at both the encoder
and decoder. The encoder can then transmit only the necessary information
(i.e. the change) required for reconstruction. For example, an orchestra can
be encoded by storing the names of the instruments, and how they are
played. The MIDI files exploit knowledge redundancy, and provide
excellent quality music at a very low bit-rate.
In addition to the above techniques, the human auditory system properties
can be exploited to improve the subjective quality of audio signal.
In this Chapter, several audio compression techniques, which exploit the
above-mentioned redundancies, will be presented. Unlike text data, which
typically employs lossless compression, the audio signal is generally
compressed using lossy technique. A brief overview of rate-distortion theory
for lossy coding is presented below.
Chapter 7: Digital Audio Compression 147
d avg (7.1)
where S is the reconstructed vector and d (S, S) represents the distortion
between Sand S, and E{x} is the expected value of the variable x.
R(d avg ) is the minimum value of this information over all transition
distributions yielding an average distortion d avg •
Shannon's source coding theorem [1] states that for a given distortion
d avg ' there exists a rate-distortion function R(d avg ) corresponding to an
information source, which is the minimum bit-rate required to transmit the
signals coming out from the source with distortion equal to (or less than)
d avg •
Rate R(davi
Distortion d avg
• Example 7.2
Consider an audio acquisition system that has 10,000 mono audio samples
with 3-bit resolution with sample levels 0 to 7. The number of occurrences
of the eight samples levels is [700, 900, 1500, 3000, 1700, 1100, 800, 300].
Calculate and plot the probability density function of each symbol. Calculate
the entropy of the source.
The probability of the occurrence of each sample level is given by
p[O] =700110000 =0.07
p[l] =900/10000 =0.09
p[2] = 1500/10000 = 0.15
p[3] = 3000110000 = 0.30
p[4] =1700/10000 =0.17
p[5] = 1100110000 =0.11
p[6] =800/10000 =0.08
p[7] = 300110000 = 0.03
The pdf is plotted in Fig. 7.2.
H = -(0.07 * log 2 0.07 + 0.09 * log 2 0.09 + 0.15 * log 2 0.15 + 0.30 * log 2 0.30 +
0.17*log2 0.17+0.11*log 2 0.11+0.08*log2 0.08+0.03*log 2 0.03)
= l.88 bits / sample
Therefore, the entropy of the source is 1.88 bitslaudio sample. •
Chapter 7: Digital Audio Compression 149
, ..
I ., ._.
a.mp'e v.lv ••
The waveform corresponding to the audio signal "chord. wav" was shown
in Fig. 4.14 (a). If the signal is quantized with 8 bits/sample, the dynamic
range of the sample values will be [-128,127]. The pdf of the sample values
is shown in Fig. 7.3. The entropy of the signal is 3.95 bits/sample. As a
result, a direct application of entropy coding will produce a compression
ratio of about 2: 1 for this signal without any loss.
chord way
03
02
i
01
) \.
.%c'i ~ ·20 0 20 .a 60
Sample Values
samples, which is not very efficient since the smaller values (that occur
frequently) are quantized with the same step-size that is used to quantize
larger amplitude samples. A superior performance can be obtained using a
nonuniform quantizer that quantizes frequently occurring sample values
more precisely than others. However, designing a nonuniform quantizer may
be a tedious process.
A performance close to the nonuniform quantization can be achieved with
a nonlinear transformation followed by uniform quantization. Here, the
signal undergoes a fixed nonlinear transformation (known as companding)
as shown in Fig. 4.14. A uniform quantization is then applied on the
transformed signal, and the digitized sample values are stored. During the
playback, we dequantize the signal, and perform the inverse nonlinear
transformation (known as expanding). The original audio, along with some
quantization noise, is obtained. Implementing such a system is easier than
implementing a non-uniform quantizer.
II r'(II·) g•
f(g)
compre er expander
(a)
(b)
Figure 7.4. Companding and Expanding. a) block schematic of companding
and expanding. b) the compressor and expander functions
•
0.1
0.'
0.7
j
:::I
0
Exponder
Oulpul
0.2
0.1
~
00 0.2 0.4 0.' 0.'
Input
that the neighboring samples are reasonably well correlated. Hence, the
value of an audio signal at any time instance k can be predicted from the
values of the signal at time (k-m) where m is a small integer. This property
can be exploited to achieve a compression ratio of 2 to 4 without degrading
the audio quality significantly. Differential pulse code modulation (DPCM)
[2, 3] is one of the simplest but effective techniques in this category, which
is presented below.
.. T_~ _ _ _ _ T
..
~ ,.
>
! ,.
1
Original
Audio --~
+
Coder 1-----.
.
Compressed
Audio
sn
(a)
(b)
Figure 7.7. Schematic of a DPCM (a) Encoder and (b) decoder. S 11 : original
input signal, s,,: predicted input, S:: reconstructed output, e,,: prediction
error, e,,: prediction error.
• Example 7.4
Consider the chord audio signal. Determine the optimal set of 1st order, 2 nd
order and third order optimal prediction coefficients,
[ R(O) R(l)][a ]
t = [R(l)] or, [1 0.97][a,] =[0.97]
R( -1) R(O) a z R(2) 0.97 1 az 0.89
0.89][a,]
0.97 a = [0.97]
[ ;~:) :~~~ ~~~;][=:] =[:~~~], [0.~7 0.~7
R( -2) R( -1) R(O) a 3 R(3)
or,
0.89 0.97 1
2
3
0.89
a 0.76
Hence, at = 1.16, at = 0.49, a 3 = -0.75
The optimal predictors (of different orders) for the chord signal are
Sn =0.97s:_1 1SI order predictor (7.6)
sn =1.81s:_1 -0.86<_2 2nd order predictor (7.7)
sn =1.16<_1 +0.49s:_2 -0.75.<_3 3rd order predictor (7.8)
The first predictor produces an error energy of 2.79, whereas the second
order predictor produces an error energy of 0.25. To verify the validity of
the optimal coefficients, total error energies were calculated in the
neighborhood of the optimal coefficients, and are shown in Fig. 7.8. It is
observed that the first order predictor provides the smallest error energy. For
the second order case, however, it is observed that the error energy is
minimum (or very close) for a large number of coefficients. These
coefficients generally follow the relationship: a l + a 2 == 0.95. In other
words, so long as the above relationship is satisfied, a good prediction is
expected.•
After the prediction error sequence ell has been obtained, it has to be
encoded to reconstruct the signal perfectly. In lossy coding, however, a
reasonably good reconstruction quality is often acceptable. As a result, the
error sequence is typically quantized using the scheme shown in Fig. 7.7(a).
Note that the quantization is the only operation in the DPCM scheme that
introduces noise in the reconstructed signal. The quantized error sequence
ell can be entropy coded to achieve further compression. The quantized
difference value is in turn used to predict the next data sample S n+l. The
following example illustrates various steps of the DPCM coding.
Chapter 7: Digital Audio Compression 155
i ·
u.r •
g
u.r 2
~
--
1
i... 0
~7
.... ~.2
11
Prediction CooIIIclenl a.cond Cool7lclent ., 1• 17 Fin. cool7lclent
• Example 7.5
The first four samples of a digital audio sequence are [70, 75, 80, 82, ... ].
It would require at least seven bits to encode each of the audio samples. The
audio samples are to be encoded using DPCM with a first order predictor as
given in Eq. (7 .6). The prediction error coefficients are quantized by 2 and
rounded to nearest integer, and stored losslessly . Determine the number of
approximate bits required to represent each sample and the reconstructed
error at each sampling instance.
The various steps at different sampling instances are shown in Table 7.1.
It is observed that the first audio sample is encoded as a reference sample.
From the second sample onwards, the prediction is employed. With a
smaller quantization step size of 2, the reconstruction error is very small
(magnitude less than 1), and tolerable. However, the number of bits required
to encode the quantized error signal is significantly reduced (3, 2, and 2 bits
for coding the 2 nd , 3rd and 4th predicted error samples, respectively) .•
• Example 7.6
Consider again the chord audio signal. The signal is encoded using a
DPCM coder using the predictors given in Eqs. (7.6) and (7.7). The error
sequence is quantized with step-sizes {I, 2, 3, 4, 5}. It is assumed that the
quantized error coefficients would be encoded using an entropy-coding
method. Hence, the entropy is considered as the bit-rate.
The rate-distortion performance of the DPCM coder is shown in Fig. 7.9.
It is observed that as the bit-rate increases, the SNR also improves. The
second predictor is seen to provide a margibal improvement over the first
order predictor. •
156 Multimedia Signals and Systems
Table 7.1. Various steps of DPCM coding of the sequence [70, 75, 80, 82, .... ].
Q o = 0.97 . The error sequence is quantized by 2 and rounded to the nearest integer.
Sampling Instances
0 1 2 3
Original signal, S n 70 75 80 82
~r-------------------------~
CD
'0
:§. 28
a:
Z
en 24
20
J:
§
Jl of hearing
PQR STUV
Frequency-.
Figure 7.6. Perceptual bit allocation. The bits are allocated only
to tones that are above the masking levels.
highly flexible. The decoder dequantizes the quantized data, which is then
passed through the synthesis filterbank resulting in the reconstructed audio
output, which can then be played back.
II
Signal Average Leve I
Level Masking Level
L
I
Signal
Level L H~ I .
Ma king Level
Average Level
Band 3 !
H
ignal .•.I
I
Average Level
Level Ma king Level
L
H
Signal Ma king Level
Level Average Level
L
Figure 7.6. Perceptual coding of different subbands. Bands 2 and 4 are perceptually
unimportant since the signal levels are below the masking levels. Bands 1 and 3 are
important since the signal levels are above the masking levels. The bits are
allocated according to the peak level above the masking threshold corresponding to
the respective subbands.
A·1l
nCI ary Data
--. Psychoacoustics (optional)
1 i
Model
~
Ancillary Data
(b)
Filter Bank
In order to achieve superior performance, the bandwidths of individual
filters in the analysis filterbank should match the critical bands (Table 2.2,
Chapter 2) of the ears. Therefore, the bandwidth of the subbands should be
small in the low frequency range, and large in the high frequency range.
However, in order to simplify the codec design, the MPEG-l filterbank
contains 32 banks of equal width. The filters are relatively simple and
provide a good time resolution with a reasonable frequency resolution. Note
that the filterbank is not reversible, i.e., even if the subband coefficients are
not quantized, the reconstructed audio will not be identical to the original
audio.
Layer Coding
Since there are 32 filters, the individual subband filters in the filterbank
produces 1 sample for each set of 32 input samples (as shown in Fig. 7.9).
The samples are then grouped together for quantization, although grouped
differently in the three layers [9].
The Layer 1 algorithm encodes the audio in frames of 384 audio samples
that is obtained by grouping 12 samples from each of the 32 subbands. Each
group of 12 samples gets a bit allocation, ranging from 0-15 bits depending
on the masking level. If the bit allocation is not zero, each group also gets a
scale factor (which is represented by 6 bits) that sizes the samples to make
full use of the range of the quantizer. Together the bit allocation and scale
factor can provide up to 20 bits resolution, resulting in a SNR of 120 dB.
The decoder multiplies the scale factor with the dequantized output to
reconstruct the subband samples.
Chapter 7: Digital Audio Compression 161
The Layer 2 encoder uses a frame size of 1152 samples per audio channel,
grouping 36 samples from each subband. There is one bit allocation for each
channel, and up to three scale factors for each trio of 12 samples [9]. The
different scale factors are used only to avoid the audible distortion. In this
layer, three consecutive quantized values may be coded with a single
codeword.
r······..·ii·········T·········j·2..·..·..·..·....····..·j·2...........~
i Samples i Samples Samples i
A A:
t ',' \,
: A :
1~
ii Samples
12 ii Samples
12 12 i
Samples i
: A : A 4:
t '4 't 'I •
Audio i 12 i 12 12 i
Samples
t Samples 1 Samples Samples 1 Grouped
i 4 \ !,.......A--..t 4 ,~ Samples
I" I"
i
:
I
Samples
A
'f:
i Samples
4
),
Samples
12
A:
1
Ii
•
" V ,J
T
Layer I Layers 2 and 3
Frame Frame
Subbands
:a=
OIl
8
.a~... "8<U
~ Ii
PCM [i: MDCf 0
'.::1
window
Audio N
... ~
'.::1
Input ~ =
§
~
0
"8<U Ii Compressed
.-. 0
Audio
... '.::1
!:l
4
~ "0
j 0
~Mocr
~
Vl
MPCT
wmdow .~
:;;;:
Figure 7.10. MPEG Audio Layer 3 Processing. The MDCT window can
be long or short. Alias reduction is applied only to long blocks.
Perceptual
Model t - -.....I
Bit- Coded
Stream Audio
Bitstream
Multi-
plexer
Rate-
Diltortion 14----1
Control
Process
Floating Inverse
Mantissa TDAC
Unpacking. to Fixed
Coded Point Transform 5.1 Channel
Normalization Output
REFERENCES
1. C. E. Shannon, "A Mathematical theory of communication," Bell Systems Technical
Journal, Vol. XXVII, No.3, pp. 379-423, 1948.
2. N. S. Jayant and P. Noll, Digital Coding of Waveforms: Principles and Applications
to Speech and Video, Prentice-Hall, New Jersey, 1984.
3. A. Gersho, "Advances in speech and audio compression," Proc. of the IEEE, Vol. 82,
No.6, p 900-918, Jun 1994.
4. B. Tang, A. Shen, A. Alwan, G. Pottie, "Perceptually based embedded subband
speech coder," IEEE Transactions on Speech and Audio Processing, Vol. 5, No.2,
pp. 131-140, Mar 1997.
Chapter 7: Digital Audio Compression 167
QUESTIONS
1. What are the main principles of the audio compression techniques?
2. You are quantizing an analog audio signal. How much quantization noise (in dB) do
you expect if you use 8, 12, 16 and 24 bits/sample/channel?
3. Calculate the probability density function and entropy of the audio signal belli. wav
provided in the CD. What compression factor can be achieved using the entropy
coding method?
4. What is the principle of DPCM coding method?
5. The autocorrelation matrix of an audio signal is given by
1 0.9 0.8]
[ 0.9 1 0.9
0.8 0.9 1
Calculate the optimal first and second order predictors. Can you calculate the third
order predictor with the given information?
168 Multimedia Signals and Systems
6. Consider a DPCM coder with a first order predictor SII = 0.97 * SII_I • Assume that the
signal has a dynamic range [-128,127]. Encode the audio signal belll.wav. Use the
quantization step-sizes {I, 2, 3,4, 5, 6}, and plot the rate-distortion performance.
7. Explain how the psychoacoustics properties of the human ear can be exploited for
audio compression.
8. How is the bit-allocation performed in a perceptual subband coder? How is the
masking threshold determined for an audio signal?
9. In audio coding, the audio signal is generally divided into blocks of audio samples.,
which are then coded individually. What are the advantages of this blocking method?
Compare the performance of long and short blocks.
10. Briefly review the different audio compression standards. Which standards are
targeted for telephony applications? Which standards have been developed for general
purpose high quality audio storage and transmission?
11. Explain the MPEG-l audio compression standard. Why were three coding layers
developed? Compare the complexity and performance of the different layers.
12. How are the audio data samples grouped in different MPEG-llayers?
13. Explain briefly the MP3 audio compression algorithm. How does it achieve superior
performance over MPEG-llayers 1 and 2?
14. Draw a schematic of MPEG Advance Audio Coding method. How does it different
from the MPEG-l compression methods?
15. Explain the encoding and decoding principles of the AC-3 codec. How is it different
from MPEG-l codec?
16. What is advantage of encoding mantissa and exponents separately in AC-3 codec?
Chapter 8
The term image compression refers to the process of reducing the amount
of data needed to represent an image with an acceptable subjective quality.
This is generally achieved by reducing various redundancies present in an
image. In addition, the properties of the human visual system can be
exploited to further increase the compression ratio.
In this Chapter, the basic concept of information theory as applied to
image compression will be presented first. Several important parameters
such as entropy, rate and distortion measure are then defined. This is
followed by the loss less and lossy image compression techniques. An
overview of the JPEG standard and the newly established JPEG2000
standard are presented at the end.
Similar to audio data, generally image data is also highly redundant. High
compression ratio can therefore be achieved by removing the redundancies.
As in audio, images generally have statistical redundancy that can be
exploited. We note that a still image has two spatial coordinates instead of
temporal coordinate as in audio. Therefore, the images have spatial
redundancy instead of temporal redundancy. The 2-D structure of an image
can be exploited to achieve a superior compression. Lastly, properties of
human visual system can be exploited to achieve a higher compression.
Statistical redundancy: Refers to the non-uniform probabilities of the
occurrences of the different pixel values. The pixel values that occurr more
frequently should be given a larger number of bits than the less probable
values.
Spatial redundancy: Refers to the correlation between neighboring pixels
in an image. The spatial redundancy is typically removed by employing
compression techniques such as predictive coding, and transform coding.
Structural redundancy: Note that the image is originally a projection of
3-D objects onto a 2-D plane. Therefore, if the image is encoded using
structural image models that take into account the 3-D properties of the
scene, a high compression ratio can be achieved. For example, a
segmentation coding approach that considers an image as an assembly of
many regions and encodes the contour and texture of each region separately,
can efficiently exploit the structural redundancy in an image/video
sequence.
Psychovisual redundency: The human visual system (HVS) properties can
be exploited to achieve superior compression. The typical HVS properties
that are used in image compression are i) greater sensitivity to distortion in
smooth areas compared to areas with sharp changes (i.e., areas with higher
spatial frequencies), ii) greater sensitivity to distortion in dark areas in
images, and iii) greater sensitivity to signal changes in the luminance
component compared to the chrominance component in color images.
Although the image compression techniques are based on removing
different types of redundancies, the compression techniques are generally
classified into two categories - i) lossless technique, and ii) lossy technique.
In lossless technique, the reconstructed image is identical to the original
image, and hence no distortion is introduced. Typical applications of
lossless techniques are medical imaging where distortion is unacceptable,
and satellite images where the images might be too important to introduce
any noise. However, the lossless techniques do not provide a high
compression ratio (less than 3: 1). On the other hand, lossy compression
techniques generally provide a high compression ratio. Furthermore, these
techniques introduce noise in the images that might be acceptable for many
applications, including the World Wide Web.
In this Chapter, two lossless techniques are presented: i) entropy coding,
and ii) run-length coding. The entropy coding techniques remove the
statistical statistical redundancy present in an image, whereas the run-length
coding technique removes the spatial redundancy. A few highly efficient
lossy compression techniques are then presented. The DPCM and transform
coding are primarily based on the removal of the spatial redundancy.
However, the loss less techniques and HVS properties can be jointly
exploited to achieve a superior compression.
compression. In the case of images, the pixel values also do not have
uniform pdf Hence, entropy coding can also be employed to compress two-
dimensional images. Given an image source, the entropy of a generated
image can be calculated. A suitable encoder can then be designed to achieve
a bit-rate close to the entropy. As in the case of audio, Huffman and
arithmetic coding techniques can help to achieve image compression.
Figure 8.1 shows the pdf of the gray levels of individual pixels
corresponding to the Lena image. The entropy corresponding to this pdf is
7.45 bits/pixel. Note that the entropy is very close to the PCM rate (8 bits),
and the compression ratio is negligible. This is true for most natural images,
and hence entropy coding is not generally applied directly to encode the
pixels. It will be shown later in this section that significant compression can
be achieved by entropy coding the transform coefficients of an image.
0.012
0.01
~. 0.008
!
~ 0.008
:g
~
0.. 0.004
0.002
00 . 128
P'u'v.'u..
112 2M
always considered to be a white run. In case the first pixel is a black pixel,
the first run-length will be zero. Each scan line is terminated with a special
end-of-line (EOL) symbol.
• Example 8.1
Determine the run-length code of a scanned fax image whose two scan
lines are shown below (assume that 1 and 0 correspond to white, and black
pixels, respectively).
Faximage=[11111111111000000000000000000000011111111111111111
00000000000000111111111111111111110000000000000000]
In the first scan line, there are 11 ones, followed by 22 zeros, followed by
17 ones. The line should be terminated by the EOL symbol. The second scan
line starts with black pixel. As a result, the run-length code of the second
line should start with a white run of zero length. This should be followed by
14 zeros, followed by 20 ones, followed 16 zeros. The line should again be
terminated by the EOL symbol. The overall run-length code would look as
follows:
RLC Code = [ ..... ,11,22,17,EOL,0,14,20,16,EOL, .... ] (8.1)
•
Eq. (8.1) shows that the RLC code can provide a compact representation
of long runs.
The number of bits required to encode the RLC will depend on how the
RLC symbols are encoded. Typically, these symbols are encoded using an
entropy coder (Huffman or arithmetic coder). Assume that a given fax
image has L pixels in a scan line. Then the run-lengths I can have a value
between 0 and L (both inclusive). The average run-lengths of black pixels
(/b) and white pixels (/w) can be expressed as
L
Ib ='2/Pb(l) (8.2)
1=0
and
(8.3)
1=0
where Pb(l) and Pw(l) are the probabilities of the occurrence of black run-
length I and white run-length I, respectively. The entropies of the black (h b )
and white (h w ) runs can be calculated as follows:
L
The compression ratio that can be achieved using RLC is on the order of
lb + lw (8.6)
r} = --"------'-'--
h" +hw
It is observed in Eq. (8.6) that longer runs (i.e., larger numerator) provide
a higher compression ratio, which is expected. Eq. (8.6) provides an upper
limit of the compression ratio. Generally, a compression ratios achieved in
practice is 20-30% lower.
The RLC was developed in the 1950s and has become, along with its two-
dimensional extensions, the standard approach for facsimile (FAX) coding.
The RLC cannot be applied to images with high details, as the efficiency
will be very low. However, significant compression may be achieved by
first splitting the images into a set of bit planes which are then individually
run-length coded.
RLC is also used to provide high compression in transform coders. Most
of the high frequency coefficients in a transform coder become zero after
quantization, and long runs of zeros are produced. Run-length coding can
then be used very efficiently along with a VLC. Run-length coding is
generally extended to two dimensions by defining a connected area to be a
contiguous group of pixels having identical values. To compress an image
using two-dimensional RLC, only the values that specify the connected area
and its intensity are stored/transmitted.
The loss less compression techniques generally result in a low
compression ratio (typically 2 to 3). Therefore, they are not employed when
a high compression ratio is required. A high compression ratio is generally
achieved when some loss of information can be tolerated. Here, the
objective is to reduce the bit rate subject to some constraints on the image
quality. Some of the most popular lossy compression techniques are
predictive coding, transform coding, waveletlsubband, vector quantization,
and fractal coding. In the next few sections, we briefly discuss each of these
coding techniques.
from the values of its neighboring pixels. The differential pulse code
modulation (DPCM) scheme discussed in Chapter 7 can be extended to two-
dimension for encoding images. The following are some examples of typical
predictors:
S" =0.97s,,_, 1st order, I-D predictor (8.7)
sm." =0.48Sm.,,_, +0.48sm_,." 2nd order, 2-D predictor (8.8)
sm." =0.8Sm.,,_, - 0.62s m_,.,,_, + 0.8s m_,." 3rd order, 2-D predictor (8.9)
• Example 8.2
Using the predictor given in Eq. (8.9), calculate the predicted error output
of the following 4x4 image. Assume no quantization of the error signal.
20 21 22 21]
18 19 20 19
[
19 15 14 16
17 16 15 13
In order to calculate the DPCM output for the first row and the first
column, the 1S[ order predictor given in Eq. (8.7) is used. For the other rows
and columns, the 3rd order predictor given in Eq. (8.9) is used. The predicted
pixel values are given below.
20 19.4 1
20.37 21. 34
•
• Example 8.3
The Lena image is encoded using the DPCM predictor provided in Eq.
(8.9). The prediction error sequence is quantized with step-sizes {la, 15,20,
25, 30}. The entropy of the quantized coefficients are used to calculate the
overall bit-rate. Figure 8.2(a) shows the bit-rate versus PSNR of the
reconstructed signal. Typical prediction error coefficients are shown in Fig.
8.2(b). It is observed that the prediction performance is very good in most
regions, except near the edges .•
Chapter 8: Digital Image Compression 175
39r---------------------~
38
..,
CD 37
§.36
~35
en
Q.34
33
~~--------------------~
03 06 09 12 15
B,I·rale (In b,lslpixel)
Image
1 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _'_
-_--
_-_
-_- -_' 1I
Receh-er
most natural images. It can be demonstated that for a first order Markov
source model, the OCT basis functions become identical to the KLT basis
functions as the adjacent pixel correlation coefficient approaches unity.
Natural images generally exhibit high pixel-to-pixel correlation, and
therefore OCT provides a compression performance virtually
indistinguishable from KLT. In addition, OCT has a fast implementation
like Off, with a computational complexity of O(N log N) for N -point
transform. Unlike Off, the OCT avoids generation of spurious spectral
components at the boundaries resulting in higher compression efficiency.
Hence, OCT has been adopted as the transform kernel in image and video
coding standards, such as JPEG, MPEG and H.261.
It was mentioned in Chapter 5 that the OCT of an Nxl sequence is related
to the Off of the 2Nxl even symmetric sequence (see section 3, Chapter 5).
Because of this even symmetry, the reconstructed signal from quantized
OCT coefficients better preserves the edge .
• Example 8.4
Consider an 8-point signal [0 2 4 6 8 10 12 14]. Calculate the Off
and the OCT of the signal. In order to compress the signal, ignore the
smallest three coefficients (out of 8) in both Fourier and OCT domains, and
reconstruct the signal. Compare the results.
The Off and OCT coefficients of the signal are shown in Fig. 8.4(a).
Note that the Off coefficients are complex, and only the absolute values are
shown. The three smallest coefficients are made zero, and the inverse
transforms are calculated. When the reconstructed signal is rounded off to
the nearest integers, the OCT coefficients provide error free reconstruction.
However, the Off provides substantial error. The reconstructed signals are
shown in Fig. 8.4(b) where the reconstruction error is particularly noticeable
at the edges .•
Quantization
1201·13101-110 10 10 10 1
ignal
(a)
Sam pi. .
(b)
Figure 8.4. Edge preservation performance of the Off and OCT. a) The
Off and OCT coefficients at various stages, b) the reconstructed signals.
Note that the amplitude of the Off coefficients are shown in Fig. (a).
• Example 8.5
Consider a 512x512 image. Calculate the complexity of a 2-D OFT
calculation using the radix-2 FFT method. Divide the image into blocks of
8x8. Calculate the complexity of the 2-D OFT calculation for all blocks.
Compare the two complexities.
The complexity of the N x N OFT calculation using separable approach is
2N x (N / 2) log2 N or N 2 log2 N butterfly operation. Therefore, the direct
512x512 OFT calculation will have a complexity of 2.4x10 6 butterfly
operations.
If the image is divided into blocks of 8x8, there will be 4096 blocks.
The complexity of 2-D OFT for each block is 82 10g2 8 or 192 operations.
Consequently, the overall complexity will be 4096*192 or 0.79x10 6
butterfly operations. As a result, the blocking method reduces the
complexity to one-third .•
."
."
...
.1
f.
•n ••
f
:)
.oo
...
.oo
• 02
...• .u ·1' 0
totffiC...". w...... ,
I. n 41 ~:-._~.,.:----:-
.. -...J 0 I
auanuttd Cotonkt.nt "alu..
10 II
(a) (b)
Figure 8.5 . Probability density function of wavelet coefficients. a) coefficients
rounded to nearest integer (entropy=3.67 bits/pixel), b) coefficients rounded to
nearest integer after quantization by 8 (entropy=O.86 bits/pixel) .
• Example 8.6
Consider the 512x512 Lena image. Divide the image into non-overlapping
blocks of SxS. Calculate the OCT of each block, and calculate the mean
energy of the ~C , and 63 AC coefficients. In addition, decompose the image
for three-stages using a Oaub-4 wavelet. Calculate the mean energy of the
lowpass and the nine highpass bands. Compare the two sets of energies.
The Lena image is divided into SxS blocks. There are 4096 such blocks.
The DC coefficients from each block are extracted and arranged in 2-D
according to the respective block's relative position. Similarly, all 4096
AC(O,l) coefficients are extracted and represented in 2-D, while all other
AC coefficients are extracted. The 64 2-D coefficient blocks are arranged as
shown in Fig. S.6(a), and the 2-D OWT coefficients are shown in Fig.
S.6(b). It is observed that both OCT and OWT coefficients provide spatial
(i.e., the image structure) as well as frequency (i.e., the edge information)
information . Note that the spatial information provided by the OCT is due to
the block OCT (if one calculates 512x512 OCT, the spatial information will
not be available). Figs. S.6(c) and S.6(d) show the first four bands of the
OCT and OWT coefficients. It is observed that different bands provide
low pass information, and horizontal, vertical and diagonal edges.
The main difference between the OCT and OWT coefficients lies in the
highpass bands. The highpass OCT bands provide higher frequency
resolution, but lower spatial resolution. As a result, there are more frequency
bands, but it is difficult to recognize the spatial information (can you
identify the Lena image by looking at a high frequency band?). On the other
hand, the wavelet subbands provide higher spatial resolution, and lower
frequency resolution. As a result, the number of subbands is few, but the
182 Multimedia Signals and Systems
spatial resolution is superior - one can recognize the Lena image looking at
a highpass band.
(c) (d)
At the decoder, the index is mapped back to the codeword, and the
r
codeword is used to represent the original data vector. If an image block has
K pixels, and each pixel is represented by p bits, theoretically p (2
with p r
combinations are possible. Hence, for lossless compression, a code book
(2 codewords is required. In practice, however, there are only a
limited number of combinations that occur most often, which reduces the
184 Multimedia Signals and Systems
WXYZ WXYZ
WXYZ WXYZ
WXYZ WXYZ
WXYZ N WXYZ
WXYZ WXYZ
Codebook K Codebook K
t ......................................... .:
Note that the CCITT has since been renamed as the ITU-T (lTU:
International Telecommunication Union). The CCITT has developed CCITT
G3 [10] and G4 [11] schemes for transmitting fax images. Later, the ISO
established another (more efficient) fax coding standard known as JBIG
(Joint Bi-Ievel Imaging Group). The first general purpose still image
compression standard has been developed by the Joint Photographic Experts
Group (JPEG) [6], which was formed by the CCITT and ISO in 1986. The
JPEG standard algorithm encodes both gray level and color images. Most
recently, the ISO has established the JPEG-2000 standard that provides
superior performance compared to the JPEG standard, in addition to several
useful functionalities. In the remainder of this Chapter, a brief overview of
the JPEG and JPEG-2000 coding schemes is presented.
pixel (see Fig. 8.1O(b». The 2-D OCT of the block is then calculated (see
Fig.8.1O(c».
8x8 blocks
Compressed
Bitstream
Original
Image
The quantization table, shown in Fig. 8.l1(a), has been derived following
the contrast sensitivity of the human visual system. Imagine a scenario
where a nOx576 image is being viewed from a distance of six times the
screen width. If the image is encoded by JPEG, the quantization step-sizes
corresponding to Fig. 8.11 (a) will produce distortion at the threshold of
visibility. In other words, if quantization step-sizes mentioned in the default
table is used, the reconstructed image will be almost as good as the original.
Note the quantization step sizes of a few high frequency OCT coefficients,
such as (0,1) and (0,2), are smaller compared to the step-size for the DC
(0,0) coefficient. This is because of the frequency sensitivity of the HVS.
Coefficient (0,1) corresponds to a cosine wave of Y2 cycle with a span of 8
pixels. Therefore, it corresponds to a spatial frequency of 1 cycle/16 pixels,
188 Multimedia Signals and Systems
~ 00 W
Figure 8_1 L Quantization of DCT coefficients. a) Quantization table, b) the
quantized coefficients, and c) The dequantized coefficients.
In order to achieve a good compression ratio, the quantized coefficients
shown in Fig. 8.II(b) should be represented losslessly using as few bits as
possible. To achieve this, the quantized 2D DCT coefficients are first
converted to a ID sequence before being encoded losslessly. For a typical
image block, the significant DCT coefficients (i.e., coefficients with large
amplitude) are generally found in the low frequency region that is in the top-
left comer of the 8x8 matrix (see Fig. 8.1 1(b)). A superior compression can
be achieved if all non-zero coefficients are encoded first. After the last non-
zero coefficient is transmitted, a special end-of-block (EOB) code is sent to
indicate that the remainders of the coefficients in the sequence are all zero.
A special 45° diagonal zigzag scan (see Fig. 8.12) is employed to generate
the I-D sequence of quantized coefficients.
For encoding, the 64 quantized DCT coefficients are classified into two
categories - one DC coefficients, and 63 AC (i.e., high frequency)
coefficients (see Fig. 8.11(b)). The DC coefficients of the neighboring
blocks have substantial redundancy for most natural images. Therefore, the
DC coefficient from each block is DPCM coded. The AC coefficients of a
typical image have a nonuniform probability density function, which can be
encoded efficiently by entropy coding. In addition, long runs of zero value
Chapter 8: Digital Image Compression 189
• Example 8.7
In this example, we evaluate the performance of a simplified JPEG-like
coder. The image is divided into 8x8 blocks, and 128 is subtracted from the
190 Multimedia Signals and Systems
pixel values. The OCT coefficients corresponding to the Lena image are
quantized using the JPEG default quantization table. In order to achieve
different rate-distortion performance, we scale the quantization table with
factors {0.8, 1.0, 1.2, 1.8, 2.4}. To simplify the implementation, we assume
that all 64 quantized OCT coefficients of a block will be encoded using an
ideal entropy coder (although we do not know how to implement such an
ideal entropy coder). In other words, we use the entropy to calculate the bit-
rate instead of the Huffman and run-length coding as with the standard.
The rate distortion performance is shown in Fig. 8.l4 .•
-20 -20 -22 -24 -28 -33 -37 -39 108 108 106 104 100 95 91 89 -4 0 I -3 -6 0 7 13
-32 -33 -34 -36 -39 -44 -47 -50 96 95 94 92 89 84 81 78 0 5 9 8 7 -10 6 -5
-50 -50 -51 -52 -55 -59 -62 -64 78 78 77 76 73 69 66 64 -I -9 -7 II II -5 -2 3
-65 -64 -64 -65 -67 -70 -73 -75 63 64 64 63 61 58 55 53 8 -4 -12 -4 3 -2 -I 4
-73 -73 -72 -72 -73 -75 -77 -79 55 55 56 56 55 53 51 49 3 -2 -5 -2 -3 -2 I 3
-77 -77 -75 -75 -75 -76 -77 -79 51 51 53 53 53 52 51 49 2 -I 0 -I -I 6 0 -2
-80 -79 -77 -76 -75 -76 -77 -78 48 49 51 52 53 52 51 50 0 4 2 -I 0 3 0 3
-81 -80 -78 -77 -76 -76 -77 -77 47 48 50 51 52 52 51 51 0 0 -2 -4 3 -5 0 -3
37
iD 36
"0
c
'=' 35
a:
z
en
c.. 34
33
.........................................• ...................................
r Tlble. I I Helder I
Segment R..tart I
••• • ••••••••••••••••• i ••••••••••••••••••
---_._---_._._...._..................._.
Figure 8.15. JPEG Bit Stream
I
The coding performance for the color Lena image for different bit-rates
(such as 0.95, 0.53, 0.36, and 0.18 bits per pixel) is included in the
accompanying CD. It is demonstrated that at 0.53 bpp (compression ratio of
45: I), the reconstructed image quality is reasonably good. Note that a higher
compression ratio can be achieved for color image since the sensitivity of
the chrominance components is low for the HVS.
(d) (e)
Figure 8.16. Reconstructed images from JPEG Bitstream at different
bitrates. (a) 0.90, (b) 0.56, (c) 0.37, (d) 0.25, (e) 0.13
192 Multimedia Signals and Systems
Lossless Mode:
In this mode, the reconstructed image is identical to the original image.
Here, the DCT is not used for energy compaction. Instead, a predictive
technique similar to the coding of the DC coefficients is employed, and a
given pixel is predicted from the neighboring three causal pixels. There are
eight modes of prediction; the mode that provides the best prediction is
employed to code a pixel.
each code block. Note that the JPEG2000 standard employs the "block
coding" scheme, which is generally not used in high performance wavelet
coding schemes, for several reasons. First, local variations in the statistics of
the image from the block-to-block can be exploited to achieve superior
compression. Second, it will provide support for applications requiring
random access to the image. Finally, parallel implementation of the coder
will be easier, and the memory requirement will be smaller.
1----------..
: Input Original:
r-----------
Output :
-----1----'
I
,,----
I Data
----_.
I : Compressed Data I
Entropy
coding
Optimized Quantization
For lossless compression, the DWT coefficients (DWTC) are coded with
their actual values. However, the DWTC must be quantized to achieve high
compression ratios. The quantization is tile specific, lossy, and non-
reversible. Uniform scalar dead-zone quantization is performed within each
subband, with different levels of quantization for each subband in part I of
the standard. The lower resolution subbands are generally quantized using
Chapter 8: Digital Image Compression 195
smaller step-sizes since the human visual system is more sensitive at lower
frequencies. The sign of the DWT coefficients is encoded separately using
the context information. The quantization scheme may be implemented in
two ways. In the first (explicit) method, quantization step-size for each
subband is calculated independently, whereas in the second (implicit)
method, quantization steps derived from the LL subband is used to calculate
the quantization steps for other subbands. Part II of the standard allows both
generalized uniform scalar dead-zone quantization and trellis-coded
quantization (TCQ).
r---------------
BI CBI BI CB2
LL(2) HL(2)
HL(\)
.. ,, '
BI CB I B2 CB4
LI 1(2) HH(2)
.,
I
I
I I
I
LH(\) flH( I) I
I
I
I
I
I I
I
B3 CB4 I
I I B3 CB4
I I
I I
I
I
Entropy Coding
The entropy coding in JPEG2000 employs a bit-modeling algorithm.
Here, the wavelet coefficients from each code-block are represented in the
form of a combination of bits that are distributed in different bit-planes.
Furthermore, these bits are reassembled to coding passes according to their
significance status. The use of bit modeling provides hierarchical
representation of wavelet coefficients by ordering the bit-planes of wavelet
coefficients from the MSB to the LSB. Hence, the formed bit streams with
inherent hierarchy can be stored or transferred at any bit rate without
destroying the completeness of the content of the image. The entropy coding
process is a fully reversible (i.e., loss less) compression step. The bit-
Chapter 8: Digital Image Compression 197
Layer 4
Layer 3
Layer 2
Layer 1
Enhanced Features
Some enhanced features are also available in JPEG2000 standard. These
include region of interest (ROI) coding, error resilience, and manipulation in
the compressed domain. In ROI coding, part of an image is stored in a
higher quality than the rest of the image. There are two methods to achieve
this. In the first method, known as the mask computation method, a ROI
mask is employed to mark out a region of image that is to be stored with
higher quality. During quantization, the wavelet coefficients of this region
are quantized less coarsely. In the second method, known as the Scaling
Base and Maxshift Method, the highest background coefficient value is first
found before the DWTC of the ROI are shifted upward. After this shifting,
all the ROI coefficients are larger than the largest background coefficient,
and can be easily identified.
Error resilience syntax and tools are also included in the standard. The
error resilience is generally achieved by the following approaches: data
partitioning and resynchronization, error detection and concealment, and
Quality of Service (QoS) transmission based on priority. Error resilience can
be achieved at two levels: entropy coding level (code block) and packet
level (resynchronization markers).
Coding Performance
The coding performance of the JPEG2000 algorithm has been evaluated
with JASPER [9] implementation. The performance corresponding to the
Lena image is shown in Fig. 8.20. The coding has been performed with six
levels of wavelet decomposition, a code-block size of 64x64, a precinct size
of 512x512, and one quality layer. It is observed that the reconstructed
image quality is good even at a compression ratio of 60: 1, and 150: 1 for
gray level, and color images, respectively.
Chapter 8: Digital Image Compression 199
The JPEG2000 coding performance for color Lena image for different bit-
rates (such as 0.95, 0.53, 0.36, and 0.18 bits per pixel) is included in the
accompanying CD. It can be shown that even at 0.18 bpp (compression ratio
of 133: 1), the reconstructed image quality is reasonably good.
The performance of JPEG2000 is compared with JPEG standard in Fig.
8.21. It is observed that for a given bit-rate, JPEG2000 provides about 3-4
dB improvements over the JPEG algorithm.
(d) (e)
The portabLe graymap (PGM) is a simple format for gray level image. The
data is stored in the uncompressed form. The pixel values are stored either
in ASCII or in binary format. The data is preceded by a few bytes of header
information (to store primarily the file type, and the number of rows and
columns).
200 Multimedia Signals and Systems
The portable pixmap (PPM) is the extension of the PGM format for
representing color pixels in uncompressed form. It can store the R, G, and B
channels in different orders.
38 36 .-----------------~
ai'34
1:1
~
~
32 ~:::;::
C
~ 30 .g 28 __ JPEG2000
' __ REG2000 Of:
~ 26 ~
l- - JPEG
.- - REG 24
22 20 ~----------------~
0.13 0.25 0.47 0.65 0.91 0.18 0.41 0.60 0.80 0.95 1. 15
bits pe r pixel
bit. per pixel
(a) (b)
Figure 8.21. Performance comparison of JPEG2000 and JPEG
standard. a) gray level, and b) color Lena image.
REFERENCES
1. R. C. Gonzalez and Richard E. Woods, Digital/mage Processing, Addison Wesley,
1993.
2. M. K. MandaI, S. Panchanathan and T. Aboulnasr, "Choice of wavelets for image
compression," Lecture Notes in Computer Science, Vol. 1133, pp. 239-249, Springer
Verlag, 1996.
Chapter 8: Digital Image Compression 201
QUESTIONS
1. Do you think applying entropy coding directly on images will provide a high
compression ratio? Why?
2. You are transmitting an uncompressed image using packets of 20 bits. Each packet
contains a start bit, and a stop bit (assume that there is no other head). How many
minutes would it take to transmit a 512x512 image with 256 gray levels if the
transmission rate is 2400 bits/sec?
3. Justify the use of run-length coding in FAX transmission.
4. Determine the run-lengths for the following scan line. Assume that "0" represents a
black pixel, and the scan line starts with a white-run.
[000000111111111000111111100000000011111110000000000]
5. Explain the principle of transform domain image compression. For what types of
images does the transform domain image compression algorithm provide a high
compression ratio?
6. What are the advantages of the transform coding?
7. Explain the advantage of using unitary transform over non-unitary transform for
image coding.
8. What is the optimum transform for image compression? Can a fixed transform
provide optimal performance for all images?
9. Can it be said that the DCT provides better energy compaction compared to the DFT
for all images?
10. The first eight pixels of a scanned image are as follows:
f =[50, 60, 55, 65, 68, 62, 67, 75].
Calculate the DFT and DCT coefficients of the
signal. Retain the 5 largest magnitude coefficients in both cases, and drop the three
lowest magnitude coefficients. Reconstruct the signal by calculating inverse
202 Multimedia Signals and Systems
transforms, and determine which transform reconstructs the edge better? Compare the
results with those obtained in Example 8.4.
11. Why is block coding generaIly used in the DCT based image coding?
12. What is transform coding gain? What should be its value for efficient compression?
13. Why is coding a vector of pixels or coefficients more efficient than coding individual
coefficients?
14. Compare and contrast the DCT and DWT for image coding.
15. What are the advantages and disadvantages of the vector quantization?
16. When will vector quantization offer no advantage over scalar quantization?
17. What are the advantages and disadvantages of the fractal image coding?
18. Why are the DC coefficients in JPEG coded using a DPCM coding?
19. Explain how the psychovisual properties are exploited in JPEG algorithm.
20. Why was zigzag scanning chosen over horizontal or vertical scanning in JPEG
algorithm?
21. What are the advantages of using joint Huffman and run-length coding to encode the
AC coefficients in JPEG?
22. How does JPEG algorithm exploit the psychovisual properties?
23. Explain the usefulness of progressive coding.
24. Compare and contrast JPEG and JPEG2000 Standards image coding algorithms.
25. How does JPEG2000 achieve embedded coding?
26. What is EBCOT algorithm?
Chapter 9
For digital video storage and processing, the analog video has to be
converted to digital form. Assume that each component of {RN' GN' BN}
has a dynamic range of 0-1 volt. The luminance component y (= 0.299 RN +
Chapter 9: Digital Video Compression Techniques 205
Note that each of the Y", C Band C R components have 8 bit resolution,
and could have the full dynamic range [0,255]. However, the levels below
16 and above 235/240 are not used in order to provide working margins for
various operations such as coding and filtering.
Color Subsampling
After the digitization we obtain three color channels: Y", C Band CR'
Since our visual system has a low sensitivity for the chroma components
(Le. C Band C R)' these components are subsampled to reduce the effective
number of chroma pixels [1]. Subsampling factors that are typically used are
denoted by the following conventions (see Fig. 9.1):
4:4:4 : No chroma subsamp., each pixel has Y, Cr and Cb values.
4:2:2: Horizontally subsample Cr, Cb signals by a factor of2.
4: 1: 1 : Horizontally subsampled by a factor of 4.
4:2:0: Subsampling by a factor of2 in both the hor. and vert. directions.
_ Example 9.1
Determine the reduction in bitrate due to 4:2:2 and 4:2:0 color subsampling.
Assume that there are N color pixels in the video. When there is no
subsampling, the video will have a size of 3N bytes (assuming 8 bit
resolution of each color channel). When 4:2:2 color subsampling is used,
there will be NY samples, NI2 Cr and NI2 Cb samples. Therefore, the 4:2:2
video will have a size of (N+NI2+NI2) or 2Nbytes. When 4:2:0 subsampling
is used, there will be NY, NI4 Cr, and NI4 Cb samples. As a result, the size
of the 4:2:0 video will be (N+NI4+NI4) or 1.5N bytes. Therefore, a 4:2:2
subsampling provides a 33% reduction in bitrate while 4:2:0 provides a 50%
reduction in bitrate. _
206 Multimedia Signals and Systems
4:4:4 4:2:2
• ,t .....
(a) (b)
4:1:1 4:2:0
c~,......,
~ .......
[J:lLJ (d)
(c) Ci=:J:::;
t ,......,~,......,· t C
Figure 9.1. Chroma Subsampling. The 4:2:0 subsampling is widely used in video
applications, especially at lower bit-rate. Fig. (d) shows the 4:2:0 format adopted
in MPEG standard, whereas Fig. (e) shows the format adopted in H.2611H.263
standards.
components it and v) can be expressed using the MAD and MSE criteria
respectively as
K-I L-I
(ii, v) = arg min LI:li(x,y;k) -i(x - u,y - v;k -1)1 (9.3)
(U,V)EZ 2 x=Oy=O
lul~l1u,lvl~l1v
where Z is the set of all integer numbers, which signifies that the motion
vectors have one-pixel accuracy. The (k -1) factor is used with the
assumption that the motion estimation is being performed with the
immediate previous frame. The double sum essentially gives us the total
absolute difference between the current block and a candidate block in the
previous frame. The "arg min" statement finds out the motion vector (ii, v)
of the block that produces the minimum absolute difference. The next
example illustrates the motion estimation procedure for a small block.
Vector
Rererence Block
• Example 9.2
Assume that we want to predict the 2x2 block shown in Fig. 9.4 (b) from
the reference frame shown in Fig. 9.4(a). Calculate the best motion vector
with respect to the MAD criteria, and the corresponding prediction error.
The current block [40, 41; 41, 43] is matched with all (25 in this case)
possible candidate blocks. The total absolute difference corresponding to
each candidate block is shown in Fig. 9.4(c). It is observed that the
minimum absolute difference is 1, and the best-matched block is two pixels
above the current block. Hence, the motion vector is (0,-2), meaning that
°
there is no horizontal motion, and a vertical motion of 2 pixels downwards.
The prediction difference [0 -1 0] is obtained by subtracting the predicted
block from the current block.•
Chapter 9: Digital Video Compression Techniques 211
Note that Eq. (9.3) employs the minimum absolute error as the matching
criterion. However, sometimes the matching is performed using the
minimum mean square error (MSE) criterion with this equation:
K-l L-l
(u,v)= argmin IIli(x,y;k)-i(x-u,y-v;k-lf (9.4)
(U,V)EZ 2 x:Oy:O
lul::;~u,lvl::;~v
The MSE criterion has the advantage that it provides better SNR of the
predicted frame. However, there is little difference in the subjective quality
of the predicted frame, and it has a higher computational complexity. Hence,
the MAD criterion is used more frequently.
37 39 40 41 41 42
43 44 42 43 43 44
43 43 43 43 43 44 40 41
44 45 43 44 44 45 41 43
f------
44 45 44 46 47 48
47 48 45 47 48 49
(a) (b)
8 6 1 3 5
8 7 6 7 9
10 9 8 9 11
13
19
12
17
12
17
16
23
19
27
I404ll
~-
= MV(O -2)
,
+ IoOl
~
(c) (d)
Figure 9.4. Illustration of Motion Estimation. (a) previous reference frame,
b) the block from the current frame to be predicted, c) motion prediction
error for different search grid points, d) motion prediction with the best
motion vector (0,-2), and the corresponding error block.
Eq. (9.3) estimates the motion vector with integer pixel accuracy.
However, more accurate ME is possible by estimating motion vectors at a
fraction-pixel accuracy, especially with half-pixel accuracy. This is
generally done by interpolating the image blocks. Although the fractional
pixel ME provides superior motion estimation, it has a higher complexity.
In order to obtain the minimum prediction error, the argument in Eqs.
(9.3) and (9.4) should be evaluated for all possible candidate blocks, which
is known as the full search algorithm (FSA). A few other motion estimation
techniques will be presented shortly.
212 Multimedia Signals and Systems
• Example 9.3
Calculate the motion vectors corresponding to the frame (size: 240x352)
shown in Fig. 9.5(b) with respect to the reference frame (these frames have
been taken from the well-known football sequence for evaluating motion
compensation performance) shown in Fig. 9.5(a). Assume a block size of
16x16, and a search window of [-16,16] in both horizontal and vertical
directions. Calculate the motion prediction gain, and estimate the motion
estimation complexity.
The football sequence is a high motion sequence. If the current frame is
predicted from the previous frame without motion estimation (Le., motion
vectors are considered to be zeros), the frame difference energy will be
significant. Figure 9.5(c) shows the frame difference signal. The prediction
(as defined in Eq. (9.5)) gain is 17.86.
Motion prediction is performed using search windows of [-7, 7] and [-16,
16]. The performance is shown in Figs. 9.5(d) and 9.5(e). It is observed that
the search window [-7,7] provides a frame difference that is significantly
lower than Fig. 9.5(c). The prediction gain in this case is 55.64. The search
window [-16,16] further reduces the error energy (as shown in Fig. 9.5(e),
providing a prediction gain of 60.11.
The motion vectors are shown in Fig. 9.6(a). It is observed that most
motion vectors are zero (in both horizontal and vertical directions). The
histogram of the motion vectors is shown in Fig. 9.6(b). It is observed that
177 motion vectors out of total 330 vectors have zero value. There are 306
motion vectors within search range [-7,7]. Since the [-7,7] search window
Chapter 9: Digital Video Compression Techniques 213
(d) (e)
H4
---..........
-'-"" ..... ...
-.r • _ • •.•
# # . ;
U · . :-: ::',; : ::
- I"'" .
., •
n 10
Hot_""DIr_
UI 171 214 m '20
Figure 9.6. Motion vectors corresponding to Fig. 9.5. a) Full search motion vectors
(search area =16) of the entire frame, b) histogram of the motion vectors. Each grid-
point corresponds to a 16x16 block. A downward motion vectors means that the
current block is best predicted from an upper block in the previous frame. A right
arrow means that the current block is best predicted from a left-block in the previous
frame . The histogram has a peak value of 177 at the center (i.e., zero motion), which
has been clipped to 15 in the figure.
214 Multimedia Signals and Systems
Since the frame size is 240x352, and the motion block-size is 16x16, there
are 330 blocks in the current video frame. For each block, there are 1089
(=33x33) candidate blocks since the search range is [-16,16]. The total
absolute difference calculation for a 16x16 block requires 512 operations
(256 subtraction + 256 additions) ignoring the absolute value calculation.
For 1089 candidate blocks, the total number of operations will be 557568
(=1089x512) operations. For the entire frame, the approximate number of
arithmetic operation will be 184 millions (=557568x330) operations. Note
that the blocks at the outer edge of the frame will not have 1089 candidate
blocks for motion prediction. Therefore, the actual complexity will be little
less than 184 millions.
The football frames and the MA TLAB code for this example are provided
in the accompanying CD.•
motion vector is calculated at eight points: (2,2), (2,4), (2,6), (4,6), (6,6),
(6,4), (6,2), and (4,2). Note that the distance between the search points is
now 2 (compared to 4 in the previous step). Assume further that the grid
point (4,6) provides the minimum cost function at the second step. In the
third step, the cost function of eight points surrounding the grid point (4,6)
will be calculated. The grid point that will provide the minimum cost
function will be selected as the best motion vector. In the example, the
lowest cost function is provided by grid (3,6), and hence the motion vector
would be (6,3) (i.e., the horizontal motion is +6, and the vertical motion is
+3).
This algorithm reduces the number of calculations from (2 p + 1)2
required by the full search (when !:!." =!:!.v = p) to the order of 1 + 810g 2 p.
-7-6-64-3-2-101234 S 67
-7
-6 I
.(5
4 1- - -b b b- 1- -
-3
T
-2
-1
o 1- - -0 6 0 - 1- -
Best Matched
1
2
3 H~
V Block
4 1- - -? ? a-A- -1
S T b--1,
6
7
I T i=-?-=J
o Step1 0 St.p2 /; St.p3
In this algorithm, the search for the optimal motion vectors is done in two
steps. In the first step, the minimum cost function is estimated in one
direction. In the second step, the search is carried out in the other direction
starting at the grid point found in the first step.
Chapter 9: Digital Video Compression Techniques 217
·7-64-4-3-2·1012341167
·7
-6
4
-4
-3
-2
·1 y-
o o- - - - - -0-9-
Best Matched
.
1
2 Block
~
3
"II
6
7
• Example 9.4
In this example, the performance of the three-steps and conjugate
directions methods is evaluated using the two football frames considered in
Example 9.3. The three-steps and conjugate directions methods provide
prediction gains of 47.6 and 36.7, respectively (whereas the full search
218 Multimedia Signals and Systems
algorithm provides a gain of 55.64). Figs. 9.10(a) and 9.1O(b) show the
motion predicted error frames corresponding to the three-steps and
conjugate directions methods. Note that the three-steps search method
provides a gain close to the FSA search method, at a substantially reduced
complexity (you can verify it by executing the MA TLAB code, and
comparing the runtimes). The conjugate direction method reduces the
complexity further. However, the prediction gain also drops. The motion
vectors corresponding to the two search methods are shown in Fig. 9.11. •
(a) (b)
Figure 9.10. Motion compensation performance. a) frame difference signal
corresponding to the three-steps search with search window (±7,±7) (GMP =47.6),
b) frame difference signal corresponding to the conjugate direction method with
search window (±7,±7) (G MP =36.7).
i
tI • •
~ 171 .
nl . .. . ..... ! '" .. .
i .. .. '
I ..
>
... '
.~ n .. n.
HoI_~_ 171 n. m no .~. n .. .21 '71 22A 272 21t
tIoritonQlOl<oc_
(a) (b)
• Example 9.5
The football is a fast moving video sequence, and hence the motion
prediction gain is only in the range of 50. In this example, we consider the
slow moving sequence Claire, which is a typical teleconferencing sequence.
Figures 9.12(a) and 9.12(b) shows the reference frame and to-be-predicted
frame, respectively. A straightforward frame difference (i.e., with no motion
prediction) produces a prediction gain of 360, while the FSA produces a
prediction gain of 1580. The corresponding error frames are shown in Fig.
9.12(c)-(d). It is observed that a high prediction gain is achieved in this case.
As a result, the overall bit-rate in this case will be lower compared to the
football sequence (at similar quality level). _
(a) (b)
(c) (d)
....
,
..
... ,
10 .
I
I ...
lO .
, ;
•• .
I. lO •
a
--
, • " .
I' •
I'
."
.'. -• "
~
If
YMlulDlfKUOfl oJi 40 •• ·11
(a) (b)
Figure 9.13. Error surface. a) Ideal monotonic error surface with one (global)
minima, b) an irregular error surface with several local minima's.
motion predicted from the Y, Cb, and Cr components of the reference frame.
However, it has been found that there is a strong correlation among the
motion vectors of the three components (when an objC;!ct moves, it should be
reflected equally in all three components). Hence, motion vector is generally
not calculated for Cb and Cr components in order to reduce the
computational complexity. Instead, a motion vector corresponding to a
block in Y component is employed for motion compensation of the
corresponding block in Cb and Cr components. However, a suitable scaling
is required since the Cb and Cr components may have a size different than
the Y component.
Figure 9.14. Example of prediction from a future frame . The head block
cannot be predicted from the previous reference frame. The future
reference frame can be used here to predict the block.
9.5.1 Motion-JPEG
The Motion JPEG is the simplest nonstandard video codec. It uses the
JPEG still image coding standard to encode and decode each frame
individually. Since there is no motion estimation, the complexity of the
coding algorithm is very small. Unfortunately, the performance of this codec
is not very good since it does not exploit the temporal correlation among the
video frames. As a result, many early video coding applications used this
standard.
Reordered
Input Video
Frames
Figure 9.15. Simplified lock diagram of the MPEG-l video encoder. VLC:
variable length coding, MUX: multiplexer.
Figure 9.16. Example ofa group of pictures (GOP) used in MPEG . I-frames
are used as reference (R) frames for P-frames, and both 1- and P-frames are
used as reference frames for B-frames.
Since the B-frames are predicted both from the previous and future
reference frames, the frames are reordered before motion prediction and
encoding. The frames are encoded in the following order:
II~BIB2P2B3B412B5B6. Sending a frame out of sequence requires
additional memory both at the encoder and the decoder, and also causes
delay. Although a larger number B-frames provide superior compression
performance, the number of B-frames between two reference frames is kept
small to reduce cost and minimize delay. It has been found that I-frame only
coding requires more than twice the bit-rate of an IBBP coding. If this delay
corresponding to IBBP is unacceptable, an IB sequence may be a useful
compromise.
The decoding of MPEG-l video is the inverse of the encoding process as
shown in Fig. 9.17. The coded bitstream is first demultiplexed to obtain the
motion vectors, quantizer information, and entropy coded quantized DCT
coefficients. An I-frame is obtained in three steps. The compressed
coefficients are first entropy decoded. The coefficients are then dequantized,
and inverse DCT of the dequantized coefficients is calculated. This
completes the I-frame reconstruction.
The P-frames are obtained in two steps. In the first step, the
corresponding error frame is calculated by decoding the compressed error
224 Multimedia Signals and Systems
Coded
Bitstream VLCa~dcoder ~ Inve~e IJ Inverse I J:f:\--.-----.,.
Demuhiplexer! Quanhzer.-, DCT ~ Reconstructed
Motion Predicted Frame Frames
Motion
.I...-_ _ _ _--I~ Compensator I + - - - - - - J
Motion Vectors Predictor
formats, and provides error correction capability that is suitable for cable
TV and satellite links. A schematic of MPEG-2 video and audio coder and
the associated packetizing scheme for transmission through a network is
shown in Fig. 9.18.
Video
Video PES Program
Data PS Stream
Mux
l
AudiO-.-J Audio
Data Encoder
I!L--__
"" -' Transport
PS Stream
Mux
The simple profile at the main level does not employ B frames, resulting
in a simpler hardware implementation. The main profile has been designed
for a majority of the applications. The low level algorithm supports SIF
resolution (352x288), whereas the main level supports the SDTV (standard
definition TV) resolution. High-1440 supports high definition resolution
(with aspect ratio of 4:3) that is double (l440x1152) to that of SDTV. The
16:9 high definition video (l920xI152) is supported in high level. Table 9.3
shows the application of this profile at different levels. The SNR and Spatial
profiles support scalable video of different resolution at different bitrates.
Simple, main, SNR and spatial profiles support only 4:2:0 chroma sampling.
The high level supports both 4:2:0 and 4:2:2 sampling, as well as SNR and
spatial scalability. A special 4:2:2 profile has been defined by MPEG-2 to
improve compatibility with digital production equipment at a complexity
lower than the high profile.
(a) (b)
(c)
Figure 9.21 below outlines the basic approach of the MPEG-4 video
algorithms to encode rectangular as well as arbitrarily-shaped input image
sequences. The basic coding structure involves shape coding (for arbitrarily-
shaped VOs) and motion compensation, as well as OCT-based texture
coding (using standard 8x8 OCT or shape adaptive OCT). The basic coding
structure involves shape coding (for arbitrarily shaped VOs) and motion
compensation as well as OCT-based texture coding (using standard 8x8
OCT or shape adaptive OCT). An important advantage of the content-based
coding approach of MPEG-4 is that the compression efficiency can be
significantly improved for some video sequences by using appropriate and
dedicated object-based motion prediction "tools" for each object in a scene.
A number of motion prediction techniques allow efficient coding and
flexible presentation of the objects:
In certain situation, a technique known as sprite panorama in MPEG-4
may be helpful to achieve a superior performance. Consider a video
sequence where a person is walking on the street (see Fig. 9.22(a)). Since,
the person is walking slowly, several consecutive images will have similar
background, and the background may move slowly due to camera
movement or operation. A sprite panorama image can be generated using the
smaller static background (i.e., the street image). The foreground object, i.e. ,
the person's image is separated from the background. The large panorama
image (i.e ., Fig. 9.22(b)) and the person's image (i.e ., Fig. 9.22(c)) are
transmitted to the receiver separately only once in the beginning. These
images are stored in the sprite buffer in the receiver. In each consecutive
frames, the position of the person in the sprite panorama, and the camera
Chapter 9: Digital Video Compression Techniques 229
parameters if any, are sent to the receiver. The receiver would be able to
reconstruct the individual frames from the panorama image and the picture
of the person. Since, the individual frames (other than the panorama image)
are represented only by the camera operation, and position of the objects,
very high compression performance may be achieved with this method.
VOPor
Video
oded
Video
(a)
(b) (c)
Scene description
In addition to providing support for coding individual objects, MPEG-4
also provides facilities to compose such objects into a scene. The necessary
composition information forms the scene description, which is then coded
and transmitted together with the media objects. Starting from the Virtual
Reality Modeling Language (VRML), MPEG has developed a binary
language for scene description called BIFS (BInary Format for Scenes).
In order to facilitate the development of authoring, manipulation and
interaction tools, scene descriptions are coded independently from streams
related to primitive media objects. Special care is devoted to the
identification of the parameters belonging to the scene description. This is
done by differentiating parameters that are used to improve the coding
efficiency of an object (e.g., motion vectors in video coding algorithms), and
the ones that are used as modifiers of an object (e.g., the position of the
object in the scene). Since the MPEG-4 should allow the modification of
this latter set of parameters without having to decode the primitive media
objects itself, these parameters are placed in the scene description and not in
primitive media objects.
An MPEG-4 scene can be represented by a hierarchical structure using a
directed graph. Each node of the graph is a media object, as illustrated in
Fig. 9.23. Nodes at the leaves of the tree are primitive nodes, whereas the
nodes that are parents of one or more other nodes are called compound
nodes. The tree structure may not be static; node attributes (e.g., positioning
parameters) can be changed, and nodes can also be added, replaced, or
removed. This logical structure provides an efficient way to perform the
object-based coding.
provided by Microsoft [15]. Although the MPEG-4 codec software has both
the frame-based coding and object-based coding options, the frame-based
technique is used in this test.
40 Hallway 40 Hallway
CD
iD 36 -0 36
"
.§. 32 Football ~
a: a:: 32 Football
t5n. 28 Table Tenru
Z
Table Tenru
~ 28
24
20 24
M-JPEG MPEG2-1 MPEG2-IP M-JPEG MPEG2-1 MPEG2-1P
Coding performance is evaluated using the first 100 frames of the test
sequence "bream" (as shown in Fig_ 9.25) with a frame-size of 352 x 288.
Assuming a sampling rate of 30 frames/second, the clip has duration of
approximately 3.3 seconds. Note that each codec has several parameters that can
be fine-tuned to achieve superior performance. In this experiment, the default
parameters have been employed. Therefore, it is possible to obtain a better
performance than that has been obtained in this experiment.
130
40 - 1I'EG·l [ MPEG·1
110 .•..... MPEG ..
m
······· II'EG-4
i.,
"0 3t ... .0
:§. :§. 70
.•
0: 3.
zCI) i
0.. II)
37
30
31 10
0 20 40 10 .0 20 40 10 ao
Frame Indll Frlm.lndt.
(a) (b)
Figure 9.26. Performance of MPEG-l and MPEG-4 codecs for bream sequence at 1.1
Mbits/s. The PSNR of the luminance component of the individual frames is shown.
Comparison at 56 Kbits/second
41
i: :·~\Mi\\J"JJ\~\.,;.h·Jr,dii/
40 .120
iii' Q.
~ 90
c
~ 60
•
~
36
~III 30
35 ~----~~-r--~~~~ o ~----------------~
o 20 40 60 80 o 20 40 60 80
Frame Index
Frame Index
(a) (b)
Figure 9.27. Performance of the H.263 and MPEG-4 codecs at 1.1 Mbits/s. The PSNR
of the luminance component of the individual frames is shown.
Chapter 9: Digital Video Compression Techniques 235
35 1--H.2U I
iii'
'tI I-.-MPEG-4 1
:§. 33
II:
~ 31
IL
~ ~-n~~~~~TM~~~~
o 4 8 12 16 20 24 28 32 o 4 8 12 1. 20 24 28 32
Frame Index Fnmelndu
(a) (b)
Figure 9.28. Performance comparison of H.263 and MPEG-4 codec at 56
kbps. a) the PSNR and b) the bitrate of the individual encoded (luminance)
frames of the bream sequence. The average bitrate is 56 kbps.
Figure 9.28 shows the PSNR and the bitrates of the individual encoded
frames. It is difficult to compare the PSNR and the size of the individual
compressed frames since the encoded frames do not correspond to each
other for the two codecs. However, the plots provide us some information
regarding how the codec operates. The I-frame of the MPEG-4 codec uses
fewer bits compared to the H.263 codec, and spends these extra bits to
encode 12 extra frames (34 frames compared to 24 for H.263 codec). It has
been observed that overall MPEG-4 provides superior video quality.
REFERENCES
1. A. N. Netravali and B. G. Haskell, Digital Pictures, Plenum Press, New York,
Second Edition, 1995.
2. F. Dufaux, and F. Moscheni, "Motion estimation techniques for digital TV: a review
and a new contribution," Proc. ofIEEE, Vol. 83, No.6, pp. 858-876, June 1995.
3. B. G. Haskell, P. G. Howard, Y. A. Lecun, A. Puri, J. Ostermann, M. R. Civanlar, L.
Rabiner, L. Bottou, and P. Haffner, "Image and video coding - emerging standards
and beyond," IEEE Tran. on Circuits and Sytemsfor Video Technology, Vol. 8, No.
7, pp. 814-837, Nov 1998.
4. M. Liou, "Overview of the px64 kbitls video coding standard," Communications of
the ACM, Vol 34, No 4, 60-63, April 1991.
5. ITU-T Recommendation H.263, Video codingfor low bitrate communications, 1996.
6. MPEG Homepage: http://mpeg.telecomitalialab.comlstandards.htm.
7. D. L. Gall, "MPEG: a video compression standard for multimedia applications,"
Communications of the ACM, Vol. 34, pp. 46-58, April 1991.
8. J. L. Mitchell, MPEG Video: Compression Standard, Chapman & Hall, New York,
1996.
9. R. Koenen, F. Pereira, and L. Chiariglione, "MPEG-4: context and objectives,"
Signal Processing: Image Communications, Vol. 9, pp. 295-304, May 1997.
10. ISO/IEC JTCI /SC29/WGII N4030, MPEG-4 Overview - (Singapore Version) ,
March 2001.
236 Multimedia Signals and Systems
11. G. Cote, B. Erol, M. Gallant, and F. Kossentini, "H.263+: video coding at low bit
rates," IEEE Tran. on Circuits and Sytemsfor Video Technology, Vol. 8, No.7, pp.
849-866, Nov 1998.
12. L. Boroczky and A. Y. Ngai, "Comparison of MPEG-2 and M-JPEG Video coding at
low bit-rates," SMPTE Journal, March 1999.
13. K. O. Lillevold, "Telenor R&D H.263 Codec (Version 2.0)," ftp:/lbonde.nta.no/pub/
tmnlsoftware/, June 1996.
14. MPEG Software Simulation Group (MSSG), "MPEG-2 Video EncoderlDecoder
(Version 1.2)," ftp://ftp.mpegtv.com/pub/mpeg/mssg/, July 1996.
15. Microsoft Corporation, "MPEG-4 Visual Reference Software (Version 2.3.0
FDAMI-2.3-001213)," http://numbernine.netlrobotbugs/mpeg4.htm, Dec 2000.
QUESTIONS
1. A multimedia presentation contains the following types of data:
i) 10000 characters oftext (8 bit, high ASCII)
ii) 200 color images (400x300, 24 bits)
iii) 15 minutes of audio (44.1 KHz, 16 bits/channel, 2 channels)
iv) 15 minutes of video (640x480, 24 bits, 30 frames/sec)
Calculate the disk space required to store the multimedia presentation. How much
space percentage does each data type occupy?
2. What is the principle behind color subsampling? Consider the video data in the above
problem. Assume that the video is stored in i) 4:2:2, or ii) 4:2:0 format. How much
space would be required in each case to store the video?
3. What is motion compensation in video coding? Why is it so effective?
4. Compare and contrast various motion estimation techniques.
5. While performing motion estimation for the football sequence using a full search
algorithm, the displaced block difference energy of a 16x 16 block was found to be as
given in the following table. The energy shown is normalized for better clarity. The
search range is (-7,7) in both horizontal and vertical direction. Calculate the motion
vector and the motion predicted error energy if we use i) three-step, ii) 2-D
logarithmic, and iii) conjugate direction search. Are these fast-search techniques able
to find the global minimum?
86 91 94 95 94 89 83 76 69 65 64 68 73 76 75
75 80 86 90 93 91 84 74 63 52 46 45 50 58 65
62 65 71 78 82 83 80 71 60 48 36 28 28 35 47
54 53 56 61 68 73 75 73 64 51 36 21 13 15 28
57 53 51 54 59 65 70 70 66 56 43 27 13 9 18
65 60 54 53 57 61 64 66 65 60 53 43 30 21 21
72 71 64 60 59 60 61 63 63 62 60 58 50 39 32
75 76 73 68 64 61 61 63 64 64 64 64 60 52 43
75 79 81 79 74 68 64 63 64 64 65 66 65 61 53
75 79 82 82 78 72 66 63 62 63 65 67 67 67 63
74 77 78 77 76 70 63 60 60 63 65 68 69 68 64
74 77 78 77 76 70 63 60 60 63 65 68 68 67 64
73 74 72 72 71 67 61 60 62 65 66 68 68 67 64
73 71 69 67 67 64 60 60 63 66 67 69 71 70 68
73 70 65 61 59 59 58 58 62 66 68 71 73 72 70
Chapter 9: Digital Video Compression Techniques 237
6. Repeat the above problem for the following table. Do the fast search algorithms find
the global minima? If we had use the search range (-15,15) instead of (-7,7), what
would be the performance for the three fast search techniques?
24 30 36 42 47 51 55 57 58 59 61 62 63 62 61
25 32 39 46 51 55 59 61 62 64 66 68 68 66 65
28 34 41 48 53 59 64 66 67 69 72 73 73 71 69
30 37 43 50 53 58 66 70 70 73 77 78 78 76 74
31 38 46 53 53 55 64 71 72 76 81 84 83 81 80
31 38 47 55 56 54 57 64 71 78 83 86 86 84 84
32 38 46 55 59 57 52 51 62 78 85 86 88 88 88
34 39 45 54 59 60 53 41 47 70 83 86 88 90 90
36 41 47 53 58 61 57 45 44 59 76 85 89 91 92
37 43 49 54 57 60 59 55 54 59 70 83 91 94 95
39 43 52 56 57 57 59 64 67 68 72 82 92 98 99
43 44 50 57 56 52 54 64 73 77 77 80 89 96 99
48 47 47 55 57 50 48 60 72 80 80 79 85 94 97
51 52 49 51 54 51 48 53 64 77 81 79 82 88 90
50 53 54 50 46 49 50 49 53 66 75 76 78 77 78
Digital audio has become very popular in the last two decades. With the
growth of multimedia systems and the WWW, audio processing techniques
are becoming popular. There is a wide variety of audio processing
techniques, such as filtering, equalization, noise suppression, compression,
addition of sound effects, and synthesis. The audio compression techniques
have been discussed in Chapter 7. In this Chapter, we present a few selected
audio-processing techniques. The depth of the coverage is very narrow
compared to the rich set of literature available. The readers who are
interested in learning the audio processing techniques in greater details may
consult the books listed in the reference section.
m
"~
J~ f
."
.. ..~
Q. b
.:,
J 1
! .H
I' to
SamplH (11000)
,. .. It
~,
4 ,
Frequency (In KHz)
, 10 11
(a) (b)
Figure 10.1. The audio signal bell.wav. a) the time domain signal, and b) it's
amplitude spectrum.
...
.i
~
I-
-, •Frequency• • 11 11
-
"", •Frequency• (in Kill)• 11 '2
(In KHl)
(a) (b)
Figure 10.2. Lowpass filtering of audio signal. a) Lowpass filter, and b) power
spectral density of the filter output. The filter is a 64 tap FIR filter (designed using
Hamming window) with a cutoff frequency of 4000 Hz.
Chapter 10: Digital Audio Processing 241
Bandpass filtering
The filtering is performed again using thefirl command.
filt_bp = firl(64,[4000 6000]111025);
O/OBandpass filtering the audio signal
x_bpf= filter(filt_bp, I,x) ;
The lower and upper cut-off frequencies are chosen as 4000 and 6000 Hz,
respectively. The filter is 64-tap high pass filter with real coefficients, and
has a linear phase. The gain of the bandpass filter is shown in Fig. 10.3(a).
The spectral density of the filtered output is shown in Fig. lO.3(a). The
complete MA TLAB function for bandpass filtering, as well as the audio
output is included in the CD.
~ • •
freq-.ncY(lnKHl)
It II 4 • •
frequency (In KHz) "
Figure 10.3. Bandpass filtering . a) Bandpass filter characteristics, and b) power
spectral density of the filter output. The filter is a 64 tap FIR filter with lower and
higher cutoff frequencies of 4000 and 6000 Hz, respectively .
Highpass filtering
The filtering is performed again using thefirl command.
fiIt_high = firl(64,40001l1025,'high') ;
O/OHighpass filtering the audio signal
x_hpf= filter(filt_high, I ,x) ;
The cut-off frequency is chosen as 4000 Hz. The filter is 64-tap high pass
filter with real coefficients, and has a linear phase. The gain of the highpass
filter is shown in Fig. 10.4(a). The spectral density of the filtered output is
shown in Fig. 10.4(a). The audio output is again provided in the
accompanying CD in wav format to demonstrate the highpass effect. •
output sound. The amount that a selected frequency band is boosted or cut is
generally indicated in decibels (dB). When an equalizers level control is set
to 0 dB, no boost or cut is applied to the selected band(s).
• • • 11 4 • • to
Frequoncy (In KHz, F,-", In KHz,
t Boost t Boost
r:
$ 2r--·_ 5'2 -------------- --~
=1F'
;§, 01-----:::;:.......- -- - - ;§,Ol------~-==+---
.6·1
;3) 0. 2
=---=--===-
Center Frequency Center Frequency
Flliquency Flliquency
(a) (b)
Figure \0.5. Frequency response of a) lowpass, and b) highpass shelving filter.
Chapter 10: Digital Audio Processing 243
Note that the lowpass and highpass filters typically attempt to remove a
frequency band completely. However, a shelving filter does not try to
remove the frequency components completely, rather the selected bands are
just boosted or suppressed while leaving the other bands unchanged.
Many audio systems have mid control, in addition to bass and treble. This
control basically uses a bandpass filter, typically known as peaking filter,
that boosts or cuts the mid frequency range. The gain characteristic of a
typical peaking filter is shown in Fig. 10.6. In consumer audio systems, the
passband, and stop bands of the filters are fixed during the manufacturing
process, and hence the users do not have any control. The tone control
systems can be implemented by placing the filters in series or parallel
connection.
$ 2
_______~~~_ t' :..:=-~--
" 1
~ O ~--~~--~----~~~---
.~ - I
o -2
Center Frequency
frequency
Figure 10.6. Frequency response characteristic of a peaking filter.
Graphic Equalizers
The graphic equalizers are more sophisticated than the tone control
systems. The input signal is typically passed through a bank of 5-7 bandpass
filters as shown in Fig. 10.7. The output of the filters are weighted by the
corresponding gain factors, and added to reconstruct the signal. The filters
are characterized by the normalized cut-off frequencies . Therefore, the same
filters would work for audio signals with different sampling frequencies .
• Example 10.2
Consider the audio signal test44k whose waveform and spectrum was
shown in Chapter 4 (see Figs. 4.1 and 4.2). The sampling frequency of the
audio signal is 44.1 kHz. A 5-band equalizer will be designed using a bank
of filters. The cut-off frequencies of the 5 bandpass filters are chosen as
shown in Table 10.1.
244 Multimedia Signals and Systems
Equalized
Digital Audio
Audio Output
Tabl e 10..
1 Cut-o fffireQuenCles 0 fb an dt pass fi\1 ters
Cut-off frequency Normalized cut-off
(in Hz) frequency
Filter# Lower Upper Lower Upper
1 20 1200 0.0009 0.0544
2 1200 2500 0.0544 0.1134
3 2500 5000 0.1134 0.2268
4 5000 10000 0.2268 0.4535
5 10000 20000 0.4535 0.9070
Note that the frequencies are chosen in accordance with the linear octave
spacing, which is appropriate due to the frequency sensitivity of the human
ear (see Chapter 2). FIR filters, each of order 32 (Le., 33-taps), are designed
using the MATLAB fir! function. For example, the third bandpass filter is
designed using the following code.
bpf(3,:) = fir1(32, [0.1134 0.2268]);
...
..,iii
=.
c
c
~ · 20
~
LL.
-30
-31
1 10 1. 20
Frequency (In KHz)
The complete MA TLAB code as well as the equalized audio output (in
wav format) is provided in the CD. The audio output demonstrates that the
different audio bands are boosted as per the selected gain factors .•
to reduce the noise contained in a signal, and improve the quality of the
audio signal.
There are several enhancement techniques for noise reduction. Typical
methods include:
i) Spectral Subtraction: This technique suppresses noise by subtracting
an estimated noise bias found during non-speech activity. This method
will be described later in detail.
ii) Wiener Filtering: It minimizes the overall mean square error in the
process of inverse filtering and noise smoothing. This method focuses
on estimating model parameters that characterize the speech signal.
This technique requires a priori knowledge of noise and speech
statistics.
iii) Adaptive Noise Canceling: This method employs an adaptive filter
that acts on a reference signal to produce a noise estimate. The noise
is then subtracted from the primary input. Typically, LMS algorithm
is used in the adaptation process. The weights of the filter are adjusted
to minimize the mean square energy of the overall output.
In this section, two techniques, namely digital filtering, and SSM, are
presented.
The gain characteristic of the 129-tap filter is shown Fig. IO.II(a). The
power spectral density of the filtered output is shown in Fig. 10.11 (b). It is
observed that although the signal strength has been reduced at 8 KHZ, the
noise is still there (the noise can be heard when the audio file is played).
Chapter 10: Digital Audio Processing 247
-110
o 10 20 30 40 60
Audio Samples (x10oo)
iii
"0
=
<: ..
0
~
...
<:
-10
c! -16
i
U)
-20
-26
I
C1.
-30
-36
-400
2
.
• • 8
. 10 12
Frequency (In KHz)
.5
S4
~
j .1.
IA.
1.1.
i
~~
4 • • 10 4 • 10
Frequ.ncy (In KHzI Frequency (In KHz)
(a) (b)
Figure 10.11 . Filtering of audio signal. a) Gain response of the filter,
and b) Power spectral density of the filtered signal.
248 Multimedia Signals and Systems
In order to suppress the noise further, a 201-tap FIR filter (i.e., filter order
is 200) is employed whose gain characteristic is shown in Fig. 10.12(a). The
filter has a sharp bandstop attenuation. The power spectral density of the
corresponding output is shown in Fig. 10.l2(b). It is observed that the noise
component has been suppressed completely (unfortunately, the neighboring
frequency components also have been suppressed). When the output file is
played, the noise can no longer be detected.•
Eq. (10J) states that if the noise spectrum (both amplitude and phase) is
known accurately, then simply by subtracting the noise spectrum from the
spectrum of the noisy speech signal, the spectrum of the noise-free signal
can be obtained. However, in practice, only an estimation of the noise
amplitude spectrum is available. In SSM, the spectrum of the noise-reduced
signal is estimated as:
g(Q) =0. However, in a highly noisy environment, the noise level may
exceed the signal level, resulting in a negative signal spectrum that does not
make much sense. Hence, in SSM, it is assumed that g(Q) ~ A where A is a
threshold in the range [0.01,0.1].
4 6 • 10 12
Frequency (In KHz)
(a)
0
Filter order : 200
iii
II ·10
c:
c
.
~ ·20
c
c!:
i
-30
III -40
t;
~
0
Co .. 0
"'0 4 • • 10 12
0 2
Frequency (m KHz)
(b)
Eq. (10.4) assumes that the estimated amplitude of the signal at a given
frequency is the real amplitude of the noisy signal modulated by g(Q). The
phase of the signal is assumed to be identical to that of the noisy signal. Fig.
10.13 shows the block schematic of the spectral subtraction method, which
has three major steps [3]. First, the noise spectrum is estimated when the
speaker is silent. Assume that the noise spectrum does not change rapidly.
250 Multimedia Signals and Systems
The noise spectrum is then subtracted from the amplitude spectrum of the
noisy input. Using this new amplitude spectrum, and the phase spectrum of
the original noisy signal, the time-domain audio is calculated by calculating
inverse Fourier transform.
Input
• Example 10.4
The performance of the SSM for suppressing noise is now evaluated with
the audio signal noisy_ audio2. The waveform of a noisy speech signal is
shown in Fig. 10.14. The time duration of this test signal is about 2.7
seconds and the sampling frequency is 22,050 Hz. It is observed that there
are gaps in the speech waveform, which represents the silence period where
the noise can be heard. The power spectral density of the noise is estimated
separately, and shown in Fig. 10.15. It is observed that the noise is
represented by a wide band of frequencies components. Hence, a direct
digital filtering is not appropriate to enhance this signal.
Noisy sp ch sign I
10 20 30 40 60 eo
Sarnpl a (x1000)
The SSM is applied to enhance this signal. The gaps in the speech signal
are used to estimate the background noise spectrum. A convenient way to do
this is by choosing a frame length, short enough to often cover segments of
the signal that do not contain speech. Assuming that we have chosen such a
Chapter 10: Digital Audio Processing 251
frame length. The spectrum of each frame is then calculated. The minimum
values of the spectral components corresponding to each frequency bin over
the frames are calculated. This minimum value is used as the estimate of the
noise. Note that the estimate may be considered as a sub-approximation of
the noise. Therefore, it is multiplied by a scale factor to obtain the noise-
spectrum that is used in the SSM. For this example, good results are
obtained with a scale factor of 20. Since the noise may have time-varying
statistics, in some cases, the noise estimation should be restarted from time-
to-time.
Pow.r Spectrum of the Nol •• signal
0
m
..,
c -5
=e
E -10
K
VI
J -15
-20
2
• •
Frequency (In KHz)
• 10 1%
The MA TLAB code for the SSM processing is included in the CD. The
audio frame length is chosen to be 512. There is an overlap of 50% between
two consecutive audio blocks. The value of A is set to 0.025. The window
function is derived from Hamming window.
118
8~
«II
32
~
Q.
0
E
< -32
-64
-88
-1280
10 20 30 40 60 eo
Samples (x1000)
Power Spectrum
i
VI
- 15
i
0
a. -30
~50
2 4 • • 10 12
Frequency (In KHz)
Figure 10.17. Power spectral density (in dB) of the original and enhanced signal
• Example 10.5
In Example 2.2, a small MIDI file was created. The MIDI file would
generate the note G3 played on an Electric Grand Piano, with velocity
(which relates to volume) 100. In this example, a MIDI file will be created
that generates the note E1 played on a Rhodes Piano with a velocity 68.
MIDI File in Example 2.2
4D 54 68 64 00 00 00 06 00 010001 0078 4D 54 72 6B 00 00 0014
01 C3 020193436478 4A 64 00 430000 4A 00 00 FF 2F 00
note for Rhodes Piano, the instrument number has to be changed to Ox04.
The 6th byte Ox43 generates note G3. In order to create note E6, this byte
should be changed to Ox64 (see Table 10.4 for the· decimal equivalents of
the piano note numbers). The velocity 100 is due to the 7th and 10 th bytes
(Ox64 each) in the second line. In order to create a volume of 68, the 7th and
10th bytes should be changed to Ox44 from Ox64.
The new MIDI file can now be represented in the hex format as follows:
4D 54 68 64 00 00 00 06 00 01 0001 0078 4D 54 72 6B 00 00 00 14
01 C3 0401 93644478 4A 44 00 430000 4A 00 00 FF 2F 00
The MIDI file can be created using the following MA TLAB code:
data=hex2dbytes(,4D54686400000006000 1000 100784D54726BOOOOOO 140 1C
3040 1936444784A44004300004AOOOOFF2FOO');
fid=fopen('F:\ex 10_5 .mid' ,'wb');
fwrite(fid,data);
fclose('all');
The ex}O- 5.mid file is included in the CD. The midi files ex2- 2.mid and
ex} 0_5.mid can be played to hear the difference between the two files. _
Table 10.4: Note numbers for Piano. The entries are expressed in decimal format.
Octave C C# D D# E F F# G G# A A# B
-2 00 01 02 03 04 05 06 07 08 09 10 II
-1 12 13 14 15 16 17 18 19 20 21 22 23
0 24 25 26 27 28 29 30 31 32 33 34 35
1 36 37 38 39 40 41 42 43 44 45 46 47
2 48 49 50 51 52 53 54 55 56 57 58 59
3 60 61 62 63 64 65 66 67 68 69 70 71
4 72 73 74 75 76 77 78 79 80 81 82 83
5 84 85 86 87 88 89 90 91 92 93 94 95
6 96 97 98 99 100 101 102 103 104 105 106 107
7 108 109 110 III ll2 113 ll4 ll5 116 117 ll8 119
8 120 121 122 123 124 125 126 127
format below. Note that the text in parenthesis is only for illustration
purposes.
(Header chunk) 4D 54 68 64 00 00 00 06 00 02 00 02 00 78
(Frack 1) 4D 54 72 68 00 00 00 14
01 C3 02 01 93256478326400250000320000 FF 2F 00
(Frack 2) 4D 54 72 68 00 00 0014
01 C3 03 0193436478186400430000180000 FF 2F 00
When the second track is added to the file, the format is changed to Ox02
(10th byte in the first line) so that each track represents an independent
sequence. The number of tracks is also changed to 02 (12th byte in the first
line). A few delta-times and notes are changed to illustrate different sounds.
The Electric Grand Piano is played in the first track (Ox02 in line-3) and the
Honky-Tonk Piano is played in the second track (Ox03 in line-5). These two
tracks can also be played simultaneously if format type is changed to OxOl.
•
10.5 DIGITAL AUDIO AND MIDI EDITING TOOLS
In this section, a brief overview of a few selected freeware/shareware
audio and MIDI processing tools are presented. Table 10.5 lists a few audio
processing tools. These tools can be used for simple to more complex audio
processing. Some of these tools can be downloaded freely from the WWW.
Table 10.5. A few audio processing tools
Name of Operating Features
software System
Glarne Linux Powerful, fast, stable and easily extensible
sound editor. Freely available from WWW.
Digital Audio Linux Freely available from the WWW. Reasonably
Processor powerful
Cool Edit Win95/98/ME Very powerful, easy to use. Capable of mixing
NT/20001XP up to 128 high-quality stereo tracks with any
sound card.
Sound Forge XP Microsoft Sound Forge XP Studio provides an intuitive,
Studio 5.0 Windows easy-to-use interface and is designed for the
98SE, Me, or everyday user.
2000
gAlan Windows gAlan allows you to build synthesizers, effects
981XP/2000 chains, mixers, sequencers, drum machines
and more.
REFERENCES
1. J. R. Deller, J. G. Proakis, and J. H. Hansen, Discrete-Time Processing of Speech
Signals, Ch. 8, Maxwell Macmillan, Toronto, 2000.
2. S. F. Boll, "Suppression of acoustic noise using spectral subtraction", IEEE Trans.
on ASSP, Vol. 27, pp. 113-120, April 1979.
3. B. P. Lathi, Signal Processing and Linear System, Berkeley Cambridge Press,
1998.
4. 1. Eargle, Handbook of Recording Engineering, New York, Van Nostrand
Reinhold, 1992.
5. S. Orfanidis, Introduction to Signal Processing, New Jersey, Prentice Hall, 1996.
6. S. Lehman, Harmony Central: Effects Resources, www.harmony-
central.com/Effects/, 1996.
256 Multimedia Signals and Systems
QUESTIONS
1. Name a few techniques typically used for audio processing.
2. Discuss how different filtering techniques can be used for processing audio signals.
3. Record an audio signal (or find one from the web) with sampling rate of 44100
samples/sec. Filter the signal with lowpass and highpass filters with cut-off
frequencies 6000 Hz.
4. Repeat the above problem with bandpass and bandstop filters with cutoff
frequencies 4000 Hz (lower cutoff) and 8000 Hz (upper cutoff frequency).
5. What is audio equalization? Which methods are typically used for audio
equalization?
6. What is tone control method? Draw the typical frequency characteristics of
shelving and peaking filters.
7. What is graphic equalization?
8. Record a music signal (or find one from the web), and repeat Example 10.2. Can
you think of ways to improve the performance of the equalization system
implemented for Example 10.2?
9. Discuss typical noise suppression techniques in digital audio.
10. When is digital filtering more suitable for noise suppression?
11. What is spectral subtraction method?
12. Find a noisy audio signal from the WWW. Apply both filtering and spectral
subtraction methods to suppress the noise, and compare their performance.
13. Create a single-track MIDI file that generates note D6 on Chorused piano. Play it
using a MIDI player (such as Windows Media player), and verify if it matches with the
expected sound.
14. Change the note and velocity in the MIDI file created in the previous problem, and play
the files to observe the effect.
15. Consider Example 10.6. Change the MIDI code such that two tracks are played
simultaneously.
Chapter 11
3. Cubic Interpolation
B···· .. ·· B
C· .. ·e- •.WJ .•.•,•.• C
A .... $ I A ,.- I
I
'"
••••
I I I to;\
I
I I ..
I
······D I \j:I'D
I I :
m n
x-I x x+I x+2
Figure 11.1. One-dimensional interpolation. (a) Nearest neighbor
interpolation, and b) linear interpolation. The amplitude of the signal is
known at points A, B, C, and D. The amplitude at points m and n is
estimated using the interpolation method.
Bilinear Interpolation
In Linear Interpolation (LI), also known as first-order interpolation, the
amplitude of a pixel is calculated using the weighted average of 2
surrounding pixels in a scan line. The LI method assumes that the intensity
of a scan line is a piecewise linear function, and the nodes of the function
are the given pixel values. Figure 11.1(b) illustrates the linear interpolation
along a scan line. Assume that the amplitude of the pixel at the new location
m is to be determined, and m is at a distance of a from the pixel x-I. It can
also be said that m is at a distance of 1 - a from the pixel x. The linear
interpolation will then produce the pixel value.
A(1-a) + Ba (11.1)
Chapter 11: Digital Image and Video Processing 259
• Example 11.1
Consider the airplane256 image of size 256x256 provided in the CD. It is
difficult to read the F 16 number. Hence, the NN and LI interpolation and
resampling methods are used to increase the size of the image to 512x512.
The interpolated images are shown in Fig. 11.2. Because of the small print
area, it is difficult to judge the quality of the interpolation in Figs. 11.(a) and
11.(b). However, the tail region is blown up in Figs. 11.2(c) and (d). The
jaggedness can easily be observed in Fig. (c), while the LI method produces
a smoother image (Fig. (d)) .•
(c)
Figure 11.2. Interpolation of the airplane image; the image-size is
increased from 256x256 to 512x512. (a) Interpolated image using NN
method, b) Interpolated image using LI method, c) blow-up of Fig. (a),
and (d) blow-up of Fig. (b).
(11.2)
• Example 11.2
In this example, we crop a few regions marked in Fig. 11.3. The original
image size is 588x819 (rows x columns). For the rectangular image, the four
corners of the mask are selected as (60,150), (360,150), (360,550), and
(60,550). The image is shown in Fig. 11.5(a). To extract the faces using an
elliptical mask, the chosen center pixels are (195,300) and (195,455). The
major radius (vertical) and the minor radius (horizontal) are of length 115
and 100 pixels, respectively. The cropped images are shown in Fig. 11.5(b)-
(c) .•
Chapter 11: Digital Image and Video Processing 261
.---x
t
Y
~ '-------J
Figure 11.4. Masks for image cropping. (a) rectangular mask,
b) elliptical mask.
Two examples images (with 256 gray levels) are shown in Fig. 11.6(a)
and (b). The first image looks darker, whereas the second image has a low
contrast. The histogram of the images are shown in Figs. (c) and (d). It is
observed that the first image has a large number of dark pixels (30% pixels
have 0 values) mostly due to the subject's hair. The remaining pixels are
distributed mostly in the gray level range [60,190]. On the other hand, the
second image looks very bright, as most pixels have values between 200-
255.
-- (a)
,-
--
(b)
J-
i-
-- --
f-
•• . ,
(c)
.-... ..
, ..
(d)
..,
In this section, two techniques are presented to change the brightness and
contrast of an image. Both methods change the histogram of the image.
However, this is done in two different ways.
af O~f<p
o p q N-I
Figure 11.7. Mapping function for contrast stretching
• Example 11.3
We want to stretch the contrast of the image shown in Fig. I1.6(a).
Ignoring the peak at the first bin of the histogram, most of the pixels have
gray values in the range [60,190]. If we want to stretch the contrast of the
pixels corresponding to these gray levels, we may use the following
mapping function.
0.17f 0 ~ f < 60
{
Y = 1.846(f - 60) + 10.2 60 ~ f < 190 (11.4)
0.1(f -190) + 250 190 ~ f < 255
The mapping function is shown in Fig. 11.8(a), and the mapped image is
shown in Fig. 11.9(a). The output image does have an improved contrast,
but the image still looks predominantly dark because the image already had
a large number of dark pixels. The mapping function as defined in Eq.
(11.4) increased the number of dark pixels further in the first segment,
264 Multimedia Signals and Systems
which did not help improve the overall brightness. Let the gray levels be
mapped by another mapping function defined as follows:
0.17f 0 ~ f < 60
{
Y = 1.846(/ - 60) + 10.2 60 ~ f < 190 (11.5)
0.1(/ -190)+ 250 190 ~ f < 255
2M
~
~
J 121
Q,
tt2
20G0 -
.........
J lOGO ........
01....... _
In In 2M • • II 100 110
~~~ ~~~
(a) (b)
igure 11 .8. ontrast tretching of image shown in Fig. 11 .6(a). a) Input-
output mapping functions, and b) modified histogram with mapping-2.
y
1 .. _ .. _ .. _ .. _ .. _ .. - .... .
y = T(f) ...... i
...
••••
:
I
" .
"
""
y\ = T(J;\) -.-.-..;.••!~.."- .. ,
,,
,.1 o ..
""
I
o If
Figure 11.10. Mapping ofa continuous probability density function.
_ Example 11.4
Consider again the images in Figs. l1.6(a) and l1.6(b). We want to
generate an image whose pixel gray levels will have a uniform density
function. The histeq function in MA TLAB is used to perform the histogram
equalization.
image_eq = histeq(inpjmage);
The output image and its histogram corresponding to Fig. 11.6(a) are
shown in Figs. I 1.1 I (a) and I 1.1 I(c), respectively. Similarly, the output
image and its histogram corresponding to Fig. 11.6(b) are shown in Figs.
11.11 (b) and lUI (d), respectively. It is observed that the images have a
higher contrast than the original images. The histogram of the equalized
images are very close to the uniform histogram. _
-- -
-
Pt.I.. V......
(e) (d)
Figure lUI. Histogram equalization. a) Histogram equalized image of Fig. 11.6(a),
b) Histogram equalized image of Fig. 11.6(b), c) histogram of the image shown in
Fig. (a), and d) histogram of the image shown in Fig. (b).
(11.8)
The first component outputs the original image, whereas the second
component of the filter outputs a highpass image. Overall, a high frequency
emphasized output is produced .
• Example 11.5
A blurred Lena image is shown in Fig. 11.12(a). The image is passed
through the three sharpening filters F" F2, and FJ as mentioned above. The
sharpened images are shown in Figs. 11.12(b)-(d). It is observed that Fig.
11.12(d) has the most sharpened edges among the four images since its
corresponding high frequency component is strongest. However, it also has
maximum noise (the highpass filter also produces noise). Hence, there is a
trade-off between the edge sharpening and the noise content. •
Digital video is one of the most exciting media to work with, and there are
numerous types of digital video processing techniques. The video
compression techniques, which provide a high storage and transmission
efficiency for various applications, were introduced in Chapter 9. In this
section, we focus on the content manipulation in digital video.
A digital video is a sequence of frames that are still pictures at specific
time instances. Consider a digital movie of I-hour duration with a rate of 30
frames/sec. There would be 108000 frames. In order to perform a content-
based search, the individual frames may be grouped into scenes and shots,
as shown in Fig. 11.13. A shot is defined as a sequence of frames generated
268 Multimedia Signals and Systems
(c) (d)
8
/\
~ ~
Figure 11.13. Video sequence can be grouped in terms of scenes and shots. In
order to increase the granularity further, the frames can be divided into regions
or objects as suggested by the MPEG-4 standard.
Assume that N Rand N c are the number of rows and columns of the
video frames, and we want to insert K wipe frames between the two
boundary frames. The r-th transition frame in the left-to-right wipe mode
can be calculated from the following equation:
(11.9)
where EEl H denotes the addition of two image blocks coming from two
different frames arranged in the horizontal direction. Note that Eq. (11.9)
follows the MA TLAB convention for specifying the rows and columns. The
schematic of the above operation is shown in Fig. 11.14. Similarly, the wipe
operation in the top-to-bottom mode can be expressed as
(r
*N R
W,TB =1 2 1:--,I:Nc *NR
) EElv I( ( --+I:NR,l:Ncr) (11.10)
K+I K+I
where EBv denotes the addition of two image blocks coming from two
different frames arranged in the vertical direction.
Frame-2 Frame-I
IT]
y N _ " Nr column
, K+I
• Example 11.6
Figure 11.15 shows two frames from two different shots of the video
sequence shown in Fig. 9.2. Assume that the Fig. 11.15(a) is the last frame
of one shot, and Fig. 11.15(b) is the first frame of another shot. We would
like to make a transition from shot-l to shot-2 using the wipe operation.
The wipe operation is carried with K=7, i.e., with seven transition frame.
The wipe frames corresponding to the left-ta-right mode are shown in Fig.
Chapter 11: Digital Image and Video Processing 271
11.16, and the wipe frames corresponding to the top-la-bottom mode are
shown in Fig. 11.17.
The transition frames have been included in the CD. In order to observe
the transition frames in motion, MA TLAB movie function can be used .•
(a)
Figure 11.15. Two example frames. a) The last frame of the first shot,
b) the first frame of the second shot.
(I) (g)
Figure 11.16. Wipe operation from left-to-right. Seven transition
frames are shown in Figs. (a)-(g).
..
., -."
'.( .....
-
• • 0
.~, . ~-f':-
.. ~'{i.~~.
.. ~
~
. - '..
(a) (b) (c) (d)
-
." .-
" .
~.~~ ~"£.
o . '"
11.3.1.2 Dissolve
In dissolve operation, the transItion frames are calculated using the
weighted average of the frames ] I and ]2' The relative weights of the two
frames are varied to obtain different transition frames. Assume again that K
dissolve frames would be inserted for gradual transition. The r-th transition
frame can be calculated using the following equation:
(11.11)
It is observed in the above equation that for smaller r, the weight of the
] I is higher than that of the ]2' Hence, the transition frames look closer to
the start frame for smaller r. However, as r increases, the weight of the
second frame increases, and the transition frames gradually move closer to
]2'
• Example 11.7
Consider again the boundary frames in Example 11.6. In this example, we
would like to achieve the transition using the dissolve operation.
The transition frames, shown in Fig. 11.18, are calculated using Eq.
(11.11) with K=8. Note that there is no directivity involved as in wipe
operation. All regions are being transformed simultaneously .•
• Example 11.8
Consider the boundary frames in Example 11.6. Here, we would like to
achieve the transition using fade in and out operations.
The fade out transition frames, shown in Fig. 11.20(a)-(e), are calculated
using Eq. (11.12) with K=5. The fade in transition frames, shown in Fig.
11.20(g)-(k), are calculated using Eq. (11.l3) with K=5 . Note that the
middle frame (shown in Fig. 11.20(f) is a gray frame, which could also be
black or white. Unlike the wipe operation, no directivity is involved in the
dissolve operation. All regions are being transited simultaneously .•
Tabi e II ..
1 Shots Wit. h abrupt transItion In
. Beverly
I H'll
I s sequence
Sequence Abrupt transition (AT) at frame #
Beverly Hills 13,17,29,46,54,80,88,125,141,155,170, 187,200,221,252,269,
282,318,344,358,382,419,435,457
Figure 11.20. Fade out and in operations. Figs. (a)-(e) shows the fade
out operation, and Figs. (g)-(k) shows the fade in operations. Fig. (f)
shows the middle frame.
where fp(m,n) represents the gray level of the p-th frame, with the frame
size M x N . If the distance is greater than a pre-specified threshold, then an
AT is declared between i and j-th frames .
Although pixel intensity matching is simpler to implement, this method is
sensitive to motion, and camera operations which might result in false
detection.
Histogram-based Method
In this technique, the histograms of consecutive video frames are
compared to detect shot changes. The absolute distance between histograms
of frames i and j is calculated using the following equation:
G
DoIH(i,j) = IIHj(k) - Hj(k)1 (11.15)
hO
Chapter 11: Digital Image and Video Processing 277
where H p is the histogram of the p-th frame, and G is the total number of
gray levels in the histogram. If the distance is greater a pre-specified
threshold, then an AT is declared.
Although the histogram-based technique provides good performance, it
fails when the histograms across different shots are similar. In addition, the
histograms within a shot may be different due to changes in lighting
conditions, such as flashes and flickering objects, resulting in false
detection.
Motion Vector Method
The object and camera motions are important features of video content.
Within a single camera shot, the motion vectors generally show relatively
small continuous changes. However, the continuity of motion vectors is
disrupted between frames across different shots. Hence, the continuity of
motion vectors can be used to detect ATs. This is especially useful since
most of the coding techniques perform motion compensation and hence, the
motion vectors are already available.
Here, we outline a technique suitable in a GOP framework (as used in the
MPEG standard) employing I, P, or B frames. The GOP duration in a coded
digital video is generally fixed irrespective of the temporal change. Hence,
the ATs can occur at frames that can be 1-, P-, and B- frames. Note that there
are no motion vectors associated with I-frames. On the other hand, P-frames
have only forward motion vectors, and B-frames have both forward and
backward motion vectors. Hence, the criterion of deciding an AT at 1-, P-, or
B- frames is different. We define the following parameters that will be
useful for making a decision.
For P frames:
R _ Number of Motion Compensated Blocks (11.16)
MP -
Total _ Number _ of _ Blocks
For B- frames:
RFB = Number of Forward Motion Compensated Blocks (11.17)
Number _ of _ Backward _ Motion _ Compensated _ Blocks
We assume that AT does not occur more than once between two
consecutive reference frames. A frame[n] may be considered a candidate for
abrupt transition if the following holds true.
Case-1: frame[n) is an I-frame
1. The RFB[n -1] (i.e., RFB corresponding to frame (n-l), which is a
B-frame) is large.
Case-2: frame[n) is a P-frame
I. The RMP [n] is very small, and
278 Multimedia Signals and Systems
_ 160 I-DoIHI
§
)(
-; 120
o
~
~ 80
J:
;3
40
It is observed that both methods find most of the abrupt transitions. The
motion vector method is particularly suitable for joint coding and indexing
[7]. The motion vectors are required for video coding any way; hence extra
calculation is not required. •
In addition to its effectiveness in determining video shots, camera
operations are also effective in creating professional quality digital video.
Hence, a brief overview of different camera operations is presented in the
following section.
Dollying
Tracking
Changing
focal1ength
for zooming
Figure 11.21. Basic camera operations.
Optical flow gives the distribution of velocity with respect to an observer
over the points in an image [4]. The motion fields generated by tilt, pan and
280 Multimedia Signals and Systems
zoom are shown in Fig. 11.22. It is observed that the motion vectors due to
pan and tilt point in one direction. Conversely, the motion vectors diverge or
converge from a focal point in the case of zoom. Figure 11.22 shows an
ideal situation with no disparity. However, in reality, some disparity might
be observed due to irregular object motion and other kinds of noise. Note
that the effect of object motion and camera operations on motion vectors are
very similar and hence difficult to distinguish.
• •••••
....... ........ ...... --+ +- "if- +- +-
ttttt t ....... --+ ...... ...... +- +- +- +-
• •••••
....... ...... --+ ...... +- +-
tttttt +- +-
• •••••
--+ ...... ...... ....... +- +- +- +-
tttttt ......................... +- +- +- <if-
11.5 SUMMARY
A wide variety of image and video processing algorithms are available in
the literature. In the first part of this Chapter, a few selected image
processing algorithms - namely resizing, cropping, brightness and contrast
improvement, and sharpening - were presented. This was followed by a
brief overview of selected video processing algorithms. Here, the focus was
again on popular algorithms. Several algorithms for shot/scene transition
were discussed. This was followed by an overview of video segmentation
techniques, including the effect of camera motions on the motion vector
field, and how this effect can be exploited to obtain video segmentation.
Chapter 11: Digital Image and Video Processing 281
Finally, a few image and video editing/processing package tools and their
key features were tabulated.
e ..
T a bill 2 A tIew se ecte dflreeware commercia . I'd
VI eo e d'ItIng
. too s.
Software Tools Operating Features
System
Adobe Premiere Windows An excellent tool for professional digital video
editing. It supports Photoshop layers, native
Photoshop files, and Illustrator artwork.
Video Wave 5.0 Windows Excellent video-editing tools. Includes high-
2000IXP quality multimedia bundles; clear, intuitive
layout.
Final Cut Pro 3 Mac OS 9/ Another excellent video editing software. Real-
OSX time effects playback on DV without PCI
hardware.
VideoStudio 6.0 Windows Video-editing software that can trim video, add
9812000/ soundtracks, create compelling titles and drop in
ME/XP stunning effects. Easy-to-learn interface
REFERENCES
I. A. K. Jain, Fundamentals of Digital Image Processing, Prentice Hall, 1989.
2. R. C. Gonzalez and R. E. Woods, Digital Image Processing, Addison Wesley,
1993.
3. W. K. Pratt, Digital Image Processing, John Wiley and Sons, Third Edition, 2001.
282 Multimedia Signals and Systems
QUESTIONS
1. Create a gray level image or download one from the WWW. Create two images
with size 60% and 140% of the original image in both horizontal and vertical
directions.
2. Select a rectangular and an elliptical region in the image from the previous
problem. Crop the selected regions from the image.
3. Create a dark gray level image or download one from the WWW. Plot the gray
level histogram of the image. Using the contrast stretching method, increase the
average brightness level of the image. Compare the histograms of the original and
contrast-stretched images.
4. Using the MATLAB "histeq" function, perform the histogram equalization of the
image. Compare the histograms of the original and contrast-stretched images.
5. Install a few image processing tools (several of these tools are downloadable from
the WWW) listed in Table 11.1. Compare their features.
6. Create two images (or download from the web. Calculate seven transition frames
corresponding to the right-to-Ieft and bottom-to-top wipe modes. Display the
transition frames using the MA TLAB movie function.
7. Consider the previous problem. Calculate the transition frames corresponding to the
dissolve operations.
8. Consider the previous problem again. Calculate the transition frames corresponding
to the fade out and fade in operations.
9. What is video segmentation? Describe a few techniques that are typically used to
segment a video.
10. Create a digital video sequence or download it from the WWW. Calculate and plot
the difference of histograms of the consecutive frames and choose the peaks as the
location of the abrupt transitions. Verify manually the correctness of the AT
locations determined by the histogram method.
11. Explain the different types of camera motions that are used in a video. Draw the
schematic of motion fields for different types of camera motions.
12. Consider a still image, and divide it into small blocks of size 4x4. Transform the
image such that the motion vectors between two consecutive frames matches the
motion vectors for the zoom operation. Display the image frames using MA TLAB
movie function.
Chapter 12
The existing television standards, which are analog in nature, have been
around for the last six decades. There are three analog television (TV)
standards in the world [1]. The NTSC standard is used primarily in North
America and Japan, the SECAM system is used in France and Russia, and
the Phase Alternation Line (PAL) is used in most of Western Europe and
Asia. The working principles of these three systems are very similar,
differing mostly in sampling frequency rate, the image resolution, and color
representation.
The block schematic of a television system is shown in Figs. 12.1 and
12.2. It typically has three subsystems:
i) TV Camera: For capturing the video.
ii) Transmission System: For transmitting the video from the camera
(or TV station) to the TV receiver.
iii) TV Receiver: For receiving and displaying the video.
J
Vert. Sync
Composite Video
;C
RF
Channel
Microphone-l
Microphone-2
For a color television system, the color video frames are passed through
red, green and blue filters (as shown in Fig. 12.2) to create three 2-D image
signals. Three separate electron beams are then employed to scan each color
channel. However, in many systems, a single electron beam scans all three
channels with time multiplexing.
In the receiver side (see Fig. 12.3), the TV signal is captured by the
antenna, and received by an RF tuner. The signal is passed through a video
intermediate frequency (I.F.) system that demodulates the signal and
separates the video and audio signal. This is followed by the video amplifier
stage that separates the luminance and the chrominance signals (for color
TV), and provides the horizontal and vertical sync pulses to the sync
separator. The sync separator passes these on to the vertical and horizontal
deflection systems. The video amplifier produces three color channels (R, G,
Chapter 12: Analog and Digital Television 285
and B). These signals are then fed to the corresponding electron guns in the
TV receiver, which emit beams of electrons whose vertical and horizontal
positions are controlled by the horizontal and vertical deflection systems.
The electron beams corresponding to the R, G, and B channels hit the
corresponding red, green and blue dots on the phosphor screen (more details
about the display device is provided in Chapter 15). The phosphor screen
then emits light and we see the moving pictures.
Viewers
We have noted that the 2-D images are scanned and converted to a 1-0
signal for transmission. There are primarily two scanning methods:
progressive and interlace. Figure 12.4 explains these two scanning methods.
Progressive scanning is generally used in computer monitors with a frame
286 Multimedia Signals and Systems
Progressive Interlace
j-+--
TV
I I Display
Haight
2 2
3 V.R.1 Fiekl-1 Trace
3
524 --...- FIekl-2 Trace
4 261 ~'"-~. '-'..~ -+- 1 Retrace (Horizontal
5 L~~~~-~"~"~"
525 ......" ..~
262 ____ __
Tnv~~
6
I
4
262.5
TV Display Width ~
I
(a) (b) (c)
Figure 12.4. Progressive and Interlace scanning, a) Progressive scanning for
8-line display, b) interlace scanning for 8-line display, and c) Interlace
scanning for NTSC television.
However, the two color channels are generated differently in NTSC, PAL
and SECAM systems. The generation of the chrominance signals in
different TV systems is presented below.
The angle 33° has been chosen to minimize the bandwidth requirement
for the Q signal. In summary, the {Y,J,Q} components for the NTSC TV
transmission system can be calculated from the normalized camera output
using the following matrix multiplication:
The envelope of the chroma signal eet) relative to the luminance signal
yet), i.e. ~I2 +Q2IY approximately provides the saturation of the color with
respect to the reference white. On the other hand, the phase of the eet), i.e.,
tan- 1 Q(t)
I(t)
approximately provides the hue of the color. In addition to the decreased
bandwidth, there is another advantage of using the {/,Q} signal. The chroma
signal is generally distorted during the transmission process. However, the
HVS is less sensitive to saturation distortion than the hue distortion.
The composite video signal for the NTSC transmission is formed by
adding the luminance and the chroma signal as follows:
vNTSC(t) =yet) + eet)
The {/,Q} coordinates are shown in Fig 12.5 [4]. The choice of the {/,Q}
coordinates relates to the HVS characteristics as a function of the field of
view and spatial dimensions of the objects being observed. The color acuity
decreases for small objects. Small objects, represented by frequencies above
1.5 MHz, produce no color sensation at the HVS. Conversely, intermediate
sized objects, represented by frequencies in the range 0.5-1.5 MHz, are
observed satisfactorily if reproduced along the orange-cyan axis. Large
objects, represented by frequencies 0-0.5 MHz, require full three colors for
good color reproduction. The {I, Q} coordinates have been chosen
accordingly to achieve superior performance.
Note that the chrominance signals {V" V,} are not rotated in the case of
the PAL system (see Fig. 12.6). As a result, the {V"V,} requires a larger
bandwidth. In practice, each chrominance channel requires approximately
1.5 MHz bandwidth.
eet) =UI cos(2nfct) + V; sin(2ifJ)
290 Multimedia Signals and Systems
"" ""-
OJ 535
islD 550
0.7
505
Si5
o.e
500
"
"
580
u.s
495 Ife
D.4
/ ~
,t- Oa
\4110 /'
7:
D.3 ~/ )
D.2
is ~
0.1 \m f ........- ~
V
0
0.1
\. lJ--D.2 0.3 D.4 u.s o.a 0.7 D.J
X
Figure 12.5. Color coordinates used for NTSC TV transmission system. The color
difference coordinates {U v,} are rotated to generate the {I, Q} signals.
"
(510
~
- 535
I'--- m
" ""
I. 7
S05
5e5
5DD
510
\.115
I~
" "'-
-
\ • .0
--
-........
\ / R ....- .>
--
-......
1.2
, \m lUI V
a.
y ....-
I., 1.2 11.3 1.7 D.I
Figure 12.6. Color coordinates used for PAL and SECAM TV transmission systems .
• Example 12.1
In this example, the advantage of {V, I, Q} and {V, U, V} channels is
demonstrated. The original color Lena image is represented in {R, G, B}
space, and the color image is decomposed to {V, I, Q} and {V, U, V}
components. The average energy of each component is shown in Table 12.3.
It is observed that the energy is distributed among the R, G, and B
components. However, in {V, I, Q} and {V, U, V} color space, most of the
energy is compacted in the Y component. Figure 12.7 shows the Y, I, and Q
images. It is observed that I and Q components have a smaller magnitude.
The histogram of the different components is shown in Fig. 12.8. The
Chapter 12: Analog and Digital Television 291
MATLAB code used to obtain these results is included in the CD. The
readers can expect similar results with most images.
Table 12.3. Energy compaction ability of different color system for Lena image.
The entries are the percentage of the total energy in each component. Signals are
made zero mean before calculating the energy.
First component Second component Third component
(RJY). {GII/Ul (B,Q,V}
RGB 37.82 43.95 18.24
YIQ 85.40 11.04 3.56
YUV 87.36 5.40 7.24
'. o.
pta........
o. N
(a)
.... ••
.- .... Q
f.- f.• u
••
..... .-
(b)
" .. (c)
...
frequency of the sound carrier is 4.5 MHz above the frequency of the video
carrier.
Sidebands corresponding
to Y components
/l~
Video carrier
fv video
carrier fc chroma fa' audio
carrier carrier (FM)
3.58 MHz
~==========::::::j~~ 4.5MHz
t
Vestigial
Sideband
o l.25 2 3 4 5 6
Frequency (in MHz)---+
Figure 12.10. NTSC signal spectrum with 6 MHz channel assignment. The Y
(luminance) signal is modulated by AM vestigial sideband using a video carrier of 1.25
MHz frequency. The I and Q signals have a double-sided bandwidth of approximately
2.6MHz and 1 MHz, respectively.
signals are then combined to produce the chroma signal. In composite video,
the chroma signal is superimposed on the luminance signal as shown in Fig.
12.9. As mentioned in the previous section, the NTSC signal has an overall
bandwidth of 6 MHz.
R(t) M
A
T -+- NTSC
G(t) R out
I
B(t) X
Subcarrier
Sync Timing Blankin
Generator
Sync, set up and burst
Figure 12.11. A typical NTSC encoder.
0.299
R(t)
(chroma)
0.587 H-+--I...... C(t)
G(t) yet)
(luminance)
0.114
Y
-
B{t)
+
Stage-2
Stage-j Sub carrier
demodulation of C(t) to recover /(t) and Q(t). The R, G, and B signals are
then obtained from the {Y,I,Q} signals by a matrix multiplication.
R(t)
M
A
SeCt) CCt) G(t)
T
R ----.
I
X
• ----.
8(t)
+ yet)
Frequency
Figure 12.14. In the Y/C model, the resultant chroma is kept separate from the
luminance component. It is much easier to separate the two components, and
better picture quality is often achieved (compared to composite video).
digital system significant advantage over analog system. The picture quality
is much better compared to analog TV. In addition, multilingual closed
captioning is possible with DTV. Finally, DTV paves the way for eventual
computer-TV integration, and it is highly likely that in a few years of time,
there will not be much difference between a computer and a television.
Digital TV has two main formats: standard definition and high definition.
The standard definition TV (SDTV) provides a resolution almost two-times
compared to analog TV quality. Even more impressive, the high-definition
TV (HDTV) provides a resolution almost four-times compared to analog
TV. Note that the working principles of the SDTV and HDTV standards are
same; the only difference is the higher resolution for the HDTV standard.
Some of the features that have been incorporated in the various HDTV
standards are as follows:
• Higher spatial resolution
• Higher temporal resolution (up to 60 Hz progressive)
• Higher aspect ratio (16:9 compared to existing 4:3)
• Multichannel CD-quality surround sound with at least 4-6 channels
• Reduced artifacts as compared to analog TV by removing composite
format and interlace artifacts
Digital TV Transmission
Figure 12.15 shows a generic block schematic of a terrestrial DTVIHDTV
system, which is adopted by the ITU-R. The DTV system has three main
subsystems:
1. Source coding and compression
2. Service multiplex and transport
3. RFffransmission
The source coding and compression subsystem employs bit rate reduction
methods to compress the video, audio, and ancillary digital data. The term
"ancillary data" generally refers to the control data, conditional access control
data, and data associated with the program audio and video services, such as
closed captioning.
The service multiplex and transport subsystem divides the digital data stream
into packets of information. Each packet is labeled properly so that it can be
identified uniquely. The audio, video, and ancillary data packets are
multiplexed to generate a single data stream.
The RF/Transmission subsystem performs the channel coding and
modulation. The channel coder takes the input data stream (which has been
compressed), and adds additional information so that the receiver is able to
298 Multimedia Signals and Systems
reconstruct the data even in the presence of bit errors due to channel noise.
In the receiver side, the compressed bit-stream is received and decoded. The
decoded signal is then displayed at the DTVIHDTV monitor.
Like analog television standards, there are also three major DTVIHDTV
standards in the world [7]. In Japan, the HDTV standard employs the MUSE
(multiple sub-Nyquist sampling encoding) technique. In Europe, the
standard is known as DVB (digital video broadcasting). In North America,
the system is known as ATSC (advanced television systems committee)
DTV standard.
Source Coding Service Multiplex and RF Transmission
Subsystem Transport Subsystem Subsystem
1----- -Vd---S--b------------ r-""'- -------------- --- ---- ---- ----I ------.. _. . ----_ ........
. 1 eo u system
Video
Videot Source Coding
and compression
---+
,:
Audio Subsystem
! Audio Service
A. ., i_" Source Coding
and compression
--'- Multiplex ,,
,,
,,
Modulation ,
.. ----------_ .. ------------- ,,, :
_____________ ..........1
Ancillary Data !
,, • ,
!,,
Control Data
!, • ,
,,
,.................................................... -.............. --,j,
Figure 12.15. A generic digital terrestrial television broadcasting model.
QPSK (quadrature phase shift keying) technique, and the DVB-C employs
the QAM (quadrature amplitude modulation) technique.
The development of HDTV in North America was started by the Federal
Communications Commission (FCC) in 1987. It chartered an advisory
committee (ACATS) to develop an HDTV system for North America.
Testing of proposed HDTV systems was performed 1990-1993. The Grand
Alliance [8] was formed in 1993 to develop an overall superior HDTV
standard combining these proposed systems. An overview of the ATSC
HDTV standard is presented below.
The interlaced formats are supported in the standard for smooth transition
from interlaced NTSC to progressive HDTV standard. The progressive scan
300 Multimedia Signals and Systems
Buffer Fullness
Forward Analyzer
(Rate Control, Perceptual f - - - - - - ,
ModeUng)
r~'~~ '~:~;~~'1
i
r--...l.-'l---+-c-o-ur-se-.ij (I, P, Bframes)
Loop
!!
Adaptiw Motion Vecto;s..........·...... _._.....,
preprocessor
Coarse
--1 H'
Motion
Estimator
HOlY
AC-3 Channel
Input Audio cornpressedt--------!
Audio
Encoder Audio Buffer
Output
Figure 12.16. Simplified schematic of the ATSC (or GA) HDTV system.
Chapter 12: Analog and Digital Television 301
The audio data in the GA HOTV system is encoded using the AC-3
coding algorithm. The audio, video, and ancillary data are then multiplexed
and modulated for transmission through terrestrial, cable or satellite
systems.
The decoding of HOTV video is the inverse of the encoding process. First,
the HOTV signal is demodulated. The demodulated bitstream is then
separated into audio, video, and ancillary bitstream. The audio decoder (see
Chapter 7) and video decoder (see Chapter 9) are then applied to reconstruct
the audio and video signals, which are then fed to the loudspeakers and the
display system (e.g., cathode ray tube) for watching a channel on the TV
screen.
I . .
~
•
Actual video data
• Packet Synchronization
• Type of Data in Packet
(audio, video, data map • Time Synchronization
tables) • Media Synchronization
• Packer loss/misordering • Random Access Flag
Protection • Bit-Stream Splice Point Flag.
• Encryption Control
• Priority (optional)
Figure 12.17. HDTV transport packet.
302 Multimedia Signals and Systems
The spectrum of the HDTV signal is shown in Fig. 12.18. Note that the
spectrum is flat except at the two edges where a nominal square root raised
cosine response results in 620 kHz transition regions. The suppressed carrier
of the VSB is a small pilot tone, located 310kHz from the lower band edge
where it can be hidden from today's NTSC television receivers. The VSB
signal can be protected from interference from the strong energy of NTSC
carriers by comb filters at the receiver.
Chapter 12: Analog and Digital Television 303
REFERENCES
1. J. Watkinson, Television Fundamentals, Focal Press, Boston, 1996.
2. M. Robin and M. Poulin, Digital Television Fundamentals: Design and Installation of
Video and Audio Systems, McGraw-Hill, New York, 1998.
304 Multimedia Signals and Systems
QUESTIONS
1. How many television systems have been developed worldwide? Compare and
contrast these standards.
2. Explain horizontal and vertical scanning.
3. What is the advantage of interlaced video over progressive video? What is the
disadvantage?
4. How long are the horizontal and vertical retrace periods in NTSC and PAL systems?
What happens during the horizontal and vertical retrace period?
5. What are the advantages of using {Y, I, Q} and {Y,U,V} color spaces over {R, G, B}
color space for TV transmission?
6. Choose an image of your choice. Repeat Example 12.1 and determine the energy
compaction performance of the {Y, I, Q} and {Y,U,V} color spaces. You may use the
MATLAB code provided in the CD.
7. Sketch the NTSC signal spectrum. Why do I and Q channels require less bandwidth
compared to Y channel?
8. How is composite video generated in the NTSC system? How is the chroma signal
generated?
9. Why is a good comb filter necessary for a good quality TV receiver?
10. Compare and contrast component, composite, and S-video.
II. What are the advantages of digital television over analog television?
12. Draw the schematic of a generic digital TV system. Explain the operation of different
subsystems.
13. How many HDTV systems have been developed worldwide? Compare them.
14. Draw the schematic of the GA HDTV encoding system.
15. Draw the schematic of the GA HDTV decoding system.
16. What is the packet length of the HDTV bitstream? What are the advantages and
disadvantages of shorter and longer packets?
17. Draw a typical HDTV signal spectrum. Explain the different parts of the spectrum.
Where is the pilot carrier located?
18. Explain the channel coding and modulation techniques ofGA HDTV system.
19. Explain the principles of the Reed Solomon code. What is the maximum number of
errors that can be corrected by (p,q) RS code?
20. Explain the principle of the Trellis code.
Chapter 13
PamllDraw 1001
: Word proce or d Id :
:
1
Dalabase
,.....................
Au 10 caplurc e II
Vidco
.....................................................................
ure/e drt !:
• ............... 1
Collecting Content Material: From the content lists created in the design
stage, the author must collect all the content material for the project.
Content material is obtained either by selecting it from available internal
sources such as libraries or creating it in-house for the project.
Several Multimedia creation tools are available commercially for
developing content. Tools can be categorized according to their operations
308 Multimedia Signals and Systems
performed on the basic multimedia objects. There are about five different
categories of multimedia creation tools:
i) Painting and Drawing Tools
ii) Audio Editing Programs (see Chapter 10)
iii) 3-D Modeling and Animation Tools
iv) Image Editing Tools (see Chapter 11)
v) Animation, Video and Digital Movies (see Chapter 11)
A few select audio, image and video editing tools have been presented
in Chapters 10 and 11. More tools can be found in the references [3].
Assembly: The entire application or project is put together in the
assembly stage. Presentation packages (for example, PowerPoint) do their
assembly while you are entering the content for various screens. Once all
the screens are assembled and placed in order, the presentation is ready.
Testing: Once the application is built and all content materials are
assembled, the project is tested to be sure that everything works and looks
accordingly. An important concern in testing is to make sure the
application works in the actual environment of the end user, not just in the
authoring machine.
Distribution: If the application is intended to run by someone else on a
different machine, the authoring system needs to have an option to make
this possible.
~ Concept H Design
I Design
~ I
(hphics vector/
Bitmapped
I
I
I
I
Audio
I
I
I
I Text
I1 __________________ _
Figure 13.4 shows the view of the ToolBook design window for a
Windows environment. This tool is conceptually easy to understand and has
a library of basic behaviors, e. g., mouse-click and transitions. It has its own
programming language - Open Script - that is easy to learn. It can import a
variety of different types of media, and is convenient for applications such
as computer-based training, marking and assessment. The multimedia
projects developed by ToolBook are called Books. The Book is divided into
pages and the pages can contain text fields, buttons and other multimedia
objects. In Fig 13.4 you can see the buttons previous [A] and next [B]; by
clicking on these buttons you can navigate through the pages of the Book.
Different objects [C] in the catalog can be placed on the pages of the Book
to create an interactive presentation.
_ _ _ _ _ M'
A
'------t~ B
4H--------------~ c
I
Antneelill, - -
Figure 13.S.View of Authorware. A: Icons in the design window, and B: Icon palette.
organized graphic frames are played back at a speed that can be set. Other
elements (such as audio events) are triggered at a given time or location in
the sequence of events.
Macromedia Director is a popular time-based tool, developed by
Macromedia for the Macintosh/Windows environment. It has a library of
basic behaviors, e.g., mouse click, and transitions. The Director has its own
programming language called Lingo, which is easy to learn and can import
different types of media. In Director, the assembling and sequencing of the
elements of the project are done by Cast [A] and Score [B]. The cast
contains multimedia objects such as still images, sound files, text, palettes,
and Quick Time movies. Score is a sequencer for displaying, animating and
playing cast members. Figure 13.6 shows the cast window and score
window of Macromedia Director, which are important for developing a
multimedia presentation.
ID
G_PV_
rl.ookiUa-
P lOa-
"*".
N_ ~Ii'
Iildr_
R_iU I
.!.l r 100011
==i--.,..-_-=.I Color •
s.....
"'... )( I"~
rs:o-
0..0. r Oeu
(:lI'It-,
message is sent to them, the objects perform their jobs without external
input. Object-oriented tools are particularly useful for games, which contain
many components with many personalities. Examples of a few object
oriented tools are:
mTropolis (for Mac/Windows)- developed by Quark
Apple Media Tool, (for Mac/Windows)- developed by Apple
MediaForge (for Windows)-developed ClearS and Corporation
PC --1 ToolBook I PC
PC, MAC
Figure 13.8. Transferring knowledge between platforms.
rr---r:
I I Source I I Target
I I Document I Document
.... ___ ..... 1
Figure 13.9. SGML document processing: from the information to the presentation.
Author Reader
... Authoring
is the process
of creating De-linearization
multimedia
applications ..
Entry
1----------------
1 ~
Entry
r---;====-+-. Exit2
Exit 3
j
rL . -_ _ _ _ _ --!-.... Exit
~
L _______________ _
Exit 4
(a) (b)
Figure 13.11. Organization of information/knowledge units. a) linear,
and b) nonlinear organizations.
link consists of text and video data, then this is a hypermedia, multimedia,
and hypertext system. The World Wide Web (WWW) is the best example of
a hypermedia application.
r--------- ---------~
I The popular games U1 C nada are ICc-hockey, I
I I
I < e- al and ~ The anadlan team won I
I
I
I
~"t:~=:":::'~=:""':":":::~:"'::::~_~I~ Information
about Olymp ICS
13.4.4 HTML
The hypertext mark-up language (HTML) is the primary language [9] of
all web pages on the WWW today. It can be considered as a (simplified)
subset of the SGML, with emphasis on document formatting. HTML
consists of a set of tags that are converted into meaningful notations and
formats when displayed by a web browser. An example of an HTML
document is given in Table 13.3, demonstrating that every start tag has an
end tag. These two tags instruct the browser where the effect of a specific
tag begins and ends.
Example 13.1
A small HTML document is shown in Table 13.3. When the HTML
document is viewed through a browser, a table will be displayed showing
the names and courses taken by a group of students (see Fig. 13.14). A
HTML document employs a large number of tags to generate a formatted
document. A brief explanation of the meaning of different tags is as follows.
Typical HTML documents [9] start with an <HTML> tag and ends with
an <lHTML> tag. Between these two tags is the body of the document. The
content between the <title> and <ltitle>tags specifies the title of the
document. The <head> tag provides the header information of a webpage.
The content between the <h2> and <lh2> tags display the headings of the
document. The <table> tag generates a table with the border width specified
Chapter 13: Content Creation and Management 319
in the border entity. The tags <TR> and <lTR> create a table row. The
content between <TH><lTH> tag specifies the table headings. The content
between <TD>and <ITO> tags is used to specify the· content in each cell of
the row.
When this file is saved as "Examp/e13j.htm/", and opened using a
typical web browser such as Internet Explorer or Netscape, the table as
shown in Fig. 13.14 will be displayed by the browser.
Limitations of HTML
Although HTML has served its purpose, and has made the WWW very
popular, it has several limitations. The foremost limitation is that it is not
intelligent. It displays the data with the desired format without knowing
what is being displayed. Consider the table created in Example 1. If we want
to redisplay the table with the student IDs in an incre~ing order, the
browser will not be able to do it. The browser does not know that the first
column is displaying the student IDs of a group of students. It simply treats
all data as dumb data without any meaning.
Second, the layout of an HTML document is static in the browser's
window. The readers or the browsers do not have any control over the
format in which the document is viewed. The control on the browser's side
320 Multimedia Signals and Systems
1 , .. Me In IIlMl-- -
lln23IU,.,nlhltll:l/www.23,en_lI~t
f'"
----
Edt View
---
Favortos Tools
~--
~
--
-+
forw6rd stop
Student Name
wnber
I ultimedia Systems DiKital LoKic Desip
by M. Mandai by X. Sun
112345 ~~
132455 ~~r---------
132412 drew 1_ i---------
132851 ~~
13.4.5 XML
In order to address the limitation of the HTML, a new markup language
was developed in 1996 by the WWW consortium's (also called w3
consortium) XML working group [10]. This is known as XML (eXtensible
Markup Language). XML is a simplified version of the SGML, but contains
the most useful features.
A major feature of XML is that a content developer can define his own
markup language, and encode the information of a document much more
precisely than is possible with HTML [11]. The programs processing these
documents can "understand" them better, and therefore process the
information in ways that are not possible with HTML. Imagine that you
want to build a computer by assembling different parts from different
companies according to a data type definition (DTD) that has been tailored
for different computer parts. You can write a program that would choose
parts from the specified companies, and give you an overall price.
Conversely, you can write a program that would select the least expensive
parts from all quoted prices, and give you the minimum price of a computer
with a given configuration. In another case, given a price budget, the XML
code will tell you the best computer configuration you can buy. The
possibilities are almost endless because the information is encoded in a way
that a computer can understand.
The same table was generated in Example 13.1 using HTML and now it is
generated using XML. Note that the XML does not have any predefined
tags, and hence the browser does not know how to display the XML
document. The XML document is generally displayed using an XSL
(extensible style sheet language) document that describes how a document
should be displayed. The XSL is a standard recommended by the World
Wide Web Consortium. In this example, the XML is embedded in a HTML
document, which describes the structure of XML document.
The flexibility of the XML in defining the tags can be observed in the
code. Here, we can define user-friendly tags that make the document easier
to understand. It is also easier for the search engines to look for specific tags
if they are looking for some relevant information. The code in Example 13.2
Chapter 13: Content Creation and Management 323
is also easy to expand; if a new student arrives, we just need to add the code
in column 1 of Table 13.5. The table will be expanded automatically and the
new entry will be accommodated in it. In comparison, we consider Example
13.1, we have to add the code given in column 2 of Table 13.5. Such
processes are more complicated for the user, and make searches more
difficult.
.:J
Dane
.- A B c
Due to the rapid growth of the multimedia industry, several standards are
being developed for creating and managing multimedia content. ISO/IEC
JTC lISC 29 is responsible for developing standards for representing the
multimedia and hypermedia information. Figure 13.16 shows a schematic of
different standard bodies, and their activities. There are three main working
groups under it. The WG 1 is responsible for developing still image coding
standards such as JBIG, JPEG, and JPEG-2000. The WG 11 takes care of
primarily the video content while the WG 12 takes care of coding the
multimedia and hypermedia information. The first two working groups
mainly focus on the information content whereas the last working group
326 Multimedia Signals and Systems
Coding of Still
Pictures
(JBIG/JPEG)
WGll
ISOIIEC JTClISC29 Coding of Moving
Coding of Audio, Picture, Pictures and
Multimedia and Hypennedia I---~ Associated Audio
Infonnation (MPEG)
WGI2
Coding of Multimedia
and Hypennedia
Infonnation (MHEG)
The database design has so far concentrated on the text data. Efficient
search engines are available for searching data based on text query.
However, multimedia data is significantly different from the text data. The
content-based audio and visual retrieval systems seemed to be more
appropriate for the multimedia data query. Block diagram of a typical
multimedia data archival and retrieval system is shown in Fig. 13.17. Data
matching and retrieval techniques are generally based on feature vectors.
For example, images can be searched based on its color, texture, and shape.
Generating feature vectors for efficient content-based retrieval is an active
area of research [14]. A detailed review content-based retrieval techniques
for images and video is presented in [15] (a copy of the thesis is included in
the CD-ROM for interested readers).
In order to manage the multimedia (e.g., audio, still image, graphics, 3D
models, video) content efficiently, MPEG group of ISO is developing the
MPEG-7 standard. Note that the MPEG has so far developed MPEG-l,
MPEG-2, and MPEG-4 that are primarily video compression standards.
Chapter 13: Content Creation and Management 327
Scope ofMPEG·7
Figure 13.18. Scope ofthe MPEG-7 standard.
MPEG 1,2,4
Use JPEG,mp3
Transact
MPEG21
13.7 SUMMARY
The first part of this Chapter presented an overview of the principles of
multimedia authoring. Each multimedia project has its own underlying
structure and purpose and will require different features and functions.
Hence, the tools should be chosen according to one's needs. Note that
today's word-processing software is powerful enough for most applications.
330 Multimedia Signals and Systems
REFERENCES
1. A. C. Luther, Authoring Interactive Multimedia, AP Professional, 1994.
2. D.E. Wolfgram, Creating Multimedia Presentations, QUE, 1994.
3. N. Chapman and 1. Chapman, Digital Multimedia, John Wiley & Sons, 2000.
4. T. Vaughan, Multimedia: Making it work, McGraw-Hill, 1998.
5. Asymetrix Website, http://www.asymetrix.com
6. Macromedia Website, www.macromedia.com
7. P. Barker, Exploring Hypermedia, 1993.
8. R. Steinmatz and K. Nahrstedt, Multimedia: Computing, Communications and
Applications, Prentice Hall, 1996.
9. Hypertext Markup Language, http://www.w3.orglMarkUp/
10. Extensible Markup Language, http://www.wlorglTRlREC-xml
11. C. F. Goldfarb and P. Prescod, The XML Handbook, Prentice Hall, Second Edition,
2000.
12. Web authoring tools, http://www.w3.orglWAI/AU/tools.
13. ADOBE website, http://www.adobe.com.
14. V. Castelli and L. D. Bergman, Image Databases: Search and Retrieval of Digital
Imagery, John Wiley & Sons, 2002
15. M. K. Mandai, Wavelet Based Coding and Indexing of Images and Video, Ph.D.
Thesis, University of Ottawa, Fall 1998.
16. ISOIlEC JTClISC29/WG11 N2207, MPEG-7 Context and Objectives, March
1998.
17. ISOIlEC JTClISC29IWGll N3500, MPEG-21 Multimedia Framework PDTR,
July 2000.
18. 1. R. Smith, "Multimedia content management in MPEG-21 framework," Proc. of
the SPIE: Internet Multimedia Management Systems III, Vol. 4862, Boston, July
31-Aug 1,2002.
19. Multimedia and Hypermedia Information Coding Expert Group,
http://www.mheg.orgl.
QUESTIONS
I. What is meant by multimedia? List the various components of multimedia.
2. Explain briefly a generic environment required for multimedia authoring.
3. Describe steps of multimedia authoring.
Chapter 13: Content Creation and Management 331
14.1.1 Cross-section of a CD
A CD is a flat circular disc with a thickness of approximately 1.2 mm. The
physical dimensions of a typical CD are shown in Fig. 14.1. The outer
diameter of a CD is 12 cm, and its inner hole diameter is 15 mm. However,
the maximum and minimum recording diameters are 11.7 cm and 5.0 cm,
respectively. Note that CDs are also available with an 8 cm outer diameter.
However, these CDs have a lower storage capacity, and hence are not
popular.
Un ...d
Ar.u 0... Record! III
Ar••
_ - - 12cm - - - - . 4
beam is focused on the lands, most light will fall on the surface, and will be
returned in the same direction. However, when it falls on the pits, the light
wiIl be scattered, and only a small portion of the light will be returned in the
same direction. In other words, the intensity of the laser light reflected from
the lands is larger than that reflected from the pits, and thus the lands and
pits (i.e., the recorded binary data pattern) are detected. When the laser beam
is focused on the reflective layer, the spot size of the beam is about 1 mm at
the surface of the CD, but is about 1. 7 ~m at the reflective layer. Hence, if a
small scratch or dust particle (smaller than 1 mm) is present on the CD
surface, the laser beam can still be focused and the data can be read.
However, the focus control must be very accurate for correct reading.
... Label ~
Protective La~ . !
• Land PIt : ...L °.1~
Retlecti ve Layer I I.1 mm
Substrate Layer...;rl--+--f-----+--f-of-----1f---+--+--+--.;
Laser Light
(a)
Reflected Land Pit
Light
Intensity
(b)
Figure 14.2. Different layers in a CD. a) The lands and pits in a
CD, b) Reflected light intensity. The laser light is focused from the
transparent substrate side.
0.4 11m
minimum
(a)
Tbl
a e 141. . Ph . IP arameters 0 fCD an d DVD
sica
CD Single-Layer DVD Double-Layer DVD
Disc Diameter 12/8 cm 12/8 cm 12/8 cm
Track Pitch 1.6 11m 0.74 11m 0.74 11m
Minimum Pit Length 0.83 11m 0.4 11m 0.44 11m
Maximum Pit Length 3.05 11m 1.87 11m 2.05 11m
Laser Wavelength 780nm 650 or 635 nm 650 or 635 nm
Layers of DVD
.. Read'mg
Table 143 speed 0 fCD an dDVD
CD Single-Layer DVD Double-Layer DVD
Read Velocity 1.2 m/sec 3.49 m/sec 3.68 m/sec
Basic Read Rate 1.4111.23 Mbitsls 11.0S Mbits/s 11.0S Mbitsls
Basic Read Rate 4.32 Mbits/s 26.16 Mbitsls 26.16 Mbits/s
Note that CD and DVD drives are available which can read data much
faster. Table 14.4 shows the raw data rate of CD and DVD drives with
different speed factors. A 6x CD (DVD) spins 12 times faster than a Ix CD
(DVD), and reads data 12 times faster. Note also that the speed factors are
meaningful for reading raw data. When we play audio from CDs or movies
from DVDs, higher speed drives and x-factors make no difference. The
audio or the video data would always be played back at Ix speed.
The core of the playback is the optical pickup mechanism. Fig. 14.5 shows a
popular optical readout system known as the three-beam method. In this
method, a laser beam is produced by a laser diode. When the laser beam is
passed through the diffraction grating, three beams are produced. The laser
beams enter a polarizing prism, and only vertically polarized light passes
through. When the beams pass through the quarter(1/4)-wave plate, the
beam's polarization plane is rotated by 45°. The polarized light is then
passed through the actuator (Le., the tracking and focusing system) and
focused on the CD surface. The laser beams are reflected by the CDIDVD
lands and pits, although the lands produce a larger intensity than the pits.
The reflected light is passed through V4 wave plate. This second passing
makes the laser beam 90° polarized, i.e., the original vertically polarized
laser beams are now horizontally polarized. The horizontally polarized light
is reflected by the prism and hits the three photodetectors. The voltage
generated at the photodetectors determines if a land or a pit has been read by
the laser beam.
..
Ta bl e 144 S)pee d f:actors 0 fCD an dDVDd' rIves
CD Factor DVD Factor Raw Data Rate
Ix 0.17x 4.32 Mbitls
4x 0.67x 17.3 Mbitls
6x Ix 26.2 Mbitls
12x 2x 52.4 Mbitls
24x 4x 105 Mbitls
30x 5x 131 Mbitls
36x 6x 157 Mbitls
0.6 mm!::'~~~~~~
0.6 mm
0.6 mm Lr 0.6 mm~'-"'-"''''''''''''''-
~----~------~
(a) (b)
B ide
06mm
06mm
(c) (d)
Figure 14.4. Layers in a DVD. a) single sided, single layer DVD (4.7 OB), b)
single sided double layer (8.5 OB), c) double sided single layer (9.4 OB), and
d) double sided, double layer (17 OB).
Chapter 14: Optical Storage Media 339
The three-beam principle employs three laser beams - the main laser beam
along with two subbeams - to detect a spot. The two subbeams are on either
side of the main beam. The reflection from the central spot (by the main
beam) and the two side spots (by the side beams) are detected by three
photodetectors. The central photodetector produces the digital signal
corresponding to the central spot. The outputs of the two side photodetectors
are compared, and an error signal is produced. This error signal is then fed
to the tracking servomotor that moves the objective lens and keeps the main
laser beam centered on the track. In addition, if the laser beam is not focused
properly, the objective lens is moved towards or away from the CD (within
± IJ.lm) to obtain a more accurate focus.
Recorded
Pattern
Actuator
(Track. & FOCU1)
Y. Wave Plate - - -.
Prism Beam
Splitter ---.
DiITractDn
Gratmg ---.
Laser
Although the outer dimensions of a CD and DVD are identical, the inner
dimensions are not the same. The layers in a DVD are thinner (0.6 mm) than
a CD (1.2 mm). A DVD player is required to read both CD and DVD. This
is generally achieved by controlling the focal length of the lens (see Fig.
14.6). A DVD drive generally distinguishes the separate layers of a multiple
layer disc by selective focus. The separation between the two layers, about
20-70 J.lm, is sufficient for each layer to have a distinct focus for them to be
distinguished.
340 Multimedia Signals and Systems
CD Data I
Swface~~__-,~__~t-
...I----/ttn---; 1.2 mm
DVDData
Swface f
Photo
Detector
Laser
CD-ROM
The CD-ROM recording is very similar to the gramophone (LP) recording
process. A stamping master, which is usually made of nickel, is created such
that it is the mirror image of the lands and pits that are to be created on a
CD. To create the pits, the raised bumps on the stamper are pressed onto the
liquid plastic (polycarbonate, polyolefine, acrylic or similar material) during
the injection molding process. Once the pits are generated, the stamped
surface is covered with a thin reflective layer of aluminum, which provides
different reflectivity of lands and pits, and helps in the data reading process.
The DVD creation process is similar to that of a CD. However, more care
is needed as the pits are smaller. In addition, there are more layers in a
DVD. Hence, the recording is done layer-by-Iayer, and the substrates of
different layers are glued.
CD-RMedia
The CD-R recording process is distinct from the CD-ROM recording
process. The CD-R recording is primarily done by the consumers, and
therefore there is no mass production of the same content.
The CD-R disc has a special dye layer (see Fig. 14.7) that is not present in
a CD-ROM. The dye is photoreactive and changes its reflectivity in
response to the high power mode of the recorder's laser. The CD-R records
Chapter 14: Optical Storage Media 341
information by burning the dye layer in the disc. There are three popular
types of photoreactive dyes used by CD-R discs [7]. The green dye is based
on cynanine compound, and has a useful life in the range of 75 years. The
gold dye is based on phthalocynanine compound, and has a useful life of
about 100 years and is better for high speed 2x-4x. The blue dye is based on
cynanine compound with an alloyed silver substrate. These CDs are more
resistant to UV radiation, and have a low block error rates.
---t~e=~==~~:--
Label
ProtectIVe Reflective
Lacquer
Gold Coating
Polycarbonate Base
Photoreactlve
Dye Layer
Figure 14.7. Different layers in a CD-R.
Rewritable CDs
CD-RW is based on the phase change principle. The reflective layer in the
CD-RW is made from material that changes its reflectivity depending on
whether it is in an amorphous or crystalline state. In its crystalline state, the
medium has a reflectivity of about 15-25%. In its amorphous state, the
reflectivity is reduced by a few %, which is enough for the detection of the
state. The most common medium used in CD-RW is an alloy of antimony,
indium, silver, and tellurium.
In a blank CD-RW, the entire reflective medium is in its crystalline state.
When recording, the drive increases the laser power to between 8-15 mW,
and heats the medium to above its 500-700oC melting point. The operation
is straightforward, and equivalent to the CD-R writing process except for
laser power. To completely erase a disc and restore its original crystalline
state, the disc is annealed. The reflective layer is heated to about 200°C, and
held at that temperature while the material recrystallizes.
Note that the recordable media is not as durable as the commercially
stamped CDs. Recordable CDs are photosensitive, and should not be
exposed to direct sunlight. The label side of a recordable CD is often
protected only by a thin lacquer coating. This coating is susceptible to
damage from solvents such as acetone and alcohol. Sharp-tipped pens
should not be used to write on the labels as they may scratch the lacquer
surface and damage the data.
342 Multimedia Signals and Systems
MagnetiC
Field J=~u
Magnetic Tape CD Media
Figure 14.8. Comparison of playback in magnetic and optical storage.
targeted for mass distribution of data or software. CD-R and CD-RW are
used for back up of large-size multimedia or computer data. The CD-V, also
known as video-CD, is used for short low quality video. The CD-I (or CD-
interactive) is the standard for the storage and distribution of interactive
multimedia applications. SA-CD is the standard for recording very high
quality audio signals.
There are primarily five DVD standards, documented in various Books, as
shown in Table 14.5. Book A defines the physical specifications of the
DVDs such as file system and data format. Book B defines the standard for
video applications, whereas Book C defines the standard for audio
applications. Books 0 and E defines the standard for DVD-R and DVD-RW
storage specifications.
14.2.1 Audio CD
The CD was first introduced to the market with audio data represented
using 16 bits/sample/channel precision and 44.1 KHz sampling rate. The
data rate of a stereo audio CD can be calculated as follows:
bits samples / s .
DataRate cD _DA = 16--x 2 channels x 44100 1.4112 x 10 6 blts/s
sample channel
The quantization of the analog audio signal to digital audio signal will
introduce quantization noise in the audio signal and degrade the quality. It
was mentioned in Chapter 4 that increasing the sample precision by 1 bit
improves the SNR of the signal by approximately 6 dB. Hence, with 16 bits
precision, the SNR of the digitized audio would be in the range of96 dB. In
practice, the CD-DA player manufacturers claim about a 90-95 dB dynamic
range, which can be considered as excellent quality for most audio
applications. Note that compared to CD, LP records provide a SNR of about
70 dB. There is another advantage of CD over LP. Each groove of an LP
stereo record contains two signals, one each for the left and right channels.
These signals are read and reproduced simultaneously by the turntable. In
the case of CD, the left and right channel information is recorded separately.
Therefore, the cross-talk between the two channels is minimized, improving
the audio quality.
Eq. (14.1) shows that the CD-DA data rate is 1.4112 x 10 6 bits/s.
However, the audio data needs to be encoded and formatted before storing it
in the CD, reSUlting in an increase in the raw data rate. Fig. 14.9 shows the
schematic of the CD-DA data encoding process. The encoder takes 24 bytes
(i.e., 192 bits) of raw audio data, and produces 588 bits of output, known as
an audio frame. There are four steps in the encoding process, which are now
briefly discussed.
Bitstream
stamped
oneD
Audio
Input
CIRC Encoding
In order to protect the data from bit errors, an error correction code is
applied to the raw audio data. The bit errors may occur for various reasons,
such as jitter, dust, scratches, and fingerprints. The CD-DA employs the
cross-interleave Reed-Solomon code (CIRC) to correct the random and burst
errors. The CIRC code converts 24 bytes of audio data into 32 bytes of
CIRC encoded data, increasing the data rate by a ratio of 32:24. Hence, the
data rate of the CIRC encoded audio data is 1.8816xl0 6 bits/s. Note that
although the CIRC code detects and corrects most bit-errors, there is a
possibility that some bit errors may still go undetected.
Chapter 14: Optical Storage Media 345
Control Word
For each 32 bytes of CIRC encoded data, 8 bits control word is added,
resulting in 33-bytes of output. The control word contains information such
as the music track separator flag, track number and time.
EFM Modulation
The bits in a CD-DA are represented with lands and pits as shown in Fig.
14.10. Here, the ones are represented by a change from land-to-pit, or pit-to-
land, and the zeros are represented by a continuation of the lands and pits.
However, this scheme is not used exactly in practice. For example, consider
the bit sequence 111111111. In this case, the pits and lands will occur
alternately. It is difficult to design a laser system with sufficient resolution
to read the sequence of lands and pits changing so frequently. When the CD
standard was established, it was decided that at least two lands and two pits
should always occur consecutively in a sequence. But, pits and lands should
not be too long in order to correctly obtain synchronization signal. As a
result, it was decided that at most ten zeros can follow one another.
Label Side
Protective Layer -i-+
1
Land Pit
Reflective Layer ~I !
.i........................................
~
100000 . ' ..I""'I""""""'"
1100dlOoooo~lOd . . 100000 .i
10000~100000i
I •••••••••••••
In order to satisfy the above length constraints, the CD-DA system adopts
a modulation scheme, known as eight-to-fourteen modulation (EFM). In this
scheme, 8-bit words are coded as 14 bits. Note that with 14 bits, 16384
different symbols can be represented. Out of a possible 16384 symbols, only
256 symbols are chosen such that the minimum and maximum consecutive
lengths of lands and pits are satisfied. An example of EFM is shown in
Table 14.6. By limiting the maximum number of consecutive bits with same
logic level, the overall DC content is reduced. In addition, by limiting the
individual bit inversions (i.e., 1 ~ 0 or 0 ~ 1) between consecutive bits, the
overall bandwidth is lowered. Because of this modulation, the EFM
produces 462-bit data from 264-bit (33-bytes) input.
The DVD also employs the EFM scheme, although with minor
modifications to improve the robustness of the MPEa data recording. The
scheme is known as EFM plus.
346 Multimedia Signals and Systems
Sync/Merging
Even with the EFM, the conditions may not be satisfied at the beginning
of the 14-bit symbol. Hence, in addition to the EFM, three extra bits are
added to each 14-bit symbol. These three merging bits are selected such that
the DC content of the signal is further reduced. The exact values of the
merging bits depend on the adjacent bits of the 14-bit symbol. These extra
merging bits produce 561-bit output for 462-bit input.
The smallest data unit in CD-DA is called a block and contains 2352 bytes
of useful audio data. Since the stereo audio data rate is 1.4112 x 10 6 bits / s,
the audio playback rate is 75 blocks/s. A typical CD-DA contains audio data
for 74 minutes. Therefore, a CD-DA contains 74*60*75 or 333000 blocks.
The storage capacity of the CD can then be calculated as
Capacity CD-DA = 333,000blocksx 2352 bytes,., 783 .216 xl 0 6 bytes of audio data.
block
correction. The EFM has a low BER, but not small enough for computer
storage application. Hence, 280 bytes (out of 2352) are used for additional
error detection correction (in addition, 8 bytes are not used). In other words,
2048 bytes in each block are used for user data storage. The two formats are
also known as Mode-l (for computer storage) and Mode-2 (for audio
recording and play back). With these extra error correction bytes, an average
uncorrectable BER smaller than 10-12 can be achieved.
The user data storage capacity of the CD-ROM can be calculated as
follows:
. bytes 6
CapaCltyCD_ROM = 333,000blocks x 2048--:::: 681.984x 10 bytes:::: 650 MBytes
block
Although a CD has a raw capacity of more than 2 Gbytes, the effective
data storage capacity is only 650 Mbytes. This is the storage capacity
generally quoted on commercially available CD-Rs and CD-RWs.
In DVD, 2048 bytes of user data is stored in one sector. However, when
header, EDC, ECC, and sync signals are added, the data size becomes 2418
bytes. The EFMpius (eight-to-sixteen) modulation again doubles the size,
making the physical sector size 4836 bytes. The formatting and modulation
comprise an effective overhead of 136%. Thus, although the DVD user data
rate (I-speed) is 11.08 Mbits/s, the raw data rate is approximately 26.16
Mbits/s
.. Companson 0 f'd
Table 147 VI eo- CD andDVD VI'deo
Video-CD DVD-Video
Video Data Rate 1.44 Mbitslsec 1- 10 Mbitsls variable (video, audio,
(video, audio) subtitles)
Average = 3.5 Mbitsls
Video compression MPEG-I MPEG-2 (MPEG-I is allowed)
Sound tracks 2 Channel-MPEG NTSC: Dolby Digital
PAL: MPEG-2
Subtitles Open caption only Up to 32 languages
There are primarily two types of DVD players available on the market.
They may come as part of a personal computer, or they may come as a
stand-alone DVD player. Both types of DVD players can generally play
CD-Audio, and Video CD, in addition to DVD Video. With current
technology, the DVD-ROM player in a PC works better with a separate
hardware-based MPEG-2 decoder. The raw data rate of DVD video is 124
Mbits/sec, and the image resolution is approximately 480x720 pixels. This
is about four times that of MPEG-l or VHS (equivalent to 240 lines,
compared to 480 lines quality of DVD) quality. The DVD Video supports
both 4:3 and 16:9 aspect ratios. The associated audio can be in many forms.
DVD video standard accommodates eight tracks of audio, with each track
being a single data stream that may comprise one or more audio channels.
Chapter 14: Optical Storage Media 349
Each of these channels may use anyone of the five specified encoding
systems.
The average video data rate is approximately 3.5 Mbits/s. If audio and
subtitles are added, the bit-rate is approximately 4.962 Mbits/s. A DVD with
4.7 GB net storage space can accommodate up to 133 minutes
(=4.7*8*1024/(4.962*60)) of video. Hence, a single sided, single layer
DVD can store up to 133 minutes of video. Although the "average" bit rate
for digital video is often quoted as 3.5 Mbits/s, the actual bit-rate varies
according to movie length, picture complexity and the number of audio
channels required.
In order to prevent copyright violations, video DVDs may include copy
protection. There are primarily three types of protection used in consumer
video applications. In analog copy protection, the video output is altered so
that the signals appears corrupt to VCRs when recording is attempted. The
current analog protection system (APS) adds a rapid modulation to the
colorburst signal and confuses the automatic gain control of the VCR. In
serial copy protection, the copy generation management system adds
information to line 21 of the NTSC video signal to tell the equipment if
copying is permitted. In digital encoding, the digital media files are
encrypted so that the video player cannot decode the signal without knowing
the key.
Commercial DVDs are marked with a regional code (see Table 14.8). The
DVD player checks to see if the region code in the DVD matches the code
in its hardware. Ifit does not match, the DVD will not play.
A DVD drive does everything a CD drive does, and more. As DVD drives
become inexpensive, it might be better to buy DVD drives. Although a
DVD movie can be played using a computer drive and watched on the
monitor, the computer DVD drives generally do not have a video jack
(although, they typically provides audio output) as found in the home video
players. Hence, these DVD drives cannot be directly used to watch a movie
on television. In addition, computer DVD drives mostly use a software-
350 Multimedia Signals and Systems
based MPEG-2 decoder, which may be slow if the CPU is not powerful
enough.
REFERENCES
1. J. D. Lenk, Lenk's Laser Handbook, McGraw Hill, 1992.
2. E. W. Williams, The CD-ROM and Optical Disc Record System, Oxford University
Press, New York, 1994.
3. L. Boden, Mastering CD-ROM technology, John Wiley & Sons, New York, 1995.
4. J. Taylor, DVD Demystified: The Guidebook for DVD-video and DVD-ROM,
McGraw-Hill, New York, 1998.
5. A. Khurshudov, Essential Guide to Computer Data Storage: From Floppy to DVD,
Prentice Hall PTR, 2001.
6. L. Baert, L. Theunissen, and G. Vergult, Digital Audio and Compact Disc
Technology, Second Edition, Newnes, Oxford, 1992.
7. W. L. Rosch, Hardware Bible, QUE, Fifth Edition, Indianapolis, 1999.
QUESTIONS
1. What are the advantages (and disadvantages) of optical storage media over
magnetic storage media?
2. Compare and contrast the storage principle of the CD and gramophone record.
3. Compare and contrast the physical dimensions of CD and DVD.
4. How does a DVD drive read a CD?
5. Compare and contrast the writing principles of CD-ROM, CD-R, and CD-RW.
6. Why do we need CIRC encoding and EFM modulation? What is EFMPlus?
7. What is the storage capacity ofa DVD? What is the raw storage capacity including
the modulation bytes?
8. How many minutes of a digital movie can be stored in a DVD?
9. You have recorded your computer data in a CD-ROM (74 Minutes equivalent) with
its full capacity. How long will it take if you read the CD at 12x speed? Assuming
a BER of 10-12 , what is the probability that there will be a bit error while reading
the entire CD?
Chapter 15
Electronic Displays
require a small display size of only a few inches. Typical computer monitors
use a display size in the range 15-24", whereas typical home televisions use
a display size in the range 13-48". Note that the weight and cost of display
increases rapidly with the screen size. Hence, large displays (>70") are
generally obtained by optically projecting the image from a smaller display.
The cathode ray tube, liquid crystal display, and digital micro-mirror
devices are typically used in projection systems.
Display Thickness and Weight: These parameters are also important to
consumers, especially when there are constraints of physical space and
weight. The current trend is to employ thinner [4, 5] and less bulky display
devices.
Resolution: The resolution of a display device is expressed in terms of the
number of pixels in the horizontal and vertical directions. The resolution is
limited significantly by the display size. Typical computer monitors have
resolutions from 640 x 480 to 1600 x 1200. On the other hand, typical digital
TVs have resolutions in the range 720 x 480. While high definition TVs
have resolutions of up to 1920 x 1080, the portable devices have resolutions
as low as 50 x 50 .
Color: This is another important aspect of a display device. Some devices
such as digital watches, and calculators may not need a color display.
However, entertainment applications generally require good color display.
The number of colors supported in a typical display is in the range of 256-16
million colors. Note that 256, and 16 millions correspond to 8-bit and 24
bit/pixel resolutions, respectively.
Brightness and Contrast Ratio: Brightness is the result of all illumination
falling on the surface, and the reflectance of the surface, whereas the
contrast ratio is the ratio of maximum brightness and minimum brightness
visible. Note that more ambient illumination decreases the contrast ratio.
Aspect Ratio: It is the ratio of the width and height of a display device. The
aspect ratio of most computer monitors and analog TVs are 4:3. However,
HDTV has an aspect ratio of 16:9.
Angle of View: Angle of view is the angle up to which the output of the
display device can be read by a viewer. Computer monitors can have a
narrow angle of view. However, TV and the projection displays must have a
large angle of view since these are targeted for small or large groups of
people.
Light Source: In all display devices, the light come from the display screen
and enters our eyes, and we see the pictures. The display devices can be
Chapter 15: Audio Fundamentals 353
divided into two categories based on how the light come from the screen. In
the emissive display, the light (or photons) is emitted by the display screen.
This typically happens in most home TV receivers, which use the cathode
ray tubes. In the non-emissive display, the display device does not emit any
light. Rather, the display device transmits or reflects lights coming from
another source. Note that LCD screens are of non-emissive type. The
emissive devices generally require high power and hence are not efficient
for portable application. On the other hand, non-emissive display devices
depend on external light source for illumination and hence require lower
power to operate.
hadow-mask
screen assembly
Phosphor
creen
lectron
gwlS
Note that the CRT is evacuated so that the electron beam emitted from the
cathode passes through the tube easily. The raster scanning is accomplished
by deflection of the electron beam using a set of magnetic coil around the
neck of the CRT. The monochrome CRT has one electron gun, whereas a
color CRT has three electron guns, one each for red, green and blue
354 Multimedia Signals and Systems
channels. The phosphor screen of a color display has three dots (red, green,
and blue) for each pixel. These three phosphor dots are known as triplet or
triad.
The electron guns point the beam at the individual phosphor dots while
scanning. However, part of the beam can spill over and hit the other dots in
the triplet, or even the neighboring pixel dots. This spilling over results in
loss of color purity. CRT devices use a masking screen to stop these spilled-
over electrons to improve the quality of the displayed image. There are
primarily two types of masking screens - shadow mask and aperture grille
(see Figs. 15.2 and 15.3).
The shadow mask (see Fig. 15.2) is a steel plate with a large number of
holes (one hole for each color pixel). The three electron beams
corresponding to each color pixel pass through these triplet holes and hit the
phosphor screen. The spilled-over electrons are stopped by the shadow
mask. The shadow mask is located very close to the phosphor screen, and
hence the phosphor dots on the screen must be spaced at the same distance
as the holes in the mask. Since the hole spacing is the same as the dot
spacing, it is often called the dot-pitch of the CRT. Typical dot pitch is 0.28
mm or 0.40 mm.
R,O,S Triad
Viewers
hadowMask
Phosphor creen
Figure 15.2. Three of phosphor dots and electron guns are arranged in a triangle.
Some CRTs, such as Sony Triniton CRTs, use an aperture grille instead of
a shadow mask. Here, instead of circular holes, the electron beam passes
through thin vertical windows (or slots) as shown in Fig. 15.3. The
resolution of an aperture grille is specified in slot-pitch, which is the
distance between the two slots.
Chapter 15: Audio Fundamentals 355
BOR
III--.j
Dot
~
Slot
ROB ROB RO B Pitch Pitch
(a) (b) (c)
vacuum technology to keep the emission within the small space of flat
panels.
A simplified schematic of the FED is shown in Fig. 15.4. The electrons
are generated in high vacuum by the micro-tips of the cathode and are
accelerated towards the phosphors by the positive voltage applied on the
transparent anode. The electron beams hit the phosphor screen, and photons
of different colors are emitted from the phosphors, which is observed by the
viewers.
1/ 10'
14 - - - Mitro r.,
______________ r-I.L~L--L~L-~IL.....].;:~=~= Layer
r N f - - - Ow. WWate
As mentioned earlier, FEDs employ multiple cathodes, one for each color
channel of a pixel, unlike the single cathodes (or electron gun) in the case of
the CRT. The distance between the cathode and the phosphor screens in
FED can be made as small as 1110" (as shown in Fig. 15.4) resulting in a
flat panel display. Note that although the CRTs have high anode voltage, the
anode voltage in typical FEDs can vary from one to several thousands volts.
Typical FEDs have several important features. They are thin, lightweight,
and have low power consumption. High resolution, brightness and contrast
can be easily achieved in FED. The FEDs can produce self-emissive
distortion free images, and can provide a wide angle of view (as wide as
170°). In addition, FEDS can withstand a wide range of temperature and
humidity, and have good stable characteristics in severe environmental
conditions.
Chapter 15: Audio Fundamentals 357
Note that the FED technology is in its infancy, being only a few years old.
Hence, only small screen displays (for automotive or small portable
applications such as POA) have been developed so far. However, it is
expected to enter the consumer TV market in near future.
Note that the PDP is a truly digital display device where the cells are
either ON or OFF, resulting in either dark or bright pixels. Hence, a
mechanism must be devised to obtain a large number of brightness levels.
This is done by a method similar to a digital to analog converter. For
example, consider the decimal number 11 that can be represented as 1011 in
binary format where the weights of 1, 0, 1, and 1 are 8, 4, 2, and 1,
respectively (see Fig. IS.7(a)).
Vic\\Cr
©
Display
Elcctrodes
Front Glass u
Diclectric
Laycr
Barricr
Rcd
Phosphor
Rcar glass - . • • • • • •
Grccn
Phosphor
Vr Voltage
Figure 15.6. Voltage-current characteristics ofa PDP Cell. Vr is the firing or
breakdown voltage. In the Normal Glow region the cell is ON state.
The cell brightness in PDP can be controlled as follows . Assume that the
PDP cell has to be made ON for T time unit to obtain the brightness level of
one (i.e., the level just brighter than a complete dark pixel). A brightness
level corresponding to gray level 11 can be then obtained by first switching
Chapter 15: Audio Fundamentals 359
ON the PDP for 8r time units, switching it OFF for 4r time units, and then
switching it ON for 2r, followed by another r time units (see Fig. JS.7(b)).
This switching ON and OFF is performed very fast, and due to the lowpass
nature of the human visual system, it appears as a smooth brightness level.
Even though at any time the pixel illumination is either on or off, the overall
brightness would be corresponding to gray level 11 .
000
~ (b)
7 6 5 4 3 2 1 10
,£ ~
Erase
pertod - . ,.'
i' t
Wrtte
..I
period Sustwn
penod
Figure 15.8. Timing diagram for the PDP display . The time
intervals are not drawn to scale.
The angle of view in a PDP display is very wide and hence the display is
suitable for any viewing situation. Although PDP principles are similar to
360 Multimedia Signals and Systems
LCD displays are based on liquid crystal technology. Typical crystals are
solid, and the molecular positions in a crystal are generally ordered, similar
to the molecules in other solid materials. The molecules in a liquid crystal,
however, do not exhibit any positional order, although they do possess a
certain degree of orientational order.
The liquid crystals (LCs) can be classified according to the symmetry of
their positional order. LCs can be nematic, chiral nematic, smectic or chiral
smectic. In nematic liquid crystals, all positional order is lost, only the
orientation order remains. Figure 15.9(a) shows the molecules in a nematic
crystal [9, 10]. In a chiral nematic, the molecules are asymmetric and this
causes an angular twist between the molecules (see Fig. 15.9(b)). In smectic,
the molecules tend to exist in layers, and are oriented perpendicular to the
plane of layers. Figure 15.9(c) shows molecules in a smectic material with
two layers. The smectic material exhibits better viewing angle
characteristics and contrast ratio. In chiral smectic, the molecules are in
helical structure.
The nematic molecules of LCs have a special property - they can be
aligned by grooves in the plastic to bend the polarity of the light that passes
through them. The amount of bending depends on the electric field (or
current) applied. Ordinary light has no particular orientation, so LCs do not
visibly alter it. But in polarized light, the oscillations of the photons are
aligned in a single direction. A polarizing filter creates polarized light by
Chapter 15: Audio Fundamentals 361
~~ +_ Bn"'l
•
--~
P,xel
t
Fluorescent I
Ught Source I
I
II
LC Molecule Cohmn Color F,llc"
Row AhlPlmenl conlllCS.
ConllCl.
\IoIlh £000
Note that the LCD does not emit light; it merely modulates the ambient
light, and hence the display power consumption is minimal. Because of its
low power consumption and light weight, the LCD is very popular in
portable applications. Displays may either be viewed directly or can be
projected onto large screens.
LCDs, however, have two limitations. First, the angle of view of LCDs is
limited «30°) because of the use of polarized light. Second, LCD panels are
Chapter 15: Audio Fundamentals 363
Viewer
Projection TV
Digital projection TV is gaining importance due to the growth of the home
theaters and digital cinema [13]. There are several competitive technology
for projection TV, such as CRT, active matrix LCD, and liquid crystal light
valves (LCL V). Although these technologies are capable of producing good
quality images, they also have their limitations. The CRT and LCD systems
have limitations in producing high brightness, in addition to their stability
and uniformity problems [3,14]. The LCLVs can produce high brightness,
but they are expensive, and suffer from stability problem.
The DMD is a highly suitable display device for projection use, which is
typically known digital light processing technology. Figure 15.12 shows the
schematic of DMD-based projection TV. The light from the light source is
passed through the color filter, which generates color light (red, green, and
blue). The color light is reflected from the DMD mirrors onto the projection
lens, which projects the image on to the screen.
Chapter 15: Audio Fundamentals 365
Light
ource
olor
~
DMD
. .
Project IOn
hip Lens creen
REFERENCES
1. 1. S. Castellano, Handbook of Display Technology, Academic Press, 1992.
2. 1. C. Whitaker, Electronic Displays: Technology, Design, and Applications, McGraw-
Hill, New York, 1994.
3. A. C. Luther, Principles of Digital Audio and Video, Artech House, 1997.
4. M. Huang, M. Harrison, and P. Wilshaw, " Displays - the future is flat," European
Semiconductor, Vol. 20, No.2, P 15-16, Feb 1998.
5. S. Tominetti, and M. Amiotti, "Getters for flat-panel displays," Proc. of the IEEE,
Vol. 90, No.4, pp. 540-558, April 2002.
6. C. Ajluni, " FED technology takes display industry by storm," Electronic- Design,
Vol. 42, No. 22, pp. 56-66, Oct 25 1994.
7. S. Itoh and M. Tanaka, "Current status offield-emission displays," Proc. of the IEEE,
Vol. 90, No. 4, pp. 514-520, April 2002.
8. H. Uchike and T. Hirakawa, "Color plasma displays," Proc. of the IEEE, Vol. 90, No.
4, pp. 533-539, April 2002.
9. V. G. Chigrinov, Liquid Crystal Devices: Physics and Applications, Artech House,
1999.
10. H. Kawamoto, "The history of liquid-crystal displays," Proc. of the IEEE, Vol. 90,
No.4, pp. 460-500, April 2002 .
11. L. J. Hornbeck, "Digital light processing for high brightness, high-resolution
applications," Proc. ofSPiE: Projection Displays lll, Vol. 3013, pp. 27-40, 1997.
12. P. F. V. KesselL. J. Hornbeck, R. E. Meier, and M. R. Douglass, "A MEMS-based
projection display," Proc. of the IEEE, Vol. 86, No.8, pp. 1687-1704, August 1998.
366 Multimedia Signals and Systems
13. E. H. Stupp, M. S. Brennesholtz, and M. Brenner, Projection Displays, John Wiley &
Son Ltd, 1998.
QUESTIONS
I. Describe different parameters that are important for a display device.
2. Explain the working principle of CRT.
3. What is a shadow mask, and an aperture grille? What is meant by the statement "The
dot pitch of a TV is 0.30 mm"?
4. What is the working principle of the Field Emission displays?
5. Compare and contrastthe CRT display and FED.
6. Explain briefly the working principle of a plasma display panel (PDP).
7. How do we achieve gray scale in a PDP? How do we achieve color?
8. How does LCD work? Is it an emissive display? How do we achieve gray scale in
LCD? How is color reproduced in LCD?
9. Compare passive and active Matrix LCDs.
10. Draw the schematic of the optical switching in a DMD. What is a typical mirror size
in DMD? Which technology is used to fabricate a DMD?
11. Compare and contrast LCD and DMD.
12. Explain the principle ofDMD based projection display.
Appendix
Images (CD:\data\images)
{airplane, baboon, Lena}.tif % standard 512x512 gray level images
{banffl, banff2, lakelouise,niagra, geeta} .tif % Miscellaneous images
lenablur.tif % blurred Lena image
airplane256.tif % 256x256 airplane image
Video (CD:\data\video)
{claire 1,claire2 }.tif % two frames from Claire sequence
{footballOOO,footbal1002} .tif % two frames from football sequence
{shot 1,shot3 }. tif % frames from two video shots
Chapter 2
test {1,2,3,4,5}. wav % Output of Example 2.1
Examp2_2.mid % Output of Example 2.2
Chapter 10
belll_lpf. wav % LPF output of Example 10.1
bell Chpf. wav % HPF output of Example 10.1
belll_bpf. way % BPF output of Example 10.1
Examp 10_2. wav % Output of Example 10.2
370 Multimedia Signals and Systems
Video
CD:\data\chapl1\wipe1 % Transition frames in Example 11.6
CD:\data\chapl1\wipe2 % Transition frames in Example 11.6
CD:\data\chap 11 \dissolve % Transition frames in Example 11.7
CD:\data\chap 11 \fade % Transition frames in Example 11.8