Introduction To Video Perception
Introduction To Video Perception
VIDEO FORMATION,
PERCEPTION, AND
REPRESENTATION
In this rst chapter, we describe what is a video signal, how is it captured and
perceived, how is it stored/transmitted, and what are the important parameters
that determine the quality and bandwidth (which in turn determines the data rate)
of a video signal. We rst present the underlying physics for color perception
and speci cation (Sec. 1.1). We then describe the principles and typical devices
for video capture and display (Sec. 1.2). As will be seen, analog videos are cap-
tured/stored/transmitted in a raster scan format, using either progressive or in-
terlaced scans. As an example, we review the analog color television (TV) system
(Sec. 1.4), and give insights as to how are certain critical parameters, such as frame
rate and line rate, chosen, what is the spectral content of a color TV signal, and how
can di erent components of the signal be multiplexed into a composite signal. Fi-
nally, Section 1.5 introduces the ITU-R BT.601 video format (formerly CCIR601),
the digitized version of the analog color TV signal. We present some of the consider-
ations that have gone into the selection of various digitization parameters. We also
describe several other digital video formats, including high-de nition TV (HDTV).
The compression standards developed for di erent applications and their associated
video formats are summarized.
The purpose of this chapter is to give the readers background knowledge about
analog and digital video, and to provide insights to common video system design
problems. As such, the presentation is intentionally made more qualitative than
quantitative. In later chapters, we will come back to certain problems mentioned
in this chapter and provide more rigorous descriptions/solutions.
1.1 Color Perception and Speci cation
A video signal is a sequence of two dimensional (2D) images projected from a
dynamic three dimensional (3D) scene onto the image plane of a video camera. The
1
2 Video Formation, Perception, and Representation Chapter 1
color value at any point in a video frame records the emitted or re ected light at a
particular 3D point in the observed scene. To understand what does the color value
mean physically, we review in this section basics of light physics and describe the
attributes that characterize light and its color. We will also describe the principle
of human color perception and di erent ways to specify a color signal.
1.1.1 Light and Color
Light is an electromagnetic wave with wavelengths in the range of 380 to 780
nanometer (nm), to which the human eye is sensitive. The energy of light is mea-
sured by ux, with a unit of watt, which is the rate at which energy is emitted. The
radiant intensity of a light, which is directly related to the brightness of the light
we perceive, is de ned as the ux radiated into a unit solid angle in a particular
direction, measured in watt/solid-angle. A light source usually can emit energy in
a range of wavelengths, and its intensity can be varying in both space and time. In
this book, we use C ( ; t; ) to represent the radiant intensity distribution of a light,
X
which speci es the light intensity at wavelength , spatial location = (X; Y; Z ) X
and time t.
The perceived color of a light depends on its spectral content (i.e. the wavelength
composition). For example, a light that has its energy concentrated near 700 nm
appears red. A light that has equal energy in the entire visible band appears white.
In general, a light that has a very narrow bandwidth is referred to as a spectral
color. On the other hand, a white light is said to be achromatic.
There are two types of light sources: the illuminating source, which emits an
electromagnetic wave, and the re ecting source, which re ects an incident wave.1
The illuminating light sources include the sun, light bulbs, the television (TV)
monitors, etc. The perceived color of an illuminating light source depends on the
wavelength range in which it emits energy. The illuminating light follows an additive
rule, i.e. the perceived color of several mixed illuminating light sources depends on
the sum of the spectra of all light sources. For example, combining red, green, and
blue lights in right proportions creates the white color.
The re ecting light sources are those that re ect an incident light (which could
itself be a re ected light). When a light beam hits an object, the energy in a certain
wavelength range is absorbed, while the rest is re ected. The color of a re ected light
depends on the spectral content of the incident light and the wavelength range that
is absorbed. A re ecting light source follows a subtractive rule, i.e. the perceived
color of several mixed re ecting light sources depends on the remaining, unabsorbed
wavelengths. The most notable re ecting light sources are the color dyes and paints.
For example, if the incident light is white, a dye that absorbs the wavelength near
700 nm (red) appears as cyan. In this sense, we say that cyan is the complement of
1 The illuminating and re ecting light sources are also referred to as primary and secondary light
sources, respectively. We do not use those terms to avoid the confusion with the primary colors
associated with light. In other places, illuminating and re ecting lights are also called additive
colors and subtractive colors, respectively.
Section 1.1. Color Perception and Speci cation 3
Figure 1.1. Solid line: Frequency responses of the three types of cones on the human
retina. The blue response curve is magni ed by a factor of 20 in the gure. Dashed Line:
The luminous eÆciency function. From [10, Fig. 1].
red (or white minus red). Similarly, magenta and yellow are complements of green
and blue, respectively. Mixing cyan, magenta, and yellow dyes produces black,
which absorbs the entire visible spectrum.
1.1.2 Human Perception of Color
The perception of a light in the human being starts with the photo receptors located
in the retina (the surface of the rear of the eye ball). There are two types of
receptors: cones that function under bright light and can perceive the color tone,
and rods that work under low ambient light and can only extract the luminance
information. The visual information from the retina is passed via optic nerve bers
to the brain area called the visual cortex, where visual processing and understanding
is accomplished. There are three types of cones which have overlapping pass-bands
in the visible spectrum with peaks at red (near 570 nm), green (near 535 nm), and
blue (near 445 nm) wavelengths, respectively, as shown in Figure 1.1. The responses
of these receptors to an incoming light distribution C () can be described by:
Z
Ci = () ()
C ai d; i = r; g; b; (1.1.1)
where ar (); ag (); ab () are referred to as the frequency responses or relative ab-
sorption functions of the red, green, and blue cones. The combination of these
three types of receptors enables a human being to perceive any color. This implies
that the perceived color only depends on three numbers, Cr ; Cg ; Cb , rather than
the complete light spectrum C (). This is known as the tri-receptor theory of color
vision, rst discovered by Young [14].
4 Video Formation, Perception, and Representation Chapter 1
There are two attributes that describe the color sensation of a human being:
luminance and chrominance. The term luminance refers to the perceived brightness
of the light, which is proportional to the total energy in the visible band. The term
chrominance describes the perceived color tone of a light, which depends on the
wavelength composition of the light. Chrominance is in turn characterized by two
attributes: hue and saturation. Hue speci es the color tone, which depends on
the peak wavelength of the light, while saturation describes how pure the color is,
which depends on the spread or bandwidth of the light spectrum. In this book, we
use the word color to refer to both the luminance and chrominance attributes of a
light, although it is customary to use the word color to refer to the chrominance
aspect of a light only.
Experiments have shown that there exists a secondary processing stage in the
human visual system (HVS), which converts the three color values obtained by the
cones into one value that is proportional to the luminance and two other values that
are responsible for the perception of chrominance. This is known as the opponent
color model of the HVS [3, 9]. It has been found that the same amount of energy
produces di erent sensations of the brightness at di erent wavelengths, and this
wavelength-dependent variation of the brightness sensation is characterized by a
relative luminous eÆciency function, ay (), which is also shown (in dashed line)
in Fig. 1.1. It is essentially the sum of the frequency responses of all three types
of cones. We can see that the green wavelength contributes most to the perceived
brightness, the red wavelength the second, and the blue the least. The luminance
(often denoted by Y) is related to the incoming light spectrum by:
Z
Y = () ()
C ay d: (1.1.2)
In the above equations, we have neglected the time and space variables, since we
are only concerned with the perceived color or luminance at a xed spatial and
temporal location. We also neglected the scaling factor commonly associated with
each equation, which depends on the desired unit for describing the color intensities
and luminance.
1.1.3 The Trichromatic Theory of Color Mixture
A very important nding in color physics is that most colors can be produced by
mixing three properly chosen primary colors. This is known as the trichromatic
theory of color mixture, rst demonstrated by Maxwell in 1855 [9, 13]. Let Ck ; k =
1; 2; 3 represent the colors of three primary color sources, and C a given color. Then
the theory essentially says
X
C= Tk Ck ; (1.1.3)
k=1;2;3
where Tk 's are the amounts of the three primary colors required to match color
C . The Tk 's are known as tristimulus values. In general, some of the Tk 's can be
Section 1.1. Color Perception and Speci cation 5
negative. Assuming only T1 is negative, this means that one cannot match color
C by mixing C1 ; C2 ; C3 , but one can match color C + jT1 jC1 with T2 C2 + T3 C3 :
In practice, the primary colors should be chosen so that most natural colors can
be reproduced using positive combinations of primary colors. The most popular
primary set for the illuminating light source contains red, green, and blue colors,
known as the RGB primary. The most common primary set for the re ecting light
source contains cyan, magenta, and yellow, known as the CMY primary. In fact,
RGB and CMY primary sets are complement of each other, in that mixing two
colors in one set will produce one color in the other set. For example, mixing red
with green will yield yellow. This complementary information is best illustrated by
a color wheel, which can be found in many image processing books, e.g., [9, 4].
For a chosen primary set, one way to determine tristimulus values of any color
is by rst determining the color matching functions, mi (), for primary colors, Ci ,
i=1,2,3. These functions describe the tristimulus values of a spectral color with
wavelength , for various in the entire visible band, and can be determined by
visual experiments with controlled viewing conditions. Then the tristimulus values
for any color with a spectrum C () can be obtained by [9]:
Z
Ti = () ()
C mi d; i = 1; 2; 3: (1.1.4)
To produce all visible colors with positive mixing, the matching functions associated
with the primary colors must be positive.
The above theory forms the basis for color capture and display. To record the
color of an incoming light, a camera needs to have three sensors that have frequency
responses similar to the color matching functions of a chosen primary set. This can
be accomplished by optical or electronic lters with the desired frequency responses.
Similarly, to display a color picture, the display device needs to emit three optical
beams of the chosen primary colors with appropriate intensities, as speci ed by
the tristimulus values. In practice, electronic beams that strike phosphors with
the red, green and blue colors are used. All present display systems use a RGB
primary, although the standard spectra speci ed for the primary colors may be
slightly di erent. Likewise, a color printer can produce di erent colors by mixing
three dyes with the chosen primary colors in appropriate proportions. Most of
the color printers use the CMY primary. For a more vivid and wide-range color
rendition, some color printers use four primaries, by adding black (K) to the CMY
set. This is known as the CMYK primary, which can render the black color more
truthfully.
1.1.4 Color Speci cation by Tristimulus Values
Tristimulus Values We have introduced the tristimulus representation of a color
in Sec. 1.1.3, which speci es the proportions, i.e. the Tk 's in Eq. (1.1.3), of the
three primary colors needed to create the desired color. In order to make the color
speci cation independent of the absolute energy of the primary colors, these values
6 Video Formation, Perception, and Representation Chapter 1
plane at time t. In general the video signal has a nite spatial and temporal range.
The spatial range depends on the camera viewing area, while the temporal range
depends on the duration in which the video is captured. A point in the image plane
is called a pixel (meaning picture element) or simply pel.4 For most camera systems,
the projection operator P () can be approximated by a perspective projection. This
is discussed in more detail in Sec. 5.1.
If the camera absorption function is the same as the relative luminous eÆciency
function of the human being, i.e. ac() = ay (), then a luminance image is formed.
If the absorption function is non-zero over a narrow band, then a monochrome
(or monotone) image is formed. To perceive all visible colors, according to the
trichromatic color vision theory (see Sec. 1.1.2), three sensors are needed, each with
a frequency response similar to the color matching function for a selected primary
color. As described before, most color cameras use the red, green, and blue sensors
for color acquisition.
If the camera has only one luminance sensor, ( ; t) is a scalar function that
x
represents the luminance of the projected light. In this book, we use the word
gray-scale to refer to such a video. The term black-and-white will be used strictly
to describe an image that has only two colors: black and white. On the other hand,
if the camera has three separate sensors, each tuned to a chosen primary color, the
signal is a vector function that contains three color values at every point. Instead
of specifying these color values directly, one can use other color coordinates (each
consists of three values) to characterize light, as explained in the previous section.
Note that for special purposes, one may use sensors that work in a frequency
range that is invisible to the human being. For example, in X-ray imaging, the
sensor is sensitive to the spectral range of the X-ray. On the other hand, an infra-
red camera is sensitive to the infra-red range, which can function at very low ambient
light. These cameras can \see" things that cannot be perceived by the human eye.
Yet another example is the range camera, in which the sensor emits a laser beam and
measures the time it takes for the beam to reach an object and then be re ected
back to the sensor. Because the round trip time is proportional to the distance
between the sensor and the object surface, the image intensity at any point in a
range image describes the distance or range of its corresponding 3D point from the
camera.
1.2.2 Video Cameras
All the analog cameras of today capture a video in a frame by frame manner with
a certain time spacing between the frames. Some cameras (e.g. TV cameras and
consumer video camcorders) acquire a frame by scanning consecutive lines with a
certain line spacing. Similarly, all the display devices present a video as a consecu-
tive set of frames, and with TV monitors, the scan lines are played back sequentially
as separate lines. Such capture and display mechanisms are designed to take advan-
4 Strictly speaking the notion of pixel or pel is only de ned in digital imagery, in which each
image or a frame in a video is represented by a nite 2D array of pixels.
Section 1.2. Video Capture and Display 9
tage of the fact that the HVS cannot perceive very high frequency changes in time
and space. This property of the HVS will be discussed more extensively in Sec. 2.4.
There are basically two types of video imagers: (1) tube-based imagers such
as vidicons, plumbicons, or orthicons, and (2) solid-state sensors such as charge-
coupled devices (CCD). The lens of a camera focuses the image of a scene onto a
photosensitive surface of the imager of the camera, which converts optical signals
into electrical signals. The photosensitive surface of the tube imager is typically
scanned line by line (known as raster scan) with an electron beam or other electronic
methods, and the scanned lines in each frame are then converted into an electrical
signal representing variations of light intensity as variations in voltage. Di erent
lines are therefore captured at slightly di erent times in a continuous manner. With
progressive scan, the electronic beam scans every line continuously; while with
interlaced scan, the beam scans every other line in one half of the frame time (a
eld) and then scans the other half of the lines. We will discuss raster scan in more
detail in Sec. 1.3. With a CCD camera, the photosensitive surface is comprised of a
2D array of sensors, each corresponding to one pixel, and the optical signal reaching
each sensor is converted to an electronic signal. The sensor values captured in each
frame time are rst stored in a bu er, which are then read-out sequentially one line
at a time to form a raster signal. Unlike the tube based cameras, all the read-out
values in the same frame are captured at the same time. With interlaced scan
camera, alternate lines are read-out in each eld.
To capture color, there are usually three types of photosensitive surfaces or CCD
sensors, each with a frequency response that is determined by the color matching
function of the chosen primary color, as described previously in Sec. 1.1.3. To reduce
the cost, most consumer cameras use a single CCD chip for color imaging. This is
accomplished by dividing the sensor area for each pixel into three or four sub-areas,
each sensitive to a di erent primary color. The three captured color signals can be
either converted to one luminance signal and two chrominance signal and sent out
as a component color video, or multiplexed into a composite signal. This subject is
explained further in Sec. 1.2.4.
Many cameras of today are CCD-based because they can be made much smaller
and lighter than the tube-based cameras, to acquire the same spatial resolution.
Advancement in CCD technology has made it possible to capture in a very small
chip size a very high resolution image array. For example, 1/3-in CCD's with 380 K
pixels are commonly found in consumer-use camcorders, whereas a 2/3-in CCD with
2 million pixels has been developed for HDTV. The tube-based cameras are more
bulky and costly, and are only used in special applications, such as those requiring
very high resolution or high sensitivity under low ambient light. In addition to the
circuitry for color imaging, most cameras also implement color coordinate conversion
(from RGB to luminance and chrominance) and compositing of luminance and
chrominance signals. For digital output, analog-to-digital (A/D) conversion is also
incorporated. Figure 1.2 shows the typical processings involved in a professional
video camera. The camera provides outputs in both digital and analog form, and in
the analog case, includes both component and composite formats. To improve the
10 Video Formation, Perception, and Representation Chapter 1
Figure 1.2. Schematic Block Diagram of a Professional Color Video Camera. From [6,
Fig. 7(a)].
image quality, digital processing is introduced within the camera. For an excellent
exposition of the video camera and display technologies, see [6].
1.2.3 Video Display
To display a video, the most common device is the cathode ray tube (CRT). With
a CRT monitor, an electron gun emits an electron beam across the screen line by
line, exciting phosphors with intensities proportional to the intensity of the video
signal at corresponding locations. To display a color image, three beams are emitted
by three separate guns, exciting red, green, and blue phosphors with the desired
intensity combination at each location. To be more precise, each color pixel consists
of three elements arranged in a small triangle, known as a triad.
The CRT can produce an image having a very large dynamic range so that the
displayed image can be very bright, suÆcient for viewing during day light or from a
distance. However, the thickness of a CRT needs to be about the same as the width
of the screen, for the electrons to reach the side of the screen. A large screen monitor
is thus too bulky, unsuitable for applications requiring thin and portable devices.
To circumvent this problem, various at panel displays have been developed. One
popular device is Liquid Crystal Display (LCD). The principle idea behind the LCD
is to change the optical properties and consequently the brightness/color of the liq-
uid crystal by an applied electric eld. The electric eld can be generated/adapted
by either an array of transistors, such as in LCD's using active matrix thin- lm-
transistors (TFT), or by using plasma. The plasma technology eliminates the need
for TFT and makes large-screen LCD's possible. There are also new designs for
at CRT's. A more comprehensive description of video display technologies can be
found in [6].
The above stated raster scan and display mechanisms only apply to TV cameras
and displays. With movie cameras, the color pattern seen by the camera at any
Section 1.2. Video Capture and Display 11
frame instant is completely recorded on the lm. For display, consecutive recorded
frames are played back using an analog optical projection system.
(a) (b)
Figure 1.3. Progressive (a) and Interlaced (b) Raster Scan Formats.
Interlaced Scan The raster scan format described above is more accurately known
as progressive scan (also known as sequential or non-interlaced scan), in which
the horizontal lines are scanned successively. In the interlaced scan, each frame is
scanned in two elds and each eld contains half the number of lines in a frame.
The time interval between two elds, i.e., the eld interval, is half of the frame
interval, while the line spacing in a eld is twice of that desired for a frame. The
scan lines in two successive elds are shifted by half of the line spacing in each
eld. This is illustrated in Fig. 1.3(b). Following the terminology used in the
MPEG standard, we call the eld containing the rst line and following alternating
lines in a frame the top eld, and the eld containing the second line and following
14 Video Formation, Perception, and Representation Chapter 1
alternating lines the bottom eld. 6 In certain systems, the top eld is sampled
rst, while in other systems, the bottom eld is sampled rst. It is important to
remember that two adjacent lines in a frame are separated in time by the eld
interval. This fact leads to the infamous zig-zag artifacts in an interlaced video
that contains fast moving objects with vertical edges. The motivation for using
the interlaced scan is to trade o the vertical resolution for an enhanced temporal
resolution, given the total number of lines that can be recorded within a given time.
A more thorough comparison of the progressive and interlaced scans in terms of
their sampling eÆciency is given later in Sec. 3.3.2.
The interlaced scan introduced above should be called 2:1 interlace more pre-
cisely. In general, one can divide a frame into K 2 elds, each separated in time
by t=K: This is known as K:1 interlace and K is called interlace order. In a digital
video where each line is represented by discrete samples, the samples on the same
line may also appear in di erent elds. For example, the samples in a frame may be
divided into two elds using a checker-board pattern. The most general de nition
of the interlace order is the ratio of the number of samples in a frame to the number
of samples in each eld.
1.3.2 Characterization of a Video Raster
A raster is described by two basic parameters: the frame rate (frames/second or fps
or Hz), denoted by fs;t; and the line number (lines/frame or lines/picture-height),
denoted by fs;y . These two parameters de ne the temporal and vertical sampling
rates of a raster scan. From these parameters, one can derive another important
parameter, the line rate (lines/second), denoted by fl = fs;tfs;y : 7 We can also
derive the temporal sampling interval or frame interval, t = 1=fs;t, the vertical
sampling interval or line spacing, y = picture-height=fs;y , and the line interval,
Tl = 1=fl = t =fs;y , which is the time used to scan one line. Note that the line
interval Tl includes the time for the sensor to move from the end of a line to the
beginning of the next line, which is known as the horizontal retrace time or just
horizontal retrace, to be denoted by Th . The actual scanning time for a line is
Tl0 = Tl Th : Similarly, the frame interval t includes the time for the sensor to
move from the end of the bottom line in a frame to the beginning of the top line
of the next frame, which is called vertical retrace time or just vertical retrace, to be
denoted by Tv : The number of lines that is actually scanned in a frame time, known
as the active lines, is fs;y
0 = (t Tv )=Tl = fs;y Tv =Tl: Normally, Tv is chosen to
be an integer multiple of Tl:
A typical waveform of an interlaced raster signal is shown in Fig. 1.4(a). Notice
that a portion of the signal during the horizontal and vertical retrace periods are
held at a constant level below the level corresponding to black. These are called
6 A more conventional de nition is to call the eld that contains all even lines the even- eld,
and the eld containing all odd lines the odd- eld. This de nition depends on whether the rst
line is numbered 0 or 1, and is therefore ambiguous.
7 The frame rate and line rate are also known as the vertical sweep frequency and the horizontal
sweep frequency, respectively.
Section 1.3. Analog Video Raster 15
Blanking level
Black level
White level
(a)
(b)
Figure 1.4. A Typical Interlaced Raster Scan: (a) Waveform, (b) Spectrum.
sync signals. The display devices start the retrace process upon detecting these
sync signals.
Figure 1.4(b) shows the spectrum of a typical raster signal. It can be seen that
the spectrum contains peaks at the line rate fl and its harmonics. This is because
adjacent scan lines are very similar so that the signal is nearly periodic with a
period of Tl: The width of each harmonic lobe is determined by the maximum
vertical frequency in a frame. The overall bandwidth of the signal is determined by
the maximum horizontal spatial frequency.
The frame rate is one of the most important parameters that determine the
quality of a video raster. For example, the TV industry uses an interlaced scan
with a frame rate of 25{30 Hz, with an e ective temporal refresh rate of 50-60 Hz,
while the movie industry uses a frame rate of 24 Hz.8 On the other hand, in the
8 To reduce the visibility of icker, a rotating blade is used to create an illusion of 72 frames/c.
16 Video Formation, Perception, and Representation Chapter 1
Luminance,
RGB
Chrominance,
---> Modulation
Audio
YC1C2
Multiplexing
Figure 1.5. Analog Color TV systems: Video Production, Transmission, and Reception.
computer industry, 72 Hz has become a de facto standard. The line number used in
a raster scan is also a key factor a ecting the video quality. In analog TVs, a line
number of about 500-600 is used, while for computer display, a much higher line
number is used (e.g., the SVGA display has 1024 lines). These frame rates and line
numbers are determined based on the visual temporal and spatial thresholds under
di erent viewing environments, as described later in Sec. 2.4. Higher frame rate
and line rate are necessary in computer applications to accommodate a signi cantly
shorter viewing distance and higher frequency contents (line graphics and texts) in
the displayed material.
The width to height ratio of a video frame is known as the image aspect ratio
(IAR). For example, an IAR of 4:3 is used in standard-de nition TV (SDTV) and
computer display, while a higher IAR is used in wide-screen movies (up to 2.2) and
HDTVs (IAR=16:9) for a more dramatic visual sensation.
America as well as some other parts of Asia, including Japan and Taiwan; the
PAL system used in most of Western Europe and Asia including China, and the
Middle East countries; and the SECAM system used in the former Soviet Union,
Eastern Europe, France, as well as Middle East. We will compare these systems
in terms of their spatial and temporal resolution, the color coordinate, as well as
the multiplexing mechanism. The materials presented here are mainly from [9, 10].
More complete coverage on color TV systems can be found in [5, 1].
and
2
R~ 3 2 1:0 0:956 0:620 3 2 Y 3
~ = 4 1:0 0:272 0:647 5 4 I 5
4 G 5 (1.4.4)
B~ 1:0 1:108 1:70 Q
p
With the YIQ coordinate, tan 1(Q=I ) approximates the hue, and I 2 + Q2=Y
re ects the saturation. In a NTSC composite video, the I and Q components are
multiplexed into one signal, so that
p the phase of the modulated signal is tan (Q=I ),
1
MHz. Although the potential bandwidth of the chrominance signal could be just
as high, usually it is signi cantly lower than the luminance signal. Furthermore,
the HVS has been found to have much lower threshold for observing changes in
chrominance. Because of the Typically, the two chrominance signals are bandlimited
to have much narrower bandwidth. As mentioned previously, the human eye is
more sensitive to spatial variations in the orange-to-cyan color range, represented
by the I component, than it is for the green-to-purple range, represented by the
Q component. Therefore, the I component is bandlimited to 1.5 MHz, and the
Q component to 0.5 MHz.10 Table 1.1 lists the signal bandwidth of di erent TV
systems.
1.4.4 Multiplexing of Luminance, Chrominance, and Audio
In order to make the color TV signal compatible with the monochrome TV system,
all three analog TV systems use the composite video format, in which the three
color components as well as the audio component are multiplexed into one signal.
Here, we brie y describe the mechanism used by NTSC. First, the two chrominance
components I (t) and Q(t) are combined into a single signal C (t) using quadrature
amplitude modulation (QAM). The color sub-carrier frequency fc is chosen to be
an odd multiple of half of the line rate, fc = 455 f2 = 3.58 MHz. This is chosen
l
to satisfy the following criteria: i) It should be high enough where the luminance
component has very low energy; ii) It should be midway between two line rate har-
monics where the luminance component is strong; and iii) It should be suÆciently
far away from the audio sub-carrier, which is set at 4.5 MHz (286 fl), the same
as in the monochrome TV. Figure 1.6(a) shows how the harmonic peaks of the
luminance and chrominance signals interleave with each other. Finally, the audio
signal is frequency modulated (FM) using an audio sub-carrier frequency of fa =4.5
MHz and added to the composite video signal, to form the nal multiplexed signal.
Because the I component has a bandwidth of 1.5 MHz, the modulated chrominance
signal has a maximum frequency of up 5.08 MHz. In order to avoid the interference
with the audio signal, the chrominance signal is bandlimited in the upper sideband
to 0.6 MHz. Notice that the lower sideband of the I signal will run into the upper
part of the Y signal. For this reason, sometimes the I signal is bandlimited to 0.6
MHz on both sidebands. Finally, the entire composite signal, with a bandwidth of
about 4.75 MHz, is modulated onto a picture carrier frequency, fp, using vestigial
sideband modulation (VSB), so that the lower sideband only extends to 1.25 MHz
below fp and that the overall signal occupies 6 MHz. This process is the same as in
the monochrome TV system. The picture carrier fp depends on the broadcasting
channel. Figure 1.6(b) illustrates the spectral composition of the NTSC compos-
ite signal. The signal bandwidth and modulation methods in the three color TV
systems are summarized in Table 1.1.
At a television receiver, the composite signal rst has to be demodulated to
the baseband, and then the audio and three components of the video signals must
10 In [9], the bandwidth of I and Q are cited as 1.3 and 0.6 MHz, respectively.
Section 1.4. Analog Color Television Systems 21
fc
(a) (color sub-carrier)
Luminance
I and Q
Audio
Figure 1.6. Multiplexing of luminance, chrominance, and audio signals in NTSC sys-
tem. (a) The interleaving between luminance and chrominance harmonics; (b) The overall
spectral composition of the NTSC composite signal.
22 Video Formation, Perception, and Representation Chapter 1
be demultiplexed. To separate the video and audio signals, a low-pass lter can be
used. This process is the same in a monochrome TV as that in a color TV. To further
separate the chrominance signal from the luminance signal, ideally, a comb lter
should be used, to take advantage of the interleaving of the harmonic frequencies
in these two signals. Most high-end TV sets implement a digital comb lter with
null frequencies at the harmonics corresponding to the chrominance component to
accomplish this. Low-end TV sets however use a simple RC circuit to perform low-
pass ltering with a cut-o frequency at 3 MHz, which would retain the unwanted
I component in the extracted luminance signal, and vice verse. This will lead to
cross-color and cross-luminance artifacts. Cross-color refers to the spurious colors
created by the high frequency luminance signal that is close to the color sub-carrier
frequency. Cross-luminance refers to false high frequency edge patterns caused by
the modulated chrominance information. For a good illustration of the e ects of
di erent lters, see [2]. After extracting the chrominance signal, a corresponding
color-demodulation method is used to separate the two chrominance components.
Finally, the three color components are converted to the RGB coordinate for display.
1.4.5 Analog Video Recording
Along with the development of analog TV systems, various video tape recording
(VTR) technologies have been developed, to allow professional video production
(record and editing) as well as for consumer level recording (home video) and play-
back (VCR). Table 1.2 summarizes common video tape formats.
1.5 Digital Video
1.5.1 Notations
A digital video can be obtained either by sampling a raster scan, or directly using a
digital video camera. Presently all digital cameras use CCD sensors. As with analog
cameras, a digital camera samples the imaged scene as discrete frames. Each frame
Section 1.5. Digital Video 23
comprises of output values from a CCD array, which is by nature discrete both
horizontal and vertically. A digital video is de ned by the frame rate, fs;t, the
line number, fs;y , as well as the number of samples per line, fs;x. From these, one
can nd the temporal sampling interval or frame interval, t = 1=fs;t, the vertical
sampling interval y = picture-height=fs;y , and the horizontal sampling interval
x = picture-width=fs;x. In this book, we will use (m; n; k) to represent a digital
video, where integer indices m and n are the column and row indices, and k the
frame number. The actual spatial and temporal locations corresponding to the
integer indices are x = mx; y = ny , and t = kt: For convenience, we use the
notation (x; y; t) to describe a video signal in a general context, which could be
either analog or digital. We will use (m; n; k) only when speci cally addressing
digital video.
In addition to the above parameters, another important parameter of a digital
video is the number of bits used to represent a pixel value (luminance only or three
color values), to be denoted by Nb. Conventionally, the luminance or each of the
three color values is speci ed with 8 bits or 256 levels. Therefore, Nb = 8 for
a monochrome video, while Nb = 24 for a color video. The data rate, R, of a
digital video is determined by R = fs;tfs;xfs;y Nb, with a unit of bits/second (bps).
Usually it is measured in kilobits/second (Kbps) or megabits/second (Mbps). In
general, the spatial and temporal sampling rates can be di erent for the luminance
and chrominance components of a digital video. In this case, Nb should re ect the
equivalent number of bits used for each pixel in the luminance sampling resolution.
For example, if the horizontal and vertical sampling rates for each chrominance
component are both half of that for the luminance, then there are two chrominance
samples for every four Y samples. If each sample is represented with 8 bits, the
equivalent number of bits per sample in the Y resolution is (4*8+2*8)/4=12 bits.
When displaying the digital video on a monitor, each pixel is rendered as a
rectangular region with a constant color that is speci ed for this pixel. The ratio of
the width to the height of this rectangular area is known as the pixel aspect ratio
(PAR). It is related to the IAR of the display area and the image dimension by
For proper display of a digitized video, one must specify either PAR or IAR, along
with fs;x and fs;y . The display device should conform to the PAR speci ed for
this video (or derived from the speci ed IAR). Otherwise, the object shape will be
distorted. For example, a person will become fatter and shorter, if the display PAR
is larger than the PAR speci ed for this video. In computer industry, a PAR of
1.0 is normally used. On the other hand, in the TV industry, non-square pixels are
used because of some historical reasons. The rationale behind this is explained in
Sec. 1.5.2.
24 Video Formation, Perception, and Representation Chapter 1
525 lines
480 lines
625 lines
576 lines
Active Active
Area Area
122 16 132 12
pel pel pel pel
to [9]. Here we only present the transformation matrix for deriving this coordinate
from digital RGB coordinate. Assuming that the RGB values are in the range of
(0{255), the YCbCr values are related to RGB values by:
2 3 2
Y 0:257 0:504 0:098 3 2 R 3 2 16 3
4 Cb 5 = 4 0:148 0:291 0:439 5 4 G 5 + 4 128 5 : (1.5.3)
Cr 0:439 0:368 0:071 B 128
The inverse relation
2
is:
3 2
R 1:164 0:0 1:596 3 2 Y 16 3
4 G 5 = 4 1:164 0:392 0:813 5 4 Cb 128 5 : (1.5.4)
B 1:164 2:017 0:0 Cr 128
In the above relations, R = 255R;~ G = 255G;~ B = 255B~ are the digital equiva-
lent of the normalized RGB primaries, R;~ G;
~ and B~ , as de ned either in the NTSC
or PAL/SECAM system. In the YCbCr coordinate, Y re ects the luminance and
is scaled to have a range of (16{235), Cb and Cr are scaled versions of color dif-
ferences B Y and R Y , respectively. The scaling and shifting is designed so
that they have a range of (16{240). The maximum value of Cr corresponds to
red (Cr = 240 or R = 255; G = B = 0 ), whereas the minimum value yields
cyan (Cr = 16 or R = 0; G = B = 255). Similarly, the maximum and minimum
values of Cb correspond to blue (Cb = 240 or R = G = 0; B = 255) and yellow
(Cb = 16 or R = G = 255; B = 0).
The spatial sampling rate introduced previously refers to the luminance com-
ponent, Y . For the chrominance components, Cb and Cr , usually only half of the
sampling rate is used, i.e. fs;c = fs=2: This leads to half number of pixels in each
line, but the same number of lines per frame. This is known as the 4:2:2 format,
implying there are 2 Cb samples and 2 Cr samples for every 4 Y samples. To fur-
ther reduce the required data rate, BT.601 also de ned the 4:1:1 format, in which
26 Video Formation, Perception, and Representation Chapter 1
Figure 1.8. BT.601 chrominance subsampling formats. Note that the two adjacent lines
in any one component belong to two di erent elds.
the chrominance components are subsampled along each line by a factor of 4, i.e.,
there are 1 Cb sample and 1 Cr sample for every 4 Y samples. This sampling
method, however, yields very asymmetric resolutions in the horizontal and vertical
directions. Another sampling format is therefore developed, which subsamples the
Cb and Cr components by half in both the horizontal and vertical directions. In
this format, there are also 1 Cb sample and 1 Cr sample for every 4 Y samples.
But to avoid the confusion with the previously de ned 4:1:1 format, this format
is designated as 4:2:0. For applications requiring very high resolutions, the 4:4:4
format is de ned, which samples the chrominance components in exactly the same
resolution as the luminance components. The relative positions of the luminance
and chrominance samples for di erent formats are shown in Fig. 1.8.12
In Chap. 4, we will discuss solutions for converting videos with di erent spa-
tial/temporal resolutions. The conversion between di erent color subsampling for-
mats is considered in one of the exercise problems.
The raw data rates of a BT.601 signal depends on the color sub-sampling factor.
With the most common 4:2:2 format, there are two chrominance samples per two Y
samples, each represented with 8 bits. Therefore, the equivalent bit rate for each Y
sample is Nb = 16 bits, and the raw data rate is fsNb = 216 Mbps. The raw data
rate corresponding to the active area is fs;tfs;y 0 Nb = 166 Mbps. With the 4:2:0
0 fs;x
format, there are two chrominance samples per four Y samples, and the equivalent
bit rate for each Y sample is Nb = 12 bits. Therefore the raw data rate is 162 Mbps,
with 124 Mbps in the active area. For the 4:4:4 format, the equivalent bit rate for
each Y sample is Nb = 24 bits, and the raw data rate is 324 Mbps, with 249 Mbps
in the active area. The resolutions and data rates of di erent BT.601 signals are
12 For the 4:2:0 format, the Cr and Cb samples may also be positioned in the center of the four
corresponding Y samples, as shown in Fig. 13.14(a).
Section 1.5. Digital Video 27
MPEG2-based DVD's, which started in mid 90's, opened the era of high quality
digital video entertainment. MPEG2 technology is also the corner stone of the next
generation TV system, which will be fully digital, employing digital compression and
transmission technology. Table 1.3 lists the details of the video formats discussed
above, along with their main applications, compression methods, and compressed
bit rates. More on compression standards will be presented in Chap. 13.
The BT.601 format is the standard picture format for digital TV (DTV). To
further enhance the video quality, several HDTV formats have also been standard-
ized by the Society of Motion Picture and Television Engineers (SMPTE), which
are also listed in Table 1.3. A distinct feature of HDTV is its wider aspect ratio,
16:9 as opposed to 4:3 in SDTV. The picture resolution is doubled to tripled in
both horizontal and vertical dimensions. Furthermore, progressive scan is used to
reduce the interlacing artifacts. A high pro le has been developed in the MPEG2
video compression standard for compressing HDTV video. Typically it can reduce
the data rate to about 20 Mbps while retaining the very high quality required. This
video bit rate is chosen so that the combined bit stream with audio, when trans-
mitted using digital modulation techniques, can still t into a 6 MHz terrestrial
channel, which is the assigned channel bandwidth for HDTV broadcast in the U.S.
may eventually overtake tape-based systems, which are slower and have less storage
capacity.
1.5.5 Video Quality Measure
To conduct video processing, it is necessary to de ne an objective measure that
can measure the di erence between an original video and the processed one. This
is especially important, e.g., in video coding applications where one must measure
the distortion caused by compression. Ideally such a measure should correlate well
with the perceived di erence between two video sequences. Finding such a measure
however turns out to be an extremely diÆcult task. Although various quality mea-
sures have been proposed, those that correlate well with visual perception are quite
complicated to compute. Most video processing systems of today are designed to
minimize the mean square error (MSE) between two video sequences 1 and 2,
which is de ned as
MSE = e2 = N1
XX
( 1 (m; n; k) 2(m; n; k))2 ; (1.5.5)
k m;n
30 Video Formation, Perception, and Representation Chapter 1
Digital Betacam BT.601 4:2:2 166 Mbps 80 Mbps Frame DCT Professional
DVCPRO50 BT.601 4:2:2 166 Mbps 50 Mbps frame/ eld DCT Professional
DVCPRO25 (DV) BT.601 4:1:1 124 Mbps 25 Mbps frame/ eld DCT Consumer
where N is the total number of pixels in either sequence. For a color video, the
MSE is computed separately for each color component.
Instead of the MSE, the peak signal to noise ratio (PSNR) in decibel (dB) is
more often used as a quality measure in video coding. The PSNR is de ned as
2
PSNR = 10 log10 max
2
(1.5.6)
e
where max is the peak (maximum) intensity value of the video signal. For the most
common 8 bit/color video, max = 255: Note that for a xed peak value, PSNR
is completely determined by the MSE. The PSNR is more commonly used than
the MSE, because people tend to associate the quality of an image with a certain
range of PSNR. As a rule of thumb, for the luminance component, a PSNR over
40 dB typically indicates an excellent image (i.e., being very close to the original),
between 30 to 40 dB usually means a good image (i.e., the distortion is visible but
acceptable), between 20 and 30 dB is quite poor, and nally, a PSNR lower than
20 dB is unacceptable.
Section 1.6. Summary 31
1.5 For the following colors in the digital RGB coordinate, determine their values
in the YCbCr coordinate.
(a) (255,255, 255); (b) (0, 255, 0); (c) (255,255,0); (d) (0, 255, 255).
1.6 In Sec. 1.5.2, we say that the maximum value of Cr corresponds to red, whereas
the minimum value yields cyan . Similarly, the maximum and minimum values
of Cb correspond to blue and yellow, respectively. Verify these statements
using the YCbCr to RGB coordinate transformation.
1.7 In Fig. 1.4, we show the spectrum of a typical raster signal. Why is the
spectrum of the video signal nearly periodic? What does the width of harmonic
lobes depend on?
1.8 What are the pros and cons of progressive vs. interlaced scans? For the same
line number per frame, what is the relation between the maximum temporal
frequency that a progressive raster can have and that of an interlaced raster
which divides each frame into two elds? What about the relation between
the maximum vertical frequencies?
1.9 In Sec. 1.4.3, we estimated the bandwidth of the NTSC signal based on its
scan parameters. Following the same approach, estimate the bandwidth of
the PAL and SECAM signals.
1.10 Describe the process for forming a composite color video signal. How should
you select the color sub-carrier frequency and audio sub-carrier frequency ?
1.11 What are the pros and cons of using component vs. composite format?
1.12 Project: Using an oscilloscope to i) draw the waveform, and ii) measure the
spectrum of a composite video signal output from a TV set or a camcorder.
1.13 Project: Digitize a composite video signal using an A/D converter, and using
Matlab to determine the spectrum. Also perform ltering to separate the
luminance, chrominance and audio signals.
1.8 Bibliography
[1] K. B. Benson, editor. Television Engineering Handbook. McGraw Hill, 1992.
Revised by J.C.Whitaker.
[2] J. F. Blinn. NTSC: nice technology, super color. IEEE Computer Graphics and
Applications Magazine, pages 17{23, Mar. 1993.
[3] R. M. Boynton. Human Color Vision. Holt, Rinhart, Winston, New York, 1979.
[4] R. G. Gonzalez and R. E. Woods. Digital Image Processing. Addison{Wesley,
2nd edition, 1992.
34 Video Formation, Perception, and Representation Chapter 1
[5] B. Grob and C. E. Herndon. Basic Television and Video Systems. Glencoe
McGraw Hill, 6th edition, 1999.
[6] Y. Hashimoto, M. Yamamoto, and T. Asaida. Cameras and display systems.
Proc. of IEEE, pages 1032{1043, July 1995.
[7] B. G. Haskell, A. Puri, and A. N. Netravali. Digital Video: An Introduction to
MPEG-2. Chapman & Hall, New York, 1997.
[8] ITU-R. BT.601-5: Studio encoding parameters of digital television for standard
4:3 and wide-screen 16:9 aspect ratios, 1998. (Formerly CCIR 601).
[9] A. N. Netravali and B. G. Haskell. Digital Pictures - Representation, Compres-
sion and Standards. Plenum Press, 2nd edition, 1995.
[10] D. H. Pritchard. US color television fundamentals. IEEE Trans. Consum.
Electron., CE-23:467{78, 1977.
[11] M. Umemoto, Y. Eto, and T. Fukinuki. Digital video recording. Proc. of IEEE,
pages 1044{1054, July 1995.
[12] J. Watkinson. The Art of Digital Video. Focal Press, Oxford, 2nd edition,
1994.
[13] G. Wyszecki and W. S. Stiles. Color Science. John Wiley, New York, 1967.
[14] T. Young. On the theory of light and colors. Philosophical Transactions of the
Royal Society of London, 92:20{71, 1802.