Computer Graphics Notes - Unknown
Computer Graphics Notes - Unknown
The generation of graphical images using a computer, as opposed to "image processing" which
manipulates images that are already in the computer. Creating a frame of "Toy Story" or
"Jurassic Park" is computer graphics; Comparing an image of a face from an ATM camera
against a database of known criminal mugshots is image processing. Note that the line between
the two can sometimes be hazy, and a given task may require both sets of skills.
Mathematics + computer science + art = computer graphics rendering of images on a device.
Rendering - creating images from models
models - objects constructed from geometric primitives (points, lines, polygons) specified by
their vertices
Models exist in n-dimensional 'mathematically pure' space
o
o
n typically 2 or 3
n can be > 3 with scientific data
Common Uses
Movies, such as Toy Story, Who Framed Roger Rabbit, The Hollow Man, Shrek,
Monsters Inc, Jurassic Park, & The Perfect Storm
Advertisements
Football game annotations.
scientific/medical visualization
CAD/CAM
multimedia
computer interfaces (Windows, X, Aqua)
virtual reality
special effects
artistic expression
way cool video games
Software
Many application programs available to produce computer graphics, either as 2D images, 3D
models, or animated sequences (Corel Draw, Photoshop, AutoCAD, Maya, SoftImage, etc.)
We will deal with the lower level routines which do the work of converting models into a
displayable form on the display device.
Several 'common' graphics languages/libaries/APIs (Application Programming Interfaces.)
GKS
DirectX
X
Postscript
OpenGL
We will be using OpenGL in this course on the linux machines in the CS Computer Graphics lab
to give a common grading platform. OpenGL is availble for all the major platforms, and is
accelerated on almost all current graphics cards, but is not necessarily available on all of the
machines here in the university. If you want to work on your machine at home you should be
able to get OpenGL libraries for it for free. Otherwise there is Mesa. Mesa is virtually identical
to OpenGL, is free, and runs on a wider variety of platforms. For more information on Mesa you
can check out: http://www.mesa3d.org . The only thing that should need to change to compile
your code here is the Makefile.
Mesa, Codeblocks(Cross Platform) or VC++ like OpenGL, is usually accessed through function
calls from a C or C++ program.
1.0 |
*
|
*
|
*
|
*
0.5 |
*
|
*
|
*
|
*
0.0 +---*---------| *
| *
|
*
You need to keep redrawing the image on the screen to keep it from fading away. Vector
displays redraw as quickly as possible given the number of objects on the screen; CRT based
raster displays redraw the image (or refresh the screen) at a fixed rate (e.g. 60 times per second)
no matter how complex the scene.
For those who spent their early youth in the arcades, vector games included:
Asteroids
Battlezone
Lunar Lander
Star Trek
Star Wars
Tempest
Initially these games were monocrome (white, green, or some other single color), as in asteroids,
then coloured filters were used to colour different parts of the screen using the same monocrome
electron gun as in Battlezone, and finally when RGB electron guns were cheap enough, true
multi-colour vector games were possible.
512 x 512 x 24bit (each pixel has 8 bits for red, 8 bits for green, and 8 bits for blue.)
Each pixel can be black->bright red (0-255) combined with black->bright green (0-255)
combined with black->bright blue (0-255)
786,432 bytes total
Each of the 512 x 512 pixels can be one of 16 million colours
note however, that a 512 x 512 display has only 262,144 pixels so only 262,144 colours
can be displayed simultaneously.
A 1280 x 1024 display (common workstation screen size) has only 1,310,720 pixels, far
fewer than the 16,000,000 possible colours. ( 3.75 MB for this configuration )
8 bit colour display using a colour map:
want benefits of 24 bit colour with only 8 bit display
512 x 512 x 8bit (each pixel is 8 bits deep so values 0-255 are possible.)
Each of the 512 x 512 pixels can be one of 256 index values into a video lookup table.
video lookup table has 256 24bit RGB values where each value has 8 bits for red, 8 bits
for green, and 8 bits for blue.
16 million colours are possible, but only 256 of them can be displayed at the same time.
Memory needs are 512x512x1byte plus 256x3bytes = 262,912 bytes, much less than the
786,432 needed without the colour map.
Here are 2 sample colourmaps each with 256 24bit RGB values:
depth of frame buffer (e.g. 8 bit) determines number of simultaneous colours possible
width of colour map (e.g. 24 bit) determines number of colours that can be chosen from
Size of various frame buffers:
screen size Monocrome 8-bit 24-bit
512 X 512 32K
256K 768K
640 X 480 38K
300K 900K
1280 X 1024 160K
1.3M 3.8M
We can only swap buffers in between screen refreshes - when electron gun of the CRT is off and
moving from bottom of screen to the top. Note that different CRT monitors refresh at different
rates. Also note that as we move more towards LCD displays and away from CRT displays that
the hardware changes, but there are still timing constraints. In the case of LCDs the refresh rate
is only important when the images are changing. For more info on the two types of displays the
following link gives a nice description:
http://www.ncdot.org/network/developers/guide/display_tech.html
If you can clear the frame buffer and the draw scene in less than 1/60th second you will get 60
frames per second.
If it takes longer than 1/60th of a second to draw the image (e.g. 1/59th of a second) then by the
time you are ready to swap frame buffers the electron gun is already redrawing the current frame
buffer again, so you must wait for it to finish before swapping buffers (e.g. 1/30th of a second
after the last swap buffers if the frame takes 1/59th of a second.)
If the screen refreshes 60 times a second you can have a new image on the screen:
1/60th of a second ... 60 frames per second if all frames take equal time
1/30th of a second ... 30 frames per second if all frames take equal time
1/20th of a second ... 20 frames per second if all frames take equal time
1/15th of a second ... 15 frames per second if all frames take equal time
1/12th of a second ... 12 frames per second if all frames take equal time
1/10th of a second ... 10 frames per second if all frames take equal time
One of the most obvious implications of this is that a very small change in the time it takes to
draw a scene into the frame buffer can have a major impact on the speed of the application.
Can your program run at 45 frames per second (fps)?
Yes. if 30 frames take 1/60th each and the next 15 frames take 1/30th each you will display 45
frames per second.
For smooth motion you want at least 10 frames per second, and preferably more than 15 frames
per second. More is better.
Mathematics VS Engineering
We like to think about a scene as mathematical primitives in a world-space. This scene is then
rendered into the frame buffer. This allows a logical separation of the world from the view of
that world.
mathematically, points are infinitely small
mathematically, line segments are infinitely thin
these mathematical elements need to be converted into discrete pixels
as usual, there is an obvious easy way of doing these conversions, and then there is the way it is
actually done (for efficiency.)
glBegin(GL_LINE_LOOP);
glVertex2f(1.5, 3.0);
glVertex2f(4.0, 4.0);
glVertex2f(4.0, 1.0);
glEnd();
To generate a filled triangular 2D polygon in world-space using OpenGL a programmer can
write code like the following:
glBegin(GL_POLYGON);
glVertex2f(1.5, 3.0);
glVertex2f(4.0, 4.0);
glVertex2f(4.0, 1.0);
glEnd();
We will not limit ourselves to these 'easier' polygons.
Note that large complex objects are often reduced down to a large number of triangles ( i.e.
triangulated ), for a number of reasons:
How are line segments and polygons in world-space converted into illuminated pixels on the
screen?
First these coordinates in world-space must be converted to coordinates in the viewport (ie pixel
coordinates in the frame buffer.) This may involve the conversion from a 2D world to a 2D
frame buffer (which we will study in a couple weeks), or the reduction from a 3D world to a 2D
frame buffer (which we will study a couple weeks later.)
Then these coordinates in the viewport must be used to draw lines and polygons made up of
individual pixels (rasterization.) This is the topic we will discuss now.
Most of the algorithms in Computer Graphics will follow the same pattern below. There is the
simple (braindead) algorithm that works, but is too slow. Then that algorithm is repeatedly
refined, making it more complicated to understand, but much faster for the computer to
implement.
Braindead Algorithm
given a line segment from leftmost (Xo,Yo) to rightmost (X1,Y1):
Y=mX+B
m = deltaY / deltaX = (Y1 - Yo) / ( X1 - Xo)
Assuming |m| <= 1 we start at the leftmost edge of the line, and move right one pixel-column at a
time illuminating the appropriate pixel in that column.
start = round(Xo)
stop = round(X1)
for (Xi = start; Xi <= stop; Xi++)
illuminate Xi, round(m * Xi + B);
Why is this bad? Each iteration has:
comparison
fractional multiplication
2 additions
call to round()
Addition is OK, fractional multiplication is bad, and a function call is very bad as this is done A
LOT. So we need more complex algorithms which use simpler operations to decrease the speed.
if m=1 then each row and each column have a pixel filled in
if 0 <= m< 1 then each column has a pixel and each row has >= 1, so we increment X each
iteration and compute Y.
if m > 1 then each row has a pixel and each column has >= 1, so we increment Y each iteration
and compute X.
illuminate X, round(Y)
add 1 to X (moving one pixel column to the right)
add m to Y
This guarantees there is one pixel illuminated in each column for the line
If |m| > 1 then we must reverse the roles of X and Y, incrementing Y by 1 and incrementing X by
1/m in each iteration.
Horizontal and vertical lines are subsets of the 2 cases given above.
need such a common, primitive function to be VERY fast.
features:
+ incremental
- rounding is a time consuming operation(Y)
- real variables have limited precision, can cause a cumulative error in long line segments
(e.g. 1/3)
- Y must be a floating point variables
Assuming 0 <= m <= 1 we start at the leftmost edge of the line, and move right one pixel-column
at a time illuminating the pixel either in the current row (the pixel to the EAST) or the next
higher row (the pixel to the NORTHEAST.)
Y=mX+B
m = deltaY / deltaX = (Y1 - Yo) / ( X1 - Xo)
can rewrite the equation in the form: F(X,Y) = ax + by + c = 0
Y = (deltaY / deltaX) * X + B
0 = (deltaY / deltaX) * X - Y + B
0 = deltaY * X - deltaX * Y + deltaX * B
F(X,Y) = deltaY * X - deltaX * Y + deltaX * B
so for any point (Xi,Yi) we can plug Xi,Yi into the above equation and
F(Xi,Yi) = 0 -> (Xi,Yi) is on the line
F(Xi,Yi) > 0 -> (Xi,Yi) is below the line
F(Xi,Yi) < 0 -> (Xi,Yi) is above the line
Given that we have illuminated the pixel at (Xp,Yp) we will next either illuminate
the pixel to the EAST (Xp+ 1,Yp)
or the pixel to the NORTHEAST (Xp+ 1,Yp+ 1)
To decide we look at the Midpoint between the EAST and NORTHEAST pixel and see which
side of the midpoint the line falls on.
line above the midpoint -> illuminate the NORTHEAST pixel
line below the midpoint -> illuminate the EAST pixel
line exactly on the midpoint -> CHOOSE TO illuminate the EAST pixel
We create a decision variable called d
We plug the Midpoint into the above F() for the line and see where the midpoint falls in relation
to the line.
d = F(Xp+1,Yp+0.5)
d > 0 -> pick NORTHEAST pixel
d < 0 -> pick EAST pixel
d = 0 -> ***CHOOSE*** to pick EAST pixel
That tells us which pixel to illuminate next. Now we need to compute what Midpoint is for the
next iteration.
if we pick the EAST pixel
Drawing Circles
The algorithm used to draw circles is very similar to the Midpoint Line algorithm.
8 way-symmetry - for a circle centered at (0,0) and given that point (x,y) is on the circle, the
following points are also on the circle:
(-x, y)
( x,-y)
(-x,-y)
( y, x)
(-y, x)
( y,-x)
(-y,-x)
So it is only necessary to compute the pixels for 1/8 of the circle and then simply illuminate the
appropriate pixels in the other 7/8.
given a circle centered at (0,0) with radius R:
R^2 = X^2 + Y^2
F(X,Y) = X^2 + Y^2 - R^2
We choose to work in the 1/8 of the circle (45 degrees) from x=0 (y-axis) to x = y = R/sqrt(2) (45
degrees clockwise from the y-axis.)
so for any point (Xi,Yi) we can plug Xi,Yi into the above equation and
F(Xi,Yi) = 0 -> (Xi,Yi) is on the circle
F(Xi,Yi) > 0 -> (Xi,Yi) is outside the circle
F(Xi,Yi) < 0 -> (Xi,Yi) is inside the circle
Given that we have illuminated the pixel at (Xp,Yp) we will next either illuminate
h = d - 0.25
h is initialized as 1 - R (instead of 1.25 - R)
h is compared to as h < 0.25 (instead of d< 0)
but since h starts off as an integer (assuming an integral R) and h is only incremented by integral
amounts (deltaE and deltaSE) we can ignore the 0.25 and compare h < 0.
X = 0;
Y = radius;
d = 1 - radius;
draw8Points(X, Y);
while(Y > X)
if (d< 0)
add 2 * X + 3 to d
add 1 to X
else
add 2 * (X-Y) + 5 to d
add 1 to X
subtract 1 from Y
draw8Points(X, Y);
The full algorithm ( was ? ) given (in C) in the red book ( version ??? ) as program 3.4 on p.83.
The full algorithm is given (in C) in the white book as figure 3.16 on p.86.
This is still somewhat bad in that there is a multiplication to compute the new value of the
decision variable. The book shows a more complicated algorithm which does this multiplication
only once.
ellipses F(x,y) = b^2 X^2 + a^2 Y^2 - a^2 b^2 = 0 are handled in a similar manner, except that
1/4 of the ellipse must be dealt with at a time and that 1/4 must be broken into 2 parts based on
where the slope of the tangent to the ellipse is -1 (in first quadrant.)
Clipping
Since we have a separation between the models and the image created from those models, there
can be parts of the model that do not appear in the current view when they are rendered.
pixels outside the clip rectangle are clipped, and are not displayed.
can clip analytically - knowing where the clip rectangle is clipping can be done before scan-line
converting a graphics primitive (point, line, polygon) by altering the graphics primitive so the
new version lies entirely within the clip rectangle
can clip by brute force (scissoring) - scan convert the entire primitive but only display those
pixels within the clip rectangle by checking each pixel to see if it is visible.
as with scan conversion, this must be done as quickly as possible as it is a very common
operation.
Point Clipping
point (X,Y)
clipping rectangle with corners (Xmin,Ymin) (Xmax,Ymax)
point is within the clip rectangle if:
Xmin <= X<= Xmax
Ymin <= Y<= Ymax
(the sign bit being the most significant bit in the binary representation of the value. This bit is '1'
if the number is negative, and '0' if the number is positive.)
The frame buffer itself, in the center, has code 0000.
1001 | 1000 | 1010
-----+------+----0001 | 0000 | 0010
-----+------+----0101 | 0100 | 0110
The full algorithm ( was? ) given (in C) in the red book ( ??? Edition ) as program 3.7 on p.105.
The full algorithm is given (in C) in the white book as figure 3.41 on p.116.
The important thing to note is what coordinate system is being used by the package you are
working with, both for the creation of models and the displaying of them. Also note that if the
two packages use different coordinate systems, then the model(s) may need to be inverted in
some fashion when they are loaded in for viewing.
OpenGL generally uses a right-hand coordinate system.
Viewport Coordinate System - This coordinate system refers to a subset of the screen
space where the model window is to be displayed. Typically the viewport will occupy the
entire screen window, or even the entire screen, but it is also possible to set up multiple
smaller viewports within a single screen window.
Transformations in 2 Dimensions
One of the most common and important tasks in computer graphics is to transform the
coordinates ( position, orientation, and size ) of either objects within the graphical scene or the
camera that is viewing the scene. It is also frequently necessary to transform coordinates from
one coordinate system to another, ( e.g. world coordinates to viewpoint coordinates to screen
coordinates. ) All of these transformations can be efficiently and succintly handled using some
simple matrix representations, which we will see can be particularly useful for combining
multiple transformations into a single composite transform matrix.
We will look first at simple translation, scaling, and rotation in 2D, then extend our results to 3D,
and finally see how multiple transformations can be easily combined into a composite transform.
Translation in 2D
point (X,Y) is to be translated by amount Dx and Dy to a new location (X',Y')
X' = Dx + X
Y' = Dy + Y
or P' = T + P where
_
P' = |
|
_
T
= |
|
_
= |
|
-
_
X' |
Y' |
_
Dx |
Dy |
_
X |
Y |
-
Scaling in 2D
point (X,Y) is to be scaled by amount Sx and Sy to location (X',Y')
X' = Sx * X
Y' = Sy * Y
or P' = S * P where
_
X' |
Y' |
_
_
= | Sx 0 |
| 0
Sy |
_
_
= | X |
| Y |
-
P' = |
|
S
scaling is performed about the origin (0,0) not about the center of the line/polygon/whatever
Scale > 1 enlarge the object and move it away from the origin.
Scale = 1 leave the object alone
Scale< 1 shrink the object and move it towards the origin.
uniform scaling: Sx = Sy
differential scaling Sx != Sy -> alters proportions
Rotation in 2D
point (X,Y) is to be rotated about the origin by angle theta to location (X',Y')
X' = X * cos(theta) - Y * sin(theta)
Y' = X * sin(theta) + Y *cos(theta)
note that this does involve sin and cos which are much more costly than addition or
multiplication
or P' = R * P where
_
_
X' |
Y' |
_
_
= | cos(theta) -sin(theta) |
| sin(theta) cos(theta) |
_
_
= | X |
| Y |
-
P' = |
|
R
rotation is performed about the origin (0,0) not about the center of the line/polygon/whatever
The solution is to give each point a third coordinate (X, Y, W), which will allow translations to
be handled as a multiplication also.
( Note that we are not really moving into the third dimension yet. The third coordinate is being
added to the mathematics solely in order to combine the addition and multiplication of 2-D
coordinates. )
Two triples (X,Y,W) and (X',Y',W') represent the same point if they are multiples of each other
e.g. (1,2,3) and (2,4,6).
At least one of the three coordinates must be nonzero.
If W is 0 then the point is at infinity. This situation will rarely occur in practice in computer
graphics.
If W is nonzero we can divide the triple by W to get the cartesian coordinates of X and Y which
will be identical for triples representing the same point (X/W, Y/W, 1). This step can be
considered as mapping the point from 3-D space onto the plane W=1.
Conversely, if the 2-D cartesian coordinates of a point are known as ( X, Y ), then the
homogenous coordinates can be given as ( X, Y, 1 )
So, how does this apply to translation, scaling, and rotation of 2D coordinates?
_
X' |
Y' |
1 |
_
_
= | 1 0 Dx | = T(Dx,Dy)
| 0 1 Dy |
| 0 0 1 |
_
_
= | X |
| Y |
| 1 |
P' = |
|
|
T
Composition of 2D Transformations
There are many situations in which the final transformation of a point is a combination of several
( often many ) individual transformations. For example, the position of the finger of a robot
might be a function of the rotation of the robots hand, arm, and torso, as well as the position of
the robot on the railroad train and the position of the train in the world, and the rotation of the
planet around the sun, and . . .
Applying each transformation individually to all points in a model would take a lot of time.
Instead of applying several transformations matrices to each point we want to combine the
transformations to produce 1 matrix which can be applied to each point.
In the simplest case we want to apply the same type of transformation (translation, rotation,
scaling) more than once.
translation is additive as expected
scaling is multiplicative as expected
rotation is additive as expected
But what if we want to combine different types of transformations?
a very common reason for doing this is to rotate a polygon about an arbitrary point (e.g. the
center of the polygon) rather than around the origin.
Translate so that P1 is at the origin T(-Dx,-Dy)
Rotate R(theta)
Translate so that the point at the origin is at P1 T(Dx,Dy)
note the order of operations here is right to left:
P' = T(Dx,Dy) * R(theta) * T(-Dx,-Dy) * P
i.e.
P' = T(Dx,Dy) * { R(theta) * [ T(-Dx,-Dy) * P ] }
i.e.
P' = [ T(Dx,Dy) * R(theta) * T(-Dx,-Dy) ] * P
The matrix that results from these 3 steps can then be applied to all of the points in the polygon.
another common reason for doing this is to scale a polygon about an arbitrary point (e.g. the
center of the polygon) rather than around the origin.
Translate so that P1 is at the origin
Scale
Translate so that the point at the origin is at P1
How do we determine the 'center' of the polygon?
Window to Viewport
Transformations in 3D
3D Transformations
Similar to 2D transformations, which used 3x3 matrices, 3D transformations use 4X4 matrices
(X, Y, Z, W)
3D Translation: point (X,Y,Z) is to be translated by amount Dx, Dy and Dz to location (X',Y',Z')
X' = Dx + X
Y' = Dy + Y
Z' = Dz + Z
or P' = T * P where
_
P' = |
|
|
|
_
X'
Y'
Z'
1
_
T
= |
|
|
|
_
1
0
0
0
_
P
= |
|
|
|
|
|
|
|
0
1
0
0
0 Dx
0 Dy
1 Dz
0 1
| = T(Dx,Dy,Dz)
|
|
|
-
_
X
Y
Z
1
|
|
|
|
-
3D Scaling:
_
P' = |
|
|
|
_
X'
Y'
Z'
1
_
S
= |
|
|
|
_
Sx
0
0
0
|
|
|
|
0
Sy
0
0
0
0
Sz
0
0
0
0
1
| = S(Sx,Sy,Sz)
|
|
|
-
_
P
= |
|
|
|
_
X
Y
Z
1
|
|
|
|
-
3D Rotation:
For 3D rotation we need to pick an axis to rotate about. The most common choices are the Xaxis, the Y-axis, and the Z-axis
_
P' = |
|
|
|
_
X'
Y'
Z'
1
|
|
|
|
-
_
cos(theta) -sin(theta) 0 0 | = Rz(theta)
sin(theta) cos(theta) 0 0 |
0
0
1 0 |
0
0
0 1 |
_
_
Rx = | 1
0
0
0 | = Rx(theta)
| 0 cos(theta) -sin(theta) 0 |
| 0 sin(theta) cos(theta) 0 |
| 0
0
0
1 |
_
_
Ry = | cos(theta) 0 sin(theta) 0 | = Ry(theta)
|
0
1
0
0 |
| -sin(theta) 0 cos(theta) 0 |
|
0
0
0
1 |
_
_
P = | X |
| Y |
| Z |
| 1 |
-
Rz = |
|
|
|
2. An alternate axis of rotation can be chosen, other than the cartesian axes, and the point
rotated a given amount about this axis. For any given orientation change there exists a
single unique axis and rotation angle ( 0 <= theta <= 180 degrees ) that will yield the
desired rotation. This alternative approach is the basis for "quaternions", which will not
likely be discussed further in this course. ( Quaternions are used heavily in the
WorldToolKit package, which is no longer produced, and can be useful for interpolating
rotations between two oblique angles. )
Composition is handled in a similar way to the 2D case, multiplying the transformation matrices
from right to left.
The planet is rotated about its Y-axis by the percentage of year that has passed
turning its coordinate system in the process
The planet is translated 2 units on its now rotated X-axis to its position in orbit
The planet is rotated about its Y-axis by the percentage of day that has passed.
Since the planet is still at (0,0,0) by its coordinate system, it rotates about its
center.
The planet is drawn as a circle with radius 0.2
If you think about the single coordinate system then the operations on the matrix are done in the
REVERSE order from which they are called:
Initially the transformation matrix is the identity matrix
The sun is drawn as a circle with radius 1 at (0,0,0)
o
o
o
The planet is rotated about the Y-axis by the percentage of year that has passed.
Since the planet is no longer at the origin it rotates about the origin at a radius of
2.
if the matrix operations are not performed in reverse order then the year and day rotation
percentages get reversed.
Either way of thinking about it is equivalent, and irregardless of how you think about it, that is
how OpenGL function calls must be issued.
Say you have three polygonal drawing functions available to you:
Object Hierarchies
Single polygons are generally too small to be of interest ... its hard to think of a single polygon as
an 'object' unless you are writing Tetris(tm).
Even in a 2D world it is more convenient to think of objects which are a collection of polygons
forming a recognizable shape: a car, a house, or a laser-toting mutant monster from Nebula-9.
This object can then be moved/rotated/scaled as a single entity, at least at the conceptual level.
This is especially true in a 3D world where you need more than one (planar) polygon to create a
3D object.
Creating an object polygon by polygon is very slow when you want to create a very large
complex object. On the other hand it does give you much more control over the object than
creating it from higher-level primitives (cube, cone, sphere)
The following two examples are from Silicon Graphics' OpenInventor(tm) a library which sits on
top of OpenGL and allows higher-level objects to be created. The first shows a tree constructed
from a cube and a cone. The second shows the same tree but constructed from triangular
polygons.
pine tree built from objects
pine tree built from triangular polygons
note, triangular polygons are often used instead of 4-sided ones because the 3 vertices in the
triangle are guaranteed to form a plane, while the 4 vertices of a 4-sided polygon may not all fall
in the same plane which may cause problems later on.
Hierarchies are typically stored as Directed Acyclic Graphs, that is they are trees where a node
can have multiple parents as long as no cycle is generated.
Hierarchies store all information necessary to draw an object:
polygon information
material information
transformation information
Hierarchies are useful when you want to be able to manipulate an object on multiple levels:
With an arm you may want to rotate the entire arm about the shoulder, or just the lower
arm about the elbow, or just the wrist or just a finger. If you rotate the entire arm then
you want the rest of the arm parts to follow along as though they were joined like a real
arm - if you rotate the arm then the elbow should come along for the ride.
With a car the wheels should rotate but if the car body is moving then the wheels should
also be moving the same amount.
An object hierarchy gives a high degree of encapsulation.
An object heierarchy allows inheritance
Attributes to be set once and then used by multiple sub-objects.
For example, at the top of the hierarchy the object could be set to draw only as a
wireframe, or with different lighting models, or different colours, or different texture
maps. This would then be inherited by the sub-objects and not have to be explicitely set
each of them.
Fonts
Text is handled in one of two ways.
bitmaps:
rectangular array of 0s and 1s
00000000
00011000
00100100
01000010
01111110
01000010
01000010
01000010
polygons:
rescalable so that the definition can generate a 'smooth' character of any size
can be either 2D or 3D
can be rotated
treated like any other line/polygon to be displayed
slower
OpenGL provides minimal font support - only bitmapped fonts. Fortunately there are free 3D
fonts available such as the Hershey library.
General 3D Concepts
Taking 2D objects and mapping onto a 2D screen is pretty straightforward. The window is the
same plane as the 2D world. Now we are taking 3D objects and mapping them onto a 2D screen.
Here is where the advantage of separating the model world from its rendered image becomes
more obvious. The easiest way to think about converting 3D world into 2D image is the way we
do it in real life - with a camera.
Lets say we have an object in the real world (e.g. the Sears Tower.) The tower sits there in its
3Dness. You can move around the tower, on the ground, on the water, in the air, and take
pictures of it, converting it to a 2D image. Depending on where you put the camera and the
settings on the camera, and other factors such as light levels, you get different looking images.
In the computer we have a synthetic camera taking still or moving pictures of a synthetic
environment. While this synthetic camera gives you a much wider range of options than a real
camera, you will find it is VERY easy to take a picture of nothing at all.
Projections
projection is 'formed' on the view plane (planar geometric projection)
rays (projectors) projected from the center of projection pass through each point of the models
and intersect projection plane.
Since everything is synthetic, the projection plane can be in front of the models, inside the
models, or behind the models.
2 main types: perspective projection and parallel projection.
parallel :
o
o
o
o
perspective :
o
o
o
o
o
which type of projection is used depends on the needs of the user - whether the goal is the
mathematically correct depiction of length and angles, or a realistic looking image of the object.
specifying a 3D view
Danger, watch out for acronyms being tossed out!
Need to know the type of projection
Need to know the clipping volume
in OpenGL there are the following functions:
since the View Plane (n=0) is infinite (as it is a plane) we need to declare a region of that plane
to be our window.
Projection Reference Point (PRP) defines the Center of Projection and Direction of
Projection (DOP)
PRP given in VRC coordinate system (that is, its position is given relative to the VRP)
parallel projection - DOP is from PRP to CW, and all projectors are parallel.
perspective projection - Center of Projection is the PRP
n-point perspective
Perspective projections are categorized by the number of axes the view plane cuts (ie 1-point
perspective, 2-point perspective or 3-point perspective)
If the plane cuts the z axis only, then lines parallel to the z axis will meet at infinity; lines
parallel to the x or y axis will not meet at infinity because they are parallel to the view
plane. This is 1-point perspective.
If the plane cuts the x and z axes, then lines parallel to the x axis or the z axis will meet at
infinity; lines parallel to the y axis will not meet at infinity because they are parallel to
the view plane. This is 2-point perspective.
If the plane cuts the x, y, and z axis then lines parallel to the x, y, or z axis will meet at
infinity. This is 3-point perspective.
The n-point perspectives can work with any combination of the x, y, z axis.
View Volumes
The front plane's location is given by the front distance F relative to the VRP
The back plane's location is given by the back distance B relative to the VRP
In both cases the positive direction is in the direction of the VPN
Viewing volume has 6 clipping planes (left, right, top, bottom, near (hither), far (yon)) instead of
the 4 clipping lines we had in the 2D case, so clipping is a bit more complicated
perspective - viewing volume is a frustum of a 4-sided pyramid
parallel - viewing volume is a rectangular parallelepiped (ie a box)
Parallel Examples
In the red version of the Foley vanDam book see P.211-212.
In the white version of the Foley vanDam book see P.250-252.
Perspective Examples
Here are some examples using the same house as in the book (figure 6.18 in the red version of
the Foley vanDam book, figure 6.24 in the white version of the Foley vanDam book, but using
Andy Johnson's solution to the former HW3 as the viewing program:
The object is a house with vertices:
0, 0,30
16, 0,30
16,10,30
8,16,30
0,10,30
0, 0,54
16, 0,54
16,10,54
8,16,54
0,10,54
VRP=
VPN=
VUP=
PRP=
0
0
0
8
0 54
0 1
1 0
7 30
U: -1 to 17
V: -2 to 16
VRP= 16 0 54
or
VRP=
VPN=
VUP=
PRP=
8
0
0
0
7 54
0 1
1 0
0 30
U: -9 to 9
V: -9 to 9
VPN= 1 0 0
VUP= 0 1 0
PRP= 12 8 16
U: -1 to 25
V: -5 to 21
VRP=
VPN=
VUP=
PRP=
16
1
0
6
0 54
0 1
1 0
8 10
U: -22 to 22
V: -2 to 18
For other examples, in the red version of the Foley vanDambook see P.206-211.
In the white version of the Foley vanDam book see P.245-250.
5.
6.
7.
8.
Method 2:
1.
2.
3.
4.
5.
This is easy we just take (x, y, z) for every point and add a W=1 (x, y, z, 1)
As we did previously, we are going to use homogeneous coordinates to make it easy to compose
multiple matrices.
2. Normalizing the homogeneous coordinates
It is hard to clip against any view volume the user can come up with, so first we normalize the
homogeneous coordinates so we can clip against a known (easy) view volume (the canonical
view volumes).
That is, we are choosing a canonical view volume and we will manipulate the world so that the
parts of the world that are in the existing view volume are in the new canonical view volume.
This also allows easier projection into 2D.
y = -1
y = 1
z = 0
z = -1
y = -z
y = z
z = -zmin
z = -1
giving matrix R:
_
|
|
|
|
-
_
r1x
r1y
r1z
0
r2x
r2y
r2z
0
r3x
r3y
r3z
0
0
0
0
1
|
|
|
|
-
_
1
0
0
0
0
1
0
0
shx
shy
1
0
0
0
0
1
|
|
|
|
-
where:
shx = - DOPx / DOPz
shy = - DOPy / DOPz
note that if this is an orthographic projection (rather than an oblique one) DOPx = DOPy=0 so
shx = shy = 0 and the shear matrix becomes the identity matrix.
*** note that the red book has 2 misprints in this section. Equation 6.18 should have dopx as the
first element in the DOP vector, and equation 6.22 should have dopx / dopz. The white book has
the correct versions of the formulas
Step 2.4 Translate and Scale the sheared volume into canonical view volume:
Tpar = T( -(umax+umin)/2, -(vmax+vmin)/2, -F)
Spar = S( 2/(umax-umin), 2/(vmax-vmin), 1/(F-B))
where F and B are the front and back distances for the view volume.
So, finally we have the following procedure for computing Npar:
Npar = Spar * TPar * SHpar * R * T(-VRP)
Step 2.1 Translate VRP to the origin is the same as step 2.1 for Npar: T(-VRP)
Step 2.2 Rotate VRC so n-axis (VPN) is z-axis, u-axis is x-axis, and v-axis is y-axis is the same
as step 2.2 for Npar:
Step 2.3 Translate PRP to the origin which is T(-PRP)
Step 2.4 is the same as step 2.3 for Npar. The PRP is now at the origin but the CW may not be
on the Z axis. If it isn't then we need to shear to put the CW onto the Z axis.
Step 2.5 scales into the canonical view volume
up until step 2.3, the VRP was at the origin, afterwards it may not be. The new location of the
VRP is:
_ _
| 0 |
VRP' = SHpar * T(-PRP) * | 0 |
| 0 |
| 1 |
- -
so
Sper = ( 2VRP'z / [(umax-umin)(VRP'z+B)] , 2VRP'z / [(vmax-vmin)(VRP'z+B)], -1 /
(VRP'z+B))
So, finally we have the following procedure for computing Nper:
Nper = Sper * SHpar * T(-PRP) * R * T(-VRP)
You should note that in both these cases, with Npar and Nper, the matrix depends only on the
camera parameters, so if the camera parameters do not change, these matrices do not need to be
recomputed. Conversely if there is constant change in the camera, these matrices will need to be
constantly recreated.
This is easy we just take (x, y, z, W) and divide all the terms by W to get (x/W, y/W, z/W, 1) and
then we ignore the 1 to go back to 3D coordinates. We probably do not even need to divide by W
as it should still be 1.
4. Clip in 3D against the appropriate view volume
At this point we want to keep everything that is inside the canonical view volume, and clip away
everything that is outside the canonical view volume.
We can take the Cohen-Sutherland algorithm we used in 2D and extend it to 3D, except now
there are 6 bits instead of four.
For the parallel case the 6 bits are:
P273 in the white book shows the appropriate equations for doing this.
5. Back to homogeneous coordinates again
For the parallel case the projection plane is normal to the z-axis at z=0. For the perspective case
the projection plane is normal to the z axis at z=d. In this case we set z = -1.
In the parallel case, since there is no forced perspective, Xp = X and Yp = Y and Z is set to 0 to
do the projection onto the projection plane. Points that are further away in Z still retain the same
X and Y values - those values do not change with distance.
For the parallel case, Mort is:
_
|
|
|
|
-
1
0
0
0
0
1
0
0
0
0
0
0
0
0
0
1
_
|
|
|
|
-
Multiplying the Mper matrix and the vector (X, Y, Z, 1) holding a given point, gives the
resulting vector (X, Y, 0, 1).
In the perspective case where there is forced perspective, the projected X and Y values do
depend on the Z value. Objects that are further away should appear smaller than similar objects
that are closer.
1
0
0
0
0 0
1 0
0 1
0 -1
0
0
0
0
_
|
|
|
|
-
Multiplying the Mper matrix and the vector (X, Y, Z, 1) holding a given point, gives the
resulting vector (X, Y, Z, -Z).
7. Translate and Scale into device coordinates
All of the points that were in the original view volume are still within the following range:
-1 <= X <= 1
-1 <= Y <= 1
-1 <= Z <= 0
In the parallel case all of the points are at their correct X and Y locations on the Z=0 plane. In the
perspective case the points are scattered in the space, but each has a W value of -Z which will
map each (X, Y, Z) point to the appropriate (X', Y') place on the projection plane.
Now we will map these points into the viewport by moving to device coordinates.
This involves the following steps:
1. translate view volume so its corner (-1, -1, -1) is at the origin
2. scale to match the size of the 3D viewport (which keeps the corner at the origin)
3. translate the origin to the lower left hand corner of the viewport
Again we just take (x, y, z, W) and divide all the terms by W to get (x/W, y/W, z/W, 1) and then
we ignore the 1 to go back to 3D coordinates.
In the parallel case dropping the W takes us back to 3D coordinates with Z=0, which really
means we now have 2D coordinates on the projection plane.
In the perspective projection case, dividing by W will affect the transformation of the points.
Dividing by W (which is -Z) takes us back to the 3D coordinates of (-X/Z, -Y/Z, -1, 1). Dropping
the W takes us to the 3D coordinates of (-X/Z, -Y/Z, -1) which positions all of the points onto the
Z=-1 projection plane which is what we wanted. Dropping the Z coordinate gives us the 2D
location on the Z=-1 plane.
0 1 0
0 0 1/(1+Zmin)
z/(1+Zmin)
0 0 -1
Now in both the parallel and perspective cases the clipping routine is the same.
Again we have 6 planes to clip against:
X = -W
X=W
Y = -W
Y=W
Z = -W
Z=0
but since W can be positive or negative the region defined by those planes is different depending
on W.
2nd method step 4: Then we know all of the points that were in the original view volume are
now within the following range:
-1 <= X <= 1
-1 <= Y <= 1
-1 <= Z <= 0
Now we need to map these points into the viewport by moving to the device coordinates.
This involves the following steps:
1. translate view volume so its corner (-1, -1, -1) is at the origin
2. scale to match the size of the 3D viewport (which keeps the corner at the origin)
3. translate the origin to the lower left hand corner of the viewport
Illumination Models
No Lighting ( Emissive Lighting )
I = Ki
I: intensity
Ki: object's intrinsic intensity, 0.0 - 1.0 for each of R, G, and B
Ambient
Directional light - produced by a light source an infinite distance from the scene., All of
the light rays emanating from the light strike the polygons in the scene from a single
parallel direction, and with equal intensity everywhere.
o Sunlight is for all intents and purposes a directional light.
o Characterized by color, intensity, and direction.
Point light - a light that gives off equal amounts of light in all directions. Polygons, and
parts of polygons which are closer to the light appear brighter than those that are further
away. The angle at which light from a point light source hits an object is a function of the
positions of both the object and the light source. The intensity of the light source hitting
the object is a function of the distance between them. Different graphics programs may (
or may not ) allow the programmer to adjust the falloff function in different ways.
o A bare bulb hanging from a cord is essentially a point light.
o Characterized by color, intensity, location, and falloff function.
Spotlight - light that radiates light in a cone with more light in the center of the cone,
gradually tapering off towards the sides of the cone. The simplest spotlight would just be
a point light that is restricted to a certain angle around its primary axis of direction Think of something like a flashlight or car headlight as opposed to a bare bulb hanging on
a wire. More advanced spotlights have a falloff function making the light more intense at
the center of the cone and softer at the edges.
o Characterized as a point light, an axis of direction, a radius about that axis, and
possibly a radial falloff function.
Here is the same object (Christina Vasilakis' SoftImage Owl) under different lighting conditions:
bounding boxes of the components of the owl
self-luminous owl
theta is constant
L' is constant
Directional lights are faster than point lights because L' does not need to be recomputed for each
polygon.
It is rare that we have an object in the real world illuminated only by a single light. Even on a
dark night there is a some ambient light. To make sure all sides of an object get at least a little
light we add some ambient light to the point or directional light:
I = Ia Ka + Ip Kd(N' * L')
Currently there is no distinction made between an object close to a point light and an object far
away from that light. Only the angle has been used so far. It helps to introduce a term based on
distance from the light. So we add in a light source attenuation factor: Fatt.
I = Ia Ka + Fatt Ip Kd(N' * L')
Coming up with an appropriate value for Fatt is rather tricky.
It can take a fair amount of time to balance all the various types of lights in a scene to give the
desired effect (just as it takes a fair amount of time in real life to set up proper lighting)
Specular Reflection
I = Ip cos^n(a) W(theta)
I: intensity
Ip: intensity of point light
n: specular-reflection exponent (higher is sharper falloff)
W: gives specular component of non-specular materials
So if we put all the lighting models depending on light together we add up their various
components to get:
I = Ia Ka + Ip Kd(N' * L') + Ip cos^n(a) W(theta)
As shown in the following figures:
=
Ambient
Specular
+
=
Diffuse
Result
These properties describe how light is reflected off the surface of the polygon. a polygon with
diffuse color (1, 0, 0) reflects all of the red light it is hit with, and absorbs all of the blue and
green. If this red polygon is hit with a white light it will appear red. If it with a blue light, or a
green light, or an aqua light it will appear black (as those lights have no red component.) If it is
hit with a yellow light or a purple light it will appear red (as the polygon will reflect the red
component of the light.)
The following pictures will help to illustrate this:
ball
light image
white red
red
white
red
green
purple blue
yellow aqua
One important thing to note about all of the above equations is that each object is dealt with
separately. That is, one object does not block light from reaching another object. The creation of
realistic shadows is quite expensive if done right, and is a currently active area of research in
computer graphics. ( Consider, for example, a plant with many leaves, each of which could cast
shadows on other leaves or on the other nearby objects, and then further consider the leaves
fluttering in the breeze and lit by diffuse or unusual light sources. )
Multiple Lights
With multiple lights the affect of all the lights are additive.
Shading Models
We often use polygons to simulate curved surfaces. If these cases we want the colours of the
polygons to flow smoothly into each other.
flat shading
goraud shading ( color interpolation shading )
phong shading ( normal interpolation shading )
Flat Shading
Given a single normal to the plane the lighting equations and the material properties are used to
generate a single colour. The polygon is filled with that colour.
Here is another of the OpenGL samples with a flat shaded scene:
Goraud Shading
Given a normal at each vertex of the polygon, the colour at each vertex is determined from the
lighting equations and the material properties. Linear interpolation of the colour values at each
vertex are used to generate colour values for each pixel on the edges. Linear interpolation across
each scan line is used to then fill in the colour of the polygon.
Here is another of the OpenGL samples with a smooth shaded scene:
Phong Shading
Where Goraud shading uses normals at the vertices and then interpolates the resulting colours
across the polygon, Phong shading goes further and interpolates tha normals. Linear interpolation
of the normal values at each vertex are used to generate normal values for the pixels on the
edges. Linear interpolation across each scan line is used to then generate normals at each pixel
across the scan line.
Whether we are interpolating normals or colours the procedure is the same:
To find the intensity of Ip, we need to know the intensity of Ia and Ib. To find the intensity of Ia
we need to know the intensity of I1 and I2. To find the intensity of Ib we need to know the
intensity of I1 and I3.
Ia = (Ys - Y2) / (Y1 - Y2) * I1 + (Y1 - Ys) / (Y1 - Y2) * I2
Ib = (Ys - Y3) / (Y1 - Y3) * I1 + (Y1 - Ys) / (Y1 - Y3) * I3
Ip = (Xb - Xp) / (Xb - Xa) * Ia + (Xp - Xa) / (Xb - Xa) * Ib
Fog
We talked earlier about how atmospheric affects give us a sense of depth as particles in the air
make objects that are further away look less distinct than near objects.
Fog, or atmospheric attenuation allows us to simulate this affect.
Fog is implemented by blending the calculated color of a pixel with a given background color (
usually grey or black ), in a mixing ratio that is somehow proportional to the distance between
the camera and the object. Objects that are farther away get a greater fraction of the background
color relative to the object's color, and hence "fade away" into the background. In this sense, fog
can ( sort of ) be thought of as a shading effect.
Fog is typically given a starting distance, an ending distance, and a colour. The fog begins at the
starting distance and all the colours slowly transition to the fog colour towards the ending
distance. At the ending distance all colours are the fog colour.
Here are those o-so-everpresent computer graphics teapots from the OpenGL samples:
To use fog in OpenGL you need to tell the computer a few things:
Here is a scene from battalion without fog. The monster sees a very sharp edge to the world
Here is the same scene with fog. The monster sees a much softer horizon as objects further away
tend towards the black colour of the sky
AntiAliasing
Lines and the edges of polygons still look jagged at this point. This is especially noticable when
moving through a static scene looking at sharp edges.
This is known as aliasing, and is caused by the conversion from the mathematical edge to a
discrete set of pixels. We saw near the beginning of the course how to scan convert a line into the
frame buffer, but at that point we only dealth with placing the pixel or not placing the pixel. Now
we will deal with coverage.
The mathematical line will likely not exactly cover pixel boundaries - some pixels will be mostly
covered by the line (or edge), and others only slightly. Instead of making a yes/no decision we
can assign a value to this coverage (from say 0 to 1) for each pixel and then use these values to
blend the colour of the line (or edge) with the existing contents of the frame buffer.
In OpenGL you give hints setting the hints for GL_POINT_SMOOTH_HINT,
GL_LINE_SMOOTH_HINT, GL_POLYGON_SMOOTH_HINT to tell OpenGL to be
GL_FASTEST or GL_NICEST to try and smooth things out using the alpha (transparency)
values.
You also need to enable or disable that smoothing
glEnable(GL_POINT_SMOOTH);
glEnable(GL_LINE_SMOOTH);
glEnable(GL_POLYGON_SMOOTH);
In the beginning of the semester we dealt with simple wireframe drawings of the models. The
main reason for this is so that we did not have to deal with hidden surface removal. Now we
want to deal with more sophisticated images so we need to deal with which parts of the model
obscure other parts of the model.
The following sets of images show a wireframe version, a wireframe version with hidden line
removal, and a solid polygonal representation of the same object.
If we do not have a way of determining which surfaces are visible then which surfaces are visible
depends on the order in which they are drawn with surfaces being drawn later appearing in front
of surfaces drawn previously as shown below:
Here the fins on the back are visible because they are drawn after the body, the shadow is drawn
on top of the monster because it is drawn last. Both legs are visible and the eyes just look really
weird.
General Principles
We do not want to draw surfaces that are hidden. If we can quickly compute which surfaces are
hidden, we can bypass them and draw only the surfaces that are visible.
For example, if we have a solid 6 sided cube, at most 3 of the 6 sides are visible at any
one time, so at least 3 of the sides do not even need to be drawn because they are the back
sides.
We also want to avoid having to draw the polygons in a particular order. We would like to tell
the graphics routines to draw all the polygons in whatever order we choose and let the graphics
routines determine which polygons are in front of which other polygons.
With the same cube as above we do not want to have to compute for ourselves which
order to draw the visible faces, and then tell the graphics routines to draw them in that
order.
The idea is to speed up the drawing, and give the programmer an easier time, by doing some
computation before drawing.
Unfortunately these computations can take a lot of time, so special purpose hardware is often
used to speed up the process.
Techniques
Two types of approaches:
object space algorithms do their work on the objects themselves before they are converted to
pixels in the frame buffer. The resolution of the display device is irrelevant here as this
calculation is done at the mathematical level of the objects
for each object a in the scene
determine which parts of object a are visible
(involves comparing the polyons in object a to other polygons in a
and to polygons in every other object in the scene)
image space algorithms do their work as the objects are being converted to pixels in the frame
buffer. The resolution of the display device is important here as this is done on a pixel by pixel
basis.
for each pixel in the frame buffer
determine which polygon is closest to the viewer at that pixel location
colour the pixel with the colour of that polygon at that location
As in our discussion of vector vs raster graphics earlier in the term the mathematical (object
space) algorithms tended to be used with the vector hardware whereas the pixel based (image
space) algorithms tended to be used with the raster hardware.
When we talked about 3D transformations we reached a point near the end when we converted
the 3D (or 4D with homogeneous coordinates) to 2D by ignoring the Z values. Now we will use
those Z values to determine which parts of which polygons (or lines) are in front of which parts
of other polygons.
There are different levels of checking that can be done.
Object
Polygon
part of a Polygon
There are also times when we may not want to cull out polygons that are behind other polygons.
If the frontmost polygon is transparent then we want to be able to 'see through' it to the polygons
that are behind it as shown below:
Coherence
We used the idea of coherence before in our line drawing algorithm. We want to exploit 'local
similarity' to reduce the amount of computation needed (this is how compression algorithms
work.)
Face - properties (such as colour, lighting) vary smoothly across a face (or polygon)
Depth - adjacent areas on a surface have similar depths
Frame - images at successive time intervals tend to be similar
Scan Line - adjacent scan lines tend to have similar spans of objects
Area - adjacent pixels tend to be covered by the same face
Object - if objects are separate from each other (ie they do not overlap) then we only need
to compare polygons of the same object, and not one object to another
Edge - edges only disappear when they go behind another edge or face
Implied Edge - line of intersection of 2 faces can be determined by the endpoints of the
intersection
Extents
Rather than dealing with a complex object, it is often easier to deal with a simpler version of the
object.
in 2: a bounding box
in 3d: a bounding volume (though we still call it a bounding box)
We convert a complex object into a simpler outline, generally in the shape of a box and then we
can work with the boxes. Every part of the object is guaranteed to fall within the bounding box.
Checks can then be made on the bounding box to make quick decisions (ie does a ray pass
through the box.) For more detail, checks would then be made on the object in the box.
There are many ways to define the bounding box. The simplest way is to take the minimum and
maximum X, Y, and Z values to create a box. You can also have bounding boxes that rotate with
the object, bounding spheres, bounding cylinders, etc.
Back-Face Culling
Back-face culling (an object space algorithm) works on 'solid' objects which you are looking at
from the outside. That is, the polygons of the surface of the object completely enclose the object.
Every planar polygon has a surface normal, that is, a vector that is normal to the surface of the
polygon. Actually every planar polygon has two normals.
Given that this polygon is part of a 'solid' object we are interested in the normal that points OUT,
rather than the normal that points in.
OpenGL specifies that all polygons be drawn such that the vertices are given in counterclockwise
order as you look at the visible side of polygon in order to generate the 'correct' normal.
Any polygons whose normal points away from the viewer is a 'back-facing' polygon and does not
need to be further investigated.
To find back facing polygons the dot product of the surface normal of each polygon is taken with
a vector from the center of projection to any point on the polygon.
The dot product is then used to determine what direction the polygon is facing:
Back-face culling can very quickly remove unnecessary polygons. Unfortunately there are often
times when back-face culling can not be used. For example if you wish to make an open-topped
box - the inside and the ouside of the box both need to be visible, so either two sets of polygons
must be generated, one set facing out and another facing in, or back-face culling must be turned
off to draw that object.
in OpenGL back-face culling is turned on using:
glCullFace(GL_BACK);
glEnable(GL_CULL_FACE);
Depth Buffer
Early on we talked about the frame buffer which holds the colour for each pixel to be displayed.
This buffer could contain a variable number of bytes for each pixel depending on whether it was
a greyscale, RGB, or colour-indexed frame buffer. All of the elements of the frame buffer are
initially set to be the background colour. As lines and polygons are drawn the colour is set to be
the colour of the line or polygon at that point. T
We now introduce another buffer which is the same size as the frame buffer but contains depth
information instead of colour information.
z-buffering (an image-space algorithm) is another buffer which maintains the depth for each
pixel. All of the elements of the z-buffer are initially set to be 'very far away.' Whenever a pixel
colour is to be changed the depth of this new colour is compared to the current depth in the zbuffer. If this colour is 'closer' than the previous colour the pixel is given the new colour, and the
z-buffer entry for that pixrl is updated as well. Otherwise the pixel retains the old colour, and the
z-buffer retains its old value.
Here is a pseudo-code algorithm
This is very nice since the order of drawing polygons does not matter, the algorithm will always
display the colour of the closest point.
The biggest problem with the z-buffer is its finite precision, which is why it is important to set
the near and far clipping planes to be as close together as possible to increase the resolution of
the z-buffer within that range. Otherwise even though one polygon may mathematically be 'in
front' of another that difference may disappear due to roundoff error.
These days with memory getting cheaper it is easy to implement a software z-buffer and
hardware z-buffering is becoming more common.
In OpenGL the z-buffer and frame buffer are cleared using:
glClear(GL_DEPTH_BUFFER_BIT | GL_COLOR_BUFFER_BIT);
The depth-buffer is especially useful when it is difficult to order the polygons in the scene based
on their depth, such as in the case shown below:
Warnock's Algorithm
Below is an example scanned out of the text, where the numbers refer to the numbered steps
listed above:
Here is a place where the use of bounding boxes can speed up the process. Given that the
bounding box is always at least as large as the polygon, or object checks for contained and
disjoint polygons can be made using the bounding boxes, while checks for interstecting and
surrounding can not.
can be written before Q. If at least one comparison is true for each of the Qs then P is drawn and
the next polygon from the back is chosen as the new P.
1.
2.
3.
4.
5.
If all 5 tests fail we quickly check to see if switching P and Q will work. Tests 1, 2, and 5 do not
differentiate between P and Q but 3 and 4 do. So we rewrite 3 and 4
3 - is Q entirely on the opposite side of P's plane from the viewport.
4 - is P entirely on the same side of Q's plane as the viewport.
If either of these two tests succeed then Q and P are swapped and the new P (formerly Q) is
tested against all the polygons whose z extent overlaps it's z extent.
If these two tests still do not work then either P or Q is split into 2 polygons using the plane of
the other. These 2 smaller polygons are then put into their proper places in the sorted list and the
algorithm continues.
beware of the dreaded infinite loop.
BSP Trees
Another popular way of dealing with these problems (especially in games) are Binary Space
Partition trees. It is a depth sort algorithm with a large amount of preprocessing to create a data
structure to hold the polygons.
First generate a 3D BSP tree for all of the polygons in the scene
Then display the polygons according to their order in the scene
1. polygons behind the current node
2. the current node
3. polygons in front of the current node
Each node in the tree is a polygon. Extending that polygon generates a plane. That plane cuts
space into 2 parts. We use the front-facing normal of the polygon to define the half of the space
that is 'in front' of the polygon. Each node has two children: the front children (the polygons in
front of this node) and the back children (the polgons behind this noce)
In doing this we may need to split some polygons into two.
Then when we are drawing the polygons we first see if the viewpoint is in front of or behind the
root node. Based on this we know which child to deal with first - we first draw the subtree that is
further from the viewpoint, then the root node, then the subtree that is in front of the root node,
recursively, until we have drawn all the polygons.
Compared to depth sort it takes more time to setup but less time to iterate through since there are
no special cases.
If the position or orientation of the polygons change then parts of the tree will need to be
recomputed.
here is an example originally by Nicolas Holzschuch showing the construction and use of a BSP
tree for 6 polygons.
In the scan line algorithm we had a simple 0/1 variable to deal with being in or out of the
polygon. Since there are multiple polygons here we have a Polygon Table.
The Polygon Table contains:
Again the edges are moved from the global edge table to the active edge table when the scan line
corresponding to the bottom of the edge is reached.
Moving across a scan line the flag for a polygon is flipped when an edge of that polygon is
crossed.
If no flags are true then nothing is drawn
If one flag is true then the colour of that polygon is used
If more than one flag is true then the frontmost polygon must be determined.
Below is an example from the textbook (figure red:13.11, white:15.34)
AET contents
------------
Comments
--------
alpha
beta
gamma
gamma+1
gamma+2
AB
AB
AB
AB
AB
one
two
two
two
two
AC
AC
DE
DE
CB
FD
CB
CB
DE
FE
FE
FE
FE
polygon
separate polygons
overlapping polygons
overlapping polygons
separate polygons
So, given a ray (vector) and an object the key idea is computing if and if so where does the ray
intersect the object.
the ray is represented by the vector from (Xo, Yo, Zo) at the COP, to (X1, Y1, Z1) at the center
of the pixel. We can parameterize this vector by introducing t:
X = Xo + t(X1 - Xo)
Y = Yo + t(Y1 - Yo)
Z = Zo + t(Z1 - Zo)
or
X = Xo + t(deltaX)
Y = Yo + t(deltaY)
Z = Zo + t(deltaZ)
t equal to 0 represents the COP, t equal to 1 represents the pixel
t < 0 represents points behind the COP
t > 1 represents points on the other side of the view plane from the COP
We want to find out what the value of t is where the ray intersects the object. This way we can
take the smallest value of t that is in front of the COP as defining the location of the nearest
object along that vector.
The problem is that this can take a lot of time, especially if there are lots of pixels and lots of
objects.
The raytraced images in 'Toy Story' for example took at minimum 45 minutes and at most 20
hours for a single frame.
So minimizing the number of comparisons is critical.
- Bounding boxes can be used to perform initial checks on complicated objects
- hierarchies of bounding boxes can be used where a successful intersection with a
bounding box then leads to tests with several smaller bounding boxes within the larger
bounding box.
- The space of the scene can be partitioned. These partitions are then treated like buckets
in a hash table, and objects within each partition are assigned to that partition. Checks can
then be made against this constant number of partitions first before going on to checking
the objects themselves. These partitions could be equal sized volumes, or contain equal
numbers of polygons.
Full blown ray tracing takes this one step further by reflecting rays off of shiny objects, to see
what the ray hits next, and then reflecting the ray off of that, and so on, until a limiting number
of reflections have been encountered. For transparent or semi-transparent objects, rays are passed
through the object, taking into account any deflection or filtering that may take place ( e.g.
through a colored glass bottle or chess piece ), again proceeding until some limit is met. Then the
contributions of all the reflections and transmissions are added together to determine the final
color value for each pixel. The resulting images can be incredibly beautiful and realistic, and
usually take a LONG time to compute.
Comparison
From table 13.1 in the red book (15.3 in the white book) the relative performance of the various
algorithms where smaller is better and the depth sort of a hundred polygons is set to be 1.
# of polygonal faces in the scene
Algorithm
100
250
60000
-------------------------------------------------Depth Sort
1
10
507
z-buffer
54
54
54
scan line
5
21
100
Warnock
11
64
307
This table is somewhat bogus as z-buffer performance degrades as the number of polygonal faces
increases.
To get a better sense of this, here are the number of polygons in the following models:
MultiMedia
Introduction
Multimedia has become an inevitable part of any presentation. It has found a variety of
applications right from entertainment to education. The evolution of internet has also increased
the demand for multimedia content.
Definition
Multimedia is the media that uses multiple forms of information content and information
processing (e.g. text, audio, graphics, animation, video, interactivity) to inform or entertain the
user. Multimedia also refers to the use of electronic media to store and experience multimedia
content. Multimedia is similar to traditional mixed media in fine art, but with a broader scope.
The term "rich media" is synonymous for interactive multimedia.
Features of Multimedia
Multimedia presentations may be viewed in person on stage, projected, transmitted, or played
locally with a media player. A broadcast may be a live or recorded multimedia presentation.
Broadcasts and recordings can be either analog or digital electronic media technology. Digital
online multimedia may be downloaded or streamed. Streaming multimedia may be live or ondemand.
Multimedia games and simulations may be used in a physical environment with special effects,
with multiple users in an online network, or locally with an offline computer, game system, or
simulator.
Applications of Multimedia
Multimedia finds its application in various areas including, but not limited to, advertisements,
art, education, entertainment, engineering, medicine, mathematics, business, scientific research
and spatial, temporal applications.
A few application areas of multimedia are listed below:
Creative industries
Creative industries use multimedia for a variety of purposes ranging from fine arts, to
entertainment, to commercial art, to journalism, to media and software services provided for any
of the industries listed below. An individual multimedia designer may cover the spectrum
throughout their career. Request for their skills range from technical, to analytical and to
creative.
Commercial
Much of the electronic old and new media utilized by commercial artists is multimedia. Exciting
presentations are used to grab and keep attention in advertising. Industrial, business to business,
and interoffice communications are often developed by creative services firms for advanced
multimedia presentations beyond simple slide shows to sell ideas or liven-up training.
Commercial multimedia developers may be hired to design for governmental services and
nonprofit services applications as well.
Text in Multimedia
Words and symbols in any form, spoken or written, are the most common system of
communication. They deliver the most widely understood meaning to the greatest number of
people.
Most academic related text such as journals, e-magazines are available in the Web Browser
readable form.
operating system is installed. On the Macintosh you can choose one of the several sounds for the
system alert. In Windows system sounds are WAV files and they reside in the windows\Media
subdirectory.
There are still more choices of audio if Microsoft Office is installed. Windows makes use of
WAV files as the default file format for audio and Macintosh systems use SND as default file
format for audio.
Digital Audio
Digital audio is created when a sound wave is converted into numbers a process referred to as
digitizing. It is possible to digitize sound from a microphone, a synthesizer, existing tape
recordings, live radio and television broadcasts, and popular CDs. You can digitize sounds from
a natural source or prerecorded.
Digital Image
A digital image is represented by a matrix of numeric values each representing a quantized
intensity value. When I is a two-dimensional matrix, then I(r,c) is the intensity value at the
position corresponding to row r and column c of the matrix.
The points at which an image is sampled are known as picture elements, commonly abbreviated
as pixels. The pixel values of intensity images are called gray scale levels (we encode here the
color of the image). The intensity at each pixel is represented by an integer and is determined
from the continuous image by averaging over a small neighborhood around the pixel location. If
there are just two intensity values, for example, black, and white, they are represented by the
numbers 0 and 1; such images are called binary-valued images. If 8-bit integers are used to store
each pixel value, the gray levels range from 0 (black) to 255 (white).
Bitmaps
A bitmap is a simple information matrix describing the individual dots that are the smallest
elements of resolution on a computer screen or other display or printing device. A onedimensional matrix is required for monochrome (black and white); greater depth (more bits of
information) is required to describe more than 16 million colors the picture elements may have,
as illustrated in following figure. The state of all the pixels on a computer screen make up the
image seen by the viewer, whether in combinations of black and white or colored pixels in a line
of text, a photograph-like picture, or a simple background pattern.
Clip Art
A clip art collection may contain a random assortment of images, or it may contain a series of
graphics, photographs, sound, and video related to a single topic. For example, Corel,
Micrografx, and Fractal Design bundle extensive clip art collection with their image-editing
software.
Principles of Animation
Animation is the rapid display of a sequence of images of 2-D artwork or model positions in
order to create an illusion of movement. It is an optical illusion of motion due to the phenomenon
of persistence of vision, and can be created and demonstrated in a number of ways. The most
common method of presenting animation is as a motion picture or video program, although
several other forms of presenting animation also exist.
Animation is possible because of a biological phenomenon known as persistence of vision and a
psychological phenomenon called phi. An object seen by the human eye remains chemically
mapped on the eyes retina for a brief time after viewing. Combined with the human minds need
to conceptually complete a perceived action, this makes it possible for a series of images that are
changed very slightly and very rapidly, one after the other, to seemingly blend together into a
visual illusion of movement. The following shows a few cells or frames of a rotating logo. When
the images are progressively and rapidly changed, the arrow of the compass is perceived to be
spinning. Television video builds entire frames or pictures every second; the speed with which
each frame is replaced by the next one makes the images appear to blend smoothly into
movement. To make an object travel across the screen while it changes its shape, just change the
shape and also move or translate it a few pixels for each frame.
*.dir, *.dcr
AnimationPro
*.fli, *.flc
3D Studio Max
*.max
*.pics
CompuServe
*.gif
Flash
*.fla, *.swf
Following is the list of few Software used for computerized animation:
3D Studio Max
Flash
AnimationPro
Video
Analog versus Digital
Digital video has supplanted analog video as the method of choice for making video for
multimedia use. While broadcast stations and professional production and postproduction houses
remain greatly invested in analog video hardware (according to Sony, there are more than
350,000 Betacam SP devices in use today), digital video gear produces excellent finished
products at a fraction of the cost of analog. A digital camcorder directly connected to a computer
workstation eliminates the image-degrading analog-to-digital conversion step typically
performed by expensive video capture cards, and brings the power of nonlinear video editing and
production to everyday users.
3.
4.
5.
6.
7.
allow user to access or change the content, structure and programming of the
project. If you are going to distribute your project widely, you should distribute it
in the run-time version.
8. Cross-Platform features:- It is also increasingly important to use tools that make
transfer across platforms easy. For many developers, the Macintosh remains the
multimedia authoring platform of choice, but 80% of that developers target
market may be Windows platforms. If you develop on a Macintosh, look for tools
that provide a compatible authoring system for Windows or offer a run-time
player for the other platform.
9. Internet Playability:- Due to the Web has become a significant delivery medium
for multimedia, authoring systems typically provide a means to convert their
output so that it can be delivered within the context of HTML or DHTML, either
with special plug-in or embedding Java, JavaScript or other code structures in the
HTML document.
Since each block is processed without reference to the others, we'll concentrate on a single block.
In particular, we'll focus on the block highlighted below.
Here is the same block blown up so that the individual pixels are more apparent. Notice that
there is not tremendous variation over the 8 by 8 block (though other blocks may have more).
Remember that the goal of data compression is to represent the data in a way that reveals some
redundancy. We may think of the color of each pixel as represented by a three-dimensional
vector (R,G,B) consisting of its red, green, and blue components. In a typical image, there is a
significant amount of correlation between these components. For this reason, we will use a color
space transform to produce a new vector whose components represent luminance, Y, and blue
and red chrominance, Cb and Cr.
The luminance describes the brightness of the pixel while the chrominance carries information
about its hue. These three quantities are typically less correlated than the (R, G, B) components.
Furthermore, psychovisual experiments demonstrate that the human eye is more sensitive to
luminance than chrominance, which means that we may neglect larger changes in the
chrominance without affecting our perception of the image.
Since this transformation is invertible, we will be able to recover the (R,G,B) vector from the (Y,
Cb, Cr) vector. This is important when we wish to reconstruct the image. (To be precise, we
usually add 128 to the chrominance components so that they are represented as numbers between
0 and 255.)
When we apply this transformation to each pixel in our block
we obtain three new blocks, one corresponding to each component. These are shown below
where brighter pixels correspond to larger values.
Cb
Cr
As is typical, the luminance shows more variation than the the chrominance. For this reason,
greater compression ratios are sometimes achieved by assuming the chrominance values are
constant on 2 by 2 blocks, thereby recording fewer of these values. For instance, the image
editing software Gimp provides the following menu when saving an image as a JPEG file:
Don't worry about the factor of 1/2 in front or the constants Cw (Cw = 1 for all w except C0 =
). What is important in this expression is that the function fx is being represented as a
linear combination of cosine functions of varying frequencies with coefficients Fw. Shown below
are the graphs of four of the cosine functions with corresponding frequencies w.
w=0
w=1
w=2
w=3
Of course, the cosine functions with higher frequencies demonstrate more rapid variations.
Therefore, if the values fx change relatively slowly, the coefficients Fw for larger frequencies
should be relatively small. We could therefore choose not to record those coefficients in an effort
to reduce the file size of our image.
The DCT coefficients may be found using
Notice that this implies that the DCT is invertible. For instance, we will begin with fx and record
the values Fw. When we wish to reconstruct the image, however, we will have the coefficients Fw
and recompute the fx.
Rather than applying the DCT to only the rows of our blocks, we will exploit the twodimensional nature of our image. The Discrete Cosine Transform is first applied to the rows of
our block. If the image does not change too rapidly in the vertical direction, then the coefficients
shouldn't either. For this reason, we may fix a value of w and apply the Discrete Cosine
Transform to the collection of eight values of Fw we get from the eight rows. This results in
coefficients Fw,u where w is the horizontal frequency and u represents a vertical frequency.
We store these coefficients in another 8 by 8 block as shown:
Notice that when we move down or to the right, we encounter coefficients corresponding to
higher frequencies, which we expect to be less significant.
The DCT coefficients may be efficiently computed through a Fast Discrete Cosine Transform, in
the same spirit that the Fast Fourier Transform efficiently computes the Discrete Fourier
Transform.
Quantization
Of course, the coefficients Fw,u, are real numbers, which will be stored as integers. This means
that we will need to round the coefficients; as we'll see, we do this in a way that facilitates
greater compression. Rather than simply rounding the coefficients Fw,u, we will first divide by a
quantizing factor and then record
round(Fw,u / Qw,u)
This allows us to emphasize certain frequencies over others. More specifically, the human eye is
not particularly sensitive to rapid variations in the image. This means we may deemphasize the
higher frequencies, without significantly affecting the visual quality of the image, by choosing a
larger quantizing factor for higher frequencies.
Remember also that, when a JPEG file is created, the algorithm asks for a parameter to control
the quality of the image and how much the image is compressed. This parameter, which we'll
call q, is an integer from 1 to 100. You should think of q as being a measure of the quality of the
image: higher values of q correspond to higher quality images and larger file sizes. From q, a
quantity is created using
Here is a graph of
as a function of q:
Qw,u)
Naturally, information will be lost through this rounding process. When either or Qw,u is
increased (remember that large values of correspond to smaller values of the quality parameter
q), more information is lost, and the file size decreases.
Here are typical values for Qw,u recommended by the JPEG standard. First, for the luminance
coefficients:
These values are chosen to emphasize the lower frequencies. Let's see how this works in our
example. Remember that we have the following blocks of values:
Cb
Cr
Cb
Cr
The entry in the upper left corner essentially represents the average over the block. Moving to
the right increases the horizontal frequency while moving down increases the vertical frequency.
What is important here is that there are lots of zeroes. We now order the coefficients as shown
below so that the lower frequencies appear first.
Original
Reconstruct
ed
Cb
Cr
Reconstructing the image from the information is rather straightforward. The quantization
matrices are stored in the file so that approximate values of the DCT coefficients may be
recomputed. From here, the (Y, Cb, Cr) vector is found through the Inverse Discrete Cosine
Transform. Then the (R, G, B) vector is recovered by inverting the color space transform.
Here is the reconstruction of the 8 by 8 block with the parameter q set to 50
Original
Reconstructed (q = 50)
and, below, with the quality parameter q set to 10. As expected, the higher value of the parameter
q gives a higher quality image.
Original
Reconstructed (q = 10)
JPEG 2000
While the JPEG compression algorithm has been quite successful, several factors created the
need for a new algorithm, two of which we will now describe.
First, the JPEG algorithm's use of the DCT leads to discontinuities at the boundaries of the 8 by 8
blocks. For instance, the color of a pixel on the edge of a block can be influenced by that of a
pixel anywhere in the block, but not by an adjacent pixel in another block. This leads to blocking
artifacts demonstrated by the version of our image created with the quality parameter q set to 5
(by the way, the size of this image file is only 1702 bytes) and explains why JPEG is not an ideal
format for storing line art.
In addition, the JPEG algorithm allows us to recover the image at only one resolution. In some
instances, it is desirable to also recover the image at lower resolutions, allowing, for instance, the
image to be displayed at progressively higher resolutions while the full image is being
downloaded.
To address these demands, among others, the JPEG 2000 standard was introduced in December
2000. While there are several differences between the two algorithms, we'll concentrate on the
fact that JPEG 2000 uses a wavelet transform in place of the DCT.
Before we explain the wavelet transform used in JPEG 2000, we'll consider a simpler example of
a wavelet transform. As before, we'll imagine that we are working with luminance-chrominance
values for each pixel. The DCT worked by applying the transform to one row at a time, then
transforming the columns. The wavelet transform will work in a similar way.
To this end, we imagine that we have a sequence f0, f1, ..., fn describing the values of one of the
three components in a row of pixels. As before, we wish to separate rapid changes in the
sequence from slower changes. To this end, we create a sequence of wavelet coefficients:
Notice that the even coefficients record the average of two successive values--we call this the
low pass band since information about high frequency changes is lost--while the odd coefficients
record the difference in two successive values--we call this the high pass band as high frequency
information is passed on. The number of low pass coefficients is half the number of values in the
original sequence (as is the number of high pass coefficients).
It is important to note that we may recover the original f values from the wavelet coefficients, as
we'll need to do when reconstructing the image:
We reorder the wavelet coefficients by listing the low pass coefficients first followed by the high
pass coefficients. Just as with the 2-dimensional DCT, we may now apply the same operation to
transform the wavelet coefficients vertically. This results in a 2-dimensional grid of wavelet
coefficients divided into four blocks by the low and high pass bands:
As before, we use the fact that the human eye is less sensitive to rapid variations to deemphasize
the rapid changes seen with the high pass coefficients through a quantization process analagous
to that seen in the JPEG algorithm. Notice that the LL region is obtained by averaging the values
in a 2 by 2 block and so represents a lower resolution version of the image.
In practice, our image is broken into tiles, usually of size 64 by 64. The reason for choosing a
power of 2 will be apparent soon. We'll demonstrate using our image with the tile indicated.
(This tile is 128 by 128 so that it may be more easily seen on this page.)
Notice that, if we transmit the coefficients in the LL region first, we could reconstruct the image
at a lower resolution before all the coefficients had arrived, one of aims of the JPEG 2000
algorithm.
We may now perform the same operation on the lower resolution image in the LL region thereby
obtaining images of lower and lower resolution.
The wavelet coefficients may be computed through a lifting process like this:
The advantage is that the coefficients may be computed without using additional computer
memory--a0 first replaces f0 and then a1 replaces f1. Also, in the wavelet transforms that are used
in the JPEG 2000 algorithm, the lifting process enables faster computation of the coefficients.
MPEG
The most common form of the video signal in use today is still analog. This signal is obtained through a
process known as scanning. In this section the analog representation of the video signal and its
disadvantages are discussed.This part also describes the need towards digital representation of video
signal. After describing the need for compression of video signal, this paper describes the MPEG
compression technique for video signals.
Today, the technology is attempting to integrate the video, computer and telecommunication
industry together on a single mutimedia platform. The video signal is required to be scalable,
platform independent, able to provide interactivity, and be robust. The analog unfortunately fails
to address these requirements. Moving to digital not only eliminates most of the above
mentioned problems but also opens door to a whole range of digital video processing techniques
which can make the picture sharper.
Digital video too has its share of bottlenecks. The most important one is the huge bandwidth
requirement. Inspite of being digital, it thus still need to stored. The logical solution to this
problem is digital video compression.
There are many different redandancies present in the video signal data.
spatial.
Temporal.
Psychovisual.
Coding.
Spatial redandancy occurs because neighboring pixels in each individual frame of a video signal
are related. The pixels in consecitive frames of signal are also related, leading to temporal
redundancy. The human visual system does not treat all the visual information with equal
sensitivity, leading to psychovisual redundancy. Finally, not all parameters occur with the same
probability in an image. As a result, they would not require equal number of bits to code them
(Huffman coding).
There are several different compression standards around today (CCITT recomandation H. 261).
MPEG, which stands for moving pictures experts groups, is a joint coommitte of the OSI and
IEC. It has been responsible for the MPEG-1(ISO/IEC 11172) and MPEG-2(ISO/IEC 13818)
standards in the past and is currently developing the MPEG-4 standard. MPEG standards are
generic and universal. There are three main parts in the MPEG-1 and MPEG-2 specifications,
namely, Systems, Video and Audio. The Video part defines the syntax and semantics of the
compressed video bitstream. The Audio part defines the same for audio bitstream, while the
System part specifies the combination of one or more elementary streams of video and audio, as
well as other data, into a single or multiple streams which are suitable for storage or transmision.
The MPEG-2 standard consists of a fourth part called DSMCC, which defines a set of protocols
for the retrieval and storage of MPEG data. We shall now examine the structure of a nonscalable video bitsream in some deatil to understand the video compression.
The video bitstream consists of video sequences. Each video sequence consists of a variable
number of group of pictures(GOP). A GOP contains a variable number of pictures(p), Figure 3.
Mathematically, each picture is really an union of the pixel values of the luminance and the two
chrominance components. The picture can also be subsampled at a lower resolution in the chrominance
domain because the human eye is less sensitive to high frequency color shifts(more rods than cones on
the retina). There are three formats:
1. 4:4:4---the chrominance and luminance planes are subsampled at the same resolution.
2. 4:2:2---the chrominance planes are subsampled at half resolution in the horizontal direction.
3. 4:2:0---the chrominance information is sub-sampled at half the rate both vertically and
horizontally.
These formats are shown in Format.fig.
Pictures can be devided into three main types based on their compression schemes.
I or Intra pictures.
P or Predicted pictures.
B or Bidirectional pictures.
The frames that can be predicted from previous frames are called P-frames. But what happens if
transmission errors occur in a sequnce of P-frames?. To avoid the propagation of transmission errors
and to allow periodic resynchronization, a complete frame which does not rely on information from
other frames is transmitted approximately once every 12 frames. These stand-alone frames are "intra
coded" and are called I-frames. The coding technique for I pictures falls in the category of transform
coding. Each picture is divided into 8x8 non-overlapping pixels blocks. Four of these blocks are
additionally arranged into a bigger block of size 16x16, called macroblock. The DCT is applied to each 8x8
block individually, Figure 4. This transform converts the data into series of coefficients which represent
the magnitudes of the cosine functions at increasing freqncies. The quantization process allows the high
energy, low frequency coefficients to be coded with greater number of bits, while using fewer or zero
bits for the high freuency coefficients. Retaining only a subset of the coeffients reduces the total number
of parameters needed for representation by a substantial amount. The quantization process also helps
in allowing the encoder to output bitstreams at specified bitrate.
The DCT oefficients are coded using a combination of two special coding schemes- Run length
and Huffman. The coefficients are scaned in a zigzag pattern to create a 1-D sequence. MPEG-2
provides an alternative scanning method. The resulting 1-D sequence usually contains a large
number of zeros due to DCT and the quantization process. Each non-zero coefficients is
assosiated with a pair of pointers. First, its position in the block which is indicated by the number
of zeros between itself and the prevoius non-zero coefficient. Second, its coefficient value. Based
on these two pointers, it is given a variable length code from a lookup table. This is done in a
manner so that a highly probable combination gets a code with fewer bits, while the unlikely
ones get longer codes. However, since spatial redundandancy is limitted, the I Pictures provide
only moderate compression.
The P and B pictures are where MPEG derives its maximum compression efficiency. It is done
by a technique called motion compensation(MC) based prediction, which exploits the temporal
redundancy. Since frames are closely related, it is assumed that a current picture can be modelled
as a translation of the picture at the previous time. It is possible then to accurately "predict" the
data of one frame based on the data of a previous frame. In P pictures, each 16x16 sized
macroblocks is predicted from macroblock of previously encoded I picture. Since, frames are
snapshots in time of a moving object, the macroblocks in the two frames may not correspond to
the same spatial location. The encoder searches the previous frame(for P-frames, or the frames
before and after for B-frames) in half pixel increments for other macroblock locations that are a
close match to the information that is contained in the current macroblock. The displacements in
the horizontal and vertical directions of the best match macroblocks from the cosited macroblock
are called Motion vectors, Figure 5.
If no matching macroblocks are found in the neighboring region, the macroblock is intra coded
and the DCT coefficients are encoded. if a matching block is found in the search region the
coefficients are not transmitted, but a motion vector is used instead.
The motion vectors can also be used for motion prediction in case of corrupted data, and sophisticated
decoder algorithms can use these vectors for error concealment( refer to article1).
For B pictures, MC prediction and interpolation is performed using reference frames present on
either side of it.
Compared to I and P, B pictures provides the maximum compression. There are other advantages of B
pictures.
B pictures themselves never used for predictions and hence do not propagate errors. MPEG-2 allows for
both frame and field based MC. Field based MC is spatially useful when the video signal includes fast
motion.