0% found this document useful (0 votes)
19 views1,559 pages

108103174

Uploaded by

mayank.tiwari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views1,559 pages

108103174

Uploaded by

mayank.tiwari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 1559

INDEX

S. No Topic Page No
Week 1
1 Introduction to Digital Image Processing 1
2 Introduction to Computer Vision 65
3 Introduction to Computer Vision and Basic Concepts of Image Formation 115
4 Shape From Shading 144
5 Image Formation: Geometric Camera Models - I 183
6 Image Formation: Geometric Camera Model - II 205
7 Image Formation: Geometric Camera Model - III 225
8 Image Formation in a Stereo Vision Setup 279
9 Image Reconstruction from a Series of Projections 321
10 Image Reconstruction from a Series of Projections 347
Week 2
11 Image Transforms - I 365
12 Image Transforms - II 389
13 Image Transforms - III 422
14 Image Transforms - IV 463
Week 3
15 Image Enhancement. 517
16 Image Filtering - I 554
17 Image Filtering - II 613
18 Colour Image Processing - I 665
19 Colour Image Processing - II 690
Week 4
20 Image Segmentation 751
21 Image Features and Edge Detection 798
22 Edge Detection 839
23 Hough Transform 860
24 Image Texture Analysism - I 883
Week 5
25 Image Texture Analysis - II 921

1
26 Object Boundary and Shape Representations - I 943
27 Object Boundary and Shape Representations - II 966
28 Interest Point Detectors 997
Week 6
29 Image Features - HOG and SIFT 1040
30 Introduction to Machine Learning - I 1098
31 Introduction to Machine Learning - II 1121
32 Introduction to Machine Learning - III 1170
Week 7
33 Introduction to Machine Learning - IV 1211
34 Introduction to Machine Learning - V 1235
35 Artificial Neural Network for Pattern Classification - I 1284
36 Artificial Neural Network for Pattern Classification - II 1315
Week 8
37 Introduction to Deep Learning 1342
38 Gesture Recognition 1393
39 Background Modelling and Motion Estimation 1445
40 Object Tracking 1477
41 Programming Examples 1518

2
Computer Vision and Image Processing
Professor M.K. Bhuyan
Department of Electronics and Electrical Engineering
Indian Institute of Technology, Guwahati
Lecture 1
Introduction to Computer Vision
Welcome to NPTEL MOOC's course on Computer Vision and Image Processing, Fundamentals
and Applications. This is my first class, so in this course, I will discuss some fundamental
concepts of computer vision and image processing. And finally, I will discuss some important
applications of computer vision.

So what is computer vision? Computer vision is a field of Computer Science; the objective is to
build a machine so that it can process and interpret images and the video just like a human visual
system does. So I can say it may be a compliment of biological vision. So let us see the formal
definition of computer vision.

(Refer Slide Time: 01:11)

So building machines that see. It is mainly the modeling of biological perception and input to a
computer vision system is mainly the digital images and the videos and the definition of
computer vision is computer vision is a field of Computer Science that works on enabling
computers to see, identify, and process images in the same way that human vision does and then
provide appropriate output. So this is the formal definition of computer vision.

1
(Refer Slide Time: 01:43)

So in this block diagram, I have shown the similarity, I want to show the similarity between
human visual system and computer vision system. So in the first block diagram, I have shown
the human visual system.

So for human visual system, we have eyes to see images or we can see videos or maybe we can
see objects. And after this, we do the processing in our brain and after this, we take intelligent
decisions. In case of the computer vision, we have cameras, and they are maybe single camera or
maybe multiple cameras for image or the video acquisition.

And after this, we have to do pre-processing, that is, image pre-processing we have to do. And
finally, we have to apply the pattern recognition and artificial intelligence algorithms for decision
making. So you can see the similarity between the human visual system and the computer vision
system.

One basic difference I can highlight in case of the human eye and a computer vision in the
cameras, that image acquisition device, the light is converted into electrical signal but there is a
basic difference I can show you. Suppose if I consider a camera, so this is suppose lens of the
camera; this is a convex lens. I want to focus this object.

So to focus this object, I can move the lens maybe in the forward direction or maybe in the
backward direction. Like this I can do this, and by this process, I can focus a particular object in

2
the camera. In case of the human eye, in case of the human; this is for the camera; in case of the
human eye, that is not possible.

Suppose if I want to focus this particular object and I have the lens, lens of my eye. I cannot
move the lens in the forward, in the backward direction. What I can do, I can sense the shape of
the lens. I can sense like this or I can sense like this. By this process, I can sense the focal length.
So that is the only difference between the human eye and the camera.

In both the cases, the light is converted into electrical signal. In human eye, we have retina and in
case of the camera, we have photo sensors, which converts the light into electrical signal.

(Refer Slide Time: 4:06)

In this case, I have shown the distinction between computer vision, image processing and the
computer graphics. In computer vision, my input is image and output is interpretation. The
interpretation of the image, the interpretation of the video; that is computer vision.

In case of the image processing, my input is image and output is image. So I can give one
example. Suppose if I want to improve the visual quality of an image, then in this case, I have to
do the image processing. And then in that case, my input is image and output is also image.
Suppose if I want to remove noises in an image, then in this case, I have to do image processing.

3
And in case of the computer graphics, my input is model. So from the model I have to generate
the image. So you can see the distinction between image processing, computer vision, and
computer graphics.

(Refer Slide Time: 05:03)

In this figure if you see, one is analysis, image analysis; another one is synthesis. So what is
analysis? From the image, if I do some interpretation, if I get some model, then it is computer
vision. And in synthesis, if the model is available, if I can generate the image from the model
that is called synthesis.

So in this case, the analysis means computer vision, and synthesis means computer graphics. So
you can see the distinction between computer vision and computer graphics.

4
(Refer Slide Time: 05:38)

And this interpretation, I can show like this by Bayesian Inference. So what is the Bayesian
Inference? The probability of world given the image. So it is the probability of world given the
image that is equal to probability of the image given the world into probability of the world
divided by probability of the image. So this is by using the Bayes law.

So in this case, probability of the world, that means, the model, world means the model, given
the image is computer vision. That means, from the image, I have to do some interpretation; I
have to get the model. That is computer vision.

And if you see that this one, computer graphics from the model I have to generate the image.
That is computer graphics. What is the probability of world? The probability of world means, the
modeling of the objects in the world. So by using this Bayesian Inference, you can see and that
the definition of computer vision and the computer graphics

5
(Refer Slide Time: 06:41)

And some of the related disciplines of computer vision, computer graphics, artificial intelligence,
machine learning, cognitive science, algorithms, image processing.

6
(Refer Slide Time: 06:52)

Now, let us see the fundamental steps of computer vision. In this block diagram, I have shown
some of the steps. The first one is the input image. I have the input image. So for this, I have
cameras.

After getting the input image, I can do some pre-processing like if I want to improve the visual
quality of an image, I have to do the image pre-processing, like if I want to remove the noises in
the image, I have to do image pre-processing. After this, I can do image segmentation.
Segmentation means the partitioning of an image into connected homogeneous region. So I can
do the image segmentation.

After this what I can do, the feature extractions. So from the image, I have to extract some
features. So in this block diagram, if you see, this is the feature vector. So this feature extraction
is this, image pre-processing is this, so after this, we have to extract some features. Based on
these features, I can do classification.

So this is a typical block diagram of a computer vision or maybe the image analysis system. So
first, I have to do the image pre-processing, after this we have to extract some features, and based
on these features, I have to do the classification.

7
(Refer Slide Time: 08:10)

Now, what is image processing? So image, if I consider a digital image, digital image is nothing
but a 2D array of numbers. So I have x-coordinate and the y-coordinates, and it is a 2D array of
numbers.

So image processing means, I have to manipulate these numbers. Suppose, if I want to remove
the noise of an image, then in this case, I have to manipulate these numbers. Suppose, if I want to
improve the brightness of an image, if I want to improve the contrast of an image, I have to
manipulate these numbers. So that means the manipulation of these numbers is nothing but the
image processing.

8
(Refer Slide Time: 08:52)

Now, in this block diagram, I have shown the typical image processing sequence. So you have
seen that one is, the light source is available. The light is coming from the source and it is
reflected by the object. And after this, we have the imaging system. That is nothing but the
camera. This camera actually converts the photon, the light proton into electrical signal.

So here, if you see here, that is, the light is coming from the source and it is reflected by the
object. And after this, we have the imaging system that is the camera. So this is the analog signal
I am getting. Analog means it is a continuous function of time. This analog image, I can convert
into digital image. So for this, I have to do the sampling. That sampling is called the spatial
sampling; sampling along the x-direction, sampling along the y-direction. And after this, I have
to do the quantization; the quantization of the intensity values.

So like this, I can get the digital image from the analog image. This, the intensity value or the
pixel below actually depends on the amount of light reflected by the object. So that means there
is a property; that property is called a reflectance property. So later on, I will discuss about what
is the reflectance property. Sometimes it is called the albedo of a surface.

After doing, after getting the digital image, what I have to do? I can store this image in the digital
computers in a memory. I can process this image in the digital computers, I can display the
image, or I can store the image. So this is a typical image processing sequence.

9
(Refer Slide Time: 10:38)

This image formation principle here I have shown again. From the light source, the light is
coming. Light is reflected by the surface. So here, I am considering the surface reflectance. And
after this, the radiance is coming and in this case, I have the camera, this is the camera. So I am
getting the analog signal. This analog image, I can convert into digital image. So this is this is the
simple image formation principle.

(Refer Slide Time: 11:11)

And here I have shown the sources of illumination. So here, I have shown the electromagnetic
spectrum. In this case, I have shown the gamma rays, x-rays, ultraviolet, infrared, microwaves,

10
FM radio. So if you see this portion, this portion is the visible spectrum that is from 400 to 700
nanometers.

(Refer Slide Time: 11:38)

So based on this electromagnetic spectrum, I have this type of images that may be, I may have
X-ray images, and images from the visible region of the spectrum, infrared image, synthetic
aperture radar images, ultrasound images, or something like mammograms also. So we have
number of images like this.

(Refer Slide Time: 11:55)

11
Now, this computer vision actually depends on the physics of imaging. That means, the principal
of cameras; first point is the cameras. Next is the measurement of light, that is, the how much
light is coming from the source and how much light is reflected by the surface; so that is the
light. And that is called the radiometry; the measurement of light is called the radiometry. And
also the concept of the colors, different types of colors. So the physics of imaging is quite
important.

(Refer Slide Time: 12:34)

Now, let us consider here. If you see the same person under different lighting conditions, shading
effect is there. So you can see the person the same person, but different appearance. So in this
case, I have to develop some computer vision system, so that the computer can identify these two
faces; this face and this face but different lighting conditions different shading conditions.

12
(Refer Slide Time: 13:03)

I can give another example. So in this case, two faces but dependent shading condition, the
lighting conditions.

(Refer Slide Time: 13:08)

So that is why the computer vision depends on geometry, physics, and a nature of objects in the
world. So these are the main things one is geometry, I can consider the geometry of the camera.
Physics, that is the measurement of light and also the basic geometrical configuration of the
camera. And also the nature of objects in the world.

13
(Refer Slide Time: 13:33)

Now, you see here, computer vision is an inverse problem. Why it is inverse problem? Because
if you see, if I want to take one image that is nothing but the 3D to 2D projections. So objects in
the world, it is the 3D objects, but if I take the image, then in this case, I am getting the 2D
image. That is the 3D to 2D projections.

So one dimension is lost that is the z-dimension. So because x, y, and the z-coordinate. So after
the projection, I will only have the x and y coordinate but this information will be missing; that is
the depth information.

But in case of the vision, computer vision and image understanding, that is 2D to 3D
reconstruction because from the image, I have to get the model, I have to do the interpretation.
So that is why I can say vision is an ill-posed inverse problem. Loss of the third dimension that is
the depth information is, if I consider z is the depth information, depth information will be
missing. So I have to determine the depth information.

14
(Refer Slide Time: 14:40)

Another is a vision is an ill posed inverse problem, so here I am showing two cases. In this first
case, I am considering actually the two different surfaces that is illuminated by the light source.
This is illuminated by the light source. In the first case, what I am getting here. I am getting the
reflectance something like this. Reflectance is a new from, but illumination will be like this. So
this surface is not illuminated, only the surface is illuminated. So corresponding to this I am
getting the image here, this is the image. So this is the reflectance and this is the image.

In the second case, I am considering this surface suppose, the scene, and it is illuminated by this
light source. Corresponding to this case, this is the reflectance and illumination will be

15
something like this. This is also illuminated and this is also illuminated. So in this case also I am
getting the, this image.

These two images, I am getting the identical images. So that is why it is very difficult to identify
whether the scene is this or this. So then in this case, that is why I can say the vision is an ill-
posed inverse problem. Here I have given two examples. One is this example, because of the 2D
to 3D reconstruction problem, another one is this, that is reflectance illumination problem.

(Refer Slide Time: 16:05)

Now, let us consider approaches to vision.

16
(Refer Slide Time: 16:08)

So far computer vision, first, we have to do some modeling, build a simple model of the world.
After these find the algorithms, the appropriate algorithms. After this, we have to experiment on
the real world. And after this, we have to update the model. That means, after getting the model
and after selecting the algorithms, we have to do some experimentation.

That means, we are doing some training. And after training I am doing the updation of the
model. So first I have to do the training and after this, we have to go for testing. So that is the
modeling plus algorithms.

(Refer Slide Time: 16:46)

17
So now, let us consider the approaches of the vision. The first one is the early vision in one
image. So in this case, representing small patches of image. So maybe something like finding the
correspondence between the points in different images or maybe something like the age
detection or maybe the representation of different textures. So these are the examples in early
vision.

(Refer Slide Time: 17:14)

So I can give some of the examples like this. The edge detection and the boundary detection, this
is one example.

(Refer Slide Time: 17:20)

18
The second example I can show the representation of the texture. So here do you see this pattern
is repeated here. I am getting the texture pattern. So how to describe a texture?

(Refer Slide Time: 17:33)

Now, shape from texture. Now, this is one important research problem in computer vision that is
how to determine shape from texture information. Here, I have given one example. So you see
from the deformation of the texture, whether it is possible to find the shape information. That is
one important problem of computer vision. I can give another two examples; two or three
examples I can see here.

(Refer Slide Time: 17:57)

19
Shape from textures, I have different types of textures, if you see. So this texture actually
indirectly gives some shape information. The texture variation, if you see the texture
deformation, it indirectly gives some shape information.

(Refer Slide Time: 18:13)

Next one is early vision in multiple images. So in this case, I can consider multiple views. So
instead of considering single camera, I can consider two cameras. If I consider two cameras, then
it is something like the stereo vision. I can consider multiple cameras. And in this case, that is
called a stereo visions. And another research problem in computer vision is structure from
motion.

(Refer Slide Time: 18:38)

20
Now, this is one example of the stereo image. In stereo image, I can get the depth information. I
have the x and y coordinates and also I can determine the depth information. And this is one
example of a stereo image.

(Refer Slide Time: 18:55)

21
This is another example of a stereo image and here I have shown one stereo matching system. So
I have two cameras; one is the left camera another one is the right camera. So corresponding to
this, I am getting two images; the left image and the right image. And from the left and then right
image, I can determine the disparity map. This actually, disparity map, gives the depth
information.

So I have two cameras the left and the right camera and from this I am getting two images, one
in the left image another one is the right image. From this I can determine the disparity map and
this disparity map will give the information of the depth of a scene.

(Refer Slide Time: 19:34)

22
Another research problem I have already mentioned the structure from motion. So from the
motion information, whether it is possible to determine the structure information. Like in human
visual system, we use the motion parallax.

So suppose if I consider, suppose this is the object. The object is moving suppose. This object is
moving like this and I have the camera. This object is moving, so in this case, this surface is
close to the camera as compared to this surface, this is the closer to the camera.

So from the camera, it looks like this it moves faster as compared to this surface, this plane; this
is one plane, another plane is this. The object is moving like this, so if I take number of images,
what I am getting. This surface or this plane moves faster as compared to this. So this is similar
to the motion parallax.

So from this, actually the problem is how to determine the shape information; the structure from
the motion. So I have number of images like this. So in this case, I have to find the
correspondence between the points because I can get number of images and from these images, I
can find a correspondence like this between the points. And from this, I will try to determine the
structure. So that is one research problem, the structure from motion.

(Refer Slide Time: 21:11)

So this is another example I have given, the motion application. So I have given two important
research direction. One is shape from texture, another one is the structure from motion. This is,

23
these two research problems of computer vision; very important research problems in computer
vision.

(Refer Slide Time: 21:29)

The next one is the mid-level vision. So in the mid-level vision, we can consider the problems
like the segmentation of an image or a video and also we can consider the problems like tracking
in the video.

(Refer Slide Time: 21:46)

24
So I can give some examples like the segmentation. In an image, a segmentation means the
partitioning of an image into connected homogeneous region. I am doing the segmentation, so
this is a segmentation.

(Refer Slide Time: 21:57)

And here, I have given one example of a skin color segmentation. So skin color is detected, if
you see here the results, the skin color is detected.

25
(Refer Slide Time: 22:06)

Next, another problem is the tracking in the video. So in a video, I have number of frames. So
tracking means finding the correspondence between the frames. So this is one example of the
tracking. I can give another examples like the tracking.

(Refer Slide Time: 22:24)

You can see here the videos, I am playing the videos. And these are some examples of the
tracking. If you see the videos, in this case, I am doing the labeling also. This is called a
background subtraction. The foreground is removed and the background is considered, and I am

26
doing the tracking. A tracking of cars, tracking up persons, and you see the tracking of players
like this.

(Refer Slide Time: 23:00)

So now, I can give another example. This is the tracking by considering two cameras; this is one
view and this is another view. So in this case, I have to find a correspondence between these two
views, two images. And you see, the tracking I am doing and you can see this is the trajectory I
am determining. So this is mainly the tracking in a stereo vision setup. So this is one camera and
this is another camera.

(Refer Slide Time: 23:29)

27
And finally, we will go for the high level vision that is one is geometry. So what is the
geometry? The relationship between object geometry and the image geometry, that I can
determine. That is one example of a high level vision.

(Refer Slide Time: 23:47)

High level vision, the probabilistic. So the concept of the classifiers like in pattern classification,
we have to use the pattern classifiers and in this case, this is the one example of the high level
vision. That means we have to consider the classifiers and for this, we can use the concept of the
probabilities.

(Refer Slide Time: 24:08)

28
So in case of the like this, this is one example of the classification. Here you see 3D model
search engine. So for this, I have to extract some features and after the features I can do the
classification. And this is one example of 3D model search engine.

(Refer Slide Time: 24:25)

And for pattern classification, we have to extract some features and this is the feature vector.
And after extracting the features, you can see in this example, I have two objects; one is object
one, another one is object two. And I have considered two features; one is the width, another one
is the lightness.

This is the decision boundary between two objects. These are the sample points of the object one
and these are the sample points corresponding to object two. So we have to find a decision
boundary and after this, we have to go for pattern classification.

29
(Refer Slide Time: 25:00)

Also we can do the grouping of similar objects. That is called a clustering. So in this example, if
I consider color as a feature, then in this case, this, this, this, this will be in the same class. This,
this, this, and this will be in the same class. Like this, I can do the grouping.

But, if I consider, the shape as a feature, then in this case, this will be in one class; this will be in
one class, like this; I can do the grouping like this. So in pattern classification, portion of this
course, I will discuss about the clustering. So how to do the clustering? So this is the basic
concept of finding the similarity between the patterns between the objects.

(Refer Slide Time: 25:44)

30
Now the question is, is computer vision is as good as human vision? We know that human vision
is more powerful as compared to computer vision. So I will explain why human vision is more
powerful than the computer vision. I will give one or two examples, but there are many examples
in this case.

(Refer Slide Time: 26:03)

So this is one example, in this image, if you see, I have number of objects here. Where it is
possible to identify all the objects in the image, the human can identify most of the objects
present in this image. But in case of the computer vision, we have to develop some algorithms so
that computer can identify the objects present in this image. That is very difficult. Even in case
of the human also it is very difficult to identify all the objects present in this image.

31
(Refer Slide Time: 26:35)

And in human visual systems, this is the block diagram of the human visual system. We have
eyes and we have the optic nerves. What is the function of the optic nerves? The signal, the eye
actually converts the light photon into electrical signal, the electrical signal is transmitted via
optic nerves, and the brain processes this signal for visual perceptions. So this is the structure of
the human visual systems.

(Refer Slide Time: 27:04)

32
Some very important features of human vision I can say, this like depth perceptions, relative
position occlusions, shading, sharpness, size, perspectives, structures, motion parallax. I can give
some examples related to, relating to these concepts.

(Refer Slide Time: 27:26)

One is there are many theories regarding human stereo vision, the binocular vision I can say. So
we have two eyes and if I want to find a distance between these two, this is one surface, another
surfaces is this. This is one plane, another plane is this. So that is the distance between these two
is the depth information.

33
So one theory I am explaining that is not the general theory, but I can give one simple theory.
This eye measure this distance, the another eye measure the distance and from these two
information, the brain determines that distance; d is depth information and that is one theory.
There are many theories regarding the stereo vision, the binocular vision.

(Refer Slide Time: 28:12)

Also, if you see this example, shape similarity. So in the shape similarity, human can easily do.
This object, this is similar to this; this is similar to this; this is similar to this. We can identify
this, the human can do this. You can see the position is different, orientation is different, here,
the orientation is different if you see these two objects.

Still we can identify that this shape is similar to this, this shape is similar to this; the human
vision can do this.

34
(Refer Slide Time: 28:40)

And also, another thing is that texture recognition. So here you see, I am considering three types
of texture; this is one texture, another texture, another textures. So human can identify different
types of textures.

So when we recognize different types of textures, so what is going within our brain, that is very
difficult to explain. In this case, this is different, this is different and this is different.

(Refer Slide Time: 29:08)

Similarly, another one is the color perceptions. So in this case, we can recognize different types
of colors, like these, the primary colors; the red color, green color, and the blue colors. And

35
based on these colors, I can identify different objects present in an image. So that is called a
color perception. So we can recognize different types of colors and by using this color
information also, we can recognize objects present in an image.

(Refer Slide Time: 29:37)

Now, you see the difference between computer vision versus the human vision. I can give two
examples; I can give many examples, but here I am showing only two examples. The butterfly
example. So this is a butterfly, this is a butterfly. So human can identify that this is a butterfly.
Even I can develop some computer vision algorithms so that the computer can identify that this
is the butterfly.

36
After this what I am doing, I am just giving, adding two lines here. Still human can identify this
is the butterfly, but in this case, the computer vision fails. So for this, I have to modify the
computer vision algorithms, so that the computer can identify that this is the butterfly.

After this, again I am putting some noise. Still human can identify this is the butterfly, but again
the computer vision fails. So again I have to modify my algorithms, so that the computer can
recognize this is the butterfly. So you can see that human visual system is more powerful than
the computer vision in this example.

(Refer Slide Time: 30:42)

Similarly, I am giving another example, the character recognition. So I can recognize all these
characters the human can recognize. Also, the computer can recognize; I can develop some
computer programs, so that computer can recognize all these characters.

37
(Refer Slide Time: 31:01)

After this, I am putting some noise in the image. Then in this case, I have to modify my image
processing algorithms to remove the noise, and if you see the next image, I am giving more
noises. Then still human can recognize these characters, but in case of the computer vision, I
have to modify my algorithms.

I have to do some pre-processing and after this I can do the recognition, I can go for recognition.
So you see the difference between the human visual systems and the computer vision systems.

38
(Refer Slide Time: 31:34)

So in summary, I can show that emulation of human-like vision on a computer is the desired goal
of computer vision. Human vision possesses some important attributes such as perception is very
important, another one is cognition.

(Refer Slide Time: 31:56)

So this is a summary and I can show this case that computer vision and the image understanding
processes are not very robust, small changes in exposure parameters or internal parameters of
algorithms use can lead to significantly different results. So that is the summary of the, my
discussion up till now.

39
(Refer Slide Time: 32:18)

Then, another thing is the interesting facts of human visual systems. So some of the cases like in
human visual systems, if you see, we do some interpolation here. If you see this image, that
means, we are doing some interpolation. That is something like I am getting the circle here or
something like this. We do some interpolation in our brain that we cannot explain what is going
on in our brain during the interpolation; why actually we are doing the interpolation that is very
difficult to explain.

(Refer Slide Time: 32:45)

40
Similarly, I am giving another example that is interpolation. You can see I am getting square
something like this. This is the square.

(Refer Slide Time: 32:57)

And another you see, the optical illusion example is this optical illusion. So what is going within
our brain that is also very difficult to explain in these cases.

(Refer Slide Time: 33:06)

Then why the computer? What is the need of computer vision? So already we have discussed
that human vision is more powerful than the computer vision. Then what is the importance of

41
computer vision? So in this case, I will give some applications of computer visions. So I will
discuss one by one and I will show the applications of computer visions.

(Refer Slide Time: 33:28)

Computer vision applications I have shown here, like in multimedias, the movies, news and the
sports, in video surveillance, security, medical imaging, medical image processing like in
augmented realities, optical character recognition, 3D object recognitions, even in the games this
is the video surveillance, inspection of the industrial components, robotic automation, this is the
robotic visions, human computer interactions, and also image segmentations. So there are many
applications.

So here, I have given some biomedical examples like medical imaging. So we can use the
computer vision for all these cases like the virtual surgery, the laparoscopic surgery, or in case
with the medical image like the CT scan, MRI, image analysis, we can use computer vision.

42
(Refer Slide Time: 34:29)

So like this, the medical image processing, we have many, many applications. In this example I
have shown and the polyps which are available in the endoscopic video. So we have taken some
endoscopic videos and from this, we can determine the polyps; these are the polyps.

(Refer Slide Time: 34:51)

Like some examples like chromosome analysis, 3D medical imaging like MRI images and CT
scan images. And something like the virtual surgery and some applications are there.

43
(Refer Slide Time: 35:06)

Machine vision for medical image analysis like the detection of the tumor; if you see detection of
the tumors. So this is one example of the medical image analysis. Like this detection of the
tumors here detection of the tumors, detection of the tumors. So this is one example of the
medical image analysis.

44
(Refer Slide Time: 35:28)

Another one is and this is the one image, the heart image which mainly the arteries and the veins.
So we can determine the blockages in the arteries and the veins by using medical image analysis
principles.

(Refer Slide Time: 35:44)

So other applications like human-computer interactions. So this example is, I am showing some
American Sign Language. So I can interact with the computers by using gestures. So this is one
important application. So instead of using mouse and a keyboard, I can use my gestures to
interact with the computers.

45
So in this example, I have shown some static American Sign Language. So computer can
identify this sign languages.

(Refer Slide Time: 36:11)

In the second example, I have shown some dynamic signs, so the computer can identify this one.
So for human-computer interaction, this is one application of computer vision; the human
computer interactions.

(Refer Slide Time: 36:26)

46
This is the vision-based interface. We have cameras and in this case the camera detects the
movement of the hand. And after this, we can recognize the movement of the hands, so we can
recognize the gestures.

(Refer Slide Time: 36:40)

And some of the research areas like these areas, it is very important. The hand tracking, hand
gestures, arm gestures, body tracking, activity analysis, head tracking, gaze tracking, lip reading,
facial expression recognition. So some of the research areas of computer vision you can see; very
important research areas.

(Refer Slide Time: 37:06)

47
And here, I am showing some human computer interactions examples. This is, the last example
is the TV controlled by using gestures. So in our laboratory, we have developed this one. So the
human-computer interactions by using gestures, and these are TV controlled by using gestures.

(Refer Slide Time: 37:28)

Hand gesture animation, this is one application of computer vision. The virtual keyboard in
another example.

48
(Refer Slide Time: 37:37)

The virtual reality applications like in MIT Media Laboratories, there are many research
activities like facial expression recognitions, face recognition, human computer interactions,
body activity analysis, these are body activity analysis; so body activity analysis so, there are
many activities in MIT Media laboratories. So these are the applications of computer visions.

49
(Refer Slide Time: 38:03)

And even in the biometrics also I have many, many applications. If you see this example, this is
the biometrics in the early 90s. So for identification of a particular person, we have to do lots of
measurement; many, many measurements we have to do. After this, we can identify a particular
person.

(Refer Slide Time: 38:20)

But if you see here, the biometrics now, these are the different biometrics; one is the face
recognitions, infrared imaging, fingerprint recognition, that iris recognitions, the signature
recognition, there are many, many biometrics modalities. These modalities are now available.

50
(Refer Slide Time: 38:43)

Like another example is optical character recognition. In this case also, the computer vision is
useful, optical character recognition.

(Refer Slide Time: 38:50)

Iris recognition that is one example of biometrics.

51
(Refer Slide Time: 38:57)

Vision-based biometrics; login without a password that without using the password, you can use
fingerprints or maybe the face recognition. So this we can do that login without a password.

52
(Refer Slide Time: 39:14)

And for automatic video surveillance, computer vision is also you can use because this is one
important application of computer vision. So applications like this, access control in special
areas, anomaly detection and alarming, crowd flux statistics and congestion analysis, in this case,
I have given these examples like the tracking. And you can see this example is the tracking
examples. And tracking of vehicles.

53
(Refer Slide Time: 39:41)

And this is another applications, the person identification by gait. Gait means the walking style.
So from the walking style, we can identify a particular person. So this is one application of
computer vision.

(Refer Slide Time: 39:55)

54
And if you see this, the face recognition. This is the problem definition of face recognition. So
we have to identify this face. We have the database, in the database, all the faces are available
and whether this particular face available in the database, we have to see. And there may be
condition like this, we are not getting good quality image the noisy image or maybe occluded
image. This is one example of the occluded image.

This is the pose variation, different poses. One portion is occluded. Even all these cases also, I
have to identify whether this person is available in the database or not. If he is available, then we
can recognize him. So this is the face recognition and the research problems may be like this.
The face recognition with age progression, that is very important research problem. Face
recognition for the different pose variations and also for different illumination conditions.

55
(Refer Slide Time: 40:57)

Also, another problem is the facial expression recognition, so like this, we have done something
on facial expression recognitions. Like this, the anger, the fear, happy, like this. So this
expression, we can recognize and this is also important application of computer vision.

56
(Refer Slide Time: 41:16)

Smiled detection in the camera, that is also one application of computer vision.

(Refer Slide Time: 41:20)

57
Face detection in camera and skin color segmentation, that is also one important research area of
computer vision. So I have to identify the skin color. Here, I have shown one example. So we
have detected the skin colors like this in the video and that is one example of skin color
segmentation.

58
(Refer Slide Time: 41:41)

And special effects. So in special effects in movies, computer vision has many applications.

(Refer Slide Time: 41:47)

Even in the robotics also, computer vision has applications. That is the robotic vision.

59
(Refer Slide Time: 41:55)

So these are the applications of computer vision and one application I can show here. It has very
interesting applications, that is finding defects in the VLSI components. So in this example, I
have shown defects in the PCB, the printed circuit board. So we have used the computer vision
for identifying the defects in the PCB, the printed circuit board. Some of the defects we can
identify like this. So this is one example.

(Refer Slide Time: 42:22)

Another example I am showing here that is an automated method for solving jigsaw puzzle. So
here, you see that this is the jigsaw puzzle. If you see, this is the jigsaw puzzle; this is jigsaw

60
puzzle. And after the simulation, after this, what I am getting? I am getting one meaningful
image. So in this case also, some algorithms we are using to solve the jigsaw puzzle.

(Refer Slide Time: 42:44)

So overview of this course, the course on computer vision and image processing, fundamentals
and applications, so let us see what is in this course.

(Refer Slide Time: 42:54)

The proposed course is the combination of three course. The first one is the image processing,
another one is the computer vision, and I will discuss some of the machine learning algorithms
which are mainly used in computer vision applications. So in this case, we will discuss some of

61
the statistical machine learning algorithms and also the artificial neural network, and the deep
networks used for the computer vision applications.

(Refer Slide Time: 43:21)

The course is something like this. The pre-requisite is basic co-ordinate geometry, matrix
algebra, linear algebra, random process, and also the fundamental concept of digital image
processing. And regarding the programming, we can use MATLAB or maybe the OpenCV
Python is the programming environment.

(Refer Slide Time: 43:40)

62
And the course outline is this. The part one is introduction to computer vision. So in this case, we
will discuss the concept of image formation and also the concept of radiometry, radiometry
means the measurement of light. And also we will discuss about the cameras, the camera models.

The part two is the basic image processing concepts. So we will discuss about the image
processing concepts like image transforms, image filtering, color image processing, image
segmentation, so briefly we will discuss about this.

Part three is about the image features. So in this case, we will discuss the texture features, the
color features, edge detection and some other features like HOG features, the SIFT features, so
we will discuss about this features.

Part four is machine learning algorithms for computer visions. So in this case, we will discuss
some statistical machine learning algorithms for different computer vision algorithms. And also,
we will discuss about artificial and the deep networks for computer vision applications.

And finally, I will discuss in part five, application of computer vision. Some of the applications
like medical image segmentations, motion estimations, face and facial expression recognitions,
gesture recognition, image fusion. So these applications I am going to discuss.

(Refer Slide Time: 45:05)

Regarding the books, the text books, you can see the first book is the Ballard and the Brown.
That book is the very old book and that is available in the internet, you can download the book.

63
The another book is the Forsyth and Ponce that you can see this book. And I have my book, that
my book is Computer Vision and Image Processing by M K Bhuyan and this was published by
CRC press.

For machine learning, I will be considering this book, the Duda and the Hart. And for image
processing, you may see the book by Gonzalez and by A K Jain. These are the books. So I hope
you will enjoy the course.

So in this lecture, mainly I have defined the computer vision and also I have seen the difference
between the human visual system and the computer vision systems. After this, I have discussed
some important applications of computer visions. So in the next class, I will discuss the
fundamental concept of image processing. So let me finish here today. Thank you.

64
Computer Vision and Image Processing: Fundamentals and Applications
Professor M. K. Bhuyan
Indian Institute of Technology, Guwahati
Lecture - 2
Introduction to Digital Image Processing
Welcome to NPTEL MOOC's course on Computer Vision and Image Processing,
Fundamentals and Application. So, in my last class, I discussed about some fundamental
concepts of computer vision and also, I have highlighted some applications of computer
vision.

So, in my class today, I will discuss the introduction to Digital Image Processing. So, what do
you mean by digital image? So, I can say, digital image means 2D array of quantize intensity
values or maybe a 2D array of numbers. So digital image processing means how to
manipulate these numbers.

Suppose, if I want to improve the visual quality of an image, then in this case, I have to
manipulate these numbers. These numbers mean the quantized intensity values. And suppose,
if I want to remove noises then also I have to manipulate these numbers. So, manipulation of
these numbers is mainly the image processing.

In case of the image acquisition, the light is converted into electrical signal. So, I am getting
analog signal, the analog image; the analog image can be converted into digital image by the
process of sampling and the quantization. So, I have to do the sampling along the x direction
and sampling along the y direction and that is called the spatial sampling.

After sampling, I have to do the quantization of the intensity values. So, to convert an analog
image into digital image, I have to do sampling and the quantization. So, what is digital
image, I will show you.

65
(Refer Slide Time: 02:07)

Digital image processing means processing of digital images on digital hardware, usually a
computer. So, this is the definition of digital image processing; that is the manipulation of the
quantize intensity values, the manipulation of the numbers.

(Refer Slide Time: 02:22)

Now, here I have shown a typical digital image processing sequence. So, if you see, the light
is coming from the source and it is reflected by the object and I have the imaging system.
So ,this imaging system converts the light photon into electrical signal.

And in this case, I am getting the analog image. So analog means it is a continuous function
of time. To convert the analog signal, the analog image into digital, I have to do the sampling

66
and the quantization. So, sampling in the x direction, sampling in the y direction and that is
called spatial sampling.

And after this, I have to do quantization of the intensity value. After this I am getting the
digital image. This digital image I can store into digital storage, I can process the image in the
digital computers, I can display the image, the processed image and I can store the image. So,
this is a typical digital image processing sequence.

(Refer Slide Time: 03:14)

The same thing I have shown here in this figure. So, image sensors are available for image
acquisition and after this, I will be getting the digital image and that digital image, I can
process by image processing hardware. And I have also image processing software, so the
process image can be displayed and it can be stored. This is the components of a general-
purpose image processing system.

67
(Refer Slide Time: 03:40)

The concept of image formation here, I have shown and the light is coming from the source
and it is reflected by the surface. So, I have shown the surface normal also. So, suppose, this
is my surface normal and I have the optics, optics is mainly the camera.

So, light is reflected by the surface and I am getting the image in the image plane, that is, the
sensors. So, sensor means it converts the light photon into electrical signal. This pixel
intensity value depends on the amount of light coming from the surface.

So, if I consider one pixel at this point, this pixel value, the pixel intensity value depends on
the amount of light reflected by the surface. So that means, that the pixel value actually
depends on the surface property.

The surface property is called the reflectance property of the surface, that is, the amount of
light reflected by the surface. So, this is a typical image formation system.

68
(Refer Slide Time: 04:40)

So, I have shown in this case, this is the electromagnetic spectrum. The source of illumination
I have shown the gamma rays, X-rays, ultraviolet, infrared ray, microwaves, FM radio. And if
you see the wavelengths from 400 to 700 nanometers, this is the visible range of the
spectrum, the electromagnetic spectrum.

(Refer Slide Time: 05:08)

So, based on this source of illumination, I may have these types of images; may be the digital
photo, the image sequence used for video broadcasting, and something like the multi-sensor
data like satellite image, visible image, infrared images, and an image of them obtained from
the microwave bands, medical images like ultrasound, gamma-ray images, X-ray images, and

69
the radio band images like the MRI images, astronomical images. So, based on this
electromagnetic spectrum, I may have these types of images.

(Refer Slide Time: 05:44)

So here, I have given some examples. The first one is the X-ray, and next one is the photo
that is obtained in the visible region of the electromagnetic spectrum, the next one is the
infrared image, another one is the radar image; so, this is the radar image; ultrasound image,
and the mammograms. So different types of images.

(Refer Slide Time: 06:06)

And some of the applications of image processing, here I have shown. The applications like
multimedia, medical imaging, that is, medical image processing and medical image analysis.

70
I have many applications even in the forensic in the biometrics, like the fingerprint
recognition, the iris recognition, video surveillance, remote sensing. There are many
applications of image processing.

And one important application of image processing is image fusion. If you see, I have
considered in the first example, I have two images, one is the MRI image another one is the
PET image. So, these two images are fused and I am getting the fuse image.

So that means the important information from MRI images and the PET images, we have
considered and the redundant information I am neglecting, I am not considering. So, this is
the fused image. In this fused image, I have more information as compared to the original
source images. So, I have two images, one is the MRI image, another one is the PET image,
and the fused image I am getting from these two images.

In a second example, I have considered the IR image and the image of the visible region of
the spectrum. These two images I am fusing and I am getting the fuse image. So, it has more
information as compared to the source images. And in this case, you will remember that I am
neglecting the redundant information, I am not considering the redundant information. So,
these are the applications of image processing.

(Refer Slide Time: 07:37)

So, what is analog image? Analog means it is a continuous function of time. This is the
definition of the analog image. So here I have shown one analog signal that is obtained from
the camera.

71
So, I have the horizontal lines and the vertical, the blanking signals; this is the, mainly the
video signal. So, this analog signal can be converted into digital signal by the process of
sampling and the quantization.

(Refer Slide Time: 08:09)

So, in TV standard, I know some of the standard like the PAL, the phase alternating by lines.
So, this is one TV standard. Another standard is you know, NTSC. This is another standard.
Another standard is SECAM. So, I am not going to discuss about this, but in India, we use the
PAL systems.

So, in the PAL system, we have 625 lines; 625 lines. And in this case, we have considered the
25 frames per second; 25 frames per second. In phase PAL system, the phase alternating
lines, this is a TV standard and we have considered 625 lines and 25 frames per second. In
NTSC, generally we use the 30 frames per second. So, I am not going to discuss about the
TV standard.

Now, let us consider, suppose a video signal, a video signal I am considering and suppose the
bandwidth of the video signal, bandwidth of video signal I am considering, suppose it is 4
megahertz. Let us consider this example, 4 megahertz. So, what will be my sampling
frequency then?

Bandwidth is 4 megahertz, so sampling frequency will be, sampling frequency, sampling


frequency will be my 8 megahertz, as per the Nyquist sampling theorem, so my sampling
frequency will be 8 megahertz.

72
Now, I want to count the number of pixels per frame. So, I want to count the number of
pixels per frame. So how many pixels per frame, that I want to determine. So, 8 megahertz,

8 ×10 , that is samples per seconds, samples per seconds; divided by, I am considering 25
6

frames per second, 25 frames per second. So, this will be equal to 320000. This is the digital
pixels per frame, pixels per frame, pixels per frame. So, this is approximately the 512×512.

So that means, the image size will be 512 ×512. The image is represented by M × N. This is
the size of the image. The image is represented by M × N, this is the image because it is a 2D
array of numbers. So, I have M number of rows and N number of columns. So, my rows are
like this, these are my rows, rows and I have the columns, like this.

So, the image is represented by M × N. So, in this example, I have seen that the size of the
image will be 512 × 512. So, this I have shown this example.

(Refer Slide Time: 11:39)

Now, this already I have explained how to get the digital image. So, in this diagram, you have
seen that I have shown one continuous tone image that is the analog image and after this, I
have to do the sampling; the sampling along the x direction and the sampling along the y
direction I have to do.

And after this, I have to do the quantization; the quantization of the intensity values. So, I
have done the quantization and after, this I am getting the 2D array of numbers. So, this is a
sample and the quantized image I am getting. So, this is a representation of a digital image.

(Refer Slide Time: 12:13)

73
The same thing I am showing the sampling and the quantization. So, sampling along the x
direction and the sampling along the y direction I have shown and after this, I am doing the
quantization of the intensity value.

(Refer Slide Time: 12:25)

So, sampling the 2D space on a regular grid and after this I have to quantize each sample
value. First, I have to do the sampling and after this I have to do the quantization. So, this this
is first I am doing the sampling and after this, I am doing the quantization. So, it is 2D array
of numbers and this is the digital image.

So, I have shown the values. The value is the 0 is the grayscale value. The 34 is the grayscale
value, this 102 is the pixel value. So, I am considering this is one digital image.

74
(Refer Slide Time: 12:58)

And in this case, already I have shown the image is represented by M × N. M × N is the size
of the image. So, M × N is the size of the image. So here I have shown the rows. The M
number of rows I am considering and the N number of columns starting from 0 to N −¿1 and
0 to M −¿ 1.

And this is the origin, the origin of the image, that is, the point is (0,0). And I have shown the
pixels; these are the pixels, the pixels of the image. These are the pixels of the image.

(Refer Slide Time: 13:34)

75
Now, let us consider the neighborhood of a pixel. So, in this figure, I have shown and a
particular suppose pixel, the pixel is P. And this pixel I am considering P, I have shown two
cases. One is the neighbourhood, the 8 neighborhood I have considered.

Corresponding to the first case, I am considering the neighborhood pixel the P1, P2, P3, P4,
P5, P6, P7, P8, that is called the 8 connected neighborhood. In the second case, if you see the
neighborhood pixel I am considering P2, P4, P6, P8. Corresponding to the center pixel, the
center pixel is x , y ; this is center pixel.

So, if you see here, the first one is the 8-connected neighbourhood, this is the 8-connected
neighbourhood, 8-connected neighbourhood. The second one is the 4-connected
neighbourhood, the 4-connected neighborhood. So, this 4-connected neighborhood I can
write like this N4 P, 4-connected neighborhood.

So, this pixel I have to consider, this pixel I have to consider, this pixel I have to consider,
this pixel I have to consider that means it will be x, y - 1, that is the one pixel; {(x, y + 1), (x
+1, y); (x - 1, y)}. So, this is the 4-connected neighborhood.

Similarly, you can define the 8-connected neighborhood. So, in the, this is the neighborhood
of the 4-connected neighborhood, the second one is the 8-connected neighborhood.

(Refer Slide Time: 15:31)

Now, you can see, corresponding to this neighborhood, I can determine the Euclidean
distance, the city block distance, and also the chessboard distance. So how to define the
distances?

76
This Euclidean distance, it is defined like this. Euclidean distance, Euclidean distance is
suppose dE; Euclidean distance is dE. So, suppose I am considering two points (i, j) and (k, l).
1
So, [ ( i−k )2 +( j−l )2 ]2 ,this is the Euclidean distance.

Another distance is L1 distance, that is called the city block distance, city block distance. The
city block distance I can show it is suppose, the d 4 distance, city block distance suppose. So,
it is the distance between (i, j) and (k, l). So, it is ¿ i−k∨+¿ j−l∨¿, this distance is called the
L1 distance. It is the City block distance or the, it is called the L1 distance, L1 distance.

Another distance is the chessboard distance or sometimes it is called a Chebyshev distance.


So, I can define suppose the it is d8 between two pixels (i and j), (k, l). So, I have to consider
this distance; the distance is maximum value I have to determine, ¿ i−k∨+¿ j−l∨¿.

So, these distances I have defined. One is the Euclidean distance, one is the city block
distance, another one is the chessboard distance.

So, corresponding to these distances you can see, in the first figure, this figure, I have
determined the distance between the center pixel, the center pixel is this pixel and the
neighborhood pixels. You can see the distance is 1.41 to the diagonal pixel and to this pixel it
is the distance is 1, to this pixel it is 1.41; like this I am getting the distances. All the
distances I can compute to the neighborhood pixel.

Similarly, if I consider the city block distance corresponding to the center pixel, the center
pixel is this, I can find a distance between the neighborhood pixel. So, if I find the distance
between the center pixel and this pixel the distance will be 2. This distance will be 1, this
distance will be 2, like this, I am determining the distance. I am calculating the distance.

In the third case, I am considering the chessboard distance. So, corresponding to the center
pixel, this center pixel, I am determining the distance between the neighborhood pixels and in
this case, I am getting 1 1 1 1, like this distance I am getting. So, you can see these distances I
can compute.

77
(Refer Slide Time: 19:26)

So I can give one example. Suppose, if I want to determine the distance between d, two
points A and B, that is, the, suppose the Euclidean distance I can determine between two

points, √( x A −x B)2 +( y A − y B )2 .

So suppose this xA is some, xA is suppose 70 minus 330 suppose x B is 330, I am taking one
example, whole square plus suppose this yA is 40 and yB is 228 suppose whole square. So,
this value I am getting it is 335 almost 335 this value. This value is 335. This is the Euclidean
distance I can determine.

And the L1 distance, if you can compute L1 distance, then this distance will be d (A, B) will
be you can see this is|x A−x B|+¿ y A− y B∨¿. So, you will be getting the value something like
448. Also, you can determine the chessboard distance, also you can determine.

The Chessboard distance d (A, B) between two points, thus maximum you have to take and it
is max ⁡¿ , ¿ y A − y B ∨¿ ). So, this value you will be getting something like 260. You can verify
this one.

So, in this example, I have shown how to calculate the Euclidean distance, city block
distance, and the chessboard distance. So, this is the important the neighborhood of the pixel
a particular pixel.

78
(Refer Slide Time: 21:47)

So, what is digital image? The digital image is nothing but the 2D array of numbers
representing the sample version of an image. The image defined over a grid and each grid
location is called the pixels.

So, I have shown the pixels, these are the pixels, and represented by a finite grid and its
intensity data is represented by a finite number of bits. So that concept I can explain like this.

(Refer Slide Time: 22:16)

Suppose, the image is represented like this M × N; the size of the image is M × N, that is, M
means the number of rows, N number of columns. And the grayscale value suppose, I am

79
considering L number of grayscale values. So, 0 to L−1, grayscale value that can be
normalized between 0 to 1.

Suppose, the L number of grayscale levels, L is equal to 2 K . So now, the number of bits,
number of bits required to store an image is equal to M × N × K. That is the number of bits
required to store an image is equal to M × N × K.

Now, what is the definition of the dynamic range? Suppose, I have an image, the highest
pixel value and a lowest pixel value I know. So, the difference between this is called a
dynamic range. The contrast of an image depends on the dynamic range. So, for a good
contrast image, the dynamic range will be more.

Now, the resolution of an image, resolution of an image. So how to define the resolution of
the image? So, one is, one resolution is the spatial resolution, the spatial resolution. The
spatial resolution means, I can say the number of pairs of lines per unit distance. This is the
definition of the spatial resolution.

And another one is the intensity resolution. So, intensity resolution, I can say that the
definition is smallest discernible, smallest discernible changes in the intensity level. So,
suppose if I consider 8 bits, so, 28 =256. That means, how many intensity levels I am using?
The 256 numbers of intensity level I am using.

So spatial resolution means the number of pairs of lines per unit distance. That means, so how
many pixels, if I consider these are the pixels, so how many pixels per unit distance that is
called spatial resolutions.

And intensity resolution means the smallest discernible changes in the intensity level. So how
many intensity levels I am using that is the intensity resolution. And if I consider the video, in
video I have to consider, suppose, if I consider a video, in video I have to consider how many
frames, frames per second? How many frames per second that corresponds to the temporal
resolution.

So, for video, I have temporal resolution and for the image, I have spatial resolution and the
intensity resolution. And for the binary image, the binary image is represented by 1 bit. So
that concept I can show you.

80
(Refer Slide Time: 26:22)

So here, I have shown the pixels and the intensities. This is the two dimensional array of
numbers and here, I have shown the pixel value 255, 18, 19. So these are the pixel values.

(Refer Slide Time: 26:57)

So already I have defined the resolution of an image, one is spatial resolution, another one
intensity resolutions. So, suppose if I consider, suppose 0 to 255 levels; 0 to 255 levels. So
how many, how many bits we need?

For this, we need 8 bits. And if I consider 0 to 127 then I need 7 bits; if I consider 0 to 63
then in this case, I need 6 bit; if I consider 0 to 31 then in this case, I need 5 bits; if I consider
0 to 15 then in this case, I will need 4 bits; if I consider 0 to 7 then I need 3 bits; if I consider

81
0 to 3, I need 2 bits; and if I consider 0 to 1, only two levels, 0 to 1 means two levels, so I
need only one bit.

So, this is called the binary image, binary image. So, this is the binary image. So binary
image needs only 1 bit. So, a binary image is represented by1 bit and a grayscale, gray-level
image is represented by 8 bits.

(Refer Slide Time: 28:19)

And the image is represented by f (x, y). What is f (x, y)? f (x, y) means the intensity at a
position the position is (x, y). This is the definition of image. The image is represented by f
(x, y).

(Refer Slide Time: 28:33)

82
So here, I have shown the image as a function because image is nothing but this f (x, y); f (x,
y). So, the intensity value I have shown f (x, y) here.

(Refer Slide Time: 28:46)

And there is another definition of a digital image. In this definition, I have shown the rows
and the columns. So, I have shown the pixels. So, in this definition, if you see, the image is
represented by these parameters, one is x and y coordinate; that is the spatial coordinate. And
if I consider the stereo image, then one important parameter is the depth information. So that
corresponds to the depth information.

So, suppose if I have two images obtained by two cameras, so from two images, I can
determine the depth information. So, in this case, the z represents the depth information. And
if I consider the color image, so lambda means the color information. And if I considered a
video, video means the number of frames per second. So I have to consider temporal
information, so t is the temporal information.

So image is represented by x, y, z, lambda, t; x and y is the spatial coordinate, z is the depth


information, lambda is the color information, and t is the temporal information. That is one
definition of digital image.

83
(Refer Slide Time: 29:54)

And some of the typical values, number of rows, number of columns. The 256, 512, 525, 625
like this, these are the number of rows; number of columns 256, 512; and gray-level values
something like 2, 64, 256 like this, we have the values.

(Refer Slide Time: 30:15)

And in this case, I have shown the dimensionality and resolution of an image. So, in this case,
I have shown the image, the image is 2 dimension and another case, I have shown 3
dimensions that is for the video, x and y coordinate and another one is number of frames per
second, that is, the temporal information.

84
So here, I have shown that dimensionality and the resolution of an image. So. image is 2
dimension and video is 3 dimension.

(Refer Slide Time: 30:41)

So in this example, I have shown, I have changed the spatial resolution. If you see the first
image, the first image is 256 × 256 pixels; the second image is 128 × 128, this image; the
third image is 64 × 64.

So that means I am decreasing spatial resolution and because of this, I am getting some
effects like checkerboard effect. This effect I have observed here and this is called a
checkerboard effect because I am decreasing the spatial resolution of an image.

(Refer Slide Time: 31:13)

85
Now, I am considering changing the intensity resolution that is called the gray-level
resolution. So, first example is I am considering the 8 bit image, next one is the 4 bit image,
next one is 2 bit image, and 1 bit image is nothing but the binary image. So I am changing the
gray-level resolution the intensity resolution.

(Refer Slide Time: 31:36)

This is another example. The initial, if you see the first image, this is 16 gray levels, second
one is the 8 gray levels, the next one is 4 gray levels, and this is the 2 gray levels that means
the binary image. So, if you see this, this portion, this portion, suppose if I consider, this
portion is not properly represented in these images, in these images. So, this effect is called
the false contouring effect.

So, I have two effects one is the checkerboard effect that is because of the spatial resolution,
and another one is a false contouring effect because of insufficient number of intensity levels.
So, I have given these two examples.

86
(Refer Slide Time: 32:17)

The image types I have shown here. I have RGB image, the grayscale image, and the black
and white image that is the binary image. So, in case of the color image, corresponding to a
particular pixel, I have 3 components; one is the R component that is the red, another one is
the G component that is the green; and the blue component that is the B.

So RGB value that is the primary colors. So, corresponding to a particular pixel, I have 3
components; R component, G component, and the Blue component. And this pixel, particular
pixel is called the vector pixel, because I have 3 components.

(Refer Slide Time: 32:57)

87
So, this is one example of the binary image, basically the binary is 1 and 0 only. So if I apply
the edge detection principle, the edge detection principle or edge detection, then in this case, I
will be getting the binary image. So, this is one example of the binary image.

(Refer Slide Time: 33:16)

The next example is the grayscale image. So, in this case, I am considering 256 numbers of
intensity levels and if you see, it is the grayscale image.

(Refer Slide Time: 33:26)

And after this, I have considered the color image. So, this is the color image because this is
called the vector pixels. In the vector pixels, I have three components; R component, G
component, and the blue component.

88
Now, I have different types of color models. I will discuss about the color models like RGB
color model, YCbCr color model, HIS color model. So, all these color models I will discuss
in the color image processing. Because I need 8 bits for the R components, 8 bits for the G
component, 8 bits for the B component, so it is called 24 bit video, that is the color image.

(Refer Slide Time: 34:02)

And this is the example of the true color image, here you see corresponding to a particular
pixel, I have the red pixels, the green pixels, and the blue pixels. So that means, pixel means
the red information, green information, the blue information corresponding to a particular
pixel. The red value, the green value and the blue value.

(Refer Slide Time: 34:22)

89
Another definition of indexed image. So, for the indexed image, I have one color map. So,
this is a color map. So, this is something like a lookup table, lookup table. So, corresponding
to these value, we have three values; one is the R value, this is R value, like G value, and a B
value; three values are available.

So, corresponding to this one, the corresponding to these RGB value, the index number is 6.
Like this, I have all the index numbers. So, this is the indexed image, if you see. So, I need a
color map and in the color map, you see the index number is associated with the color values,
R value, G value, and the B values. So, corresponding to 6, I have this RGB values.

(Refer Slide Time: 35:14)

So, types of digital image. So binary image, 1 bit per pixels; grayscale image, 8 bit per pixels;
true color or the RGB image, it is 24 bit per pixels; and for indexed image, 8 bit per pixels.
So, this is the case.

90
(Refer Slide Time: 35:29)

So next one is what is image processing? The image processing is that manipulation and
analysis of the digital image by digital hardware. And it is for the better clarity that is for the
human interpretations and automatic machine processing of the scene data. So that is image
processing.

(Refer Slide Time: 35:52)

So, I have given some examples here. So, one is image enhancement. Image enhancement
means to improve the visual quality of an image. So first if you see, I am just the compare the
two images, the input image and the output image. You can see that I am improving the
visual quality of the image.

91
So, some of the examples like I can increase the contrast of an image, I can improve the
brightness of an image. Like this, I can do many things, that is one example.

(Refer Slide Time: 36:27)

In this example, another example, I am doing the image enhancement. So first one is the
input image, the second one is the better-quality image that is the enhanced image.

(Refer Slide Time: 36:39)

In this example, I am considering the blurred image. So blurred image maybe because of
some motion blur. Suppose, the camera is moving, then in this case, I am getting the blurred
image. And in this case, I can apply some image restoration technique to deblur the image.

92
So, in this example, I have the blurred image and I can apply some image restoration
technique to get the deblur image.

(Refer Slide Time: 37:05)

In this example, I have considered a noisy image and after this, I am getting the filtered
image, a better-quality image.

(Refer Slide Time: 37:14)

Now, the image processing, what is actually image processing now I can show. Because
image is represented by f (x, y), what is f (x, y)? Intensity at the point, the point is x, y. Now,
in this case, I have g (x, y); g( x, y) is the output image and f (x, y) is the input image. So, I
can change the range of an image or I can change the domain of an image.

93
What do you mean by range of an image? The range of an image means, I can change the
intensity value, the pixel value I can change by using some transformation. Also, I can
change the domain of an image. Domain of an image means, I can change the spatial
positions; x and y coordinate I can change.

One is just changing the range of an image and another one is changing the domain of an
image. So, in this example, here you see, I am changing the range of the image. That means I
am changing the intensity value. In the second case I am changing the spatial coordinates.

(Refer Slide Time: 38:14)

So, I have shown the two examples. The first one is the input image. This is the input image
and I am changing the range of an image that is the pixel values.

In a second example is the input image and I am changing the domain of the image. In this
case I am doing the scaling of the image. So that means, this is the changing the domain of
the image and these are changing of the range of an image.

94
(Refer Slide Time: 38:37)

This is one transformation, here if you see I am doing some mapping. This pixel is mapped to
this point, this pixel is mapped to this point. So, this is my, this is my input image and this is
my output image. So, I can do the mapping, that means, I am changing the spatial
coordinates.

(Refer Slide Time: 38:55)

I can do rotation on an image, this is the rotation of an image. I can do the scaling of an
image. That means, I can change the domain of an image, I can say the spatial position of the
pixels.

95
(Refer Slide Time: 39:06)

Now, the linear point operation on an image. So, I can give one example here. Here you see g
(n) is the output image, f (n) is the input image. And in this case, I am considering one
operation, the operation is h, that is called a point operation, the linear point operation. The
point operation is mainly the memoryless, we do not need any memory.

Now, in this case, g (n) is equal to I can consider g (n) is the output image, P × f (n)+ K . So,
if I consider this expression, K is the offset and P is the scaling factor. Now, based on this the
point operation, how to get that negative image?

So ,in this case if i consider P is equal to minus one and K is equal to L minus 1, L means the
number of gray level values, how many levels I am considering, from 0 to L minus 1. So,
output is g( n)−f (n)+ L−1.

So, if I consider this transformation, then in this case, I will be getting the negative image. So,
this is my input image and then this is my output image.

96
(Refer Slide Time: 40:11)

Also, I can apply the nonlinear point operation, maybe the logarithmic operation I can apply.
So, this point operation I am going to discuss after 2, 3 lectures, that is, when I will discussed
about the image enhancement technique, that time I will discussed about the point operation.
So, this is about the nonlinear point operation and the linear point operations on image.

(Refer Slide Time: 40:36)

Here, I am giving one example, the addition of images. So, n number of images I am
considering. After this, I am doing the addition, addition of n number of images. The second
example is the subtraction of two images.

97
(Refer Slide Time: 40:51)

If I do addition of the images what will happen, you see I have the noisy image. Suppose I
have number of noisy frames. If I do the averaging, then in this case, the noise can be
reduced, image averaging for noise reduction.

So just I am adding this one that means, I am actually averaging the images, that is, I am
doing to sum of n input images and corresponding to this I am getting this this example. So,
in this example, the image averaging for noise reduction.

98
(Refer Slide Time: 41:22)

And already I have explained that how to find a difference between two images. So, one
example is, suppose if I consider these two images, the first image is this. In this case, the
background is available; in the second case the background and the foreground is available.

Suppose the moving objects are available in the image, in the first case, there is no moving
objects. And in the second case, suppose, the background is also there, the moving object is
there.

So, in this case, if I do the subtraction of these images from the first image to second image,
then in this case, I will be getting the foreground, something like this, moving objects I can
determine. That is called a change detection.

So, this is called a principle of change detections. Thus, I have to subtract the images then in
this case I will be getting the moving objects in the image.

99
(Refer Slide Time: 42:11)

Like this, we can do some geometric image operations. So, I can do image translation like
this, I can change the coordinate; coordinate is n1, another one is n2. So, I can do the
translation of the image. This the coordinate is x and y actually, in this case, I am showing n1
and n2. So n1 minus b1 and n2 minus b2, that means I am doing the image translation.

The second example is also if you see, the coordinate n1 is divided by c, the coordinate n2 is
divided by d. So that means I can zoom the image, zoom in and zoom out I can do. I will
explain this one zoom later on.

So, this is also a geometric image operation and this is one example of the image rotation that

I can rotate the image. So, for the rotation, the transformation matrix is [ cosθ
sinθ
−sinθ
cosθ ]
. So, by using this transformation matrix you can rotate an image.

100
(Refer Slide Time: 43:21)

Here I have, in this example I have shown the zoom. After zooming, I have done some
interpolation, nearest neighbour interpolation it is done. The second case is the bilinear
interpolation that is more accurate. So, I have shown two interpolation techniques, one is the
nearest neighbour interpolation, another one is the bilinear interpolation after zooming.

(Refer Slide Time: 43:40)

So, image processing steps I have. One is the image acquisition. After this, I have to get the
digital image. So, for this, I have to do the sampling and the quantization. Compression is
important because if I considered the raw image or the raw video, that is very bulky. So that
is why for storing the image, for storing the video, or for the transmission of the image or for

101
the transmission of the video, I have to do image compression, I have to do video
compression.

Next one is the image enhancement. Image enhancement is to improve the visual quality of
an image. So, one example is, I can improve the contrast of an image or maybe I can improve
the brightness of an image. That is image enhancement.

And restoration is also the meaning is same. To improve the visual quality of an image. So,
there is a basic difference between enhancement and the restoration. In image enhancement
that is mainly the subjective. It depends on the subjective preference of an observer.

So, I can say, the image is not good, the contrast is not good, and based on my observation I
can improve the visual quality of an image. That is subjective and there is no mathematical
theory for image enhancement.

But in case of the image restoration, I have mathematical models. So, I can give one or two
examples like this. Suppose I am taking some image and camera is moving, then in this case,
I am getting the blurred image. So, in this case, I can deblur the image by using the
restoration principle, the image restoration principle.

So, for this I have to develop a mathematical model and based on this mathematical model, I
can improve the visual quality of the image. So, for this, I have to apply image restoration.
That is image restoration is objective and image enhancement is subjective, the subjective
preference of an observer.

I can give another example of the image restoration. Suppose the object is not properly
focused then in this case, I am getting the blurred image. Then also I can improve the visual
quality of the image. I can de-blur the image.

And suppose if I take one image in the foggy environment, then in this case, I am also getting
not good quality image. Then by using this image enhancement principle or the restoration
principle, I can improve the visual quality of an image. So, you see the fundamental
difference between image enhancement and the restoration.

In image enhancement there is no mathematical model, general model but in image


restoration, I can develop some mathematical model and by using this mathematical model, I
can improve the visual quality of an image.

102
Next one is image segmentation. The image segmentation is the partitioning of an image into
connected homogeneous region. So, I can later on, I will discuss about the image
segmentation. So, that means I am doing the partitioning of an image into connected
homogeneous region.

And after this, for computer vision, I have to extract features. And after this, based on the
features, I have to go for recognition, the object recognition that is called the image
interpretation or the understanding.

(Refer Slide Time: 46:57)

So, this is a typical image acquisition systems. So, we have the CCD, the search coupled
device camera. Here you see the CCD chips. So the photon is converted into electrical signal.
After this, I am getting the video signals, that is, the analog signals. The analog signal can be
converted into digital, already I have explained about this.

103
(Refer Slide Time: 47:16)

So, this is the image sensor. The light photon is converted into electrical signal and after this,
we have to do the sampling and the whole. And after this, you can convert the analog signal
into digital image.

(Refer Slide Time: 47:30)

So, these are the, for the still image, these are the 512 ×512 some typical values. 256 × 256,
one typical values. For video like this 720 × 480, 1024 × 768 that is the high definition TV,
like this. And regarding the intensity resolution, 8 bits is sufficient for all but the best
applications needs 10 bits, television, production, printing.

104
And for medical imaging generally, the 12 to 16 bits images we consider. So, this is the case
and one important application is the medical imaging. So, we need more number of bits we
need more number of intensity levels.

(Refer Slide Time: 48:16)

And if you see why we need the compression. Suppose a raw video is very bulky. So suppose
in this example, if I considered a digital video, uncompressed video, 1024 × 768 that is the
size of the image frame.

Suppose if I consider 24 bit per pixel, why it is 24 per pixel? That is the, if I consider the
color, color image that is the 24 bit per pixels and 25 frames per second. That means 472
Mbps. So that is very difficult to store and also for transmission. So, that is why we have to
compress the image, the compression of the image is important in this case. So, I have given
this example.

105
(Refer Slide Time: 49:00)

And image enhancement, this already I have explained that to improve the visual quality of
an image, like this we can improve the contrast, we can sharpen the edges, we can remove the
noise, the image enhancement.

(Refer Slide Time: 49:16)

This is already I have explained. One is enhancement; restoration, image restoration, I have
already explained. And reconstruction, mainly we have the 2D images. So, if I want to get the
3D information, the reconstruction of the 3D information from the 2D projections, that is
called a reconstruction.

106
This concept I am going to explain maybe after 2, 3 lectures that is image reconstruction from
2D projections. That concept I will explain the radon transform. Radon transform I will
explain later on. So, construction of 3D information from the 2D projections that is used in
CT-scan, the computer tomography. So, this concept I am going to explain when I will
discuss the radon transform.

After this, we have to extract the features. Image features I have to extracts, maybe something
like the color feature, I can consider the texture, I can consider shape features, motion
features I can consider. And finally, we will go for the recognition or the classification of the
objects, that is the machine learning. This is the machine learning, feature extraction and
recognitions.

(Refer Slide Time: 50:30)

So, these are some examples, the image filtering. Original image, the noisy image, and the
filtered image. So, this is one example.

107
(Refer Slide Time: 50:36)

Second example is the image restoration. Degraded image after processing I am getting the
restored image. That is a very good quality image, this is the noisy image.

(Refer Slide Time: 50:47)

And for contrast improvement, there is a technique. This technique also I am going to explain
later on. That is by histogram equalization technique, I can improve the visual quality of an
image. That means I can improve the contrast of an image. So, this concept I am going to
explain when I will discuss the image enhancement principles.

108
(Refer Slide Time: 51:08)

This is one example of the contrast enhancement. This is not a good contrast image, this is
not a good contrast image but I think this is also good contrast image I can say. So, by using
the histogram equalization technique, I can improve the contrast of an image.

(Refer Slide Time: 51:23)

And after this, I have to extract features. The features may be something like edges I can
consider, the boundary of an objects I can consider. So, I can extract many, many features
from the image or maybe from the video.

109
(Refer Slide Time: 51:38)

And after feature extraction this one example is the edge detection technique. So, this is my
input image and I am extracting the edges of the image.

(Refer Slide Time: 51:45)

And the segmentation is the partitioning of an image into connected homogeneous region.
This homogeneity may be defined in terms of the gray value, color value, texture value, shape
information, and the motion information. So, based on this principle, based on these values, I
can define the homogeneity.

110
So again, I am going to discuss about the segmentation principle. So, suppose in my next
classes, so suppose if I want to do the partitioning of an image, so this is one partition, this is
one partition, this is one partition.

So, this portion is homogeneous, this portion is also homogeneous. If I consider this, portion
is homogeneous, this portion is homogeneous. Like this, this portion is also homogeneous.
So, homogeneity I can define in terms of gray value, color, texture, like this, I can define.

So, corresponding to this region, the gray value is almost same or maybe the color may be
same. So, like this I can do the partitioning of an image.

(Refer Slide Time: 52:47)

So, this is one example of the segmentation, image segmentation. This is my image, input
image and after this, I am doing the segmentation.

111
(Refer Slide Time: 52:54)

And finally, after feature extraction, I will go for the object recognition. So, this is my last
step that is I have to extract the features. The features maybe the color is a features, texture
maybe the features. So, I can extract different types of features from the image and after this,
I will go for the object recognition.

(Refer Slide Time: 53:16)

So first one is the feature extraction, feature model matching, hypothesis formation, and
object verification, I have to do this.

112
(Refer Slide Time: 53:27)

And finally, I will go for the image understanding. So maybe I can apply some supervised
technique and that is the part of the artificial intelligence or maybe the machine learning. So,
I am going to discuss about this in my last lecture the machine learning. So, this is the final
step the image understanding.

So, in this class, I have discussed about the concept of image processing, what is the meaning
of the digital image. And after this, I have discussed about the types of digital image, the
RGB image, the grayscale image, the binary image, index image. So, I have discussed about
this.

And also, the concept of the resolution, resolution of an image. So first, I discussed about the
spatial resolution, after this, I discussed the intensity resolution. After this, I discussed about
the image processing techniques. So, I can change the domain of an image, I can change the
range of an image. So, I have given some examples I can do the rotation of an image, I can do
the scaling of an image. Like this I can do many operations, I can do zooming also.

After this, I have considered the typical image processing system, the computer vision
systems. So first, I have to do some image processing and in this case, before image
processing, I have to get the image. The analog image can be converted into digital image by
sampling and the quantization.

After this, I can improve the visual quality of the image by using the enhancement and the
restoration techniques. And after this, I can extract the features from the image. And finally, I

113
will go for object recognition. That is the pattern classification or the machine learning
techniques I can apply.

So, this is about my, the second lecture. So, next class, I am going to explain the concept of
the image formation principle and also, I will discuss about the radiometry. Radiometry
means the measurement of light. So, let me stop here today. Thank you.

114
Computer Vision and Image Processing
Professor M.K. Bhuyan
Department of Electronics and Electrical Engineering
Indian Institute of Technology, Guwahati
Lecture 3
Image Formation - Radiometry

Welcome to NPTEL MOOC's course on Computer Vision and Image Processing,


Fundamentals and Applications. In my last class, I discussed the concept of image
processing. Image is represented by f (x, y); f (x, y) means the intensity at a particular point.
The point is (x , y).

What is digital image? Digital image is nothing but the 2D area of quantize intensity values
and digital image processing means the manipulation of these numbers. So, in my last class, I
discussed two operations; one is the changing the range of an image, and also the changing
the domain of an image.

f (x, y) is the image, so I can change the range of an image, and also I can change the domain
of an image. Suppose, if I change the pixel values that means, I am changing the range of an
image. And if I consider, suppose zooming of an image or if I do the scaling of an image or
maybe if I do the rotation of an image, that means, I am changing the domain of an image.

Also, I discussed about the concept of image enhancement, image restoration, image
reconstruction. So, image enhancement means to improve the visual quality of an image. So
today, I am going to discuss about the image formation concept, and first I will discuss the
concept of radiometry.

Radiometry means the measurement of light. So, you know that the pixel intensity value at a,
in an image depends on the amount of light reflected by the surface. So that is why the
measurement of light is quite important and that measurement is called the radiometry. I can
also measure the sensitivity of camera, and also the sensitivity of human eye, that is called
photometry. You can see the difference between photometry and the radiometry.

Now, I will discuss the concept of image formation and after this, I will discuss some
radiometric parameters.

115
(Refer Slide Time: 02:35)

So, in my next slide, you can see I have shown one image formation principle. So, I have
considered the point of light source, that is, the input light distribution, I am considering I. I
am considering the imaging system; the imaging system is characterized by PSF. The PSF is
called the point spread function. PSF is the point spread function and I have, I am getting
image in the image plane, that is, the output distribution.

The object function is an object or a scene that is being imaged. Light, the light from the
source is incident on the scene or the object surface and it is reflected back to the camera by
the imaging system. The point spread function that is nothing but the impulse response of the
system, and in this case, if I consider the point light source, PSF actually indicates the
spreading of the object function and it is a characteristic of the imaging system; the imaging
system is the camera.

A good imaging system generally has a narrow PSF and whereas a poor imaging system has a
broad PSF. So, in this diagram, you can see I am getting the output image; output image is
nothing but the PSF, it is convolve with the image plus noise. So, this image formation
principle I have explained that PSF is quite important. So, it is the characteristics of an
imaging system and I am considering the noise; the noise of the imaging system.

116
(Refer Slide Time: 04:11)

And in this figure, I have shown the image formation concept that is the radiometry. So, light
is coming from the source and it is reflected by the surface. So, I have the optics; optics
means the, it is the camera lens and I am getting the image in the sensor. The sensor converts
the optical photons into electrical signal.

So, this, the pixel intensity depends on the amount of light reflected by the surface. So, that is
why I have to determine the amount of light reflected by the surface and the surface property,
that is called the reflectance property of the surface, it is also called the albedo of the surface.
So, this measurement is quite important. The measurement of light, and also the surface
property, the reflectance property of the surface. So, in this diagram, I have shown the image
formation concept.

So, light is coming from the source and it is reflected by the surface and I have the optics that
is the lens. And I am getting the output, that means, in the sensor, I am getting the response.
So, that, the light photon is converted into electrical signal. So here, if you see, the light is
coming from the source.

Now, the surface property is important. This is the surface; the surface property is called the
reflectance property of the surface. Sometimes, it is called the albedo of the surface. So, light
is reflected by the surface and we have the optics, and I am getting the electrical signal in the
sensor. So, this is a typical image formation process.

117
(Refer Slide Time: 05:47)

For image formation, the concept of the radiometry; radiometry means the measurement of
light, so in this case, I have to determine the brightness. That means, I have to do the
measurement; the measurement of the light. And another one is the interactions between the
light and the surface that I have to consider.

So, in my next diagram, you can see. In this case, I have shown one plane and, in this case, I
am considering the spherical coordinate system. So, in the next slide, you can understand this
concept.

(Refer Slide Time: 06:14)

118
So, I am going to the next slide. So let us discuss about some radiometric quantities.

(Refer Slide Time: 06:19)

So, this diagram here, if you see here, so if I consider a point, a particular point, that point has
a tangent plane. So, if I consider this point here, this is the point, this point has a tangent
plane and thus a hemisphere. So, this is the hemisphere I am considering.

Now, I have shown, this is the incoming light; θi and φ i, that is the incident angle. And if I
consider this one, this is the outgoing angle, that is, light is reflected by the surface. So, this is
the incoming ray and this is the outgoing ray. So, in this case, I am considering the polar
coordinates. The polar coordinate is θ and φ .

(Refer Slide Time: 06:56)

119
Now, let us define the angles. So, in this figure, if you see, the first one is, this first one is the

s
plane angle. So, you know that what is the definition of the plane angle; the θ= radian. In
r

A
the second case, I am considering the concept of the solid angle. Solid angle is steradian,
r2
so I will show you.

So what is the definition of the solid angle. So suppose, if I consider this, the area is
something like this. The idea is suppose dA; dA is the area and this is our, so the solid angle

dA
is defined like this.dω= . And the unit is a steradian, unit is steradians.
r2

So, you know the area of the, area of a sphere is equal to 4 π r 2, this is the area of a sphere.
And that why, what will be a total solid angle? Total solid angle, total solid angle will be
equal to 4 π steradian. Total solid angle will be equal to 4 π steradian. Area of a sphere is 4

π r . So total solid angle will be 4 π .


2

So in this case, I have shown this diagram. If you see here, the solid angle in spherical
coordinate system. So the solid angle subtended by a region at a point is the area projected on
a unit sphere centered at that point. So this definition I can see in the next slide, what is the
meaning of this.

(Refer Slide Time: 08:49)

120
So, the solid angle subtended by an object from a point p is I am considering and this is the
object. And I am considering 1-unit sphere. The radius is equal to 1 and I am considering the
projection of this; projection of the objects onto the unit sphere centered at the point P.

So solid angle, you can define like this. Solid angle subtended by an object from a particular
point P is the area of the projection of the object onto the unit sphere centered at the point P.
So, this is the definition of the solid angle. The definition of the solid angle is very important
because based on the solid angle concept, you can define radiometric quantities.

(Refer Slide Time: 09:36)

And one important point is here. A source is tilted with respect to the direction in which the
illumination is traveling. It looks smaller to a patch of surface. Similarly, as a patch is tilted

121
with respect to the direction in which the illumination is traveling, it looks smaller to the
source. This effect is called a foreshortening effect.

So, suppose if I consider, suppose a surface is something like this surface. So, if I see from
this side, suppose I am observing from the side, since it is tilted with respect to my
observation, so it is my camera suppose, then this surface area, suppose this area, small area I
am considering the dA, it looks smaller. And similarly, if I consider one light source, suppose
is a light source and light is traveling suppose in this direction, if I see from this side, then in
this case, if I consider this small area, then this looks smaller. So, this is called a
foreshortening effect.

So, to consider the foreshortening effect, I am considering this factor, cos θ I am considering.
The cos θ is basically the foreshortening factor. So, in this diagram, you can see what is the
effective area. The effective area is dA cos θ , that is the actual area. If you see from the point,

dA cos θ
the point is P and this is the surface is this. So that is why the solid angle will be 2
r
steradian. So, this foreshortening effect is very important.

122
(Refer Slide Time: 11:11)

Now, let us define some radiometric quantities, the first quantity is radian intensity. Radian


intensity is I= . So, what is the meaning of this? The meaning of this is suppose I am

considering optical flux, optical flux is equal to φ that is the unit is watt.

Now, let us consider this diagram. Suppose this is a surface, one surface and this is the light
source. This is the light source I am considering. Light is going from the source and it is
reflected by the surface. So, this is the surface or surface or the object I am considering and
the light is reflected from here.

So, based on this, what is the definition of the radiant intensity? Radian intensity, radian


intensity of the source is equal to the flux, suppose, . So corresponding to this source, the

light source, the radian intensity is d φ , the amount of flux per unit solid angle. So, this, this is
the corresponding to the source, I am defining the radian intensity.

After this, the light is falling on the surface. So in this point I can determine the irradiance,
that is the illumination, I can determine. Irradiance means the illumination. So what is the
definition of the irradiance? Irradiance, I can determine, irradiance means the illumination.
Irradiance is equal to E equal to d φ , the amount of flux on this particular area there.

123

So to suppose this area is dA, this area is dA. So, that is the illumination. What will be
dA
the unit? Unit will be watt per meter square. So this is the unit; watt per meter square. After
this, the light is reflected the light is reflected by the surface. So now, I consider it as a
source.

Now, again, I want to determine the radiance, what is the definition of the radiance. So

dE
radiance is equal to, L is equal to . So that is equal to, because if I consider it is a source,

dE dE d2 φ
then in this case, it would be . So if I put this value, , so dt is .
dω dω dω

So, this is our definition of the radiance. From the irradiance, you can determine the radiance.
And if I consider the foreshortening effect, then in this case, I have to consider d 2 φ dA cos θ
and I have to consider the d ω also; d ω here; d ω dA cos θ . dA cosθ , I am considering
because of the foreshortening effect; dA cos θ .

So these are the basic parameters. First I have to understand the concept of the solid angle
and after this, I have defined the radian intensity, after this I have defined the irradiance;
irradiance means the illumination at a particular point and after this, I have defined radiance.
So, these parameters I have to understand first; these are the parameters for the measurement
of light.

(Refer Slide Time: 14:47)

124
So, again, you see what is the radiance. Radiance measure the distribution of light in space;
radiant power per unit foreshortened area per unit solid angle. That is the definition of the
radiance. And already I have discussed what is unit. Unit is watt meter minus 2 steradian
minus 1.

So, in a vacuum, radiance leaving a particular point p in the direction of q is the same as the
radiance arriving at q from p, because there is no loss of power. So, this is one important
concept. So already I have defined the radiance.

Now, in case of the spherical coordinate, I can consider a radiance like this. So, L is the
radiance and θ and φ , I am considering the angles of the spherical coordinate system.

(Refer Slide Time: 15:32)

Radiance is constant along a straight line. So, in this diagram, if you see, this is suppose point
1 and this is point 2. So, power from 1 to 2, this power living 1, so I can determine the power,
this is the power. So, first one is the radiance and after this, I am considering the
foreshortening area and this is the solid angle.

So, if I multiply this, then I will be getting the power from 1 to 2 and the next one is power
from 2 to 1. So, this should be, this would be actually it would be power from 2 to 1. It
should be power from 2 to 1 arriving at 2.

So, what is the power arriving at 2? So, the, similarly, I can calculate the power from arriving
at two. So, I can calculate this and these two powers will be equal because there is no loss of

125
radiance, there is no loss of radiance. Radiance is constant along a, along the straight line if I
consider vacuum as the medium.

126
(Refer Slide Time: 16:31)

Now, what is the definition of an irradiance? So already I have defined the irradiance.
Irradiance means illumination. So, a surface experience radiance coming in from a solid
angle, the solid angle is d ω , then the irradiance will be radiance cos θ d ω . That is the
irradiance.

And what will be the total power arriving at the particular surface? Then in this case, I have
to calculate by adding irradiance over all incoming angles. So in this case, I am considering
all the angles I am considering. This is the solid angle, total solid angle I am considering. I
am calculating the total power arriving at the surface, so that I can determine.

(Refer Slide Time: 17:12)

127
So, in spherical coordinate system, radiance at a point P is represented by L (P, θ , φ ¿. So, this
is the radiance L (P,θ , φ ¿, and if a small surface patch dA is illuminated by radiance, the
incoming radiance is this and the solid angle is d Ω; the incident angle is this. Then in this
case, what will be the irradiance? The irradiance is already I explained, this is the irradiance
in terms of radiance.

And in this case, the cosθ already I have explained that is the foreshortening factor. So, to get
irradiance, the radiance is integrated over the entire hemisphere. So, here, you see I am
calculating the irradiance. So the, I am calculating the irradiance.

So I am just integrating the radiance for the total solid angle. The solid angle is d Ω, so this is
the radiance. Cos θ , I am considering because of the foreshortening factors, so I am
calculating the irradiance. The radiance is integrated over the entire hemisphere. So that is
why for the entire hemisphere, the integration is from 0 to twice pi, 0 to pi by 2, then in this
case I can calculate this one. This d Ω=sin θ d θ d φ .

(Refer Slide Time: 18:29)

So, finally, the summary is the radiance means the energy carried by a ray. So power per unit
area perpendicular to the direction of the travel per unit solid angle. So that is the definition
of a radiance. So, what is the unit of the radiance? The unit is watt per meter square steradian.
And radiance is constant along the straight line, that is in vacuum.

And irradiance, energy arriving at a surface; incident power in a given direction per unit area.
So, unit is watt per meter square. So, this is the definition of the irradiance. Now, next, I am

128
considering one important derivation, that is, the derivation for radiometry of thin lenses;
radiometry of thin lenses. So, let us consider one diagram.

(Refer Slide Time: 19:15)

Suppose I am considering one surface like this and I am considering and this is the lens of the
camera, and suppose this is the image plane. This is the image plane. This is my lens of the
camera. So here, what I am considering? The concept is radiometry of thin lens. So, I am
going to explain this concept, radiometry of thin lens.

So, this is a surface. So, this area I am considering suppose, this small area I am considering
the surface. This is the object I am considering, object is this. This small area I am
considering, this small area is dA 0 ; the total area is A 0suppose. The total area is A 0and at this
point, so this is the surface, suppose surface normal; the surface normal is this normal to the
surface.

So, this is my incident ray and this is my, the ray it will be something like this. So, this angle
is θ and this angle is also θ . This will be θ and if I draw like this, so this angle will be α , this
angle will be α . In this case, you see the α is not equal to θ this is the normal to the surface.

So, this distance, I can consider, this distance I can consider as f P suppose. The distance from
the lens to the image plane. And suppose this distance, distance from the lens to the, this
object point is I can consider as f 0; I can consider this distance is f 0 .So I am considering the
radiometry of that thin lens.

129
Now, the respective solid angle seen from the lens, so this is my lens. So I, if I see from the
side, I can see one solid angle. If I see, if I see from this side, I will be getting one solid angle
and, in this case, also I am getting one solid angle. And suppose this area I am considering,
this area I am considering d Ap small area I am considering in the image plane, d Ap.

So light is coming from the source, light source and it is incident on the surface and it is
reflected back and this is the camera. This is the camera and I am getting the image in the
image plane. So light is coming from the source, it is reflected by the surface and it is going
to like this.

Now, if you see here, there are respective solid angles seen by the, seen from the lens. So, if I
seen from the lens, so I have two solid angle I can see in this direction or in this direction, the
both the direction I can see. So, this solid angle will be equal. So, in this case, if I consider
this as one, dA 0 ; solid angle, then it cos θ f 02 is equal to d Ap cos α f p2 .

So respective solid angle seen from the lens; so I am calculating. And from this, you can
calculate d Ap. d Ap you can calculate; d Ap will be equal to simple dA 0 cos θ f p2 is and f 02 cos

α . After this, I want to determine the radiance at this point.

2
So, what is the radiance? Lθ will be equal to d φ as per the definition of radiance. So
d ω dA 0 cos θ . So from this, I can determine I have to do the integration. So, d φ is equal to d

φ i have to determine. d φ is equal to dA 0 cos θ L θ and d ω ; that is the total solid angle.

So, after this, the light is reflected by the surface. So, I am getting the irradiance at this point.
So what will be the irradiance? The irradiance at this point E p is equal to d φ divided by d Ap,
that I can determine. So in this case, I have to put this value; dA 0 cos θ L θ and into multiplied

by f 02 , this below I am putting f 02 cos α dA 0 cos θ f p2 and it is d ω , d ω

So what is the solid angle, total solid angle? The solid angle of the lens, I can consider d ω , is
equal to, first, I have to consider the area; area of the lens will be pi by 4 d square cos α . And
I have to consider the distance. So distance will be f 0cos α whole square.

So in this case, I am considering suppose this is f 0and suppose this is x, this distance x, so
this is suppose α , this angle. So I am considering x cos α is equal to f 0. So from this, what is
x? x is nothing but, x is nothing but f 0divided by cos α ; cos α , I am getting.

130
So, this is the total solid angle I am determining. So, after putting this value in this equation,
if I put this value in this equation, then in this case, I will be getting Ep, that is, the Ep means
irradiance. I am calculating the irradiance at this point. So, Ep will be just put the value

4
πd 4
2
cos α L θ . So, this expression I am getting.
4f p

4
πd
So, this is one important expression; 2
cos 4 α L θ . So, this expression I am getting. That is
4f p
irradiance I am determining.

(Refer Slide Time: 26:59)

So in this case, if you see here, in this expression I am showing the same thing. So first, I am
considering the equating respective solid angles seen from the lens. So already, I've explained
this one and after this, I am calculating d Ap.

After this, I am calculating the radiance at this point I am calculating. Radiance I am


calculating, so d phi will be this one. Now, what will be the solid angle? The solid angle
already I have explained, the solid angle is this. These are the omega, solid angle is this.

And finally, I am getting Ep. Ep means irradiance. Irradiance I'm getting this one. So this is

4
πd 4
one important expression. So Ep will be 2
cos α L θ . So I am getting this one.
4f p

(Refer Slide Time: 27:40)

131
Now, in this case, what will be the E ; that means, total irradiance, what will be the E
total total ?
E total is this? So in this case, what I am considering. I am considering different angles; theta
1, theta 2; that means radiance from all these angles I am considering. And after this, I am
calculating Ep 1 for this L theta 1, Ep 2 for L theta 2, like this I am calculating. And I am
calculating E total. E total means total irradiance at this point.

So like this I can calculate. So this expression already I told you, this is the important
expression, that is, irradiance at the image plane. This image irradiance is proportional to the
scene irradiance. If you see this expression, the image irradiance is proportional to the scene
irradiance. That means the grayscale value of an image, the pixel value of an image depends
on L.

Also, the irradiance is proportional to the area of the lens. So irradiance is proportional to

4
πd
area of the lens. Area of the lens is , and it is inversely proportional to the distance
4
between the center and the image plane. The termcos 4 α is very important.

So, it indicates a systematic lens optical defect. This is called a vignetting. So what is the `
vignetting effect? So optical rays with larger span of angle alpha, if the angle alpha is more,
that are attenuated more. Hence, the pixels closer to the image border will be darker. So this
effect is called the vignetting effect and the vignetting effect can be compensated by a
radiometrically-calibrated lens.

132
So, this is one important observation, that is, we have to consider the radiometry of a thin
lens. And in this case, I have considered that you can calculate E total that is the irradiance you
can calculate and that is equal to the area of the lens and also it is inversely proportional to
the distance between the center and in the image plane. And I have considered another term

cos α , and after this, L; L is the irradiance.


4

This cos 4 α is very important because it corresponds to the systematic defects of the lens that
is called the vignetting. So because of this, if α is more, then in this case, what will happen?
The in the border area of the image, that the pixel will be darker as compared to the central
portion of the image.

(Refer Slide Time: 30:09)

Now, let us consider the light at surfaces. The many effects when light strike at the surface,
the light may be absorbed or it may be transmitted, it may be reflected, something like the
mirror, or it may be scattered, the example is milk.

So some assumptions like surface does not fluoresce, surface does not emit light, and all the
light leaving a point is due to the, due to that arriving at that point. So we have considered
these assumptions and light maybe something like the absorbed, transmitted, reflected, and
scattered.

133
(Refer Slide Time: 30:44)

So, to represent the characteristics of a surface, I am defining one radiometric quantity. That
definition of the radiometric quantity is BRDF, the bidirectional reflectance distribution
function. So, I am defining one, the parameter and that is bidirectional reflectance distribution
function. So, this is basically to see the property of a surface.

So, what is the definition of the BRDF? The ratio of radiance in outgoing direction, so this is
the radiance in the outgoing direction, to the incident irradiance. So incident irradiance is this
because light is reflected by the surface. So, this is your, this is the incident irradiance, the
incident irradiance and this is the radiance in the outgoing direction.

So, in this case, I am considering the spherical coordinate systems, this θi and φ i that
corresponds to incident angle, and this θ0 and φ 0 is corresponds to the outgoing direction. So,
this is I am writing in the spherical coordinate systems L0 divided by Li cos φ 0 di. So cos θ I
am considering because of the foreshortening effect.

134
(Refer Slide Time: 31:56)

Now, the radius leaving a surface in particular direction, that you can determine from the
BRDF. So this radiance you can calculate; radius leaving a surface in a particular direction.
So this BRDF, already we have defined, and this is your incoming radiance; cos θ is the
foreshortening factor, d ω is the solid angle. So from this, you can determine radiance leaving
a surface in a particular direction.

And radiance leaving a surface due to its irradiance, then in this case, I have to consider all
the contributions from every incoming direction. So in this case, I can calculate the radiance
leaving a surface due to irradiance I can determine this one. So in this case, I am considering,
all the angles I am considering because that is where I am considering the integration.

The integration, I am considering because I am considering the contribution from every


incoming directions and if I consider the discrete light source, a number of point light
sources, then the integral is replaced by summation. So this integral is replaced by the
summation. So this is the BRDF.

135
(Refer Slide Time: 32:58)

And another important definition is radiosity. So suppose, if I consider some surfaces like
cotton cloth; in cotton cloth, the reflected light is not dependent on angle. That means, the
radiance leaving the surface is independent of angle. That is one important point.

Radiance leaving the surface is independent of the angle, so appropriate radiometric unit is
radiosity. So what is the definition of the radiosity? Total power leaving a point on the
surface per unit area on the surface. So that is the unit watt per meter square.

And this is independent of the direction because in this case, I am considering the cotton
cloth. So, this type of surface is called the diffused surface. I will get two types of surface;
diffused or the Lambertian surface, I will discuss about the Lambertian surface. Another one
is the specular surface. Specular means the mirror-like surface. So one is the Lambertian or
the diffuse surface another one is the specular surface. So I will explain what is the diffuse
surface, what is a specular surface.

So, radiosity, you can determine from radiance, that is, if you see this expression, I am
determining the radiosity. Radiosity means total power leaving a point on the surface per unit
area on the surface. That means, sum the radiance leaving the surface over all the exit angles.
So that is why I am taking the integration and I am considering all the exit angles, total solid
angle I am considering, so I can determine the radiosity from radiance.

136
(Refer Slide Time: 34:24)

So if you see this, here, expression, radiosity of a surface whose radius is independent of the
angle, that is, something like the cotton cloth, that is, the diffuse surface. In case of the
diffused surface, radiance leaving the surface is independent of angle. This is the definition of
a diffused surface, but if I consider a mirror-like surface, it is not the case.

So here, you see, I am calculating the radiosity, the radiosity. And since this is independent of
angle, so this will be constant, so I am taking it out from the integration. So this is cos θ and I
am considering these angle because it is a solid angle I am considering. And in this spherical
coordinate system, integration is from 0 to pi by 2, 0 to twice pi. So I am calculating the
radiosity for the diffuse surface.

137
(Refer Slide Time: 35:12)

Another important definition is directional hemispheric reflectance. That is also very


important definition that many surfaces, for many surfaces, light leaving the surface is largely
independent of exit angle. So for this, I am defining one parameter, that parameter is
directional hemispheric reflectance.

The range is 0 to 1. So what is the definition of the directional hemispheric reflectance? The
fraction of the incident irradiance in a given direction that is reflected by surface. So in this
case, this is the incident irradiance in a particular direction that is reflected by the surface for
all the angles I am considering.

So this is an incident irradiance in a particular direction that I am considering and the light is
reflected by the surface in all directions. And so that is why I am considering the integration.
And already I have defined a BRDF, so this BRDF, just I am putting the BRDF here, cos
θ d ω. So you see the relationship between the directional hemispheric reflectance and the
BRDF.

138
(Refer Slide Time: 36:10)

So I have shown the diffuse surface that is called the Lambertian surface. Lambertian surface
or the diffuse, ideal diffuse surface, that means, the radiance leaving the surface is
independent of angle. So if you see this diagram, radiance leaving the surface is independent
of the angle. In different angles, the light is reflected back. So these are the incoming lights
and these are the reflected lights. So radiance leaving the surface is independent of angle.

And in case of the diffuse surface, that directional hemispheric reflectance is called the
albedo. So this definition is very important, albedo. Albedo means the reflectance property of
the surface. If I considered a diffuse surface, then it is called a diffuse reflectance that is the
albedo.

139
(Refer Slide Time: 36:55)

So, for Lambertian surface, the BRDF is equal to because it is independent of direction, so it
is constant, it does not depend on direction. And in this case, I can consider the directional
hemispheric reflectance.

So already, I have explained that directional hemispheric reflectance and the BRDF is
constant so I am considering this BRDF is constant because it is independent of direction,
constant term I am taking out and after this, it is πρ. So this cos θ sin θ , this I am getting a
solid angle. So here, you see, this is the albedo. This is a directional hemispheric reflectance.
So the BRDF is equal to the directional hemispheric reflectance divided by π . So this
expression is important.

140
(Refer Slide Time: 37:39)

So I am giving some examples of the Lambertian surface. These are the Lambertian surface,
the radiance leaving the surface is independent of angles.

(Refer Slide Time: 37:51)

And in the second example, I have given some example of the non-Lambertian surface that is
called non-Lambertian means the specular; specular surface, something like the mirror-like
surface. So this portion is the specular, this portion is specular, this portion is specular.

So up till now, what I have considered, the definition of some radiometric parameters. The
first parameter I have discussed about the radiance, and after this, I have discussed about the

141
irradiance. And after this, I have discussed the concept of the BRDF. So what is the BRDF,
the bi directional reflectance distribution functions.

And after this, I have considered the direction and hemispheric reflectance. And in this case, I
have defined the concept of the diffuse surface that is the Lambertian surface and also the
specular surface. In these two examples, I have given two examples, I have shown the diffuse
surfaces and the specular surfaces. There are many applications in computer vision in which
you can, you have to determine the specular surface.

Suppose, if I consider in case of the medical image processing, so if I take some images,
some videos of the internal organs of the human, then in this case, you will be getting some
specular surfaces. Then I have to remove the specular surfaces for image segmentation. So,
identification of the specular surface and the diffuse surface is important.

So, for this, to understand the definition of the albedo. Albedo is the reflectance property of
the surface. So, I am considering the diffuse surface. So, in the next slide, I am going to show
one application of this in face recognition.

(Refer Slide Time: 39:36)

If you see this here, this is from a paper, this is the paper. So here, I am considering, this is
the input image. If you see, these are the input image. From this input image, I am
determining, I am actually, the extracting the specular areas.

142
So, if you see, the nose is the specular portion. This portion, the specular. If you see these
lips, lips are also specular. So, specular I am determining. Similarly, in this case, the nose,
this portion is the specular. So, specular portion is determined.

So, you can see this paper, in this paper, that the extraction of the specular portion of the face
that information you will get from this paper. So, in face recognition also, the application of
the specularity is important. So, what is the diffused surface because this surface if you
consider, these are the diffused areas if I consider.

And this if you consider, eyes like this, these are the specular parts. So, in this class, I have
discussed about the concept of the radiometry, then one important derivation I have derived
that is image formation in the thin lens. And after this, I defined some radiometric quantities
like radiance, irradiance, and BRDF, and also the directional hemispheric reflectance, and
also the concept of the specularity. So, let me finish here today. Thank you.

143
Computer Vision and Image Processing – Fundamentals and Applications
Professor M. K. Bhuyan
Department of Electronics and Electrical Engineering
Indian Institute of Technology, Guwahati
Lecture No. 04
Shape from Shading

Welcome to NPTEL MOOC course on Computer Vision and Image Processing Fundamentals
and Application. In my last class I discussed about the concept of radiometry. Radiometry means
the measurement of light, measurement of electromagnetic radiation. Photometry quantifies the
sensitivity of camera or sensitivity of human eye. In my last class I mentioned some radiometric
quantities like radiant intensity of the source and also I discussed about irradiance of a surface
and also radiant intensity and radiance.

Also I discussed about the types of surfaces, the Lambertian surface and diffused surface. So, for
this I discussed two parameters, one is the BRDF and another one is the directional hemispheric
reflectance. So, today I am going to discuss another topic that is the shape from shading. So, I
want to determine the shape information from shading. Shading means the variable levels of
darkness. So, from one image I want to determine the shape information that is from shading.

So, before going to this concept, the shape from shading, what I discuss in the last class, I am
going to discuss again briefly. So, the first concept I discussed about the concept of the
Lambertian surface. And in this case I discussed about the quantity, the quantity is the albedo of
a surface.

144
(Refer Slide Time: 01:54)

So, what is Lambertian surface? So, example like cotton cloth or may be matte papers, so in
Lambertian surface, radiance leaving the surface is independent of angle. And this Lambertian
surface is also called the diffused surface. And in case of the Lambertian surface, we have
defined a parameter, the perimeter is albedo that is called the diffused reflectance. And for a
Lambertian surface, BRDF is independent of angle.

So, in this example I have shown one surface that is the Lambertian surface. You have seen here
that these are the incoming light directions and if you see the radiance leaving the surface it is
independent of the angle. In all the directions radiance are leaving. So, this is the definition of
Lambertian surface, that is, radiance leaving the surface is independent of angle.

145
(Refer Slide Time: 02:43)

In this case I had given one example of the Lambertian surface that is the diffused surface. So, if
you see this surface or maybe the, this surface, the example of Lambertian surfaces.

(Refer Slide Time: 02:54)

In this example I have given the example of the specular surface, that is the non-Lambertian
surface. If you see this portion, this is the mirror-like portion, so that is nothing but the specular
surface that is the mirror-like surface. This is one example of the non-Lambertian surface.

146
(Refer Slide Time: 03:13)

Last class also I discussed about the, this concept, in face recognition how to identify the
specular portion of the image. You have seen these are my input images you can see these are
my input images and in these input images I have two components, one is the diffused surface,
another is the specular surface. So, specular portion is the, mainly the nose portion and also the
eyeballs. So, in this paper, this Lambertian surface and the specular surfaces are separated. One
is this, this surface is the diffused surface and these are specular surface. So say, specular surface
is the specular portion of this face if you see the specular portion of the face that is separated
from the original image.

Similarly, in the second example also I have considered the input images like this and in this case
I am determining the specular portion of the image. So, this is my specular portion like this. So,
you can see this paper to understand this concept. So, how to segment out the specular portion of
the image.

147
(Refer Slide Time: 04:14)

Now one standard model is; the model is the Phong model. So, for real surfaces we can consider
both the components, one is the diffused component, another one is the specular component. So,
for real surfaces we can consider the Lambertian plus specular that is called the Phong model.
So, in this case you can see the surface radiance is represented like this. This is surface radiance.
And I have two components, one is the diffused component. The first one is the diffused
component and second one is the specular component.

Here the ρ1, it corresponds to the diffuse albedo and ρ2 corresponds to specular albedo. So, one is

ρ1, and another one is ρ2. One is the diffused albedo, another one is the specular albedo. And in

this case radiance leaving a specular surface is proportional to cos n (θ 0−θ s ). Radiance leaving a

specular surface is proportional to this term, cos n (θ 0−θ s ). And in this case one parameter is
important, the parameter is n.

So, if I consider large values of n, the large values of n produce a small and narrow specular lobe
that corresponds to sharp specularities, that corresponds to the sharp specularities. And if I
consider small value of n that give wide specular lobes, that is the large specularity. So, that is
the meaning of this, the parameter. The parameter is n. So, based on this n I have the sharp
specularity or maybe the ambiguous boundaries. This Phong model is very important because in

148
this case we consider both the parameters, one is the diffused component we have considered,
another one is specular component we have considered.

(Refer Slide Time: 05:52)

And one example I can show. The specularity removal in colonoscopic image, maybe in the
endoscopic image also we can apply this one. So, we have the images, the colonoscopic images
and you can see some specular components like the specular, the mirror-like surfaces. So, by
using some algorithms, I am not discussing about the algorithms, we can segment out the
specular components like this the specular components are determined in the endoscopic image,
in the colonoscopic image and we can show the results. This is one application why the
specularity is important, the specularity removal in colonoscopic images.

149
(Refer Slide Time: 06:26)

Now next concept is sources and the shading. The one parameter I want to, I want to explain, the
exitance of a source. What is the meaning of the exitance of a source? The internally generated
power radiated per unit area on the radiating surface that is called the exitance of a source. That
is, the power internally generated by the source. Now in this case the source can have both
radiosity and the exitance. Radiosity means because it reflects and exitance means because it
emits.

So, here you see, in this expression I have considered the radiosity, radosity is B(x). We have
considered the exitance. Exitance is E(x). Exitance means the power internally generated by the
source and also we have to consider the power reflected by the surfaces, other surfaces. So, if I
consider, suppose this is the source, the source is a emitting power. And also I have to consider
other surfaces like this. The power coming from other surfaces I have to consider. That is called
inter-reflection. So, radiosity due to incoming radiance.

So, I have to consider these two components, one is the exitance, another one is radiosity due to
incoming radiance. So, I have considered these surfaces, these surfaces I have considered and I
have considered radiosity due to incoming radiance. And one thing is important there are two
models, one is the local shading model, another one is there global shading model. So, if I
consider a particular surface, suppose if I consider this surface and suppose I am considering this

150
source, sources, the light sources corresponding to these points suppose. These sources are
visible. These sources are visible.

So, surface has radiosity only to source visible at each point. So, suppose corresponding to this
point these sources are visible. And we have to consider radiosity only to the sources visible at
each point so that we have considered that is called the local shading model. But in the global
shading model, I have to consider surface radiosity due to the radiance reflected from other
surfaces.

Suppose in this case if I consider other surfaces like this, the light is reflected from these surfaces
that also I have to consider. That is, the surface radiosity is due to the radiance reflected from
other surfaces we have to consider that is inter, inter-reflection and that is also called the ambient
illumination. So, in this case I am considering the local shading model, another is the global
shading model. In local shading model we are only considered the sources. In case of the global
shading model I am considering the radiance reflected from other surfaces.

(Refer Slide Time: 09:13)

Now the next concept is the shape from shading. That is a very important concept. So, how to
determine shape information from shading?

151
(Refer Slide Time: 09:20)

So, an image is essentially 2D, but in this case, the, our world is 3D. That is nothing but the
image formation process. So, suppose image formation process already I have discussed. This is
my object and this is my imaging system. Imaging system is nothing but the camera and I have
the image here in the image plane. Light is coming in this direction and like it I am getting the
image here. This process is nothing but the 3D to 2D projection this is nothing but the 3D to 2D
projection.

In case of the human visual system, human brain reconstructs the 3D information. There are
many cues. The one cue is the motion parallax. Another one is the binocular disparity. So, by
using this cue the human visual system recovers the shape information from the 2D images. So,
what is motion parallax? I want to explain. Suppose I have this camera and suppose my object is,
this is my object. Now object is moving suppose, like this object is moving. So, my one surface
plane is 1, another one is surface 2. The surface 1, this surface 1 is close to the camera, the
camera is here, as compared to the surface 2.

So, from the camera the surface 1 or the plane 1 moves quickly as compared to the plane 2 that is
the motion parallax. So, that means I can write like this. What is the motion parallax? Objects
moving at a constant speed, constant speed across the frame will appear to move a greater
distance, or greater amount I can write, if they are closer to an observer or camera, then they
would, if they were at a greater distance, so that is the motion parallax.

152
The objects moving at a constant speed across the frame will appear to move a great amount if
they are closer to an observer or a camera, then they would if they were at a greater distance. So,
this concept already I have explained, that is the motion parallax. So, based on the motion
parallax we can get the shape information.

Another cue is the binocular disparity. In binocular disparity we consider two cameras and by
using these two cameras, we will get two images. So, from these two images we can determine
the disparity values and from the disparity values, we can determine the depth information,
disparity in observation. So, by using these two cues human visual systems recovers shape of
objects in a 3D scene from the 2D image. So, these two concepts I have explained. One is the
motion parallax, another one is the binocular vision.

In binocular vision, already I have explained that we have two cameras and we are getting two
images, one is the left image, another one is a right image. And from these two images I can
determine the disparity, disparity in observation. And after this we can determine the depth
information. That is the binocular vision that we are going to discuss in next classes, the stereo
vision.

Now in this problem, the shape from shading problem I have only one image. So, from one
image I want to get the shape information. So, how to do this? We will discuss now.

(Refer Slide Time: 14:42)

153
So, in this slide I have shown the shading, one example of the shading. But in this case it is not
the shading, it is a shadow. So, we cannot reconstruct shape from one shadow. This is one
example of the shadow, because if I consider suppose light is coming in this direction, this is the
light, and because of this light I am getting one shadow here.

So, shadow principle is something like this, suppose I have a source, light source. And suppose I
am considering one obstacle, this is my source and I am considering one object here. This is my
object. And I am getting the shadow here. So, this portion is called penumbra. And this, this
portion is called umbra. Umbra means this portion which cannot see the source at all. So, from
these points, suppose I cannot see the source completely. What is the penumbra?

A penumbra which can see a part of the source; that is the definition of penumbra. One is the
umbra, another one is the penumbra. So, this is the formation of shadow. So, now I have
discussed the concept of the shadow. So, in computer vision application, one application is
removal of the shadow. Suppose one example I can give, object tracking. In object tracking there
may be shadow, the shadow of the object. So, when the, when the object is moving the shadow is
also moving that is the cast shadow. So, in this case I have to remove the shadow. Now in this
class I am not discussing about the shadow. I am discussing about the shading. Shading means
variable levels of darkness. Now let us see what is shading.

(Refer Slide Time: 17:07)

154
I have given some examples of the shading. If you see these images, if you see all these images,
then this variable levels of darkness you can observe. This gives a cue for the actual 3D shape.
And in this case there is a relationship between the intensity and the shape. So, if you see these
images I, then in this case, I am getting a feeling of the 3D shape. That is the shape from shading.

(Refer Slide Time: 17:37)

Another example I am giving here. If I see these images you can see the variable levels of
darkness. But in this case this image is constant intensity value. So, if I consider the first two
images the intensity gives a strong feeling of the scene structure. So, because of the shading I am
getting a feeling of the 3D shape. That is the shading example. I am giving this example.

155
(Refer Slide Time: 18:07)

Another example I am giving. This is not the shading because it is a constant intensity value. But
if you see this image, because of the shading I am getting a feeling of the 3D shape. So, this is
another example of shading. So, from the shading information how to get the shape information.

(Refer Slide Time: 18:24)

156
Another example I am giving, the shading example. So, from the shading you can get a feeling of
the 3D shape. Here also I am giving another example of the shading. So, here you can see, you
have a feeling of the 3D shape that is from the shading. So, shading means, already I have
mentioned, shading means variable levels of darkness.

(Refer Slide Time: 18:47)

The 3D shape from shading problem is, from one image, I only one image, I want to get the
shape information. In this case the shading information is available. If you see this image you
can see the shading. And in this case from the one image I want to get 3D shape information. So,
3D shape information I am getting here.

157
Similarly, in the second example I am considering one face image. In this case also I am, I am
having the shading. Now from the single image I want to get the 3D shape information. So, that
is the problem of shape from shading.

(Refer Slide Time: 19:21)

So, what is the concept of shape from shading? So, I have only one image, single image that is
I(x, y), one image. So, from one image I want to get the 3D shape information, that is the Z(x,y),
that is the depth information I want to determine. 3D shape of the object in the image I want to
determine. Similarly, in the second example also I am considering one image. And from this
image I want to get the 3D shape information. That is the problem of shape from shading. So,
shading gives a feeling of the 3D shape.

158
(Refer Slide Time: 19:54)

Shape from shading problem is, input is my single image, output is 3D shape of the object in the
image. But the problem is ill-posed. Why it is ill-posed? Because many shapes can give rise to
same image. That is the problem is ill-posed. I have to consider some common assumptions. The
lighting is known that means I know the direction of the light source and also lighting conditions.

Also another assumption is, I have to consider the Lambertian reflectance and also I have to
consider uniform albedo. The albedo lies between 0 to 1, already I have explained. So, these two
assumptions I have to consider. One is the lighting, another one is the Lambertian reflectance,
also uniform albedo.

159
(Refer Slide Time: 20:36)

Now, how to represent a particular surface? So, in this example I have considered one surface,
this surface I am considering, and corresponding to this surface suppose I am considering one
point here. This point I am considering, this is the point. Corresponding to this point, I can draw
a tangent plane. So, my tangent plane is this. And corresponding to this tangent plane I have a
surface normal. So, my surface normal is this, this is my surface normal.

So, I am explaining again. Corresponding to this red surface and corresponding to this particular
point, I can consider one tangent plane and corresponding to this tangent plane, how to represent
the tangent plane? The tangent plane is represented by the surface normal. So, this is my surface
normal. So, this surface, the particular surface can be represented by surface normals. So, I have
the surface normals like this. So, I have the surface normals. So, the surface is represented by
surface normals.

160
(Refer Slide Time: 21:35)

The shape from shading problem is, given a grayscale image, I have a grayscale image and the
albedo, and the light source direction is available, then in this case I want to get the scene
geometry. So, that means this case, I have to get the information of the surface normals. So, these
are the surface normals. So, this, the grayscale image is available. In this case you can see the
variable levels of darkness. So, from this I want to determine the surface normals. If I can
determine the surface normals that means I can reconstruct scene geometry and this is the shape
from shading problem.

(Refer Slide Time: 22:15)

161
Here, I have shown the incident and the emitance angles. So, light is coming from the source like
this. So, this is the direction of illumination and I am considering a surface and corresponding to
this point, I am considering a tangent plane and this is the outgoing direction. This is my
incoming direction, this is my incoming and this is my outgoing direction. And corresponding to
this point, the point is suppose, P, I am considering the surface normal. So, my objective is
basically, I have to get all the surface normals so that I can get the orientation information of the
surface.

(Refer Slide Time: 22:54)

So, what determines the scene radiance? So, here, in this case the scene radiance actually
depends on the amount of light that falls on the surface. So, this is the first case. The amount of
light that falls on the surface that determines the scene radiance. The fraction of light that is
reflected, so that depends on the albedo of the surface. So, here if you see, the light is coming
from the source so, this is the direction, incident direction and this is the outgoing direction. And
this is my camera. So, this is my camera. And what is this corresponding to this point?

This is my surface normal. So, this vector s, it corresponds to the source vector. n is the surface
normal and this ρ is the albedo of the surface. So, that means, in this case what will be my
image? I is the image that is the irradiance, the image will be the albedo. And after this I have to
take that dot product of the surface normal and the source vector, source vector is this.

162
(Refer Slide Time: 24:01)

So, in case of the Lambertian surface, already I have explained the Lambertian surface that
means the radiance leaving a surface is independent of the angle. That is the definition of
Lambertian surface. Now in case of the Lambertian surface, it reflects all light without absorbing
and I am not considering the specular surface, I am not considering the specularity. Brightness of
the surface as seen from the camera is linearly correlated to the amount of light falling on the
surface.

So, brightness of the surface, the brightness of the surface is observed by the camera. So, the
brightness actually depends on the amount of light falling on the surface. So, amount of light
coming from the source, this is a source and light is reflected back to the camera and this is the
surface normal, this is the source vector and I have shown the incoming direction and the
outgoing direction. So, brightness of the surface as seen from a camera is linearly correlated to
the amount of light falling on the surface. So, this is very clear.

163
(Refer Slide Time: 25:04)

Now how to, how to represent the surface orientation? So, already you know that a smooth
surface has a tangent plane at every point. So, this surface, suppose I am considering. So, I have
the x co-ordinate, y coordinate and the z coordinate, this surface orientation is represented by this
gradient value. Gradient is p and q. What is p? p is the partial derivative of z with respect to x;
that is the p, the change of z with respect to x.

And what is q? The q is the partial derivative of z with respect to y. That means with y direction
what is the change of z that I am considering. So, this is δ x this is δ y. And this case that will be
p δ x, this will be equal to q δ y, like this. So, the surface orientation is represented by this
gradient. The gradient is p and q.

164
(Refer Slide Time: 26:00)

Here I have shown again, the surface orientation is represented by this gradient, one is, is
gradient is p, that is the partial derivative of z with respect to x. And q is the partial derivative of
z with respect to y. So, I am defining like this that is the surface orientation. And in this case I
am considering the surface vectors, one is in this direction r x, that is (p, 0, 1), that is the surface
vector. Another surface vector is r y that is (0, q, 1) that is r y. And I want to determine the surface
normal.

The surface normal you can determine by the cross product of r x and r y. The surface normal is n.
So, if I take the cross product of this, I will be getting (p, q, -1). And this can be normalized. So,
if I do the normalization of this, then I will be getting this value. (p, q, -1) divided by √ p2 +q 2+1,
normalized value of the surface normal.

So, in this diagram, again I am showing, the tangent plane I am showing corresponding to the
point p. And here again I am showing what is gradient value corresponding to a particular
surface orientation p is equal to δ that is the partial derivative of z with respect to x. And what is
q? The partial derivative of z with respect to y. So I have, I have defined. So, that means the
surface orientation is represented by the gradient value p and q. And from this I can determine
the surface normal.

165
(Refer Slide Time: 27:27)

So, same thing I am showing here. So, let the image surface be z is equal to f (x, y). In next slide
I am going to explain what is z =f (x, y)? Then the surface normal can be obtained as a cross
product of the two surface vector. So, in my last slide I have shown the surface vector is r x and r
y. And from this you can determine the surface normal. The surface normal is (p, q, - 1). So, this,
this concept I am going to discuss in the, my next slide, what is the image surface that z equal to
f( x, y)?

(Refer Slide Time: 27:57)

166
The dependence of the surface radiance on the local surface orientation can be expressed in a
gradient space. And the reflectance map R (p, q) used for that purpose. That means what is R (p,
q)? r p q means it gives the relationship between surface orientation and the brightness. That is
called reflectance map. So, R (p, q) means that it is a relationship between the surface
orientation. The surface orientation is given by the parameters p and the q. And the brightness
also I am considering. So, for this, what I am doing? I am just doing the dot product between the
surface normal and the source vector, s is a source vector.

So, in this case I am considering (p, q, 1) is a vector normal to the surface. There is a surface
normal and I am considering the source vector, the (p s ,qs ,1), that is the vector in the direction of
the source. And if I take the dot product between this, surface normal and the source vector, then
in this case I will be getting this one. So, this is the reflectance map.

(Refer Slide Time: 29:00)

So, in this case I am giving one example of a reflectance map. The reflectance map of an image
R (p, q) can be visualized in the gradient space as nested iso-contours corresponding to the, to
same observed irradiance. So, in this case if I consider, suppose the concept of brightness as a
function of the surface orientation. So, corresponding to this point I have one contour. The
contour is, this is the contour. The brightness value, the brightness value is suppose 0.8.

For different values of p and q, so you see corresponding to different values of p and q, the
brightness is constant. The brightness is 0.8. Corresponding to this one, the reflectance map is

167
0.7. So, that means I am showing the brightness as a function of surface orientation. Suppose
corresponding to this orientation the brightness is 0.8. So, that is why I am considering iso-
contours that corresponds to the same observed irradiance. Irradiance means the brightness. So,
this is; I have given one example of the reflectance map.

(Refer Slide Time: 30:14)

Here, I have given some reflectance map examples, brightness as a function of surface
orientation. In this example if you see, this is almost uniform brightness. So, corresponding to
this I am getting the reflectance map like this. Corresponding to this dark object, I am getting the
reflectance map something like this. And corresponding to this image this will be the reflectance
map. The reflectance map, you can you can, plot.

(Refer Slide Time: 30:37)

168
A contour of constant intensity is given by a contour that is, p -q space means it is a gradient
space, p -q space called the gradient space of constant intensity is given by c is equal to the dot
product between the surface normal and source vector. So, this is a surface normal and this is a
source vector. So, I am getting the contour of constant intensity. So, already in my last slide I
have shown this expression. Hence for each intensity value and for each source direction there is
a contour on which the surface orientation could lie.

Now one important thing is the image irradiance is related to the scene irradiance. So, that is why
I can show this image irradiance, I (x, y) is image irradiance; that is related to the reflectance
map. So, image irradiance is related to the scene irradiance. So, that means the image irradiance
is related to reflectance map. So, this is my reflectance map. So, this is the image irradiance.

(Refer Slide Time: 31:42)

169
So, here again I am showing the same concept. Light is coming from the source. And this is the
surface and the light is reflected by the surface and this is my observation. Observation may be
camera. So, given a 3D surface, lighting and viewing direction we can compute the gray level of
pixel I(x, y) of the surface. If I know these values lighting and the viewing direction, we can
compute the gray level pixel, value that is I (x, y) of the surface I can compute. So, in this case
find the gradient of the surface. The gradient of the surface is p and q that I can determine. And
in this case by using this expression I can determine gray levels.

170
(Refer Slide Time: 32:22)

Now shape from shading, can we reconstruct the shape from a single image? So, I have only one
image and corresponding to, suppose, this point I have, this is the reflectance map corresponding
to this. Suppose this blue one, the blue one is the reflectance map. But the problem, if you see the
previous equation, I have shown the previous equation, here p and q, two variables are available,
p and q; but equation is only one equation. If you see that only one equation is available but I
have two variables. So, how to compute this?

(Refer Slide Time: 32:52)

171
So, one solution is I can consider more images of different brightness. I can consider this image.
I can consider this image. I can consider this image. I can consider this image. Like this I can
consider. And from this I want to get the 3D information. Similarly, I have shown this example.
Different images I am considering 2, 3 images maybe 4 images, first 7 images like this of
different brightness, different illumination condition. And from this I am getting the 3D
information.

(Refer Slide Time: 33:25)

So, like this, take several pictures of the same object under different viewpoint with different
lighting conditions. So, I am taking one image corresponding to a particular lighting condition,
this image I am considering. Corresponding to the first image I am getting the reflection map like
this. This color is same color I am considering the reflectance I am getting. Corresponding to the
second image, corresponding to another lighting condition I am getting the reflectance map that
is given by these green colors. And again I am considering another image, the blue image. Let us
suppose, corresponding this, this one. In different lighting conditions, then in this case I am
getting the, this blue reflectance map.

And all these are intersecting at this point. If you see, all these are intersecting at this point, this
reflectance map corresponding to that particular point. So, from this I can determine this value,
the p and q value. That is the p and q value is mainly the orientation of the surface. So, this is one
technique. This technique is called photometric stereo.

172
(Refer Slide Time: 34:23)

Here again I am showing this another example. Three different lighting conditions, same scene
and in this case what I am getting? All these are intersecting at this point corresponding to that
particular point. So, these are intersecting. And this gives the value of p and q. So, I am getting
three reflectance maps, one is corresponding to the, this image, the first image. Corresponding to
the second image I am getting another reflectance map. Corresponding to the third image I am
getting another reflectance map, and from this I want to determine the solution. The solution is p
and q.

(Refer Slide Time: 34:56)

173
So now, this shape from shading, there are many algorithms to get, to get the solution of this.
And one very important algorithm is photometric stereo. The photometric stereo is one very
important application, one very important algorithm of computer vision. So, in this case input is
several images of the same object but different lighting conditions and pose is the same. This is
my case that I have several images of the same object but different lighting conditions, same
pose. And what will be my output? The output is the 3D shape of the object in the image. I can
also determine the albedo of the surface, and lighting also I can determine. So, this is called the
photometric stereo.

(Refer Slide Time: 35:46)

So, already I have explained that in case of the camera I have (x, y, z) that is projected into (x, y)
coordinate. That I am doing the projection. So, this is the principle of the camera. That means in
this case the depth information is lost. So, that means I can show like this, this is z direction. This
is x direction and this is y direction. I am considering one arbitrary surface, this surface I am
considering. And for imaging I am doing the projection. This is my image plane. So, I am getting
the image. So, this is the direction of the light source, light is going from this direction and I am
getting the image.

So, due to this projection I am getting the image, the image is (x , y). Only x and, y coordinate I
am getting. The z coordinate is missing. Now the surface representation, how to represent a
particular surface? The surface representation, a surface is represented like this. x and y

174
coordinate, there is a spatial coordinate and f (x y), in my previous slide I told you what is f( x y).
So, a surface is represented like this. x and y is the spatial coordinate. This f( x, y), this is called
the depth map. Or sometimes it is called the height map; sometimes it is called the dense map.

So, this term, depth map, height map and a dense map, this representation is called, one standard
term is there, this is called the M, Monge Patch, the representation of a surface is equal to Monge
Patch. So, what is (x, y)? The pixel value at the point (x, y) is actually f (x, y). Now in this case
what is the concept of the photometric stereo? So, in case of the photometric stereo, suppose this
is the surface, light is coming from the source like this and this is the surface normal. And
suppose this point (x , y).

So, the photometric stereo concept is, for different views; different views means that is the
number of sources, the light sources I am considering, and we will see f (x, y), and find it the
depth map. So, this point is illuminated, this point is illuminated by different light sources. From
this I want to see f (x, y). So, I am getting the pixel value f (x, y)and from this I want to
determine the depth map. So, I will show how to do this photometric stereo.

So, the setup off the photometric stereo is, suppose this is the surface the surface is like this, this
is the surface. Light is coming from a source corresponding to the point (x, y), suppose, this
point, the particular point. So, I am considering a source, the light source is S 1. Like this I am
considering many, many light sources. Another source may be S 2, another source may be S3. So,
I am getting number of images and this light will be reflected by, so this is my camera, this is my
camera and after this, I am getting the image. The image is I( x, y) I am getting. So, this is the
photometric stereo setup. So, light is coming from the source S1, S2, S3.

But you remember, only at a particular time only one source is considered. So, at a particular
time only source, the first source is considered, S1. And corresponding to this I am getting one
image. So, for this the S 2 is not considered. Other sources are not considered. So, that is why I
have to obstruct the light, I have to obstruct the light. That is called occluder, occluder I am
considering. So, at a particular time one source is considered, so S 1 is considered. Corresponding
to S1, this point is illuminated by the light source and I am getting one image. The image is
suppose I1 (x, y).

175
Next one is I am considering the second source, from the second source also I am getting another
image. The image is I2 (x, y). So like, I am getting number of images, and in this case from these
images, from these images I want to get the information. That information is f( x, y), f (x, y)
means it is depth map I want to determine. So, this is the case. So, I have explained the concept
of photometric stereo.

(Refer Slide Time: 40:38)

So, let us see the mathematical derivation of this photometric stereo. So, here if you see, I have
the surface. I am considering the light sources. And I am showing the camera. Camera is in z
direction. The point (x, y) I am considering. Surface is represented by (x, y, f( x, y)). That is
already I have explained. This is Monge's Patch. Now, what will be the, my image? If I consider
my camera is linear then in this case what will be my pixel value the particular point? The point
is (x, y). This is the pixel value I (x, y). I am getting the image.

And if I consider the linear camera so k is considered about linear camera, and this is the surface
albedo because light is reflected by the surface. So, I have to consider albedo of the surface. This
is the surface normal. And I am, consider the source vector because I know the direction of the
source. So, now I am defining r (x, y). r (x, y) is ρ (x, y, n (x, y)). What is actually it means? It
corresponds to the surface characteristics because I am considering the albedo and surface
normal. This corresponds to characteristics of the camera because I am considering linear
camera. Linearity I am considering by this k. And source I am considering, these are the

176
characteristics of the source. The source is S1. So, I am getting this equation I (x, y) is equal to
the pixel value I am getting.

(Refer Slide Time: 42:03)

So, if we consider n sources for each of the, for which C i is known. So, suppose if I consider n
number of sources, then in this case I am considering the source because the C, because if you
see the previous slide, what is the source? Source and the camera is represented by this C 1. The
factor is C1. So, if I consider n sources, for each of which I am getting C i. So, for each n of the
source I am getting C1, C2, C3 like this.

This is the transpose operation. If you see this bar, the transpose, the transpose of this. And
corresponding to these sources I am getting the images. So, I am getting the images.
Corresponding to the first source I am getting one image. Corresponding to the second source I
am getting the second image. Like this I am getting the number of images. But you remember,
only one source at a particular time.

So, already I have explained this equation, already in the previous slide it is there. So, this
represents the surface characteristics. This represents the characteristics of the camera and the
source. This equation I can write in the vector form. I( x, y) =C r( x, y). What is I (x, y)? I (x, y)
is this one. Because I will get number of images corresponding to number of sources. C, this is
the basically the characteristics of the source and the camera. I am considering the linear camera.
So, this is, C is known. I am getting number of images. I is also known.

177
What is unknown? The unknown is r (x, y). r (x, y) is basically the surface characteristics. So, I
have to, I can determine the albedo of the surface. And my main objective is to determine the
surface normal, because surface normal means it gives the orientation of the surface. So, that
means the shape of the surface I can determine.

(Refer Slide Time: 43:52)

Now I am considering one matrix. This matrix I am considering. So, in this case, since at a
particular time only one source is considered and I have shown in my previous diagram that is
occluder is considered. So, corresponding to a particular source, suppose S1 is considered I am
getting the image I1. And for the rest of the sources I am getting 0, 0, 0.

If I consider the second source, corresponding to the second source I am getting the image I 2 (x,
y). And for the remaining sources I am getting zeroes, zeroes like this. So, that concept already I
have explained. So, I am getting the matrix. The matrix is I (x, y). So, multiply equation 1 by this
matrix. What is the equation number 1? Equation number 1 is this. So, this equation is multiplied
by the matrix. So, matrix I am considering to consider the shadow effect. It has the effect of
zeroing the contribution from shadow regions. So, this is, multiplying equation by this I am
getting this one. And in the matrix form I can write this one.

So, in this equation which one is known? This is known, this is known and this is unknown. The
unknown is r (x, y). So, I know the image. This is, this matrix I know and this is the source
matrix I know. I have to determine r (x, y). So, that is the concept of the photometric stereo.

178
(Refer Slide Time: 45:11)

And in this case the albedo of the surface lies between 0 and 1. So, in this case this expression,
you know, r (x, y) is equal to ρ (x, y) n (x, y). That is the characteristics of the surface albedo and
surface normal. And from this you can determine the, the albedo of the surface. The albedo of
the surface is this. And from this also you can determine the normal. Now surface normal also
you can determine.

So, what is the surface normal? Already I have explained, I have the surface vectors r x and the r
y. And what is the surface normal? The surface normal is (p, q, - 1). And I can normalize it. So
after normalization I am getting this one. So, from these equations you can determine the albedo
of the surface and also you can determine the surface normal.

179
(Refer Slide Time: 45:56)

And to recover the depth map, f (x, y) is to be computed from the estimated value of the unit
normal. That means from unit normal I have to determine the f (x, y) value. So, for this I am
considering 3 measured value of r (x, y). So, first value is r1 x1, second one is r2, third is r3. So I
am considering this r (x, y) is, I am considering three values. r1, r2 and r3 I am considering. So
from this I want to determine p. The p is determined like this. p is the gradient value. Gradient
value is r1 divided by r3. Next value I can determine the q value. So I am considering three
measured value of r x y. So, I am considering r1, r2 and r3. So, from r1, r2 and r3 I can
determine the gradient value. The gradient value p n q.

180
(Refer Slide Time: 46:44)

And after this we have to consider this check that is the partial derivative of p with respect to y
should be approximately equal to the partial derivative of q with respect to x. This is called the
test of integrability. So, this test we have to consider. The mixed second order partials are equal.
And finally, I can reconstruct the surface by using this equation. So, I have to do the integration.
This f x gradient is available, f y is available. And c is integration constant. So, from this I can
determine f (x, y), f (x, y) is the surface I am getting.

(Refer Slide Time: 47:21)

181
So, the normals are recovered by using these concepts. So, these are the normals, the surface
normals. And from the surface normal if I do the integration, then in this case the surface is
recovered. The shape of the surface is recovered by integration.

So, in this class I discussed the concept of the photometric stereo. So, before that, I discussed the
concept of the shape from shading. There are many algorithms to solve this problem, but one
popular algorithm is the photometric stereo. So,, in this case I am considering the number of light
sources. And surface is illuminated by the light sources but one at a particular time. And I am
getting number of images and from these images I want to get the surface normal. From the
surface normal I can get the, the information of the shape of the surface. The surface can be
reconstructed. So, this is the main concept, that the concept of the shape from shading.

In my next class I will discuss some geometric transformation like the affine transformation. So,
let me stop here today. Thank you.

182
Computer Vision and Image Processing – Fundamentals and Applications
Professor M. K. Bhuyan
Department of Electronics and Electrical Engineering
Indian Institute of Technology, Guwahati
Lecture No. 05
Image Formation: Geometric Camera Models

So, welcome to NPTEL MOOC course on Computer Vision and Image Processing Fundamentals
and Applications. So, in my last class I discussed about the concept of shape from shading. So,
image acquisition process is nothing but the 3D to 2D transformation. So, in this process the
depth information I am going to lose. So, depth information is not available. So, that is why we
considered the concept of shape from shading. So, in shape from shading I want to get the shape
information from the shading information.

Shading means the variable levels of the darkness. So, one algorithm for solution of the shape
from shading problem is photometric stereo. In photometric stereo we consider number of
images. So, from number of images we want to get the shape information that is very important,
the photometric stereo. So, suppose if I want to determine the shape information of a surface that
surface is illuminated by number of light sources, but one source at a particular time. So, in this
process I am getting number of images. So, from these images I want to get up shape
information. So, that is the objective of the photometric stereo.

So, today I am going to discuss about the image formation concept. In this case I will discuss the
concept of geometric camera models. So in this case, for understanding of the geometric camera
models I have to discuss the concept of geometric transformations, because the concept like
these perspective projections, orthographic projections, if I want understand, the concept of the
geometric transformation is quite important. So, let us see what is geometric transformation?

183
(Refer Slide Time: 02:11)

So, suppose if I consider one object, suppose the object is moving in an image. That means it is
translated or it can be rotated or it can be scaled. So, in this case these operations I can represent
mathematically. So, if I consider combined operation, suppose translation, rotation and the
scaling, so this combined operation is called the affine transformation. So, how to represent these
operations mathematically? So, today I am going to discuss about the geometric transformation.
And after this I will discuss the geometric camera models. So, let us discuss the geometry
transformations.

(Refer Slide Time: 02:49)

184
So, first operation is 2D transformations. So, first I will discuss 2D transformation and after this I
will discuss 3D transformations. So, first I am considering one point here the point, the point is
P, x y. This point P, x y is translated by a vector, the vector I am considering, that is x 0 y0. So, it
is translated by the amount x 0, x 0 unit parallel to x axis, y 0 units parallel to the y-axis.

The new point will be P' , x ' y ' so, this is the new point. This translation operation I can write in
this form. So, x ' = x + x 0 and y ' = y + y 0. So, this is a translation of a point. Suppose if I define
the column vectors like this. So, P is the point, the initial point P x y. P' is the final point, x ' y ' .
And in this case if I consider, the translation vector is T, x 0 y 0.

(Refer Slide Time: 3:49)

Now translation can be represented like this. So, translation process is x ' y ' = 1 0 0 1 hat matrix I
am considering, into x y + x 0 y 0. So, that the previous equation you can write in this form. Or
you can write in this form. x ' y ' is equal to 1 0 0 1 x 0 y 0 x y 1. So, the previous equation,
whatever I have shown in the previous slide, that the translation operation, that operation you can
show in the matrix form like this.

But one thing is this expression is asymmetric expression. So, there is a concept of homogenous
coordinate system. So, the concept is homogenous, homogenous, homogenous coordinate
system. Suppose if I consider a point, point is represented by the pair of numbers x y. So, each
point can be represented by triple. What is a triple? x y w, suppose. If suppose w coordinate is

185
non-zero then the point can be represented as, the point is mainly the x y and w, that point can be
represented by x / w , y / w and 1.

These x /w, y / w, these are mainly the Cartesian coordinates of the homogenous point that
means triples of coordinates represents point in a 3D space. The concept is something like this. I
am considering Tx, the point, the triple Tx Ty and Tw.Tw is not equal to 0. So, in this case all
triples of the form, this form actually represents a line in the 3D space. And if I consider
homogenous coordinate system then in this case it is the symmetric expression. So, I can use
symmetric expressions in place of asymmetric expression.

(Refer Slide Time: 06:07)

Now let us consider concept of rotation of a point x y by an angle θ in the clockwise direction. In
this figure I have shown. Suppose the point is P x y and that point is rotated by an angle θ . So,
what is the new position? The new position is P' x ' y ' . That point P x y is rotated by an angle θ in
the clockwise direction.

Now the simple, the trigonometry, if you see what will be the x? What will be this x? x is
nothing but r cos α because the length of this vector is r. So, the, it will be r cos α . And what will
be the y? This is y. So, y will be equal to r sin α . So, from this you can determine x ' . This x '
coordinate you can determine. That is r cos α - θ is equal to, you can just expand this one, x cosθ
+ y sin θ and y ' is equal to r sin α - θ = y cos θ - x sin θ .

186
Then in this case this rotation operation you can represent in this form, x ' y ' is the final
coordinate. x y is the original coordinate, initial coordinates. And I have the transformation
matrix. This is the transformation matrix, cos θ sin θ - sin θ cos θ . So, this matrix is the
transformation matrix for rotation. The rotation operation you can represent it like this.

(Refer Slide Time: 07:32)

Next we consider the case of scaling of a point by a factor of Sx. Sx means scaling along the x
direction and Sy along the direction of x and y respectively, Sy means scaling along the y
direction. This scaling operation you can write in the matrix form like this. x ' y ' is equal to Sx 0 0
Sy x y, so this is the scaling operation. Now suppose if the scaling is not uniform for the whole
object then it is called shearing. For example, shearing parallel to the x axis can be represented
as, so x ' is equal to x + ky. So, I can write like this and y ' = y.

So, what is the meaning of this? The meaning is shearing parallel to the x axis can be represented
as x ' is equal to x + ky and y ' = y. So, in the matrix form I can write this operation like this, the
shearing operation I can write like this. x ' y ' is equal to 1 0 k 1 x y. So, this is the operation in
the matrix form.

(Refer Slide Time: 8:38)

187
So, here I have shown the example. So, how to do the shearing? So, in this case shearing parallel
to x axis I am shown here. So, this is the, suppose object and I have done the shearing. So, this is
the matrix for shearing. In homogenous coordinate system this transformation I can write like
this. So, the translation operation, the rotation operation, scaling operation and this operation is
the shearing operations. The next I have shown one example.

(Refer Slide Time: 09:07)

This is the another case, the reflection about x axis. So, if I do this operation, x ' y ' is the final
positions and x y is initial positions and I am considering the transformation matrix. This are the
transformation matrix, 1 0 0 - 1. So, I if I do this transformation what I am getting? I am doing

188
actually the reflection. So, if you see this one this point is reflected here. So, now the main
operation is rotation, scaling and the translation and these operations I can represent in the matrix
form like this. The combined operation I can show like this. The upper 2 by 2 sub-matrix is a
composite rotation and scale matrix whereas, the Tx and Ty are composite translations I have
considered.

(Refer Slide Time: 9:50)

Suppose if I want to rotate a vector about an arbitrary point. Suppose arbitrary point is suppose,
P. So, how to rotate this? So, for this, what I have to do? First the vector needs to be translated
such that the point P is at the origin. So, first I have to do the translation operation. So, first I
have to do the, this. This, this I have to move to this point, it is to be to the origin. After doing the
translation operation I can do rotation operation and finally we need to translate back the point
such that the point at the origin return to the point P. So, after doing the rotation again I have to
do the translation back to, so that this point move to the point P. So, I am repeating this.

What is the objective? The objective is if I want to rotate a vector about a, about an arbitrary
point P in 2D x y plane then in this case I have to do the following operations. First the vectors
needs to be translated such that the point P is at the origin. So, I am doing the translation, so just
doing the translation. So, it will be in the origin.

After this I can do rotation operation and finally we need to translate back to the point. I am
doing the translation in this direction; again I am doing the translation. Translate back the point

189
such that the point at the origin return to the point P. This complete operation I can write in this
form. T - r r θ T r. Suppose I can write like this. So, what I am doing first? First I am doing the
translation. After this I am doing the rotation. And after this again I am doing the translation that
is in the reverse direction, - r.

This is the combined operation I am doing. But you have to remember the concatenation
operation is not commutative. That means matrix operations are not commutatives. The order of
this operation is important. Suppose if I do first rotation and after this I am doing that translation
again I am doing the translation, the results will be different. And one thing is important, in my
previous case I have shown one, that transformation matrix. In this matrix I have considered all
the operations, the rotation operation, the scaling operation and translation operation. So, this
combined operation is called affine transformation.

(Refer Slide Time: 12:42)

So, next I am considering 3D transformations. So, in the 3D transformation all the 3D points are
converted to homogenous coordinate. It is something like this. So, x y z is converted into x y z 1.
Translation of the point is given by x ' , x ¿= x + x 0. So, the translation along the x direction,
translation along the y direction and translation along the z direction.

190
(Refer Slide Time: 13:11)

So, I can write this translation in the matrix form like this. The displacement vector is x 0 y 0 z 0.
And the point x y z is translated to the point x ¿ y ¿ z ¿. And I can consider the unified expression,
that is, the unified expression means the symmetric expression, so this is the symmetric
expressions. So, that means the transformation matrix for translation is now converted into a
square matrix. So, this is the transformation matrix for translation that is a square matrix. So, T is
equal to 1 0 0 x 0, 0 1 0 y 0, 0 0 1 z 0, 0 0 0 1. So, I am considering the translation along the x
direction, translation along the y direction and translation along the z direction.

191
(Refer Slide Time: 14:01)

Now the next I am considering the scaling operation that is the 3D scaling. The transformation
for the scaling I can show you like this. So, s is equal to Sx 0 0 0, 0 Sy 0 0, 0 0 Sz 0, 0 0 01. So,
this is the transformation matrix for scaling. So, in this case I am considering scaling in the x
direction Sx. Sy is the scaling in y direction. And Sz is the scaling in z direction.

(Refer Slide Time: 14:32)

And let us consider the concept of the rotation. Now the three cases of rotations; rotation around
x axis, rotation around y axis and rotation around z axis. If I consider rotation around z axis, only
x and y coordinates change and z coordinates remains same. Rotation around z axis means it is a

192
rotation in a plane which is parallel to x y plane. So, here in this figure I have shown the rotation
along the x axis by an angle α , rotation along the y axis by an angle β , rotation around z axis by
an angle θ .

So, corresponding to this, I have this transformation matrix. The transformation matrix like r θ
that is rotation around z axis, R α is the rotation around x axis and R β is the rotation around y
axis. So, I have considered these transformations.

(Refer Slide Time: 15:34)

And finally, the transformation matrix for the shear is given by this matrix. So, uptil now I have
discussed the concept of the transformation, I discussed the concept of 2D transformation and
after this I have discussed the 3D transformation. So, in case of the 2D transformation I
discussed the translation, rotation and scaling and also I discussed about shearing. In case of the
3D transformation I considered again the translation operations, rotation operations and scaling
operations.

This combined operation; I can do the operations one by one so that is called the affine
transformation. This is very important, affine transformation. If I understand the concept of the
affine transformation then it is easy to understand the concept of image formation in the camera.
So, we have number of projections like perspective projections, orthographic projections. So, for
these projections I have to understand this concept, the concept of the homogenous coordinate

193
systems. Now already I had discussed that the matrix operation is not commutative, the order of
this operation is important.

(Refer Slide Time: 16:47)

So, I can give one example in this case, so if you see the next slide, so in this case what I am
doing? In the first figure, in this figure I am considering one vector. The vector is a. First I am
doing the translation of this vector and after this I am doing the rotation of this vector by an
angle θ . So, corresponding to this I am getting the point is P1. So, the meaning is, first I am
doing the translation of the vector by an amount to ∆ x in the x direction, ∆ y in the y direction
and after this I am doing the rotation of this vector.

In the second case what I am doing? First I am doing the rotation of the vector, the same vector.
And after this I am doing the translation. So, I am getting P2. So, it can be observed that the P1 is
not equal to P2. Because the matrix operation is not commutative the order of operation is quite
important. And in case of the affine transformation suppose, suppose if I do something like the
translation I am doing and after this I am doing the scaling and after this if I do the rotations, so
how can you write this one?

So, suppose I am getting one point, new point, P, this one and R θ , S T P is equal to A P. So, in
this case what I am considering, this will be T P, so in this case what I am doing? In this case
first I am doing the translation of this point T P. After this I am doing the scaling, S is the
spelling and after this I am doing the rotation. So, this combined operation is called the affine

194
transformation. So, the matrix, what is the matrix, A, the matrix A is equal to R θ S and T. So,
this matrix will be 4 by 4 matrix.

So, this matrix I can consider as affine matrix. So, first I am doing the translation. After this I am
doing the scaling. And after this I am doing the rotations. So, a is equal to R theta S T. In the first
case I have shown I have done the translation and after this I am doing the rotation. In the second
case what I have done? First I have done the rotations and after this I am doing the translation. In
this case the P1 is not equal to P2.

So, these matrix operations are quite important. So, I can give one example, suppose. I have
some objects. This object is like this in one image. In another image suppose object will be like
this. So, now I have to determine what type of transformation it is going on. It may be rotated.
The object may be rotated. It may be translated and some scaling operations may be done.

So, corresponding to this you have to determine the transformation matrix. So, for this, suppose
some of the keypoints you can consider, and based on these keypoints you can understand what
type of transformation it is going on that is, you can find the transformation matrix, whether it is
rotated, whether it is scaled or the translation operation is going on. So, I have given this
example.

So, in an image, suppose this is the first image, in the second image I have this one. So, what
type of transformation it is going on? So, that you can determine. So, this is about the geometric
transformation. So, these transformations are important for understanding the concept of image
formation in the camera.

195
(Refer Slide Time: 20:33)

Now after this, I will discuss the concept of the image formation in the camera. So, if you see in
my next slide I am going to discuss about the geometric camera models. So, first one is image
formation principle. And in my last class I explained about this, what is the output image? The
output image is the PSF. And it is convolved with the object function + noise, noise of the
imaging system.

The object function is a object or a scene that is being image. And what is the PSF? The PSF, the
Point Spread Function is the impulse response when the inputs and the output are the intensity of
light in an imaging system. That means, it represents the response of the system to a point

196
source. So, PSF indicates the spreading of the object function. And it is the characteristics of the
imaging system. So, PSF is quite important.

So, this is my object and I am getting the image here. So, corresponding to this here, you can see
the PSF, it is convolved with the image and I am considering the noise of the imaging system.
And one important point in the last class I discussed, a good or the sharp imaging system
generally has a narrow PSF, whereas a poor imaging system has a broad PSF.

(Refer Slide Time: 21:46)

And here in this case I have shown simple image formation process. So, I have shown one object
and I have shown the film that is the sensor I am considering. The sensor converts the light
photon into electrical signal. But is this case if I consider, I will be getting the blurred image.
Because it is the (inter) intersections of all the light rays if I consider. So, because of this I am
not getting the fine quality image. I am getting the blurred type of image. So, that is why I can
consider this setup.

197
(Refer Slide Time: 22:21)

That means I am considering one barrier. And in the barrier, one aperture is there, small opening
is there, this opening is there. If I consider this opening, that means a barrier is considered to
block most of the rays. So, because of this structure it reduces blurring. In my earlier slide I have
shown I am not getting the image, the good quality image because of the number of rays. And in
this case it is incident on the sensor. So, in the second case I am getting the good quality image
because it reduces blurring. And this opening is called the aperture, the aperture of the imaging
system.

(Refer Slide Time: 22:58)

198
So, suppose if I consider the shrinking the aperture. Suppose if I want to reduce the size of the
aperture what will happen? So, in the first case I am considering the aperture size is 2 mm, next
is 1 mm, after this 0.6 mm, next is 0.35 mm. That means gradually and gradually I am I am
reducing the size of the apertures, shrinking the apertures. So, in this case I am getting sharp
images, but we cannot make the aperture as small as possible. Because if I make the aperture
very small less light gets through. And also I have to consider the effect of diffraction. That is
why I cannot make the aperture as small as possible.

(Refer Slide Time: 23:38)

And in this case I have shown the image formation in a pinhole camera. So, this is a very simple
camera. So, I have one opening here, small opening, hole is there. And this is my object, and this
is the image.

So, I am getting the inverted image here. So, this structure is very much similar to human eye. In
case of the human eye, this, instead of this opening, I have pupils and I have the retina. In case of
the camera I have the sensor here. In case of the human eye we have retina. In retina also we
have two types of photoreceptors, the rods and the cons. Rods are responsible for monochromatic
visions and cons are responsible for color visions. So, this structure is very similar to human eye
and I am getting inverted image here.

199
(Refer Slide Time: 24:25)

And in this case I have shown the image formed by a convex lens. So, I have the objects and I
have lens. This is the convex lens. I have the sensors film. So, if I consider this portion here, this
portion of the object, this is properly focused. So, that means, in this case I am getting the image
here. And in this case, this portion if I consider, this portion of the object; that is not properly
focused. Then in this case it is nothing but the circle of confusion. Corresponding to this point,
this point is properly focused by the lens, the convex lens. So, I am getting the image here. This
portion I am getting the image. But corresponding to this portion, this is not properly focused.
So, I am getting the blurred portion here. And this is called the circle of, circle of confusion.

And in the second diagram I have considered lens with an apertures. So, small aperture I have
considered. Because why, what is the need of the aperture, I have already discussed. So, aperture
is important. this is optical centre and the centre of projection. This is the optical axis. And after
this, this is the focal point in the convex lens. This is the focal length and I have shown the image
formation process in the convex lens.

200
(Refer Slide Time: 25:40)

And there is a concept of the depth of field. So, here you see I have shown two cases. In this, in
the first case I have shown this portion, suppose if I consider the aperture is wide. And in this
case the aperture is narrow. In the first case the aperture is wide, in the second case aperture is
narrow. In the first case, only this portion of the object, from this length to this length it is
properly focused, the rest of the portion, it is not properly focused.

So, corresponding to this portion I am getting the sharp image. And if I consider the background
I am not getting the sharp image, I am getting the blurred background. That means this portion is
properly focused. That means corresponding to this depth this is properly focused. In the second
case, the aperture is narrow aperture so that means this range I can expect, that this range it is
properly focused. So, that is why this is coming, this is coming properly. And the background is
also coming properly.

That means the aperture size affects the depth of field. And it is quite evident that a smaller
aperture increases the range in which the object is approximately in focus of a camera. So, large
aperture corresponds to a small depth of field and vice versa.

201
(Refer Slide Time: 26:55)

And in this case I have shown the derivation of the thin lens equation. I am considering the
convex lens and from the similar triangles, this equation I am getting. What is d 0? d 0 is the
distance between the objects and the lens. So, this is d 0 What is di? The distance between the
lens and the image. So, I am having the image here. And here I have shown the focal point and
also the focal length I have shown. f is the focal length.

So, in the first example I am getting the real image. In the second case I am getting the virtual
image. So, if I assume d 0 is greater than f, then in this case then object point which satisfies the
above equation will lie in focus if I consider d 0 is greater than f. And when d 0 is less than f,
then di becomes negative. Then in this case I will be getting the virtual image. So, I have shown
this derivation of the thin lens equation. Already you know these equations.

202
(Refer Slide Time: 27:54)

And what is the magnification done by the lens of a camera? The magnification is defined by di
divided by d 0. That is the definition of the magnification. M is positive for the virtual image
while it is negative for the real image. And if I consider M is greater than 1 that indicates the
magnification. So, this is the definition of the magnification done by lens of a camera. So, in case
of the monocular setup, monocular vision, suppose this is the object and this is the image.

So, this is my, suppose, x and this is x ' , suppose and this is z and suppose this is f. So, this is the
monocular vision setup. So, corresponding to this, you know this equation from the similar
triangle x / z = x ' / f, monocular vision. And if I considered suppose two cameras. I am
considering one surface something like this. I am considering one camera another camera is this.

So this camera, suppose consider this one there and this I am considering. So, in this case I am
considering two cameras. So, I can consider this as left camera, suppose and this is right camera,
suppose. So, this setup is called binocular vision. Binocular vision. So, in the monocular vision
because image acquisition system is nothing but the 3D to 2D projections so I do not have the
depth information in this case. But with the binocular vision I can get that depth information.

So, in my next class I will discuss the geometrical camera models like the some projection
models I am going to discuss, like the perspective projections, orthographic projections. And
after this I will discuss about the stereo visions. So, by using the stereo vision you can get the

203
depth information from disparity. So, that concept I am going to discuss in my next class. So,
today I have discussed about the geometry transformation which is very important like the affine
transformation. And based on this affine transformation, you can understand the concept of the
perspective projection and the orthographic projection. And you remember the concept of the
homogenous co-ordinate systems. So, that is all for today. Let me stop here today. Thank you.

204
Computer Vision and Image Processing – Fundamentals and Applications
Professor M. K. Bhuyan
Department of Electronics and Electrical Engineering
Indian Institute of Technology, Guwahati
Lecture No. 06
Image Formation: Geometric Camera Models

Welcome to NPTEL MOOC course on Computer Vision and Image Processing Fundamentals
and Applications. So, in my last class I discussed about concept of affine transformation. I
discussed some operations like translation operation, rotation operation, scaling operation. So,
understanding of these operations is quite important to understand the concept of image
formation in a camera. So, today I am going to discuss some important projection techniques in
the camera. These are mainly the perspective projections, width perspective projections and also
the orthographic projections. So, in my last class I discussed about the image formation in a
setup.

(Refer Slide Time: 01:14)

So, I have considered this set up. So, I have considered the object and this is the film. So, in this
case I am getting the blurred image because of the intersections of the rays. If you, see the
different rays are coming from the objects. And in this case I am getting the blurred image.

205
(Refer Slide Time: 1:35)

If I consider the another configuration, in this configuration I am considering one barrier. I have
the object here. I have the sensor area here. So, if I consider this aperture, aperture is nothing but
the small opening. Then in this case I am getting the good quality image. This aperture actually
reduces blurring. This small opening is called the aperture. In my previous slide I have shown the
case. In this case, in that case I was getting the blurred image. But in this case I am getting at
least some good quality image because of the aperture.

(Refer Slide Time: 2:05)

206
I have shown some examples like, I am considering the aperture size is 2 mm, aperture size is
1mm, aperture size is 0.6 mm, aperture size is 0.35 mm. If I reduce the aperture then what is
happening? I am getting the sharp image like this. This is the blurred image corresponding to the
aperture size of 2 mm. And this is the image I am getting the 0.35 mm, the sharp image I am
getting.

But I cannot make a aperture as small as possible. Why? Because less light it can pass through
the aperture. And also I have to consider the effect of diffraction. So, that is why I cannot make
the aperture as small as possible.

(Refer Slide Time: 02:46)

And here I have shown the image formation in a pinhole camera. So, this is a very simple camera
configuration. This is my object. I am getting the inverted image in my image plane. This is my
image plane and this is one small opening. So, this configuration is very similar to our human
eye. So, this is nothing but the pupil of the eye. And this image plane is nothing but retina of the
eye. So, we are getting the inverted image. The human brain reconstruct the 3D information.
Because it is 3D to 2D projection so human brain reconstruct the 3D information.

207
(Refer Slide Time: 03:28)

Last class also I had shown the image formed by convex lens. So, here I have shown the convex
lens. This is my convex lens. And I have the object. And in this case if you see, this portion of
the object is properly focused. So, I am getting the image, the sharp image. But if I consider at
this portion of the object, the second, this portion, this is not properly focused. So, that is why
corresponding to that portion I am getting the circle of confusion. I am getting the blurred image
corresponding to that portion.

In the second diagram, if you see this diagram, I am considering the convex lens with an
aperture. So, I am considering the aperture and I have shown the focal length. f is the focal
length and the focal point also I have shown and this is the configuration to get the image.
Optical axis also I have shown. So, lens with an aperture.

208
(Refer Slide Time: 04:24)

Next, in the last class I discussed about the concept of the depth of field. The depth of field
depends on the aperture size. So, in the first case I had considered the wide aperture. In the
second case I am considering the small aperture. Corresponding to the wide aperture only this
portion, if I consider this range, from this to this range, this range is properly focused.
Corresponding to the small aperture if you see the range, so this range, this range is properly
focused that means the aperture size affects the depth of the field. So, a small aperture increases
the range in which the object is approximately in focus of a camera. So, large aperture
corresponds to small depth of field and vice versa.

So, in my first case, only the, this range I am considering. That means I am getting the
foreground in the image. The background is the blurred because that is not focused properly. In
the second case the range is high. So, I am getting the foreground and the background. So, you
can understand the depth of field depends on the aperture.

209
(Refer Slide Time: 05:27)

And last class also I have show the derivation of thin lens equation. So, in the first case I have
shown the real image and in this case this is the object and I am getting the inverted image. Here
I am shown the focal length and the distance between the image and the lens that is di and
distance between the object and the lens that is d 0 I have shown. In the second case I have shown
the virtual image. So, virtual image is this. I have shown the object, object is this. So, from here
if you see this, the configuration I can get the, the lens equation. The lens equation is 1 by d 0 + 1
by di is equal to 1 by f. So, already you know this equation.

Now in the second case if you see, d 0 < f. Then in this case di will be negative. And in this case I
am getting the virtual image. So, in the second case I am getting the virtual image. In the first
case I am getting the real image but that is inverted image.

210
(Refer Slide Time: 6:23)

And also I have defined the magnification done by the lens of a camera. So, magnification is
defined by M is equal to minus di divided by d 0 . So, that is the magnification. So, magnification
is considered as positive for the virtual image. And for the real image it is negative and if I
consider M is greater than 1 that corresponds to magnification.

(Refer Slide Time: 06:47)

Now let us consider homogenous coordinate. I discussed this homogenous coordinate in my


affine transformation class, the geometric transformation. So, homogenous coordinate means

211
extending n-dimensional space into n plus 1 dimensional space. So, in this diagram I have
shown, one is the image plane. This is my image plane. And I have shown this is the scene, this
is the scene. So, a point in the 2D image is treated as a ray in 3D perspective space. So, point in
the image plane, the point is x comma y. This is the point in the image plane that is considered as
a ray in 3D perspective space, so, what is the 3D perspective space? So, I am considering as a ray
sx comma sy comma s. So, this is the, in the perspective space.

So, that means if I consider all the points in the ray so, if I consider all the points in the ray that
corresponds to the point x comma y comma 1 in the image plane. So, that means it is nothing but
the 3D to 2D projection, this is nothing but the 3D to 2D projection because corresponding to all
the points in the ray I am getting only one point in the image, that point is x comma y comma 1.

And in this case if I have the ray coordinate, suppose this is the, in the perspective space if I want
to get the image coordinate what I have to do? I have to divide sx by s, that is by the third
coordinate, sy divided by s and s divided by s. And after this I have to neglect the third
coordinate. Then in this case I am getting the x comma y. So, I can get the image coordinate
from the scene coordinate like this.

(Refer Slide Time: 8:39)

The x comma y I can write in homogenous coordinate like this, x y 1, adding one extra
coordinate and if I consider it the scene coordinate x y z, that can be represented in the
homogenous coordinate x y z 1. So, I can represent like this. And suppose if I want to convert

212
from homogenous coordinate how to do the conversion? So, this is in the homogenous
coordinate. If I want to do the conversion from homogenous to the image coordinate then in this
case what I have to do?

The x coordinate is divided by the third coordinate. The third coordinate is w. y coordinate is
divided by the third coordinate and third is also divided by would and after this I am neglecting
the third coordinate. So, I am getting the coordinate x divided by w, y divided by w. And
similarly in this case also I am considering like this x divided by w, y divided by w, z divided by
w. And I am neglecting the fourth coordinate, the fourth coordinate is 1. So, I am neglecting this
1.

(Refer Slide Time: 09:36)

So, let us consider the case of perspective projection of a point. So, here I have shown, in the
first diagram if you see I have shown the camera, this is the camera centre of projection. Camera
looks along the z axis. The focal point of the camera is at the origin, that is origin is 0 0 0. So, I
am considering this one that is the focal point. And image plane is parallel to xy plane. So, I have
this image plane that is parallel to xy plane.

And what is d? The d is the distance between the centre of projection and the image plane. So,
this is my centre of projection that is the focal point of the camera. And here I have shown one
example of the perspective projections. So, my object is this, this is my image plane and this is
my camera, camera means the viewing direction.

213
And in the third case I have shown the perspective projection of a point. So, first I have shown
the centre of projection, this is the centre of projection. And I have considered the object, object
is y. And this is the view plane. View plane is nothing but the image plane. So, what is the
distance between the centre of projection and view plane? The distance is d. The distance
between centre of projection and the object is z.

So, from this you can see I can get this expression ys is equal to dy divided by z. So, what is ys?
ys is nothing but the size of the image in the image plane; that is the perspective projection of a
point. If I increase z, what is z? The distance between the centre of projection and the object. If
suppose it is some high value then what will be the image size? The image size will be reduced.
That is the concept of the perspective projection.

(Refer Slide Time: 11:18)

So, here I am shown again the modeling of projection. Again I have shown the pinhole camera
model as an approximation. And after this I have shown the projection model. So, in this case I
have considered the optical centre that is the centre of projection at the origin. So, COP means
the centre of projection and I have considered the object. The object is x y z. And this PP means
the plane of projection that is the image plane.

Put the image plane that is the projection of plane in front of the COP. Why I am putting this
one? Because I want to avoid inverted image. That is why I am putting the plane of projection in
front of the COP. The camera looks down the negative z axis. And this is the positive z axis. So,

214
camera looks down the negative z axis. So, what is the size of the image in the PP, the plane of
projection? The size is x dash y dash and minus d. Why it is minus d? Because I am considering
negative z direction.

(Refer Slide Time: 12:18)

Now if I consider same model, so if I consider the projection equations. So, if I see the similar
triangle, I have two similar triangles here if you see, one is the big triangle, another one is a small
triangle. So, if I consider these triangles you can see the point x y z that is the scene coordinate is
mapped into the image plane, the image point will be minus d x divided by z minus d y divided
by z minus d. I have to neglect the third coordinate. So, that means the point x y z is mapped into
minus d x divided by z minus d y divided by z in the image plane. So, this is the concept of the
perspective projection. So, how to define this equation in the matrix form?

215
(Refer Slide Time: 13:03)

So, in my next slide I should this equation. This is the equation corresponding to the perspective
projection. This, the mapping I can represent in the matrix form, the matrix form is this. So, this
is my image coordinate, image coordinate I am getting. This is called the perspective projection
matrix, the projection matrix. This matrix can be formulated as a 4 by 4 matrix. So, if you see
here I am considering the 4 by 4 matrix because the previous matrix equation, that is not
symmetric expression. But in this case if I consider the 4 by 4 formulation that is the symmetric
matrix. So, that is my perspective projection.

(Refer Slide Time: 13:42)

216
Now let us consider how does scaling the projection matrix change the transformation? So, this
is my perspective projection. Now let us consider scaling of x coordinate by c, y coordinate by c
and z coordinate by c from the previous equation. Corresponding to this I am getting the same
output. What is the output? The output means the image coordinate. The image coordinate will
be minus dx divided by z minus dy divided by z.

That means, what is the meaning of this? In the image, a larger object further away, that is scaled
by x comma y comma z, that x coordinate is scaled by c, y coordinate is scaled by c, z coordinate
is scaled by c, can have the same size as a smaller object that is closer. So, that is the
interpretation of this case, scaling by c in a process perspective projection matrix.

(Refer Slide Time: 14:33)

So meaning is; the physical interpretation is the distant objects appear smaller. So this is the, the
outcome of the perspective projection. So, if I consider this object the distant object appear
smaller in the image plane. This is my image plane.

217
(Refer Slide Time: 14:50)

After this I will discuss some simplified projection models. That is called the weak perspective
projection. Another one is the orthographic projection. So, what is weak perspective? Here you
see. So, we have this matrix formulation for the perspective projection. And suppose if I consider
a surface. Suppose one surface is like this. So, suppose these points if I consider, the relative
depth of these points are much smaller than the average distance z av to COP. So, suppose
relative depth between these points are much smaller, then these points I can considered as a
group.

218
Suppose I have another object like this. So, this is my object number 1 suppose in the image.
And this is my another object, the object number 2. So, if I consider these points, the relative
depth of the points on the objects are much smaller than the average distance, then in this case I
can considered as a group, I can consider as a same point. Now if I considered this case, if you
see, if this consider, if I consider this case, suppose relative depth of points on objects are much
smaller than the average distance z av to COP, centre of projection then I have this equation.

And if I consider c is equal to this one, c is equal to minus d divided by z av then in this case I
have this one. That means x-coordinate is scaled by c. And y-coordinate is scaled by c. This is
nothing but simple scaling. The projection is reduced to uniform scaling for all the object points
coordinate and this concept is called the weak perspective projection. So it actually, the weak
perspective model approximate the perspective projection model. So it is a scaling of x
coordinate and scaling of y coordinate.

(Refer Slide Time: 16:41)

So, the meaning of this, it will give the perspective effect but not over the scale of individual
objects. And in this case I had to collect the points into a group that I have shown in previous
slide. So, collect the point into group at about same depth and then divide each point by the depth
of that group. So, it is the same thing, the scaling of x coordinate and scaling of y coordinate.
And this concept, the weak perspective projection approximate the perspective projection.

219
(Refer Slide Time: 17:10)

And the next one I have considered the orthographic projection. In orthographic projection if I
consider d tends to infinity, the d is the distance between the centre of projection and the image
plane. It is very high. And also I am considering z tends to minus infinity, minus I am
considering because I am considering the negative z direction. If z is very high, z is the distance
between the centre of projection and object. That is also very high, then in this case minus d by z
tends to 1.

And if I consider this case if minus d by z tends to 1, that means scene coordinate is x comma y
comma z that is mapped into x comma y in the image plane. This is nothing but simply 3D to 2D
projection. The coordinate x y z is mapped into x comma y. This is called the orthographic
projection or the parallel projection. So, here in this figure I have shown the old coordinates,
scene coordinates are like this. The coordinates is x y z, but in the image I have the two
coordinates, only x comma y. Again I am showing the same concept here. The point x y z is
mapped into x comma y.

220
(Refer Slide Time: 18:18)

In this diagram also I have shown the same concept. The point x y z is mapped into x comma y.
That is the, from world coordinate to the image coordinate I am doing the mapping. And what is
the projection matrix in homogenous coordinate? In homogeneous coordinate I can write the, the
projection matrix in this form. That is the orthographic projection. So, this is my image
coordinate.

(Refer Slide Time: 18:40)

This concept of the orthographic projection and the weak perspective I have shown in this
diagram. This is my image plane, suppose and this is my object. In case of the, which perspective

221
projection what I am getting? I have the scaling of x coordinate and scaling of y coordinate. So,
that is weak perspective projection and I am getting the image corresponding to weak perspective
projection.

In case of the orthographic projection what is my consideration? The consideration is the


distance between the image plane and the centre of projection is very high. And the distance
between the centre of projection and the object is also very high. In this case the point x y z is
mapped into just the point x comma y. So, in this case I am considering the image plane is here.
The distance between the center of projection and the image plane is very high and distance
between the centre of projection and the object is also very high. And in this case I have to
consider the orthographic projection. So, this is my orthographic projection.

(Refer Slide Time: 19:32)

222
And here I have given two examples. One is the, first one is the perspective projection. And
second one is the orthographic projection. And I am shown one image corresponding to
orthographic projection. In this case there is no z information because it is mapping from x y z to
x comma y.

And finally I want to show one example of the orthographic projection. The focal point at
infinity, the rays are parallel and these are; rays are orthogonal to the image plane. Then in this
case this is the mapping from x y z to x y. And in this case the z information is missing. We do
not have the z information. And I am getting the orthographic view.

223
So, in this class I discussed the concept of projection. So, 3 projection techniques I have
discussed. One is the perspective projection, one is the weak perspective projection, another one
is the orthographic projection. So, these concepts are very important. So, in the next class I will
discuss the concept of camera calibration. So, camera has extrinsic parameters and the intrinsic
parameters. And based on this how to do the camera calibration? So, that concept I am going to
explain in the next class. So, let me stop here today. Thank you.

224
Computer Vision and Image Processing – Fundamentals and Applications
Professor M. K. Bhuyan
Department of Electronics and Electrical Engineering
Indian Institute of Technology Guwahati, India
Lecture – 07
Image Formation - Geometric Camera Model - III

Welcome to NPTEL MOOCS course on Computer Vision and Image Processing,


Fundamentals and Applications. I have been discussing about camera projection technique.
The first technique is the perspective projection. In perspective projection, distant objects
appear smaller and after this suppose if the relative depth of a group of points on a particular
object are much smaller than the distance to the COP.

Then in this case I will be getting the weak perspective projection. So, weak perspective
projection approximate perspective projection and finally I have another projection technique
that is orthographic projection that is nothing, but parallel projection. So, x, y, z coordinate is
mapped into x, y coordinate in the image plane that is the orthographic projection. Today, I
am going to discuss about this projections and finally I will discuss the concept of camera
calibration.

So, camera has intrinsic and the extrinsic parameters. So, I have to estimate these parameters
for camera calibration. So, what are the intrinsic parameters and what are the extrinsic
parameters that concept I am going to discuss today and based on this I can do camera
calibration.

(Refer Slide Time: 02:07)

225
So, let us first see the concept of the projections that already I have discussed in my last class
the perspective projections, weak perspective projections and the orthographic projection. So,
first one is you can see the perspective projection and the main concept is the distant object
appear smaller in perspective projection. So, in this case I have shown in the figure, I shown
the COP that is the center of projection and PP means the plane of projection that is nothing,
but the image plane.

So, I have shown the PP that is the plane of projection and also the object is x, y, z that object
I have shown and this is the camera reference frame the x, y, z I have shown that is the
camera reference frame and d is the distance between the COP and the PP so that is nothing,
but the focal length. So d is the distance between the COP and the PP the plane of projection
and in this equations the first transformation I have shown that is corresponding to the
perspective projection.

And in the second case what I am doing the scaling of the x coordinate, scaling of the y
coordinate and also the scaling of the z coordinate I am doing the scaling and finally after
doing the scaling I am getting this result and this is the same as that of this that is the
perspective projection that concept I have already discussed in my last class. So, what is the
interpretation of this mathematics?

So that means a larger object further away that is further away means the scale of x
coordinate, y coordinate and the z coordinate. So x coordinate is scaled, y coordinate is scaled
and the z coordinate is scaled can have the same size as a smaller object that is closer. The
meaning is the distant object appear smaller in the perspective projection.

226
(Refer Slide Time: 04:07)

So, in this figure I have shown this concept that is the distant object appears smaller. You can
see in this figure. So, I have shown the image plane so you can see this is the image plane and
I have shown the projection of the objects one is A another one is B another one is C like this
I have shown and you can see the projections in the image plane.

(Refer Slide Time: 04:31)

Already this concept I have discussed that is the weak perspective projection. So weak
perspective projections the concept is if the relative depth of points on a particular object are
much smaller than the average distance to COP the center of projections then I will be getting
the weak perspective projections and actually the weak perspective projections approximates

227
perspective projection. So, in weak perspective projection it is nothing, but the scaling of the
x coordinate and scaling of the y coordinate that is the weak perspective projection.

(Refer Slide Time: 05:07)

And finally another projection technique is the orthographic projection. So, it is nothing but
the parallel projection orthographic projection. The x, y, z point that is mapped into x, y point
in the image plane that is the parallel projection. So, you can see in this figure I have shown
the world coordinates and corresponding to this I have shown the image coordinates and this
is nothing, but the mapping of x, y, z coordinate into x, y coordinate in the image plane.

(Refer Slide Time: 05:39)

228
And here also I have shown the orthographic projection and in this case it is nothing but the
mapping from x, y, z into x, y coordinate that is the concept of the parallel projection the
orthographic projection.

(Refer Slide Time: 05:53)

And in this figure I want to show the distinction between the orthographic projection and the
weak perspective projection. So, in the first case you can see this is the object and
corresponding to this image plane the image plane is far away from the center of projection
and corresponding to this I have the orthographic projection so it is nothing, but the x, y, z
point is mapped into x, y coordinate in the image plane. T

he first one is the orthographic projection I have shown and in case of the perspective
projection that is nothing, but the scaling of the x coordinate and the scaling of the y
coordinate that I have shown in the image plane and I have shown the weak perspective
projection and that is nothing, but the scaling of the x coordinate and the scaling of the y
coordinate.

229
(Refer Slide Time: 06:43)

And in this figure I have shown the perspective projection another one is the orthographic
projections.

(Refer Slide Time: 06:50)

And in summary what I can show in perspective projection the 3 d point is x, y, z and
corresponding to this the 2 d image position will be fx divided by z so f is the focal length
and fy divided by z. In my figure, I put d in place of f, but now I am considering f in place of
d. So, in place of d I am considering f that is the focal length in this equation. So
corresponding to the perspective projection it is nothing, but fx divided by z, fy divided by z.

230
In case of the weak perspective projection it is nothing but the scaling of the x coordinate and
the scaling of the y coordinate and in case of the orthographic projection the point x, y, z is
mapped into x, y point in the image plane. So, in summary I can show like this one is the
perspective projection, one is the weak perspective projection, one is the orthographic
projection.

(Refer Slide Time: 07:55)

Now there is a concept of the vanishing point. So what is the definition of the vanishing
point? So what is the definition of the vanishing point? You can see here I have shown the
image plane in the figure and also I have shown the COP is the center of projection. Suppose,
if I consider one point at the infinity suppose one point at the infinity corresponding to this if
I see the projection of this point in the image plane.

Then I will be getting this point and that point is the point is v that is the vanishing point.
Now, let us consider another point the point is suppose A so corresponding to the point A so
what is the projection in the image plane? In the image plane the projection is A dash that is
this point. Suppose, the point A is moved in this direction and suppose this point is A dash
and corresponding to this A dash I have the projection in the image plane.

So corresponding point is this that is the projection of the point A dash and like this if I move
the points in this direction and suppose this point is A double dash. So, corresponding to this
my projection point in the image plane will be this so this is the projection point. So that
means if I move the point A to infinity in this direction then what will happen this projection
point moves closer to the vanishing point the v is the vanishing point.

231
So, ultimately corresponding to the point suppose the point A moves to infinity then
corresponding to this corresponding to this my projection point will be the vanishing point
that means the projection of a point at infinity is nothing, but the vanishing point. So,
corresponding to two parallel lines I have the same vanishing point so that I can show in my
next figure.

(Refer Slide Time: 10:03)

So in the next figure I have shown the vanishing point is v and I have shown the COP, COP
is the center of projection and I have shown the image plane and you can see the two parallel
lines have the same vanishing point v. So this is my vanishing point that vanishing point is v
and I have these two parallel lines.

I have the same vanishing point v and the ray from the COP through v is parallel to this lines.
So, suppose if I consider one ray that is this ray is this ray from the COP through the
vanishing point is parallel to the lines this two lines. So, from this figure you can understand
the concept of the vanishing point. So that means the two parallel lines have the same
vanishing point.

232
(Refer Slide Time: 10:54)

And in this figure I have shown some real images and corresponding vanishing point you can
see the first one is the railway tracks I have shown and you can see this is the parallel lines.
So, I may have the vanishing point at this point. Similarly, in this figure the second figure I
have the parallel plane and corresponding to this I may have the vanishing point somewhere
like this.

So, I have the vanishing points corresponding to the parallel lines. So, this vanishing point is
quite important one application I can say in robotic path planning robot can determine this
vanishing point and based on this vanishing point robot can do robotic path planning it can be
done based on the vanishing point.

233
(Refer Slide Time: 11:43)

And here also I have shown one vanishing point you can see the vanishing point here and
corresponding to the parallel lines.

(Refer Slide Time: 11:51)

And in this case I have shown one point perspective, two point perspective and the three
point perspective and in this case I have shown one vanishing point corresponding to one
point perspective. The next one is two vanishing points corresponding to two point
perspective and third one is three vanishing points corresponding to three point perspective
and you can see the horizon line you can see. So, I will define what is the horizon line in the
next slide.

234
(Refer Slide Time: 12:25)

So, here you can see the each set of parallel lines meets at different point that is the vanishing
point. So, if I consider suppose this is the parallel lines all these are parallel lines they will
meet at a vanishing point the vanishing point is suppose v1. Similarly, if I consider another
set of parallel lines suppose this set of parallel lines they will meet at another vanishing point.
The vanishing point is suppose v2 and this parallel lines are on the same plane.

So, this all the parallel lines are on the same plane and in this case I will be getting collinear
vanishing points that means the point v1 and v2 are collinear. So I am getting two vanishing
points in this case v1 and v2 then in this case I will be getting the collinear vanishing points
and the line is called the horizon for that plane. So that line joining v1 and v2 it is called the
horizon line that is the definition of the horizon line.

In this case, you can see I have two vanishing points and one important application of the
vanishing point that determination of the fake images. So, suppose if I consider copy and
paste forgery. So, suppose one portion of the image is cut and pasted it on another image then
based on the vanishing points I can determine whether that image is the forge of the original
image.

The forging I am considering the copy and paste forgery. So I can determine the vanishing
points and based on the vanishing points I can say whether it is the original image or the
forged image that I can determine.

235
(Refer Slide Time: 14:20)

Now what is camera calibration? So in case of the camera projections it is nothing, but the 3D
to 2D projections. 3D world points are projected on to the 2D image plane. So, camera
calibration is the process of estimating intrinsic or extrinsic parameter. So what do you mean
by intrinsic parameters? The intrinsic parameters deal with the cameras internal
characteristics such as focal length of the camera, distortion parameters of the camera and the
image centers and what are the extrinsic parameters?

Extrinsic parameters describes each positions that is the camera positions and the orientation
in the world. So that means I can estimate the intrinsic parameters like the focal length,
distortion parameters, image centers I can estimate and also I can estimate the extrinsic
parameters that is nothing, but the position and the orientation of the camera with respect to
the world coordinate.

So, in this case you can see finding the quantities internal to the camera that affects the image
process. So like this image center I can consider the focal length I can consider, the lens
distortion parameters I can consider these are the internal parameters.

236
(Refer Slide Time: 15:49)

Now this precise calibration is required for 3D interpretation of images, reconstruction of


world models and robot interactions with the world. So, I can give one example in the real
world I have two lines suppose this segment is A, B and C this is the world coordinate in the
world coordinate and this ratio AB divided by AC I have some ratio it is a divided by b and
suppose this angle is alpha.

So, corresponding to this I will be getting one image in the image plane so I will be getting
one image plane and in this case this I am considering the ratio a by b and also the angle
alpha I am considering. So, I have the image in the image plane corresponding to this scene
in the world. So this characteristics should be preserved in the image what are the
characteristics I am considering the ratio, ratio is a by b and also angle is alpha.

So in the image also this angle should be alpha and the ratio should be maintained the ratio
between this point and this point that is A dash B dash and it is C dash. So that ratio should
be maintained A dash B dash divided by A dash C dash that should be also maintained in the
image. So this characteristics should be preserved in the image. So, in case of the robot
interactions robot take image so robot captures images.

And what information available in the image that should be perfectly match with the real
world scene so that is important for hand eye coordination in case of the robot interactions.
So, I am repeating this because the robot takes images by using the camera and what

237
information available in the image that should be perfectly matched with the real world scene
that is the case. So, for all this cases I need camera calibration.

(Refer Slide Time: 18:19)

So the parameters of the camera already I have mentioned the extrinsic parameters and the
intrinsic parameters of the camera and in this figure you can see that I have shown the camera
reference frame, I have shown the COP, COP is the center of projections and PP is the plane
of projections I have shown and our object is nothing, but the x, y, z is the object. So, what is
the extrinsic parameters?

Extrinsic parameters define the location and the orientation of the camera reference frame
with respect to the known world reference frame. I am repeating this so extrinsic parameters
define the location and orientation of the camera reference frame with respect to a known
world reference frame that is the extrinsic parameters and what about the intrinsic
parameters?

The intrinsic parameters link the pixel coordinates of an image point with the corresponding
coordinates in the camera reference frame and in this case I can show this case so what is the
meaning of the intrinsic and the extrinsic parameters? So, I have the coordinate one is the
object coordinate system the number one is the object coordinate system and again I have the
another coordinate system that is the world coordinate system.

And because I am the camera so I have the camera reference frame the camera reference
coordinate system that is the camera reference frame. Number four, I have the image plane

238
that is the image reference frame and after this I have the pixels coordinates, the pixel
reference frame. So I have this reference frames one is the object reference frame, one is the
world reference frame another one is the camera reference frame, one is the image reference
frame and another one is the pixel coordinate.

So extrinsic parameters means that is the location and the orientation of the camera reference
frame with respect to the known world reference frame that is the extrinsic parameters and
the intrinsic parameters link the pixel coordinate of an image point with a corresponding
coordinate in the camera reference frame. So I have the pixel coordinates and I can find the
correspondence between the pixel coordinate and the camera reference frame.

So this is about the extrinsic parameters and the intrinsic parameters. In my next slide you
can understand this concept.

(Refer Slide Time: 21:07)

So, first one is I have the object reference frame. So here I have shown the object reference
frame so the coordinates are xb, yb and zb that is the object reference frame I have shown and
also I have shown the world reference frame that here I have shown the coordinate is xw, yw
and zw that is the world coordinate reference frame and I have the camera reference frame so
you can see the COP the center of projection here.

So, I have the camera reference frame xc, yc and the zc that is the camera reference frame
and if I consider the image frame image is nothing, but the 2D. So, in the image frame the
image reference frame is xf and one is yf. If I consider the object frame, world frame and the

239
camera frame they are 3D, but if I consider the image reference frame that is nothing but the
2D so I have x coordinate and the y coordinate.

And after this I have the pixel coordinates so corresponding to this image plane I have the
pixel so this corresponds to the pixel coordinates so you can see xp and the yp the pixel
coordinates. So, now I will show all this one by one. One is the object reference frame, one is
the world reference frame, one is the camera reference frame. One is the image reference
frame, one is the pixel reference frame.

(Refer Slide Time: 22:41)

The first one is the object coordinate frame. So I am considering the notation X0 Y0 Z0 and it
is useful for modeling objects. Son in this figure I have shown the object coordinate frame
that already I have shown that is xb, yb and the zb.

240
(Refer Slide Time: 23:05)

The next one is the world coordinate frame. So my notation is Xw, Yw and Zw. So, in the
figure you can see one is Xw, one is Yw and one is Zw that is the world coordinate frame and
it is useful for interrelating objects in 3D that is the world coordinate frame.

(Refer Slide Time: 23:32)

The next one is the camera coordinate frame. So the 3D coordinate system for the camera and
my notations are Xc, Yc and Zc for the camera coordinate frame. So, here in the figure you
can see one is xc, yc and the zc for the camera coordinate systems and it is useful for
representing objects with respect to the location of the camera.

241
(Refer Slide Time: 24:03)

The next one is the image plane coordinate because in case of the image plane I have the
CCD sensors the charge coupled devices are available and we have the 2D coordinate system
and in this case you can see my x coordinate is xf and y coordinate is yf and my notation is x,
y corresponding to the image plane coordinate frame. So, it describes the coordinate of the
3D points projected on the image plane.

So that is the coordinate of the image plane coordinate that is the coordinate of the image
plane.

(Refer Slide Time: 24:44)

242
And finally I have the pixel coordinates. So you can see the pixels and you can see the size of
the pixels also so if you see this is a pixel, this is the size of the pixels and corresponding to
this pixel my notation is x im and y im that is the pixel coordinate. So, this is the pixel
coordinate I have the x im and the y im in the pixel coordinate frame. So that means I have
five frames.

(Refer Slide Time: 25:13)

So, in this case in summary I can show you the first one is object coordinates that is the 3D
coordinate systems the next one is the world coordinates that is also the 3D after this we have
the camera coordinates this is also the 3D coordinate system and these are mainly the object
coordinates, world coordinates and the camera coordinates.

These are the external parameters and after this I have the image plane coordinates that is the
2D and the pixel coordinate that is also 2D these are the internal parameters of the camera.
The external parameters mainly considers the world coordinates and the camera coordinates.

243
(Refer Slide Time: 25:55)

So, the world to pixel coordinates if you see the world and the pixel coordinates are related
by some additional parameters. The additional parameters are the position and the orientation
of the camera, the focal length of the lens of the camera, the position of the principal points in
the image plane and also the size of the pixels.

So, these are the parameters so that is the relationship between world and the pixel
coordinates. So, for this I need the position and the orientation of the camera, the focal length
of the camera lens. The position of the principal points in the image plane and the size of the
pixels.

(Refer Slide Time: 26:40)

244
And already I have defined what are the extrinsic parameters and the intrinsic parameters?
Extrinsic parameters that define the location and the orientation of the camera reference
frame with respect to a known world reference frame that is the definition of the extrinsic
parameters and for intrinsic parameters, what is the definition of the intrinsic parameters?

The intrinsic parameters necessary to link the pixel coordinates of an image point with a
corresponding coordinates in the camera reference frame. So that is the meaning of the
intrinsic parameters.

(Refer Slide Time: 27:19)

Now, in the next slide you can see I have shown the extrinsic parameters and the intrinsic
parameters. So, I have shown the object coordinates, world coordinates, camera coordinates
and the image plane coordinates and the pixel coordinates. So, in this figure also you can see
the extrinsic parameters and the intrinsic parameters.

245
(Refer Slide Time: 27:42)

Now in this case in this figure I have shown how to estimate the extrinsic and the intrinsic
parameters? So, for this a set of known correspondence between a point features in the world
and their projections on the image I have to determine.

So, in this case you can see I am considering one calibration object here this is the calibration
object I am considering and in this case I have to estimate the intrinsic and the extrinsic
parameters by considering a set of known correspondence between point features that I have
shown in the calibration object in the world and their projections are on the image.

So, corresponding to this known feature point if you see this point I will be getting one image
in the image plane so I will be getting one image plane and corresponding to this features I
can find the correspondence between the world coordinate and the camera coordinates that I
can find. So, what is the calibration object?

The parameters of the corners are estimated using an object with known geometry so what is
the calibration object? The parameters of the camera are estimated using an object with
known geometry. So, in this case I have considered the calibration object and I know the
geometry of this object. So from this I can estimate the parameters.

246
(Refer Slide Time: 29:18)

Now in this case first I am considering the concept of the extrinsic parameters. So, for this
what I have to do I have to develop some transformation between the unknown camera
reference frame and the known world reference frame. So, in this figure I have shown two
reference frames one is the world reference frames. So, first one is the world reference frames
suppose the center is C1 and another one is the camera reference frame.

And suppose the center is C2. So, if I want to align the camera reference frame with the
world reference frame for this what I have to do? First, I have to translate the origin of the
camera reference frame to the world reference frame. So that both the origins will be
coincided exactly that means first I have to do the translation operation so that the origin of
the camera reference frame will be coincided with the origin of the world reference frame.

So, that means I am repeating it again so if I want to align the camera reference frame with
the world reference frame I have to translate the origin of the camera reference frame to the
world reference frame so that both the origins will be coincided exactly. So, first I have to do
the translation after this I have to do rotation of the camera reference frame so that it aligns
with the world reference frame.

So that means first I have to do the translation so that I can do the matching that is the C1 and
C2 will be matched it will be overlapping match means it is overlapping and after this I have
to do the rotation so that it aligns the camera axes with the world’s axes. So this operations I
have to do. First, I have to do the translation and after this I have to do the rotation. So this

247
transformation I have to do for alignment of this two coordinate reference frame. One is the
world reference frame another one is the camera reference frame.

(Refer Slide Time: 31:40)

In this figure also I have shown the same concept. So, for alignment of this two coordinate
systems first I am doing the translation and after this I am considering rotation. So that means
the 3D translation I am considering that is describing relative displacement of the origin of
the two reference frames because I have to overlap the origins C1 and C2 so for this I have to
do the translation.

And after this for aligning the axes of this two coordinate systems I have to do the rotation.
So for this I am considering the 3 by 3 rotation matrix I am considering. So you can see what
is the transformation from the camera to world. So Pw is the world point and Pc is the camera
point in the camera reference frame. So, first I have to do the translation of the world
reference frame and after this I have to do the rotation.

So, first I am doing the translation of the world reference frame and after this I am doing the
rotation of this frame world frame so that the world reference frame will be coinciding with
the camera reference frame.

248
(Refer Slide Time: 32:50)

So in this case the same thing I am showing here so Pc is the camera coordinates and this Pw
is the world coordinates and what transformation I have to do? I have to do the translation
first and after this I have to do the rotation. So, here I have shown the 3 by 3 rotation matrix
and I have shown the Pc this is the points of the camera the x coordinate, y coordinate and the
z coordinate I have shown corresponding to the camera coordinates.

And corresponding to the world coordinate I have shown the X coordinate, Y coordinate and
the Z coordinate. So, from this equation I can write in the matrix form like this so Xc, Yc, Zc
are the camera coordinates and I am considering the 3 by 3 rotation matrix you can see this is
the rotation matrix. The elements are r11, r12, r13 like this. It is a 3 by 3 matrix and Xw, Yw,
Zw these are the world coordinates. And I am doing the translation along the x direction,
translation along the y direction, translation along the z directions.

249
(Refer Slide Time: 34:05)

Same thing I am showing that Pc is equal to R Pw minus T and I have the rotation matrix and
already I have defined this one. So what is Xc what is the X coordinate of the camera this is
nothing but R1 T, T is the transpose Pw minus T that is the X coordinate. What about the Y
coordinate is also R2 T Pw minus T that is Y coordinate and similarly I can get the Z
coordinate so I can get the X coordinate, Y coordinate and the Z coordinate. This Ri T
corresponds to the i-th row of the rotation matrix that I am considering.

(Refer Slide Time: 34:53)

Now, in case of the intrinsic parameters I have to consider the focal length of the camera and
also I have to do the transformation between the image plane coordinates and the pixel

250
coordinates and also I have to consider geometric distortions introduced by the optics that I
have to consider.

So, these are the internal parameters of the camera one is the perspective projections that I
need the focal length for perspective projection equation I need the information of the focal
length and also the transformation between image plane coordinates and the pixel coordinates
and also I need to consider the geometric distortion introduced by the lens that I need to
consider.

(Refer Slide Time: 35:40)

So, for camera coordinates to the image plane coordinates what transformation I have to do
already I have shown here. So, again I have to do the translation and also I have to do the
rotation so in this case already I have defined the camera coordinates Xc, Yc, Zc and the
world coordinates are I have defined that is Pw I have considered and I am doing the
translation and the rotation.

And if I consider the perspective projections so if I put the perspective projection equation
this is the perspective projection equations x is equal to f Xc divided by Zc so corresponding
to this if I put this equations here into this equation I will be getting this one. So, just putting
the value of Xc and Zc I am putting this in the perspective projection equations.

251
(Refer Slide Time: 36:36)

And from image plane coordinates to pixel coordinates I need to consider and in this case I
am considering the ox, oy that is the coordinates of the principal point. So, principal point in
the image plane that I can consider as the center of the image. So, if the size of the image is
M cross N or by M cross M the ox will be N divided by 2 and oy will be M divided by 2. The
size of the image is N cross M that is the principal point in the center of the image.

And also I need to consider the effective size of the pixels so that I need to consider. So, in
this case I am considering sx and sy I am considering that corresponds to the effective size of
the pixels in the horizontal in the vertical directions that is in millimeters. So, that means in
this figure you can see I am considering the scene point the point is P in the world and I have
shown the image plane.

So, this is the image plane and also I have shown the pixels if you have seen the pixels so I
have shown the pixels and principal point I am considering ox, oy and I have shown the
camera projection centers. So this is the camera reference frame I have considered. So,
considering to this you can see I have the coordinates x and y coordinates I will be getting
and in this case it is nothing.

But x im minus ox because I have to do the translation corresponding to the principal point in
the image point and similarly for the y also I am doing and also I have to do the scaling you
can see here I am doing the scaling of the x coordinate and the y coordinates because I am

252
considering the effective size of the pixels I am considering. So, in this case ox, oy is the
image center I am considering and sx and sy corresponds to the size of the pixels.

And in this case I am considering the negative sign in the equations are due to the opposite
orientation of the xy axes in the camera and the image reference frame. So, in the figure you
can see the opposite orientations of this two coordinate system here you see in the image the
direction of x coordinate is this, y coordinate is this, but in the pixel coordinate what I am
showing x direction is this and y direction is this that is opposite.

So, for this I am considering the negative sign. So that is the intrinsic parameters I am
considering and from image plane coordinate to the pixel coordinate.

(Refer Slide Time: 39:19)

So by using this matrix notation because already we have defined this equations and in matrix
form I can write like this that is x im y im 1 I can write like this and in this case sx and sy I
am considering because of the size of the pixel that means I am doing the scaling.

253
(Refer Slide Time: 39:38)

And finally the relating pixel coordinates with the world coordinate and already I have
defined this equations that is the perspective projection equations and that is the perspective
projection equations and we know this already I have defined about this the image and the
pixel coordinates. So from this you can get x im and the y im you can see this equations so I
will be getting x im and the y im from this equations.

(Refer Slide Time: 40:05)

And also I have to consider that image distortions due to the optics that is the lens so I have to
consider. So, in this case I have shown the radial distortion and this is in the world frame of
reference and this is in the image. In world frame of reference this is a circle, but in the image

254
plane I am not getting the circle because of the radial distortion. So in this case x, y that is I
am considering x, y that coordinate of the distorted pixels in the equation.

You can see x and y I am considering and I am showing that this is the equation of the circle
x square plus y square is equal to r square it is the equation of the circle I am considering. So
that means in the world it is a circle, but because of the radial distortions I am not getting the
circle in the image plane and what is x, y? X, y is the coordinates of the distorted points. The
distortion is a radial displacement of the image points.

So this is the distortion that distortion is called the radial distortion and the distortion is the
radial displacement of the image point. The displacement is null there is no displacement at
the image center and it increases with the distance of the point from the image center and the
optics introduces image distortions that become evident at the periphery of the image. So, this
distortion I can model by using this equations.

So, I am doing the corrections so x corrected and the y corrected I am getting by modeling of
this the radial distortions and I am considering the parameters k1, k2, k3 these are the
intrinsic parameters and if you see this image the optics introduces image distortions that
become evident at the periphery of the image. You can see in the periphery this distortion is
evident that is the radial distortion is evident.

And so you can see the k1, k2, k3 these are the intrinsic parameters of the camera. So that I
need to consider during the camera calibration.

255
(Refer Slide Time: 42:36)

And you can see here I have shown the distorted image because of the radial distortion and
after this it is corrected by using this equation that is I am doing the modeling. So, by using
this it is corrected the x coordinate is corrected and y coordinate is corrected and you can see
the corrected image here.

The second image is the corrected image the first image is the distorted image because of the
radial distortion. And you can see the distortion is evident at the periphery of the image. So,
at the periphery of the image the distortion is visible.

(Refer Slide Time: 43:16)

256
And there is another distortion due to the camera lens the distortion in the tangential direction
that is the tangential distortions. So, again in the figure I have shown this is in the world
frame of the reference and image I have shown this is the image plane of reference I have
shown and by using this equations I can do the corrections of the x coordinates and the y
coordinates.

I am not going to discuss about the tangential distortions, but these are the distortions one is
the tangential distortion another one is the radial distortions that is because of the optics.

(Refer Slide Time: 43:51)

And after this I am combining extrinsic with the intrinsic parameters. So I have shown the
matrix the matrix is M in that is the internal parameters of the camera I am considering the
matrix M in and also I am considering this matrix M ex that is nothing, but I am considering
the external parameters. So for external parameters I am considering the rotation and the
translation and after combining these two I have the image coordinate x, y, z is the image
coordinate and I have the world coordinate the world coordinate is xw, yz, zw.

And I am considering the matrix M int that is the internal parameters I am considering that is
the transformation for camera to image reference frame and I am considering M external that
is the external parameters of the camera extrinsic parameters of the camera that is world to
camera reference frame.

257
(Refer Slide Time: 44:52)

And if I use the homogeneous coordinate system I can write like this xh, yh, w and I am
considering the matrix M in and M ex and in this case you can see the elements of the matrix
M I am combining this two matrix one is the M in and another one is M ex I am combining
these two matrices and I am getting the matrix M and what are the elements of the M? m11,
m12, m13, m14 m21, m22, m23, m24.

These are the elements of the matrix and finally from the previous equations you can get the
matrix M and in this case you can see I am defining fx and the fy I am considering fx and fy.
What is fx? Fx is nothing, but f divided by sx. So fx is the focal length expressed in the
effective horizontal pixel size that is the focal length in the horizontal pixels and similarly I
can define fy.

So I am repeating this what is fx? Fx is the focal length expressed in the effective horizontal
pixel size. So, pixel size I am considering sx and the sy so that is why f divided by sx that is
the focal length in horizontal pixels that I am considering. Now I am considering the
projection matrix the projection matrix is M is projection matrix and is the 3 by 4 matrix. In
projection matrix I am considering this two matrices one is the internal parameters.

I am considering another one is the external parameter matrix I am considering that is M in


and M ex I am considering and I am combining these two I am getting the projection matrix
the projection matrix is M. So, I have to estimate the extrinsic and the intrinsic parameters of
the camera and that is called the camera calibration.

258
So all these parameters I have to determine the parameters are here you can see m11, m12,
m13 these are the parameters I have to define. And already I have defined that typically I can
consider the 3D objects of the known geometry with image features that I can consider for
calibration.

(Refer Slide Time: 47:23)

So, for camera calibrations I will explain some methods, but before going to this you can see
first I have to consider the homogenous coordinate systems and you can see I have the pixel
coordinates x im and the y im and that is divided by w so I will be getting x im and the y im
by using this equations. So, I am combining extensive parameters with the intrinsic
parameters of the camera and finally I am getting x im and the y im I am getting and I am
using the homogenous coordinate system.

259
(Refer Slide Time: 47:58)

And for calibration already I have defined I have defined I need calibration object that means
a 3D object of know geometry I need and it should be located in a known position in space
and from this I can extract image features which can be located accurately. So, from this
object I can extract image features and based on this calibration object I can do the calibration
that is the meaning of the calibration object.

(Refer Slide Time: 48:27)

So, in my next slide I have shown one calibration object so this is the calibration object and
two orthogonal grids of equally spaced black squares I can consider as a calibration object

260
and assume that the world reference frame is centered at the lower left corner of the right grid
with axes parallel to the three directions identified the calibration pattern.

So, I can show the world reference frame. You can see the world reference frame is this is
centered at the lower left corner of the right grid and with axes parallel to the three directions
identified by the calibration pattern.

(Refer Slide Time: 49:08)

And from this I can obtain the coordinates the 3D coordinates I can obtain Xw, Yw, Zw I can
obtain. So, given the size of the planes the number of squares so all known by constructions
the coordinate of each vertex can be computed in the world reference frame by simple
mathematics. So, I can determine all the coordinates I can determine by simple mathematics.

261
(Refer Slide Time: 49:34)

And after this obtain 2D coordinates that is nothing, but x im and the y im I can obtain. The
projection of the vertices on the image can be found by intersecting the edge lines of the
corresponding square side or maybe by using the corner detection principles. So after
detecting the edges and the corners I can obtain the 2D image coordinates. So that means I
am considering the calibration object for camera calibration.

(Refer Slide Time: 50:08)

So, the problem of the camera calibration is this. Compute the extrinsic and the intrinsic
camera parameters from N corresponding pairs of points. So, what are the corresponding
points? I have the world coordinates points and also I have the image coordinate points the

262
corresponding points. So, from this I have to estimate the extrinsic and the intrinsic camera
parameters.

So this corresponding pair of points I can obtain from the calibrating object so I can
determine this the corresponding pair of points I can determine from the calibrating object.
So there are many methods for camera calibrations, but in this class I will only discuss briefly
two process. One is the direct camera calibration technique and another one is the indirect
camera calibration technique.

(Refer Slide Time: 50:58)

So, in the indirect camera calibration estimate the elements of the projection matrix if needed
compute the intrinsic and the extrinsic camera parameters from the entries of the projection
matrix. So you have seen the projection matrix this is the 3 by 4 matrix and I am considering
both the internal parameters and the external parameters I have considered. So, you can see
what is M int?

This is M int I am considering the internal parameters in this case I am considering the focal
length the size of the pixels I am considering, but I am not considering the camera distortions
parameters. So, if I consider the camera distortion parameters the computation will be
difficult. So that is why I am not considering the camera distortion parameters, but here only I
am considering the focal length and the size of the pixels.

And this matrix if you see M external I am considering the rotation and the translation that
already I have explained and corresponding to this you can see corresponding to the

263
projection matrix I have this matrix M matrix. So, I have to estimate the elements of the
projection matrix in case of the indirect camera calibration. The indirect camera calibration is
somewhat easier as compared to direct camera calibration.

(Refer Slide Time: 52:21)

In case of direct camera calibrations so it is nothing, but direct recovery of the intrinsic and
the extrinsic camera parameters. So, this is the intrinsic parameters and these are the extrinsic
parameters and I can directly recover the intrinsic and the extrinsic parameters of the camera.
So, briefly I will explain this both the principle one is the indirect camera calibration one is
the direct camera calibration.

(Refer Slide Time: 52:49)

264
The first one is the indirect camera calibration I am showing so here you can see this equation
already I have defined in my previous slide. So I have the projection matrix and the
projection matrix I have this one m11, m12, m13 these are the elements of the projection
matrix and you can see I have the image coordinates x and y I am considering. So, just I am
replacing x im y im with x, y for simplicity. So already I have defined this equations in my
previous slide.

(Refer Slide Time: 53:24)

So, in this projection matrix you can see if I divide every entry by m11. So each and every
element is divided by m11 that means I have only 11 independent entries. So, for this I need
at least 11 equations for computing the projection matrix. So I need only 11 equations for
computing the projection matrix and we need at least 6 world image point correspondence.
So you can understand this so this is the projection matrix.

And this projection matrix has 11 independent entries because the other elements are divided
by m11 and for this I need 11 equations for computing M.

265
(Refer Slide Time: 54:14)

And step 1 so already you have this equations and I have this correspondence one is the
world coordinates and the other one is the image coordinate I have this information and from
this equation you will be getting this one. If you see from this equation you will be getting
this you can see by simple mathematics you will be getting this one.

(Refer Slide Time: 54:38)

And from this equations I will be getting one homogenous system of equations so I will be
getting one homogenous system of equations. So, N number of rows and 12 number of
columns I am having the matrix and I am considering the solution Am is equal to 0 that I am

266
considering this. So, what is m here? M is nothing but the vector I am considering the vector
m is a vector suppose.

And what are the elements of the vector m11, m12 like this m34 it is the transpose and in this
case since the A has the rank 11 so this matrix has the rank 11 the vector M can be recovered
by using some techniques like singular value decomposition. So, by using SVD the singular
value decomposition I can estimate that vector M can be recovered from the singular value
decomposition technique.

So, the singular value decomposition technique is you know A is equal to U D V T. So in this
case the column of v corresponds to the 0 singular value of A, so that means the vector M can
be recovered from singular value decomposition as the column of V corresponding to the 0
singular value of A and in this case I am considering A is equal to U D V transpose I am
considering.

(Refer Slide Time: 56:19)

So that means already I have explained so the matrix A has the rank of 11 so N is greater than
equal to 11 and I can apply this singular value decomposition technique and by this I can
recover the vector M.

267
(Refer Slide Time: 56:34)

And after this I can find the intrinsic and the extrinsic parameters, the step number 2. So, I
can find the values of m11, m22 like this I can find so this is the projection matrix.

(Refer Slide Time: 56:46)

So, this corresponding to this m11, m12, m13, m14 I am defining following vectors, so vector
is suppose q1, q2, q3, q4. So, what is q1? m11, m12, m13 I am considering. For q2, I am
considering m21, m22, m23 I am considering and similarly q3 and q4 I am considering.

268
(Refer Slide Time: 57:10)

And after this the solutions will be getting because I have to compute this parameters ox I can
determine oy I can determine and the fx and the fy also I can determine by using this
equations. For solutions you can see this book this book chapter you can see so this book you
can download also from the net and you can see the solution of this you will be getting the fx
and the fy that is the focal length you can determine that is the intrinsic parameters of the
camera. This is about the intrinsic camera calibration. So, this is about the indirect camera
calibration. The next I am going to discuss the direct camera calibration.

(Refer Slide Time: 57:56)

269
So, already I have defined this equation the Pc is equal to R Pw minus T and Pc is equal to
RP w minus RT and Pc is equal to RP w minus T transpose. Now for simplicity we replace
minus T transpose with T. So for simplicity I am replacing minus T transpose with T and
corresponding to this Pc I am getting the Pc is the camera coordinates, R is the rotation
matrix, Pw is the world coordinates plus T is the translation vector.

So, I am considering the translations and the rotational matrix I am considering. So, this
transformation I am getting that is the transformation for the camera coordinates and the
world coordinates.

(Refer Slide Time: 58:46)

And from camera coordinates the pixel coordinates already I have obtained this equations in
my previous slide and from this I can get the relation between the world coordinates and the
pixel coordinates. So this I am getting. So, I am getting the relationship between the world
coordinates and the pixel coordinates you can see this equations.

270
(Refer Slide Time: 59:11)

And after this the intrinsic parameters, what are the intrinsic parameters? The focal length sx
and the sy that is the size of the pixels and in this case ox and the oy that is the center point I
am considering and you can see this is my matrix M in that is I am considering the internal
parameters of the camera that is the intrinsic parameters of the camera and I can define four
independent parameters.

I can define like this fx equal to f divided by sx that is the focal length in horizontal pixels
alpha that is the aspect ratio I can define that is nothing but sy divided by sx and also the
image center coordinates already I have defined that is ox, oy I can define.

(Refer Slide Time: 59:59)

271
So the main steps so assuming ox and oy are known and we have to estimate all the
parameters and finally I can estimate ox and oy. So, detail derivations you can see from this
book so already I told you that this book you can download from the net and you can see so
how to do the camera calibration by using the indirect camera calibration method and another
one is the direct camera calibration method.

But in my class I have explained this two techniques one is the concept of the camera
calibration and why the camera calibration is important and after this I discussed the indirect
camera calibration and direct camera calibration techniques.

(Refer Slide Time: 1:00:40)

So, finally the comments to improve the accuracy of camera calibration it is a good idea to
estimate the parameters several times using different images and average the results that I can
consider. Different images mean I can consider the calibrating objects. The precise
calibration depends on how accurately the world and the image points are located so that is
the localization errors.

272
(Refer Slide Time: 1:01:10)

And you can see in theory the direct and the indirect camera calibration should produce the
same results, but in practice we obtain different solution due to different error propogations
and one point is important the indirect camera calibration is simpler and should be preferred
when we do not need to compute the intrinsic and the extrinsic camera parameters explicitly.
So, indirect camera calibration is simpler as compared to the direct camera calibration that I
want to explain.

(Refer Slide Time: 1:01:39)

So, already I have explained this concept how to do the camera calibrations. In this figure you
can see I am considering the calibrating objects so this is my calibrating object I have the

273
known 3D points of the object and after this I have to compare their projections with the
corresponding pixel coordinates of the points so I have to compare. Repeat for many points
and estimate re—projection error that I can do.

So, repeat for many points and estimate the re-projection error I can estimate. So, in this case
I am doing the calibration by considering one calibrating object. So, in this figure also I have
shown the world frame, the image plane and the camera frame I have shown and by using the
calibrating objects I can do the camera calibration.

(Refer Slide Time: 1:02:29)

And this is about the camera calibrations. Now I will discuss one or two issues of the digital
camera. So, in this case I have shown the digital camera sensor that is the image senor that
converts light photon into electrical signal. So, I will be getting the analog signal. The analog
signal I can convert into digital signal that is the digital image by the process of sampling and
the quantization and I will be getting the digital image.

So, I am showing the image sensor corresponding to a digital camera. So that may be the
charge couple device may be present or maybe the CMOS complementary metal oxide
semiconductor or may be the charge couple device I can consider in the image consider as
image sensor.

274
(Refer Slide Time: 1:03:17)

And some of the issues with digital cameras. First one is the noise because of the low light
we will be observing noise. In the image you can see here in the figure you can see the image
is blur and we will be getting the noise because of the low light conditions and if I do the
compression of the image than I will be getting the artifacts. So, suppose if I consider a Zetec
compression or Zetec 2000.

Then also I will be getting the artifacts because of the compression and one is the color
fringing artifacts is called the chromatic aberration. So, what is the chromatic aberration that
is the range of different wavelength focus in different planes. This is also called the purple
fringing or a chromatic aberration that I will explain later on. Purple fringing so that means
the rays of different wave length focused in different planes.

And this is called the purple fringing or the chromatic aberration and a blooming is mainly
the charge overflowing into neighborhood pixels this is because of the charge overflow and
also if I do the over sharpening then over sharpening can produce halos as shown in the
figure and for stabilization I can consider a mechanical stabilization, the mechanical
compensation or maybe the electronic compensation.

I can consider that is corresponding to the camera shave. So, these are the issues with the
digital cameras.

275
(Refer Slide Time: 1:04:56)

And one effect I discussed in my radiometry class that is the concept of the vignetting. In that
class I define the radiometry for the thin lens the E is equal to pi d square divided by 4 fp
square cos 4 alpha L. So, we derive this equation in radiometry. So, in this case E is the E
irradiance and L is the radiance. So, that is you can see that is irradiance is proportional to the
scene radiance.

That means the gray value of an image depends on L is the radiance and in this case the term
cos 4 alpha this is the term cos 4 alpha indicates a systematic lens optical defect and that is
called the vignetting effect. So, this is the vignetting effects actually depends on the term the
cos 4 alpha. So, what is the interpretation of this cos 4 alpha? So, optical rays with larger
span of angle alpha the span of angle is alpha are attenuated more.

Hence, the pixel closer to the image borders will be darker. So, you can see the image pixels
in the boundary in the border it is darker because of the vignetting effect. So, that is the
optical rays with larger span of angle the angle is suppose alpha the span of angle is alpha are
attenuated more and for this the pixels closer to the image borders will be darker and this
vignetting effect can be compensated by a radiometrically calibrated lens. So this vignetting
effect already I have discussed in my radiometry class.

276
(Refer Slide Time: 1:06:52)

And this chromatic aberrations the definition is the rays of different wavelength focus in
different planes. Here you can see I am showing the different wavelength the red, the blue
they focus in different planes and in this case you can see because of this I am getting the
radius boundary here maybe something like this I am getting because of the chromatic
aberration that is the purple fringing. So, this is mainly the rays of different wavelength focus
in different planes.

(Refer Slide Time: 1:07:25)

And one application of this I can give one application the image forgery detections by
considering the chromatic aberration that is the purpose fringing. So, you can see one is the

277
original image and after this I am considering the manipulated image you can see I am doing
the manipulation of this and the duplicated regions were detected you can see the duplicated
regions are detected because it is a copy and paste forgery that I am taking the one part of the
image from other images.

And for that camera the purple fringing effect will be different from that of the original
camera and based on this property I can detect the image forgery. So, that means this portion
is taken from another camera and that camera has different the purple fringing characteristics,
the chromatic aberration characteristics and based on this characteristics I can identify
whether this image is the original image or the forged image.

So, this is one application of the image forgery detection based on chromatic aberration. So in
this class, I discussed the concept of camera calibrations. So, how to determine intrinsic
parameters and the extrinsic parameters of the camera. I have discussed two methods one is
the direct camera calibration method another one is the indirect camera calibration methods.
So, briefly I discussed how to do the camera calibrations.

And how to estimate the camera parameters and how to find a projection matrix. So, this is
about this camera calibrations and let me stop here today. Thank you.

278
Computer Vision and Image Processing – Fundamentals and Applications
Professor M. K. Bhuyan
Department of Electronics and Electrical Engineering
Indian Institute of Technology Guwahati, India
Lecture – 08
Image Formation in a Stereo Vision Setup

Welcome to NPTEL MOOCS course on Computer Vision and Image Processing


Fundamentals and Application. Up till now, I discussed about the image formation concept in
a single camera. So, in a single camera setup there are many disadvantages. So, one major
disadvantage is that in single camera based setup it has limited field of view. For some
applications like object detection or maybe object tracking we need large field of view.

So, in a single camera based setup it is not possible to get the wide field of view the large
field of view. So, that is why we have to use multiple cameras instead of single camera. And
another problem is you know that in image formation system it is nothing, but the 3D to 2D
projection x, y, z coordinate is projected into x, y coordinate and because of this the depth
information is lost.

So this is the concept of the image formation. Now because of this if I consider only single
camera it is very difficult to get the depth information. I have discussed some techniques to
get the depth information, the shape information. Some cubes like shape from shading, I
discussed about this shape from shading. Shading means the variable levels of darkness and
after this the shape from textures I have highlighted this concept.

Shape from motion or maybe the shape from focus and defocus images and shape from
contours. Depth estimation from a single camera is an ill-posed problem. So, for depth
estimation we need minimum two cameras. In a stereo vision setup that is the binocular setup
we have two cameras. In two cameras we have two images. One is the left image another one
is the right image.

From this two images I can get the depth information. So, today I am going to discuss about
the image formation principle in a stereo vision setup.

279
(Refer Slide Time: 02:25)

So, let us see what is the stereo vision setup? Now already I told you that image formation
process is nothing, but the 3D to 2D transformation. The 3D scene coordinate is transformed
into a 2D image coordinates. In case of the human visual system, human brain reconstruct the
3D information. In case of the computer vision we have to develop some algorithm so that
computer can determine the depth information. So, for this we can consider a stereo vision
setup.

(Refer Slide Time: 02:58)

So, here I have shown the image formation concept that I have already explained that is the
pinhole camera model that is the very simple model and simple linear model and it

280
corresponds to the perspective projection. So, I have shown the pinhole camera and I have
shown the image plane, the image plane is shown and also the pinhole camera and suppose if
I consider this is the object. So, object maybe something like this, so this is the object. I am
getting the 2D image in the image plane. So, this is nothing, but the perspective projection.

(Refer Slide Time: 03:31)

Now in case of the stereo vision, the goal of the stereo analysis, we have the 2D image points
here. We have the 2D coordinates, the 2D coordinate I can write like this x, y so this is the
2D coordinate x, y and from the 2D coordinate I have to get the 3D coordinate of the scene
the 3D coordinate is x, y, z. So, I have to determine this.

This is the objective of the stereo analysis the goal of stereo analysis. From the 2D image
coordinate, I have to get the 3D scene coordinates. And in this case I need minimum two
images for this estimation the depth estimation the 3D information.

281
(Refer Slide Time: 04:21)

Here I have shown the 3D reconstruction so I have shown the left image and this is the right
image. From these two images, I am determining the 3D information this is the 3D
information. So, already I told you that we need minimum two images. So, in this case I have
consider binocular setup.

So, in the binocular image setup I have two images one is the left image another one is the
right image. From these two images I am getting the 3D information the depth information I
can determine.

(Refer Slide Time: 04:53)

282
So I can show suppose I consider one diagram here so suppose this is the image plane and I
have the points so suppose this point is P and this point is Q. So, I am just doing the
projection corresponding to this projection in the image plane I am getting the projection P
dash is equal to Q dash same point I am getting so I am doing the projection here. Same point
I am getting in the image plane corresponding to this P and Q. So, this depth information is
lost here.

The depth information is d depth information is lost in this case, but if I consider one stereo
vision setup suppose if I consider one stereo vision setup something like this.. So, I am
considering two points P and Q corresponding to this P and Q I have one projection that is the
projection is P1 dash or maybe P1 dash and corresponding to the point Q this point is Q so I
have the another projection this is Q1 dash.

I am considering one camera is C1 another camera is C2 that is the center of projection and
corresponding to the second camera, second camera is C2 so I have the projection here. This
projection I am getting the same point so it is P2 dash is equal to Q2 dash same point I am
getting for about the points. So, this is the depth information. So, in the second setup that is
the stereo vision setup, there is a binocular setup.

You can see you can see in one image plane I am getting this projection in second image
plane I am getting this projection. So, from this information I want to get the depth
information. So, I can show how to get the depth information from these two images one is
the left image I can consider is a left image and this is the right image. So, C1 and C2 are
mainly the camera projection center C1 is the left camera and C2 is the right camera.

So, in this setup you can see in the binocular setup in the stereo vision setup I have two
images and you can see the projections of the point P and Q corresponding to the left image
and corresponding to the right image. So, corresponding to the right image I have the same
point the P2 dash is equal to Q2 dash, but considering to the left image the projection points
are different. So from this information I want to get the depth information. So that is the
fundamental concept of the stereo vision.

283
(Refer Slide Time: 08:14)

So in this setup I have shown the image formation in a stereo vision setup. So this
configuration is called the epipolar geometry this is called the epipolar geometry. So, what is
the definition of the epipolar planes? So, here you see I have the left camera projection
center, this is the left camera projection center, this is the right camera projection center and
corresponding to this two cameras I have the image plane.

One is the left image plane this is the left image plane and this is the right image plane. So, I
have two image planes and I am considering the point the point is suppose M that is the scene
coordinate that is the 3D point I am considering the point is M. So, first I have to define the
epipolar plane. The epipolar plane is nothing, but the plane joining the points the camera
projection centers and the 3D point the 3D point is M that plane is called the epipolar plane.

This plane is called the epipolar plane. The plane corresponding to the camera projection
centers the left camera and the right camera and also the 3D point the 3D point is M and also
I have the image planes. So, I have two image plane one is the left image plane I have shown
already this is the left image plane and this is the right image plane. So now I want to define
the epipolar lines. So, I have two epipolar lines.

One is the left epipolar lines this is the left epipolar line you can see and one is the right
epipolar line. So, epipolar lines means the intersection of image plane with epipolar plane you
can see the epipolar lines means this is the intersection of image plane with epipolar planes

284
that is the definition of the epipolar lines and corresponding to this I have the baseline if you
see the baseline between the camera.

So, I have the left camera this is my left camera and this is my right camera, this is the left
camera projection center. So the line between the left camera projection center and the right
camera projection center that is called the baseline. So, this line is called the baseline. So, I
have two epipoles one is the right epipole this is the right epipole another one is the left
epipole this is the left epipole.

Epipole means the intersection of image plane with the baseline. So, intersection of the image
plane with the baseline that is called the epipoles. So, I have two epipoles one is the left
epipole another one is the right epipole. So, I can write the left epipole so left epipole is el in
this diagram left epipole el is the image of the image of the right camera projection center,
right camera projection center is C2.

Here, I am writing cr right camera projection is cr in the left image plane. So, left epipole el is
the image of the right camera projection center cr in the left image plane also I can define the
right epipole, right epipole is er. So, I can define the right epipole. So, this camera
configuration is called general camera configuration. This is the stereo vision setup and this
configuration is called general camera configuration. In this case you can see the image
planes are not parallel.

(Refer Slide Time: 12:35)

285
Next, we can see the same configuration I have shown here the epipolar geometry I have
shown. So this points P1, P2, P3, P4, Pw these are the points that is the object points in the
world these are the objects basically object points in the world and I have the left camera
projection center Ol and the right camera projection center Or and already I have defined the
epipolar plane.

The plane corresponding to the points Ol, Or and Pw that is the epipolar planes. I have two
image planes, one is the left image plane and on the right image plane and I have shown the
projection of the points. So, corresponding to the point this point P1, P2, P3, P4, Pw I have
only one projection in the left image plane so this is my left image plane. So, I have only one
point here.

And corresponding to the right image plane so in the right image plane you can see the
projection of the points the points are P1, P2, P3, P4. So I have change the projection of the
points in the image planes right image plane and the left image plane and in this case I have
shown the epipolar lines. One is the left epipolar lines, this is the left epipolar lines and this is
the right epipolar lines and also I have shown the epipoles.

One is the left epipole el another one is the right epipole that is er is the right epipoles. So,
you can see that the image formation in a stereo vision setup and I have shown the two
images one is the left image another one is the right image. Corresponding to the left image, I
have shown the projection of the points P1, P2, P3, P4 and Pw and corresponding to the right
image plane I have shown the projection of this points P1, P2, P3, P4. So from these two
images I want to get the depth information.

286
(Refer Slide Time: 14:31)

So, in this case for getting the depth information I have to match the points one of the points
in the left image another one is the points in the right image. So, here I have shown in this
diagram the matching of the points. So, here the green point is matched with the green point
here one is the left image another one is the right image. Similarly, that yellow point is also
matched the yellow point is also matched and the red point is also matched.

So, I have to do the matching so after the matching I can find the correspondence between
these two images. One is the left image another one is the right image and in this case you
can see that epipolar lines are not parallel in this case you can see this epipolar lines these are
the epipolar lines the epipolar lines are not parallel and in this case I can show you that if I
want to find the correspondence between the points I have to search along the horizontal
direction along the vertical directions.

Suppose, if I want to find the correspondence between this point and the green point so that
means I have to do the searching. So, searching along the horizontal direction and searching
along the vertical direction in two directions I have to search the points to find the
correspondence between the images the two images. One is the left image another one is the
right image.

287
(Refer Slide Time: 15:51)

But in this configuration I have shown the epipolar lines are parallel now. So, I am making
the epipolar lines parallel and actually this configuration is called the canonical stereo
configuration. So, in this case you can see the 2D searching problem is converted into 1D
searching problem.

So, along the epipolar lines you have to search. The red point is searched along this line, the
green point is searched along this epipolar line. The yellow point is searched along the
epipolar lines that means the 2D searching is converted into 1D searching. This is called
rectification the image rectifications.

288
(Refer Slide Time: 16:31)

So, here I have shown the same concept. The 2D correspondence problem is converted into
1D search problem that is it reduces computational complexity. In this configuration, this is
the general camera configuration already I told you this is the general camera configuration.
So, in this case the epipolar lines are not parallel.

So, in this case if I want to find the correspondence between the left image and the right
image. I have the search along the horizontal direction and along the vertical directions that is
2D searching I have to do.

(Refer Slide Time: 17:09)

289
Suppose, I am considering this configuration suppose so this is the camera projection center
C1. Two image planes I have shown here, so this is also C2 and epipoles I am showing e1
and e2. So, this configuration is called canonical stereo configuration. So, in this
configuration optical axes of the cameras are parallel and epipolar lines are parallel in the
image and epipoles moves to infinity.

You can see the epipole moves to infinity I have two epipoles e1 and e2. So epipole moves to
infinity and in this case the baseline is aligned to the horizontal coordinate axes. So, this
configuration is called the canonical stereo configuration. In this case, the computation is
simpler I can convert the general camera configuration that already I have defined the general
camera configuration.

I can convert the general camera configuration into the canonical camera configuration. By
using some transformation this geometric transformation I have to do, so I have to do some
geometric transformation I can convert the general camera configuration into canonical
camera configuration this is called the image rectification. So thus image rectification
actually it converts the 2D the search problem from 2D to 1D search problem.

So, I have shown that how to convert the general camera configuration to canonical camera
configuration by using some geometric transformation and that concept is called the image
rectification. So after doing the image rectification, the epipolar lines will be parallel and the
epipoles move to infinity then in case the 2D search problem will be converted into 1D search
problem. So, search is important to find the correspondence between the images the left
image and the right image.

290
(Refer Slide Time: 21:33)

So, in this case I have shown that one is the general camera configuration. In the general
camera configurations, epipolar lines are not parallel, but in the second case if you can see
here the second case I am considering the canonical configuration. In the canonical
configurations, the epipolar lines are parallel. Now, in this case I have to do some
transformation to convert the general camera configuration into canonical configuration.

Here I have to show what, I have to do, so I have considered the 3 by 3 the homography
matrix, so I have considered the 3 by 3 the homography matrix, I have considered. So, I am
doing I am doing the transformation the transforming the coordinate of the original image
plane I can do by using this transformation. So for this I am considering the 3 by 3
homography matrix.

291
(Refer Slide Time: 22:46)

And in this diagram I have shown that the stereo images before and after rectifications. So, in
the first case if I see this first case the epipolar lines are not parallel and corresponding to this
you can see the image planes these are also not parallel, but in case of the second case after
the rectification epipolar lines will be parallel. So, in the first case if I want to find the
correspondence between the images I have to search along the horizontal lines and the
vertical lines in two directions I have to search.

So, suppose if I want to find the correspondence between these two points I have to search
the image, but in the second case if I want to find the correspondence between the points I
have to search along the line particular line so that line is the epipolar lines. So, that means
the 2D searching is converting into 1D searching because of the image rectification.

292
(Refer Slide Time: 23:39)

Here again I have shown another example that is before rectification and after rectifications.
So, how to find the correspondence between the two images, one is the reference image
another one is the target image.

(Refer Slide Time: 23:51)

Now, here I have shown the elementary stereo geometry in the rectified configuration. So that
means from this configuration I want to determine the depth information. So, let us see this
configuration. So in this configuration, I have two cameras the projection center is C and C
dash. Here this is the left camera and this is the right camera and this is the point the scene
point is P x, y, z the scene point is x, y, z.

293
And distance between the two camera projection center that is the baseline the baseline I am
considering the distance is B and if you see the image plane here this is the image plane
corresponding to the left camera and this is the image plane corresponding to the right
camera. So, what will be the size of the image in the left image the left image is u and in the
right image it is u dash and I have considered the focal length the focal length is f.

So, from this configuration I want to find the depth information. So this process is called the
triangulation. So, triangulation is a process of obtaining the 3D dimensional scene points
from its projected points in the two image planes. So, I want to get the 3D information of the
scene point from its projected points in the two image planes. So, if I have two image planes
so from the two images I want to get the 3 dimensional scene point that means I want to
determine the x coordinate, y coordinate and the z coordinate.

(Refer Slide Time: 25:25)

So, same configuration I am showing here. Now, I am defining a term that term is called the
disparity. Disparity in the horizontal direction between the points u and u dash that is u minus
u dash. So, in this diagram you can see similar right angle triangle. One is this triangle that is
the UCB that is one triangle and another triangle is CPA that I am considering the CPA two
right angle triangle and corresponding to the point the coordinate is x, y, z.

So, suppose this side is positive x and the left side is negative f corresponding to this point.
So, if I consider this two triangles one is UCB and another one is CPA from similar right
angle triangle I can write like this u by f is equal to minus h plus x divided by z. Why it is? So
from this to this point from this point to this point the distance is h and from this what is the

294
distance? This is x you can see this distance is x because this coordinate this is suppose the
axes is 0 here, but in this case I am considering the negative direction.

The negative direction is minus I have shown so that is why I have considered a negative sign
here. So corresponding to this the distance from this point to this point if you consider this
point to this point the distance will be h plus x, but that is in the negative directions so it is
minus h plus x divided by z. Similarly, if I consider another triangle the triangle is this
triangle, this triangle and also this triangle.

Corresponding to this what I am getting u dash divided by f, f is the focal length and I have to
find the distance this distance I have to find that is in the positive direction. So it is h minus x
that distance will be h minus x in the positive direction so this is the case. So, I am getting
this two equations from this two right angle triangles and in this case from this two equations
I can determine the z information, z is the depth information.

Z is equal to bf divided by d that I can determine. What is b? B is the distance between two
camera projection centers. What is f? F is the focal length of the camera and what is d? D is
the disparity in the horizontal distance that is u minus u dash. So, you can see that is the depth
information. So, if the disparity is more the depth will be less. So, suppose the object suppose
one object that is very close to the camera then in this case the disparity will be more.

And corresponding to this the depth will be less. So that means here you see the disparity is
nothing, but the horizontal displacement of the matching points in the disparity value for this
point. So, after determining the depth you can determine the rest of the coordinates, the rest
of the coordinates is x coordinate and the y coordinate. So you can see how to define the
depth information in this stereo vision setup.

295
(Refer Slide Time: 29:01)

Now in this case if I have to find the disparity between the two images. So up till now I
discussed about the stereo vision setup and I have discussed the concept of the epipolar
planes the epipolar lines and also the concept of the general camera configurations. So after
image rectification I am getting the canonical camera configurations. So, in the canonical
camera configurations the epipolar lines will be parallel.

And that is after rectification the 2D search problem is converted into 1D search problem.
After this I discussed about how to get the depth information from the stereo vision setup.
Now after this what we have to consider, we have to determine the disparity we have to
determine from the two images. One is the left image another one is the right image. For
determining the disparity I have to find the matching point.

One is the left image another one is the right image so I have to determine the matching
point. So, that is why the first I have to find the matching. After doing the matching, I have to
go for the reconstruction of the 3D information. So, from the two images I can find the
disparity from pixel-to-pixel from all the pixels I have to determine the disparity. So, if I find
the disparity for all the pixels of the images then in this case I will be getting the disparity
map.

And from the disparity map, I can determine the depth map the depth information I can
determine that is the reconstruction. So, here I have shown one is the matching another one is
the reconstruction. So, finding the correspondence between the two images.

296
(Refer Slide Time: 30:35)

And in case of the matching this I have to consider that camera model parameters must be
known. So already I have discussed about the camera parameters the extrinsic parameters of
the camera external parameters of the camera that is the position and the orientation of the
camera with respect to the world coordinates and internal parameters of the camera like the
focal length, image centers and the lens distortions. So I have to consider this parameters are
known.

(Refer Slide Time: 31:05)

So, here you can see I have considered two images one is the left image another one is the
right image and I want to show the disparity what is the disparity? In this case you can see in

297
the diagram c I have shown the overlapping between these two images. I am just doing the
overlapping of this and you can see the disparity in the two images. This yellow lines I have
shown the disparity and this portion I have magnified in this image.

So, you can see the disparity here the yellow this arrow it shows the disparity between the
two images the left image is shown in a and the right image is shown in b. Now from this you
can see the disparity I have shown here in figure e. So disparity in terms of gray level
intensity value. So, low gray level values correspond to low disparity values whereas the high
gray level values corresponds to higher disparity values.

So, one point I have explained that is the concept of the disparity objects nearer to the camera
encounter more shift compared to the more distant object and this is reflected in the disparity
values that means the objects nearer to the camera have high disparity values. While the
objects further from the camera have low disparity values. So that means the object nearer to
the camera the depth will be less and object further from the camera that is away from the
camera have low disparity value and that means the high depth value.

So, you can see the relationship between the disparity and the depth the same thing I have
shown here in the color map showing the disparity values in the color map.

(Refer Slide Time: 32:56)

In this figure, I have shown again the same concept if I consider this portion, this portion is
binocularly visible that means this portion is visible from both the cameras. I have two
cameras one is the left camera, the left camera projection center is this, right camera

298
projection center is this. So, this portion is binocularly visible, but if I consider this portion or
this portion that is monocularly visible.

And I have the image plane the virtual image plane is here this is the left image plane and this
is right image plane. Corresponding to the scene point P, what is the value in the left image
and what is the value in the right image, it is 3. So, from this you can find the horizontal
displacement, the horizontal displacement is pR minus PL that is 6 so that is the disparity is
6.

So, from this disparity we can determine the depth information by using that equation. So
already the equation already I have explained so what is the equation z is equal to bf divided
by d, d is the disparity and b is the distance between the camera projection centers. So, by
using this equation you can find the depth information.

(Refer Slide Time: 34:09)

And this disparity map is quite useful for some applications like one application I am
showing here. The image segmentation by considering the depth so here I have shown this
example the image segmentation by considering the 3D information that means by using the
disparity information how to do the segmentation. So image segmentation is nothing, but the
partitioning of an image into connected homogeneous region.

So, in this case I have considered two images, one is the left image another one is the right
image and in this case I have shown c is the disparity map I am obtaining from this two
images I am getting the disparity map and this is the result of the color based image

299
segmentation and this is the result for e is the result for disparity map based segmentation. So,
from the disparity we can do the segmentation because the depth of this objects are different.

So, if I consider depth of this and depth of this depth will be different and based on this
disparity information we are doing the segmentation. So, accurate segmentation using three
dimensional information.

(Refer Slide Time: 35:16)

And here I have shown after determining the disparity maps we can find the 3D map that is
the disparity map can be converted into 3D map and if we know the geometry of the imaging
system that means the parameters like the baseline, the camera focal length, pixel side if you
know then this I can get the 3D information from the disparity map, so that is called the
reconstruction.

300
(Refer Slide Time: 35:43)

Now for stereo correspondence so already I told you that for a stereo correspondence I have
to find the correspondence between the two images one is the left image another one is the
right image. So, I have to find the correspondence between the points like this. One is the left
image and one is the right image. This is called the stereo correspondence and in case of the
stereo vision setup I can show you that same thing I am showing here.

So, this is my left image and this is my right image. So, I have two points suppose P points
and the Q points. So this point is P suppose and this point is Q so corresponding to the first P
point I have the P1 dash and corresponding to second point I have the projection that is Q1
dash and also corresponding to this left, right image I have P2 dash and I have another points
that is the projection is Q2 dash.

So, in this case I have to find the correspondence between this points the point is
corresponding between this points P1 and P2 I have to find the correspondence. Similarly, I
have to find the correspondence between the points this point I have to find the
correspondence. So, in this case for finding the correspondence one important point is there
should not be no self-occlusion.

If I want to find the correspondence there should not be any self occlusion and also this point
is important constant image intensity suppose the constant image intensity is there then it is
very difficult to find the correspondence between the points. So, if I consider a flat object or

301
maybe the non textured object or maybe the white object or maybe the uniform brightness if I
consider then in this case it is very difficult to find the correspondence between the points.

So, this self-occlusion I can give one example is something like this. Suppose, if I consider
object something like this one object. So, here this is the left image I have the left image and
this is the right image I am getting. To see this portion is not visible from the cameras if you
see that is one example of the self occlusion.

So in this case it is very difficult to find the correspondence between the images. This is one
example of the self occlusion. So, I am considering this is the object and in this case these
points are not visible by the camera the left camera it is not visible, but right camera it may
see this points. So, in this case it is very difficult to find the correspondence between the
images.

So that means I can show in this diagram so already I have shown so it is C1 is one camera
from projection center another one is C2. So this portion if we consider this portion is
binocularly visible and this portion is monocularly visible and this portion is binocularly
visible. One is monocularly visible another one is the binocularly visible and I can show
another example suppose so I have C1 camera another one camera is C2.

I have two cameras similarly I have another camera C2. So, if I consider this portion is
occluded in the left image because that portion is not visible by the camera C1 and if I
consider this portion this is my object. So this portion is not visible by the right camera. So,
that means this is occluded in the right image. So, I have shown this how to find the stereo
correspondence between the two images.

So one problem is the self occlusion, another problem is the constant image intensity. So, if I
consider the flat object, if I consider a non textured object or maybe the white object the
object with uniform brightness then in this case it is very difficult to find the correspondence
between the images.

302
(Refer Slide Time: 42:27)

So, in this case I have shown two images the first one is the left image another one is the right
image. So point in the first camera image plane is denoted by X. So this is X the first image
plane I am considering the point is represented by x1, x2, 1 that is in the homogeneous
coordinate system and second point the second image plane is represented by x1 dash, x2
dash and 1.

So, I am considering the second image plane that is the point is represented by x dash so x1
dash, x2 dash, 1 that is in the homogenous coordinate system. Now this homography matrix
H pi is the homography matrix. So, I can do some transformation so that x can be converted
into x dash by using the homography matrix. So, here I have shown one example the left
image this is the right image.

And in this case I have shown the common field of view that this portion is visible by both
the cameras. So, I have shown the one example of the common field of view.

303
(Refer Slide Time: 43:39)

And already I told you the disparity map so disparity means the horizontal displacement
between corresponding points. So, after determining the disparity we can determine the depth
information. So, this is my left image, this is my right image and this is the disparity map. So,
for all the pixels of the image I am determining the disparity then after determining the
disparity that is the disparity map I can determine.

So for all the pixels of the image the left image and the right image I am finding the disparity.
From this disparity I am getting the disparity map and from the disparity map I can do the 3D
reconstruction.

304
(Refer Slide Time: 44:20)

So, in this diagram I have shown the common field of view the two cameras I am considering
the left camera and the right camera and I have shown the portion that is the common field of
view I have shown.

(Refer Slide Time: 44:30)

Here I have shown again the same example the left image this is the left image next one is the
right input image and after this I am showing some corresponding points the corner points I
have shown some red points you can see, red points here I have shown some red points that is
the corner points and after this I want to find the correspondence between this points. So, in
this case I am finding the correspondence between this two points.

305
That is corresponding to the common field of view I am finding the correspondence between
the points of the left image and the second image. Now, let us consider the matching
problem. So, in the matching problem, I will consider two approaches one is the pixel based
approach and another one is the feature based approach. So, if I want to find the
correspondence between the images the left image and the right image.

So, I can compare the pixel intensity value and based on the pixel intensity value I can find
the correspondence that is called the pixel based method. Another method is the feature based
method. In feature based method, I can extract some features the important features in the
images and based on this features I can do the matching. So, this two approaches I am going
to explain.

(Refer Slide Time: 45:44)

Now there are many challenges for matching. Some challenges like image noise differing
gain of the cameras, the contrast and also some constraints like the perspective distortions,
occlusion, specular reflections. So this challenges I am going to discuss one by one.

306
(Refer Slide Time: 46:04)

And in this case the choice of camera setup is very important. The baseline is the distance
between the camera, the camera means the center of the projections of the camera the left
camera and the right camera and if I consider small baseline then in this case the matching
will be easier and if I consider large baseline then depth precision will be better.

(Refer Slide Time: 46:27)

And also I have to consider some assumptions for stereo matching. These assumptions are
like epipolar constraints, uniqueness constraints, minimum and maximum disparity, ordering
constraints, local continuity constraints. So, there are some constraints that means some

307
assumptions I have to consider for stereo matching, So, I will explain all these assumptions,
all these constraints one by one.

(Refer Slide Time: 46:52)

So, first assumption the first constraint is epipolar constraints. So, in this constraint the
matching point of a pixel in the left image lies in the corresponding epipolar line in the right
image that means already I have explained that is the 2D search problem is converted into 1D
search problem. So in this case I have to find the correspondence along the epipolar line only.
I have to see the correspondence along the epipolar lines.

This is called the epipolar constraints. Next one is I am considering the uniqueness constraint.
In most of the cases a pixel from the first image can corresponds to only one pixel in the
second image. This is most of the cases it is true, but suppose if I consider the opaque objects.
The opaque object satisfy this constraints, but if I consider transparent object the transparent
objects violate this condition.

And this is because of the fact that the many points in a three dimensional space is projected
on to the same point in an image plane because of this condition is not satisfied. So, I am
giving another example in which the uniqueness constraint is not satisfied.

308
(Refer Slide Time: 48:09)

So, I am showing one diagram here. Suppose, I have one object something like this the
surface and I have the left image and the right image. So, corresponding to this point I have
this projection. So, this is my left image and suppose this is my line corresponding to this
point I have this one, corresponding to this I have this one, corresponding to this I have the
projections.

This is my right camera, right image. This is same point for the left camera, but in the right
camera it is different points for the right camera. So, if I see here this point, this point, this
point and this point it is same point for the left camera because I have the projection here. So,
one projection I am getting, but for the right camera different points this point, this point and
this point it will be the different points for the right camera.

So, in this case it is very difficult to find the correspondence between the left image and the
right image and in this case is very similar to the case already I have explained the case is self
occlusion. So, this is called the uniqueness constraint. So, as per the uniqueness constraint in
most of the cases there exist at most one matching pixel in the right image corresponding to
each pixel in the left image that is the uniqueness constraint.

309
(Refer Slide Time: 50:30)

The next constraint is the photometric compatibility constraint. Intensity values of the pixels
in a region in the left image and its corresponding matching region in the right image only
slightly differ in intensity value. This slight difference in the intensity value is due to the
different camera positions from where the images are captured. This is obvious and this is
called the photometric compatibility constraints.

Another one is the geometric similarity constraints. So, as per this constraint if you see this
observation is mainly that geometric characteristics of the features, the features maybe
something like the length, orientation of a particular line, contours regions found in the first
and the second images do not differ much. So, features which are available in the first image
and with the features which are available in the second image they do not differ much that is
the geometric similarity constraint.

310
(Refer Slide Time: 51:27)

And this is very important the ordering constraints. This constraint says that for regions
having similar depth the order of the pixels in the left image and the order of their matching
pixels in the right image are the same. So, here I have shown two cases is A and B illustration
of ordering constraint in two scenarios. So, in the first example the ordering constraint is
fulfill. Why it is fulfill?

As the points A, B, C and their corresponding matching points A dash B dash and the C dash
follow the same special order. So that means here I am considering A, B, C that is in the left
image and A dash, B dash, C dash in the right image they follow the same special order that
is the ordering constraint, but in the second case if you see this constraint feels at the order of
the points A and B in the left image is different from the order of the corresponding matching
points A dash and the B dash.

If you see this example the ordering constraint is not satisfied the second case. So, this is the
illustration of ordering constraint in two examples the two scenarios.

311
(Refer Slide Time: 52:44)

The next one is the disparity continuity constraint. So as per this constraint that there is an
abrupt change in disparity value at the object boundaries that means if I consider the edges
and the boundary there is an abrupt change in the disparity values whereas the disparity
values do not change significantly for the smooth regions and this is quite obvious. For the
boundaries or for the edges the disparity values changes abruptly.

But if I consider the smooth region the disparity values do not change significantly. This is
called the disparity continuity constraint. Another one is the disparity limit constraint. This
constraint imposes a global limit on the maximum allowable disparity values between the
stereo images, but this is mainly based on psycho-visual experiments which says that the
human visual system can only fuse the stereo images if the disparity values do not exceed a
particular limit.

This is mainly based on psycho-visual experiments this is called the disparity limit
constraints. So this constraints that this assumptions are quite important because based on
these assumptions we are going to find the correspondence between the images.

312
(Refer Slide Time: 54:05)

The next one is the issues related to accurate disparity estimation. So, we will discuss some
issues related to accurate disparity estimations.

(Refer Slide Time: 54:15)

And one important issue is occlusions. Here I have given one example of the occlusions.
Now occlusion is mainly because of different field of view of the cameras. So, in this
example I have shown that this portion is only visible in the left image and that is not visible
in the right image, but again if you see this portion is visible in the right image, but that is not
visible in the reference image.

313
So then in this case it is very difficult to find the correspondence between this portion of the
image. Also another case is that is the presence of overlapping of different objects that means
if I consider overlapping of different objects located at different distances from the camera
that is one case then in this case if I consider this portion the overlapping of the objects then
in this case also it is very difficult to find the corresponding to that portion of the image
because that portion is occluded.

(Refer Slide Time: 55:17)

The next one is the photometric variations. Now the optical characteristics of the cameras
may slightly differ and this leads to photometric variations of the stereo image pairs. So, in
this case I have given this example you can see the photometric variations between the left
image and the right image. Also, if I consider this portion you can see the photometric
variations because of the optical characteristics of both the cameras may slightly differ.

314
(Refer Slide Time: 55:45)

Now let us consider another problem that is the image senor noise because of the noises in
the camera you can see this noise is here and noise present in the image then also it is very
difficult to find the correspondence between the images. So, one technique we can apply that
is the preprocessing of the image that is you can remove noises in the image by filtering.

(Refer Slide Time: 56:06)

The next point is the specularity and the reflections. So, already I have explained what is the
specular reflections? For a specular surface the radiance leaving the surface are dependent on
angles. So the disparity map obtained from the stereo image pairs may not give actual
information of the specular surfaces. In this example I have given the specular surfaces, this

315
is the specular spaces, this is the specular surface like this I have the specular surface then in
this case also very difficult to find the correspondence between the images.

(Refer Slide Time: 56:37)

I am giving another example here you can see the specular surface. The specular surface is
nothing, but the mirror like surface. So, mirror like surface is there and in this case also it is
very difficult to find the correspondence between the images.

(Refer Slide Time: 56:51)

One important effect is the foreshortening effect. I explained the concept of the
foreshortening factor. The foreshortening factor is cos alpha or cos theta. As per effect what
is the effect? The appearance of an objects depends on the direction of the view point that

316
concept I have explained already. Hence, an object may appear compress and occupies
smaller area in one image as compared to the other image.

So, in this example I have shown. Here you can see the area is less, but you can see the area
is more corresponding to that portion of the image then in this case it is very difficult to find
the correspondence between the images.

(Refer Slide Time: 57:34)

The next problem is the perspective distortion. The perspective distortion is a geometric
deformation of an object and its surrounding area. This is mainly because of the projection of
a three dimensional scene onto a two dimensional image plane that is nothing, but the 3D to
2D projections.

So because of this perspective transformation the perspective transformation makes an object


to appear large or small as compared to its original size. If you see in this image this portion
looks small, but if you see this portion, this portion looks large because of the perspective
distortions.

317
(Refer Slide Time: 58:13)

And also another problem the textureless regions, in this example I have shown the
textureless regions, this is one region, another region then in this case it is very difficult to
find the correspondence between the images corresponding to that portion of the images.

(Refer Slide Time: 58:28)

And here I am giving one example of the repetitive structures what is the repetitive structures
you can see? This structure is repeated and in this case also it is very difficult to find the
correspondence between the points of the left image and the right image.

318
(Refer Slide Time: 58:43)

And next point is the discontinuity. So, what is discontinuity? Generally we considered that
assumption is that surface is present in the scene are smooth. This is the assumption we have
considered. However, this constraint is not fulfilled when multiple objects are present in the
scene. So, if I consider this here the multiple objects are present here because of this we have
the discontinuities between the different objects in a scene.

And there is an abrupt scene in a disparity value in the boundary regions. So this is called the
discontinuity. You can see the discontinuity is because of the multiple objects present in the
scene. So in this example I have shown this portion is discontinuous I have shown the
discontinuous region here this is one problem of finding the disparity map. So, up till now I
discussed about the concept about the stereo vision.

So, first I discussed about the general camera configuration. In general camera configuration,
epipolar lines are not parallel after this I discussed the concept of image rectification. In the
image rectification, the epipolar lines will be parallel and in this case so 2D search problem is
converted into 1D search problem. So, in the first case I have considered the general camera
configuration and after this I considered the canonical camera configuration.

Now because of this configuration the canonical configuration the epipolar lines will be
parallel, then the 2D search problem is converted into 1D search problem. After this, I
discussed about the concept of the disparity map, so how to find the disparity maps and from
the disparity maps I can determine the depth information. So depth point is very important.

319
So, for all the pixels of the image the left image and the right image I can find the disparity
values.

From the disparity values, I can find the disparity maps and from the disparity map I can
determine the depth map. After this, I discussed about some assumptions to find the
correspondence between the images the left image and the right image like epipolar
constraint, discontinuity constraint. So there are many constraints we have discussed and after
this problem of finding the matching points for the stereo correspondence I discussed some
problems like the problem of occlusion, perspective distortions.

So these problems are quite important. So, in the next class I am going to discuss the concept
of stereo matching. One approach is the pixel based approach and another approach is the
feature based approach. So, next class I will discuss these concepts. So, let me stop here
today. Thank you.

320
Computer Vision and Image Processing – Fundamentals and Applications
Professor M. K. Bhuyan
Department of Electronics and Electrical Engineering
Indian Institute of Technology Guwahati, India
Lecture – 09
Image Reconstruction from a Series of Projections

Welcome to NPTEL MOOCS course on Computer Vision and Image Processing,


Fundamentals and Applications. In my last class, I discussed about the concept of stereo
vision that image formation in a stereo vision setup. In that discussion, I discussed about two
camera configurations. One is general camera configuration and another one is canonical
camera configurations.

In general camera configuration, I have nonparallel epipolar lines, but in a canonical


configuration epipolar lines are parallel. So, I can convert the general camera configuration
into canonical camera configuration by using some transformation and that is called image
rectification. So, after image rectification 2D search problem is converted into 1D search
problem and that is the advantage of image rectification.

After this, I discussed about the concept of disparity map so we can find disparity map by
finding the horizontal displacement of the pixels of the left image and the right image and if I
compute the disparity values for each and every pixel of the images then I can determine the
disparity map and from the disparity map I can determine the depth map. So, also I discussed
some concepts like the constraints, some assumptions in finding the disparity maps.

And also, some problems in finding the matching the problems like noise, the perspective
distortions, foreshortening factors, specular surface. So, all these cases I have discussed in my
last class. So, today I am considering the second part of this the stereo vision setup that is
how to find the matching between the left image and the right image that is called the stereo
correspondence.

After finding the stereo correspondence, I have to determine the disparity map and from the
disparity map I can determine the depth map. Now for the matching of the left image and the
right image there are two approaches mainly one is the pixel-based method and another one is
feature based method. In pixel-based method, I have to compare the pixel intensity value of
the left image and the right image.

321
And based on this comparison, I can find the matching pixels of the left image and the right
image and in this comparison actually we are getting the dense disparity map. In a second
approach, we can consider features I can select some of the image features and based on this
feature I can find the correspondence between two images, the left image and the right image.
This method will give the sparse disparity map.

In pixel based method, I am getting the dense disparity map, in case of the feature based
approach I am getting the sparse disparity map and in case of the pixel based method I can
consider a particular window that is called the search window and in this window I can find
the corresponding pixels. So, let us discuss these two approaches one is the pixel-based
approach and another one is the feature-based approach.

(Refer Slide Time: 03:30)

So last class I have shown that is the disparity map here in this case you can see the disparity
is nothing, but the horizontal displacement between corresponding points. So, here I have
shown the left image this is the first one is the left image, the next one is the right image and I
have shown the disparity map. So, this disparity map you can determine based on the
horizontal displacement between corresponding points.

322
(Refer Slide Time: 03:52)

And already I told you that there are two approaches for stereo matching. One is the pixel or
maybe the area-based approach and another one is the feature-based approach. In pixel based
approach, I can compare the pixel intensity values of the left image and the right image and I
can find the corresponding pixels in the right image and the left image and in the feature
based technique I can extract some features and based on this features I can do the matching.

So, first let us consider the pixel based that is the lower level method so I am discussing the
pixel-based method for matching the images the right image and the left image.

(Refer Slide Time: 04:32)

323
Also I discussed about the matching challenges the problems like the perspective distortions,
image noise, the gains of the cameras may be different, specular reflections, occlusion. So
these cases I have already discussed about the stereo matching.

(Refer Slide Time: 04:49)

And also I have considered some assumptions, some constraints like epipolar constraints,
uniqueness constraints, the maximum and the minimum disparity values, the ordering
constraints, the local continuity assumptions. So, this assumption I have considered for stereo
matching the matching between the left image and the right image.

(Refer Slide Time: 05:10)

324
So, in the pixel based correspondence I am searching the corresponding pixels. Suppose, I
have a pixel in the left image and I want to find the corresponding pixel in the right image for
this I have to do the matching and in this case I can consider one window that is the search
window. So in this search window I can find the corresponding pixels in the right image.

So, corresponding to this pixel in the left image I can find the corresponding pixel in the right
image by searching in this particular window. The window is shifted you can see the window
is shifted and shifted like this and I want to find the corresponding pixels in the right image.

(Refer Slide Time: 05:50)

This is the same thing I am showing here that is the pixel correspondence one is the left
image and another one is the right image and I am considering one window and in this
window I am doing the searching. So, that is the intensity based method.

325
(Refer Slide Time: 06:04)

In intensity based method, we can consider this similarity measures like SAD sum of absolute
difference or we can consider the SSD sum of squared differences or maybe the measures
something like the cross correlation factor. So, based on this we can do the matching the
matching between the left image and the right image. So, how to do the matching I can show
you.

(Refer Slide Time: 06:30)

So, suppose I have the image one image is f and one another image is suppose g I want to
find the correspondence between this two images. So, for this I am considering the window
based method so I am considering one window I am considering this window I am

326
considering. So, this measures I can consider the maximum and in this case i, j the pixel i, j is
within this particular window the window means the region particular window that is
considered.

So, this measure I can use or maybe something the measure something like this I can use this
one i, j in this particular window and f i j minus g i, j this measure also I can use. So, this is
the nothing but the first one is the chessboard distance the second one is this is the Manhattan
distance, so already you know about the distances. One is the chessboard distance, the second
one is the Manhattan distance or maybe I can also consider this distance.

This is called SSD sum square difference so sum square difference that is also the Euclidean
distance which is very popular Euclidean distance the Euclidean distance is very popular. So,
by using this we can also find the corresponding pixels that means this measurement I can do.
This cross correlation factor between these two images I can write like this, this is the cross
correlation.

And in this case also I am considering window this is called the cross correlation. So one is
the chessboard distance, one is the Manhattan distance, one is the Euclidian distance and the
cross correlation also I have defined. So, suppose what is SSD? SSD is nothing, but the
summation of f minus g whole square so I can write like this simple expression f square plus
g square minus twice f g.

So, in this case this is constant if you see this is constant and this is also constant. So this is
nothing, but you can see this is nothing but the cross correlation that means the SSD actually
also give the value of the cross correlation in this case. So, by using this measure we can find
the similarity between the pixels and based on this I can do the matching.

327
(Refer Slide Time: 10:20)

The next point I want to show you that feature based matching. In feature based matching, we
can consider features like edge points, lines in the image, the corner points in the image or
maybe the boundary pixels by using this features we can compare the left image and the right
image and in this case I am getting the sparse disparity map.

In case of the pixel based method, we are getting the dense disparity map. But in this case I
am getting the sparse disparity map. So, in this case in feature based technique I can give one
example.

(Refer Slide Time: 10:55)

328
The example suppose I am considering one image. The left image is something like this there
is a image one suppose object is something like this one object is there and I have another
objects that is in the left image. One is the left image and one is the right image. So, from
these images I can determine I can find the features the features maybe something like the
edge point.

Line segments or maybe the corner points I can consider. This features I can consider and
based on this I can determine the similarity measure S I can determine theta l minus theta r l
means the left image and r means the right image. So, theta I am considering some orientation
of a particular line. Suppose l means the orientation in the left image and theta r means the
orientation in the right image.

And w1 I am considering the weight some weight I am considering and suppose another one
is w2 ll minus lr depth also I can consider. This will be square or w3 maybe something like
this il minus ir whole square so like this I can find the similarity measures. So, first one is I
am considering orientation of a particular line. So theta l means the orientation in the left
image theta r means orientation in the right image.

Similarly I can consider the length of a line particular line in the left image and the length of
a line in the right image and in this case the average intensity along a particular line in the left
image, average intensity along a particular line in the right image. So like this I can consider
the features based on this similarity measures. So, this image matching problem is quite
important some applications like this, the applications like image fusion.

In my first class, I explained the concept of the image fusion, the image matching problem is
quite important for the applications like image fusion. In image fusion, suppose we have the
images like the multi focus image or maybe the multi exposure image. I will explain this one
image fusion. Another application is image mosaicking that means I can consider the
construction of multiple images.

I can write like this constructing multiple images of the scene into a larger image that means
the output of the image mosaic will be the union of two input image. Image fusion concept I
have explained. Suppose, I have one image that is the multi focus image another one is
suppose multi exposure image or maybe suppose I have the image corresponding to infrared
imaging and another one is the visible vision of the spectrum that is the visible image.

329
Then in this case I can fuse the image both the image I can fuse and in this case I am
considering the important information from both the images I can consider and I can neglect
the redundant information that is the concept of the image fusion and imaging mosaicking
mainly the output of the image mosaic will be the union of two input images. So, for this
application we have to do image matching.

(Refer Slide Time: 15:58)

Now in case of the intensity based method already I have explained that we have to compare
a pixel values. So, in this case I will be getting the dense disparity map and for this we need
the textured image and in this case it will be sensitive to illumination changes, but in case of
the feature based method it is relatively insensitive to illumination changes because we are
considering the features only not the pixel values.

And in this case we will be getting the sparse disparity map in case of the feature based
method because we are comparing the features so this method will be faster than the
correlation based method. You can see the comparison between the intensity based method
and the feature based method.

330
(Refer Slide Time: 16:37)

And if I consider three or more viewpoints then in this case I will be getting more
information for the matching, but in this case we have to consider additional epipolar
constraints. If I consider more viewpoints if I use more cameras then the common field of
view will also increase.

(Refer Slide Time: 16:52)

So, in summary I have shown here first one is the left image and the right image obtained
from the stereo cameras one is the left camera another one is the right camera and after this
we have to do camera calibration. Camera calibration means the estimation of the intrinsic

331
and the extrinsic parameters of the camera that is camera calibration. After doing the camera
calibrations, we are getting stereo image pairs.

One is the left image another one is the right image and that corresponds to general camera
configurations. After this, I can do image rectification. So, if I do the image rectification the
2D search problem is converted into 1D search problem. So, after this we have done the
image rectification. After doing the image rectification we do the stereo correspondence. We
can do the stereo matching.

So already I have explained two method one is the pixel based method another one is the
feature based method. So by this we can find the correspondence between the left image and
the right image. After determining the disparity map the next step is we can determine the
depth map from the disparity map we can determine the depth map and finally this depth map
can be applied for some applications.

In this case I have shown one application that is for the robotic vision and for robotic vision
this depth map can be used. So, this is the summary of this discussion. First, we have to do
the camera calibration the estimation of the intrinsic parameters and the extrinsic parameters
and after this we have considered the stereo image pairs that can be rectification by image
rectification procedure.

After this, we have to find the disparity map for this we have to find the disparity between the
pixels. If I determine the disparity values for all the pixels of the image then in this case I can
determine the disparity map. After finding the disparity map, we can determine and from the
depth map depth map can be applied for any applications.

332
(Refer Slide Time: 18:50)

So summary you can see so we have considered a stereo vision and after this we considered
two problems. One is the matching problem and next one is the reconstruction problems and
for matching we have considered two approaches one is the area based approach another one
is the feature based approach. So, this is about the stereo vision concept.

(Refer Slide Time: 19:11)

So, next topic is in this class I am going to discuss that image reconstruction from series of
projections and that is very important topic, the image reconstruction from a series of
projections that is nothing but 3D representation of a particular object from number of

333
projections. So, we will be taking number of projections and from this projection I can get the
3D representation of the object for better visualization.

This concept is applied in the case of CT Scan, CT Scan means the computed tomography. In
x-ray imaging if you see it is nothing, but the 3D to 2D projections. So that means we are
losing information in x-ray imaging, but in the CT Scan we can get the 3D representation of
the internal part of the organ or maybe some objects based on the projections. So, for this I
will be discussing one concept that called the Radon transform.

So by using the Radon transform you can find the projections at different angles and after
getting the projections we have to do the back projections that is the inverse Radon transform
we have to do and so that we can get the 3D representation of the object. So, first one is the
projection by using the Radon transform and after this we have to consider the inverse Radon
transform for reconstruction. So, this principle is called the image reconstruction from a
series of projections and mainly one application I have highlighted that is the CT scan.

(Refer Slide Time: 20:43)

So, here you see the non intrusive medical diagnosis the first example I have shown the x-
rays. So x-ray is nothing, but it is 3D to 2D projections. So, if you see this is the x-ray image
so in the x-ray image the intensity of the x-ray image is mainly depends on the x-ray energy
that is mainly depends on the absorption of the x-ray by the object.

So, this is my object so this intensity of the pixels of this x-ray image depends on the amount
of x-ray, amount of light absorbed by the object that depends on the absorption. So that

334
means it is nothing but the 3D to 2D projections. Suppose, in the second case what I am
considering the object is this and I have the source the x-ray source and we have the detectors
the photo detectors.

Now, in this case the source is moving the source is rotating and also the detector is rotating
like this. So, in this case I will be getting number of projections I can get 10 projections, I can
get 15 projections so I will be getting number of projections. From this projections, I want to
get the 3D representation of the object.

So here you can see this is a computer for processing and after this, this is nothing but the
reconstructed cross section that is the 3D representation of the object from the projections and
in this case if you compare this image and this image the x-ray image and this image you can
see visually the second image has more information as compared to the first image. So the
second principle is the CT scanning the CT scan means the computed tomography. So, this is
the principle of the x-ray and the CT scan.

(Refer Slide Time: 22:41)

Here also I am showing the same principle. The source is rotating and also the detector is
rotating then in this case I will be getting number of projections. From these projections, my
objective is to do the reconstruction that is after reconstruction it is nothing, but the 3D
representation of the object.

So in this arrangement in this configuration I am getting number of projections because the


source is rotating and the detector is rotating. And in this case I will be getting number of

335
projections like this, this is one projection and this is another projection so like this I will be
getting number of projections.

(Refer Slide Time: 23:19)

The same principle I am showing here the source is rotating and you can see the detector is
also rotating. This is my source and the detector is rotating. So, in this case you can see I am
getting one projection and suppose it is rotated by a particular angle and corresponding to this
I am getting another projection like this I am getting the number of projections. This is
mainly the rotation of the source and the detectors.

This arrangement if I compare this arrangement another arrangement shown in the b the
second arrangement. In this case the source is not moving instead I am considering the fan
beam emission the beam is rotating. This is the source the x-ray source now the beam is
rotating and this is my object and this is the detector. So, in this case also I am getting
number of projections.

The first configuration and the second configuration the concept is very similar, but the
principle is slightly different. In the second case the source is not moving, but instead we are
considering the fan beam emission. So, the beam is rotating and in this case I am getting
number of projections. So, from the projections my objective would be 3D representation of
the object that is 3D reconstruction.

336
(Refer Slide Time: 24:37)

Now let us consider the Radon transform. What is Radon transform? In this case I am
showing the object first after this the coordinate system corresponding to the object is x and y
coordinate. One coordinate is x another coordinate is y so this is corresponding to the object.
This is the direction of the x-ray I am considering the x-ray in this direction. So, in this case I
am considering another coordinate system that is s and u.

In polar coordinate I am considering s and theta. So one is s another one is theta. So s theta
represents the coordinates of the x-ray relative to the object. So, in this diagram I have shown
two coordinate system one is x and y coordinate system another one is the s and theta
coordinate system. What is s and theta means I am considering s and theta coordinate system
and another one is x, y.

In polar coordinate is s theta in rectangular coordinate in Cartesian coordinate it will be s u.


So, I have two coordinate system one is x, y coordinate system another one is the s theta
coordinate system. So s theta represents the coordinate of the x-ray relative to the object.
Now in this case you can see corresponding to this x-ray I am getting the projection, the
projection is represented by g s theta.

So what is the image reconstruction problem? The image reconstruction problem is to


determine f xy from g s theta. What is g s theta? G s theta is the projection I am considering
corresponding to the object the object is f x, y and in this case I am getting the projection the

337
projection is g, s theta. So, it is a linear transformation the projection from f x y to g s theta
and in this case considering the particular angle the angle is theta.

At a particular angle theta I am determining the g s theta, g s theta is the projection that
means I am getting one projection. If I change theta then in this case I will be getting another
projection so like this I can get number of projections. So fix theta to get 1D signal that is g
theta S that means I am getting one projection that is g s theta corresponding to a particular
angle.

Now what is g s theta? G s theta is nothing, but it is the line integral f x, y along the direction
the direction is in the direction of u along this direction that is the definition of the g s theta.
The projection of an object f x, y along a particular line, line is I am considering this line
suppose line number is 1 and second line is number 2 I have two lines 1 and 2. So, projection
of an object f x, y along a particular line is given by g s theta so it is f x, y du that is the line
integral.

And already I told you what is the image reconstruction problem? The image reconstruction
problem is determination of f x, y from g s theta that is the image reconstruction problem.
Now the concept of the Radon transform is like this the value of the 2D function at an
arbitrary point is uniquely obtained by the integrals along the lines of all directions passing
through the points.

So that means in this case I am considering the line integral. Now what is the equation of the
line the line number is 1 suppose the equation of the line number 1 is x cos theta plus y sin
theta is equal to 0. In this case what is the line 1? Line passing through the origin and whose
normal vector is in the theta direction. The normal vector is this is the normal vector whose
normal vector is in the theta direction.

What is the line number 2? The line number 2 is the line whose normal vector is in the theta
direction and whose distance from the origin is s. So, corresponding to line number 1 the
equation is x cos theta plus y sin theta is equal to 0 that is the equation of the line and
corresponding to this the s value will be 0 you can see the x value is 0 and I have the theta
value 0 theta because theta I am considering particular theta the projection is g 0 theta.

The g 0 theta is double integration from minus infinity to plus infinity f x, y is the object delta
function x cos theta plus y sin theta dxdy. Why I am considering the delta function? Delta

338
function is the dirac delta function, dirac delta properties delta 0 suppose d s is equal to 1 this
is the property of the dirac delta function that means delta n is equal to 1 for n is equal to 0 is
equal to 0 otherwise. This is the dirac delta function.

So, in this case I am considering this dirac delta function because I want to determine the
integration along this particular line the line is this line that means along this line the value of
the dirac delta function will be 1 because x cos theta plus y sin theta is equal to 0. So that
means the delta 0 will be 1. I want to determine the integration along this particular line. So,
that is why I am considering the dirac delta function.

What is the equation of the second line? The equation of the second line the number two line
you can see x minus s cos theta cos theta plus y minus s sin theta so finally I will be getting
the equation of the second line. The equation of the second line is x cos theta plus y sin theta
minus s is equal to 0. So this is the equation of the second line. The second line is the line
whose normal vector is in the theta direction.

And whose distance from the origin is s so that means this line is in the theta direction and
what is the distance from the origin? The distance from the origin is s so equation of the line
number two is x cos theta plus y sin theta minus s equal to 0 that is the line number 2.

(Refer Slide Time: 30:30)

Corresponding to the line number 2 then in this case I have the g s theta so this is my g s theta
corresponding to line number 2 so I have the g s theta. G s theta is f x, y and I am considering
again the dirac delta function dxdy. So, this is called the Radon transform. This expression is

339
called the Radon transform. So corresponding to the line number 2 here this is the line
number two this is the Radon transform.

And already I have define the Radon transform for 1 that is g 0 theta corresponding to the line
number 1 and if I consider the line number 2 the g s theta is this expression that is the Radon
transform. This g s theta I can display as an image and that is something like this I can put
like an image so it is s and theta suppose theta maybe 0 degree, 90 degree, 180 degree. So, I
can plot this one that is theta versus s I can plot like an image.

And this is called the Sinogram g s theta is displayed as an image. So, this is called the
Sinogram. Now in this case I am considering one case that means suppose already I have
explained the rotation case. Rotation of a particular point or vector suppose so one vector is
suppose x, y and suppose this is r at an angle theta 0 and this vector is rotated x dash y dash
the new position is x dash y dash this vector is rotated and this angle is theta.

So, corresponding to this rotation, rotation of this vector what is my transformation matrix?
The transformation matrix you can determine cos theta minus sin theta cos theta this is my
transformation matrix for the rotation. So you know about this rotation matrix suppose the
axes rotation I am considering one is the x, y coordinate system that is the x, y coordinate
systems and I have another coordinate system s and u suppose.

So, how I am getting the s and u coordinates system by rotation of the x, y coordinate system
by an angle theta. So, I have shown the theta here this is theta. So by rotation I am getting the
s and u coordinate system. So, I have two coordinate system one is x and y coordinate system
another one is the s and u coordinate system. Corresponding to this, my transformation matrix
the rotation matrix will be simply cos theta sin theta minus sin theta cos theta you can verify
this one.

In the first case I am considering rotation of vector in the second case I am considering the
axes rotation. So, here I have shown I have the x, y coordinate system. How to get the s and u
coordinate system? You can see here in this diagram also x and y coordinate and I have the
another coordinate system and that is s and u. So how to get s and u coordinate system? If I
rotate x, y coordinate system by an angle theta then in this case I will be getting s and u
coordinate system that is the axes rotation.

340
So corresponding to this, this is the transformation equation the transformation matrix is cos
theta sin theta minus sin theta cos theta. So from this equation you can see what will be the
value of s? S is x cos theta plus y sin theta and x is equal to s cos theta minus u theta sin theta,
u is equal to minus x sin theta plus y cos theta and y equal to s sin theta plus u cos theta.

Now based on this I want to verify whether x cos theta plus y sin theta minus s should be
equal to 0 because you know that corresponding to the line number 2 it is x cos theta plus y
sin theta is equal to s this is the equation of the second line the line number 2 so that I want to
verify. So you know x cos theta plus y sin theta minus s that should be equal to 0. From this
you can see x cos theta plus y sin theta minus s should be equal to 0.

So, if I put this value here I am just putting the value of the x, I am putting the value of y and
just after doing some simplification I am getting 0 that means my equation is correct equation
of x cos theta plus y sin theta minus s is equal to 0 that is the equation of the line number 2.

(Refer Slide Time: 35:16)

Now the transition from x, y coordinate to s and u coordinate system there is no expansion or
there is no shrinkage that means the area will be same dxdy is equal to dsdu. There is no
expansion and there is no shrinkage because of this transition from xy coordinate to s and u
coordinate. So that means in this case we have this equation you can know this is the equation
Radon transform equation g s theta f x, y.

I am considering the delta function delta x cos theta plus y sin theta minus s dxdy this is
called the Radon transform, this is the Radon transform equation and corresponding to the

341
line number 2 my equation x cos theta plus y sin theta minus s and we have this the value of
the x and y coordinate x is equal to s cos theta minus u sin theta y is equal to s sin theta plus u
cos theta.

So, putting all this value in this equation the Radon transform equation g s theta I am just
putting the value of x and y here and this is x this is y and what is this? X cos theta plus y sin
theta minus s that is equal to 0 so I am considering delta 0. What is the delta 0 value? Delta 0
value is integration delta 0 value is 1 and also you can see I am just changing the dxdy into
dsdu.

So because the transition from xy coordinate to su coordinates yields no expansion or no


shrinkage so that is why dxdy is equal to dsdu. So just I am putting dsdu here then in this case
I have the final expression. The expression is g s theta f s cos theta minus u sin theta, s sin
theta plus u cos theta du.

This equation is called the Ray Sum equation. This is a very important equation that means I
am getting one projection along this particular line the line is number 2. So that means just I
am doing the summation along this particular line. So this is the Ray sum equations.

(Refer Slide Time: 37:36)

Now this diagram I have shown that is one example of the image Radon transform this is
called the Sinogram. So corresponding to this image, I have the Radon transform the Radon
transform you can display like an image and this is called the Sinogram.

342
(Refer Slide Time: 37:58)

Now representation in polar coordinate system. So how to represent in the polar coordinate
system? So we have the x and y coordinate systems and suppose I am considering this line
and in this case you know s coordinate and the theta coordinate you know that is the direction
of the normal so theta is this and suppose if I consider this is the point P and suppose if I
consider this vector the vector is r.

So what will be the x if I do the projection of here x will be r cos phi I can determine the x
component and also I can determine the y component r sin phi and what is s my equation for
s is equal to x cos theta plus y sin theta and after this I can put the value of x just putting the
value of x r cos theta and just putting the value of y r sin phi and finally I will be getting r cos
theta minus phi. So corresponding to this that means corresponding to this I have this one.

So, this equation that is r cos theta minus phi I am showing here r cos theta is this equation I
am showing this diagram I have shown that is the cosine function I am showing cos theta
minus phi. So, I can write this the point P the point is mapped into a sinusoid in the s theta
plane. So, in this case I have shown this is the s theta plane this is s and this is theta. So that
means the point P is mapped into a sinusoid in the s theta plane that is the case I am showing
the mapping.

So, the point P this point P is mapped into a sinusoid the sinusoid is represented by cos theta
minus phi in the s theta plane and for a fix point r phi we have a locus of all the points in the s
theta. So, this is the representation of this equation s is equal to x cos theta plus y sin theta in

343
a polar coordinate. So, in a polar coordinate I can represent this equation s is equal to x cos
theta plus y sin theta.

(Refer Slide Time: 40:02)

So here I want to show some examples of the Radon transform.

(Refer Slide Time: 40:05)

The first one is the Radon transform corresponding to this image that is the full circle in
Radon transform and corresponding to this I have shown the Sinogram.

344
(Refer Slide Time: 40:13)

In the second example also I am considering the thin stick bar and corresponding to this I can
find the Radon transform in this equation. So, you can write compute program for the Radon
transform. So, up till now I discussed the concept of the Radon transform. So how to
determine the Radon transform that concept I have discussed that means I am just
determining the projection along a particular line.

I have shown the equation of the line two line, I have shown one is x cos theta plus y sin theta
is equal to 0 and what is the equation of the second line? The equation of the second line is x
cos theta plus y sin theta is equal to s. So, just I am defining the projection along a particular
line. So first I have define the Radon transform and after this I have considered some
equations to find the Ray Sum equation.

So, Ray Sum equation also I have explained. So, that means this concept I am explaining that
is how to get the projection of an object. The object is f x, y and corresponding to this f x, y
the projection is represented by g s theta. So, in my next class I will discuss about the image
reconstruction that means how to reconstruct the image f x, y from g s theta, g s theta is the
projection.

So, I can apply some techniques like the back projection technique or another technique is the
Fourier transform technique. There are some other techniques also mainly I will discuss these
two techniques, one is the back projection technique and another is the Fourier transform
technique. So, how to reconstruct f x, y from g s theta; that is f x, y the object, so I can

345
determine the projection along this lines, along the line 1, along the line 2 for a particular
angle.

So, like this if I change the angles then in this case I will be getting number of projections.
The theta angle I can change and corresponding to each and every theta I will be getting one
projections. Now, the next class I will be discussing how to reconstruct f x, y from g s theta.
So, let me stop here today. Thank you.

346
Computer Vision and Image Processing – Fundamentals and Applications
Professor M. K. Bhuyan
Department of Electronics and Electrical Engineering
Indian Institute of Technology Guwahati, India
Lecture – 10
Image Reconstruction from a Series of Projections

Welcome to NPTEL MOOCS course on Computer Vision and Image Processing –


Fundamentals and Applications. In my last class, I discussed about the concept of the Radon
transform that I have shown the distinction between x-ray imaging and the CT scan a
computed tomography. This CT scan is nothing, but the 3D representation of the object from
number of projections.

So, I can consider number of projections and from this I can get 3D representation of the
object a particular object and I have discussed about how to determine the projection. The
projection is determined like this g s theta I can determine from f (x, y), f (x, y) is the object.
So, this class the last class I am going to continue for Radon transform and also the inverse
Radon transform.

(Refer Slide Time: 01:22)

So, what is the Radon transform? If you see the projection of f (x, y) along a particular line
that we have define that is g s θ . Now what is s θ ? S and theta represent the coordinate of the
x-ray so we have the x-ray here coordinate of x-ray relative to the object. So, I have the
object and I have the x-ray. So, s θ coordinates the coordinate is s theta that is the coordinate
of the x-ray relative to the object.

347
And in this case the s is defined in the range infinity minus infinity. So, this is the value of s
the range and also theta the range of θ is from pi to 0. So, this is the range for s and θ . Now,
in this case what I am considering I am considering g s θ g s theta is nothing, but the line
integral this is the line integral this I am considering the object function is f x, y. The image
reconstruction problem can be defined as the process of determining f x, y from g s theta.

G s theta is the projection so in this case I want to determine f x, y from g s θ . In my last


class, I have shown suppose corresponding to this line, line number 1 my equation of the line
is 1, that is the equation of the line number is 1, x cos theta plus y sin theta is equal to 0 and
corresponding to this my projection g 0 theta. So, if you see here g 0 theta is the this
projection at this point.

And after this I am considering g s theta, what is g s theta? The projection of the second line,
if I consider the second line, so what is the equation of the second line? Equation of the
second line is x cos theta plus y sin theta minus s that is the equation of the second line and
corresponding to the second line what is the projection? The projection is g s theta that is the
projection of f x, y.

The equation of the second line is x cos theta plus y sin theta is equal to s that is the equation
of the second line. So, if you see this expression the g 0 theta, what is the meaning of this? G
0 theta is the integration along the line passing through the origin of x, y coordinate and
whose normal vector is in the theta direction and that is given by g 0 theta and like this I am
considering g s theta.

And in this case I am considering the direct delta function because I am just determining the
line integral that is only along the line particular line, along the line number 1, along the line
number 2.

348
(Refer Slide Time: 04:11)

After this I considered the concept of the Ray Sum. Ray Sum expression if you see g s theta
we have this expression and after this I am putting the value of x and y that is in terms of s
and u. This is x and this is y and I am considering because x cos theta plus y sin theta minus s
that is equal to 0 so it is del 0 dsdu and also if you remember that one the dxdy is equal to
dsdu because this coordinate transformation from x, y to s, u there is no shrinkage and there
is no expansion.

So, that is why dxdy is equal to dsdu that we have considered and corresponding to this I am
having the Ray Sum equation, this is the Ray Sum equation. So, what is the meaning of this
equation? This is the equation that sum of the input function f x, y along the x-ray direction
the direction of the x-ray, x-ray is this whose distance from the origin is s so this distance is s
if you see this distance is s and whose normal vector is in the theta direction.

So, the normal vector is in the theta direction that is sum of the input function f x, y along the
direction of the x-ray part whose distance from the origin is s and whose normal vector is in
the theta direction that is the direction of the Ray Sum. So, the Radon transform is mapping
the special domain x, y into that s theta domain this is the mapping. So, last class we explain
about this concept the concept of the Ray Sum.

349
(Refer Slide Time: 06:00)

And this is the Sinogram the Sinogram we have explained the Sinogram is nothing, but the
Radon transform is displayed as an image.

(Refer Slide Time: 06:11)

And I have shown some examples of the Radon transform.

350
(Refer Slide Time: 06:14)

The examples like this you can see one is the full circle in the Radon transform.

(Refer Slide Time: 06:20)

And the second one is the thin stick in the Radon transform. So, my input is this input and
corresponding to this, this is my Radon transform.

351
(Refer Slide Time: 06:29)

Now the inverting a Radon transform. So, if I want to reconstruct f x, y from g s theta I have
to do inverse Radon transform. Now I can get the reconstructed value of f x, y from the
projection the projection is g s theta. So, in this example I have shown reconstruction from
18, 36 and the 90 number of projections. So, inverse Radon transform that is the back
projection I am considering and based on this back projection I am getting the reconstructed
image that is the reconstructed image that is nothing, but the 3D representation of a particular
object.

So, to construct a particular image or the object f x, y I can apply two approaches. There are
many approaches, but in this case I can consider two approaches one is the back projection
method and another one is the Fourier transform method. So, now I am going to explain the
concept of the back projection method and after this the Fourier transform method.

So, in this example I have shown here that is just I am doing the back projection and after the
back projection the image the f x, y is reconstructed from g s theta. I can give another
example how to do the back projection. So, in the next slide I am showing how to do the back
projection.

352
(Refer Slide Time: 07:48)

Suppose I have the object like this image suppose this is one internal structure here and
suppose the x-ray is coming from this side these are the x-ray. So, corresponding to this I can
determine the Ray Sum and because this object is present here I will be getting the 1D
absorption profile something like this I will show you. So, I will be getting the profile
something like this is the 1D absorption profile.

This actually amplitude depends on the absorption. So, in this case we have the maximum
absorption here. So, corresponding to this absorption I have the profile so that is the
projection in this direction I am doing the projection along this direction. Now I am
considering another projection the angle will be different now. So, from this side I am doing
the projection of the same scene this is the object, the same scene I am considering.

But in this case I am getting the projection in another direction. So, the projection will be
something like this, this is the 1D projection because in the object this position the absorption
will be maximum. So, corresponding to this I have the 1D absorption profile. This amplitude
depends on the absorption. Now from this two projections, I have the two projections that is I
am determining g s theta.

So, first projection is this and second projections is this. To reconstruct this object I have to
do back projection. So, how to do the back projection? So, here you can see I have the profile
this profile I am doing the back projection I am doing. So, in this case corresponding to this

353
portion you see duplicating the same 1D signal across the image perpendicular to the
direction of the beam that means this is repeated that is duplicating the same 1D signal.

The 1D signal is this duplicating the same 1D signal across the image perpendicular to the
direction of the beam. Similarly, corresponding to the second one just I want to do the back
projection. So, my 1D profile is this, this is the 1D profile now I am doing the back projection
and corresponding to this the same procedure I am applying. So, I have this profile.

Now, if I combine this two the first one and the back projected image I will be getting this I
will be getting. So, if you see intensity at this particular point it will be double as compared to
this portion. So, at this portion I have the more information so this information I have the
most information. Similarly, if I do the projection in another projection then again after this
we can do the back projection.

Suppose, in this angle I am doing the projection and back projection then in this case also you
will see the information the more information I will be getting at this point as compared to
this points. So like this I can do the reconstruction that is nothing, but the 3D representation
of the object. So in that case intensity will be twice that of the individual back projected
image. This back projected image is called the Laminogram.

So, we have seen that how to do the back projection. So, first I am doing the projection in one
direction and again I am doing the projection in another direction. And after this I am doing
the back projection and after this I am combining. So, like this I can get the back projected
image that means I am reconstructing the object the reconstruction is possible.

354
(Refer Slide Time: 13:15)

Now, I will discuss the concept of inverse Radon transform and in this case I will discuss the
back projection method. So, what is the back projection method? So already physically I have
explained what is the back projection? So this expression you know that is the Radon
transform g s theta that is the projection g s theta corresponding to the object the object is f x,
y.

Now, let us consider the back projected image for a particular angle that angle is theta k that
is given by the theta k particular theta is fix corresponding to the theta k I am getting f theta
k. So, in general I can write like this, this expression I can write like this f theta x, y d x cos
theta comma y sin theta comma theta that means the theta is fix corresponding to theta I am
getting f theta x, y.

So, in this case how to get a particular projection? So, I am explaining again the particular
projection I am getting the theta is fixed and we are varying the variable s that means we
simply sums the pixel of f x, y along the line defined by the specified value of this
parameters. So, parameters are mainly s and theta so that means we have to first fix theta and
we have to vary s so theta is fix and vary s that means we are summing the pixels of f x, y
along the line defined by the specified value of these two parameters.

The parameters s and theta. So, I can increment the values of s that means this is to cover the
image, but in this case I have to make theta fix. So, if I make theta fix then I will be getting
one projection like this I will be getting number of projection. So, in this figure you can see

355
so corresponding to this particular angle I am getting one projection corresponding to this
angle I am getting one projection, corresponding to this angle I am getting one projection.

After this what I am doing I am just doing the back projection in this direction the back
projection I am doing like this I am doing the back projection. So, if you see the intensity
value at this point or this point if you see the information at this points so I will be getting
maximum information because of the information from this, information from this, and
information from this that means I can get 3D information that is 3D representation of the
object.

So, ultimately the back projected image f x, y I can determine by using this expression
because the theta varies from 0 to pi. So, for all the angles I am considering and I am
determining f x, y.

(Refer Slide Time: 16:06)

And already I have shown this one and if I consider the discrete value then instead of
integration I have to consider summation. So, discrete values if I consider the theta 1, theta 2,
theta 3 like this so that in this case instead of considering the integral I am considering the
summation. So, if I consider only the discrete directions then approximate reconstruction this
approximate reconstruction is given by this expression.

So, it is 0 to pi g s cos theta plus y sin theta comma theta d theta. So, in polar form also we
can write this expression that is the approximate reconstructed value of f x, y. So, this
concept is quite important that is the concept of back projection. So, how to get a particular

356
projection already I have explained the projection you can get corresponding to a particular
theta the theta is fixed and we vary s.

After varying s so that it will cover the entire image f x, y then corresponding to this I am
getting one projection. After this I am changing theta and the same procedure is repeated and
corresponding to this I am getting another projection so like this I will be getting number of
projections. After this I will go for the back projection the back projection of the profile g s
theta and from this I am getting f x, y. So, this is the concept of the back projection method
by which we can do inverse Radon transform.

(Refer Slide Time: 17:29)

The next method is the Fourier transform method. So, inverse Radon transform by Fourier
transform method. So what is the Fourier transform method here? The one dimensional
Fourier transform of Radon transform that is g s theta and corresponding to g s theta, I can
determine the Fourier transform that is the one dimensional Fourier transform I can determine
and that is g theta epsilon.

And after this that is for the variable s denoted by g theta and one cross section of the 2D
Fourier transform of the object. So, I can determine the 2D Fourier transform of the object f
x, y sliced by the plane at an angle theta with f x coordinate f x is the frequency coordinate
and perpendicular to the plane the plane is fx, fy that is denoted by this F fx minus fy. So this
will be identical so this will be identical.

357
So I am considering the one slice of the 2D Fourier transform of the object. So, pictorially I
am going to show this projection theorem so then you can understand what is the projection
theorem. So, mainly for the time being you can see so I am determining g s theta and after
this I am taking the 1D Fourier transform of g s theta that is 1D Fourier transform of Radon
transform I am determining.

After this, that is defined by G theta 1 and after this I am determining the 2D Fourier
transform of the object, the object is f x, y I am determining the 2D Fourier transform of the
object the object is F fx, fy that is the 2D Fourier transform of the object. After this we are
considering one slice of this Fourier transform, one slice I am considering.

The slice that is the slice by the plane at an angle theta with fx coordinate fx coordinate is the
frequency coordinate and perpendicular to the plane fx, fy. So, I am considering one slice of
this. So then G theta is equal to F that epsilon cos theta epsilon sin theta so this is identical.
So, how to prove this one?

So, first what I have to do I have to determine the Fourier transform that is 1D Fourier
transform of the Radon transform. So, first I have to determine this so this is my g s theta that
is the Radon transform and this is the expression for the Fourier transform 1D Fourier
transform of Radon transform I am determining.

After this, what I am considering you have the G theta this one G theta I have and in place of
G s theta that I have the equation for the Ray Sum equation that already I have defined the
Ray Sum equation I am just putting here the Ray Sum equation. So, just putting the Ray Sum
equation in this expression then I will be getting this one and also I am considering dx dy is
equal to ds du that already you know this. So, this condition I am considering so this is the
Ray Sum equation already I have defined. So from this we are getting this.

358
(Refer Slide Time: 20:42)

Now what is G theta? G theta is f x, y e to the power this expression I have this expression if
you see the previous slide this s cos theta minus u sin theta that is nothing, but x and s sin
theta plus u cos theta that is nothing, but y. So, corresponding to this I have f x, y e to the
power minus j twice pi epsilon x cos theta plus y sin theta dxdy so I have this expression and
after this expression I can write in this form.

You can see just minus j twice pi epsilon x cos theta epsilon y sin theta dxdy I will be getting
this one and finally this can be written like this just combining this one. So, one portion I am
combining like this another one is epsilon sin theta. So, this expression is very much similar
to this expression if you see F u, v minus infinity to infinity f x, y e to the power e to the
power minus j xu plus yv and dx dy this is v.

So, this is nothing, but the 2D Fourier transform of f x, y this is the expression for 2D Fourier
transform of the f x, y. So, if I compare this is the 2D Fourier transform and this one you can
see by comparison of this you will be getting F u, v what is F u, v what is u here? U is epsilon
cos theta what is v? Epsilon sin theta so you are getting this one G theta epsilon is equal to
this one. So that means the projection theorem is proved.

359
(Refer Slide Time: 22:49)

So, pictorially I want to explain this one. So, what is the meaning of this projection theorem?
So, this is the projection theorem that is one dimensional Fourier transform of the Radon
transform. So what is this? So, if you see here I am showing the projection here. The
projection is g s theta. So, projection along this direction this is the projection I am
considering.

I can determine 1D Fourier transform of g s theta so 1D Fourier of g theta is g theta I am


getting corresponding to the frequency variable epsilon corresponding to this. So, first I am
calculating the 1D Fourier transform of the Radon transform. Now suppose corresponding to
this object, the object f x, y I am determining the 2D Fourier transform of the object. So the
2D Fourier transform of the object is F fx, fy.

So, this is the 2D Fourier transform of the object. Now I am considering the one cross section
of this the cross section is I am considering this red line I am considering this is the cross
section I am considering. The cross section of the 2D Fourier transform of the object I am
considering F epsilon cos theta epsilon sin theta so one cross section I am considering.

This cross section is mainly the cross section of the 2D Fourier transform of the object slice
by a plane the plane I am considering you can see by a plane at an angle theta with respect to
the coordinate fx. And this plane is perpendicular to the plane the plane is fx, fy so I am
getting one slice.

360
So, slice means the cross section of the 2D Fourier transform of the object sliced by a plane at
an angle of theta with respect of fx axes and perpendicular to the plane a plane is fx, fy so
that is the plane I am considering. So, this is the one cross section of this. This cross section
and the 1D Fourier transform of the Radon transform that is identical.

So, if you see the expression that is the 1D Fourier transform of the Radon transform I am
getting this profile and if I consider the one slide of the 2D Fourier transform of the object
and I am taking one slice then it will be identical. So, projection at an angle theta generates
one cross section of F fx, fy. So, the projection of theta at a particular angle theta it will
generate the one cross section of fx, fy.

So, I am repeating this so projection at an angle theta generates one cross section of F fx, fy
that is the Fourier transform of the original object. So, if I consider the projection for all the
thetas so for all the angles they will generate the entire profile the profile is this profile F fx,
fy and after this if I take the inverse Fourier transform of this the inverse Fourier transform of
this that will give f x,y. So, if I take the inverse Fourier transform that will give f x,y that is
the reconstructed f x, y.

(Refer Slide Time: 26:19)

So, this projection theorem I have explained. I am again explaining this so projection at an
particular angle theta generates one cross section of the Fourier transform. So one cross
section of the original transform of the original object the original object is the Fourier
transform of the original object is this and the projection for all the theta it will generates the

361
entire profile of this, entire profile of this and finally I can determine the inverse Fourier
transform of this Fourier transform to get the reconstructed f x, y.

So this procedure I can consider how to actually reconstruct f x, y. So, the steps I can
consider the first step may be take projections. So, at different, different angles I can take the
projections that means in this case I am considering g s theta. After this what I have to do, I
have to determine the Fourier transform of the Radon transform, Fourier transform of g s
theta that is the 1D Fourier transform.

So this Fourier transform is nothing, but the 1D Fourier transform of g s theta. So, 1D Fourier
transform I am getting this profile so like this I have to consider all the 1D Fourier transform
and I have to combine. After combining, I will be getting F fx, fy I will be getting that is
number three step is this and number four I have to do inverse Fourier transform of this
inverse Fourier transform that will give the approximate value of f x, y that will give the
approximate value.

So, this steps this is the steps of the projection theorem that is by using the Fourier transform
method we can determine the reconstructed f x, y and one thing is important so in Radon
domain suppose that means if I take the 1D Fourier transform that means I will be getting 2D
Fourier transform from 1D Fourier transform I can get the 2D Fourier transform because you
can see this is the 1D Fourier transform.

So, if I combine all the 1D Fourier transform then I can get the 2D Fourier transform that is
2D Fourier transform is this and also I can determine 1D wavelet transform that wavelet
transform I am going to discuss in my next classes. So, 1D wavelet transform and that means
in this case this transformation is called a very important transformation the ridgelet
transformation.

So, how to get the ridgelet transform? First, I have to determine the Radon transform and
after this I have to consider 1D wavelet transform then in this case I will be getting the
ridgelet transform. So one is the 2D Fourier transform another one is the ridgelet transform I
will be getting.

362
(Refer Slide Time: 30:06)

So, here I have shown some reconstruction by considering the back projection method.
Reconstruction from 32 angles and another example I am considering reconstruction from 64
angles. So, up till now I discussed the concept of the Radon transform and also the inverse
Radon transform. So, Radon transform is nothing, but the g s theta from g s theta I can
reconstruction f x, y.

There are two methods I have explained the two approaches one is the back projection
method and another one is the Fourier transform method. In the back projection method, I
have already explained so how to get a particular projection you make theta fix and vary s the
variable s and corresponding to this I will be getting one projection the projection is g s theta.
Like this I will be getting number of projections at the particular time the theta is fixed and
varying s and I will be getting one projection like this I will be getting number of projections.

So, from all the projections the g s theta I am doing the back projection and by using the back
projection I can reconstruct the image the image if f x, y this is one popular technique that is
image reconstruction by the back projection technique. The second technique is the Fourier
transform technique. In the Fourier technique, what we have to consider 1D Fourier
transform of the Radon transform I have to determine.

And also you can see the 2D Fourier transform of the object I can determine that is f x, y is
available so from this I can determine the 2D Fourier transform. So, in this method in the
Fourier transform method what I have to determine all the 1D Fourier transform of the Radon

363
transforms. So, I have the g s theta because I have the number of projections. For all this
projections, I can determine the 1D Fourier transform of the Radon transform I can
determine.

After this I can combine all this 1D Fourier transform then I will be getting the 2D Fourier
transform of the object after this I have to apply the inverse Fourier transform technique to
reconstruct the original f x, y. So, this technique is called the Fourier transform technique. So
these two techniques, the two approaches are very important the back projection approach
another one is the Fourier transform approach.

So, I think we have covered the concept; the concept is the 3D representation of a particular
object from number of projections. So, let me stop here today. Thank you.

364
Computer Vision and Image Processing - Fundamentals and Applications
Professor M.K. Bhuyan
Department of Electronics and Electrical Engineering
Indian Institute of Technology Guwahati, India
Lecture 11
Image Transforms
Welcome to the NPTEL MOOCs course on Computer Vision and Image Processing –
Fundamentals and Application. Already I have defined mathematically the image, the image
is represented by f (x, y), f( x, y) means intensity at a particular point the point is x ,y that is
called the spatial domain representation of an image. In an image, we have frequency
information. Suppose, if I consider edges or a boundary that is nothing but the high-frequency
information.

In edges in the boundary, there is an abrupt change of grayscale intensity value, that is why
the high-frequency information is present in the edges in the boundary. If I consider the
constant intensity region, the homogeneous region that corresponds to the low-frequency
information.

So, I can convert the spatial domain information into the frequency domain information for
better analysis of an image, I can do some transformation, I can apply the DFT the discrete
Fourier transform, I can apply the DCT the discrete cosine transform like this so that the
spatial domain information can be converted into a frequency domain information.

In the case of the frequency, what is the definition of the frequency? Frequency means,
spatial rate of change of grayscale value that is the definition of frequency. Now, in this
transformation, that signal is represented as a vector and the transformation changes the basis
of the signal space, that is the definition of the image transformation.

And the transformation is quite useful for compact representation of data. So, that means, the
image can be represented by using few transformation coefficients. So, I can apply DFT, I
can apply Discrete Cosine transform and the image can be represented by using transform
coefficients that means, the compact representation of an image that is the image
transformation.

And because of this transformation, it is easy to calculate convolution or maybe the


correlation I can compute because convolution means in the frequency domain it is nothing
but the multiplication, in spatial domain it is the convolution. So, because of this

365
transformation, I can easily do convolution and let us see what is the meaning of the image
transformation. So, in my next slide, I can explain the concept of image transformation.

(Refer Time Slide: 03:01)

So, in this block diagram you have seen the input is f (x, y) that is the image, after this, we
are doing some transformation. Operator transformation I am getting F (u, v), so F (u, v) is
nothing but that is the transformed image, u is the spatial frequency along the x-direction and
v is the spatial frequency along the y-direction.

After this, we can do some operations in the frequency domain that is the operation R I am
doing in the frequency domain and after this, I am doing inverse transformation. So, that we
will get the spatial domain information after processing. So, I am getting g (x, y). So, in the

366
case of the inverse transformation signal data that is represented as vectors and the
transformation changes the basis of the signal space.

And one important point is the transformation is usually linear but not shift-invariant. And
already I had explained that it is useful for a compact representation of data and because of
these transformations, I can separate noise and the salient image features because the image is
represented by few transform coefficients.

So, it is easy to separate noise and the salient image features, and also it is useful for image
compression because the image I am considering by only considering few coefficients that is
which are more important than we are considering and neglecting the remaining coefficients,
so like this, we can do the image compression. And the transform may be orthogonal or
maybe the non-orthogonal.

So, suppose if I consider one transformation matrix, the transformation matrix is T that is a
complex matrix and if I take that T inverse that is equal to T complex conjugate transpose
then this matrix T is called the unitary matrix. It is called the unitary matrix and in this case,
the transformation is called the unitary transformation. And suppose that T is real, the
transformation matrix is real then T inverse is equal to T transpose then in this case and this
matrix is called the orthogonal matrix.

So, I have defined unitary matrix and the orthogonal matrix. Based on this condition my
transformation maybe unitary transformation or maybe the orthogonal transformation. So,
transformation may be orthogonal if this condition is satisfied or maybe the unitary and if this
condition is not satisfied then in this case it will be non-unitary or maybe the non-orthogonal
transformation.

One definition is the transformation may be complete or under complete. What is the
meaning of this? Suppose, let us consider x n, I am representing the input data x n is
represented like this, N minus 1 K is equal to 0 to N minus 1, X K and phi n K and this is the
basis function. So, my input data that is input vector is represented like this x n is equal to X
K that means, I am doing some transformation X K and the basis function is this.

So, in this case, K is equal to 0 to N minus 1. This x n can be approximately represented like
this, this is the approximate representation of x n. So, K is equal to 0 to M minus 1, now it is
M minus 1, X K phi n K. So, in the second representation that is the approximate
representation I am only considering K is equal to 0 to M minus 1. In the first case I am

367
considering K is equal to 0 to N minus 1 that means, I am considering the n number of
coefficients that means, x0, x1 x2 like this I am considering.

But in the second case I am considering only M number of coefficients then in this case that
your M is less than N. Now if I consider the first one, that my transformation will be the
complete transformation, and if I considered the second case that is the under complete case
and in both the cases if I see, I can determine the mean square error is nothing but 1 by N
summation n is equal to 0 to N minus 1.

And in this case the mean square error also I can determine, so a transformation may be
complete or under complete, so you can understand this. And another one is the
transformation can be applied for the whole image that means, if I consider a whole image,
this is the whole image, this N-by-N image.

So, for the entire image, I can apply to the transformation, or otherwise, I can apply the block
by block. So, suppose if I consider is the block, one block I am applying the transformation,
next block I am applying the transformation like this I can apply the transformation block by
block. So, that the information may be applied to image blocks or maybe to the whole image.

So, this is the definition of the image transformation and the transformation may be
orthogonal or the non-orthogonal, complete or under complete, applied to the image blocks or
the whole image. Now, based on this orthogonality concept, I can show you a different type
of transformation. So, what are the different types of the transformation?

(Refer Time Slide: 09:44)

368
So, images permission, so first one is I am considering orthogonal sinusoidal basis function.
The first case is orthogonal sinusoidal basis function, the second case is orthogonal non-
sinusoidal basis function, or I can consider the basis function depends on the statistics of the
input signal and another one is the directional transformation. So, I can classify image
transformation like this, the first one is orthogonal sinusoidal basis function. My
transformation function is orthogonal and also, I am considering the sinusoidal basis function.

So, examples like the DFT the Discrete Fourier Transform, or the DCT the Discrete Cosine
Transform, or maybe the Discrete Sine Transform DST these are the examples. So, basis
function is orthogonal and also the sinusoidal function. Another one is orthogonal non-
sinusoidal basis function. So, I can give some examples like the Harr transformation, Harr
transform that is we will discuss in the wavelet transform.

In the wavelet transform you can understand what is Harr transformation, another transform
like Walsh transform, Hadamard transform, or maybe slant transform. In this case, basis
function is orthogonal but, in this case, it is non-sinusoidal basis function. Another case, the
next one is the basis function depends on the statistics of the input data. Then in this case, I
can give you one example one is the KL transformation, I will discuss about the KL
transform and another one is singular value decomposition SVD, so I can give these
examples.

So, basis function is not fixed, but basis function depends on the statistics of the input data,
the input signal that is the examples like the KL transform and singular value decomposition.
And regarding the directional transformation, so already I have discussed about the Radon
transform that is one example is the Radon transform and another example I can give I will
discuss later on that is the Hough transform.

So, I have these types of transformations one is the DFT, DCT, and DST that is the
orthogonal sinusoidal basis function and another one is the Harr transformation, Walsh
transformation, Hadamard transformation, Slant transformation orthogonal non-sinusoidal
basis function and another one is the basis function depends on the statistics of the input
signal that is the KL transform and the SVD and directional transformation like Radon
transform and the Hough transform.

369
(Refer Time Slide: 14:21)

Now, let us consider one-dimensional sequence, the one-dimensional sequence is x n. So, it is


n is from 0 to N minus 1. Now, I am doing some transformation, the transformation is x is
equal to T x, so this transformation I am considering. This x is the, small x is the input data
and in this case, the X capital is equal to t x, t is the transformation matrix. So, I can write
like this X k is equal to t k n, t k n is this, this is the transformation kernel x n is the input
data, the input sequence.

And in this case, if I consider a transformation is something like this the T inverse equal to T
complex conjugate transpose. So, already I have defined this is called the unitary matrix and
the transformation will be the unitary transformation. And for the reconstruction I can use
this formula that is reconstruction of the original sequence small x is equal to T complex
conjugate transpose X because the T complex conjugate transpose is nothing but T inverse.

So, in this case, the reconstruction formula I can get this, this is the reconstruction formula.
So, x n is equal to k is equal to 0 to N minus 1 t complex conjugate transpose k comma n X k
and the N is from 0 to N minus 1.

370
(Refer Time Slide: 15:44)

Now, let us consider this case, the X is equal to T x, so X capital X and that is the transform
data, T is the transformation and x is the input sequence, so already I have defined this one.
So, in the matrix from I can represent in this form, the first one is the transform data. So, this
equation I am writing in the matrix from.

So, the first one is the transform data and after this I am considering the transformation
matrix and after this, I am considering the input data. In this case that x naught the small x
naught, x1, xn minus 1 that is the input data and x naught, x1, xn minus 1 that is the
transform data there is a capital and I am considering that this permission matrix.

(Refer Time Slide: 16:34)

371
Now, let us consider 1D transformation. So, I think already you know the DFT, so how to
write a DFT. So, 1D DFT I am considering first, 1D DFT. So, one-directional DFT I can
write like this X K is equal to n is equal to 0 to N minus 1 x n e to the power minus j twice pi
divided by N n k. That is the I can write like this n is equal to 0 to N minus 1 x n W n K.

So, in this case, W is the twiddle factor, you know W is the twiddle factor DFT and x n I can
write like this, x n also I can write x n is nothing but 1 by N, K is equal to 0 to N minus 1, K
is equal to 0 to N minus 1 and it is X K W minus n k, so I can write like this, this is a 1D
DFT. So, in this case, I am considering the data vector, these are the data vector and I am
considering the 1D DFT in the matrix form, so 1D DFT into matrix form I can write like this.

So, this is the DFT already I have defined the DFT, 1d DFT I have defined and in the matrix
from I can represent like this. So, this is my transform data and this is the transformation
matrix and this is my input data. And regarding the inverse one, if I consider the inverse DFT,
so the inverse DFT is nothing but this, this is the inverse DFT, this is the IDFT, so inverse
DFT I can represent like this.

And in this case, if I see this transformation, the transformation matrix is suppose W, W
inverse is equal to W complex conjugate transpose and divided by n that means, as part of
definition of the unitary matrix the DFT matrix is not unitary, so it is not unitary the DFT
matrix is not unitary. So, I can make the DFT matrix unitary, if you see my second slide you
can understand this one.

(Refer Time Slide: 19:14)

372
So, I am defining the DFT like this now. So, DFT is F k 1 by root N, f n w nk and this is my
inverse DFT. So, if I consider this then in this case my transformation matrix will be like this.
So, T inverse is equal to T complex conjugate transpose that means, in this case I am getting
the unitary DFT. So, this discussion up to this discussion you have understood that the
meaning of the unitary transformation and the orthogonal transformation.

In the case of the unitary transformation, if I consider a unitary matrix that means, the T
inverse is equal to T complex conjugate transpose that is the definition of the unitary matrix.
And suppose the matrix T is the real matrix then the T inverse is equal to T transpose that is
the definition of the orthogonal matrix and based on this I may have unitary transformation or
orthogonal transformation.

And in this example, I have shown the DFT transformation, the DFT transformation is not
unitary but I can make it unitary. I have defined the DFT like this DFT is 1 by root N like this
I have defined so that the DFT transformation matrix that will be unitary. So, that is the
definition of the unitary transformation and the orthogonal transformation.

(Refer Time Slide: 20:36)

Now, let us consider the unitary transformation and basis. So, I am considering one
transformation matrix that transformation matrix is T and I am considering the unitary
transformation. So, transformation matrix is the rules are like this t00, t01, t0 n minus 1, 310
like this I have the transformation matrix. Then what will be my T inverse?

So, I am calculating the T inverse, T inverse is very easy to calculate because it is the unitary
matrix. So, in this case, I have to do complex conjugate and after this, I am doing the

373
transpose, the transpose of the matrix. So, T inverse is equal to T complex conjugate and
transpose. So, I am getting T inverse.

(Refer Time Slide: 21:18)

And after this, I am showing the transformation here. So, this is the input data here and this is
the T inverse I have the T inverse and this is a transform data F naught, F1 the capital F1
these are the transform data. Now, in this case, I can represent like this. Now, in this case, the
columns of T inverse are independent. So, if I see the columns, these columns suppose these
columns the columns of T inverse are independent and they form a basis for the N-
dimensional space. So, this column is independent, this is independent, this is independent.

(Refer Time Slide: 21:56)

374
And I can give one example of the unitary transformation, here the 2D DFT I am considering.
So, the transformation matrix is something like suppose 1 by root 2, 1,j,j,1, this is the
transformation matrix. Then I can determine T inverse the T inverse will be 1 by root 2, 1
minus j, minus j, 1, so this is my T inverse. So, this is the definition of unitary transformation.

So, already I have explained, what are the examples of the unitary transformation. One is the
DCT, one is the DST Discrete Sine Transformation, and one is the Hadamard transformation,
and this KL transformation is also unitary. But in this case of the KL transformation, the
transformation matrix or the transformation kernel depends on the statistics of the input data.

In the case of the DCT, the transformation kernel is fixed. But in the case of the KL
transformation, the transformation kernel is not fix it depends on the statistics of the input
data but it is orthogonal.

(Refer Time Slide: 23:02)

Now, the properties of the unitary transformation. The rows of the transformation matrix T
form an orthogonal basis for the N-dimensional complex space that is very important
property and one important thing is the determinant of T is equal to 1 that you can determine,
the determinant of T is equal to 1. All the Eigenvectors of T have a unit magnitude that is
another property, Tf is equal to lambda F by using this you can determine the Eigenvalues
and Eigenvectors, so all the Eigenvectors of T have unit magnitude.

375
(Refer Time Slide: 23:39)

Another important property is the Parseval’s theorem. That means, T is energy preserving,
the transformation is energy preserving because it is a unitary transformation which preserves
energy that is called the Parseval’s theorem. So, in this mathematics I have shown here, first I
am calculating the energy in the transform domain F complex conjugate transpose F that
corresponds to energy in transform domain.

And another one is I can determine the energy in data domain that small f complex conjugate
transpose f that is the energy in the Data Domain. After this, you can see these mathematics
and if you see this T complex conjugate transpose and this part, this is nothing but the
identity matrix, I is the identity matrix. So, from this, here you can see that F complex
conjugate transpose F that is the energy into transform domain that is equal to the energy into
data domain, so energy into data domain is this, so energy is preserved.

The energy is conserved, but one thing is that most of the energies are unevenly distributed
among the coefficients. So, what is the meaning of this? Suppose the operator transformation,
I am getting some coefficients, these are my coefficients but most of the energies are
available in few coefficients. So, these energies are available like this and for the rest of the
coefficients, the energy is very negligible.

That means, energy is conserved but will be unevenly distributed among the coefficients. So,
most of the energies are available in these coefficients only, and remaining if you see the
coefficient the energy is negligible. So, in case of the image compression, I can neglect these
coefficients, because image compression is nothing but the compact representation of data.

376
So, I can neglect the redundant information. So, that is why I can consider only these
coefficients, the coefficients having the significant energy I can consider and remaining
coefficients I can neglect. So, this property is very important the energy is conserved, but
most of the energies are available in a few coefficients.

(Refer Time Slide: 25:59)

The next property is the Decorrelating property. So, in this property, you can see the input
data is highly correlated, but after the transformation, the transform data will be less
correlated or maybe you can consider uncorrelated. So, input data is highly correlated and
after data transformation transform data will be uncorrelated that is the concept.

So, if I consider this is my data vector, corresponding to this data vector I can determine the
covariance matrix and after the transformation, I am getting the transform data, the
transformed data is F capital F, and my input device small f. So, for this transformed data also
I can determine the covariance matrix, the covariance matrix is C F t and for this also I can
determine the covariance matrix.

So, in this case, my original data is highly correlated, but after the transformation, I am
having the covariance matrix something like this, this is the diagonal covariance matrix. The
off-diagonal elements all these elements will be 0, if you see this 0, all these are 0. So, that
means, the off-diagonal elements are 0 that corresponds to the perfect Decorrelating. So, that
is the meaning of decorrelating property.

So, after the transformation that transforms data will be less correlated or maybe if I consider
a perfect Decorrelation, then in this case the off-diagonal elements will be 0. So, these are the

377
properties of the unitary transformation, so the first property is quite important, the rows of
the transformation matrix they are independent and they form the basis for the N-dimensional
space that is one first property and determinate of the transformation matrix T is equal to 1.

There is one important property, one is the Parseval’s theorem, the energy in the transform
domain is equal to energy into data domain that is the very important property. Another
property is energy complexion property. So, most of the energy is available in few
coefficients. So, after the transformation, I am getting the transform coefficients and most of
the energy is available in the few coefficients, so I can neglect the remaining coefficients for
compact representation of data that is called energy complexion property.

And another property is the Decorrelating property. So, my input data is highly correlated and
after the transformation, the transform data will be uncorrelated or maybe less correlated. So,
if I have the diagonal covariance matrix that means, it is completely uncorrelated. So, these
are the properties of the unitary transformation.

(Refer Time Slide: 28:40)

So, in this example, I have shown the first one is the image in the spatial domain. So, I have
shown the pixel position and the pixels I am showing. So, you can see the correlated data,
you can see the pixels if you see the pixels, so input is correlated. And after the
transformation, I am getting the transform coefficient; you can see the transform coefficient
here, so this portion is a transform corporation.

378
And, in this case, I am getting the uncorrelated data. So, that is the importance of the
transformation. So, my input data is correlated, but after the transformation, I am getting the
uncorrelated data.

(Refer Time Slide: 29:14)

Now, let us consider the 2D transformation. So, one example I can give the 2D
transformation mean in the image I can apply the 2D transformation. So, what is the 2D
transformation here? So, first I am considering the data matrix f 00, f 01, f 0 N minus 1 this is
one row, like this another row is f 10, and up to this f 1 N minus 1, so I am considering the
data matrix.

Next one is I am considering the transformation. I am getting the transformed data, the
transformed data is Ft k1 k2, there in this case the Ft the transform coefficients. And in this
case, k1 and k2 are the row and the column indices in the transform array and I am
considering the transformation kernel, the transformation kernel is t n1, n2, k1, and k2, so this
is my transformation kernel.

And if I want to reconstruct the original data after the transformation, so in this case, you can
see because I am considering the unitary transformation that means the t inverse is equal to
the t complex conjugate transpose. So, I am considering this kernel, so I can reconstruct the
original data.

379
(Refer Time Slide: 30:29)

Now, let us consider one property that is called the Separable property. So, already I have
defined the transformation kernel, the transformation kernel is this, this is my transformation
kernel and if I can do the separation like this the t1, the one kernel is t1 another kernel is t2.
So, that means that there is transformation kernel t1, that is the kernel for the horizontal
direction, and transformation kernel t2 that is for the vertical direction.

So, if I can separate like this, that is called the Separable property. One important point is
separable transforms are easy to implement, why it is easy to implement? Because the
transformation can first be applied along the rows and after this, I can apply the
transformation along the columns. And in this case, if I consider the kernels along the rows
and the columns, I have the identical function, then in this case I have the symmetric
function, the symmetric property.

What is the meaning of the symmetric? If the kernels along the row and the columns have the
identical function then this property is called a symmetric property. And if I considered a
kernel product to the 2D DFT, so this is the kernel for the 2D DFT, that is separable you can
see. So, based on the separable property, you can see, I can determine the 2D DFT by using
the 1D DFT. So, I can apply 1D DFT along the rows and after this, I can apply that 1D DFT
along with the columns because of this property the separable property.

380
(Refer Time Slide: 32:05)

So, based on this separable and symmetric property and I am considering the unitary
transformation. If you see, already I have defined this, this is the transformation and f n1 is
the mind data input data, the 2D data and I am considering the transformation matrix these
are transformation kernel and I am considering T is the transformation kernel and I am
applying the separable property and the symmetric property.

So, if I apply the symmetric property and the separable property, I am having this one. So, I
can write these in the matrix from like this Ft is equal to T f T, and what about the inverse
transformation. This is my inverse transformation T complex conjugate transpose Ft T
complex conjugate transpose that is my inverse transformation because I am considering the
unitary transformation.

381
(Refer Time Slide: 32:57)

So, in case of the 2D transformation, 2D separable and I am considering the unitary


transformation, and also I am considering the symmetric transformation. And this property
again I am considering that this property already I have explained, one is the energy
preserving property, the distance preserving property, energy compaction property and other
properties like the decorrelating property, these properties are again applicable in the 2D
transformation.

(Refer Time Slide: 33:26)

And if I consider one image, image is nothing but the 2D array of numbers, then again I can
show you the 2D transformation. So, I am considering one image block, the image block is x

382
and the n1, n2, and that the size of the image is N cross N. So, for the transformation I am
getting the transform data, the transform data X k1, k2.

And in this case, this is my input image and this is my transformation kernel, and what is my
inverse transformation? Inverse transformation means, I am getting the input image from the
transform coefficients. These are my transform coefficients X k1, k2 and this is my kernel for
the inverse transformation.

So, this can be represented into matrix form, if you consider this equation there will be n
square number of similar equations defined for each pixel element. So, that means, I have n
square number of similar equations. So, that can be represented in the matrix form I can
represent like this.

(Refer Time Slide: 34:28)

So, again I am writing this one and what is my input image? The input image are like this.
These are mainly the pixels, pixels of x n1, n2. So, I am writing this one and this is my
matrix, the matrix is H k1, k2, so that is my matrix. So, this H k1, k2 is an N-by-N matrix
defined for the variables k1 and k2. So, the image block x can be represented by a weighted
summation of n square images, so I am repeating this.

So, a particular image block, x can be represented by a weighted summation of n square


images and each of the size N cross N and the width of these linear combinations are the
transform coefficients capital X k1, k2 and this H k1 matrix, this matrix is called is to the
basis image.

383
(Refer Time Slide: 35:39)

Now, let us consider the 2D Fourier transform. So, how to define the 2D Fourier transform,
here you can see. So, I am considering my input image f x y and this is my DFT pair. So, f x
y, If I take the 2D DFT then I will be getting F u v. The F u v means the Fourier transform of
the input image, the input image is f x y and this is my DFT formula that is in the continuous
domain F u v, I can get from f x y e to the power minus j xu plus yv dx dy and this is my
inverse Fourier transformation.

Now, in this case, you can see u and v, u and v is nothing but a spatial frequency in radian per
length. What is u? u is the spatial frequency in the x-direction and v is the spatial frequency
along the y-direction. So, in case of the f u v, I am considering two frequencies one is u
another one is v the spatial frequency, and but a Fourier transform I think already you know
this condition that f x y should be absolutely integrable.

So, this condition is important for the Fourier transformation. Now, if you see this example, I
am determining the Fourier transform of the images, I am considering three images and you
can see what will be the Fourier transform of these images. So, first images you can see.

384
(Refer Time Slide: 37:12)

That is the first image, corresponding to the first image I have the Fourier transform that I am
getting the spatial frequencies, corresponding to the second image also, I have the horizontal
lines then, in this case, I have the Fourier transform you can see these points here. And for
that this image also the third image I have the Fourier transform that is a 2D Fourier
transform.

(Refer Time Slide: 37:35)

This Fourier transform can be represented in the polar form, that F u v is the Fourier
transform that can be represented in the polar form. So, I have the magnitude part and also
the phase angle. So, I have two information one is the magnitude information and the other
one is the phase information. And in case of the F u v, it has two components one is the real

385
part another one the imaginary part, that is the Fourier spectrum of the input signal, the input
signal is f x y and by using this formula I can determine the phase angle.

(Refer Time Slide: 38:08)

Now, in this example, I want to show what is the meaning of the phase information and what
is the meaning of the magnitude information. Phase information represents the edge
information or the boundary information of the objects that are present in the image and for
applications like medical image analysis, this phase information is very important. And what
is the meaning of the magnitude?

A magnitude tells how much of a certain frequency component is present in the image. And
phase information tells where the frequency component is present in the image.
Corresponding to this example, if you see I have the input image f x y and I have two plot,
one plot is the magnitude information I am plotting, so first is the magnitude plot and second
one is the phase information I am plotting.

So, in case of the magnitude plot, I am using the log transformation, I will explain what is the
log transformation. Log transformation is used to compress the dynamic range of an image
for better visualization. So, you have seen here I have two information one is the magnitude
information, another one is the phase angle information. And in this case, you can see I am
doing the reconstruction of the original image from the 2D Fourier transform.

So, in the first case, I am considering the magnitude information, but I am not considering the
phase angle information, phase information I am not considering. And this is my
reconstructed image that means, you cannot reconstruct the image only by considering the

386
magnitude information. In the second case, I am considering the phase information, but I am
not considering the magnitude information, magnitude is constant then also this is the
reconstructed image.

So, in this example, you can see that I want to reconstruct the original image by considering
the magnitude information and the phase angle information. If I only consider the magnitude
information or if I only considered phase information, the perfect reconstruction is not
possible. So, for perfect reconstruction, I need both magnitude information and the phase
information.

(Refer Time Slide: 40:14)

And in this example, I have shown the Fourier transform of a simple rectangular bar. So, my
input image this and corresponding to this I have the Fourier transform of this, because for a
rectangular function if I take the Fourier transform, the Fourier transform will be that sync
function. The second case I am considering the Fourier transform of a simple image with
Gaussian distributed intensity.

So, if you see the intensity of this one, this is Gaussian distributed intensity, the Fourier
transform of a Gaussian function is Gaussian. So, that means, in this case also I am getting
that Gaussian distribution here, the third example I am considering two identical objects
placed in different spatial position and corresponding to this if you see this here, this object is
placed here, this object is placed here, corresponding to this I am getting the exactly the same
Fourier spectrum, the Fourier spectrum I am getting.

387
And in our last example, I am considering the rotation invariant property of the Fourier
transform. So, this object is rotated, if you see, then in this case, if it is rotated, the Fourier
transform is also rotated that is called the rotational invariant property of the Fourier
transform.

So, up till now, I discussed about the concept of the Fourier transform. The Fourier transform
is quite important to see the frequency component present in the signal. So, in an image, I
have the high-frequency information and another one is the lower frequency information. So,
in my next class, I will discuss another important information that is the Discrete Cosine
Transformation.

And also, I may discuss the fundamental concept of the KL transformation. In case of Fourier
Transform and in case of the DCT, the transformation kernel is fixed. But in case of the KL
transformation, the transformation kernel is not fixed; it depends on the statistics of the input
data. So, next class, I will discuss about the DCT and the KL transformation. So, let me stop
here today. Thank you.

388
Computer Vision and Image Processing - Fundamentals and Applications
Professor M.K. Bhuyan
Department of Electronics and Electrical Engineering
Indian Institute of Technology Guwahati, India
Lecture 12
Image Transforms
Welcome to the NPTEL MOOCs course on Computer Vision and Image Processing –
Fundamentals and Application. In my last class I discussed the concept of image
transformation, the transformation converts the spatial domain information into frequency
domain information. Mainly, it is the mapping from spatial domain into frequency domain.
The transformation does not sync the information content present in a signal and
transformation is quite useful for compact representation of data.

And also, I discussed the concept of orthogonal transformation. So, in the orthogonal
transformation, I define the transformation matrix that is T inverse is equal to T transpose
then, in this case, it is called the orthogonal matrix and the transformation will be orthogonal
transformation. And if I consider the T the transformation matrix is a complex matrix then T
inverse is equal to T complex conjugate transpose that is called unitary matrix and the
corresponding transformation is called the unitary transformation.

And I also highlighted that DFT the discrete Fourier transform that is not unitary. So, I can
make DFT unitary by defining the DFT, so that I am getting the unitary DFT transformation.
After this, I discussed the properties of unitary transformation. So, three main properties I
want to highlight again1 is energy preserving property. So, in the energy preserving property
if I do the transformation from the spatial domain into frequency domain, so I will be getting
transform coefficients.

So, most of the energy is available in few transform coefficients that is called the energy
compaction property and one is the energy preserving property that is mainly the energy into
data domain is equal to energy in the frequency domain. So, that is called the energy
preserving property, energy in the data domain is equal to energy in the transform domain.
Also, I discussed another important property that is that Decorrelating property.

So, my original data is highly correlated and after the transformation, the transform data will
be less correlated. So, for this, I define the covariance matrix. So, based on the property of the
covariance matrix, I can define the decorrelating property. Last class, I define the DFT, the
DFT unitary matrix. Now, today I will discuss one transformation that is called the DCT the

389
discrete cosine transform and after this, I will discuss another transformation that
transformation is KL transformation.

The DCT transformation is an orthogonal transformation, orthogonal means the T inverse is


equal to T transpose that is the orthogonal matrix and the transformation is the orthogonal
transformation. So, let us see what is the DCT, the discrete cosine transform.

(Refer Slide Time: 03:47)

So, first I am considering 1D-Discrete Cosine Transform here. So, I am considering the data
vector, the data vector is F I am considering, so it is f 0, f1 like this f (N – 1), and I am
considering the transpose because column vector I am considering. After this, the DCT can
be defined like this.

So, DCT F c (k), the DCT of the input sequence, the input sequences f (n) is defined by this
equation, here you can see the F c k is equal to alpha k n is equal to 0 to N minus 1 f n cos pi
k divided by twice n twice n plus 1. So, in DCT this cos pi k divided by twice n twice n plus
1 that is the transformation kernel and I can also get the reconstructed data that is the inverse
transformation, the inverse DCT I can determine, this is the inverse DCT.

So, F c (k) from F c (k) I am determining f (n), f (n) is the original data vector that I can
determine by inverse discrete cosine transform. And in this case, I can define alpha k, alpha k
is equal to 1 by root N corresponding to k is equal to 0 and it is equal to root 2 by N for k is
equal to 1, 2, to N - 1. So, this is the definition of 1D-Discrete Cosine Transformation

390
(Refer Slide Time: 05:17)

And this transformation I can write like this. So, F is the DCT that is the transformed data and
T C is the transformation matrix corresponding to Discrete Cosine Transform and f is the
original data. And in this case, the transformation matrix T C and corresponding elements of
the transformation matrix I can write like this. So, t k l the elements of the transformation
matrix, so t k l is equal to alpha k cos pi k divided by twice N twice l plus 1.

So, this is the elements of the transformation matrix and in this case, the transformation
matrix is real, because I am considering cos pi k divided by twice N. So, that means, in this
case, I am having the real values. So, that is why the DCT is a real transform but in case of
the Fourier transform I have the both cosine and the sine component, the real component, and
the imaginary component, so that is why it is complex.

But in case of the DCT I have only the cos component, the cos pi k divided by twice N twice
l plus 1. So, that is why the DCT is a real transformation and the DCT is an orthogonal
transformation. So, here you see, the T c inverse is equal to T c transpose. So, that is why it is
the orthogonal transformation.

391
(Refer Slide Time: 06:44)

Now, I want to show the relationship between the DCT and the DFT. So, in my first equation,
you can see here, this is the definition of the DCT F c k is equal to alpha k, f n is the input
data sequence n is equal to from zero to N minus 1, f n cos pi k divided by twice N twice n
plus 1, so this is my F c k, F c k is the DCT.

Now, this equation I can write in this form you can see, because, already I have explained
that in case of the DCT we have only the cos term there is no sine term that is, there is no
imaginary component. So, that is why I am considering only the real part of the exponential
function, this exponential function I am considering.

So, I am considering a Real that means, the cos term I am considering then neglecting the
imaginary term that is a sine term and after this, I am considering another sequence. The new
sequence I am considering is f dash n. So, what is f dash n? f dash n is equal to f n
corresponding to n equal to 0 to N minus 1, and otherwise it is 0. So, I am defining a new
sequence, the new sequence is f dash n.

After this, this DCT expression, I can represent in this form, you can see F c k Real and, in
this case, I am considering n is equal to 0 to twice N minus 1 because in this case, you can
see this 0, that means I am doing the zero padding, zero padding up to twice N minus 1. So, f
n, f dash n is equal to f n for n is equal to 0 to N minus 1 and 0 up to twice N minus 1. So,
what will be the length of my sequence?

The length of a sequence is from 0 to twice N minus 1 that means, it is a twice N point
sequence. So, in this case for corresponding to the new sequence, the new sequence is f dash

392
n, I can write this expression, the expression for the DCT. So, if you see this expression and
you can see this term if you see these terms from this to this because you know this
expression.

Suppose, if I consider the DFT, DFT is n is equal to 0 to N minus 1, f n and e to the power
minus j twice pi divided by N n k, this is the definition of N point DFT. So, if you compare
this one and this one then it is nothing but it is the twice N point DFT, that expression is
nothing but it is a twice N point DFT. So, from this expression, you can see the relationship
between DCT and the DFT.

So, how to get the DCT from the DFT? If you see this block diagram, you can see the input
data sequence is f naught, f1, up to f N minus 1 this is my input data sequence. After this I
want to get a new data sequence, the new data sequence is f dash n. So, for this, I have to do
zero padding, so I am doing zero padding. So, what will be the length of the new sequence?

The length of the new sequence will be twice N, so it is from zero to twice N minus 1 and
after this, I am computing the twice N point DFT. So, that means I am completing this part
twice N point DFT. After this you can see my I am doing the multiplication by alpha k and
also by this e to the power minus j pi k divided by twice N. So, that is why I am doing
multiplication by alpha k e to the power minus j pi k divided by twice N.

And after this, I am considering one operation that operation is the Real. So, I am considering
the real part of this and I am getting the DCT, the DCT of that sequence. So, you can see that,
you can obtain the DCT from the DFT by this process. So, first I have to do the zero padding,
so that the length of the sequence will be twice N and after this, I have to calculate the twice
N point DFT.

And after this I have to do multiplication, the multiplication is alpha k e to the power minus j
pi k divided by twice N and after this, I am doing one operation that is the real part I am
considering and I am getting that DCT. So, that means the DCT can be computed by using
the DFT.

393
(Refer Slide Time: 11:20)

So, in this slide, I am showing another interpretation that is the relationship between the DCT
and the DFT. In this case, you can see again I am considering one new sequence, the new
sequence is generated from the original sequence, the original sequence is f n. So, this is my
new sequence f dash n. So, what is f dash n? f dash n is equal to f n corresponding to n equal
to 0,1, up to N minus 1 and it is f twice N minus 1 minus n corresponding to n, n plus 1 up to
twice N minus 1.

So, what will be the length of the new sequence? The length of a new sequence is twice N
point. After this, I can determine the DFT of this sequence. Thus, you can see the
mathematics. After determining the DFT of the new sequence, finally, I am getting the F c k,
F c k is nothing but the DCT of the sequence. So, I am getting the DCT of the sequence that
is equal to e to the power minus j pi N into n k alpha k into f k.

So, this is I am writing here. So, this is a very important representation. So, how to get the
DCT from the DFT? So, what is F c k? F c k is nothing but the DCT of the sequence is equal
to e to the power minus j pi N n k alpha k F k. So, you can see. So, by using this approach
also you can determine the DCT from the DFT because the DFT already, you know how to
compute the DFT.

So, there are many algorithms like FFT also you can use to determine the DFT and by using
the DFT you can determine the DCT of a particular sequence. So, for this I have to define the
new sequence, the new sequence is f dash n, so this is a new sequence and that is the twice N
point sequence. So, by using this method, I can determine the DCT from the DFT.

394
(Refer Slide Time: 13:37)

Now, if you see this extension, the extension what I have done the extension is something
like this f dash n, I am having the new sequence f dash n is equal to f n corresponding to n is
equal to 0,1, up to N minus 1 and f is equal to twice N minus 1 minus n for n is equal to N, N
and plus 1 up to twice N minus 1.

So, I am getting a new sequence like this, the DFT is nothing but the periodic extension of
data. This is the case for the DFT, in case of the DCT if you see the second case, this is
nothing but the symmetric extension of data. What I have done here that is a symmetric
extension of data. So, in case of the DFT, it is nothing but the periodic extension of data but
in case of the DCT I have the symmetric extension of data.

(Refer Slide Time: 15:15)

395
Now, in this case, you can see the previous discussion you can see, the input sequence was
the N point sequence and after this, I am generating a new sequence that is the twice N point
sequence and after this, I am determining that DFT. So, there is a twice N point DFT and
from a twice N point DFT I am getting the N point DCT.

So, what I discussed in my last slides that from the input sequence, the input sequence is the
N point sequence after this I am having the new sequence, the new sequence is twice N point
sequence and after this, I am determining the twice N point DFT. So, this DFT is the twice N
point DFT and from the twice N point DFT I am calculating the N point DCT.

So, you can see, in case of the DFT the N point data is represented by the twice N point DFT
but this N point data is represented by N point DCT. So, you can compare the energy
compaction property. In the first case in the DFT case, the N point data is represented by the
twice N point and DFT that is if I considered a new sequence suppose that is the twice N
point data is represented by twice N point DFT.

And in case of the DCT these twice N point data is represented by only N point DCT. So, you
can compare the energy compaction property that is the energy compaction of DCT is better
than DFT because the twice N point data is represented by twice N point DFT, but in case of
DCT, the twice N point data is represented by N point DCT.

So, that is why I can say that DCT is better than DFT for a case, the case is the energy
comparison, the energy comparison property. And in case of the DCT, I have already
explained this one, this is nothing but the symmetric extension of data around a point the
point is N minus 1, this is the case of the DCT I am doing the symmetric extension of data
around a point N minus 1.

396
(Refer Slide Time: 17:39)

In this case, again I have shown the concept of the DFT and the DCT, DFT is nothing but the
periodic extension of data, and DCT you can see it is a symmetric extension of data around a
point N minus 1. So, you can see here, in case of the DFT because of the discontinuity, you
can see that discontinuity here, discontinuity, this is a discontinuity. So, there is a jump in the
discontinuity.

And in this case, we have the high-frequency distortion and this high-frequency distortion is
called the Gibbs phenomena, this is a high-frequency distortion, because there is a jump in
the extension, the extension is around the point N minus 1. In case of the DCT, you can see
the transition is pretty smooth around point N minus 1. So, transition is very smooth around
point N minus 1. So, that means the high-frequency distortion will be less in case of the DCT.

So, I have explained in the slide here that one is the high-frequency distortion. And in case of
the DCT high-frequency distortion is less because we have smooth transition at the
boundaries, around the point N minus 1. But in case of the DFT because of this abrupt change
of this value, you can see, that we have the high-frequency distortion and this is called a
Gibbs phenomenon.

397
(Refer Slide Time: 19:24)

And in this case, you can see the energy compaction property of the DFT and the DCT. In
this example, I have shown the input image is this and I am considering the DFT first, you
can see the energy compaction here I am showing in the graph. And in case of the DCT you
can see this energy compaction, energy compaction means, most of the energy is available in
few coefficients.

In DCT, it is better than the DFT and in this second example, in this case, I have shown the
image reconstruction after performing the discrete cosine transform. So, in this case, I am
considering one image, input image and I am determining the DCT of this image. So, I have
the transform coefficients, you can see the DCT coefficients here, all the DCT coefficients.
These are the DCT coefficients.

And I have already explained that most of the energy is available only in few coefficients. So,
that is why I am considering only the high-value coefficients, only this part I am considering
neglecting all the remaining coefficients, only this portion I am considering and after
considering these coefficients, I am reconstructing the image. So, this is the reconstructed
image and you can see if I compare visually this image and this image, visually you cannot
see the significant difference between the input image and the reconstructed image.

So, you can see here, that DCT and DFT I have explained. So, from the DFT you can
determine the DCT I have shown to a process and first, you have to define a new sequence
from the new sequence, I have to calculate the DFT and from the DFT, I can determine the
DCT. DFT you can calculate by the popular algorithms like FFT the fast Fourier transform
and from this, you can easily determine the DCT.

398
And if I want to compare the DCT and the DFT you can see DCT has more advantages as
compare to DFT. The first advantage is the DCT is a real transformation DFT is not a real
transformation, the second point is the energy compaction property. So, if I compare the DFT
and the DCT, so DCT is better than DFT.

And after this, I have shown that one example, how to reconstruct the image from the DCT
coefficients. So, only we have to consider few coefficients based on the energy value and we
can neglect the remaining coefficients because most of the energy is available only in few
coefficients. So, you can see the difference between the DCT and the DFT.

(Refer Slide Time: 22:21)

Now, I am showing the 2D-DCT, the two-dimensional Discrete Cosine Transform. So, this is
the extension of only 1D-DCT you can see here. So, F c k1, k2, so k1 is the index
corresponding to the1 direction that is the x-direction suppose, and K2 is the index
corresponding to the y-direction and I have alpha k1, alpha k2 and my input to the sequences
this f n1, n2 and my cos term, the two cos term cos pi k1 divided by twice N twice n plus 1
and cos pi k2 divided by twice N twice N2 plus 1.

And from this expression, I can determine the inverse DCT. So, I can determine the inverse
DCT by using this equation. Now, already I have explained two properties, one is the
separable property of the kernel and also the symmetric property. So, because of the
separable property, I can implement the transformation, the 2D transformation first along the
row direction in one direction, and after this I can perform the 1D transformation in another
direction.

399
So, that means, for a 2D-DCT that can be implemented by one-dimensional DCT. So, first I
can perform the 1D-DCT along with the columns and after this, I can perform the 1D-DCT
along the rows. So, this is because of the separable property of the kernel.

(Refer Slide Time: 23:54)

Here, I have shown the procedure. So, my input data sequence is f n1, n2 and first I have to
apply the DCT, one-dimensional DCT along the rows. So, in this case, I am getting this one f
k1 n2. So, DCT is applied along the rows and after this, I am applying the DCT along with
the columns. So, after this, I am getting the 2D-DCT that is that two-dimensional Discrete
Cosine Transform.

(Refer Slide Time: 24:24)

400
So, this is about the DCT the next I want to discuss the concept of dimensionality reduction
of data. So, up till now, I discussed that these two transformations one is the DFT another one
DCT. In case of the DFT and in case of the DCT the transformation kernel is fixed. Yeah,
already you know what is the transformation kernel for the DFT and also for the DCT.

Now, I am going to discuss another transformation that is called the KL transformation. In


the KL transformation, the transformation kernel is not fixed. The transformation kernel
depends on the statistics of the input data. That is the difference between the KL
transformation and the DCT and the DFT, or maybe I can consider that discrete sine
transform, I have not discussed the discrete sine transform.

So, for all these transformations DCT, DST, and also the DFT. The transformation kernel is
fixed. In the KL transformation, the transformation kernel depends on statistics of the input
data. Now, let us discuss about the KL transformation.

(Refer Slide Time: 25:38)

So, that is the Karhunen Loeve Transformation, that is the KL transformation. So, what is the
principle here? So, first I am considering a population vector, suppose a vector is considered
that is the x is a vector, the population vector is considered x1, x2, x n, and for this vector, I
can determine the mean, the mean I can determine that is the mu x I can determine.

So, this input vector is the n-dimensional corresponding to this n-dimensional vector I can
determine the mean that is mu x I can determine. And also, I can determine the covariance
matrix, the covariance matrix is C x. What is the dimension of the covariance matrix? It is N

401
cross N, so I have the mean and the covariance. Now, if I consider the elements of the
covariance matrix suppose C i i corresponds to the variance between x i and x i.

Another one is C i j that is the covariance between x i and x j and one important point is the
covariance matrix that is C x is a real matrix and also the symmetric matrix, so it is a
symmetric matrix, then in this case it is possible to find a set of an orthonormal eigenvector.
So, it is easy to find the n number of orthonormal eigenvectors. So, in this case, I am
considering e i as the eigenvectors, and corresponding to this eigenvector, my eigenvalue is
lambda i this is my eigenvalue.

So, from the covariance matrix I can determine the eigenvector and after this the
corresponding eigenvalue I can determine. These eigenvalues I can arrange in the descending
order of the magnitude, you can see here the eigenvalues are arranged in the descending order
of the magnitude. So, lambda i is greater than equal to lambda i plus 1 corresponding to i is
equal to 0,2, n minus 1. Now, the next step is,

(Refer Slide Time: 28:18)

I want to determine the transformation matrix. The transformation matrix is A, so this is my


transformation matrix. The first row of the transformation matrix is the eigenvector
corresponding to the largest eigenvalue. So, in this case, I am considering e 1 suppose, so this
is the first row of the transformation matrix is the eigenvector corresponding to the largest
eigenvalue and like this, I am considering what will be the last?

The last row of the transformation matrix is the e n T n that is the last row of that
transformation matrix is the eigenvector corresponding to smallest eigenvalue, because in this

402
case already I have explained the eigenvalues are arranged in the descending order of
magnitude. Now, in this case, I am defining a transformation, the transformation is y is equal
to A x minus mu x that I am determining.

And this transformation is called the KL transformation. So, this is my transformation matrix
x is the input vector, mean is the mean already I have determined, and I am getting the
transform data, the transformed data is y. So, this transformation is called the KL
transformation. So, now I want to see the properties of y, y means the transform data.

So, the first property is the mean of y is equal to 0, the output will be 0 mean and also, I can
determine the covariance matrix of y. The covariance matrix of y is obtained from C x and
the transformation matrix A. So, I can determine the covariance matrix this I can determine
the covariance matrix of y that is nothing but the C y, it is nothing but the expected value of y
y transpose the mean of y is equal to 0.

So, that is what I can write like this. So, E is equal to I can write A x minus mu x and A x
minus mu x transpose So, it is equal to A expected value of E x minus mu x x minus mu x
transpose and A T. So, from this you can see the covariance matrix of y is equal to A C x, so
this part is nothing but C x and A T, so I have determined this. So, here you can see the
covariance matrix of y is equal to A C x A transpose, so this is the covariance matrix. And
you can see the covariance matrix of y is a diagonal matrix.

So, that means, you can see the only the diagonal elements are available and the off-diagonal
elements all these elements are 0, so C y is a diagonal matrix. So, what is the meaning of
this? So, already I have explained the meaning that is, after the transformation that
transformed data will be uncorrelated the original data is highly correlated.

But after the transformation, because I have the diagonal covariance matrix, after the
transformation the transformed data will be uncorrelated. So, that is the meaning of this, the
transformed data will be uncorrelated. And another thing is the eigenvalues of C y are the
same as that of C x and the eigenvectors of C y are the same as that of C x that you can
verify.

403
(Refer Slide Time: 32:12)

Now, let us consider this case so, the reconstruction of the original data but before going to
the reconstruction of the original data, what I can show you? Suppose, I am considering one
image suppose, so in this image suppose, these are the points suppose 0,1,2,3,4,5, like this
0,1,2,3,4. Suppose, in this image I have these pixels.

So, 2D binary image I am considering and, in this case, white means object is present and
black means no object is present that means 0. So, in this case, I am considering white and
this is the white portion. Corresponding to this I can determine the population vector. So,
corresponding to which point the pixel is present. So, 2 and 2 suppose, because this point is 2
and 2.

Suppose, another point is 3 and 1, in this point it is present 3 and 1, the another one is 3 and
2, another point is 3 and 3 in this pixel the pixel is present that is white pixel is present.
Another one is 4 and 2, so I can consider. So, this is my population vector. From the
population vector, I can determine the mean, the mean of this vector I can determine, and
also, I can determine the covariance matrix.

After determining the covariance matrix and the mean I can determine the eigenvalues and
eigenvectors of the covariance matrix that I can determine the eigenvectors and also the
eigenvalues I can determine and after this, I can determine the transformation matrix, the
transformation matrix is A, the transformation matrix is obtain by eigenvectors. After this I
can apply the transformation, the transformation is A x minus mu x I can apply, so this is my
transformation.

404
So, in this example, I am considering one image and, in this image, I am considering the
white pixels these pixels, and corresponding to these pixels I am having the vector x. So,
from this x I can determine mean, covariance, and the transformation matrix and after this I
can apply the transformation. So, after the transformation, what I am getting? I will be getting
a new coordinate system.

So, my new coordinate system will be something like this, this is the direction of the
eigenvectors e1 and e2. So, application of this transformation I am getting a new coordinate
system whose origin is at the centroid of the object pixel. So, I can determine the centroid of
the object pixel it is the centroid of the object pixel and the axis of the new coordinate system
will be parallel to the direction of the eigenvectors.

So, here I have shown eigenvectors e1 and e2. So, one axis is e1 another axis is e2. So, I can
say the KL transformation is nothing but the rotation transformation. This is the rotation
transformation that means it aligns the data along the direction of the eigenvectors and
because of these alignments, different elements of y will be uncorrelated. So, I am repeating
this that means, I am doing the KL transformation.

The KL transformation is nothing but the rotation transformation, it aligns the data along the
direction of the eigenvectors, and because of this alignment different elements of y will be
uncorrelated, so that is the meaning of the KL transformation. After this, I am considering the
deconstruction of the original data. So, in this case, you can see, so from this transformation,
I can determine the reconstructed data that is the x I can determine, A transpose y plus mu x.

And in this case, I am considering the orthogonal transformation A inverse is equal to A


transpose, so I am having this one. So, I can reconstruct the original data and the perfect
reconstruction is possible. Now, let us consider the transformation matrix is something like
suppose A k, that is I can call as the truncated transformation matrix.

In this transformation matrix, I am not considering all the eigenvectors. In the first case, I
have considered all the eigenvectors for the construction of the transformation matrix. In the
second case, I am considering A k. So, in this case, I am only considering k number of
eigenvectors of C x, C x is the covariance matrix of the input data. So, I am considering k
largest eigenvectors.

So, then in this case I have how many rows will be there now because I am only considering
k number of eigenvectors, so dimension of A k will be k cross n. So, here I have shown k

405
cross n. And in this case, I am applying that KL transformation. So, A k is nothing but the
truncated transformation matrix x minus mu x and I am getting y. So, what will be the
dimension of y now, the dimension of y is the k.

So, that means the dimension is reduced, the original dimension was n and after the
transformation, I have the dimension k and n is greater than k. So, this is the principle of
dimensionality reduction. So, I can reduce the dimension of the data. So, in the second case,
if I want to reconstruct the original data that is not possible because I am not considering all
the eigenvectors.

So, that is why if I want to reconstruct from this one, I am getting the approximate
reconstruction. So, that is why this is x gap. So, you can see I have this A k transpose it is n
cross k and it is k the dimension of the y is k, and what will be the dimension of the
reconstructed data, the reconstructed vector? The dimension will be n. So, in the second case,
the perfect reconstruction is not possible because I am not considering all the eigenvectors for
the construction of the transformation matrix.

Now, in this case, I can determine the mean square error because perfect reconstruction is not
possible. So, what will be the mean square error? So, you can see here, the first time you can
see that is the lambda j I am determining, this j is equal to 1 to n that means, I am considering
all the eigenvalues minus i is equal to 1 to k lambda i that means, I am considering only k
number of eigenvalues.

So, if I subtract these two you can see here, so I am getting this one. So, j is equal to k plus 1
up to n lambda j. So, that is nothing but some of the neglected eigenvalues. So, the mean
square error depends on the sum of the neglected eigenvalues.

406
(Refer Slide Time: 39:38)

Now, implementation under KL transformation in an image. So, if I consider this is an image.


So, in this image I am considering N cross N image, in this N cross N image, you can see the
columns, the columns are x naught, x1 x2 like this. So, in this case, every column I can
consider as a vector.

So, this column I can consider as a vector, this column I can consider a vector, and after
considering this is a vector corresponding to particular this column x i I can determine the
mean and also, I can determine the covariance corresponding to a particular column, the
column is x i that I am considering.

(Refer Slide Time: 40:20)

407
And after this I am determining the transformation matrix, the transformation matrix is A. So,
what is the e naught dash? e naught dash is the eigenvectors corresponding to the first largest
eigenvalue. What is e1 dash? So, e1 dash is the second eigenvectors corresponding to the
second largest eigenvalue. So, like this, I have to arrange all the eigenvalues that is in the
descending order of magnitude.

So, I am arranging all the eigenvalues in the descending order of the magnitude and
corresponding to the lambda naught I have the eigenvalue eigenvector e naught,
corresponding to lambda 1 I have the eigenvector e1, corresponding to lambda 2 I have the
eigenvector e2 like this. And from this, I am constructing the transformation matrix, the
transformation matrix is A.

Also, I can consider the truncated transformation matrix that is the truncated transformation
matrix is A k. In this case, I am not considering all the eigenvectors, I am only considering
the first k number of eigenvectors corresponding to the eigenvalues. So, I am having the
truncated transformation matrix, after this what I am doing.

(Refer Slide Time: 41:36)

I am showing the transformation here, so y i is equal to k cross 1, A k k cross N and in this


case N cross 1, so I am doing this one. So, this is the transformation I am having. So, after the
transformation, my transform will be k cross 1, like this. So, in this case for the entire image
how to get this. So, if I collect all the y's mainly, so I have the y i corresponding to x i.

So, for all the columns if I can determine y i, in this case, for every x i of the image we get y
i. If the transformation of all the column vectors of the 2D image is done, then we will get the

408
N transform vector that is y i we will be getting with a dimension k. And in this case, you can
see it is an inverse transformation I am doing that is the reconstructed value of xi.

Approximate reconstruction because in this case, I am only considering the truncated


transformation matrix. So, if I do the collection of all the x i's, then in this case I can get the
reconstructed image. So, collection of all the excise will give the N cross N transform image.

(Refer Slide Time: 43:02)

So, in summary, I can say in the case of transformation, the eigenvalues are arranged in the
descending order of magnitude and the transformation matrix is formed by considering the
eigenvectors in the order of the eigenvalues. So, already I have explained this concept and
also the reconstructed below I can determine by using this expression and in this case, I am
considering the orthogonal transformation A inverse is equal to A transpose.

And in this case, suppose we want to retain only k transform coefficients we will retain the
transformation matrix formed by the k largest eigenvectors, and based on this I can define
one transformation that is the principal component analysis that is nothing but the linear
combination of largest principal eigenvectors. So, this is the definition of the principal
component analysis. So, this is a summary of the KL transformation.

So, what is the summary of the KL transformation? The first from the population vector I can
determine the mean and the covariance and from the covariance matrix I can determine the
eigenvectors and the eigenvalues. The eigenvalues are arranged in descending order of
magnitude. And after this I can determine the transformation matrix, the transformation
matrix is obtained from the eigenvectors and after this, I can do the KL transformation.

409
And again, I can reconstruct the original data and also, I can determine the truncated
transformation matrix. In the truncated transformation matrix, I can only consider the k
number of largest eigenvectors, and based on this I can determine the truncated
transformation matrix.

Now, you can see this if I want to do compression of data, the compression depends on the
value k. So, how many eigenvectors I am considering for making the transformation matrix.
And for reconstruction what information I need? I need the information of A k and also, I
need the information of y i that is the transformed data that information I need. From this I
can determine x, x is the input data that I can determine.

So, this is about the KL transformation the one problem of the KL transformation here you
see, the transformation depends on the statistics of the input data. So, already I have
mentioned that is in other transformations, the transformation kernel is fixed, but in this case,
you can see the transformation kernel is not fixed it is derived from the input data. So,
suppose in the applications like the real-time applications, suppose the non-stationary data.

Then in this case what will happen for each and every instance I have to determine the mean
of the data vector and after this, I have to determine the covariance matrix from the
covariance matrix, I have to determine the eigenvectors and eigenvalues and after this, I can
determine a transformation matrix. Since the data is non-stationary for each and every
instance, I have to do this. So, that is why the real-time implementation is very difficult it
cannot be applied.

So, that is why the KL transformation is not applied for image compression or video
compression because for non-stationary data, we have to compute all the parameters mainly
the mean covariance, and also, I have to determine the transformation matrix. So, that is why
real-time implementation is difficult. But, if I consider the decorrelating property, so it can
perfectly decorrelate the input data because I am having the diagonal covariance matrix. So,
that is why the perfect decorrelation is possible by considering the KL transformation.

410
(Refer Slide Time: 46:53)

411
Now, I want to show the principal component analysis. What is the principal component
analysis? Already, I have explained the principal component analysis means the linear
combination of largest principal eigenvectors. So, in the principal component analysis again I
am showing the same thing in the KL transformation.

So, I can reduce the dimensionality of the data set by finding a new set of variables smaller
than the original set of variables and retains most of the sample’s information and useful
product compression and classification of data.

So, you can see I have the input vector x and after this, I am reducing the dimensionality, so
the dimension is reduced, because in this case, the K is less than N. So, PCA the principal
component analysis allows us to compute a linear transformation that maps data from a high
dimensional space to a lower-dimensional subspace.

(Refer Slide Time: 47:51)

And in this case, what is the high dimensional subspace you can see. So, this is my high
dimensional subspace and this is my low dimensional subspace. This v1, v2, v n you can see
here is a basis of the N-dimensional space and if I consider u1, u2, u k is the basis of the K-
dimensional space and suppose, if N is equal to K that means, in this case, the dimension is
not reduced. So, here I have shown this example, one in the high dimensional space another
one lower-dimensional space representation.

412
(Refer Slide Time: 48:29)

And the information loss, what is the information loss? Because of the dimensionality
reduction, information will be lost. So, the goal of the PCA is to reduce dimensionality while
preserving as much information as possible. This is mainly the minimizing the error between
the projection in the new and older dimensions.

So, that means, in this case, I have to minimize this, one is the original data another one is the
reconstructed data, that I have to minimize. Now, how to determine the best lower-
dimensional subspace? The best low dimensional subspace can be determined by the best
eigenvector of the covariance matrix of x.

So, already this concept I have explained and these eigenvectors corresponding to the largest
eigenvalue, and these are called the Principal Components.

413
(Refer Slide Time: 49:20)

And here I have shown the principal components that is nothing but the eigenvectors. So,
orthogonal direction of the greatest variance in data. So, I have the first principal component
PC 1, the second principal component PC 2, and in this case, like this, I have the principal
components that is nothing but the direction of the eigenvectors, these are the eigenvectors.

(Refer Slide Time: 49:45)

So, already, I have explained because of the KL transformation, I will be getting a new
coordinate system. In the new coordinate system, my new axis will be the eigenvectors. So,
here you can see these are my eigenvectors e1, e2, these are eigenvectors. So, new axis are
orthogonal and represent the directions with maximum variability, and the principal
component analysis projects data along the direction where the data varies the most.

414
And these directions are determined by the eigenvectors of the covariance matrix
corresponding to the largest eigenvalue. So, this concept already I have explained. The
magnitude of the eigenvalues corresponds to the variance of the data along the direction of
the eigenvectors.

(Refer Slide Time: 50:36)

And here I have shown that dimensionality reduction. Can ignore the components of the
lesser significance. So, you can see I am considering the principal components PC1, PC2,
PC3, PC4 all the PC like this I am considering and you can see these principal components
have the maximum information. So, I can neglect the remaining principal components
because already I have explained the property, that property is called energy compaction
property.

So, most of the energy is available only in few coefficients. So, in this case, I am considering
the largest eigenvalues and the corresponding eigenvectors. So, that means the PC1, PC2,
PC3, PC4, PC5 I am considering and remaining eigenvectors I am not considering, the
principal components, that is I can select the first p eigenvectors based on the eigenvalues and
I can neglect the remaining eigenvectors.

415
(Refer Slide Time: 51:39)

So, what is the method again I am explaining this one. So, I have the input vector. So,
dimension is N cross 1, from N cross 1 I can determine the mean of this and after this from
the original data, I can subtract the mean and after this, I can determine the covariance matrix
I can determine, from the covariance matrix I can determine the eigenvalues I can determine
and from the eigenvalues I can determine the eigenvectors.

So, this principle already this concept I have already explained. So, from the input data how
to determine the mean and how to determine the covariance, and from that covariance, how
to determine the eigenvalues and eigenvectors.

(Refer Slide Time: 52:20)

416
After this, you can see if the C is symmetric then these eigenvectors u1, u2, u3 like this form
a basic, and in this case any vector x minus x bar that is the mean can be written as a linear
combination of the eigenvectors you can see here and also if I consider the truncated
transformation matrix, then, in this case, I can reduce the dimension in the step 6.

So, this is the, I am only considering the K number of eigenvectors. In the first case, I
considered the N number of eigenvectors. So, in the step 6, I am considering the dimension
reduction step. So, you can see the dimension is reduced by this expression and how to
choose the principal component.

So, already I have explained, so principal components are selected based on the eigenvalues,
the magnitude of the eigenvalues. And in this case, the one condition you can consider. So,
you can see here, the lambda i this should be greater than particularly greater than. So,
lambda I am considering for the K number of eigenvectors from i is equal to 1 to K.

And in the second case here you see the lambda I, I am considering from i is equal to n that is
for all the eigenvalues, then in this case, if this ratio is greater than a particular threshold, so
based on this condition, I can select the value of K, so threshold I have to consider. For
example, I can consider point 9 or maybe point 95, and this ratio I can determine and from
this, I can select the value of K, K number of eigenvectors.

(Refer Slide Time: 54:06)

And in this case, we have seen that the original vector x can be reconstructed using the
principal components. So, this is my reconstructed information, so this is my reconstructed
vector. In this case, I considered a mean but in this case x approximate, and because I am

417
considering the truncated transformation matrix, so that is why I am getting the approximate
x that I can get.

And also, you can see the error also I can determine, the error means the low dimensional
basis based on the principal component minimizes the reconstruction error. So, the
reconstruction error I can determine and the error can be determined like this, this is the
expression for the error, the error is equal to 1 by 2 and summation lambda i and in this case,
I am considering the neglected eigenvalues, the sum of the neglected eigenvalues that I am
considering.

(Refer Slide Time: 55:01)

Now, I have shown one example, how to consider the PCA for face recognition? So, in this
example, I have shown in my first class, so this is the face recognition problem. So, we have
the database and we have the input images. So, in this case, I have to select whether this
particular face is available in a database or not. So, this is the task of face recognition. So, my
input images maybe like this, it may be occluded or maybe some illumination variation I may
consider and, in this case, I have to find that particular face in the database.

418
(Refer Slide Time: 55:37)

For this, I can apply the PCA for the face recognition. So, this is my input image and by
image that I can consider as N square into 1 vector that I can consider. And in this case, you
can see the original face is this, minus mean face I can determine, the average face I can
determine, and that can be approximately represented by this, this can be approximately
represented like this because I am only considering the K number of basis that is expressed
into low dimensionality space.

(Refer Slide Time: 56:11)

So, you can see here, so each face minus the mean in the training set can be represented as a
linear combination of K eigenvectors. So, here you can see this is the representation. So, each
face minus the mean can be represented by a linear combination of the best eigenvectors. So,

419
I have the eigenfaces like this. So, from the eigenvectors I can determine the eigenfaces, these
are eigenfaces. So, each face, this face is represented as a linear combination of the K-based
eigenvectors.

(Refer Slide Time: 56:53)

So, here in this case that means, if you see this slide here these are my widths. So, these are
my width vectors. So, representing faces onto this basis.

420
(Refer Slide Time: 57:06)

And for the testing what I have to do? For a test unknown image, the same procedure is
repeated. So, first I have to do the normalization and after this, I have to do the projection on
the eigenspace and this is my widths corresponding to the test image, and for a face
identification what I have to consider? Only I have to compare the widths, the widths that
already I have explained for the training and for the testing also this width I have to compare.

Because already I have the eigenfaces and if this error is less than a particular threshold, then
a particular face is recognized or identified. So, this is the face recognition by the principle of
PCA. So, in this class, I discuss the concept of the KL transformation, a very important
concept the KL transformation and after this, I discuss the concept of the PCA the principal
component analysis and after this, I discussed on application. So, how to recognize face by
using the PCA? So, let me stop here today. Thank you.

421
Computer Vision and Image Processing - Fundamentals and Applications
Professor M.K. Bhuyan
Department of Electronics and Electrical Engineering
Indian Institute of Technology Guwahati, India
Lecture 13
Image Transform: Introduction to Wavelet Transform
Welcome to the NPTEL MOOC course in Computer Vision and Image Processing –
Fundamentals and Applications. In my last classes I discussed the concept of image
transformation; I discussed the concept of the Fourier transformation, discreet cosine
transformation, and the KL transformation. Fourier transform gives the frequency
information present in a signal, but there is a drawback.

The major drawback of the Fourier transform, it does not give the time information that
means, at what time a particular event took place, and that information is missing. So, to
consider that issue, we are now considering another transformation that transformation is the
Wavelet transformation.

So, today I am going to discuss the fundamental concept of Wavelet transformation, it is not
possible to discuss all the mathematical concepts of the Wavelet transformation. So, that is
why I will briefly discuss the fundamental concept of the Wavelet transformation. So, let us
see what is the Wavelet information?

So, first I will discuss the concept of the Fourier series and after this, I will discuss the
Fourier transformation and I will highlight the disadvantages, the drawbacks of the Fourier
transformation and after this, I will discuss the STFT, the Short Time Fourier Transformation
and finally, I will discuss the discrete wavelet transformation.

422
(Refer Slide Time: 02:01)

So, first one is the Fourier series you can see here I am considering a periodic function. The
periodic function is x t and, in this case, I am considering a period twice pi then in this case,
this x (t) can be represented by the Fourier series, the Fourier series is a naught plus the
summation k is equal to 1 to infinity and I have two components one is the cosine component
another one is the sine component.

So, due to this Fourier series, I will be getting the fundamental component of the signal and
also the harmonic components present in the signal. The coefficient a naught is given by this,
the coefficient a (k) and the b (k) is given by this expression. So, I think already you know
about the Fourier series.

(Refer Slide Time: 02:48)

423
I can give you one example of the Fourier series. A square wave is composed of
fundamentals and the harmonics. So, if I considered a square wave it is composed up the
fundamentals and the harmonics. So, you can see the fundamental components and all the
harmonic components present in the signal.

(Refer Slide Time: 03:08)

After this the next development is the development of the Fourier transformation. So, already
I have discussed about the discrete time Fourier transformation DFT, the DFT can be
implemented by FFT and also, last class I discussed about DCT the discrete cosine
transformation. The DFT gives the frequency information present in an image or a signal. So,
here I have shown the image of the Fourier and this is the 2D-DFT magnitude spectrum of
the Fourier image.

So, these are Fourier transform of Fourier and in this case the Fourier spectrum. So, this is the
center point of the Fourier spectrum and if I consider, so this portion, the central portion
corresponds to the low frequency part and if I consider the outside part, that part, that
corresponds to the high frequency part. Now, before the Fourier transform, I have to multiply
the image if I consider one image is f (x, y), the image is multiplied by - 1 to the power x plus
y that is the pre-processing I have to do.

The image is multiplied by minus 1 to the power x + y. So, because of this the Fourier
transform, if the size of the image is, suppose N cross N, then the Fourier transform will be
cantered at the point, the point will be N / 2, N / 2. So, that means, the pre-processing is the
image is multiplied by - 1 to the power x + y and corresponding to this the Fourier transform,
the center of the Fourier transform spectrum will be N / 2, N / 2.

424
And in this case, last class I have shown that log transformation is used for displaying the
Fourier transformation. The log transformation is used to compress the dynamic range of the
pixels for better visualization.

(Refer Slide Time: 05:00)

And in my last classes, I discussed about the 2D Fourier transformation. So, if the input
image is f (x, y), the corresponding Fourier transform is F (u, v). So, f (x, y) the 2D
transformation pair that is a 2D-DFT, f (x, y) that is transformed to F (u, v), then in this case
u and v means the spatial frequency along the x direction and v is the spatial frequency along
the y direction. If I apply this one that is the Fourier transform, I am getting, F (u, v) I am
getting and from this I can determine the inverse Fourier transformation.

(Refer Slide Time: 05:36)

425
And in this example, I have considered three images. And corresponding to these images I
have shown the Fourier spectrum one is the first images, the vertical lines, the horizontal
lines, and the diagonal lines and corresponding to this I am getting the Fourier spectrum like
this. So, u and v means the spatial frequency in the horizontal and the vertical directions in
radian per link respectively.

(Refer Slide Time: 06:01)

And already, I have mentioned that this Fourier transform can be represented in the Polar
form. So, I have the magnitude part and the phase angle part, the phase angle I can determine
like this. So, I have two components, one is the real part another one is the imaginary part of
the Fourier spectrum.

(Refer Slide Time: 06:19)

426
And this pre-processing I have shown here that what I have mentioned earlier. So, the image
is multiplied by minus 1 to the power x plus y and after this the Fourier transform is applied,
then the Fourier transform will be cantered at the point M by 2 and N by 2.

The size of the image is, suppose M cross N, M number of rows and N number of columns,
then this will be the center of the Fourier transformation. The centre will be M by 2 and N by
2, this is the center of the Fourier transform that is the Fourier transform spectrum. So, this
pre-processing I have to do before the Fourier transformation.

(Refer Slide Time: 07:03)

And in my last class, I have shown that image reconstruction from the magnitude information
and the phase information. So, you have seen here the input image is f x y and this is the
magnitude spectrum. So, I am using the log transformation to compress the dynamic range
and after this, I am considering the face spectrum, the face spectrum is this and, in this case, I
am only considering the magnitude information, I am not considering the phase information
and this is my reconstructed image.

In the second case, I am considering the phase information, the magnitude information is not
considered because the magnitude is constant, then corresponding to this, this is the
reconstructed image. So, perfect reconstruction is not possible, if I only consider the phase
information or the magnitude information. So, for perfect reconstruction I need both phase
information and magnitude information.

427
(Refer Slide Time: 08:01)

Now, in an image, what do you mean by the frequency? So, suppose if I consider one image,
suppose if I consider edges here, one edge is and this is one intensity portion and that is
another intensity portion. And, in this case, I am considering the edge. So, in this case, if I
draw the profile here, suppose intensity is something like this and the edge is present at the
location at this, at the edge there is a sudden change of the grayscale intensity value.

So, that means, if I consider the edges that corresponds to the high frequency information and
if I consider the constant intensity portion or the homogeneous portion, that portion
corresponds to the low frequency information. So, an edge means the high frequency
information because there is an abrupt change of grayscale intensity value.

And in the Fourier transform, if I apply the Fourier transform, so already I have explained.
So, the central portion corresponds to the low frequency information, so this is my low
frequency information and if I consider the outside portion that corresponds to the high
frequency information.

In this example, you can see this example here, I am considering one image and I am
applying the Fourier transformation and I am getting the Fourier spectrum, the spectrum is
this. Next what I am doing, I am considering the Fourier spectrum, and I am applying the
inverse Fourier transformation to reconstruct the original image. So, I am getting the
reconstructed image.

So, this is about the Fourier transformation and inverse Fourier transformation and the perfect
reconstruction is possible because I am considering all the frequency components present in

428
the signal that means, all the high frequency components and the low frequency components,
I am considering for a reconstruction.

(Refer Slide Time: 10:13)

Now, in this example, I have shown here, I am considering the inverse Fourier
transformation, but in this case, I am only considering the central portion of the Fourier
transformation. The central portion of the Fourier transformation or the spectrum corresponds
to the low frequency.

And in this case, if I want to reconstruct the original image, the perfect reconstruction is not
possible because I am only considering the low frequency information and corresponding to
this if I apply the inverse Fourier transformation, then I will be getting this image.

In the second case, I am not considering the central portion of the Fourier spectrum, that is
the low frequency information I am not considering, only I am considering the high
frequency information of the Fourier spectrum and corresponding to this I am reconstructing
the image by using the inverse Fourier transformation.

So, I am getting the reconstructed image that means, I am getting the high frequency
information. So, this is about the low frequency information and the high frequency
information present in the Fourier spectrum.

429
(Refer Slide Time: 11:23)

So, that is why I can say, the central part of the Fourier transformation that is the low
frequency component are responsible for the general gray-level appearance of an image. So,
that means, this low frequency information gives the general appearance of an image and if I
considered a high frequency component or the Fourier transformation, these are responsible
for the detail information of an image. So, if I consider the high frequency information, they
give the detail information of an image.

(Refer Slide Time: 12:00)

So, in this example, also I have shown here and the input image I have shown here and
corresponding to this I am considering the Fourier spectrum and I am using the log
magnitude spectrum, because I have to compress the dynamic range for better visualization.

430
So, you can see the lower frequency component present in the Fourier spectrum and the high
frequency that is the detail information present in the Fourier spectrum. So, central portion
corresponds to the low frequency information and if I consider the outside portion, the outer
portion of the spectrum that correspond to the high frequency or the detail information.

(Refer Slide Time: 12:40)

And in this case, what I am doing the reconstruction of the image from the Fourier spectrum.
So, in this case, I am applying the inverse Fourier transformation. So, you can see here, only
this portion that is this information is considered that means only the low frequency
information is considered and corresponding to this, this is my reconstructed image. In the
second case, thus, I am considering more information, low frequency information or maybe
some high frequency information and corresponding to this, this is my reconstructed image.

Again, I am considering more information that is more important than I am considering that
may contain low frequency as well as some high frequency information and corresponding to
this, this is my reconstructed image. And similarly, if I consider this one, that this information
then in this case, I have both low frequency information and also high frequency information
and corresponding to this, this is my reconstructed image. So, you can understand the concept
of the Fourier transformation.

431
(Refer Slide Time: 13:44)

So, Fourier transformation breaks down a signal into constituent sinusoid of different
frequencies. So, this is my Fourier transformation expression, but one drawback in
transforming to the frequency domain is that, the time information is lost. So, when looking
at the Fourier transformation of a signal, it is impossible to tell when a particular event took
place. So, that is the main drawback of the Fourier transformation.

So, up till now, I have discussed the concept of the Fourier transformation in an image, I have
explained the central portion of the Fourier spectrum, it gives the low frequency information
and if I consider the outer portion of the Fourier spectrum, it gives the high frequency
information. So, for perfect reconstruction, I need both low frequency information and the
high frequency information.

Now, after this I have considered the Fourier transformation expression. And already I have
explained it one main drawback of the Fourier transformation is that, the time information is
not available, yet what time a particular event took place and that information is not available
in the Fourier transformation. So, for this we have to consider another transformation that
STFT first I will discuss. And after this, we will discuss the DWT the discrete wavelet
transformation.

432
(Refer Slide Time: 15:12)

So, corresponding to the Fourier transformation I can give one example. So, if I consider this
signal, this signal has two frequency components. One frequency is at the 50 hertz, another is
at 100 hertz. So, corresponding to this signal, I have the spectrum, the Fourier spectrum is
this. So, you have seen that there are two frequency components one is the 50 hertz frequency
component and under only the 100 hertz frequency component corresponding to that signal.

(Refer Slide Time: 15:41)

In the second case, I am considering one non-stationary signal; the signal is represented by
this. So, it is f 2 t and I am considering the non-stationary signal and a signal will be
something like this and corresponding to this signal also I am getting the peaks corresponding
to 50 hertz and 100 hertz, so this is my spectrum.

433
So, if you compare the spectrum and the previous spectrum, the previous spectrum was this
for the stationary signal, then in this case I am getting identical spectrum one for the
stationary signal and the second one is for the non-stationary signal. So, that means, the time
information is not available in the Fourier transformation.

(Refer Slide Time: 16:26)

I can give another example you can see there are two signals one is the stationary signal. The
first one is the stationary signal and second one is the non-stationary signal and you can see
the frequency components present in the signal 2 hertz, 10 hertz, 20 hertz like this. In the
nonstationary also you can see the frequency components present in the signal and
corresponding to this, you can see the spectrum, the magnitude spectrum and if you see these
two spectrum, they are almost identical. So, that means the time information is not available
in the Fourier spectrum.

434
(Refer Slide Time: 17:03)

Again, I am considering the same explanation. You can see I am considering two signals
deferent in time domain, but same in the frequency domain. That means, at what time the
particular frequency component occur, that information is not available in the spectrum.

(Refer Slide Time: 17:24)

So, for this we will consider another transformation. So, that transformation is the short time
Fourier transformation STFT. So, for this we consider the window, particular window I am
considering. So, suppose this window time window I am considering and corresponding to
this time window, I want to see what are the frequency components present in the signal.

So, that means, take Fourier transformation of segmented conjugative pieces of a signal and
each Fourier transformation then provides test spectral content of that time segment only. So,

435
corresponding to that time segment, so if I consider this time segment or the time window, I
can see what are the frequency components present in the signal that is the short time for your
transformation.

(Refer Slide Time: 18:17)

This short time Fourier transformation is called Gabor. So, this is the Gabor, so for this I am
considering one window function, the window function is cantered at tau. So, corresponding
to this function I am considering this window and corresponding to this window I want to see
what are the frequency components present in the signal.

So, this window is considered in the time domain. So, corresponding to this time interval, so
what are the frequency components present a signal that I want to see? So, that means,
corresponding to this time window, I can see what are the frequency component present in
the signal and like this, I can see the frequency components present in the signal, this is the
short time Fourier transformation and this is also called a Gabor.

436
(Refer Slide Time: 19:07)

So, in this example, I have shown the STFT. So, I am considering a non-stationary signal. So,
you can see the frequency components. So, from this point to this point, if you see from this
to this time window, only one frequency component is present that is available in the Fourier
spectrum and corresponding to this time interval two frequency components are present in the
signal. So, I am having the two frequency components in the Fourier spectrum.

(Refer Slide Time: 19:38)

So, what is the STFT, take Fourier transform of segmented consecutive pieces of a signal?
Each Fourier transformation then provides the spectral content of their time segment only.
But one problem is how to select the time window? That is the one main problem of the
STFT. So, how to select the time window that is the problem?

437
Now, I want to consider this case that is the low frequency signal, better resolved in the
frequency domain and the high frequency signal better resolved in the time domain. This
concept I can explain in the next slide. What is the meaning of this? This is a very important
concept. You can see here.

(Refer Slide Time: 20:22)

Here, I have shown two signals, one is the low frequency signal. The first one is a low
frequency signal and another one is the high frequency signal. So, if I consider low frequency
signal, that signal can be better resolved in the frequency domain. That means, in the
frequency domain I can see the information present in the signal and in this case if I consider
a high frequency signal that can be better resolved in the time domain.

(Refer Slide Time: 20:50)

438
Based on this concept I can show the uncertainty theorem. So, uncertainty theorem is we
cannot calculate frequency and the time information of a signal with absolute certainty. This
is similar to Heisenberg uncertainty principle involving momentum and the velocity of a
particle. In Fourier transform, we use the basis which has infinite support and infinite energy.

In wavelet transformation, we have to localize both in time domain and in the frequency
domain. So, that means, in the time domain, we can do some translation of the basis function
and in frequency domain we can do scaling. So, this is the uncertainty theorem.

(Refer Slide Time: 21:35)

So, in this example, you can see, I am considering the signals in the time domain and the in
the frequency domain you can see the signals, but one problem is that, at what time particular
event took place, at what time a particular frequency is present in the signal that information
is missing in these examples, that is uncertain.

439
(Refer Slide Time: 21:55)

So, in this example, I have shown the STFT of a signal, you can see the frequency
information, the time information, and the amplitude information that we can obtain by using
STFT.

(Refer Slide Time: 22:09)

Now, this uncertainty principle I can explain here. So, if I consider the narrow window that
gives poor frequency resolution and if I consider the wide window that gives poor time
resolution, so first example, I am considering the narrow window and in a second case I am
considering the wide window and, in this case, this is very similar to the Heisenberg
uncertainty principle.

440
Cannot know what frequency exist at what time intervals that information is not available.
So, in these two examples, I have shown that cases one is for the narrow window another one
is for the wide window.

(Refer Slide Time: 22:48)

The next I am considering the resolution of the time and the frequency. So, if you see these
windows, I am considering different windows here. So, if I consider this window, this gives
better frequency resolution, but it is poor time resolution. But if I consider this window, this
window gives better time resolution, but poor frequency resolution. So, you can see the
concept of the resolution, one is the time resolution another one is the frequency resolution.
And in this case, I have shown, this is the time axis and this is the frequency axis.

(Refer Slide Time: 23:30)

441
And, in this example, I have shown that cases like the time domain, you know that Shannon.
So, in the time domain information of a signal that means we have the time information and
the amplitude information of a signal. And in the frequency domain that we have explained
the Fourier transformation, we have the amplitude information and the frequency
information, but timing permission is not available.

And if I consider the STFT, that is the Short Time Fourier Transformation. So, we can see
that corresponding to a particular time window, I have the frequency information present in
the signal. In case of the wavelet analysis, we are considering variable size windows, I can
consider the narrow window or maybe something like the window something like this.

And in this case, I can see in a particular time interval, what are the frequency component
present in the signal. So here, I have the time information and scale information gives the
frequency information that is the wavelet analysis. So, you can see the distinction between
the signal representation in the time domain, frequency domain, and the STFT, and the
wavelet analysis.

(Refer Slide Time: 24:43)

So finally, you can see, for analysis of the high frequencies, we can consider narrow windows
for better time resolution. And for the analysis of the low frequency signal, we can use wider
windows for better frequency resolution. The function used to window the signal is called a
wavelet. So, this is the main concept of the wavelet transformation, the window size is not
fixed in the wavelet transformation. In STFT, the window size is fixed.

442
(Refer Slide Time: 25:15)

So, you can see this example, I am considering one nonstationary signal, A signal with three
frequency components at varying times. So, in this case, you can see the wide windows do
not provide good localization at high frequency. So, if I consider this window first this wide
window, this wide window do not provide good localization at high frequency.

(Refer Slide Time: 25:42)

443
And if I considered a narrow window that is good for the high frequency, use narrow
windows at high frequency. And corresponding to the low frequency component, if I consider
low frequency part, the narrow windows do not provide good localization at low frequencies.
So, for this we have to consider wide windows. So, like this we have to consider the wide
window. So, this is my wide window that I am considering. This is the concept of the variable
size windows.

444
(Refer Slide Time: 26:14)

So, what is wavelet? Wavelets are functions that wave above and below the x axis. The first
one is the varying frequency, limited duration, and average is value will be zero. So, here you
can see the one example is the wavelet another one is the sinusoid. In case of the wavelet, we
have the varying frequency, limited durations, and the average is value will be zero.

(Refer Slide Time: 26:41)

So, what is a mother wavelet? In wavelet we have a mother wavelet and in this case the
mother wavelet will be something like this. And mathematically, it can be shown like this,
this is the definition of the wavelet. So, I am considering the mother wavelet, the mother
wavelet is denoted by psi t, this is the mother wavelet.

445
And other wavelets, the other wavelet already I have defined, the other wavelet is defined
like this, psi a, b t, 1 by root a, psi t minus b divided by a. So, in this case, I have two
parameters a and b, these are two arbitrary real numbers. Now, this a is the dilution parameter
and b is the translation parameter. So, now, how to represent the mother wavelet?

The mother wavelet psi t is nothing but psi a is equal to 1, b is equal to 0. So, if I put a is
equal to 1 here and b is equal to 0 then in this case, I will be getting the mother wavelet and,
in this case, if I consider a is not equal to 1 and suppose b is equal to 0 then in this case, I can
get psi a 0, t it is 1 by root a psi t by a. So, that means I am considering a is not equal to 1 and
b is equal to 0.

So, what is the meaning of this? The meaning of this is, this is obtained from the mother
wavelet indeed in this case the time is scale by a and the amplitude is scale by root a, that this
parameter a is it causes the contraction of the mother wavelet, if a is less than 1, or otherwise
if a is greater than 1 that corresponds to expansion or the stressing, expansion of the mother
wavelet or distressing of the mother wavelet, if a is greater than 1. This parameter a is called
the dilution parameter or the scaling parameter.

And if I consider a is less than 0. So, what will happen if a is less than zero? It corresponds to
the time reversal with dilation. Now, this function psi a, b, it is a shift of psi a, 0 by an
amount of b when b is greater than 0 that means in this case, there is a sifting along the right
direction in the time axis by an amount b when b is greater than 0. On the other hand, this
function is shift in the left along the time axis by an amount b when b is less than 0.

So, that means shifting in the right direction and shifting in the left direction in the time axis,
if b is greater than 0 that means it is the shifting right along the time axis by an amount b and
in this case, if the b is less than 0, then in this case shifting in the left along the time axis by
an amount b. So, this is the interpretation of the parameters a and b and I have defined the
mother wavelet and from the mother wavelet, you can see I can get the other wavelets.

446
(Refer Slide Time: 31:00)

So, in this case, I had given some examples of the mother wavelet. So, first one is the
Daubechies wavelet, one is the Haar wavelet, and another one is the Shannon wavelet. I have
not mathematically defined, but you can see the shape of this wavelets, one is the Daubechies
wavelet, Haar wavelet, and the Shannon wavelet.

(Refer Slide Time: 31:20)

So, in this example, I can show you the mother wavelet and, in this case, I am considering a
is equal to 1, a is equal to 2, a is equal to 4, that means what is the a? a is the dilation
parameter and b is the translation parameter.

447
(Refer Slide Time: 31:36)

So, you can see I am doing the scaling because I am changing the parameter, the parameter is
a, so a is equal to 1, a is equal to 2, a is equal to 4, like this. So, I am doing the scaling.

(Refer Slide Time: 31:48)

And also, I can do the translation with the help of the parameter, the parameter is b. So, b is
the translation parameter. So, in this case, I am considering the shift, shift is equal to 0, shift
is equal to 100, shift is equal to 200. So, I can do the translation and the dilation.

448
(Refer Slide Time: 32:07)

In this case, I have shown here, I have the mother wavelet, that is the Haar wavelet, and I can
do the translation and also, I can do the scaling. So, in this case, instead of considering the
parameters a and b, I am considering the parameter s and the tau. So, the tau is for the
translation and s is for the scaling.

(Refer Slide Time: 32:26)

Now, let us consider the definition of the continuous wavelet transformation. So, C tau s is
the continuous wavelet transformation. And in this case, I am considering the input signal,
the input signal is f t and I am considering the mother wavelet, the mother wavelet is psi t
minus tau divided by s. And in this case, the mother wavelet can be translated and it can be
scaled. And in this case, what is the meaning of the scaling?

449
Scaling means it gives frequency information. So, scale is equal to 1 by frequency. So, in this
case, you can see these parameters the tau is the translation parameter, and s is the scaling
parameter that is nothing but the measurement of the frequency. So, this is the definition of
the continuous wavelet transformation that is the forward continuous wavelet transformation
of the signal, the signal is FT.

(Refer Slide Time: 33:26)

And you can see, this W s tau is the continuous wavelet transformation, f t I am considering
and the psi s tau is the wavelet I am considering, wavelet function that is nothing but the
inner for the between the f t and the wavelet function. And from this you can reconstruct the
original signal by the inverse CWT the continuous wavelet transformation. So, inverse
continuous wavelet transformation is obtained like this.

And in this case, C psi that corresponds to energy, the energy is given by this. And in this
case, this psi F is the Fourier transform of the mother wavelet, the mother wavelet is the psi t.
So this is the definition of continuous wavelet transformation and also we can determine the
reconstructed signal that is the inverse continuous wavelet transformation we can determine.

450
(Refer Slide Time: 34:24)

So, for CWT what are the main steps? You can see, take a wavelet and compare it to the
section at the start of the original signal. So I am considering one wavelet, the wavelet is this.
And one section I am considering and wavelet is this. Calculate a number C that represents
how closely correlated the wavelet is with the section of the signal. The higher C is, the more
the similarity.

(Refer Slide Time: 34:58)

And after this, shift the wavelet to the right and repeat steps one and two, until you have
covered the whole signal. So, these steps, I have to repeat for the whole signal.

451
(Refer Slide Time: 35:12)

And after this, I have to do the scaling, the wavelet and repeat the steps 1 to 3. And finally,
repeat steps 1 to 4 for all the scales. That means, I want to find a similarity between the
wavelet and the signal at different scales. So this example I have given, corresponding to
CWT, the continuous wavelet transformation.

(Refer Slide Time: 35:33)

This CWT you can display like this. Already, I have defined the continuous wavelet
transformation and CWT you can display like this. So we have the time information and
CWT coefficients. You will be getting the CWT coefficients and the scale information is also
available, this should be CWT.

452
(Refer Slide Time: 35:53)

And in this case, it is a CWT of a stationary signal. So, if I consider this signal and this is the
CWT of the stationary signal, so you can see here I have this information scaling information,
the translation information, and the coefficients. Coefficients means, the continuous wavelet
transform coefficients. The scale gives the frequency information.

(Refer Slide Time: 36:17)

Similarly, I am considering the CWT of another signal and signal is this. And in this case,
you can see I have the scale information, the translation information, and the coefficients of
the CWT.

453
(Refer Slide Time: 36:17)

CWT of non-stationary signal, so I am considering one non-stationary signal and


corresponding to this you can see the CWT. And in the MATLAB also, you can show the
CWT of the non-stationary signal or maybe the stationary signal.

454
(Refer Slide Time: 36:44)

So, you can see the difference between the Fourier transform and the wavelet transformation.
So, I have shown the signal, the signal is this and if I apply the Fourier transformation, that is
nothing but the signal is weighted by F u, that is the Fourier transform of the signal, the signal
is weighted by F u.

And in case of the wavelet transformation, I am considering this signal that means the signal
is weighted by C tau s that means I am considering the continuous wavelet transformation.
So, you can see the similarity between the Fourier transform the wavelet transformation. But
in this case, in the case of the continuous wavelet transformation, we are considering different
scales and positions, because I have two parameters, one is s another one is tau, one is the
scaling parameter, another one is the translation parameter.

455
(Refer Slide Time: 37:43)

And for wavelet series expansion, so already, I have defined that wavelet like this and if I
consider scale is equal to, s is equal to s naught to the power minus m and translation tau is
equal to n tau naught s naught to the power minus m, so I am considering this one. And in
this case, if I considered a discrete wavelet transformation, if the scale and the translation
take place in discrete steps, then in this case, I can consider this expression as like this.

Because I am considering the discrete wavelet transformation, if the scale and the translation
take place in discrete steps, then in this case, just I can write like this, if I consider s naught is
equal to 2 and tau naught is equal to 1, then corresponding to this, I have this expression and
this is called the dyadic wavelet. So, this is about the wavelet series expansion.

(Refer Slide Time: 38:40)

456
And in this case, the signal f t can be represented as a series combination of the wavelet. So,
you can see the signal f t is represented as a series combination of the wavelets. So, I have the
wavelet coefficients and this is my wavelet. So, the wavelet coefficients is nothing but the
inner for the between f t and f t is the signal and psi m n is the tau wavelet, I am considering.

Next, I am considering multi resolution analysis. So, in this case, I am giving one example of
an image. So, an image may contain the big objects or maybe the small-small objects or
maybe the low contrast region or maybe the high contrast region, the good contrast region.
So, if I want to see or if I want to analyze a particular image and suppose the small objects or
presents, then in this case I have to consider high resolution.

Because in the high resolution, I can see small objects and also the regions of the bad contrast
and if I want to consider the big objects or maybe the good contrast regions, then in this case
I can consider low resolutions, that means, an image can be analyzed in different resolutions,
maybe in the high resolution or maybe in the lower resolutions.

For different cases, I have I have given this example, for the big objects, I can consider low
resolutions. For the good contrast region, I can consider the low resolution and if I consider
small objects, then I have to consider the high resolution, if I consider low contrast region,
then in this case also I have to consider high resolution. So, that means, at different
resolution, I can analyze a particular image. And based on this concept, you can see here.

(Refer Slide Time: 40:42)

I am considering different resolutions for different image regions. So, you can see the N by N
image I am considering that corresponds to the highest resolution and after this if you see this

457
pyramid, the resolution is decreased. The next resolution is N by 2 cross N by 2 and finally, I
have that 1 by 1 resolution. So, this is the approximation pyramid I am considering. So, an
image I am considering or maybe the signal I am considering, and in this case, the signal or
the image can be analyzed at different resolutions.

(Refer Slide Time: 41:19)

So, that is the objective of the multi resolution analysis. So, analyze the signal at different
frequencies with different resolutions. Analyzing a signal both in time domain and frequency
domain is needed in many times. But resolution in both domains is limited by the uncertainty
principle. And already I have explained good time resolution and poor frequency resolution at
high frequencies and good frequency resolution and poor time resolution at low frequencies.

(Refer Slide Time: 41:51)

458
So, you can see this example here, I am considering one signal here and I am doing the
reconstruction of the signal. So, this is my original signal I am doing the reconstruction from
the low frequency component and the high frequency component. So, this is my low
frequency information of the signal, this low frequency information is combined with the
detail information.

The detail information is D1 and based on this I am doing the reconstruction. So, these are
reconstruction so far. After this, I am again considering the detail information that is the high
frequency information and this is the reconstruction I am doing. And after this, again I am
considering more detailed information, then after this I am getting the reconstructed signal.

So, that means, with the low frequency signal, I am considering the high frequency
information for reconstruction. So, a signal can be also decomposed into low frequency
information and the high frequency information. In this example, I have shown the synthesis
of the signal that is the reconstruction of the signal, but for analysis, I can decompose a
particular signal into low frequency signal and the high frequency components. So, in this
case I have shown the synthesis.

(Refer Slide Time: 43:08)

So, that means, a particular signal can be represented like this, the first one is the low
frequency component. And after this, the D1, D2, D, so that is the detailed information. So,
that means, in wavelet representation of a function consist of a course overall approximation
that means the low frequency information. Next, I have the detail coefficients that influence
the function at various scales.

459
So, the meaning of this discussion is the signal is represented by this, that is I have the low
frequency information and after this we have the all the detail information of that signal that
is for the analysis. For synthesis, for the reconstruction, we have to use the low frequency
information and the detailed information we have to use.

(Refer Slide Time: 44:00)

In this case, you can see here the same thing I am showing here. This is I am showing the low
frequency information and the low frequency information is added with the detail
information. So, I am getting the reconstructed signal and after it is the next I am considering
the reconstructed signal and the detail information I am considering D2. So, I can do the
reconstruction.

After this, again I am considering the detail information D3 and this is the reconstruction I
am doing. So, already I have shown this one and that is the synthesis of the signal, based on
the low frequency signal and the detailed information the detailed information is nothing but
the high frequency information. So, I can give one example here. Suppose, how to consider
the image? Suppose, if I consider one input.

460
(Refer Slide Time: 44:47)

And I am considering an array of 8 data, so 1,2,3,4,5,6,7,8 like this and if I consider the level
one approximation, then in this case, I can determine the average plus detail information I can
determine. So, average will be 3 by 2, 7 by 2, 11 by 2, 15 by 2. And the details will be minus
1 by 2, minus 1 by 2, minus 1 by 2, and minus 1 by 2. So, you can see how to get the average.
So, between this sample and this, I am determining the average. So, average I am getting 3 by
2, 3 and 4 I am considering.

So, average I am getting 7 by 2, 5 and 6 I am considering, the average I am getting 11 by 2


and 7 and 8 I am considering, the average is 15 by 2. And after this, I want to find a
difference that is the details. So, 1 minus 2 divided by 2, so I am getting minus half, 3 minus
4 divided by 2, I am getting minus half like this, I am considering the detail information, so
this is level 1.

Similarly, level 2, I can also determine the average component and the detail component. So,
it is 5 by 2, 13 by 2, minus 1, minus 1, minus half, minus half, minus half and minus half. So,
in level 2 also, I am determining the average, average between this and the this I am
determining and also I am determining the difference between 3 by 2 and 7 by 2, 11 by 2
minus 15 by 2, so I am getting minus 1, minus 1.

And similarly, level 3 I am having this one average minus 2, minus 1, minus 1, minus half,
minus half, minus half, minus half. So, in this case you can see, I am determining the average
value and the detail information. In this class, I discuss the concept of the Fourier
transformation and also I have highlighted the drawbacks of the Fourier transformation and

461
after this I discussed the concept of the STFT and also I discussed the concept of multi
resolution analysis.

So, in my next class, I will continue the same discussion, the multi resolution analysis and
finally, I will discuss the discrete wavelet transformation. How to apply the DWT in an
image, that concept I will explain in my next class. So, that is all for today. Let me stop here
today. Thank you.

462
Computer Vision and Image Processing
Professor M. K. Bhuyan
Department of Electronics and Electrical Engineering
Indian Institute of Technology, Guwahati
Lecture: 14
Fundamentals and Applications
Welcome to NPTEL MOOCs course on Computer Vision and Image Processing: Fundamentals
and Applications.

In my last class, I discussed the basic concept of wavelet transformation. I highlighted one
important concept that is a signal should be analyzed both in time domain as well as in frequency
domain. Also, I discussed the concept of continuous wavelet transformation. So, for this, I
considered the mother wavelet and also, I considered two parameters, one is the translation
parameter and another one is the scaling parameter.

So, by considering these two parameters, I can get a number of wavelets for analysis of a
particular signal. After this, I discussed the concept of multi resolution analysis. A signal should
be analyzed in different resolutions, a high resolution and a low resolution. So that concept also I
discussed in my last class.

Today, I am going to discuss the concept of discrete wavelet transformation and how to
decompose a particular signal into different frequency bands. In my definition of the signal, I am
considering either 1D signal or 2D signal. The 2D signal means image. So that concept, I am
going to discuss, that is, how to decompose a particular signal by considering the discrete
wavelet transformation.

So I can decompose a particular image into number of frequency bands. That means, I can
consider a low frequency information, I can consider high frequency information. So that
concept I am going to discuss now.

463
(Refer Slide Time: 02:22)

So in my last class, you can see here, the main concept of the wavelet transform is, that is, I am
considering analysis windows of different lengths are considered, for different frequencies. So if
I consider high frequencies, we have to consider narrow windows. So that is, use narrow
windows for better time resolution. And if I consider the analysis or low frequencies, then in this
case we have to consider wider windows for better frequency resolution. And the function used
to window the signal is call the wavelet. So that concept, I have already explained, that is, the
analysis of the high frequency and also the analysis of low frequencies.

464
(Refer Slide Time: 03:10)

Here, in this example, you can see, I am considering one window. You can see the window here.
So this is the window. And in this case, I am considering one wide window. So, wide window do
not provide good localization at high frequency. So here you can see, so this portion corresponds
to the high frequency. So that is, you can see from this figure.

(Refer Slide Time: 03:35)

And that is why, if we consider the narrow window, that is better for high frequency. That is, use
narrow windows at high frequencies.

465
Refer Slide Time: 03:47)

And after this, you can see, I am considering one narrow window. And if I consider this portion,
that portion is the low frequency portion of the signal. So if I consider the narrow window, they
do not provide good localization at low frequencies. That is why, I have to consider a wide
window for analysis of the low frequency components. So that is why, I have to consider one
wide window for the analysis of the signal.

So here, in this example, you can see that narrow windows do not provide good localization at
low frequencies.

(Refer Slide Time: 04:24)

466
That is why, we are considering one wide window at low frequencies. So that is why the window
size is not fixed in case of the wavelet transformation. So I am considering variable size
windows. Narrow windows for the high frequency information analysis, and if I consider the
wide window, that is good for low frequency signal analysis.

(Refer Slide Time: 04:52)

And also, I discussed the concept of the mother wavelet. Here you can see, I am considering the
one wavelet function, I am considering. And in this case, you can see, I have two parameters.
One is he scaling parameter another one is the translation parameter. So ‘a’ is the scaling
parameter and ‘b’ is the translation parameter. That concept also I have discussed in my last
class.

Refer Slide Time: (05:16)

467
And after this, I discussed the concept of continuous wavelet transformation. So this is the
definition of the continuous wavelet transformation. C(τ , s) , for the signal, the signal is f (t). I
am considering one 1 dimensional signal. And I have two parameters, one is τ , another one is s.
So τ is the translation parameter and s is the scaling parameter which measures frequency. And
the translation parameter, that is the measure of time.

And here you can see, so this expression is for the continuous wavelet transformation for the
signal. The signal is f (t). And I am considering the kernel, so mother wavelet I am considering,

t −τ
Ψ( ¿ , I am considering. So, this is the definition of the continuous wavelet transformation.
s

(Refer Slide Time: 06:11)

468
After this I discussed the concept of the multi resolution analysis. So that means, we have to
analyze the signal at different frequencies with different resolutions. So I have given the example
in case of the image. So suppose if I consider one low contrast region in an image, that I have to
analyze at the high resolution. And also, if I consider a small object in an image, that I have to
analyze at high resolution.

But if I consider a good contrast region, that I can analyze at low resolution. And if I consider a
big object in an image, that I can analyze in low resolution. That is the concept of multi
resolution analysis. That means analyze the signal at different frequencies with different
resolutions.

And also, it is important to analyze a signal, both in time domain and frequency domain. But the
problem is uncertainty principle, that already I have explained the concept of the uncertainty
principle.

The point is that, if I consider high frequency, that means the good time resolution and poor
frequency resolution at high frequency. On the other hand, good frequency resolution and poor
time resolution at low frequencies. So that concept already I have explained.

(Refer Slide Time: 07:41)

469
And in this case, you can see I am considering Approximation Pyramid. So if I considered N by
N image, that is the highest resolution I am considering. So that is the image. If I go to the top of
the pyramid, the next level is N by 2, N by 2, that is the less resolution I am considering. And the
lowest resolution is 1 by 1, that is 1 pixel, I am considering.

So I am considering different resolutions for multi resolution analysis. So that means, I want to
analyze a particular signal at different frequencies with different resolutions.

470
(Refer Slide Time: 08:22)

And also, I had explained this concept, that is, how to represent a particular signal by considering
low frequency information and the high frequency information. So here you can see, I am
considering one signal, the signal is this. And that is approximated by low frequency information
and the high frequency information. So L0 is the low frequency information. And for the high
frequency information, that is, the detailed information, I am considering D1, D2, D3.

And suppose if I combine L0 and D1, L0 is the low frequency information and D1 is the high
frequency information. So by using this, I want to reconstruct the signal, then in this case I can
reconstruct the signal like this, that is, by considering L0 and D1.

After this, I am considering the detail information D2. So if I consider the detail information D2,
so the reconstruction can be like this. So I can do the reconstruction like this. And after this, I am
considering the detail information D3, so by considering this I can reconstruct the signal.

That means, for reconstruction I am considering low frequency information, also the high
frequency information, that is the detail information, D1, D2, D3. That is the efficient
representation using the high frequency information, that is the detail information.

471
(Refer Slide Time: 09:53)

So, you can see, the signal is represented by this one. One is the L 0 is the low frequency
information, after this, I have the detail information like D1, D2, D3. So, by using this, I can
reconstruct a particular signal or I can represent a particular signal.

(Refer Slide Time: 10:14)

So, for wavelet representation of a function, a coarse overall estimation, that is the low frequency
information, and also, I have the detail coefficients that influence the function at various scales.
So that means, for wavelet representation of a particular function, so I have to consider coarse
overall approximation for the low frequency component. Also, I have to consider detail
coefficients that influence the function at various scale.

472
(Refer Slide Time: 10:44)

And you can see how to reconstruct the signal. So I am considering the low frequency
information. So in the signal you can see this is the low frequency information. And after this,
what I am considering. I am just adding the detail information with the low frequency
information, that is L0 + D1. And by using this, I want to reconstruct the signal. So this is the
reconstructed signal.

After this what I am considering, the detail information D2 is considered. So that means L1+ D2
I am considering. And like this, I am considering more and more detail information to
reconstruct the signal. So you can see, by considering the low frequency information and the
detail information D1, D2, D3, I can reconstruct a particular signal. That is called synthesis. In
case of the analysis, I can represent a particular signal by the low frequency information and the
high frequency information.

473
(Refer Slide Time: 11:45)

So in my last class also I had given one example. Same example I want to give here.

So suppose I am considering one input, then, suppose the input is an array of 8 data points I am
considering. So suppose 1, 2, 3, 4, 5, 6, 7, 8, so eight data points I am considering. And first, I
am considering Level 1 decomposition of the signal. That means, I want to decompose this
signal into low frequency component and the high frequency component.

So if you see these first two samples, from the first two samples I can determine the approximate
component. So the approximate component will be 1 + 2 divided by 2, so that is nothing but 3 /
2. After this, if I consider these two components, 3 and 4, so from this also I can determine the
approximate component. So 3 + 4 divided by 2, so it will be 7 / 2.

And after this I am considering next two components, 5 and 6. So 5 and 6, I can consider and I
am determining the approximate component. So, approximate component will be 11 / 2. And
after this I am considering 7 and 8, so from 7 and 8 I am considering the approximate
component. So approximate component will be 15 / 2.

So these are the approximate components. After this, I want to determine the detail components.

So from 1 and 2, that means from 1 and 2, if you see, what I can determine, the detail
component. So that is 1 - 2, that is the difference between these two divided by 2, so it will be –
1/ 2. Similarly, for 3 and 4, I can determine the detail information, that is – 1/ 2. So I will be
getting – 1/ 2, – 1/ 2, like this.

474
So in the level one decomposing, you can see I am getting the low frequency information, these
are the low frequency information, up to this. And also, I am getting the detail information. Low
frequency means it is the approximate. Approximate information I am getting. And in this case, I
am getting the detail information. Details, I am getting. So this is Level 1 decomposition.

In Level 2 decomposition, so in the Level 2 decomposition what I am considering, I am


considering 5 by 2, I am determining because my sample is 3 by 2 and 7 by 2. So if I consider
these two, so I will be getting the approximate component, it is 5 by 2, 13 by 2, 13 by 2 I will be
getting if I consider these two components. I will be getting 5 by 2, 13 by 2.

And after this, I can determine the detail components. The detail is - 1, - 1. So from 3 by 2 and 7
by 2, I can determine the detail information, that is, -1. And from 11 by 2 and 15 by 2, I can
determine the detail information, the detail information is - 1. And the previous values are - 1 by
2, - 1 by 2, -1 by 2 and -1 by 2. This is Level 2 decomposition.

And if I consider Level 3 decomposition, I will be getting 9 by 2, minus 2, minus 1, minus 1,


minus half, minus half, minus half and minus half. So in this case, this corresponds to the
approximate component. And if I consider this, all this corresponds to the detail information.

So that is the final array I am getting. And that is nothing but the Haar Transformation of the
input data. So that means, this is the concept of the Haar Transformation of the input data. So I
considered an array of eight data and if I consider the Level 3 decomposition, that is I have done
Level 1 decomposition, Level 2 decomposition and the Level 3 decomposition and I am getting
the final array, so here you can see I have one approximate component and you can see the
remaining are detail information, the detail components.

So this is one example.

475
(Refer Slide Time: 16:50)

I can give another example. Suppose I have one 1D image, having 4 pixels. The pixels are
suppose 9, 7, 3, 5. I am considering one 1D image and resolution, I am considering resolution, 4
pixels I am considering. And if I apply the Haar Wavelet Transformation, then in this case, if I
apply the Haar Transformation. Haar Transformation means, I have to determine the
approximate component and the detail component.

So, corresponding to this 9, 7, 3, 5, I will be getting 6, 2, 1, - 1. So, the 6 is the low frequency
information, that is L0, D1, detail information, D2, and D3, the detail coefficients. And in this
case, suppose if I consider resolution and I am determining the average, also I am determining
the detail coefficients.

So first, I am considering the 1D image with resolution of 4 pixels, that means, I am considering
9, 7, 3, 5. And I don’t have any detail coefficients in this case. And after this, the resolution is, I
am considering the resolution is 2, suppose. So corresponding to this, I am determining the
average. Average will be 8 and 4. And what about the detail? The detail will be 1 and minus 1.

How to determine 8 and 4? 9 plus 7 divided by 2, so I will be getting 8. And how to determine 4?
So 3 plus 5, divided by 2, I will be getting 4. And how to determine the detail information, that
is, detail coefficients, that is, 9 minus 7, divided by 2 so it will be 1. And another one is 3 minus
5 divided by 2, that will be equal to minus 1. So like this, I am calculating the average
coefficients and the detail coefficients, that is the approximate coefficient and the detail
coefficient.

476
After this, if I consider resolution 1, then in this case also I can determine the average value. The
average is 6. And what about the detail coefficient? The detail coefficient will be 2.

So this is the decomposition of a particular signal. The signal is 9, 7, 3, 5, and I am decomposing


that signal. And ultimately what I will be getting? I will be getting 6, 2, 1, - 1. That I will be
getting after the decomposition. That is the Haar Transformation I am applying. So you can see, I
am getting 6, 2, 1, -1. So if you see here, I will be getting this. First one is 6, after this, 2 and 1, -
1.

And also, we can reconstruct the original image to a resolution by adding or subtracting the
detail coefficients from the lower resolution versions. So that means, how to reconstruct the
image, because my coefficients are 6, 2, 1, - 1. These are my coefficients.

So, how to reconstruct? So I have to consider the detail information, the detail information is 2.
So, if I consider the detail information 2, so I can reconstruct this one, 8 and 4. After this, if I
consider detail information, suppose 1, another information is 1 and - 1. So I can reconstruct the
signal, the signal is 9, 7, 3, 5. So the signal is reconstructed. 9, 7, 3, 5. So this is the
reconstruction procedure.

Reconstruction means synthesis. And if you see the decomposition, the decomposition means
analysis. So this is analysis. And this is, reconstruction means synthesis, synthesis of the signal.

So I can reconstruct the signal by considering these detail coefficients. Because I have the
approximate information, that is, the average information is available. And by considering the
detail coefficients, I can reconstruct the signal. That is the concept of the signal decomposition
and the reconstruction.

477
(Refer Slide Time: 21:54)

And now, I will discuss the concept of the Multiresolution Conditions, the Multiresolution
Analysis, how to analyze a particular signal in different resolutions.

So suppose if a set of functions can be represented by a weighted sum of this basis function, the
function is, psi 2 to the power j, t minus k. That means, I am considering a set of functions which
can be represented by a weighted sum of psi 2 to the power j t minus k. That, I am considering.

Then, a larger set including the original can be represented by a weighted sum of psi 2 to the
power j plus 1 into t minus k. So that means I am considering the higher resolution in this case.
So in this figure, you can see, I am considering j, it controls the resolution. So you can see the
low resolution signal and the high resolution signal.

So this concept is there. Suppose a set of functions which can be represented by a weighted sum
of the basis function. The basis function is, psi 2 to the power j into t minus k. Then a larger set,
including the original can be represented by a weighted sum of psi 2 to the power j plus 1 into t
minus k.

So you can see in this figure, I am considering a signal at low resolution and the high resolution.
And the resolution is controlled by the parameter, the parameters is j.

478
(Refer Slide Time: 23:31)

Again, I am repeating this. So if a set of functions can be represented by a weighted sum of psi 2
to the power j, t minus k, then a larger set including the original can be represented by a
weighted sum of psi 2 to the power j plus 1 into t minus k.

So what is this concept? You can see, suppose if I consider one set, that is the V j, so that is the
span of this function, the basis function is psi 2 to the power j into t minus k. And if I consider
another larger set, the another large set is V j plus 1. So larger set I am considering, that is V j
plus 1. That should be the span of this basis function, the basis function is psi 2 to the power j
plus 1 into t minus k.

So first, I am considering a small set, the small set is V j, and corresponding to this I have to
consider the basis function, the basis function is psi 2 to the power j into t minus k. But if I
consider one larger set, that is V j plus 1, then in this case, I have to consider the basis function.
The basis function is psi 2 to the power j plus 1 into t minus k. That I have to consider.

And in this case you can see that the set V j is a subset of V j plus 1, because the V j plus one is a
large set as compared to V j. And you can see, if I consider a particular function, f t, suppose,
that is represented by using the basis function, the basis function is psi j k. But if I consider, one
larger set, that is, if I consider a function f j plus 1, previously it is f j but now I am considering a
large set, that should be represented by or that can be represented by the basis function, the psi j
plus 1 into k.

479
That means I am considering the liner combination of the basis function, that is the weighted
sum of basis function, I am considering. So you can see, this function is represented by the
weighted sum of psi j plus 1 k. But if I consider f j, that is the small set, I am considering, that
can be represented by the weighted sum of the basis function, the basis function is psi j k.

That means, the meaning is this, that is if I consider a small set, suppose V j, that can be
represented by a weighted sum of psi 2 to the power j t minus k. But if consider a large set, the
large set is V j plus 1, for this I have to consider the basis function, the basis function is psi 2 to
the power j plus 1 into t minus k. So that I have to consider, because V j is a subset of V j plus 1.

(Refer Slide Time: 26:41)

So here, you can see I am considering a set, the set is V naught, that is the small set, I am
considering. So for this I am considering the basis function, the basis function is psi t minus k.

Next, I am considering one large set as compared to V naught. So V1 is larger than V naught. So
for V1, I have to consider the basis function, the basis function is psi 2 t minus k. Like this, if I
consider another set, that is, a large set V j, that should be represented by the basis function, the
basis function is psi 2 to the power j t minus k.

So that means, if f t is an element of the set, the set is the V j, then f t can be represented by this.
f t is equal to summation over k, summation over j, a j k, that is the coefficients, psi j k, t. That I
am considering.

480
So V j means the space spanned by the basis function, the basis function is psi 2 to the power j t
minus k. So that is the sub-space, the V j, that is the space spanned by the basis function psi 2 to
the power j t minus k, I am considering.

And in this figure, you can see I am considering the sub-space V naught, V1, V2, V3, like this.
So the first small subset is V naught, this is the V naught, and you can see the next one is, you
can see the subset, that is the V1 I am considering, the set is V1. So from this, you can see that V
j is a subset of V j plus 1, I am considering.

So that means, if f t is an element of V j then in this case f t will be also element of V j plus 1. V j
plus 1 means I am considering one large or maybe one big set as compared to V j. So you can see
the small set is V naught, the next big set is V1, and compared to V1, the next big set is V2,
compared to the set V2 the next big set is V3.

So that means, you can see, I have the nested spanned spaces. So V j is a subset of V j plus 1, I
am considering. This is called the nested spaces, that is spaces are V j. So I am considering V
naught, V1, V2, like this I am considering. And corresponding to V naught, I need the basis
function, the basis function is psi t minus k. Similarly for V1, the basis function is psi 2t minus k.
So like this, I am considering.

(Refer Slide Time: 29:34)

481
So that means the function is represented like this. f t is equal to the summation over k,
summation over j and a j k is the coefficient. And I am considering the basis function, the basis
function is psi j k.

Now, in this case you can see, I am considering the space, the space is V naught. Now, how to
get V1? That is the big set as compared to V naught. So for this, what you can see, if I can find
the difference between V1 and V naught and after this if I consider suppose V naught plus that
difference, then I will be getting V1. So that means the idea is, define a set of basis functions that
span the difference between V j.

So I am repeating this, suppose I have that information, that is the V naught is available, so how
to get V1? So if find the difference between V1 and V naught, after this, if I consider V naught
plus that difference, then in this case I will be getting V1. So with the help of this difference I
can represent V1.

(Refer Slide Time: 30:47)

So here you can see, I am considering the set, the set is V naught, V1, V2, V3, like this. And I
am considering W j be the orthogonal complement of V j in V j plus 1. So what do you mean by
orthogonal complement? The orthogonal complement is a subspace of vectors where all of the
vectors in it are orthogonal to all of the vectors in a particular subspace. So that is the meaning of
the orthogonal complement.

482
So I can repeat this one. So that means the orthogonal complement is a subspace of vectors
where all of the vectors in it are orthogonal to all of the vectors in a particular subspace. So that
is the meaning of the orthogonal complement.

So I can give one example. Suppose for example, if I consider a plane, suppose if I consider a
plane in R to the power 3 space, the 3 dimensional space, the 3D space, then the orthogonal
complement of that plane is the line that is normal to the plane and passes through the point, the
point is 0, 0, 0.

So that means if I consider a plane in 3 dimensional space, that is R to the power 3 space, then
the orthogonal complement of that plane is the line that is normal to the plane and passes through
the point, the point is 0, 0, 0. So that is the concept of the orthogonal complement.

So, in this case, how to determine V j plus 1? You can see, V j plus 1 is the bigger set as
compared to V j. So if I consider V j plus W j, W j is nothing but the difference I am considering.
Then in this case, if I consider adding of this, V j plus W j, then I will be getting V j plus 1. So
here, in this figure you can see, so if I consider V naught, here it is, the V naught, if I consider V
naught as a set. And if I consider the difference, the difference is W naught. If I add V naught
plus W naught, I will be getting V1. So I will be getting V1.

Similarly, if I consider V1 plus W1, then I will be getting V2. So this is V1, so if I consider, this
is V1, so this is V1. And if I consider V1 plus W1, then I will be getting V2. So this space is V2.
So with the help of the difference, I can consider this one, that is, the V j plus 1 is equal to V j
plus W j.

483
(Refer Slide Time: 33:36)

So, how to compute the difference? So, if f t is a element of V j plus 1, then f t can be
represented using the basis function phi t from V j plus 1. So that concept I am showing here.
That is, f t is represented like this. c k is the coefficient and phi 2 j plus 1 into t minus k, the
function is represented like this, using the basis function phi t from V j plus 1.

And alternatively, f t can be represented using two basis functions. One is phi t from the space,
the space is V j. And also, I am considering psi t from the space W j. What is W j? W j is the
difference between V j and V j plus 1. So, here you can see, the V j plus 1 is represented by V j
plus W j.

V j plus 1 is the larger set as compared to the set V j. So that means, the function f t can be
represented like this. I am considering two basis functions. One is phi t, I am considering, this
basis function, phi 2 to the power j into t minus k. And another one is psi, 2 to the power j t
minus k I am considering. And you can see the coefficients, the c k and the d j k are the
coefficients I am considering. Because, I am considering the linear combinations.

So f t, can be represented by using these two basis functions. So, for this what I am considering?
This component I can consider as approximate component and this difference I can consider as
the detail components, the details I can consider.

484
(Refer Slide Time: 35:29)

So, that means, you can see these functions, this f t, instead of considering only one basis
function, I am considering two basis function, one is phi another one is psi. So, I am getting V j
and another one is W j. W j, the difference between V j plus 1 and V j. That difference I am
calculating.

So, the signal, the signal or the f t, the function is represented by using these two basis functions.
So, what is this component, the second component? The difference between V j and V j plus 1.
That is the difference.

(Refer Slide Time: 36:07)

485
So, here you can see, how to represent V j plus 1. V j plus 1 is nothing but V j plus W j. And
recursively I can determine V j plus 1. So V j plus 1 is nothing but V naught plus W naught plus
W1 plus W2 plus W3, up to W j I am considering. That means, V naught corresponds to the
approximate information and W naught W1, W2, all these are basically the difference, that is
nothing but the detail information.

So, here you can see, I am considering the V naught and W naught is what? It is the difference
between V1 and V naught. W1 is the difference between V naught and V1. So, by considering
these differences, I can represent a particular function.

So, you can see, the function is represented by this, f t is equal to summation over k, c k, phi t
minus k, that is for V naught. So this basis function I am considering for V naught. And another
one is summation over k, summation over j, that is the second component I am considering.

d j k, that is the coefficients, and phi 2 to the power j, t minus k. Then I am considering the detail
component, and this is for W naught, W1, W2, and I am considering the basis function psi. So,
that means I am considering two basis functions, one is phi, another one is psi, to represent the
function f t.

Since I am considering the vector space, I am considering this summation as direct sum. So, this
should be direct sum. Instead of the simple summation, this should be the direct sum because I
am considering the vector space.

(Refer Slide Time: 38:01)

486
So, here you can see, for wavelet decompositions we consider two basis functions, one is phi t,
another one is psi t. And you can see the two shapes are translated and scaled to produce number
of wavelets at different locations and on different scales. I can do the translation; I can do the
scaling so that I will be getting a number of wavelets.

So, you can see, I am doing the translation, also I am doing the scaling. So, here you can see, this
phi t is used to encode low resolution information and this psi t, it is used to encode detailed or
high-resolution information.

(Refer Slide Time: 38:43)

So, I am writing it again, the function f t is represented by these two functions, one is the scaling
function I am considering, and another one is the wavelet function. So, I have two basis
functions, one is phi t minus k and another one is psi 2 to the power j, t minus k. So, a function is
represented as a linear combination of these two basis functions.

In Fourier analysis, there are only two possible values of k. Either it may be 0 or pi by 2. The
values of j correspond to different scales. The scale means frequencies. So, you can see the
difference between the wavelet expansion and the Fourier analysis.

487
(Refer Slide Time: 39:28)

And this is the Haar wavelet I am showing. So, by considering the Haar wavelet, I can determine
the approximate component and the detail component. So, by using this wavelet, the Haar
wavelet, I can determine the approximate value, approximate means the average value I can
determine. And by using these wavelet function, I can determine the detail information. That is,
the difference I can determine.

So, the first one is for computing the average, the second one is for computing the details.

(Refer Slide Time: 39:58)

488
And in this case I am showing one example. I am considering 1D Haar wavelet decomposition,
and I want to determine the average and the detail information, you can see. So first, I am
calculating the approximate value and the average value. The green one is the average value and
the red one is the detail value. So, in my numerical example I have shown this example.

After this what you can do, you can rearrange these values like this. So, the greens you can
rearrange and the red you can rearrange like this. After this, again, I can determine the
approximate value and the detail value, the average value and the detail value. After this, again, I
can rearrange. After this, again, I can determine the approximate value and the detail value.

And finally, I will be getting the signal after decomposition. So, that means, I have the one
approximate component, you can see this is the approximate component. And if you see this red
one, all these red ones, that is nothing but the detail information.

So, like this I can do the decomposition by using the Haar wavelet. So, Haar wavelet can be used
for determining the approximate value and the detail coefficients.

(Refer Slide Time: 41:12)

So, here you can see, I am considering 1D Haar wavelet decomposition. So you can see, I will be
getting the final array like this. So, this is after decomposition I am getting this one. So this is
approximate values, and the remaining red one is the detail, details I am getting.

489
(Refer Slide Time: 41:36)

So, this implementation can be done by using two filters, one is the low pass filter, another one is
the high pass filter. That means, the decomposition can be done by considering two filters, one is
the low pass filter. The low pass filter is h naught n, and high pass filter is h1 n, I am
considering.

So, for analysis I am considering this one. So, signal is f n. So, I am considering the low pass
filter to get the approximate component and high pass filter to get the detail information. And
after this, I can do the down sampling. The down sampling by 2. So, this is for the analysis of the
signal. So this is called Analysis Filter Bank. And for the reconstruction, for the synthesis of the
signal, what you can do, just opposite.

We have to do the up sampling of the signal, we are doing the up sampling. And after this, again
I am considering the synthesis filter bank, that is nothing but, again, the low pass filter and the
high pass filter. So g naught n is the low pass filter and g1 n is the high pass filter. So by
considering this, I can reconstruct the original signal.

So you can see, this is the approximate reconstructed signal, by considering low pass filter and
the high pass filter. And this filter is called the Quadrature Filter, the Quadrature Mirror Filter.
The concept of the Quadrature Mirror Filter, I can explain in my next slide.

490
(Refer Slide Time: 43:06)

What is a Quadrature Mirror Filter? Here, you can see I am showing the frequency response of
the low pass filter and the high pass filter. So, h naught n is the low pass filter and h1 n is the
high pass filter. So, how to get the high pass filter? That is nothing but h1 n is the mirror of h
naught n. That is, the high pass filter is the mirror of the low pass filter.

So that is why, if I consider the symmetry point, the symmetry point is pi by 2. So this is pi by 2.
So suppose if I have only the low pass filter, so high pass filter easily we can obtain by this. So h
naught n is available suppose, the low pass filter is available. Already we have designed the low
pass filter. So, how to get the high pass filter?

The high pass filter h1 n in the time domain is nothing but it is minus 1 to the power n into h
naught n. And corresponding to this, you can see the transfer function. So h naught minus z, I am
considering. So, what is h naught z? That is the transfer function of the low pass filter. So, it is
the transfer function of the low pass filter, that is in the z domain. z transform I am considering.
And h1 z, that is the transfer function for the high pass filter.

So, this is the concept of the Quadrature Mirror Filter. If I know, the low pass filter, from the low
pass filter I can determine the high pass filter, because the center of symmetry is pi by 2.

491
(Refer Slide Time: 44:45)

And similarly, for the synthesis side also we need the low pass filter and the high pass filter. So
here you can see g naught n, g1 n. g naught n is the low pass filter, that is the synthesis filter.
And g1 n is the high pass filter. So from this, you can see, from h naught n, that is, in the time
domain, that is, the low pass filter, and h1 n is the high pass filter in the time domain.

You can see. Because, you can determine G not z, that is the synthesis filter, that is the low pass
filter, in the synthesis side is nothing but H naught z, because H naught z is the low pass filter in
the analysis side, that is the analysis, for the analysis of the signal I am considering the low pass
filter. That filter is H naught z.

So, from H naught z you can determine G not z. G not z is nothing but the low pass filter for
synthesis. And also, you can determine G1 z. G1 z is nothing but the high pass filter for the
synthesis of the signal. So from h naught, you can determine G1 z. And already you know what
is H1 z. H1 z is nothing but H naught minus z. You know this. And also you know that in time
domain h1 n is equal to minus 1 to the power n, h naught n. You know this.

So that means the concept is, if you only know this one, h naught n, that is the low pass filter of
the analysis, the low pass filter for analysis then all other filters, all other filters means h1 n, g
naught n, g1 n, you can determine from h naught n. So you can determine h1 n from h naught.
You can determine g naught n from h naught, you can determine g1 from h naught. So all the
filters you can determine from h naught. So that is the concept.

492
(Refer Slide Time: 47:12)

And we are considering a class of perfect reconstruction filters needed for the filter bank
implementation. So I have to consider filter banks because I have to decompose the signal for
discrete wavelet transformation. For discrete wavelet transformation, we need the filter banks,
the low pass filter and the high pass filters.

So in this case, these filters satisfy this condition, h1 n is equal to minus 1 to the power n and, h
naught N minus 1 minus n, where N is the tap length required to be even. So it should be even.

And corresponding to this, we can determine the synthesis filters. So that means, for discrete
wavelet transformation I have to determine the approximate components, and also the detail
components. So for this, I need low pass filters and the high pass filters.

And for analysis we need the low pass filter banks and also the high pass filter banks. And
similarly, for the synthesis of the signal, we need filter banks. That is nothing but low pass filters
and the high pass filters.

493
(Refer Slide Time: 48:26)

So here, you can see, I am showing the analysis of the signal and also the reconstruction of the
signal. So here you can see I am considering the low pass filter h naught n, and high pass filter is
h1 n. And I am considering the Quadrature Mirror Filter. So from h naught n you can determine
h1 n. So the signal is decomposed by considering this low pass filter and the high pass filter.

Similarly, for the synthesis I am considering h naught minus n, that is the synthesis filter, that is
the low pass filter. And h1 minus n, that is the high pass filter for the synthesis, I am considering,
which can be obtained from h naught n. So the main filter is h naught n, from h naught n, you
can determine h1 n, you can determine h naught minus n and also you can determine h1 minus n,
you can determine.

494
(Refer Slide Time: 49:22)

So this multiple level of decomposition we can do by using these filter banks. In this figure, you
can see the multiple levels of decompositions I can do. So first, I am considering the signal, the
signal is f k, that is of highest resolution. After this, I am considering the low pass filter and the
high pass filter. I am determining the approximate value and the detail vale.

After this, I am down sampling. So I am having the detail information, detail here is 1. And from
the approximate information, you can see we have the approximate information, from this
approximate information I can again do the decomposition. So for this, I am again considering
the low pass filter and the high pass filter.

So I will be getting the approximate component and the detail component. So detail information I
am getting, the Detail 2. And after this, I can do the down sampling to get the lowest resolution.
So like this, I can implement this decomposition. And this is nothing but the multilevel
decomposition.

495
(Refer Slide Time: 50:23)

And for the synthesis, for the reconstruction, the opposite is this. Because I have the approximate
information, I have the detail information, detail information 1, detail information 2. And you
can see, by considering again the filter banks, the low pass filter and the high pass filer, you can
reconstruct the signal.

And one thing is important, here I am doing the up sampling. In case of the analysis, we did the
down sampling but in this case I am considering the up sampling of the signal. And you can see,
I can reconstruct the signal by considering approximate information and the detail information.
This is about the reconstruction.

496
(Refer Slide Time: 51:03)

In this figure also I am considering the decomposition of a particular signal, that is the discrete
wavelet transformation I am considering. For decomposition of the signal, I am considering the
low pass filter and the high pass filter. So I will be getting the approximate component and the
detail component. So this is the decomposition of a particular signal.

(Refer Slide Time: 51:23)

In this case also I am showing one signal, the 1D signal. And I am considering one low pass
filter and one high pass filter to get the approximate and the detail components. So, any 1D
signal can be decomposed like this. Now, I am discussing the 1D signal that can be extended for
2D signals. Like, in the image we can do the decomposition.

497
(Refer Slide Time: 51:47)

And here you can see the Multiple-Level DWT, the multiple level discrete wavelet
transformation. How to implement this one? So the signal is this, I am considering the low pass
filter and the high pass filter. So I will be getting the approximate component and the detail
component I will be getting.

And you can see in the second figure, what I am doing? I am doing the multiple level
decompositions. So signal is S. I will be getting the approximate component c A1, detail
component c D1. From this approximate component, again I am getting the approximate
component c A2 and the detail component c D2.

And from the approximate component c A2, I am getting the approximate component c A3 and
the detail component, c D3. So like this, we can do multiple-level DWT, the discrete wavelet
transformation.

498
(Refer Slide Time: 52:38)

So, in this case I am considering one 1D signal, and you can see how to decompose this
particular signal. So corresponding to this signal I am getting the approximate component. The
approximate component is c A1 and detail component is c D1. And from the approximate
component c A1, I am doing the decomposition again, so I will be getting the approximate
component c A2 and detail component I will be getting c D2.

Again, from c A2, again I can do the decomposition. I can get the approximate component, the
approximate component is c A3 and the detail component is c D3. And from c A3, that is the
approximate component, that is this approximate component, I can again do the decomposition,
so I will be getting the approximate component and the detail component.

So like this I can do the multiple level decompositions. So finally, after decomposition what I
will be getting? I will be getting one approximate component, that means one approximate signal
I will be getting. And if I consider c D5, c D4, c D3, c D2 and c D1, all these are detail
information of the signal.

499
(Refer Slide Time: 53:53)

And this is another example you can see. You can see the original signal. And you can see I am
getting the approximate signal and approximate level 1 signal, also the detail level 1 signal you
can get. This is detail level 1 signal, approximate level 1 signal, detail level 3, approximate level
3, approximate level 5, like this we can do multiple decompositions.

(Refer Slide Time: 54:17)

Similarly, again, I am considering another signal here. And corresponding to this, you can get the
approximate in level 3 decomposition, detail in level 3 decomposition, approximate level 3
decomposition and detail level 1 decomposition. So like this you can do the decomposition of a
particular signal.

500
(Refer Slide Time: 54:38)

So this can be implemented for 2D signals. The two dimensional signal like image. Because the
main property is the separable property of the kernel. So if I consider this kernel which is
separable, so based in this property I can implement it for the 2D signal, that is for the image.

For the 2D signal, for the 2D image, what I have to consider. First, I have to apply the
transformation along the rows and after this, I have to apply the transformation along the
columns. So like this, I have to do.

In case of the DCT, the discrete cosine transformation, I did like this. So first, I applied the
transformation along the rows and after this I applied the transformation along the columns. So
similarly, I am apply Haar transformation along the rows and after this, I can apply the Haar
transformation along the columns.

501
(Refer Slide Time: 55:32)

So, there are two techniques for decomposition of a particular signal, that is the image. If I
consider the 2D signal, one is the Standard Decomposition, another one is the Non Standard
decomposition. That is for 2D Haar wavelet transformation. So I will be explaining these two
techniques, one is the Standard Decomposition technique, another one is the Non Standard
Decomposition technique.

(Refer Slide Time: 55:57)

So in case the Standard Decomposition Technique, compute 1D Haar wavelet decomposition of


each row of the original pixel values. After this, compute 1D Haar wavelet decomposition of
each column of the row-transformed pixels.

502
(Refer Slide Time: 56:13)

So you can see, I am showing here. So suppose this is the image I am considering. First I have to
apply the Haar decomposition row wise. Row wise Haar decomposition. So I will be getting the
approximate component, that is the green and detail means the red. So you can see, along the
rows I am getting the approximate value and the detail value. And after this, I am rearranging
terms.

(Refer Slide Time: 56:42)

And from the previous slide I am getting this one. So like this we have to do the decompositions.
And finally after rearranging I will be getting this, that is the row-transformed result I will be
getting. That is, row wise Haar decomposition I am considering.

503
(Refer Slide Time: 56:58)

After this, because we have the row-transformed result, this result is available. After this I have
to apply the decomposition along the columns. So if I apply the decomposition along the
columns, then in this case you will be getting only this as the low frequency component and if I
consider others, the red, these are the detail coefficients.

(Refer Slide Time: 57:21)

So in this example you can see here. So first I am applying the row wise decomposition. So after
row wise decomposition, you will be getting this one. After row wise decomposition, what you
can get? Row transform result I will be getting. So after this, I am arranging this one. After

504
arranging this one, so I will be getting this one. That means, this is the output corresponding to
row-transform decomposition.

After this what I am getting? I am applying the transformation, that means the decomposition
along the columns, and you can see only I have one low frequency information. So that is
available here, this is the low frequency information. The rest is the high frequency information.
High frequency information means the edges and the boundaries. These are the high frequency
information. But if I consider the constant intensity portion, or also the homogenous portion of
the image, that corresponds to the low frequency information.

The edges and the boundaries are the high frequency information, because in case of the edges
and the boundaries there is an abrupt sense of the grey scale intensity value. So that is why the
edges and the boundaries are the high frequency information, the high frequency pixels. And if I
consider the homogenous portion of the image, or maybe the constant intensity portion, that
corresponds to low frequency information.

So here, you can see after this decomposition, I am getting this one. So I am getting the one low
frequency information. If I consider outer side, this one, this is nothing but the high frequency
information, I will be getting.

(Refer Slide Time: 59.4)

The next one is the Non-standard Haar wavelet decomposition. So for this, perform one level
decomposition in each row, that is, one step horizontal pairwise averaging and differencing.

505
Number 2, perform one level decomposition in each column from Step number 1. After this,
repeat the process on the quadrant containing averages only in both the directions.

(Refer Slide Time: 59:35)

So that concept, I can show in the figure here. So, I am considering the input image. So first I
have to do one level horizontal decomposition. So in the horizontal direction, that is along the
rows, I am doing the decomposition, I am getting the average and the detail information.

After this, one level vertical Haar decomposition I am doing. You can see, vertically, I am doing.
In this case, I am doing the horizontally, the decomposition. In the second case, I am doing
vertically the decompositions, I am doing. Like this. So, you can see the averaging and
differencing of the detail coefficients, I am doing.

506
(Refer Slide Time: 60:15)

After this, we have to rearrange the terms. After this, I have to consider only the low frequency
information, that is the quadrant, and this quadrant I have to consider. That is, I have to consider
the green quadrant because it has the low frequency information. After this, again, I have to do
the decomposition of this one. The green quadrant, I have to do the decomposition. That is one
level horizontal and one level vertical, I have to do.

And finally, I will be getting this one. So here you can see, I have this low frequency information
and red means the detail information, the detail coefficients.

(Refer Slide Time: 60:55)

507
So in this case, you can see, first I am doing the decomposition along the rows, after this, the
columns. After this, rearranging the terms and corresponding to this, you can see, I have this low
frequency information, that is the average value and you can see, if you consider outside, that is
nothing but the detail information. That is the high frequency information.

(Refer Slide Time: 61:26)

After this, this green quadrant, considering the green quadrant I am doing the decomposition. So
if I do the decomposition, I will be getting this one, and you can see only I have this low
frequency information. So this low frequency information is available here. And if I consider the
rest of the portion, that corresponds to the high frequency information.

So this is the concept of the Non-standard decomposition. So I have two decomposition


techniques, one is the standard decomposition technique, another one is the non- standard
decomposition technique.

508
(Refer Side Time 61:53)

So this concept, I am showing here. So f n is the input image. The two dimensional signal, that is
the image I am considering. And I am considering the low pass filter and the high pass filter to
get the average value and the detail value. And after this, I am doing the down sampling by 2.

And after this what I am considering, again, I can do the decomposition of the signal, you can
see. Again, I am applying the low pass filter and the high pass filter. And after this, I am doing
the down sapling by 2. So first component, I am getting LL component because I am applying
the low pass filter and the low pass filter here. So I will be getting the LL component. That is, the
low frequency low frequency information, I will be getting. LL means the low frequency low
frequency information.

And corresponding to the second case, if you see, I am applying the low pass filter and the high
pass filter. That means I am getting LH coefficients, that is the low frequency and the high
frequency coefficients, I will be getting.

Similarly, in this case I am applying the HP, that is the high pass filter; and after this, the low
pass filter. That means I will be getting HL coefficients, the high frequency and low frequency
coefficients.

And after this, I am applying the high pass filter and the high pass filter, that means I will be
getting the high high coefficients. The high frequency high frequency coefficients, I will be
getting. So this is represented like this.

509
So LL information, that is the low frequency information is available here. LH information is
available here, HL information is available here and very high frequency information, that is the
high high frequency information is available here.

So that means, after this decomposition I will be getting this transform image.

(Refer Slide Time: 63:41)

Now, I am considering this example. So suppose one image is decomposed and I am getting
these components. So this component is LL component, this is LH component, this is HL
component and this is the HH component, I will be getting.

So you can see, I have this component, LH component, HL component, HH component. So this
component is the HH component. This component is HL component. This component is LH
component. So after the decomposition, I will be getting this one. That is, the level 1
decomposition.

510
(Refer Slide Time: 64:19)

So you can see, I am considering two level decompositions. You can see the original image and
after this, I am doing the decomposition at the level 1. So this LL, this is LH, this is HL and this
is HH. And after this again, I am doing the decomposition of this image. The second level
decomposition. So I can do the decomposition like this.

(Refer Slide Time: 64:45)

And this is one example of the Haar transformation. So I am applying the Haar transformation
for the decomposition of the input image. So like this, we can do the multiple level
decompositions. This is briefly about the discrete wavelet transformation.

511
(Refer Slide Time: 65:01)

And already, I discussed different transformations. So what are the advantages and disadvantages
of these transformations? So first one is the KLT, that is the KL transformation. So the advantage
is, theoretically optimal. What are the disadvantages? It is data dependent and not fast, because
the KLT depends on the statistics of the input data. So that is why, it cannot be implemented in
real time.

Next one is the DFT. The DFT is very fast but what are the problems? The problem is, it
assumes the periodicity of data and also, the high frequency distortion that is nothing but the
Gibb’s Phenomenon. That concept already, I have explained.

In case of the DCT, what are the advantages? Less high frequency distortion as compared to
DFT. And also the high energy compaction. But the problem is the blocking artifacts.

In case of the DWT, the high energy compaction and also the scalable, because I can consider
different scales for decompositions, and energy compaction is very high. But the
computationally, it is very complex.

So you can see the comparison between these.

512
(Refer Slide Time:: 66:23)

And if you see the previous figures, again I am showing the previous figures here. So suppose in
this case I am doing the decomposition. One application I can explain. Suppose image
compression, so for image compression I can do this decomposition. And what I can consider for
this component, that is the low frequency component, that is the LL component, I can allocate
more number of bits because it has more visual information as compared to other components,
that is the high frequency components.

So suppose corresponding to this component it has more information, so that is why I have to
allocate more number of bits for the LL component as compared to LH component, HL
component or HH components. So that means, for image compression I can neglect this
component, I can neglect this component, that is the redundant information I can neglect. That
means the maximum importance I can give to LL component as compared to other components.

So based on this principle, I can do image compression or video compression. That is, based on
DWT. So one compression standard is JPEG 2000. In JPEG 2000 this principle is used. In case
on the JPEG, we use the DCT. But for JPEG 2000, we use DWT, the discrete wavelet
transformation. This is the concept of image compression by considering discrete wavelet
transformation.

513
(Refer Slide Time: 67:55)

And finally, I want to discuss one point, important point. That is, that Fourier transform is not
suitable for efficiently representing piecewise smooth signal. And the wavelet transform is
optimum for representing point singularities due to the isotropic support, because we consider
dilation of the basis functions. So, that means the Fourier transform is not suitable for
representing very smooth signals. For this, we can consider the wavelet transformation.

But there is a problem in the wavelet transformation. What is the problem? Here you can see, the
isotropic support of wavelets makes it inefficient for representing anisotropic singularities such
as edges, corners, contours, lines, etc. That means, the wavelet is not good for representing
anisotropic singularities. I am repeating this, that is, wavelets are not good for representing
anisotropic singularities like edges, corners, contours, like this.

So that is why, to approximate the signals having anisotropic singularities, such as cartoon-like
images, we have to consider some other transformation. So for this, the analyzing elements
should consist of waveforms ranging over several scales. So we have to consider several scales.
Several locations we can consider, and orientations. And the elements should have the ability to
become very elongated.

So suppose if I consider the representation of a cartoon, suppose a carton image like this, that
means it has edges and the corners or the contours may be available. So for this, the wavelet is
not good. So for this, we have to consider some other transformation so that we can represent
this anisotropic singularities.

514
For isotropic support we can consider wavelets, but for anisotropic singularities the wavelet
representation has a problem. So for this, we have to consider some other transformation. Now,
here you can see, we have to consider different scales, we have to consider different locations
also we have to consider, and also the orientations also we have to consider.

(Refer Slide Time: 70:26)

So this requires a combination of an appropriate scaling operator to generate elements at


different scales. So we have to consider the elements and in this case we have to consider the
different scales, and also we have to consider a translation operator to displace these elements
over the 2D plane. And also we have to consider an orthogonal operator to sense its orientations.

So that means, I have to consider the cases like the scaling, the translation and the orientation
also we have to consider for representation of cartoon-like characters. Like, suppose if I want to
represent smooth edges or maybe the contours or maybe the edges or the lines, I have to consider
different scales and also we have to consider the translation parameter and also we have to
consider the different orientations.

So that means, in summary, I can say that wavelets are powerful tools in the representation of the
signal. The wavelets are good at detecting point discontinuities. However, they are not effective
in representing geometrical smoothness of the contours. The natural image consists of edges that
are smooth curves, which cannot be efficiently captured by the wavelet transformation.

515
So that is why we have to consider some other transformation. Like the transformation like the
curvelets, ridgelets, contourlets, shearlets, we can consider for representing anisotropic
information.

And in this case, you can see, I am showing one example. The approximation of a curve. So I am
considering one curve here, you can see, this is the curve. The approximation by isotropic shape
elements. So I am considering the isotropic shape elements. But in the second case, I am
considering anisotropic shape elements. So you can see I am considering anisotropic shape
elements, that is very efficient as compared to the first one.

In the first case, I am considering the isotropic shape elements, that is not efficient for
representing smooth contours or maybe the smooth edges or maybe the lines. But in the second
case, I am considering anisotropic shape elements which is very good for representation of
anisotropic singularities.

So, I am not going to explain the concept of the curvelets transformation, ridgelets
transformation, contourlets transformation and the shearlets transformation. So for this
transformations you may read books, and in my book also I have discussed about these
transformation, the curvelets transformation, ridgelets transformation, contourlets transformation
and the shearlets transformation.

So in this class, I discussed the concept of the discrete wavelet transformation. And also, I have
explained how to decompose a particular image into different bands. So I can get LL frequency
band, that is the low frequency band, LH band, HL band and HH bands. So I can decompose a
particular image by using the DWT. So for this, I have to consider the low pass filter and the
high pass filter.

So it is not possible to discuss all the mathematical concepts behind wavelet transformation. So if
you are interested, then you may see books, the image processing books and you can study
yourself about the wavelet transformation.

In my class, I have explained only the basic concepts of the wavelet transformation. So let me
stop here today. Thank you.

516
Computer Vision and Image Processing - Fundamentals and Applications
Professor M. K. Bhuyan
Department of Electronics and Electrical Engineering
Indian Institute of Technology, Guwahati
Lecture: 15
Image Enhancement
Welcome to NPTEL MOOCs course on Computer Vision and Image Processing: Fundamentals
and Applications. In my last class I discussed the concept of image transformation. From now, I
will discuss some image processing concepts.

The first concept is Image Enhancement. The objective of image enhancement is to improve the
visual quality of an image for better visual perception or for better machine processing of the
image. Image enhancement, it is subjective. There is no general mathematical theory for the
image enhancement. It is application specific.

I can give some examples of image enhancement. The one example is the removal of the noise,
another one is to improve the contrast of the image, I can change the brightness of an image. I
can highlight the edges, highlight a particular region of interest. So these are some examples of
image enhancement.

The image enhancement can be implemented in spatial domain or maybe in the frequency
domain. In spatial domain, I can manipulate the pixel values directly. In the frequency domain,
what I can do, from the input image, I can apply the Fourier Transformation, I will be getting the
image in the frequency domain. And after this I can do the processing in the frequency domain,
and after this I will apply the inverse Fourier Transformation to reconstruct the processed image.

And in my last class, I discussed one concept. So before applying the Fourier Transformation, I
have to multiply the image by minus 1 to the power x plus y. That concept, already, I have
explained in my last classes. So image enhancement, already I have defined, that is important for
better visual perception.

Now, already I have explained this concept, the image is represented by f x y and I can change
the range of an image and also I can change the domain of an image. Range means the pixel
intensity values and domain means the spatial coordinates. In my next slide, I will show this
concept.

517
(Refer Slide Time: 02:44)

That is, the first one as you can see, I am changing the intensity value, the pixel value. And in the
second case, I am changing the spatial coordinates. So in the first case that is nothing but
changing the range of an image. In the second case, I am changing the domain of an image.

So suppose if I write like this, g x y, that is, the output image, g x y, equals to f x plus a and y
plus b, suppose. This operation, I am considering. g x y is the output image and f x y is the input
image. And I can also write another expression, g x y is equal to f, and I am applying some
transformation. This is the transformation for the x coordinate. t x means the transformation for
the x coordinate; and t y, it is the transformation for the y coordinate.

So I can consider this expression. Now, based on this expression, I can do these operations. One
is the scaling operation, I can do. I can do zooming operation. And also, I can do the translation.
The translation operation, I can do. The first, the g x y is equal to f x plus a y plus b, that is
nothing but the translation.

Also, I can do the scaling. The scaling along the x direction, scaling along the y direction. Image
zooming is nothing but it either magnifies, image zooming means the magnifies or I can consider
the minifies. Magnifications, I can do or it can minifies the input, the input image.

So I can give one expression for the zooming, t x is equal to t x x y. So x coordinate is scaled by
c. And if I consider this transformation, transformation for the y coordinate, t y is the
transformation for the y coordinate, so y divided by d. So that means it is the expression for the

518
zooming. For the x coordinate, I am doing the scaling and for the y coordinate, I am doing the
scaling.

(Refer Slide Time: 05:25)

Now, let us consider what is image enhancement. Image enhancement means, to improve the
visual quality of an image for better human perception or for better interpretation by machines.
Now, in this case, you can see I am considering one input image and corresponding to this I have
the output image. So I am getting the better quality image in the output.

(Refer Slide Time: 05:47)

519
Now, in this block diagram you can see, I have the input image and another one is the better
quality image in the output, and I can apply image enhancement techniques. And already I have
explained that it is application specific. Image enhancement is application specific. There is no
general theory for the image enhancement. And image enhancement, I can implement it in the
spatial domain and also in the frequency domain.

In the spatial domain, I can manipulate the pixel intensity values directly, and in the frequency
domain I can modify the Fourier Transform of the image. So these two techniques I can apply.
One is the spatial domain technique, another one is the frequency domain technique.

(Refer Slide Time: 06:34)

Now, some of the operations for image enhancement, maybe something like the point operation,
spatial operation, frequency domain operations and the pseudo coloring.

What is the point operation? So in the point operation, suppose if I consider the histogram of an
image. So I will define the histogram later on. But for the time being just I am explaining like
this. So suppose this is the histogram of an image, so r is the input gray levels. It is from 0 to L
minus 1. And I am considering the number of pixels. So this, I can consider as the gray level
histogram. So later, I will define mathematically what is the gray level histogram.

So number of pixels corresponding to a particular gray level, I can see. So corresponding to this
particular gray level, I can see how many pixels are available in this particular gray level. And in
this case, if I apply this transformation, S is equal to T r, r is the input gray level. This is my

520
input. And S is the output gray value, so output is this. So the point function I am considering,
that operates on the gray level value, the gray level is r, the input value is r.

So this operation is called the point operation. That means, by using some transformation I am
considering the input, the input is r, and I am getting the output, the output is S. And T is the
transformation, the transformation I am considering. So this is the point operation. S is equal to T
r.

And in the spatial operation, what I can consider, I can consider suppose this image. And these
are pixels of the image. And in this case, I am getting the output image, the output image is g x y.
And in this case, I am applying some transformation, the transformation is T, and it is f x y, f x y
is the input image. T is the transformation, I am getting the output image, the output image is g x
y.

Then in this case, what I am considering, I may consider a particular window, one window I can
consider. A window, something, I can consider like this. And this window, suppose the 3 by 3
window, I am considering. Then in this case, I have to consider the neighborhood pixel. Suppose
this pixel I am considering, then I have to consider the neighborhood pixel, the neighborhood
pixel, these are the neighborhood pixels, I have to consider.

So I am getting the output image, the output image is g x y and my input image is f x y and I am
doing some transformation. Then in this case, I have to consider the neighborhood pixels. In this
case, I am considering the 3 by 3 marks. And corresponding to the center pixels, I am
considering the neighborhood pixels for this operation. The operation is the spatial operation.

In frequency domain operation, already I have explained that in this case first we have to
determine the Fourier Transform of the image. And for this, I have to multiply the image by
minus 1 to the power x plus y. And after this, I can do the processing. And finally, I can apply
the inverse Fourier Transformation to reconstruct the processed image.

And one example of this enhancement technique is the pseudo coloring. Pseudo means the false
coloring. Pseudo means the false coloring, that concept I am going to explain in color image
processing. False coloring means, I can convert the gray scale image. Gray scale image I can
convert to color image. That is the false coloring, that is the pseudo coloring. The gray scale
image can be converted into color image.

521
(Refer Slide Time: 11:08)

In this case, I have already shown, I have already explained this concept. So I have the g x y, my
input image is f x y. And corresponding to this center pixel, the center pixel is this, I have the
neighborhood pixels. I can do some transformation, and corresponding image f x y, I am getting
the output image, the output image is g x y. That means this point operation is nothing but s is
equal to T r. So r is my input and s is my output.

Now, in this case, that is the definition of the point operation. This operation, that means, output
depends only on the value of the f at a particular point, x comma y, does not depend on the
position of the pixel in the image. So this output depends on the pixel value at a particular point,
but does not depend on the position of the pixel, that is, the position of the pixel is x comma y.
This is the spatial domain technique.

522
(Refer Slide Time: 0:12:13.7)

In this case, I have given one example. You can see the two images, the first image you can see.
The second image, I am having this image. I am doing some transformation. The input image is f
x y, and I am getting the output image, the output image is g x y. So that means that this pixel is
modified and this modified pixel is here. This is the modified pixel based on the transformation.

(Refer Slide Time: 12:40)

Now, I will discuss some intensity transformation functions.

523
(Refer Slide Time: 12:50)

So what is the intensity transformation function? You can see. The first function is, that is the
contrast stretching function. So if you see the first diagram, that is the contrast stretching
function. So in the contrast stretching function, what is the transformation? The transformation I
am considering s is equal to T r, that transformation I am considering. r is my input and s is my
output.

So this is, this axis is my input and this is my output. So r is my input and s is my output. Now, if
I consider this transformation, this transformation if I consider, T r, so suppose if I consider this
point suppose, K suppose. This, I am considering. Now, in this case, you can see the values of r
lower than the K are compressed by the transformation functions into a narrow range of S
towards the black. So if I consider dark side is this and the light side is this.

So what is the function of the transformation function? That is, the values of r, r means the input
values, lower than the threshold value, the value is K. The values of r lower than K are
compressed by the transformation function into a narrow range of s towards dark. And the
opposite is true for values of r higher than K. So this is the function of the contrast stretching.

In the second case, I am considering a threshold function. So again the same concept. In the x
direction, my input is there. In the y direction, the output is there. And corresponding to, if you
see, up to this, from 0 to T, the threshold, the gray scale value, my output is zero. This red, my
output is zero. This output is zero. And if the input value is greater than T, my output will be L
minus 1 because it is from 0 to L minus 1, and this is also 0 to L minus 1.

524
So if I apply the thresholding function, then in this case I will be getting the binary image. I can
give one example in the next slide.

(Refer Slide Time: 15:15)

You can see, so, I am considering the thresholding function. So the thresholding, I am
considering, suppose this is the threshold, my, Th is the threshold. My input is this and output is
S is equal to T r.

And this value, if you consider, this value is nothing but L minus 1, and this is 0. So if I consider
here, if the input image, I m n is greater than the threshold, then what will be the value of I m n?
The value of the output image will be 255. 255 means, L minus 1. Else, I m n is equal to 0.

This is the output of the thresholding operation, you can see. My input is this and output I am
getting the binary image, because I have only two levels, one is zero another one is L minus 1, so
L minus 1 is 255.

525
(Refer Slide Time: 16:13)

Next, I am considering another operation. That operation is the image negative. Intensity range
already I have considered, it is from 0 to L minus 1. This is my intensity range. And if I apply
this transformation, the transformation is S is equal to T r, if I apply this transformation, my
input is r, output is S. S is nothing but, I can write here also, S is equal to L minus 1 minus r. I
am considering this.

Then in this case, I will be getting the negative image. So this is my input image, I am getting the
negative image. This is my input image, I am getting the negative image. Why the negative
image is important? To enhance white details embedded in the dark regions of an image.

That is why the negative image is important, to enhance white details embedded in the dark
regions of an image. So that is why I have to consider the negative of an image. So this is the
negative of an image.

526
(Refer Slide Time: 17:18)

Next important transformation. So in my some classes, I discussed about the log transformation.
That is used to compress the dynamic range. So in this case, I am considering this
transformation, s is equal to c log r plus 1. So in this case, c is the scaling factor, r is my input
gray scale value, s is the output gray scale value, I am considering. And in this case r is greater
than equal to 0, that I am considering.

And in this case, it is used to compress the dynamic range of an image. It is used to display the
2D Fourier Spectrum that already I have explained in one of my class.

(Refer Slide Time: 18:00)

527
So, I can show this transformation, what is the log transformation here? This is the log
transformation. If I consider this curve, log transformation is nothing but, so this is L minus 1,
this is 0, this is L minus 1, this is my r, this is S is equal to T r. And if I consider this
transformation, that is the log transformation. This is the log transformation.

So in this case, what will happen in this case of the log transformation, transformation maps a
narrow range of low intensity input, if I consider the narrow range of the low, this is the dark
side and this is the bright side. So the transformation maps a narrow range of low intensity input
values into a wider range of output values.

The opposite is true for higher values of input gray scale values. That means expanding the
values of the dark pixels in the image and compressing the higher-level values. That is the
application of the log transformation. I am repeating this. Expanding the values of the dark pixels
in the image and compressing the higher-level values. That means, in this case if I consider a
high-level value, high level value is from this to this and low-level pixel values are this.

So for the low level pixel value, I am expanding. And for the high-level pixel values, I am doing
the compression because output is only this, if you see. But corresponding to dark level, the
output will be this.

So this transformation is used to display the Fourier Transform of an image. So if you see, here I
am considering some useful transformation. The one transformation, I have shown the log
transformation.

If you see this transformation, the transformation is this transformation. And this transformation
is this. In this case, if I apply this transformation, then in this case no influence on the visual
quality. There will be no change in the visual quality. And this is called the identity and
sometimes it is called lazy man. It is called the lazy man operation. The lazy man transformation.

Because if I apply this transformation, no influence on visual quality. And if you see this
transformation, this transformation, that is nothing but the image negative. So, I have shown
some transformations. One transformation I have explained, one is the log transformation. And
you can see here, this transformation is the anti-log, that is the inverse log transformation. One is
the image negative, one is the identity.

528
(Refer Slide Time: 21:20)

I can show this example. If I consider this image, then in this case, suppose this is the Fourier
Transform of the image, then in this case, to visualize the Fourier Transform Spectrum, I have to
apply the log transformation, because it compresses the dynamic range.

(Refer Slide Time: 21:37)

So in this case you can see, first, you can see the Fourier Transform of the image that is, the
Fourier Spectrum, that is not clearly visible. If you see the second image, that is clearly visible
because I am applying the log transformation. So first one is without log transformation and the

529
second one is, I am applying the log transformation. In the log transformation, it compresses the
dynamic range.

(Refer Slide Time: 22:04)

Another very important transformation is the power law transformation. And in this case I am
considering this transformation, the power law is s is equal to c r to the power gamma. This is
also called the gamma transformation, sometimes it is called the gamma transformation. So the
transformation is s, c r to the power gamma. And sometimes you can write also this one, s is
equal to c r plus offset, one offset is considered, epsilon, to the power gamma. So I am
considering this offset because I am getting the measurable output when the input is zero.

Now, in this case, if you see this transformation, based on the gamma is equal to 1, suppose if I
consider gamma is equal to 1 in this expression, then it is nothing but the image scaling. It is the
same effect as adjusting the camera exposure time. If I consider gamma is equal to 1. Because if
I consider gamma is equal to 1, that means s is equal to c r.

530
(Refer Slide Time: 23:22)

And based on this gamma, I have this transformation. So you can see for gamma is equal to 1, I
have the identity. And if you see gamma is equal to 0.2, gamma is equal to 0.4, gamma is equal
to 0.6, 0.7, that is, I am considering the fractional value of gamma. And you can see, if I
consider, it is greater than 1. The gamma is 1.5, gamma is 2.5, gamma is 5, like this.

So if I consider the fractional value of gamma, what is the meaning of the fractional value of
gamma? That means if I consider a fractional value of gamma, narrow range of dark input value,
dark input values, that is converted into the wider range of output values. So fractional value of
gamma, that means the narrow range of dark input values, that is converted into wider range of
output values.

You can see here, this transformation, gamma is equal to 0.2. So that is very similar to the log
transformation. And if I consider the gamma is 1.5, then in this case it will be opposite. Gamma
is suppose 1.5, gamma is 2.5, then the outcome will be opposite.

So I have the number of transformation corresponding to different values of gamma.

531
(Refer Slide Time: 25:17)

Now, one application of gamma transformation, I can give one example. Suppose in the CRT
monitor, the cathode ray tube monitor, what is happening? It is nothing but the voltage is
converted into light intensity. That means, I am getting the output in the monitor. Then in this
case, it is basically the non-linear. This operation is the non-linear operation. This is non-linear
operation. That is the non-linear power function.

So in this case what will happen, if I consider a simple monitor, and corresponding to this input,
if I consider this input, suppose my input is this, something like the ramp input and this is my
image, the first one is the image. Corresponding to this, what you can see in the output, because
it is a non-linear device. The monitor output will be something like this, and corresponding to
this output the gamma will be something like this, gamma will be 2.5.

That means, the display system would tend to produce images that are darker than the intended
input. So I am repeating this. That means if I consider this case, gamma is equal to 2.5, the
display systems will produce images that are darker than the intended input.

So that means, in this case if I give this input, corresponding to this input I am getting the output,
this output. And corresponding to this output, the gamma is 2.5. Corresponding to the input
image the gamma is 1. In this case, the gamma is equal to 1.

Now, in this case, how to avoid this condition?

532
(Refer Slide Time: 27:12)

So I have to do some corrections, that is called the gamma corrections. So suppose my input is
this, that means this image. So, corresponding to this image my gamma is, gamma is equal to 1.
Now, I have to do some gamma correction. Generally in the TV monitor, gamma lies between
1.8 to 2.5. And in this case, if I want to do some gamma corrections, so what I have to do? So
suppose gamma is equal to 1 divided by 2.5. So suppose if I do the processing S is equal to r
divided by 1 by 2.5, then it is nothing but r is equal to 0.4.

So first, I am doing the gamma correction. If I do this correction, then in this case my input will
be this, input to the monitor will be this. And in this case, corresponding to this input, my output
will be this. So then monitor output will be the ramp output I am getting, that is the gamma will
be 1 in this case.

So that means, you can see what I am doing. So before applying to the monitor, I am doing the
gamma correction. So the second step, you can see, this step is nothing but the gamma corrected
step. So this input is actually inputted to the monitor. And corresponding to this I have the
output, the output is this output. And the output is nothing but the gamma is equal to 1,
corresponding to this.

And one thing is important, that different monitors have different settings, that means gamma
will be different for different monitors. So this is the concept of gamma correction. I can give
some examples of gamma corrections.

533
(Refer Slide Time: 29:05)

You can see here, the original image and corrected by, that is by, gamma I am considering,
gamma is equal to 1, that I am considering. Gamma is equal to 1.2 in this case, I think. So it is a
gamma corrected image.

(Refer Slide Time: 29:23)

Similarly, I can give some examples, gamma is equal to 1, 2, 5. So for gamma is equal to 1,
gamma is equal to 2, gamma is equal to 5, I have these images. So this is about the gamma
corrections.

534
(Refer Slide Time: 29:38)

Next point I want to discuss, the piecewise linear intensity function.

(Refer Slide Time: 29:44)

So what is the meaning of the piecewise linear intensity function? I will show some
transformation. The first transformation is piecewise contrast stretching operation. In this case, I
am considering, the transformation is this. This transformation is S is equal to T r. My input is
this, so I can consider it is 0 to L minus 1. And also here I am considering 0 to L minus 1.

And suppose the slope of this is, the slope of this section. So I have multiple sections here, if you
see the transformation. Corresponding to this section, if I consider this section, my slope is alpha.

535
Corresponding to the second section my slope is beta. And corresponding to the third section, my
slope is gamma.

Then in this case, if I consider suppose this case, my input is this side is r and this side is S. This
side is S is equal to T r. So that means if I consider mathematically what is the transformation, S
is equal to, the first one is alpha r, if the r lies between a and 0; is equal to beta r minus a plus c1,
if r lies between, if r lies between a and, if r lies between a and b. And is equal to gamma r minus
b plus c2, if r lies between L minus 1 and b. So mathematically this transformation I can show
like this.

Now, in this case the slope alpha, beta and gamma determine the relative contrast stretch. So
based on the value of alpha, beta and gamma, I can adjust the contrast stretching. So this is the
piecewise contrast stretching operation. And corresponding to this example, I have the input
image and you can see the output image, the output image is this.

(Refer Slide Time: 32:21)

Again, I am showing another example, the contrast stretching. So enhanced in the range, the
range is 100 to 150, that is enhanced the range, and also 150 to 255. So I am doing the
enhancement of this. So from this point to this point nothing is there because it is a linear
function, 45 degree, this angle is 45 degree. If it is 45 degree then in this case it is identity.

But corresponding to this portion, I am doing the contrast stretching and corresponding to this
point to this point also I am doing the contrast stretching. So the slope is alpha, beta and gamma.

536
(Refer Slide Time: 33:02)

The next one is the gray-level slicing. So in this case you can see, I have shown two
transformation. The first you can see, the first one is, this side already I have told you, this is r
and this is s. In the first case you can see, mathematically I can show is, S is equal to L minus 1,
corresponding to r lies between b and a. That means, this is my a and this is b. Otherwise, it will
be 0. So this I can consider as 0. This value I can consider as 0. So otherwise it will be 0.

So what will be the output for this transformation? That means if I want to consider this range
ranges from a to b, this range, a b, this portion is highlighted. That means now this range is
highlighted, not portion. This range is highlighted, highlighting the intensity range, the intensity
range is a and b, and reduces all other intensity to some lower value, lower value may be 0. That
I am considering. So this intensity range a b, from a to b, that is highlighted, so this is
highlighted. This is highlighted and if you consider the rest of the intensity values, the rest will
be 0.

In the second case, I am doing the same thing. In the first case you can see, if I have a image, a
particular image. And suppose I have one object here, particular range, intensity range. Suppose
this intensity range I am highlighting. Then in this case, if I consider the first transformation, the
rest of the things, the rest of the pixel values, that will be 0. Corresponding to the first
transformation.

In the second case, I am showing another transformation. In this case what I am considering,
from a to b, that range, that is highlighted but remaining intensity, they are not changed. So that

537
means, in this case I can write S is equal to L minus 1 corresponding to the range, corresponding
to this range the output I am having, the L minus 1, that portion is highlighted. But the remaining
is not changed, otherwise, S is equal to r.

So that means this background, I am not making zero. But if I consider this portion, the region of
this, so this is the region of interest, so that portion is highlighted. So that portion will be L minus
1 but the rest of the portion, I am not changing any pixel values.

(Refer Slide Time: 36:11)

So I can show some of the slicing techniques. Result of slicing in the black and white regions. So
I am doing the slicing in the black and white regions. This is one example.

538
(Refer Slide Time: 36:25)

Another important slicing technique, that is the bit-plane slicing. That is one slicing technique. In
this case, in a particular gray scale value can be represented like this, k7 suppose, 2 to the power
7. Because I am considering the 8 bit image, k6 2 to the power 6 k5 2 to the power 5, k4 2 to the
power 4, k3 2 to the power 3, k2 2 to the power 2, k1 2 to the power 1, plus k naught. Because
each pixel is quantized by 8 bits. That means, I am considering 8 bit image. So that means, how
many intensity levels, I have the 256 number of intensity levels.

So in this case, if I consider this representation, this corresponds to the MSB bit, the most
significant bit, and this corresponds to the LSB. One is the LSB, another one is the MSB. And
corresponding to this MSB, suppose I will be getting one plane, that is, one bit-plane I am getting
and that bit-plane I can consider as the MSB bit-plane. The most significant bit, MSB bit-plane.

So I will be getting 8 numbers of bit planes. And corresponding to this, this is nothing but the
LSB bit plane. So I will be getting the bit planes like this. This representation is quite important,
because in this case I can highlight the contribution of specific bit to image intensity. And in this
case, I can analyze relative importance of each bit. So which one is important, the bit 7 is
important or bit 2 is important or bit 1 is important, that I can see.

So one is to highlight the contribution of specific bits to image intensity, that I can see, and also I
can analyze relative importance of each bit. In this case, I have shown the bit planes, the LSB
bit-plane, MSB bit-planes, like this. And in this case, the MSB bit-plane can be obtained by
thresholding at 128.

539
(Refer Slide Time: 39:08)

So you can see here. This is my original image, and in this case MSB plane is obtained by
thresholding at 128. So I am having this MSB bit-plane I am having. This is the MSB bit-plane.

(Refer Slide Time: 39:20)

So in this case, corresponding to this image, input image, you can see, I am considering the 8 bit-
planes. The LSB, MSB, all the bit-planes I am getting. And I can see the importance of the bit.
Which bit is more important for visual information? So visually which bit-plane is important,
that I can analyze. So that is the importance of the bit-plane slicing.

540
(Refer Slide Time: 39:46)

After this the next important concept is the Histogram Processing. So first, I will discuss what is
the meaning of the image histogram. And after this I will discuss two important algorithms, one
is the histogram equalization technique, another one is the histogram specification technique.
And in this case, by histogram equalization technique, I can improve the contrast of an image.
And in case of the histogram specification, I can generate a particular image based on the
specified histogram.

So these two techniques are important, one is the histogram equalization, another one is the
histogram specification.

So first, going to this concept, first I have to define the histogram of an image. And based on the
histogram of an image, I can define the contrast of an image. Low contrast image, good contrast
image, bad contrast image, I can define. And also, the assumptions for histogram equalization
technique, I have to consider some assumptions and based on these assumptions, I have to apply
the histogram equalization technique.

541
(Refer Slide Time: 40:59)

So let us first define what is the histogram of an image. So in this case the histogram of an
image, I can consider, suppose, r k is the input intensity levels, that is the number of gray levels I
am considering, from 0 to L minus 1. Now, I am determining the probability of r k, what is the
probability of the r k? n k divided by n. What is n k? Number of pixels with gray level r k, and n
is the total number of pixels. The total number of pixels is nothing but the m cross n, the size of
the image.

So based on this, I can define the histogram of an image. And I am showing one program
segment. So by using this program segment, I can generate the histogram of an image.

And in this case, I can show some of the cases of the histogram. Suppose if I consider histogram,
is something like this. This is 0 and this is L minus 1, and I am considering the probability of r k,
that I am considering. Another histogram I can consider, suppose this is probability of r k, L
minus 1, my histogram is something like this. Another histogram, I am considering. And this is L
minus 1. And another histogram I am considering.

So if you see the first histogram, this is my dark side, this is the dark side and this is the bright
side. So corresponding to the first histogram, I will be getting the dark image. Dark image I will
be getting. Corresponding to the second histogram, I will be getting the bright image or maybe
the light image, I will be getting. Bright, or the light image I will be getting.

542
And in my first class I discussed about the contrast of an image. The contrast of an image
depends on the dynamic range. Dynamic range means the highest pixel value I have to consider
in an image and the lowest pixel value I have to consider. The difference between these two is
called the dynamic range. In the third case, the histogram is not uniformly distributed. Then in
this case, I will be getting a low contrast image. And in the final case, the fourth case, the
histogram is uniformly distributed. Then in this case I will be getting the good contrast image.

So you can see the histogram concept and based on the histogram I can define the dark image,
the bright image, low contrast image and the good contrast image.

(Refer Slide Time: 44:23)

In this case, I have shown the histogram of an image. So this is my input image, and I have
shown the histogram of the image. The histogram of the image, you can plot.

543
(Refer Slide Time: 44:33)

And again in this case, I have shown the histogram of the image. In the x-axis, I am considering r
k, that means the input pixel values, the pixel values of the image. And in the y-axis, I am
considering the probability of r k, that is nothing but n k divided by n. That means, what is n k?
Number of pixels with gray level r k. And n means the total number of pixels.

(Refer Slide Time: 45:02)

And in this example, I have shown, the one is the dark image, one is the bright image, another
one is the good contrast image. So that already, I have explained.

544
(Refer Slide Time: 45:11)

And you can see one example of the low contrast image and corresponding histogram, you can
see.

(Refer Slide Time: 45:18)

This is the improved contrast image. You can see the histogram of the image. This histogram is
uniformly distributed so the contrast is better as compared to the previous one.

545
(Refer Slide Time: 45:28)

Now for histogram equalization, what is the histogram equalization? I will explain now. So I am
considering this point transformation s is equal to T r. So that already, I have explained. So r is
the input image and s is the output image. And the input range is from 0 to L minus L, L minus
1.

The first assumption is s is equal to T r, that is monotonically increasing functions in the interval,
the interval is r is equal to from 0 to L minus 1. So first assumption is the transformation
function, the transformation function, that is the transformation is T, is monotonically increasing
function. So what is the interpretation of this?

546
The interpretation is this, output intensity value will never be less than the corresponding input
intensity value. So this is the first case. That is, the transformation is monotonically increasing
function, the T is a monotonically increasing function. The interpretation is, the output intensity
value will never be less than the corresponding input intensity value.

The second one is, I am considering the second condition is this, the T r lies between 0 to L
minus 1, corresponding to input is r lies between 0 to L minus 1. So what is the meaning of this?
The meaning is, the range of output intensity is same as the input intensity. So this is assumption,
the second assumption is this, range of output intensity is same as that of the input intensity.

And finally, I am considering another assumption. So what is the r? r is the T inverse s, because I
have s is equal to T r. So in this case I am considering r is equal to T inverse s. So that means, the
meaning is the T r is a strictly monotonically increasing functions in the interval, r lies between 0
to L minus 1. So what is the interpretation of this? That T r is a strictly monotonically increasing
function in the interval 0 to L minus 1, corresponding to the input value is r. That is, mapping
from s back to r will be one to one. That is the meaning of this. The mapping from s back to r
will be one to one.

So these three assumptions I am considering for histogram equalization.

Now, in this case, what is the meaning of the monotonically increasing function? I can show one
example. I can interpret. Suppose I am considering this and suppose this I am considering. So
this is L minus 1 and also, this is L minus 1. So you can see, if I consider this multiple input
value. It will give a single output value. This side is input and this side is output, s is equal to T r.

So this is multiple input and I will be getting the single output. And if I consider a single input
then in this case also, I will be getting the single output. The single output I will be getting. That
means monotonically increasing function means T r2, suppose is greater than equal to T r1, for
r2 greater than r1. So this is the definition of the monotonically increasing function. And what is
the strictly monotonically function, I can explain.

The strictly monotonically function, I can explain here. So in the strictly monotonically
increasing function, so for the strictly monotonically increasing function, something like this. So
this is L minus 1 and this is L minus 1. This is my s and this is r. So in this case the single input

547
that will give the single output. Single input and single output. This is nothing but one to one
mapping. Single output. Single input, single output. That is one to one mapping.

That means T r, mathematically I can show you. The T r2 is greater than T r1, for r2 greater than
r1. So that is the definition of the strictly monotonically increasing function. So these
assumptions are very important.

(Refer Slide Time: 52:18)

The next one is the histogram equalization technique. So in this case you can see, the r can be
treated as a random variable and corresponding to the interval from 0 to 1, that I can consider.
And what is the P r? That is, the P r is nothing but the pdf of r. So that is the pdf. So P r r is the
nothing but the pdf, the density function, the probability density function of r. And I am
considering P s s, that is the pdf of the output, the output is s.

So in this case, we have the distance from s is equal to T r. Then in this case, you can determine
the P s s, that is the pdf of s, you can determine by this expression. And we have this
transformation s is equal to T r, then in this case, you can determine s by this integration, and in
this case I am considering U is the dummy variable, then in this case the d s, d r will be this, and
corresponding to this I can determine the pdf of s, I can determine. Then in this case I will be
getting 1.

What is the meaning of this, pdf of the output variable? The output variable is s. That means I am
getting the uniform distribution. So I can show this one. My P r r maybe something like this.

548
Distribution is maybe something like this. This is L minus 1. But after the histogram
equalization, I will be getting a pdf, this is L minus 1, 1 by L minus 1. So I am having this one.

So you can see, after histogram equalization, I am getting the uniform pdf.

(Refer Slide Time: 54:47)

And for the discrete case, thus, you can implement the histogram equalization technique. This is
my r k and I can define the probability of r k and after this, I am getting the output, the g k I can
determine. So for the discrete case also you can apply the histogram equalization technique.

(Refer Slide Time: 55:04)

549
And I can show some examples of the histogram equalized image. So here you can see, it is the
histogram equalized image. And you can see the corresponding histogram, that is, uniformly
distributed.

(Refer Slide Time: 55:18)

Here also I have shown one example, that is the histogram equalized image I am having. So first
one is the low contrast image, the second one is the histogram equalized image. So this is about
the histogram. So that means in this case what I have done, so s is equal to T r, r is the input
image. And in this case, from the input image I can determine this pdf. And from this, I have to
determine P s s, I have to determine. That is the concept of the histogram equalization technique.

So from the input image, I am determining the pdf of r, and from the pdf of r I am determining
the pdf of the variable s, that I am determining.

550
(Refer Slide Time: 56:08)

And in this case also I am showing the example, the histogram equalization techniques, I am
applying and you can see the first one is the low contrast image, the second one is the equalized
image.

(Refer Slide Time: 56:20)

The another thing is the histogram specification. So what is the histogram specification? Given
an image with a particular histogram, another image which has a specified histogram can be
generated and this process is called the histogram specification or histogram matching.

551
So what I am considering, just suppose I have the input image. Input image is suppose r. From
the input image, I can calculate this one, this pdf I can calculate. And suppose, I have the output,
image is z. This is my output image. Now, the specified histogram is given. The specified
histogram is this. This is my specified histogram. From the specified histogram, I have to get the
output image.

That means the specified pdf, it is given. The output image should have this pdf. So I can
consider, suppose s is equal to T r, you know this. That is nothing but L minus 1, 0 to r, and this
is the P r W, W is the dummy variable for the integration. And in this case, z is the random
variable, so G z is equal to L minus 1, 0 to z t, t is the dummy variable for the integration. Then
in this case, I will be getting the equalized image, S is the equalized image.

So specified histogram is given and I have to generate the image. So s is equal to T r, L minus r,
so first I am considering this one. And after this, the transformation function, G z is the
transformation function. Transformation function I am determining. So G z is equal to T r, and
from this you can determine z. z is the output image you can determine. The output image is G
inverse T r. This, you can determine.

So you can see the difference between histogram specification and the histogram equalization.

(Refer Slide Time: 59:30)

So in this algorithm, I have shown the algorithm for the histogram specification. So you can just
see the algorithm for the histogram specification, that already, I have applied.

552
So up till now, I have discussed these two important techniques, one is histogram equalization
technique, another one is the histogram specification technique. In histogram equalization
technique, what we have done, we have considered the three assumptions that already I have
explained, and based on the three assumptions I am getting the equalized image. So after
equalization, I am getting the n from pdf. In case of the histogram specification, the specified
histogram is given and from this, I have to generate the image and it should be equalized, the
equalized output image.

So you can see the difference between the histogram equalization and the histogram
specification. The histogram equalization technique you can apply for the entire image, for the
whole image, or maybe you can apply in a particular region. So one is the local histogram,
another one is the global histogram. Global histogram means the histogram equalization
technique is applied for the whole image, and if I consider only the region of interest, in the
region if I apply the histogram equalization technique, that is the called the local histogram. This
histogram equalization technique, histogram specification technique and also the point
operations, very important operations for image enhancements.

So next class, I will continue the same concept. And I will discuss in the next class the concept
image filtering. So let me stop here today. Thank you.

553
Computer Vision and Image Processing - Fundamentals and Applications
Professor M. K. Bhuyan
Department of Electronics and Electrical Engineering,
Indian Institute of Technology, Guwahati
Lecture: 16
Image Filtering
Welcome to NPTEL MOOCs course on Computer Vision and Image Processing: Fundamentals
and Applications. Last class I discussed the concept of image enhancement, the objective is to
improve the visual quality of an image. Today, I am going to discuss the concept of image
filtering to remove noises. Image filtering operations can be implemented in spatial domain or in
frequency domain. In spatial domain I can manipulate the pixel values directly, so for this I can
consider neighborhood operations. In case of a frequency domain I can modify the Fourier
Transform of the image.

So first I have to determine the Fourier Transform of the image and after this I can do processing
in the frequency domain. And after this I have to do Inverse Fourier Transformation to get the
processed image. So, before discussing the image filtering concept, I will first discuss some
image processing operations. So, already you have understood these operations, I am going to
explain it again some of the operations like zooming and some neighborhood operations.

(Refer Slide Time: 01:46)

554
So, first one is the image enhancement techniques that can be implemented in spatial domain or
in frequency domain. In spatial domain I can operate directly on pixels so that means I can
change the pixel values directly. And in the frequency domain I can modify the Fourier
transform of an image. So, this concept already I have explained.

(Refer Slide Time: 02:13)

And in this case I have given one example of image enhancement. I have the original image and
you can see the enhanced image.

555
(Refer Slide Time: 02:23)

And I have considered some geometric operations. So, if you see here g x y is equal to f x plus a
comma y plus b. That means g x y that is the output image f tx y, tx y is the transformation for
the x coordinate ty x y, that is the transformation for the y coordinate. That means I am changing
the domain of an image.

So, if you see here that g x y is equal to f tx x y comma ty x y that is the I am changing the
domain of an image. That is the spatial coordinates I am changing. So, based on this equation I
can do scaling operation, I can do translation operation or I can do the operation like the zooming
operation zooming operation also I can show.

The image zooming either magnifies or minifies the input image. So, if I consider suppose from
transformation tx x y and in this case suppose x coordinate is divided by c and if I consider ty x y
the y coordinate is divided by d. So, by this transformation I can do zooming. Zooming in and
the zooming out.

556
(Refer Slide Time: 03:59)

And in this case I have shown some operations like rotation I can do and also I can do the
scaling. That is I am changing the domain of an image.

(Refer Slide Time: 04:10)

And in this case I am showing some neighborhood operations that is the spatial or neighborhood
operations. So, for this I am considering one window the window is given by W x y, that is the
window. And I have the output image the output image is g x y.

557
Then in this case I am considering the neighborhood pixels for the processing so if you see this
window, this is the window, and corresponding to this central pixel I am considering the
neighborhood pixel. And based on these neighborhood pixels I can do the processing. That is
called the neighborhood operations.

I can give one example of a neighborhood operation here the input image is f i j, this is the input
image, divided by n that is the number of neighborhood pixels, and I am considering one window
that window is W x y. That window I am considering. And I am getting the output image, the
output is g x y. So, this is one example of neighborhood operations.

(Refer Slide Time: 05:14)

And I can do some operations between images. So, first you can see the addition of two images,
the subtraction of two images that is the g x y is subtracted from f x y, so this is the subtraction
operation. Multiplication of two images and you can see the division operation. So, in case of the
subtraction, that is nothing but the change detection, already I have explained in my second class.
So, it is nothing but the change detection.

If I want to detect the moving regions, then in this case I have to apply the subtraction operation.
That is called the change detection. And in this case you can see that I am doing the addition of
number of images. Now, because of these operations, the arithmetic operation can produce pixel

558
values outside the range, the range is already we are considering from 0 to 255. That is mainly
from 0 to L minus 1 number of levels I am considering.

So, arithmetic operation can produce pixel values outside the allowable range. Then in this case
what I have to do I have to convert the values back to the range, the range is from 0 to 255. So, I
have to do this. And in this case if I consider the multiplication of an image, suppose the
multiplication of f x y by constant, suppose the constant is c, that is nothing but the scaling, I can
do the scaling. Then in this case, by this operation, I can change the brightness of an image. The
brightness I can adjust.

If the scaling factor is greater than 1, that means in this case the brightness of the image will be
more and a factor less than 1 darkens the image. So, I can do the scaling of the image. And
similarly, in the division also if I consider f x y divided by some constant, suppose d, by a factor
of d, that is very similar to change detection. That is very similar to change detection.

(Refer Slide Time: 07:44)

Now, in this case I am going considering how to remove noises based on averaging. So, I am
considering one image that is f x y is the noiseless image and I am considering the noise is n x y.
And suppose the g x y is equal to f x y plus n x y. And one consideration I am considering, that is
noise is uncorrelated and has 0 average value.

559
Then in this case if I consider the K number of images, I am taking the average of K number of
images. The images you can see, the images is g x y and I am determining the mean value I am
determining, that is the average value I am determining. So, by this process I can remove noise.

(Refer Slide Time: 08:35)

So, I am explaining it again. So, you can see the g x y is the corrupted image, and in this case I
am adding noise, that noise is eta x y. And in this case I am considering the noiseless image the
noiseless image is f x y, that is the noiseless image. And we have considered the noise here has 0
mean value. So, that is why the expected value of zi is equal to 0.

And also, I am considering at every pair of coordinates the noise is uncorrelated. So, that means
the expected value of zi zj is equal to 0.

560
(Refer Slide Time: 09:13)

And this the noise effect is reduced by averaging a set of K noisy images. So, I am considering K
number of images and I am determining the mean value. So, K number of images I am
considering from i is equal to K and I am taking the the average of this. Now in this case the
mean value and the standard deviation of the new image show that the effect of noise is reduced.

(Refer Slide Time: 09:39)

561
So, first I am showing the new images, there is a mean g, mean x y. And I am determining the
mean value that the expected value I want to see. So if you see it is expected value of 1 by K this
expression, in place of g mean I am putting this expression. And since 1 by K is constant, so I am
taking out from the summation.

And after this, in place of gi x y I am putting f x y plus noise I am considering. And it is


separated, the f x y is separated from the noise. And already I have considered that the noise has
0 mean, so that is why this will be 0 and this value will be K f x y. So, ultimately, I am getting f
x y. So, if I take the average of number of images then I am getting f x y.

What is f x y? f x y is the noiseless image.

(Refer Slide Time: 10:43)

And also I can determine the standard deviation of the new image. So this is the standard
deviation of the new image. And ultimately if I do this calculation, you can see it is 1 by root K
sigma eta x y. So, if I increase K what will happen? If I increase K, K means the number of
images if I increase the variability of the pixel intensity decreases and remains close to the
noiseless image value f x y.
So, registered case. If I increase K then the variability of the pixel intensity decreases and
remains close to the noiseless image value f x y. So, you can see if I consider number of images
and if I take the average of this, then the noise can be reduced.

562
(Refer Slide Time: 11:36)

And in this example I have shown the noisy image. The first one is the noisy image. The second
one is the averaging of five images I am considering, the second visual is the average of five
images. Next one is the averaging of 10 images, next one is the averaging of 20 images. And if
you see this image, averaging of 50 noisy images. And this is a averaging of 100 noisy images.
So, you can see that the noise is removed because of the averaging operation.

(Refer Slide Time: 12:10)

563
The next one is the image subtraction. So, already I have explained, because of the image
subtraction I can determine the moving regions. That is called the change detection. So, for
medical application that can be used.

Here I am giving one example. So suppose this is my one input image and suppose I want to
determine the locations of the arteries and the veins, so for this I am injecting iodine into the
blood stream. So, iodine medium injected into the blood stream I am doing. And in this case, if I
do the subtraction between these two images, so this is the first image and this is my second
image.

If I do the subtraction, that is f x y minus h x y, then I will be getting a g x y, the subtracted


image I am getting. So, you can see the subtracted image. That is the subtracted image is the g x
y I am getting. That means I can see the locations of the arteries and the veins. And in the next
image you can see depth portion, depth portion is enhanced. Depth portion is enhanced by image
enhancement techniques.

So, you can see the application of image subtraction.

(Refer Slide Time: 13:36)

Also, I can do some logical operations. Like I can do AND logical operation, OR operation,
XOR operation, NOT operations. Then these operation I can do for image processing. Than in

564
this case such operations are done on pairs of their corresponding pixels. And in this case, I have
to consider one real picture and another one is I have to consider is mask for this operation.

So, first I have to consider the real picture, real picture means the image I have to consider. And
also I have to consider a machine generated mask I have to consider. So, in case of the mask it is
nothing but the binary image consisting only of pixel values 0 and 1. So, this logical operation is
performed between the mask and the image. So, I can show some examples of this logical
operations.

(Refer Slide Time: 14:34)

So, first one is the AND operation. So, I am considering the image first one is the image and I
am considering the mask term mask is like this. So if I do the AND operation, then in this case
the output will be like this. If you see this portion will be black, this remaining portion will be
black.

Similarly, if I consider the mask, suppose this mask, and if I do the OR operation then in this
case I will be getting the output image something like this. So, I can apply these operations, the
logical operations, AND operation, OR operation, all these operation I can apply.

565
(Refer Slide Time: 15:14)

And here I am showing one another example. That is how to enhance region of interest. So, it is
a dental X-ray image. And I am considering the mask, two masks I am considering. And in this
case what I am doing, the multiplication of two images. This is the first image and this is the
second image. I am doing the multiplication between these 2 images. Then in this case, you can
see the result is this.

So, what I am getting? I am getting the region of interest mask for isolating teeth with fillings.
So, that means, in this case I am getting the teeth with fillings, so mainly I am doing the
multiplications between the image, the first image and the second image. In the second image I
am considering mask. The mask means the region of interest mask for isolating teeth with
fillings. So, these operations I can apply.

566
(Refer Slide Time: 16:18)

Now, let us consider the spatial domain filtering operation. So, what is the spatial domain
filtering operation I want to explain. So, in spatial domain filtering operation I have to consider
the neighborhood operations so for this I have to consider one mask.

Suppose, if I want to modify a pixel value, then in this case I have to consider the neighborhood
pixels and mainly I have to do some convolution, the convolution operation I have to do. And
based on these operations I can consider low pass filter, high pass filter, high boost filter, I can
consider these type of filters. Also another important filter is the median filter I can consider.

So, first I will explain the concept of the spatial filtering, and in this case I will explain the mask.
How to do the masking operation. And after this I will explain the concept of the low pass filter,
high pass filter, a median filter I will explain.

567
(Refer Slide Time: 17:22)

So, in case of the digital image the frequency means the measure of the amount by which grey
value change with distance. So, already you know about the frequency, this is the definition of
the frequency and I may have the low frequency component or the high frequency component in
an image. And for this I have the high pass filter or the low pass filters.

If I want to select the low frequency component, then in this case I have to apply the low pass
filter. If I want to select the high frequency component, the high frequency component in an
image is nothing but the edges and the boundaries. And if I consider a low frequency information
that corresponds to the constant intensity or the homogeneous portion of the image.

568
(Refer Slide Time: 18:07)

So, for the neighborhood operations I have to consider a filter mask and I have to take the filter
mask from point to point in an image and perform operation on the pixel inside the mask. And
based on this I may have this type of operations. One is the min operation, so what is the
minimum operation, the min filter? Set the pixel value to the minimum in the neighborhood.
That operation I can do.

Another operation is the max filter. Set pixel value to the maximum in the neighborhood, that
operation also I can do. And I can do the median operation. So, in this case suppose if I consider
neighborhood pixel value is 1, 7, 15, 18, 24, so for this what is the median value? Median value
is 15, so that means the pixel under consideration, that value is replaced by 15.

In this case, I am considering the neighborhood operation. So, that is I am considering a mask,
and in this mask I am applying this these operations, the min filters, max filter and the median
filter.

569
(Refer Slide Time: 19:23)

In this case I am showing the spatial domain approach. And here you can see f x y is the input
image and T is the transformation, I am doing some transformation. And I am getting the output
image, the output image is g x y. T is an operator on the input image, the input image is f x y.
And in this case I am considering the neighborhood of x y. So for this I am considering a mask,
in this example I am considering the 3 by 3 mask, and on the pixel the pixel is x comma y. That
is the central pixel a central pixel is x comma y.

And let us see how to do this operation.

570
(Refer Slide Time: 20:06)

So, again I am showing this one I have the input image and I have the output image and I am
considering the neighborhood pixels. So, these are the pixels, the neighborhood pixels. And I am
considering the mask and I am getting the output image. The output image is g x y f x y is the
input image and T is the operation that is the neighborhood operation that I am doing.

(Refer Slide Time: 20:30)

So, mask is similar to this. So it is a symmetric mask. And what is the central pixel of mask? The
central pixel of the mask is x comma y. And that I can consider as a filter. So, I have the filter

571
values, the coefficient values, or I can consider as widths of the masks. So, this is one example of
the mask or the filter.

(Refer Slide Time: 20:58)

And in spatial domain filtering what I have to do, I have to put the mask over the image, suppose
I am considering one image Z1, the pixel value is Z1, Z2, Z3, Z4, Z5, Z6, Z7, Z8, Z9. And I have
the mask, here this is the mask. So, what are the widths of the mask? The widths are w1, w2 like
this, these are the width of the mask. And what I have to do? I have to put the mask over the
image. So, corresponding to this pixel the central pixel I have to put the mask over the image,
that means I am doing the overlapping of the mask with the image.

After this, corresponding to the pixel Z5 I am determining the response of the mask. The
response of the mask is given by w1 into z1, plus w2 into z2, like this I have to determine the
response of the mask for the image pixels. So, pixels are Z1, Z2, Z3 up to Z9. And in this case, I
am considering the 3 by 3 mask, the 3 by 3 mask is considered. That is the symmetric mask.

572
(Refer Slide Time: 22:26)

This operation again I am showing here. I am considering a 3 by 3 filter, my input image is f x y,


I am considering the mask, 3 by 3 mask I am considering. So, this is the filter and if you see the
widths of the filter, the filter widths or the coefficients are r, s, t, u, v, w, x, y, z. And if I
considered the image pixel value a, b, c, d these are the a, b, c d upto i. These are the original
image pixels I am considering.

And already I have explained, so how to do the masking. I have to put the mask over the image
and I have to determine the response of the mask, that is nothing but v is multiplied with e, plus r
is multiplied. This is the multiplication, like this I have to do the multiplication and addition.
This process is repeated for all the pixels of the image.

Then in this case I will be getting the filtered image.

573
(Refer Slide Time: 23:26)

So, this concept is called the linear filtering. So, if you see this expression here g m n is the
output image and, in this case I am considering the input image is f m n and I am considering the
mask that mask is given by w m dash n dash. And in this case summations are performed over
the window. And in this case, the filter is the symmetric filter, that is mask is symmetric mask.

Then in this case I can do these operation, that is nothing but you can see it is something like
convolution. Because the filter coefficients are multiplied with the pixel values and it is added
up. It is nothing but the multiplications and sum up, that is the convolution.

574
(Refer Slide Time: 24:18)

And in this case I can show the difference between the linear filtering and non-linear filtering. I
am considering the spatial filtering methods. And in this case if I consider this operation, that
means I am determining the response of the mask. A filtering method is linear when the output is
output is a weighted sum of the input pixels.

So, in this case this is the linear filtering. But if I consider this operation max, the max operation,
I am considering 9 pixels because k is equal to 1 to 9. And out of 9 pixels I am determining the
maximum pixel, that is the maximum valued pixel. Then in this case this is not the linear
filtering. This is the example of the non-linear filtering. Because I am just determining the
maximum value out of 9 pixels.

575
(Refer Slide Time: 25:15)

And already I have explained how to do the spatial filtering. You can see here, I have the input
image and this is the mask. And in the mask mainly I have to consider the neighborhood pixels.
And in this case how to determine the output pixel? So, I have to place the mask over the image
and after this I have to do the spatial convolution.

So, here you can see this expression here is nothing but the spatial convolution. Multiplication
and the sum up I have to do. Like this I have to determine the filtered image or maybe output
image you can determine.

576
(Refer Slide Time: 25:57)

And one thing is that the mask falls outside the edge. Then in this case you can either ignore the
edges the boundary pixels you can ignore or otherwise what we can do I can do the 0 padding in
the boundary pixels. So, that I can put the mask in the boundary pixels.

These two solutions, one is, I can neglect the boundary pixels that is the ignored edges I can
consider. And if I ignore the edges, the resultant image will be smaller than the original. And if I
consider the 0 padding, then in this case I will be getting unwanted artifacts.

577
(Refer Slide Time: 26:39)

Here I have shown these two cases, one is the 0 padding and another one is the I can neglect the
boundary pixels I can neglect.

(Refer Slide Time: 26:47)

Now in case of the linear filtering methods we have two operations, one is the correlation
operation another one is the convolution operation. So, what is the correlation operation and how
the correlation is related to convolution?

578
(Refer Slide Time: 27:01)

In this case I have shown the same thing, the masking operation. I am considering the mask and I
am considering the image. The image is f i j. And I am considering the mask that is the kernel
matrix, the kernel matrix is w i j. And corresponding to this I am getting the output image the
output image is g i j I am getting.

So, in this case what I have to do I have to put the mask over the image and I have to do the
multiplication between the pixel value and the widths value of the mask. And after this I have to
add all these things. So, that means it is nothing but the multiplication and the sum up. So, like
this by using this expression you can see, for all the pixels of the image I have to do this
operation. For this I have to shift the mask to the next pixel like this so that I can consider all the
pixels of the image. So, w s t is nothing but that is the mask.

579
(Refer Slide Time: 28:10)

This is the correlation operation. And one application of the correlation operation in pattern
matching you can see if I consider suppose this pattern, the pattern is this and I want to detect
whether this pattern is available in the input image, my input image is this. So, if I find the
correlation between the pattern and the image, then in this case based on the correlation I can
detect whether particular pattern is available in the image or not.

In the second example also I can consider the pattern which pattern I am considering, and by
using the correlation I can detect whether this pattern is available in the image or not, that I can
detect. That is called pattern matching.

580
(Refer Slide Time: 28:56)

And what is the convolution? In the convolution what I have to, the mask is first flipped both
horizontally and the vertically. That operation I have to do for the convolution. So, mask is first
flipped both horizontally and the vertically, and you can see here. And after this the operation is
very similar to the correlation, only difference is mask is first flipped both horizontally and
vertically, and the rest of the operation is very similar to correlation.

But if I consider symmetric mask, the symmetric mask is something like this. Then in this case
convolution is equivalent to correlation. So, for a symmetric mask the convolution is equivalent
to correlation.

581
(Refer Slide Time: 29:48)

And in this I want to give one illustration of the spatial filtering. And in this case I am
considering the low pass filter. Here you can see I am considering the original image, and I am
considering the mask, that is the 3 by 3 mask for the averaging operation. So, if you see the
value, it is 1, 1, 1 and mainly it is divided by 9.

Because I am considering averaging over 9 pixels. So, this is the mask corresponding to the
averaging filter. That is the low pass filter. And for considering the boundary pixels I am doing
the 0 padding, so you can see the 0 padding I am doing. After this what I have to do? I have to
put the mask over the image and I have to do the convolution.

582
(Refer Slide Time: 30:39)

So, you can see in the next slide, here I have to put the mask over the image. And in this case I
have to determine the response of the mask corresponding to the center pixel. The center pixel is
7 here. Corresponding of the center pixel, I am determining the response of the mask, the
response of the mask is 8.4.

(Refer Slide Time: 31:01)

After this what I have to do? I have to move the mask to the next pixel. And corresponding to the
center pixel the center pixel is 9 in this case, corresponding to the center pixel I have to

583
determine the response of the mask. So, the response of the mask I can determine so you can see.
This is the response of the mask corresponding to the center pixel, the center pixel is 9. So, like
this I have to do this operation for all the pixels of the image.

(Refer Slide Time: 31:34)

So, like this you can see, again the mask is shifted and I have to determine the response of the
mask.

(Refer Slide Time: 31:48)

584
And again you can see, I am moving the mask to the next pixel. And I am determining the
response of the mask.

(Refer Slide Time: 32:04)

And like this for all the pixels of the image I have to do this operation. So, in this demonstration
you can see how to do the spatial filtering. And in this example I am only considering the low
pass filter.

(Refer Slide Time: 32:20)

585
So, this operation is very similar to the spatial convolution.

(Refer Slide Time: 32:33)

So, for all the pixels of the image I have to determine the response of the mask.

(Refer Slide Time: 32:45)

And finally I am getting this image.

586
(Refer Slide Time: 32:51)

So, you can see I have the original image and this is the image after spatial averaging that I am
getting. So, you can see the result of the averaging filter.

(Refer Slide Time: 33:05)

So, if I do the averaging the averaging low pass filter reduces the noise. And in this case if I
considered the large filtering window, that means the blurring will be more. In my previous
example I have considered only the 3 by 3 mask, and if I considered the large filtering window

587
then the blurring will be more. So, these are the case, the blurring will be more and the averaging
low pass filter reduces noise.

(Refer Slide Time: 33:34)

And in this case I have shown one example of the averaging filter, original image is there and I
am considering the noisy image, and I am having the filtered image. And this is the mask
corresponding to the low pass filter, that is the averaging filter.

(Refer Slide Time: 33:51)

And if I considered a 7 by 7 mask, that is the blurring will be more. Because if I consider 7 by 7
mask, this is nothing but I have to divide by 49. And if I consider 3 by 3, mask that means I am

588
dividing by 9. So, that is why corresponding to the 7 by 7 averaging mask, the blurring will be
more.

(Refer Slide Time: 34:14)

Here I am giving another example. So, original image is this and I am applying 3 by 3 mask, that
is the averaging filter I am applying. And you can see the output image, that is the smoothened
image I am having.

Similarly, I am considering another example, I have the original image and I am considering the
5 by 5 mask and I am having the smoothened image. So, in this case you can see that blurring is
more as compared to the previous example.

589
(Refer Slide Time: 34:45)

And in this case I am considering the extraction of the brightest object in an image. So, my input
image is this, and after this I am applying 15 by 15 averaging filter. And after this I am just
applying the image thresholding technique. So, by this technique you can see I can extract the
largest and the brightest object in the image. This is one application of the averaging filter.

(Refer Slide Time: 35:17)

Next filter I am considering that is the smoothing filter, the Gaussian filter. In case of a Gaussian
filter if you see, the weights are samples of the Gaussian function. So, I am considering the

590
Gaussian function here, this is the Gaussian function. Corresponding to this Gaussian function I
have the Gaussian mask, here I am considering the 7 by 7 Gaussian mask.

And what about the weight of the mask? Weights are nothing but the weights are the samples of
the Gaussian function. So I have the Gaussian function, only I have to put the value of x and y so
that I will get the weights of the mask. And in this case if I considered the mask size, the mask
size depends on the sigma. Because if I consider height is equal to width 5 sigma, then in this
case area will be 98.76 percent of the area it will cover. So, the mask size depends on sigma.

(Refer Slide Time: 36:15)

And corresponding to sigma is equal to 3, you can see I am considering 15 by 15 Gaussian mask.
So, in this case the parameter sigma controls the amount of smoothing. That is, it controls the
amount of blurring. So, this is one example of the 15 by 15 Gaussian mask.

591
(Refer Slide Time: 36:36)

And corresponding to the Gaussian mask, you can see I have the input image, I am considering
small sigma, then in this case it is limited smoothing. And if I consider large sigma, you can see
more blurring, the strong smoothing. So the blurring depends on sigma. This is called the
Gaussian blurring.

(Refer Slide Time: 37:01)

Again, I am showing another example. I am considering the Gaussian mask corresponding to this
Gaussian function, corresponding to this Gaussian function I am considering the Gaussian mask.

592
And my input image is this. And I am considering two cases one is the large sigma another one is
the small sigma. So, for the large sigma blurring is more as compared to small sigma.

(Refer Slide Time: 37:27)

And you can see the difference between the Gaussian smoothing and the averaging smoothing.
One is the averaging I am considering; another one is the Gaussian smoothing I can consider.
That called the Gaussian blur.

(Refer Slide Time: 37:40)

The next filter I want to consider the high pass filter, that is used for sharpening edges. Then in
this case what I have to do? I have to develop a high pass filter, I have the original image f m and

593
if I subtract the blurred image, that is the low pass filtered image from the original image, then in
this case I will be getting the sharpened image, that is nothing but I will be getting high
frequency components.

So, a sharpened image is represented by f High m n, that is nothing but the high frequency
components I will be getting corresponding to the edges and the boundaries. So, this is the
definition of the high pass filter. So what I have to do, I have to subtract the blurred image, that
is the average version of the image from the original image, then in this case I will be getting the
certain image. This operation is important for edge enhancement so I can highlight the edges.

(Refer Slide Time: 38:42)

So, in this case I have shown the mask corresponding to the high pass filter. So, first mask if you
see this one, if I apply this mask nothing will happen, so that means I will be getting the same
image. That is why it is called the identity mask. The original image I will be getting.

Original image minus the average portion of the image that is the mask corresponding to the low
pass filtering 1 by 9, 1 by 9, 1 by 9 like this, so this is the mask corresponding to the high pass
filter. So, central pixel is 8 by 9 and remaining you can see, it is minus 1 by 9, minus 1 by 9,
minus 1 by 9 like this. So, you can see how to get the high pass filter mask.

594
(Refer Slide Time: 39:26)

And if I apply this mask, the high pass filter mask, this is the mask corresponding to the high
pass filter. You can see here, I am considering one input image. And in the input image I am
considering the constant pixel value, if you see the 10, 10, 10 like this, this value.

Corresponding to this portion of the image what will be the response of the mask? If you apply
the convolution operation and if you determine the response of the mask you can see
corresponding to that portion the response is 0. But if I see here that this portion the pixel value
is suddenly changing from 10 to 80, that means that corresponds to edges or the boundaries.

So, corresponding to that location if I apply the mask then you can see the response of the mask
is this. So, that means corresponding to the high frequency I will be getting the response
corresponding to the high pass filtered mask. So, this is the mask corresponding to the high pass
filter.

595
(Refer Slide Time: 40:32)

And here I am showing one example, the original image I am considering, that is the original
mask I am considering that is nothing but the identity mask. And this is a low pass filter mask.
And if I subtract the low pass filter mask from the original, I will be getting the mask
corresponding to the high pass filter. And if I apply this filter in this image, the image is this, the
input image, then in this case I will be getting the edges. So, that means edge sharpening I can do
by using the high pass filter. Because I am considering only the high frequency components.

(Refer Slide Time: 41:08)

596
And in this case you can see, in frequency domain I have shown the first row is the frequency
domain. Corresponding to the low pass filter, high pass filter and the band pass filter. In the
second row I am showing the response in the spatial domain.

In the spatial domain this is the response for the low pass filter, this is the response for the high
pass filter and this is the response for the band pass filter. So, in this case the central pixel
already you know, the central pixel is 8 by 9, so that means in this case it is a maximum value
here you are getting.

And in this case it is the averaging operation I am considering. So, you can see the concept of the
low pass filter, high pass filter and the band pass filter in frequency domain and in the spatial
domain. First row is the frequency domain and second row are the spatial domain.

(Refer Slide Time: 42:00)

The next filter is the high boost filter. So, in this case what I have to consider, subtracting an
unsharp version of the image from the original image. So, that means what I am considering, if
you see I am considering the average version of the image, that is nothing but the low pass
filtered image, and I am considering the original image, original image is f m n. And I am
considering a factor, the factors A. The A is greater than 1.

So, what I am doing the subtracting the unsharp version of the image from the original image.
The unsharp version of the image means this is the unsharp version of the image. That is the
average image. And after this you can see this operation, so I can write like this, A minus 1 f m n

597
plus f m n minus f average I can do. And finally, I will be getting this expression, A minus 1 f m
n, that is the original image; and after this, f High m n, that is the high frequency image. So, if I
put A is equal to 1, then in this case this component will be 0, if I considering this A is equal to 1
so this will be 0. Then in this case I will be getting the high pass filtered output.

So, this is my original image if I consider A is equal to 1 then only I will be getting the high
frequency components. If A is equal to 1.3, then in this case you can see if you put this value in
this expression A is equal to 1.3, then in this case corresponding to 1.3 I have the high frequency
information as well as the low frequency information. So, in this output I do not have the low
frequency information but in the second case you can see, because of A is equal to 1.3, that
means I am considering low frequency information as well as the high frequency information.

If I consider the A is equal to 1.5, that means I am considering more low frequency information
and as well as high frequency information I am considering. So, by applying the high boost filter
I can consider both low frequency information and the high frequency information I can
consider. So, I am getting the edges, that is sharpening the edges as well as I am having the low
frequency components.

(Refer Slide Time: 44:31)

And in this case I am showing the same thing, the high boost filtering, f m n is the input image
and I am considering A factor, A cross f m n. And in this case I am considering the average
version of the image, and this is nothing but the high boost filter I am considering. So, you can

598
see this is my original image and I am considering suppose it is A minus 1. Plus, I am
considering the blurred version of the image and I am getting the output, the output is this I am
getting. That is the high boost filtering.

(Refer Slide Time: 45:17)

You can see the difference between the high pass filter and the high boost filter. So, I am
considering the original image is this, the first output is the high pass filter image. I am having
the high pass filtered image. And second one is the high boost filter. In the high boost filter, you
can see I have the background information along with the edge information. So, in the high boost
filter I have the high frequency information as well as the low frequency information. But in case
of the high pass filtered image I have only the high frequency information, that means I have the
edges.

599
(Refer Slide Time: 45:51)46.21

And in this case I can show these two examples. Corresponding to this image I may applying the
high boost filter. And in this case I am considering A is equal to 1.4, and corresponding to this I
am getting the output image, this. That means I am getting the background information as well
as I am sharpening the edges.

And corresponding to the second example, I am considering this is the input image and I am
considering A is equal to 1.9. Corresponding to A is equal to 1.9 I have the output image, this is
the output image. So, that means I am sharpening the edges and also I have the low frequency
information.

600
(Refer Slide Time: 46:30)48.21

The next filter is the median filter. So, median filter is applicable for to remove the salt and
pepper noise. The salt and pepper noise is mainly the impulse noises. I have the means so these
pixel values will be affected by the impulse noises. And in this case the high value of this, the
high value of this maybe is 255, but the low value of this noise may be 0. This is something like
ON and OFF noise. That is the impulse noise I am considering.

And in this the median filter operation what I have to consider, I have to consider a mask, and
within this mask I have to consider all the pixels, the neighborhood pixels I have to consider.
And this central pixel value is replaced by the median value, so that concept I am going to
explain. And one thing it is important that median is a non-linear filter.

Why it is a non-linear filter? So, I can show you, suppose if I take the median A f1 plus B f2 that
is not equal to A median f1 plus B median f2. So, if you see this one the median is a non-linear
filter. The median A f1 plus B f2 is not equal to A median f1 plus B median f2. And how to
apply the median filter? I can explain you in the next slide.

601
(Refer Slide Time: 48:24)

So, this is one example of the salt and pepper noise. The salt and pepper noise is nothing but the
impulse noises. And it is not affecting all the pixels of the image. Like that in Gaussian noise the
Gaussian noise affects all the pixels of the image. But in case of the salt and pepper noise, it
affects only a few pixels of the image and that is nothing but ON and OFF noise. So, my input
image is this and I am considering the salt and pepper noise.

(Refer Slide Time: 48:56)

Again I am considering here you can see my input image is this and I am considering the salt and
pepper noise. And in the first case I am applying the averaging operation and in the second case I

602
am applying the median filter operation. You can see the difference between the averaging
operation and the median filtering operation. So, this median filter is very effective to remove
salt and pepper noise.

(Refer Slide Time: 49:23)

And here also I have shown one example. The original image with noise and in first case I am
applying the averaging filter that is the output of the averaging filter, in the second case I am
considering the median filter. You can see the distinction between these two images, one is the
averaging filter another one is the median filter. So, I think the median filter gives better result as
compared to averaging filter.

603
(Refer Slide Time: 49:49)

And how to apply the median filter? So, in this example I am considering one image my image is
this the pixels value I have shown. And I am considering a 3 by 3 mask and I have to consider a
neighborhood values. The neighborhood value is 115, 119, 120, these are the neighborhood
values. And for these values I have to determine the median value. So, for this I have to arrange
in the ascending order or the descending order, and after this I have to find the median value. So,
in this case the median value will be 124. So, like this I have to determine the median value.

(Refer Slide Time: 50:27)

604
And in this case, in this example suppose if I consider the 3 by 3 mask, and in this case this pixel
is affected by the salt and pepper noise, because the value is 255. Then in this case,
corresponding to this 3 by 3 window I have the pixel values, the pixel values are 15, 17, 18, 20,
20, 20, 20, 20, 255. That means I have to arrange in the ascending order, and after this I have to
find the median value. So, median value will be 20 then this 255 pixel value that will be replaced
by 20. So, this is the concept of the median filter.

(Refer Slide Time: 51:07)

And in this case I am showing one example. The first one is the input image and after this I am
considering the salt and pepper noise. And after this what I am considering? If you see first one
is the input image I am considering after this I am applying the salt and pepper noise. And after
this I am applying the 3 by 3 median filter I am applying. So, corresponding to the 3 by 3 filter I
am getting this output. The next one is, I am applying the 5 by 5 median filter, the 5 by 5 median
filter I am applying. And corresponding to this I have the output this one.

605
(Refer Slide Time: 51:43)

So, here also I have shown one example. I have the original image, original image with noise and
I am considering N is equal to 3 median filter, N is equal to 5 median filter I am considering. So,
you can see the outputs. Then one problem of the median filter is the computational complexity.
The computational complexity is in the order of n square. So, n means the small n means the
number of pixels in the window.

So, the computational complexity is very high, because I have to find the median value. And in
this case the computational complexity is in the order of n square, n is the number of pixels in the
window. So, that means I have to do n square comparisons to determine the median value for the
window.

And suppose if I consider N by N image, then in this case, for the entire image if I want to
determine the median value, then my computational complexity N square into n square. That is
the computational complexity. Then in this case some algorithms like the quick sort algorithm I
can apply, because I have to do the sorting so in this case I can apply some algorithm like quick
sort algorithm I can apply to reduce the computational complexity.

606
(Refer Slide Time: 53:09)

And again, here I have shown the example of the median filter the first one is the input image,
second one I am considering the salt and pepper noise. So all the pixels are now affected by the
salt and pepper noise, and after this I am applying the median filter. So, this is the output of the
median filter image.

(Refer Slide Time: 53:28)

Finally, I want to discuss another filter that is the bilateral filter. So, what is the bilateral filter?
So, before discussing the bilateral filter I want to discuss again the mean filter. What is the mean
filter? Mean filters blurs the image, removes simple noise, and no details are preserved in case of

607
the mean filter. In case of the Gaussian filter, Gaussian filter blurs the image, preserves details
only for small value of sigma. Already I have explained about the Gaussian filter.

And after this I discussed about the median filter. Median filter preserves some details and it is
very good for removing strong noises. Now in this case we want a filter that not only smooths
the region but it can preserve edges. That means I need the edge information also I have to do the
smoothing. So, for this I have to apply the bilateral filtering.

(Refer Slide Time: 54:33)

So, bilateral filtering smooths images while preserving edges. That means I have to preserve the
edges and also I have to smooth the images. So, these two considerations, one is I have to do the
smoothing and also I have to preserve the edges. And in this case I have consider two cases, one
is the geometric closeness I have to consider, that means the domain I have to see.

And also I have to consider a photometric similarity between the pixels, the neighborhood pixels.
So, these two considerations I have to take into account. One is the geometric closeness between
the neighborhood pixels and also I have to consider the photometric similarity between the
neighborhood pixels. That means I am doing operations for the domain as well as the range. That
means I am looking at the domain and the range of the image.

608
(Refer Slide Time: 55:33)

So, already I have explained. So what I have to do, for this I have to consider a geometric
closeness I have to consider. The two pixels can be close to one another, that is occupy nearby
spatial location. So that means I have to consider the domain and also I have to consider the
photometric similarity. Two pixels can be similar to one another, that is they have the nearby
values. So, that means I have to consider the range, the pixel range I have to consider.

(Refer Slide Time: 56:03)

And in this case what I have to consider? Property is nothing but the convolution filter I have to
do, smooth images but preserve edges. So that means I am considering the edges and also I am

609
doing the smoothing of the image. And in this case, I have to operate in the domain and the
range. For range I have to consider the photometric similarity and for the domain I have to
consider geometric similarity.

(Refer Slide Time: 56:32)

So, you can see the output, one is the Gaussian filter output another one is the bilateral filter. In
case of the bilateral filter you can see the edges, I am having the edges and also the image is also
smoothed. So, combined domain and range filtering I have to do in the bilateral filtering.

(Refer Slide Time: 56:53)

610
And in this case the bilateral filtering is defined like this, g i j. And in this case I am considering
the input image, the original image is I, and also I am considering the weight. The weight is w i,
j, k, l. And g, i, j is the filtered image. The weight w, i, j, k, l, that is the weight I am considering
is assigned using the spatial closeness that is the geometric closeness. And the intensity
difference that is the photometric similarity. One is the geometric closeness this is the spatial
closeness, another one is the intensity difference that is the photometric similarity I have to
consider.

A pixel located at i, j is denoised using its neighboring pixels located at k, l. And in this case the
weight is defined like this. You can see we have two parts, the first part is this, the second part is
this. So, what is meaning of the first part and what is the meaning of the second part? I can show
you in the next slide.

(Refer Slide Time: 57:57)

So, this is the weight. The first part is the domain kernel, you can see because I have to consider
geometric closeness I have to consider. The second part is the range kernel I have to consider the
range kernel. The first one is the domain kernel. For the range kernel I have to consider the
photometric similarity.

So here you can see I am finding the photometric similarity between the pixels and also I am
considering the geometric similarity between the pixels. So, you can see the geometric similarity
between the pixels. So, that means in bilateral filtering I am considering the domain kernel and

611
also the range kernel I am considering. And that is combined and if I combine these two I will be
getting the weight.

(Refer Slide Time: 58:45)

And in this case you can see the bilateral filters with various range of domain parameters value.
So, I am considering the sigma d, that is the domain parameters value and also I am considering
sigma r, that is the range parameters value. For defining parameters you can see the output
images. That means I am considering both range and the domain filtering in the bilateral filters.

So, in this class today I have discussed about the spatial filtering. First I discussed about the
concept of the masking. After this I discussed about the concept of the low pass filter and the
high pass filter, and the high boost filter. After this one non-linear filter I discussed, that is the
median filter I have discussed. And finally, I discussed the concept of the bilateral filtering.

So, in the bilateral filtering we considered the domain filtering and the range filtering. So, this is
about to spatial filtering. In my next class I will discuss about the concept of frequency domain
filtering. In a frequency domain filtering I have to modify the Fourier Transform of the image.
So, that concept I am going to discuss in the next class. So, let me stop here today. Thank you.

612
Computer Vision and Image Processing - Fundamentals and Applications
Professor M. K. Bhuyan
Department of Electronics and Electrical Engineering,
Indian Institute of Technology, Guwahati
Lecture – 17
Image Filtering
Welcome to NPTEL MOOCs course on Computer Vision and Image Processing: Fundamentals
and Applications. In my last class I discussed the concept of spatial filtering, I discussed the
concept of low pass filter, high pass filter and the high boost filter. Also, I discussed one filter
that is the non-linear filter, the median filter. The median filter can be used to remove impulse
noises. That is called as the salt and pepper noises.

Also I discussed one important concept the filter is bilateral filter. So, in the bilateral filtering we
consider range and the domain of an image. Today I am going to discuss the concept of image
filtering in frequency domain. In frequency domain what we have to do, we have to modify the
Fourier Transform of the image. So, for this we have filters like the low pass filter, high pass
filter, maybe the band pass filter and band stop filter. And this concept I am going to discuss
today, the concept of the frequency domain filtering.

So, for this the first step is we have to take the Fourier Transform of the image. So, before
determining the Fourier Transform of an images I have to do some preprocessing, that means I
have to multiply the image by minus 1 to the power x plus y. So, this step I am going to explain
now. So, what is the concept of the frequency domain image filtering?

613
(Refer Slide Time: 01:58)

So, I will discuss the concept of the frequency domain filtering.

(Refer Slide Time: 02:02)

So, already I have explained this concept. If you see F u v, that is the Fourier Transform of an
image. If I consider the Fourier Transformation, it has two parts, one is the real part another one
is the imaginary part. So, in this case you can see first one is the real part, the next part is the
imaginary part.

This Fourier Transform can be represented in the polar coordinate, like this. So, we have the
magnitude component and another component is the phase angle component. In case of the

614
Fourier Transform, I can determine the magnitude of the Fourier Transformation. The magnitude
I can determine like this, so I have the real component and I have the imaginary component.

And also I can determine the phase angle. The phase angle is nothing but tan inverse I u v, that is
the imaginary part, divided by the real part, the real part is R u v. And in this case u and v is the
special frequency along the x direction and the y direction respectively. So, in this case you can
see the Fourier Transform I am considering, and I am considering the magnitude of the Fourier
Transformation and the phase angle of the Fourier Transform.

(Refer Slide Time: 03:18)

And for the preprocessing what I have to do, if you can see here the Fourier Transform of an
image the image is f x y and corresponding to this the Fourier Transform is F u v. So, in this case
the input image is multiplied by minus 1 to the power x plus y. Then in this what is the objective
of this preprocessing step?

This shifts the center of the Fourier Transform to the point that point is M by 2, N by 2. So,
corresponding to the Fourier Transformation, then in this case the center of the Fourier
Transform is this and the coordinate is M by 2 and N by 2 that is the coordinate of the center of
the Fourier Transform.

So, for shifting I have to multiply the image by minus 1 to the power x plus y, and after this I
have to determine the Fourier Transform of the image. This is called the preprocessing. So, first I

615
have to do the preprocessing and after this I have to determine the Fourier Transform of the
image. And for displaying the Fourier Transformation already I have explained that I have to
consider the log transformation. That is to compress a dynamic range of the image.

(Refer Slide Time: 04:39)

And one important property of Fourier Transform is the symmetric property. If you see here, the
F u v is equal to F complex conjugate minus u, minus v that is the conjugate symmetric property.
And also if I considered a Fourier Transformation, the magnitude of the Fourier Transform is
symmetric. And that is why I am getting the Fourier Transform similar to like this, because of
the symmetric. The magnitude of the Fourier Transform is symmetric and also I have to consider
the Fourier Transform is conjugate symmetric. So, that is why I am getting this Fourier
Transformation.

616
(Refer Slide Time: 05:25)

And in this case, I have already explained this one in one of my classes. That is if I take the
Fourier Transform of the input image, my input image is this and I am determining the Fourier
Transform of the image, so corresponding to this I am getting the Fourier Transform of this. And
in this case, if I want to reconstruct the original image, I have to apply the inverse Fourier
Transformation.

So, in this case this is the Fourier Transform of the image, and after this I am applying the
inverse Fourier Transformation to get back the image, that is the reconstructed image. And in
this case the perfect reconstruction is possible, because I am considering all the frequency
information presented in the image.

And in case of the Fourier Transformation, the central part, supposes if I considered the Fourier
spectrum here, this central portion corresponds to the low frequency part, this is the low
frequency part, that is the low frequency information. And if I consider the outside portion, like
this that corresponds to the high frequency information. The central position is the low frequency
information and the outside if you see, that is the high frequency information.

In this case I am considering all the frequency information for reconstruction so that is why the
perfect reconstruction is possible in this case.

617
(Refer Slide Time: 06:48)

In this example you can see I am considering only the central portion of the Fourier
Transformation, that corresponds to the low frequency information. And if I apply the inverse
Fourier Transformation, then you can see the reconstructed image. So, this is the reconstructed
image.

And in the second example what I am considering, I am neglecting the low frequency
information, that means the central portion I am not considering, I am considering only the outer
portion of the Fourier Spectrum. That means I am considering the high frequency information.
And if I determine the inverse Fourier Transformation, then in this case this is my reconstructed
image.

So, in the first case what you will be getting that because of the low frequency component it
gives general appearance of the image. Only I have the low frequency information so it gives
general appearance of the image. But in the second case I am considering the high frequency
information, that corresponds to fine details of the image.

Fine details mean like the edges, boundaries the fine information present in the image.

618
(Refer Slide Time: 08:05)

So, here I am explaining this one. The central part of the Fourier Transform, that is the low
frequency components are responsible for the general gray level appearance of the image. On the
other hand, the high frequency components of the Fourier Transform are responsible for the
detail information of an image. So, this concept already I have explained.

Now in case of the frequency domain filtering I can consider the filters like this, suppose if I
consider a low pass filter, for low pass filter I have to select this portion only. I have to neglect
the outer portion of the spectrum. Then in this case I will be getting the low pass filter image.
And if I consider high pass filter, then I have to neglect the central portion of the Fourier
Transformation, so this portion I am neglecting and I am considering the outer portion of the
Fourier Transformation.

So, that means I can consider the high pass filter and the low pass filter based on this concept.

619
(Refer Slide Time: 09:13)

Here, I am showing one example. I have shown one image, that is the image is this, input image.
And corresponding to this image I have the Fourier Transform, this is my Fourier Transform.
And if I consider this circle, I have shown a circle, the red circle. And if I consider this portion
only that means I am only considering the low frequency information, that gives general
appearance of the image.

And if I consider the outer portion, the outside portion like this, if I consider this portion or this
portion or this portion, it gives the fine details of the image. Like the edges boundaries or a fine
information of the image.

620
(Refer Slide Time: 09:58)

And in this case, again I am showing here what I am considering. I am considering the Fourier
Transformation and I have to reconstruct the image by using the inverse Fourier Transformation.
In the first case you can see only I am considering this portion of the Fourier Transform. That
means I am only considering the low frequency information and corresponding to this, this is my
reconstructed image.

After this if you see the second case, I am considering this portion of the Fourier Transformation,
that means I am increasing the size. That means I am only considering the low frequency
information or maybe some high frequency information also I am considering maybe. And
corresponding to this one my reconstructed image will be like this.

In the third case I am increasing this portion, if you see, that means I am considering low
frequency information and maybe some high frequency information. And considering this, if I
determine the inverse Fourier Transformation, then I will be getting the reconstructed image like
this. And in the final case, you can see I am considering the big portion here.

So, this portion I am considering that means I am considering the low frequency information and
the high frequency information. Not all the high frequency information but significant high
frequency information I am considering. And corresponding to this, this is my reconstructed
image. So, you can see the concept of the filtering, the frequency domain filtering.

621
So, based on selection of this, the portion of the Fourier Transformation you can reconstruct the
image. You can select the low frequency information, you can select the high frequency
information, like this, based on your requirement you can select the portion of the Fourier
Transformation. This is the Fourier spectrum.

(Refer Slide Time: 11:53)

So, in the frequency domain filtering edges and the sharp transitions is mainly the high frequency
content of the Fourier Transformation. Low frequency content in the Fourier Transform are
responsible to the general appearance of the image, that means it gives the information of the
general appearance of the image. And in this case the blurring or the smoothing is achieved by
attenuating range of high frequency components of Fourier Transformation. That means if I only
consider the low frequency information, I can do the blurring, I can do the smoothing.

622
(Refer Slide Time: 12:30)

And this is the typical frequency domain filtering. So, first one is the input image I am
considering. After this I am doing the preprocessing, the preprocessing is nothing but the image
is multiplied by minus 1 to the power x plus y. And after this I am taking the Fourier Transform
of the image, so the F u v is the Fourier Transform of the input image. And after this I am
considering the filter function. The function is H u v. The filter function maybe low pass filter,
maybe the high pass filter, I am considering the filter function, the function is H u v.

The filter function is multiplied with the image, that is the F u v. F u v is the Fourier Transform
of the image. That is multiplied with the filter function. After this, I am taking the inverse
Fourier Transformation to reconstruct the image after the processing. That means the processing
means the frequency domain filtering. And in the post-processing, again I have to multiply this
by minus 1 to the power u plus v. So, I will be getting the output image, the output image is g x
y. That is the enhanced image.

623
(Refer Slide Time: 13:46)

And this concept I am showing here. So, I have the input image, in the first diagram if you see in
the first figure, I have the input image. I am taking the Fourier Transform of the image,
corresponding to this I am getting the Fourier Spectrum. And I have shown the reconstructed
image by considering the inverse Fourier Transformation. So, here I am showing the inverse
Fourier Transformation.

And in the second figure if you see here, I am only considering the central portion of the Fourier
Transformation and I am reconstructing the image based on the central portion of the Fourier
Transformation. Then in this case I will be getting the blurred image. That is nothing but the low
pass filtered image.

In the second case I am neglecting the central portion of the Fourier Transformation. This portion
I am neglecting. I am only considering the outside portion of the Fourier Transformation and I
am applying the inverse Fourier Transformation for reconstruction. Then in this case I am getting
the edges and the boundary, the fine details of the image. So, from this you can understand the
concept of the low pass filtering and the high pass filtering.

624
(Refer Slide Time: 15:06)

Now, you can see the convolution operation in spatial domain can be represented like this. So, g
(x, y) is the output image suppose, my input image is f (x, y), I am doing the convolution of f (x,
y) with h (x, y). In this case I am considering g (x, y) is the output image, f (x, y) is the input
image and I am considering the convolution of the input image with h (x, y).

So, this is the convolution operation. You can see. So, g (x, y) is equal to f (x, y) convolution h
(x, y). So, f (x, y) is the input image, g (x, y) is the output filtered image, and I am considering
the impulse response, the impulse response is h (x, y). So, this is the definition of the convolution
in spatial domain.

625
(Refer Slide Time: 15:56)

And already you know in signal processing the convolution theorem. The convolution in the
spatial domain or in the time domain is equivalent to multiplication in the frequency domain.
That is the convolution theorem. So, here you see a convolution in the spatial domain is
equivalent to multiplication in the frequency domain.

So, that means in the frequency domain filtering what I have to do, I have to do the
multiplication instead of convolution. So, in this case F (u, v) is the input image, the Fourier
Transform of the input image, and H (u, v) is the filtered transform function. So, only I have to
do the multiplication between F (u, v) and H (u, v).

626
(Refer Slide Time: 16:40)

And in this case I have shown one example of the Gaussian low pass filter. So, in the frequency
domain I have shown. And corresponding to this I have shown the function in the spatial domain.
In spatial domain it is something like this. That is nothing but the averaging. Because if you see
the averaging filter in the spatial domain, if you see the filter coefficients of a filter widths, the
filter widths are something like this, 1 by 9, 1 by 9, 1 by 9, 1 by 9, 1 by 9, 1 by 9, so these are my
filter coefficients for a low pass filter.

And if you see this Gaussian high pass filter, this is in the frequency domain. And corresponding
to spatial domain this is my function, that is h (x), corresponding to the high pass filter. If you
see the central portion the high value, because in the high pass filtering if you know what is the
mask corresponding to the high pass filter in the spatial domain.

If you see the central pixel it is 8 by 9. The remaining is 1 by 9, minus 1 by 9, minus 1 by 9, like
this. This is the mask corresponding to the high pass filter in spatial domain. So, that is why
corresponding to this 8 by 9 I am having a peak here in the spatial domain. You can see the
response of the filter in frequency domain and in the spatial domain.

627
(Refer Slide Time: 18:11)

Now, first I am considering the ideal low pass filter. So, before going to the ideal low pass filter
in my discussion I will be considering the ideal low pass filter. So, I will be considering the ideal
filter, I will be considering the Butterworth filter, also I will be considering the Gaussian filter. In
case of the ideal filter that is mainly the sharp filtering, you can consider as a sharp filtering.

And in case of a Gaussian filter it is something like the smooth filtering. In case of the ideal filter
the transition from the pass band to the stop band is very sharp. I am repeating this. For the ideal
low pass filter or maybe the ideal filter the transition from the pass band to the stop band is very
sharp. So, that is why I am getting the sharp filtering. But in case of the Gaussian it gives smooth
filtering. In case of the Butterworth filter we have a parameter, the parameter is the filter order.
In Butterworth filter we have a parameter that parameter is filter order.

If I considered the high order values, the filter order is very high. The Butterworth filter
approaches to the ideal filter, for lower order values the Butterworth filter is more like a
Gaussian filter. So, that means in this case Butterworth filter maybe viewed us providing a
transition between two extremes, one is ideal low pass filter, another one is the Gaussian low
pass filter. Now corresponding to the ideal low pass filter in the frequency domain, I am defining
the filter function, the filter function is H (u, v).

Already I have explained what is u? u is the special frequency in the x direction and v is special
frequency in the y direction. And corresponding to this I am considering D naught is a positive

628
constant. D naught is called as cutoff frequency, it is a positive constant. And in this case if D u v
is less than equal to D naught, the response will be 1. If D u v is greater than D naught, the
response will be 0.

Now, what is D u v? What is the meaning of the D u v? D u v is the distance between a point u v
in the frequency domain and the center of the frequency rectangle. The center of the Fourier
Transform is M by 2 comma N by 2. So, D u v, that is the distance between a point u v in
frequency domain and the center of the frequency rectangle, that is D u v. And D naught is a
cutoff frequency, D naught is a positive constant. So, this is a definition of the ideal low pass
filter.

(Refer Slide Time: 21:38)

Corresponding to this filter transfer function I am showing the ideal low pass filter. So, if you see
the H u v, that is the filter function, that is very similar to the box function. And in the second
case I am showing the filter as an image. The filter as an image I am showing. And in the final
case, if you see the last case, I am showing the filtered response, that is the low pass filter. And D
naught is the cutoff frequency. And already I have explained that D naught is a positive constant.

So, all the frequencies on or inside the circle of radius D naught, so in this case I am considering
this circle. And radius is D naught here. So, all the frequencies on or inside the circle of radius D
naught are passed without attenuation. Whereas, all frequencies outside the circle are completely

629
attenuated. And in this case you can see the ideal low pass filter is symmetric about origin. So,
this is the meaning of the D naught.

So, if you see the previous expression for the filter function, you can see D u v is less than equal
to D naught, if this condition is satisfied then H u v will be 1. D u v is greater than D naught,
then H u v will be 0. That means all the frequencies on or inside the circle of radius D naught are
passed without attenuation. Whereas, all frequencies outside the circle are completely attenuated.
That is the meaning of this expression.

And in this case of you see, corresponding to this rectangular function, in spatial domain I have
the same function in spatial domain. So, this is the concept of the ideal low pass filter in
frequency domain and in the spatial domain.

(Refer Slide Time: 23:45)

Now, in this case I am showing the example of the ideal low pass filter. And in this case you see
my input image is this, the first one is the input image. After this what I am considering? I am
considering the ideal low pass filtering with cutoff frequency, that is the cutoff frequency is the
D naught and the radii values, something like the 5, 15, 30, 80, and 230. If the radii value is 5,
that means I am considering a very small circle, radius value is 5.

630
Corresponding to this, this is my output image. That means this is blurred image. Because I am
considering only the low frequency information. And suppose if I consider suppose the cutoff
frequency, the radius is 15, corresponding to 15 my reconstructed output image will be like this.

Corresponding to 30, that means my output will be something like this. That means I am
continuously selecting the low frequency information and also some high frequency information.
That means I am increasing the size of the circle. So, this is the circle here, I am increasing the
size of the circle like this. And in this case you can see, if I consider the radius is 5 only the
image will be completely blurred.

And if I consider suppose the radius is something like 80, then in this case the blurring will be
less. But if I consider this case, suppose the radius is 30, you can see some effects the rings like
this. This is called the ringing artifacts. You can see the ringing and the blurring. The blurring
depends on the size of the radius. That means it depends on the cutoff frequency. The cutoff
frequency is D naught.

(Refer Slide Time: 25:43)

Now, I will consider the concept of the Butterworth low pass filter. And corresponding to the
Butterworth low pass filter you can see the function, the transfer function in H u v is equal to 1
divided by 1 plus D u v divided by D naught to the power twice n. And in this case I have shown

631
the response for different orders of the Butterworth filter. The first one is n is equal to 1, n is
equal to 2, 3, 4 like this.

That means in this case is I increase the order of the Butterworth filter, it approaches the ideal
low pass filter. In the first case I have shown the Butterworth filter transfer function. So, this is
the Butterworth filter transfer function. The second one is the filter displayed as an image. So,
this is the Butterworth filter displayed as an image. And in this case I have shown the filter radial
cross sections of order 1 to 4.

And in case of the Butterworth filter, if I considered suppose low order, then in this case there is
no sharp discontinuity, no clear cutoff frequency for the low order Butterworth filter. That means
if I consider a high order Butterworth filter, it approaches the ideal low pass filter.

(Refer Slide Time: 27:05)

So, in this case you can see I have shown the Butterworth low pass filter that is displayed as an
image. So in this case you can see, H u v is the filter transfer function of the Butterworth filter.
Now, corresponding to frequency domain, my response in the spatial domain will be sinc
function. So, in the case of the Butterworth low pass filter the spread of the sinc function is
inversely proportional to the radius of H u v.

That means, in this case if I consider the radius is D naught, so the spread of the sinc function
that is in the spatial domain, sinc function in the spatial domain is inversely proportional to the

632
radius of H u v. So, if I consider D naught is high suppose, then in this case the sinc function
approaches the impulse function. So, this is my impulse.

I repeat this. The cross section of the low pass filter, the ideal low pass filter in the frequency
domain looks like a box filter. But in the spatial domain it would be sinc function. So, here I
have shown the sinc function. And in case of the filtering in the spatial domain, is done by
convolving h x y with the image. And in this case, each pixel in the image is like a discrete
impulse.

So, if I considered a image, each pixel in the image is like a discrete impulse and convolving a
sinc function with an impulse copies the sinc at that location of the impulse. I am repeating this,
that means convolving a sinc function with an impulse, impulse means the pixel value, copies the
sinc at that location of the impulse.

Now, in case of a sinc function, the central lobe of the sinc function is responsible for blurring.
So, I have a central lobe, this is a central lobe, that is responsible for blurring. And in this case if
I consider outer lobes, small lobes, they are responsible for ringing artifacts. So, this central
portion, central lobe is responsible for the blurring and the side lobes, the small lobes are
responsible for the ringing artifacts.

And in this case already I have explained the spread of the sinc function is inversely proportional
to the radius of H u v, that means the larger D naught becomes the more the spatial sinc function,
approaches an impulse. So, that means if I considered the high value of D naught, then the sinc
function approaches to the impulse, the impulse is this.

Then in this case if I considered the impulse in spatial domain, that means there will be no
blurring. Corresponding to this case I have the blurring because of the central lobe of the sinc
function. Also, I have the ringing artifacts.

633
(Refer Slide Time: 30:28)

So, in this case I have shown the example here. Original image I have shown, and I am
considering the Butterworth low pass filter. The order I am considering n is equal to 2, and I am
changing the cutoff frequency, the cutoff frequency is D naught I am changing. The first one is 5,
next one is 15, 30, 80, 230 like this. And in this case if you see, if a cutoff frequency is less the
blurring will be more. Because of the central lobe of the sinc function.

And in this case if I considered the high value of D naught, suppose D naught is 15,
corresponding to this I have the ringing artifacts because of the side lobes, the side lobes of the
sinc function. And if I considered the high value, D naught is very high, 230, then corresponding
to this I have the impulse, that is the sinc function to be converted into impulse, then I do not
have the blurring effect. Then ringing will be also less because of the smoother transition from
the pass band to the stop band. So, you can understand that concept of the ringing artifacts and
also the concept of blurring.

634
(Refer Slide Time: 31:46)

And after this I will discuss the concept of the Gaussian low pass filter. And in case of this, I am
getting the smooth transition from the pass band to the stop band. And in this case, smooth
impulse response and the ringing artifacts will be less in case of the Gaussian filter. So, first one
is, I have shown the Gaussian low pass filter transfer function, the second one is the filter
displayed as an image. And after this I am showing the Gaussian low pass filter for different
values of D naught. D naught is equal to 10, 20, 30,40, 100 like this. For different values of D
naught I am showing the filter radial cross sections for values of D naught.

635
(Refer Slide Time: 32:34)

And in this case I have shown the Gaussian low pass filter in frequency domain and after this I
have shown the corresponding to spatial domain representation of the Gaussian low pass filter.

(Refer Slide Time: 32:45)

And corresponding to the Gaussian low pass filter I have the outputs the first I am considering
the original input image and I am considering the D naught, something like the 5, 15, 30, 85,
230. And you can see the blurring and the ringing artifacts. Then in this case, the less ringing
than the Butterworth low pass filter, but also less smoothing. The blurring will be less. Ringing
artifacts will be less but also the less smoothing.

636
So, I have shown the comparison between the ideal low pass filter, the Butterworth filter and also
the Gaussian low pass filter. And in this case the one important concept that you have to
understand, the concept of ringing artifacts and the blurring.

(Refer Slide Time: 33:35)

And in this case I have shown the ideal low pass filter response corresponding to D naught is
equal to 15, you can see the blurring and the ringing artifacts. And in the next case I am
considering the Butterworth low pass filter, the order is 2, n is equal to 2; and D naught is 15,
again 15. You can see the ringing artifacts will be less. And in this case of the Gaussian, D
naught is again 15, less blurring but ringing artifacts will be less. So, you can see the comparison
between the ideal low pass filter, the Butterworth low pass filter and also the Gaussian low pass
filter.

637
(Refer Slide Time: 34:19)

Next I am showing one example, one practical example. A low pass Gaussian filter is used to
connect broken text. So, in the first figure if you see, I have shown the broken text here and if I
apply the Gaussian filter, the low pass filter, then you can see the output that in this case we can
connect the broken text because of the blurring. Because the Gaussian filter produces the
blurring of the image. So, that is why the broken text is connected.

(Refer Slide Time: 34:57)

638
Again I am showing the concept of the low pass filtering. My input image is this and
corresponding to this my Fourier Transform is this. The original image and its Fourier
Transformation. In the next case I am showing the reconstructed image, in the second case what I
am considering, I am only considering the central portion of the Fourier Transformation. And
this is the reconstructed image. That is the filtered image and corresponding Fourier
Transformation. So, you can see the output image and the input image. And you can do the
competition between the input image and the output image visually.

(Refer Slide Time: 35:36)

Next one is the concept of the high pass filter. So, what is high pass filter? The high pass filter is
nothing but 1 minus low pass filter. So, in spatial domain filtering also I have explained the how
to get the high pass filter. High pass filter is nothing but 1 minus low pass filter. So, in this case
also I have three cases, one is the ideal high pass filter Butterworth filter and the Gaussian high
pass filter.

And corresponding to this you can see the filter function, that filter function is H u v and it is
equal to 1, corresponding to D u v greater than D naught, D naught is the cutoff frequency, that is
the positive constant. And it is equal to 0 if this condition is satisfied. For the Butterworth filter
also, I have to considered the parameter. The parameter is the order of the filter. The order of the
filter is n. And Gaussian transfer function for the high pass filter is H u v you can see this
expression. So, this is a Gaussian high pass filter.

639
(Refer Slide Time: 36:43)

So, for finding the fine details of the image like edges and the boundary I can consider the ideal
high pass filter, so high pass filter is nothing but 1 minus low pass filter. So, by using the high
pass filter we can see the fine details of the image like the edges in the boundary.

(Refer Slide Time: 37:00)

So, like the ideal low pass filter again I am considering the ideal high pass filter, and in this case
this is the condition for the filter transfer function. So, I have shown the H u v. So in case of the
high pass filter we have only considered the outer portion of the Fourier Transformation. This

640
black portion, if you see this black portion, that is not considered. So, only I am considering the
outer portion of the Fourier Transformation. And corresponding to this you can see the response
of the high pass filter, and D naught is the cutoff frequency in this case.

(Refer Slide Time: 37:36)

So, in this case you can see the ringing artifacts corresponding to the ideal high pass filter as the
transition from the pass band to the stop band is very sharp. So, that is why I have the ringing
artifacts. And if you increase the cutoff frequency, the cutoff is D naught, then in this case you
can see the ringing artifacts will be less. This ringing artifact concept already I have explained.

641
(Refer Slide Time: 38:03)

And after this the concept of the Butterworth filter, so you can change the order of the
Butterworth filter, n is the order of the Butterworth filter and D naught is the cutoff distance as
explained earlier. So, corresponding to this Butterworth filter, you can see the filter function H u
v. The filter is displayed as an image and corresponding to this the response of the filter you can
see. So, this is the Butterworth high pass filter.

(Refer Slide Time: 38:32)

And corresponding to the Butterworth high pass filter, you can see the output, first one is the
input image, the next one is the Butterworth filter, order is 2, but D naught is 15. So,

642
corresponding to D naught 15 you can see ringing artifacts. But if I increase D naught to 30 or D
naught to 80, then the ringing artifacts will be less.

But, if I consider D naught is very high, then in this case I will be getting the blurred image.
Because that means I am considering the high frequency component as well as low frequency
component. That means I am considering the high frequency information as well as some low
frequency information. That is why the image will be blurred corresponding to D naught is equal
to 80.

(Refer Slide Time: 39:20)

And finally I want to show the Gaussian high pass filter. So, the Gaussian high pass filter is
defined like this and D naught is the cutoff distance as discussed earlier. So, corresponding to the
Gaussian high pass filter I have the filter function and the filter is displayed as an image and I
have the response of the Gaussian high pass filter. Like this.

643
(Refer Slide Time: 39:43)

And corresponding to the Gaussian high pass filter you can see the outputs. The first one is the
input image, and for different values of D naught you can see the outputs. The order is, I am
considering, n is equal to 2 here. So, if you see corresponding to D naught is equal to 15, the first
is this. D naught is equal to 30, you can see the output. D naught is equal to 80, you can see the
output. If I increase D naught, what will happen, that means I am considering high frequency
information also some low frequency information. So, that is why the image will be blurred.

(Refer Slide Time: 40:24)

644
So, here I have shown one example of the Gaussian high pass filter. I have shown the original
image after this I am considering the Gaussian filter. Corresponding to the Gaussian filter you
can see the output, that is high pass filter. So, I am having the edges. You can see the fine details
of the image and corresponding Fourier Transform you can see. Corresponding to this image this
is the Fourier Transformation. So, this is the Gaussian high pass filter.

(Refer Slide Time: 40:55)

And in this case I have given one example of the high pass filtering. The original image I have
shown first. The next one is the high pass filtering output you can see next one. And after this
what I am doing, high frequency emphasis result I am showing here. And after histogram
equalization I will getting the output something like this. So, after this what I doing, finally I am
applying the histogram equalization technique to improve the contrast of an image. So, this is
one example of the high pass filtering.

645
(Refer Slide Time: 41:26)

The next one is the Laplacian in frequency domain. This Laplacian I will explain when I will
discuss the concept of the edge detection. So, Laplacian is spatial domain is represented like this,
that is the second order derivative, f is the image. So, the first order derivative is del f divided by
del x. That is the gradient of the image along the x direction and gradient along the y direction.
This is the gradient.

And what about the second order derivative? The second order derivative is del 2 f, del x 2 and
another one is del 2 f, del y 2. And if I consider this one, that is nothing but the Laplacian of the
image. So, this Laplacian I will explain later on. So, corresponding to this Laplacian, if I take the
Fourier Transform of the Laplacian, then in this case you will be getting this one.

Minus u square plus v square, F u v. F u v is the Fourier Transform of the input image. Then in
this case what is H 1 u v, that is nothing but minus u square plus v square. So, you can see in
spatial domain how to represent the Laplacian, in frequency domain how to represent the
Laplacian. So, this is the Laplacian operator.

646
(Refer Slide Time: 42:54)

And corresponding to the Laplacian operator, the first figure if you see, Laplacian in the
frequency domain, you can see this, the Laplacian will be something like this. So, by using the
Laplacian you can determine the location of the edge pixel. I will explain this concept later on
how to determine the location of the edge pixel by using the Laplacian.

And next figure if you see, the 2D image of Laplacian in the frequency domain. So, in the
frequency domain the Laplacian will be something like this. And the third figure if you see,
inverse DFT of Laplacian in the frequency domain. So in the frequency domain you can see the
inverse DFT of the Laplacian in the frequency domain. In the spatial domain, corresponding to
the Laplacian operator, the mask will be something like this 0, 1, 0, 1, - 4, 1, 0, 1, 0.

So, if you see the central pixel the central pixel, the central position of the mask, the mask value
is minus 4. So, corresponding to this you can see here the mask will be something like this. So,
in spatial domain the mask response will be something like this because of the central point of
the mask is minus 4 of the Laplacian operator.

647
(Refer Slide Time: 44:16)

So, how to do the image enhancement by using the Laplacian operator? In spatial domain what
you can do suppose I have the original image. The original image is f x y and if I take the
Laplacian of the image that means I can select the high frequency component. So, high
frequency information is subtracted from the original image, that I am doing.

And in frequency domain, this can be represented like this, so G u v is equal to F u v plus u
square plus v square F u v. That is the Laplacian I am considering in the frequency domain. And
I have the new operator, the new operator is H 2 u v, that is nothing but 1 plus u square plus v
square is equal to 1 minus H 1 u comma v. That is the Laplacian.

648
(Refer Slide Time: 45:05)

So in this case I am showing one example of the Laplacian, so how to apply the Laplacian of the
image. In this case you can see the first one is the original image, the second one is I am
considering the Laplacian filtered image. That means I am determining the edges by using the
Laplacian. The next one is I am considering the Laplacian image scale, and after this I am
subtracting the Laplacian from the original image.

I am getting the enhanced image, so you can see the enhanced image here, this is enhanced
image. So, by using the Laplacian you can enhance the image. The Laplacian can be
implemented in frequency domain or in spatial domain.

649
(Refer Slide Time: 45:47)

And after this, just I am considering the band stop filter and the band pass filter. And in this case
the band stop filter, the transfer function you can represent like this, and again the concept of the
cutoff frequency, the cutoff frequency is D naught and W is the width of the band. W is the
width of the band. And you can see the band pass filter, what is the band pass filter, 1 minus
band stop filter. That is the band pass filter.

And in this case, later I will explain if I take the difference between 2 Gaussians then in this case
I will be getting the band pass filter. Difference of two Gaussian functions that will give the band
pass operation.

650
(Refer Slide Time: 46:31)

After this I am discussing another filter that is Homomorphic filtering. This filtering technique
can be used to improve the appearance of an image by simultaneous intensity range compression
and contrast enhancement. Then in this case the main concept is the optical image consists of 2
primary components, what are the primary components? One is the lightening component and
another is the reflectance component.

So, I can consider these two components, here you can see one is the lightening component that
is nothing but the illumination, and f x y is the image. And another component is the reflectance.
The reflectance is nothing but the albedo, the albedo I have explained in my, I think, third or
fourth classes about the albedo, that is the reflectance. So, f x y is the image and I am considering
2 components, one is the illumination component another one is the reflectance component.

The lightening component corresponds to the lightening condition of a scene. That means this
component changes with the lightening conditions. That is the incident illumination. The
reflectance component results from the way the objects in the image reflect light. That is the
albedo of the surface. And it is the intrinsic property of object itself. And normally it does not
change. In many image processing applications it is useful to enhance the reflectance component
while suppressing the contribution from the lightening component.

That means for some applications it is used to enhance the reflectance component while
suppressing the contribution from the lightening component. So, homomorphic filtering is a

651
frequency domain filtering technique. So, in the filtering technique what we have to do, we have
to compress the brightness, the brightness means from the lightening condition while enhancing
the contrast. The contrast means it is coming from the surface reflectance property of the object.

So, in this expression you can see I have two components, one is the illumination component
another one is the reflectance component. So, f x y is equal to E x y and rho x y. And after this
what I can consider, this E x y is mainly the low frequency component. This illumination
component is a low frequency component because the lightening condition varies slowly over
the surface of the object.

This component is responsible for the overall range of the brightness of the image, that is the
overall contrast. And if I considered the this component, the reflectance component, so this
reflectance component is a high frequency component because it varies quickly at the surface
boundaries, edges, due to varying phase angle. And this component is responsible for the local
contrast. These assumptions are valid for many real images. So, what I am doing in this case, if
you see this expression, after this I am taking the logarithm, the log I am taking so that the
multiplication is converted into addition.

And after this, I am applying the frequency domain enhancement technique. So, what is a
frequency domain enhancement technique? From the input image I am taking the log because,
why I am taking the log, the multiplication is converted into addition. After this I am doing the
frequency domain filtering, for this I am doing the DFT and H u v is the filter function. And after
this I am applying the inverse DFT, and after this exponential function I am considering, that is
nothing but the inverse log function I am considering. And after this I am getting the output
image, the output image is g x y I am getting. So, in the frequency domain I am doing this
operation. This Homomorphic filtering is a frequency domain filtering technique.

652
(Refer Slide Time: 50:38)

So, already I have explained this concept, the simultaneous dynamic range compression and
contrast enhancement. And the illumination component is a low frequency component, that is a
slow spatial variation, and a reflectance component is a high frequency information, it is
characterized by abrupt spatial variations because it varies quickly at the surface boundaries,
edges due to varying phase angles.

(Refer Slide Time: 51:06)

And in this case I can give one example of the Homomorphic filtering to remove multiplicative
noise. So, what is this noise model, you can see the multiplicative noise model. I have the signal

653
and I have the noise, both are multiplied. And after this I am doing what, the log I am
considering. So that the multiplication is converted into addition this multiplication is converted
into addition. After this I can apply the low pass filtering to remove the noise.

(Refer Slide Time: 51:39)

In this case you can see I am considering the original image and I am considering the
multiplicative noise. I am considering. After this what I am considering?

654
(Refer Slide Time: 51:48)

The Homomorphic filtering, I am considering you can see the output of the Homomorphic
filtering and another one is the low pass filtering I am considering.

(Refer Slide Time: 51:56)

Next, I am considering the Wiener filter for image restoration. This image restoration already I
have explained, in this case it is objective, enhancement is subjective. In case of the image
restoration model we can develop a mathematical model and based on this mathematical model
we can remove the noise or we can improve the visual quality of an image. The objective of the

655
image restoration is to restore the degraded image to its original form. Then in this case I can
consider some examples like motion blur or maybe the optical blur.

So, in this case I can develop a mathematical model, and based on this mathematical model I can
improve the visual quality of an image. And in this case I am showing one example, an observed
image can be modelled as, so g x y is the observed image and I am considering the image, the
image is f, x dash y dash. And in this h is nothing but the point spread function, the PSF of the
imaging function and n is the additive noise.

Now, the objective of the image restoration in this case is to estimate the original image, the
original image is f, from the observed degraded image, the degraded image is g. So, that is the
objective of the image restoration. So, I am repeating this, the objective of the image restoration
is to estimate the original image f from the observed degraded image g. Now, this image
degradation can be modeled as a convolution of the input image with a linear shift invariant
filter, h x y. And in this case h x y may be considered as a Gaussian function for out of focus
blurring. That I can consider.

Then in this case what will be the g x y, g x y will be h x y convolution f x y that I can consider.

(Refer Slide Time: 53:56)

So, the definitions are like this f x y input before the degradation that is the original image. What
is g x year? Image after the degradation that is the observed image, h x y that is the degradation

656
filter used to improve the image. And I am considering the approximate f x y, that is the estimate
of x y which is computed from g x y, so I can do the estimation of f x y and n x y is the additive
noise.

So, you can see the degradation model that is how to do the image restoration. The first one is
the input image, I am considering the degradation. So for this I am considering the degradation
filter, that is used to improve the image. And I am considering the noise, and after this I am
getting the degraded image, the degraded image is g x y. And after applying the image
restoration technique I am getting the restored image. That is the better quality image I am
getting.

(Refer Slide Time: 54:57)

Now, in this case the same thing I am showing here, that is the degradation filter I have shown,
the noise I have shown, the input image I have shown, and finally I am getting the restored
image. Then in this case this model can be mathematically represented in the spatial domain, and
in the frequency domain as follows. In spatial domain I can represent like this the g x y is
nothing but h x y, that is the degradation filter, convolution with f x y plus the noise. That is the
additive noise.

In frequency domain I can represent like this, because convolution in the spatial domain is
equivalent to multiplication in frequency domain. So, G u v is equal to H u v into F u v plus N u

657
v. And in this case what is this approximate F u v, that is the g u v divided by H u v. What is H u
v, H u v is the Fourier Transform of the Gaussian.

(Refer Slide Time: 55:57)

The image restoration can also be implemented by a Wiener filter. So, what is the Wiener filter?
The restored image is obtained as W u v, G u v. W u v is nothing but the Wiener filter. And that
you can represent like this, so later I will show how to get this expression, H complex conjugate
u v, and it is divided by H u v magnitude whole square plus P u v. What is P u v? That is the
ratio of S eta u v divided by S f u v.

What is S f u v? That is the power spectral density of f x y. S eta u v, that is the power spectral
density of the noise, additive noise. And if I considered the P is equal to 0, so in this case
suppose P is equal to 0, then W u v, in the expression if you see this expression then W u v is
equal to 1 by H u v that is nothing but the inverse filter. If the P is greater than H u v then the
high frequency components, the high frequency information will be attenuated.

And in this case the F u v, the magnitude of this F u v and N u v are known approximately. This
information is available. F u v means that is the Fourier Transform of the image and N u v means
the Fourier Transform of the noise.

658
(Refer Slide Time: 57:30)

So, a Wiener filter minimizes the least square error. So, you can see I am computing the error
here, so here is defined like this f x y is the original image and I am considering the reconstructed
image the approximate image f approximate x y whole square. And in frequency domain I can
write like this, d u v. The transformation from spatial domain to frequency domain is done by
considering Parseval's theorem.

That means energy in the data domain is equal to energy in the frequency domain. Then in this
case you can see this one, F approximate u v is equal to W u v, G u v and just you can represent
like this, this expression. And after this, the F minus F approximate you will be getting this one,
and from this if you put this value in the previous equation, then in this case you can determine
the least square error you can determine.

659
(Refer Slide Time: 58:429)

And since f x y and n x y are uncorrelated, this input image and the noise they are uncorrelated,
then this error cam be represented like this, if you see the previous expression it can be
represented like this. Then in this case the integrand is the sum of two squares. So, you can see
the integrand, the integrand of the sum of two squares. We need to minimize the integrand, that
is integrand should be minimum for all the values of u and v, u is the spatial frequency in x
direction and v is the spatial frequency in the y direction.

So, that means we have to consider this condition. And the condition for minimum integrand you
can determine like this from this expression. And finally, if you do all this mathematics, then in
this case you will be getting that W is equal to H complex conjugate, the magnitude of H square
plus magnitude of N square divided by F square. That you will be getting, that is nothing but the
Wiener filter.

660
(Refer Slide Time: 59:33)

So, finally I can show how to do the denoising by using the Wiener filter. First the input image is
this, after this I am considering the degradation filter, noise I am considering, that is the additive
noise I am considering. After this I am getting the degraded image, I am considering the Fourier
Transform of the degraded image. I am getting G u v, after this I am applying Wiener filter. So I
am determining the approximate value of F u v, and after this I am taking the inverse Fourier
Transform to get the restored image.

So, steps are like this compute the Fourier Transform of the blurred image. Multiply the Fourier
Transform by the Wiener filter, that Weiner filter already I have explained. And after this finally
we have to take the inverse Fourier Transformation to get the restored image. This is about the
brief discussion about the Weiner filter.

661
(Refer Slide Time: 01:00:33)

The next concept what I want to discuss, the image quality measurement, so how to measure the
image quality. So for this I can calculate the mean square error, so error can be calculated like
this, f x y is the original image and I am considering the approximate, that is the reconstructed
image. The difference between this that will give the error. And from this you can determine the
mean square error. Then in this case the image size I am considering M cross N, M number of
rows and N number of columns.

So, from this you can determine the mean square error between the input image and the
reconstructed image. So, that is one measure for image quality measurement. And from the MSE
you can determine the root mean square error, that you can determine. That is the root mean
square error you can determine from the MSE.

662
(Refer Slide Time: 01:01:24)

Another measure is the signal to noise ratio. That you can determine by this expression. So my
input image is f x y, so I am getting the signal power here. And also the difference between this
two is nothing but the noise. So, first I am considering the signal divided by noise that will give
the signal to noise ratio.

(Refer Slide Time: 01:01:48)

And another one is the peak signal to noise ratio, PSNR. That also you can determine, you can
see I am considering the signal power and also the peak signal power and also the noise power I

663
can determine. And maximum pixel value I am considering 255, so you can determine the
PSNR. So, PSNR is measured with respect to the peak signal power, but in case of the SNR,
SNR is measured with respect to the actual signal power. That is the difference between the
PSNR and the SNR. So, by using this measure, so one is the mean square error another one is the
signal to noise ratio, another one is the PSNR, we can judge the quality of the reconstructed
image.

In this class I discussed the concept of the frequency domain filtering. For frequency domain
filtering the concept is, I have to modify the Fourier Transform of that image. And I have
discussed the concept of the low pass filter. For this I discussed about the concept of the ideal
low pass filter, Butterworth low pass filter and the Gaussian low pass filter. And also two
important properties, one is the blurring effect another one is the ringing artifacts.

So, how to consider these two cases, and it depends on the cutoff frequency, the cutoff frequency
is D naught. After this I explained the concept of the high pass filter, again I am considering the
ideal high pass filter, Butterworth high pass filter and the Gaussian high pass filter. After this I
have discussed the concept of the Laplacian. So, by using the Laplacian how to improve the
visual quality of an image. So, by Laplacian you can detect the edges.

And after this I discussed the concept of the Homomorphic filtering. And in this case I have two
components, one is the illumination component another one is the reflectance component.
Illumination component is the low frequency component and another the reflectance component,
that is the albedo, is a high frequency component. And how to do the filtering in frequency
domain that concept I have explained.

After this finally I have discussed the concept of the image restoration by Wiener filter. And
after this I discussed the image quality measurement, like this one is the mean square error, one
is the signal to noise ratio, and another one is the peak signal to noise ratio. This is about the
frequency domain filtering. So, let me stop here today. Thank you.

664
Computer Vision and Image Processing- Fundamentals and Applications
Professor M. K. Bhuyan
Department of Electronics & Electrical Engineering
Indian Institute of Technology, Guwahati
Lecture - 18
Color Image Processing

Welcome to NPTEL MOOCs course on Computer Vision and Image Processing: Fundamentals
and Applications. In my last class I discussed about the concept of image filtering in spatial
domain and in frequency domain. Today I am going to discuss one or two applications of image
filtering. So, one application is the background estimation. So by using the low pass filtering and
the high pass filtering I can roughly estimate the background in the program.

The another application I am going to discuss is approximation of reflectance model. That


concept already I have discussed in my discussion of homomorphic filtering. In homomorphic
filtering, I can separate the illumination component and the reflectance component. And for the
illumination component what I can do, the intensity range compression I can do for the
illumination component.

And for the reflectance component I can do image enhancement. So, that concept again I am
going to explain, the homomorphic filtering. And finally, I am discussing the concept of template
matching, so how to do the template matching in an image. So, let us see these applications first
and after this I will discussed the concept of color image processing.

665
(Refer Slide Time: 1:51)

The first application I can show you, the background estimation. So, in the background
estimation what I can do, suppose I have the image and I can apply the low pass filter or I can
apply the high pass filter. So, if I apply the low pass filter, then I will be getting the approximate,
this approx background I will be getting. And this high frequency component, the high pass
filter, I will be getting the approximate objects. Like the edges I will be getting. So, you can see
this is a very simple technique.

So, from the input image I can apply the low pass filter and the high pass filter, I can get the
approximate background and also I can get the approximate objects like the edges. So, this is a
very simple application. The another application I can show you. So, already I have discussed in
homomorphic filtering, that is in homomorphic filtering what we have done. So simultaneously
we are doing the intensity range compression plus the contrast enhancement, contrast
enhancement for the reflectance component and intensity range compression for the illumination
component, that is the irradiance.

So, second application I want to show you the approximation of the reflectance model. So,
already you know that image irradiance, suppose at a particular point, at the image point x dash,
y dash. So, I am considering the image irradiance at the image point x dash, y dash, and that is
proportional to scene irradiance, that is the illumination, scene irradiance that is suppose E x, y.

666
Then in this case this x, y point, that is suppose the world point, into I have to consider the
reflectance component, reflectance is suppose I can consider rho x, y that is the albedo mainly.

So, in this case for a scene irradiance, I have to consider sum of contributions from all
illumination sources and what is the reflectance? Reflectance means the portion of the irradiance
which is reflected towards the observer or the camera. So, that means mathematically I can write
this expression, the first one is the x dash, y dash that is the image, I have the image. Then in this
case this is the gray value actually, this gray value is equal to, I can put a constant, the constant is
suppose K and E x, y, rho x, y I can consider.

Now, in this case already I discussed in homomorphic filtering, this E, this component, changes
very slowly over the scene, that is the irradiance changes slowly over a scene. And if I
considered a reflectance component, it changes quickly, it changes quickly over edges, this is
mainly due to varying phase angles, changes very quickly over edges due to varying phase
angles.

So, I have two components, one is the irradiance another one is the reflectance. And I can apply
the log operation, the log operation in log f x dash, y dash. The objective is to convert the
multiplication into addition, the summation, log K plus log E x, y and plus log rho x, y, this
component. So, in this case if you see this log K is nothing but a dc component, the K is constant
so it is dc. Log E x, y, that is the low frequency component. And in this case the log, rho x, y,
that is the reflectance property of the object in the world.

So, then it is the high frequency component, so I can consider as a high frequency component,
high frequency component this is nothing but the log rho x, y that is the actually the
characteristics of the object, the surface characteristics, or you can consider as characteristics of
the surface.

So, in this case you can see by applying the filtering technique, I can remove the dc component,
also I can remove the low frequency component. So, my high frequency component is the
reflectance component, that is the reflectance property of the object in the world. So, that is the
characteristics of the object.

667
So, in case of the image analysis, the image analysis means, image analysis is nothing but
recognize the reflectance, reflectance component under various illumination conditions. So, you
can see what is the image analysis, image analysis means, recognizing the reflectance component
under different illumination conditions.

So, you can see the application of the filtering because by using the low pass filter, I can remove
the dc component and also I can remove the E x, y component. that is the irradiance I can
remove, and my reflectance component is the high frequency component that I can observe.

(Refer Slide Time: 11:11)

The third application I can show you, that is the template matching. So, concept I am going to
explain. So, suppose I have one image, the image is f x suppose, and I have some objects in the
image, this is my image. And suppose I have a template, I have a template image, the template
image is suppose t x. So, the template is suppose like this, so this is a template.

So, what is the objective of the template matching? Detecting a particular feature in the image.
That means, in this case, in the template I have shown one object, I want to detect whether that
particular object is present in the image or not. In the image I have many objects, but in that
template I am considering the object, one object I am considering. So, whether that particular
object is available in the image or not, I want to determine. And that is the template matching.

668
So, for this what I have to do, template is moved over the image, and in this case I have to do the
overlapping. At a particular point, suppose this template is matching with the particular object,
then in this case maximum overlap I have to find. So, for this I have to do the shifting of the
template.

So, template is shifted by a vector, suppose v, the template is shifted by vector v. So, after this
what I have to determine, suppose I am determining one measure, that is the distance measure I
am considering summation, that is for all the image points I am doing the summation, f x minus
t, x minus v, so I am finding the distance.

So, the distance will be minimum when the template exactly matches with the object present in
the image. So, that is why I am considering the distance vector and I am doing the summation for
all the image points. This principle is using the motion estimation, that is if I want to determine,
suppose one object is moving in a frame in a video, so I can estimate the motion by considering
this technique, that is the template matching technique.

So, now this expression can be expanded, f square x plus t square x minus v minus twice fx t x
minus v, so just I am expanding this one. So, in this case if you see these and these two terms,
these are constant terms, this is a constant term. So, and if you see this, this is nothing but the
cross correlation. So, this is nothing but the cross correlation.

So, the cross correlation I can write like this, the cross correlation between the image and the
template is equal to, I am considering sum of the distances, all the distances, fx t. So this is the
cross correlation, and this will be maximum, this will be maximum when the template matches
with the object in the image.

In continuous domain if I consider this expression, then in this case the summation will be
replaced by integration. Now, this in the Fourier test from suppose, the T u v I am considering,
the T u v is the 2D Fourier Transform, this is a 2D Fourier Transform of the template. And also I
am considering F u v is the Fourier Transform of the image, the 2D Fourier Transform of the
image.

Then in this case I can express the cross correlation like this, the cross correlation in the
frequency domain will be multiplication, the multiplication between F u v, that is the 2D Fourier

669
Transform of the image into the template, template is minus u minus v, this template is shifted.
So, this is the cross correlation in the frequency domain, that is the multiplication between F u v
and that T u v.

Now, what will be the block diagram for this template matching? So, my input image is f x y, I
can determine the 2D Fourier Transform of this image. So, this is I can determine F u v I can
determine. This F u v is multiplied with the template, the template is T minus u minus v, it is
multiplied.

And from this I can determine the cross correlation I can determine, the R ft and v I can
determine, the cross correlation I can determine. And in this case I have to find, I have to find the
maximum value of R ft, maximum value I have to find. Determine at which point the R ft will be
maximum, so that I can determine. So, this is the fundamental concept of the template matching.

So, up till now, I discussed about the concept of image filtering and the applications. One is the
background estimation, one is the approximation of the reflectance component and one is the
template matching. Now, I will discuss the concept of color image processing.

So, in case of the color image processing, I have two types of processing. So, one is full color
processing and another one is pseudo color processing. In case of the full color processing, I can
process a color image just like a grayscale image, in color image I can consider a pixel, a pixel is
a vector pixel, because I have to consider the R value, G value, and blue value, these are the
primary colors.

So, vector pixel I can consider and I can do the processing. So, I can give some examples, I can
do the image enhancement for the color image, I can remove noises in the color image. So, like
this I can do all the processing in the color image.

The second one is the pseudo color processing; pseudo means the false color. So, for some
applications for better visualization, I can convert the grayscale image into color images. I can
apply some transformation, so that the monochromatic images can be converted into color
image. Now, in case of the color image processing, I can consider two process, one is the
marginal processing, another one is the vector processing.

670
In case of the marginal processing, I can consider the R channel, G channel, blue channel
separately, and I can do the processing for the R component, G component and the blue
component separately, this is called the marginal processing. But in another case that is a vector
processing I can consider RGB together, that is the vector pixel and I can do the processing for
this pixel, that is the RGB pixel. So, one is marginal processing, another one is vector
processing.

So, today I am going to discuss about this concept, one is the full color processing, another one is
a pseudo color processing, and after this I will discussed some color models like color models
like RGB color models, CMY color models, so these color models I am going to discuss now.

(Refer Slide Time: 20:29)

So, you can see here I have two types of processing, one is the full color processing. So, already
I have explained about the full color processing. And pseudo color means the false color
processing, so that is assigning a color to a particular monochrome intensity range of intensity,
that is to enhance visual discrimination.

671
(Refer Slide Time: 20:55)

And in this case I have shown the electromagnetic spectrum and corresponding to the visible
light you can see the wavelength is from 400 nanometers to 700 nanometers, 400 nanometer
corresponds to blue color and 700 nanometer corresponds to red color. So, you can see the
visible spectrum, that is the visible range of the spectrum, that is from 400 to 700 nanometers.

The particular color corresponds to at least one wavelength in this band. So, if I consider suppose
blue color corresponding to a blue color, a particular wavelength I have to consider, the
wavelength to corresponding to the blue color. And one important point is the pure or the
monochromatic colors do not exist in nature. The pure color, it is very difficult to get pure colors,
corresponding to only particular wavelength, that is very difficult to get.

672
(Refer Slide Time: 21:51)

In this case also I am showing the visible spectrum, the spectrum is from 400 to 700 nanometers.
And you can see all the colors, you can see all the colors in the spectrum. And in this case, if I
consider a particular color, that means the frequency or mix of frequencies of the light
determines a particular color. So, that is you can see here in the spectrum.

(Refer Slide Time: 22:19)

673
And let us see the structure of the human eye. So, if you see the retina in human eye, so retina
has two types of photoreceptors that converts the light photon into electrical signal. So, we have
photoreceptors and two types of photoreceptors, one is called a rod and another one is called
cones. Rods are non-color sensing receptors and the cones are color sensing receptors, that
means the concert responsible for color vision and rods responsible for the monochromatic
vision.

And if you see the distribution of these cones and the rods, you can see here, so for cones high
concentration at the center, that is for the cones. And if I consider the rods, none at the center and
it increases to maximum at 18 degrees, that you can see the distribution of the rods and cones.
And the color information is limited at periphery. So, this is the roughly the structure of the
human eye, you can see the lens, but here I am only discussing the concept of the rods and the
cones, mainly the retina.

(Refer Slide Time: 23:35)

So, in the cones, we have three types of photoreceptors, one is R, one is G and other one is B, a
red, green and blue, these are the primary colors. So cones are responsible for color visions. And
in human eye we have three types of cones, that is R, G and B. And what about another colors?
Suppose, if I consider another color that means the sensation of a certain color is produced due to
the mix response of these primary cones in certain proportion. What are the primary cones? The
primary cones are R, G, B.

674
So in human eye we have 6 to 7 million cones in human eye, and that is divided into red, green
and blue cons for this red vision, green vision and the blue vision. And you can see 65 percent
cones sensitive to red vision, 33 percents are responsible for the green vision and only 2 percent
for the blue visions. So that is why the blue cones are most sensitive.

(Refer Slide Times: 24:39)

And in this figure you can see the absorption of light by red, green and the blue cones in human
eye as a function of wavelength. So corresponding to blue, you can see the wavelength that is
445 nanometers, corresponding to green this is 535 nanometers and corresponding to that is red
these 575 nanometers, the standard wavelength for the red color is 700 nanometers and
corresponding to blue color the standard is 45.6 nanometers the wavelength and corresponding to
green the wavelength is 546.1 nanometer wavelength.

These are the standard wavelength corresponding to R component, G component and the blue
component. This is the color mainly, but in this figure I am showing only the response of the
cones, the red cones, blue cones and the green cones.

675
(Refer Slide Time: 25:50)

So, according to the International Commission on Illumination, the wavelength corresponding to


the blue is 435.8 nanometers, green is 546.1 nanometers and red is 700 nanometers. And the
primary colors can be added in certain proportion to produce different colors of light.

(Refer Slide Time: 26:14)

And in this case, the color produced by mixing RGB is not a natural color. So, for a natural
color, we have only single wavelength, that wavelength is lambda. But in this case, if I

676
considered the same color is artificially produced by combining the weighted R, G and B is
having different wavelength. That any color we can produce artificially by combining R
component, G component and blue component. And in this case, the color produced by mixing
RGB is not a natural color, because for a natural color we have only single wavelength.

(Refer Slide Time: 26:53)

And in this case, I am showing these cases. So, mixing two primary colors in equal proportion
produce secondary colors. So, you can see I am combining R plus B I am getting magenta, in
equal proportion I am combining G plus B I am getting cyan, I am combining R plus G in equal
proportion, so I am getting yellow. So, this magenta, cyan and the yellow, these are the
secondary colors. And if I combine suppose in equal proportion R is equal to G is equal to B is
equal to 1, that corresponds to the white color, the white light.

So, in these figures you can see that is the primary colors and the secondary colors. So,
corresponding to this white than R is equal to G is equal to B in equal proportion I am doing.
And you can see the black color also in the second figure, so in these two figures I have shown
the primary colors and the secondary colors.

677
(Refer Slide Time: 27:58)

Now, in this case, in case of the color, color has mainly three components, one is the brightness,
another one is hue, another one is saturation. So, brightness means, it is the intensity, the light
intensity, that is the logarithmic function of light intensity. And hue means a particular color, that
means it is associated with the dominant wavelength in a mixture of light waves. And saturation
means the relative purity or the amount of white light mixed with a particular color that is the
saturation.

So, you can see the light I can consider as one property is the brightness, one is the hue, hue
means the color corresponding to a particular wavelength, and saturation means the purity of the
color, that is the relative purity or the amount of white light mixed with a particular color.

678
(Refer Slide Time: 28:57)

So, that means I can consider color as like this. So, brightness is one component, another one is
the chromaticity. Brightness means the intensity. And chromaticity has two components, one is
hue, another one is saturation. Hue means the color and saturation means the purity of the color.
So, color is defined like this, one is the brightness, that is the intensity, brightness means
intensity and the chromaticity, so this is the definition of color.

(Refer Slide Time: 29:35)

679
Now, I will discuss some color models. The first color model is X, Y, Z color model, you can see
that this matrix, this is experimentally found, that is that the CIE means the International
Commission on Illumination. And in this case, the entire color gamut can be produced by three
prime primaries.

That is I am considering the R, G and B, corresponding to the wavelength I am considering R


lambda, G lambda, B lambda and corresponding to this you can get X, Y, Z, this X, Y, Z is
called the tri-stimulus values. So, from this equation you can determine X value, Y value, and Z
value. So, by using X, Y, Z you can represent a particular color.

(Refer Slide Time: 30:19)

This X, Y, Z can be normalized and I am getting the tri-chromatic coefficients. So, I am getting
the tri-chromatic coefficients, the coefficients are a small x, small y and small z. That means the
capital X, capital Y capital Z that is normalize and I am getting x, y, z. So, by using X, Y, Z I can
represent colors.

And in this case, after normalization x plus y plus z is equal to 1. So, I have this expression that
is corresponding to X, Y, Z color system. And for any wavelength of light in the visible spectrum
these values can be obtained directly from the curves or tables that is compiled from
experimental results.

680
(Refer Slide Time: 31:07)

And in this case, based on this x, y, z color system, I can draw a chromaticity diagram suppose, if
I consider z is the dependent variable and if I write like this z is equal to 1 - x + y, then in this
case in terms of x and y, I can determine z. So, based on this information, I can draw the
chromaticity diagram.

So, in this case, you can see I have the axis extent from 0 to 1. This origin corresponds to the
blue. I have two extreme points, one is the red, red is the extreme point, and another one is the
green. And corresponding to this point, the x is equal to y is equal to 1 by 3 and that corresponds
to white color, this point.

So, you can see how to get the chromaticity diagram from the x, y, z color model, the x, y, z
system. So, you can see the origin corresponds to the blue color. I have two extremes, one is the
red color another one is the green color, the white point is x is equal to y is equal to 1 by 3. So,
that means, by using x y and z information I can represents color.

681
(Refer Slide Time: 32:35)

And in this case, this is the actual chromaticity diagram you can see here, so you can see the blue
color, this is the blue point, another point is the red point and this point is the green point. And in
this case, you can see that violet corresponds to 380 nanometers to red is 700 nanometers. And in
this case, if you see the boundary of this chromaticity diagram, these boundary colors are pure
colors.

So, if you see all these boundary colors, these are the pure colors. Pure means the saturation is 1,
saturation is defined like this, the purity of the color. And in this case if I consider inside these
colors, so inside point represents the mixture of the spectrum colors. So my spectrum colors are
RGB, that is the primary colors, and by mixing the RGB I can get other colors and you can see
this chromaticity diagram. So, in the boundary I have the pure colors, so that corresponds to 100
percent saturation.

682
(Refer Slide Time: 33:40)

And so, you can see any color inside this the chromaticity diagram can be achieved through the
linear combination of the pure spectral colors and a straight line joining any two points shows all
the different colors that may be produced by mixing the two colors corresponding to the two
points.

(Refer Slide Time: 33:40)

683
You can see the straight line here, you can see the straight line you can see and you can see how
to produce the colors by mixing the colors.

(Refer Slide Time: 33:40)

The straight line connecting red and blue is called a line of purple. So, this line, the red and the
blue, this is called the line of purple, this line.

(Refer Slide Time: 34:18)

684
So, here again I am showing the chromaticity diagram and in between you can see this is the
white point you can see, and all other points you can see. And that is the RGB primaries form a
triangular color gamut, and the white color falls in the center of the diagram.

Because corresponding to white, in X, Y, Z color system what we can consider, that is x is equal
to y is equal to 1 by 3. So, corresponding to this I have the white color. But if I consider the RGB
color model, then in this case R is equal to G is equal to B is equal to 1, that corresponds to
white.

(Refer Slide Time: 34:58)

So, another model is, that is the RGB color model. So, sometimes it is called the hardware color
model, because it is used the hardware devices like computer monitors, even in the camera this
model is used, the RGB color model. So, in the RGB color model you can see the RGB color
cube here.

So, I have mentioned all the points here, you can see this origin is the black and this point is the
white point. And you can see the line joining the black and white, that is the intensity axis. So,
the grayscale I have shown. And you can see all the colors, the primary colors and the secondary
colors in the cube, the RGB cube, you can see this is the red point, the green point and the blue
point.

685
So, corresponding to the red axis what is the red point? It is 100, corresponding to green it is 010,
and corresponding to the blue, because it is RGB, so 001. So, that coordinates you can see in the
RGB cube. And in this case, it is RGB 24 bit color cube, because why it is 24 bit color cube?
Because for R component I need 8 bits, for G component I need 8 bits and for the blue
component I need 8 bits. So, that is why it is RGB 24 bit color cube you can see. And all the
colors, the primary colors and the secondary colors you can see here. So, this is color RGB color
model or sometimes it is called a hardware color model.

(Refer Slide Time: 36:29)

And in this case, you can see the 3 components of the colors, that is vector pixel I have shown,
corresponding to a particular pixel I have the red value, the green value and the blue value. So, I
have three components corresponding to a particular pixel, and that pixel is called the vector
pixel.

686
(Refer Slide Time: 36:55)

The next model is the CMY and the CMYK color models. So, that means I can get the secondary
colors from the primary colors. So, how to get c color? You can see, c is nothing but c is equal to
1, 1, 1 minus R, so you can get c color. What is magenta? Magenta, you can get, 1, 1, 1 minus G,
so you will get the magenta color. What is yellow? Yellow 1, 1, 1 minus B, so you will be
getting the yellow colors.

So, you can see by using this expression you can get the secondary colors. The secondary colors
your cyan, magenta, and the yellow. Now, in this case, if I consider suppose c is equal to m is
equal to y is equal to suppose 1 equal, then it should produce the black color, but practically the
black color is not obtained. If I mix in equal proportion the c is equal to m is equal to y is equal
to 1, then the perfect black color is not possible, I am not getting. So, that is why extra pigment I
am considering, and that is the K, the K pigment is color it is the black pigment.

So, you can see in the inkjet printers or the laser printers, I have four toners, one is the toner put a
c color, one is for the magenta, one is for the yellow and k is for the black, that is a perfect black
color. So, this is called a CMYK model, CMY is the secondary color and K is the black color.
So, this CMYK model is used in the printing devices.

687
(Refer Slide Time: 38:41)

After this I will discuss one important concept, and that is decoupling the color components from
the intensity information. Because the color has two components, one is the intensity component
another one is the chromaticity component. Intensity means the brightness and chromaticity
means it has two components, one is the hue and other one is the saturation. I can decouple the
color information from the intensity information.

So, there are some advantages, the first advantage is you can see. The human eye is more
sensitive to intensity variation as compared to color variation. Suppose, if I want to do image
compression, the color image compression, then in this case I have to give maximum weightage
or maximum preference to the intensity component as compared to color components.

So, that means, in the bit allocation strategy what I have to do, I have to allocate more number of
bits for the intensity component as compared to the color components, this is for the color image
composition, that is the bit allocation I have to do, more number of bits should be allocated for
the intensity component as compared to the color component.

And suppose if I considered a black and white TV, transmission, then in this case only I can
consider the intensity component, I can neglect the color components, that I can do. And for
image processing, what I can do, the RGB can be converted into some other models like HSI

688
color model, I will discuss later on what is this color model. So, I can decouple the color
information from the intensity information and I can process the image, only the intensity
component I can consider.

Suppose, if I want to apply histogram equalization technique then only I can only consider the
intensity component, I can neglect the color information. Similarly, suppose I want to remove the
noises in the image, then in this case I can only consider intensity component of the color image,
I can neglect the color information.

So, that is why the decoupling of color information from the intensity information that is quite
important. And based on this I have some color models like HSI color model, Y-Cb-Cr color
model, YIQ color models. So next class I am going to discuss about these color models.

So, let me stop here today. Thank you.

689
Computer Vision and Image Processing- Fundamentals and Applications
Professor M. K. Bhuyan
Department of Electronics & Electrical Engineering
Indian Institute of Technology, Guwahati
Lecture-19
Color Image Processing

Welcome to NPTEL MOOCs course on Computer Vision and Image Processing: Fundamentals
and Application. In my last class I discussed the concept of color image processing. So, there are
two techniques, one is that marginal processing and another one is the vector processing. In case
of the marginal processing, the RGB components we can process separately, the R component I
can process separately, G component I can process independently and B component I can
process independently, that is called marginal processing.

In case of a vector processing I can consider as a vector pixel, because in a vector pixel I have
three components, R component, G component and B component. So, all these components, the
three components, I can consider together for processing. So, one is marginal processing, another
one is the vector processing. In my last class also I discussed the concept of the full color image
processing and one is the pseudo color processing.

In full color processing, I can process a color image just like a grayscale image, I can do image
enhancement, I can remove noise of the color images. In case of a pseudo color processing that
means the false color so, I can convert that grayscale image into the color image. So, that is the
concept of full color processing and a pseudo color processing.

After this I discussed about some color models. So, one color model is X, Y, Z color model.
After this I discussed the RGB color model, the CMY color model and CMYK color model.
Now, based on this X, Y, Z color model, I have defined the chromaticity diagram. So, all the
colors I can represent in the chromaticity diagram. That is about the X, Y, Z color model.

Now, I can consider the color has mainly two components, one is the brightness and another one
is the chromaticity. So, one is the luminance and another one is the chrominance that is the color.
So, chromaticity means I have to consider two components, one is the hue, another one is the
saturation. Hue means the color corresponding to a particular wavelength. And what is the
saturation? Saturation means the purity of color, that concept I discussed in my last class.

690
Today I will discuss some other color models, and the main concept is mainly the decoupling of
color information from the intensity information, that is the intensity means the brightness
information. So, I can do the separation of the color component from the intensity component
and there are many advantages of decoupling.

So, one important application is in color image processing and that color image compression.
Human eye is more sensitive to intensity variation as compared to color variation. So, that is why
if I want to consider the color image compression, then in this case I have to allocate more
number of bits for the intensity component as compared to the color component. And if I
consider suppose the black and white TV transmission, then in this case only I have to consider
the intensity component, I need not consider the color component.

Also, suppose if I want to do the color image processing, then in this case what I have to
consider, only the intensity component I have to consider and maybe the color component I can
process separately. So, for example, suppose if I want to improve the visual quality of an image,
that is the image enhancement. So what I can do, I can process the intensity component only
without affecting the color components.

So, these are the advantages of decoupling the intensity information from the color information
in the color image. So, first I will discuss about this concept, the decoupling of intensity
information or decoupling of color information from the intensity information. After this I will
discuss some color models like HSI color models, YIQ color model, YCbCr color model. So,
these concepts I am going to discuss today.

691
(Refer Slide Time: 5:06)

So, first one is the decoupling the color components from intensity. So, already I have explained
this concept. So I can separate the intensity from the color components. And one important thing
is, the human eyes are more sensitive to the intensity than the color, color means hue. So, based
on this, suppose one application is the color image compression. So for this what I can do, I can
allocate a more number of bits for the intensity component as compared to the hue component.

And the second example already I have explained, that is for a black and white TV. For a black
and white TV I can only consider the intensity component without considering the color
components. And for color image processing what I can do, I can only do the processing for the
intensity component and maybe sometimes I can do the processing for the intensity component
and the color component separately. So, this is the advantage of decoupling the color
components from intensity.

692
(Refer Slide Time: 6:03)

Now, based on this I have one color model that is the HSI color system, that is the hue, saturation
and the intensity. So, in this diagram you can see I am representing hue, saturation and intensity.
So, you can see the intensity axis, that is the brightness. And hue is nothing but, this axis the red
axis, this reference axis is the red axis, and this angle if you see, this angle, this angle
corresponds to hue.

And if you see this length of this vector from this to this, this corresponds to saturation. So,
saturation is 1 for a pure color and less than 1 for an impure color, and here I have shown three
components, one is the intensity component that is nothing but the brightness, hue means the
color, and saturation means the purity of color.

693
(Refer Slide Time: 6:56)

And in this figure what I have shown, the relationship between RGB and the HSI color model.
So, first one is the RGB color cube, the second one is the HSI model I am showing like this. So,
what you can see, in the RGB model you can see all colors, the primary colors, the red color and
the green color and the blue color, these are the primary colors, also you have seen the secondary
colors cyan, magenta and the green. And corresponding to this RGB cube, you can see the
intensity axis that is connecting from the black point to the white point, that is the intensity axis.

And in the second figure you can see that is the HSI model. In this case also I am showing the
intensity axis connecting the black point and the white point. So, if you see here, this is the
intensity axis. And after this I am considering the hue and saturation with respect to the reference
axis, the reference axis is the red axis, this is the reference axis, red.

So, hue means the angle with respect to the red axis saturation means the length of the vector.
So, if you see here, the saturation of a color increases as a function of distance from the intensity
axis and saturation of points on the intensity axis is 0. So, if I consider saturation of the points in
the intensity axis, so this axis is, this is the axis, that is the intensity axis. So the saturation of the
points on the intensity axis is 0. So, this is about the relationship between RGB and HSI color
model.

694
(Refer Slide Time: 8:45)

And in this case I have shown some models to represent hue and saturation on color planes. The
first one is the hexagon color model, the second one is the circular model I am considering and
the third one is the triangle model. So, the concept is same, and axis, the reference axis is red and
I am considering the angle with respect to the red axis, that is the hue and saturation is nothing
but the length of the vector.

So, if you see this diagram, the saturation is a distance to the point, that is the length of the
vector. Hue is an angle from that red axis, so I am considering this. So I can consider any one of
these model, the hexagon I can consider, maybe circle I can consider or the triangle I can
consider.

695
(Refer Slide Time: 9:40)

And in this case, I have shown here, again, the same thing the triangle model I am considering.
And in this case if you see here, I have shown the intensity axis, that is connecting. This is
intensity axis, connecting the black point and the white point. In the second case also, in the
circle model I have shown the intensity axis connecting the black point and the white point, that
is the intensity axis.

And corresponding to this, if you see the hue, what is hue? Hue is the angle with respect to the
red axis. So, in both cases you can see the hue, this is the hue. And saturation is nothing but the
length of the vector, if it is 1 then saturation will be 1. That means, I am getting the pure color.
Corresponding to the points in the intensity axis the saturation will be 0, that means it is not the
perfect color. So, this is about a representation of the HSI color model.

696
(Refer Slide Time: 10:41)

And some conversion formulas. Here you can see the HSI can be converted into RGB and also
RGB can be converted into HSI. So, in this case I am considering that converting colors from
RGB to HSI. So, I have the RGB values, from the RGB values you can determine the H, H is the
hue and that means, in theta actually I can represent, the hue is theta and saturation you can see
saturation is represented like this, and the intensity is nothing but 1 by 3 R plus G plus B. I think
you did not remember all these formulas, only this the last formula is important that intensity is
nothing but 1 by 3 plus R plus G plus B. So, this is the conversion from formula from RGB to
HSI.

697
(Refer Slide Time: 11:28)

Similarly, I can convert the HSI to RGB. So, you can see all these formulas. And corresponding
to RG sector, that is H lies between 0 to 120 degree, corresponding to this you can see the R
value, G value and the B value. Similarly, corresponding to the GB sector, that is H lies between
120 degree to 240 degree, and corresponding to this you can see the values, the R value, G value
and the B value. And similarly, for the BR sector you can get the values, R value, G value and B
value.

Again, I think you need not remember all these equations. For exam what you can do, for exam
these equations will be provided, only you have to find the values, you can convert the HSI into
RGB or RGB into HSI.

698
(Refer Slide Time: 12:25)

In this example, I have shown the HSI components of RGB color. The first one you can see, the
RGB image I have shown here, the first figure is the RGB image I am showing here. The second
I am showing the hue component of the image, the second is the hue component. The third one is
the saturation component.

So, you can see, this portion is the black, black means because the maximum white light, that
means the saturation will be 0, saturation means the purity of the color, that means the amount of
white light added with a particular color. So, that is why the situation is 0, corresponding to the
white portion and you can see the intensity values here. So, I am showing the HSI components of
the RGB colors.

699
(Refer Slide Time 13:14)

The next model is YIQ model, that is mainly used in NTSC standard analog video transmission.
In this case, Y stands for intensity component and I is the in phase component and Q is the
quadrature component. And I have shown that conversion formula from RGB to YIQ. So, by
using this equation you can convert the RGB value into YIQ value. And in this case, the same
concept, because Y stands for intensity.

So, that means the importance of YIQ model is there, luminance component of an image can be
processed without affecting the color components. So, mainly I can separate the brightness or the
intensity from the color information. The color information mean, in this case, I and Q, the in
phase component and the quadrature component. That is I am separating the luminance from
chrominance.

700
(Refer Slide Time: 14:18)

The another important model is Y Cb CR color model and this is mainly used in color image
compression. So, Y is the intensity component corresponding to YIQ model and Cb and Cr are
the color components, and it is mainly used for the color image compression.

Why it is used? Because the spectral redundancy between Cb and Cr, that means, coefficients are
less correlated, the Cb and Cr are less correlated. Because of this, this model, the Y-Cb-Cr model
is used for color image compression. And here I have shown the conversion formula. So, you can
convert the RGB values into Y Cb Cr, and by using this conversion formula you can do this.

701
(Refer Slide Time: 15:12)

So, here I have shown one example of color image compression. So, I have the original image
and I am applying to JPEG2000. So, in this case, the color model is used Y Cb Cr. So, this color
model is used for this compression and you can see the image after the compression.

(Refer Slide Time: 15:38)

And finally, I want to show another important color model that is the CIE Lab color model. In
many computer vision applications this model is used. So, already you know this conversion

702
formula that is how to convert the RGB into X, Y, Z system. And based on this you can calculate
the L value, a value and the b value by using this conversion formula. You can see all these
conversion formula and you can implement this one so the RGB can be converted into Lab color
model.

(Refer Slide Time: 16:14)

And in this one important property is the Lab color space nicely characterizes human perception
of colors, and that is why this color space is superior to other color spaces. So, in many computer
vision applications like skin color segmentation, background modeling, in video surveillance, so
this color model, that is the Lab color model is applied.

703
(Refer Slide Time: 16:40)

And already I have explained the concept of the color image processing. One is the marginal
processing, another one is the vector processing. So, first figure if you see that is I am
considering the grayscale image. The second one is I am considering the color image. So,
corresponding to the color image I have to consider the vector pixel, corresponding to a
particular pixel I have three components, one is the R component, G component and the blue
component.

So, in the marginal processing what I can do, I can process the RGB components separately. But
in case of the vector processing, I can consider the vector pixel as a whole, that means, I can
consider RGB as a whole, as a vector pixel, and I can process all these components together, that
is the vector processing. So, in the next slide, I can show the difference between the marginal
processing and the vector processing.

704
(Refer Slide Time: 17:41)

Here you can see I have shown that grayscale image and here I have applied the spatial masking
operation. In the second case, I am showing the RGB color model and again in this case I am
applying the spatial mask. So, in case of the marginal processing, what I have to do, that is
process the RGB components separately. So that is the marginal processing.

(Refer Slide Time: 18:17)

705
In case of the vector processing, treat each pixel as a vector to be process. In this figure, I have
shown the concept of the marginal processing. So, my channel is red channel, green channel and
the blue channel. So, red is process separately, green is processed separately and the blue is
processed separately. So, that is called a marginal processing.

(Refer Slide Time: 18:37)

In the second example, this example I have shown the vector processing. So, I can consider RGB
as a single unit, that is the vector pixel. And I can do the processing together, that is the
processing for the R component, processing for the green component and processing for the blue
component together and after the processing I am getting the output, output is the R, G and B.
This is called vector processing.

706
(Refer Slide Time: 19:06)

Now, in this example, I have shown the difference between the marginal processing and the
vector processing. So, one example I am considering that is the marginal median, the median
value I am considering and second one I am considering the vector median.

So, in this case you can see in case of the marginal median, if I want to determine the median of
the pixels. So what will happen? You can see the color, the yellow color that color is appearing
here, that is not originally present in the image. In the original image this color was not present.
That means, the color distortions take place. But in the vector median I am getting the original
colors.

Let me give one example. Suppose, I have the R value, G value and B value, and I am
considering suppose three pixel, the first pixel is number 1 pixel, the values are suppose 1, 5, 6.
Second pixel I am considering, the second pixel is suppose 2, 4, 3. Third pixel I am considering
suppose 3, 7, 1. So, I have three pixels, for the first pixel RGB values are 1, 5, 6, for the second
the pixel the RGB values are to 2, 4, 3, and for the third pixel the RGB values are 3, 7, 1.

So, in this case, if I want to determine the median for this, the marginal median, that means for
this channel I have to determine the median. So, what will be the median? The median will be 2.
Corresponding to the green channel I am determining the median value, the median value will be

707
5, and corresponding to the third channel I am determining the median value, the median value
will be 3, that is the marginal medium I am doing.

And corresponding to this what you are getting, you are getting the value 2, 5, 3. These 2, 5, 3
pixel value, that is the R is equal to 2, G is equal to 5 and B is equal to 3, this 2, 5, 3 is not
available in the original image, that is why I am getting some other color in the image, the output
image you can see that color, yellow color I am getting because of the marginal medium. This
pixel 2, 5, 3 is not present in the original image. You can understand the concept of the marginal
median, so what is the problem with the marginal median. That is the false color, that is the color
distortions take place.

(Refer Slide Time: 22:03)

I am showing the same example again. So, I am considering three pixels, the first pixel is this,
the value is R 100, 0, 0; the second is 0, 100, 0; the third one is 100, 100, 100. So, I can
determine the marginal median, so marginal median will be 100, 100, 0, but vector median I can
apply some algorithm. So, vector median may be something like this. So I will explain how to
determine the vector median. So, in this case you can see this 100, 100, 0, that is not available in
the original image that pixel values, that is nothing but the false color, the color distortions.

708
(Refer Slide Time: 22:45)

So, for color image processing what I can do, you can see here. I have the image, so I have the
vector pixel RGB. So, I can apply transformation, so T is the transformation. And after doing this
transformation I can do the processing, and after this I can do the inverse transformation to get
the RGB value.

So, what is this transformation? Suppose, the transformation may be the RGB values I can
convert into HIS, the hue, saturation and the intensity. And after this I can process only the
intensity component or maybe I can process intensity component and the color components
separately, that is the processing.

After this I have to do inverse transformation, that means the process HSI value is converted into
RGB, that is the processing technique.

709
(Refer Slide Time: 23:41)

Here the same thing I am showing here, the RGB value what I am converting, I am converting
into HIS. After this I am doing the processing, maybe I can only select the intensity component
or maybe I can do the processing separately, I component and saturation component and the hue
component. After processing I am getting HSI and after this we have to do inverse
transformation, that is HSI is converted back to RGB. That is the processing technique.

(Refer Slide Time: 24:17)

710
In this case, again, I am showing the same example. In the first figure if you see, in the first
figure what I am doing, corresponding to this input image I have the RGB components. And in
this case, without converting RGB into HSI, I can do the processing for R component, G
component and the blue component. So, after processing I am getting R dash, G dash and B
dash, so I am getting the output image like this.

But in the second case, if you see the second case what I am doing, from the input image I have
the RGB components. So RGB components and that I am converting into HSI components, and
after this I can do the processing, so I am getting H dash, maybe the S dash or the I dash I will be
getting. And after this, I can convert the HSI into R component, G and B corresponding to the
output image.

(Refer Slide Time: 25:18)

Here I have shown some examples, the contrast enhancement. So first, you can see the input
image, input image is this. And what I am doing, first I am considering the contrast enhancement
and for this I am only considering the intensity component I am considering, and I am applying
the histogram equalization technique. So, corresponding to this I have a second image. Now, this
is the second image I having.

In the second case, you can see what I am considering, processing each of the RGB components
separately, that is the second case. So I am doing the histogram equalization for R component, G

711
component and the blue component, and corresponding to this I have this image. You can see the
difference between these two images, the image number two and the image number three. In the
image number two, what I am considering, only I am considering the intensity component of the
image, that is the color value. And in the second case what I am considering, I am considering R
value, G value and the blue value and processing separately.

(Refer Slide Time: 26:26)

In this case, I am showing the contrast enhancement. The first case if you see, a is the original
image, b is the contrast enhancement by processing only the intensity component, so in the this
case I am considering only the intensity component. The third case is the processing of each
RGB component for contrast enhancement.

The second example if you see, I am doing the spatial filtering. So d is the original image, e is
the filtering on each of the RGB components, that means e is the filtering on each of the RGB
components, that is separately I am doing. And last one is filtering on the intensity component
only, that I am considering. You can see the difference between these two cases.

712
(Refer Slide Time: 27:18)

Again, I am considering another example, the spatial filtering. So, in this case what I am
considering? I am considering the input image, the first one is the input image. And if you see
the second image, what is the second image? I am doing the filtering. And in this case what I am
considering, filtering on each of the RGB components separately, that means I am considering
the R component, G component and the blue component separately.

In the second case, where I m considering I am only considering the intensity component and
after this I am applying the spatial filtering. The spatial filtering already I have explained, the
spatial filtering suppose if I apply the averaging filter, so what will be the mask? The mask is
nothing but 1, 1, 1, 1, 1, 1, 1, 1, 1. So, this is the mask for the average filter. So, this mask I can
apply only for the intensity component and you can see the difference between this image and
this image, the second image and the third image.

713
(Refer Slide Time: 28:25)

The next example I am showing that is I am considering the salt and pepper noise. And after this
what I am considering, the filtering on each of the RGB components I am doing so, I am
applying the median filters, you can see the median filters here. So, I am applying the median
filters to remove the salt and pepper noise.

And in the second case what I am considering, I am considering the filtering only the intensity
component, that means, I am applying the median filter only for the intensity component and you
can see the output, the output is this. In the first case what I am considering, the filtering on each
of the RGB components and I am applying the median filter to remove salt and pepper noise.

714
(Refer Slide Time: 29:10)

And this edge detection, I am going to discuss in my next classes the edge detection, but here I
have shown one example. So, my input image is this and what I am considering, in the first case
I am only considering the intensity component for edge detection and corresponding to this my
output is this. In the second case what I am considering, edge detection on each of the RGB
components and corresponding to this my output is this. So, I have shown these examples to
explain the concept of color image processing.

(Refer Slide Time: 29:49)

715
The next point is the pseudo color image processing, that is the false color processing. That is
what is the false color? That means I can convert that grayscale value into the color value. So, for
this I have to define some transformation and based on these transformation functions, I can
convert the grayscale value into the color value.

Now, one point is important, that these transformation, these are the transformation on the gray
level values of an image, but they are not functions of positions. And this method produces a
composite image whose content is modulated by the nature of the transformation functions.

So, I have to consider some transformation functions and what is the method? So, this method
produces a composite image whose content is modulated by the nature of the transformation
functions and one point is important. So, these are the transformation on the gray level values of
an image, but they are not a function of positions.

(Refer Slide Time: 30:57)

And here you can see, I have to define some transformation for the R component, G component
and the blue component. And that means I have to assign the red color, green color, blue color
corresponding to gray level pixel intensity values. So, this is nothing but the pseudo coloring,
pseudo color means the false coloring.

716
(Refer Slide Time: 31:18)

So here you can see the example. So I have the grayscale image, and I am converting into the
color image. So, here you can see the procedure is something like intensity slicing I am doing. So
assigning a yellow color to pixels with value 255 and a blue color to all other pixels.

(Refer Slide Time: 31:40)

Now, in case of the pseudo coloring, so already I have explained about the transformation. So
you can see the input is the grayscale image f x, y is the image, and I am considering some color

717
transformation. So color coordinate transformation I am considering. And after transformation, I
am getting the R value, G value and the B value, and that I can display as a color image. So, I
have to define the transformation, so how to define the transformation, in my next slide you can
see.

(Refer Slide Time: 32:09)

So, I have the grayscale image value, grayscale image. And I am considering the transformation
for the red component, transformation for the green component, transformation for the blue
component and corresponding to this I have the red component, a green component and a blue
component. So, I have these components red, green and blue components. So, let us see what
type of transformation we have to use.

718
(Refer Slide Time: 32:37)

So, one example you can see, the X-ray image of a bag suppose, here I am considering, and we
have to assign colors. In this case, how to do this? So, if you see, the first one is I am considering
the grayscale image and I am assigning the colors, the color coded image I am getting.

So for this I am considering some transformation, you can see the transformation, the
transformation for the red component, green component and that blue component. And if you see
these sinusoid, the frequency and the phase of the sinusoid are not same, then in this case
corresponding to particular suppose this portion, I can assign a particular color based on the R
value, based on the G value and based on the B value.

So, corresponding to this I can assign a particular color. Similarly, corresponding to this portion,
suppose if I consider this portion, I have the R value, G value and the blue value, and
corresponding to this I can define a particular color. So, you can see I am considering the
transformation that means I am considering the sinusoid, but frequency and the phase it is
different for the red component, green component and the blue component. So, that means
changing the pace and the frequency of the sinusoid can emphasize in color ranges in the
grayscale image.

Suppose, if all these transformation have the same phase, the phase is same suppose and the
frequency is also same, then the output will be the monochromatic output. Because I will not be

719
getting the color image. That means, I am repeating this, suppose all these transformation will
have the same phase and the frequency, then the output will be the monochromatic output. So,
you can understand the concept of this transformation and by using this transformation, I can
convert the grayscale image into color image.

(Refer Slide Time: 34:37)

Next important point is the color balancing. So, what is the color balancing? Generally if I take
the image by a color camera, the digital camera, so the RGB output should be such that when
mixed equally it should produce white color. That means, suppose if I consider R is equal to G is
equal to B is equal to 1, and corresponding to this what I should get? I should get the white color.
But actually, because of the sensor imperfections, I am not getting the actual color, that is called
the color imperfection.

So, for this I have to do that color balancing. And any one of the component may become weak,
but the components may be R component, G component and the blue component. So any one of
the components may become weak and that is why the color mismatch will occur. So, that is the
problem of color distortion, so for this I have to do color balancing.

So, one example is, in case of a digital camera generally the blue channel is noisy as compared to
R and G component. So, in this case what is the procedure of the color balancing? Select the gray

720
level, say suppose white, and in this case of the white point in an image R is equal to G is equal
to B is equal to 1 that I know, corresponding to white.

And in this case, I have to find a transformation to make R is equal to G is equal to B because in
the real case it is not equal but actually it should be equal for the white pixel, R should be equal
to G, G should be equal to B for a perfect image. But in this case really it is not true. So, that is
why I have to find a transformation to make R is equal to G is equal to B.

After this what I have to consider keep any one of these component fix and match the other
components to it and by this we can define a transformation for each of the variable components.
So, I can make R fix and I have to find the transformation for that green component,
transformation for that blue component. So that R is equal to G is equal to B. And after this I
have to apply this transformation to all the pixels of the image to balance the entire image. So,
this is the procedure.

So, for a known pixel, that is suppose the known pixel is the white pixel, I know this condition
that R is equal to G is equal to B is equal to 1, but actually because of the color imperfection, it is
not true. So, what I have to do? I have to find a transformation to make R equal to G is equal to
B. After this I can make one component fix, and what I have to do? I have to match the other
components to it and for this I have to define a transformation. And I have to apply these
transformation to all the pixels of the image to balance the entire image.

721
(Refer Slide Time: 37:49)

So, here I can show the example. So, corresponding to this, the known portion here, you can see
the eyeball here. This portion is supposed white, corresponding to this pixel what I can do the R
is equal to G is equal to B is equal to 1 that I know. And after this I am applying the color
balancing technique and corresponding to this I am getting the output image. So, my output
image is this, that is the color balance image.

(Refer Slide Time: 38:14)

722
After this I can define the histogram of a color image. So, in this case what I can do, the
histogram of luminance and chrominance components separately I can do, or otherwise that color
histograms for the R component, G component and the B components also I can do.

And this color histogram is quite important, this color histogram I can consider as a feature, the
image feature. So, for this what I can do, I can determine the histogram for the luminance
component and the chrominance component. Chrominance components means the hue and
saturation, like in this image.

(Refer Slide Time: 38:48)

And in this case, I have given one example of the contrast enhancement by histogram
equalization technique. So, here you can see I am applying the histogram equalization technique.
So, first RGB is converted into HSI color space, this is the first step, after this I am applying the
histogram equalization technique for the I component and maybe I have to do saturation
correction, correct the saturation if needed and after this the HSI is converted back to RGB.

So, you can see I have the original image, after this I am applying the histogram equalization
technique only for the I component, so that is my result. And after this I have to do saturation
correction, so after saturation correction, this is my output image.

723
(Refer Slide Time: 39:36)

And the color image smoothing, so I can do the smoothing of the color image. So, in this case
what I can do, so pre color plane method. So, for RGB, CMY color models, smooth easy color
plane using moving embracing and combined back to RGB. So, you can see I am considering the
moving average filter I am considering the R component, G component and the blue component
for averaging.

In the second case, again, I am showing here, the second case what I am considering, I am only
considering the intensity component for smoothing, that is the averaging but without considering
the hue component and the saturation component. That means, I am only considering the
intensity component, I am not considering the hue and the saturation component.

724
(Refer Slide Time: 40:27)

And in this case, you can see the example, the color image, the input is the color image. And I
have shown that red image, the red color component, the green and the blue components of the
image.

(Refer Slide Time: 40:41)

And in this case, you can see the color image and I have shown the hue component, saturation
component and the intensity component that you can display, because the RGB can be converted

725
into HSI by using the transformation. So, I have given the equations. So, by using these
equations, you can convert the color image into hue, saturation and intensity.

(Refer Slide Time: 41:04)

And in this case, I have shown the example of the color image smoothing, smooth all RGB
components, the first case. The second case, only I am considering the intensity component and
smooth only the I component of the HSI color model, you can see.

(Refer Slide Time: 41:22)

726
And this is the difference between these two images. One is the first image, the first image you
can see, the first image is this that is I am considering the RGB components, the second image is
only I am considering the intensity component. And the difference between these two I am
getting this one, you can see the difference between these two.

And also, I can apply the high pass filter that is the color image sharpening. So, again I can do
like this, so I can consider RGB components or I may consider only the intensity component for
a HSI model. So, like already I have explained about this and this is the difference between the
certain results from the two methods in the previous slide.

727
(Refer Slide Time: 42:07)

After this I will discuss the concept of the median filter, already I have explained the two
techniques of color image processing, one is the marginal processing and another one is vector
processing. And I have explained what is the problem with marginal median. So, if I determine
the marginal median, then the color distortions take place.

So marginal median means, if I consider RGB separately and if I determine the median for the R
component, G component and the blue components separately, then in this case the color
distributions take place. So, that is why I cannot apply marginal median filter that is the scalar
median filter.

So, for this I have to consider the another technique, that is called the vector median filter. So for
vector median filter we have to consider one window, and in this window I have to apply the
vector median filter.

728
(Refer Slide Time: 43:12)

In this case you can see one example. I have shown the original image and after applying the
scalar median filter, that is the marginal median filter, that means I am applying the median filter
for the R component, G component and the B component separately. Then in this case the color
distortions take place, then in this case you can see the output image corresponding to marginal
median filter, the scalar median filter.

(Refer Slide Time: 43:43)

729
So, that is why we have to consider the concept of vector median filter. So, how to develop the
vector median filter? So, what is the concept behind the vector median filter?

(Refer Slide Time: 43:56)

So, for vector median filter suppose, I want to find a distance between two pixels. So, what is
this vector median, suppose the distance between two pixels in terms of RGB value, and I am
considering one window and in this window I am finding the distances.

So, in this case I am considering the one pixel, the pixel is I x, y and corresponding to this pixel
the RGB value is RGB value is R1, G1, B1 and I am considering, suppose the median filter, the
vector median filter. In this case I am considering the median pixel, corresponding to a median
pixel Im x, y, the R value, G value and the B value will be Rm, Gm and the Bm.

So, if you see this expression, what is the meaning of the median? So, if I find a distance
between the pixel I x, y and the median pixel, that is the meaning of this, and summation over the
window, so I have to consider all the distances and after this I have to take the summation
corresponding to that particular window. And also I can determine the distance between two
pixels, distance means, the distance in terms of RGB values and the same window I am
considering.

Now, the distance between the pixel and the median, that will be less than or equal to the
distance between any two pixels within this window. So, how to calculate the distance? The

730
distance in terms of the RGB value and this is the Euclidean distance I am considering, distance
between any pixel I x y and the median pixel I am considering, and that distance will be less than
equal to the distance between any two pixels. So, that is the motivation behind the median filter.

So that means, pictorially I can show like this. Suppose this is the median pixel, so I can find the
distance between the median and other pixels, I can find and that will be less than or equal to. So
if I consider another pixel and I am finding the distance between this and other pixels within this
window, then in this case the distance between the pixel and the median pixel, any pixel and the
median pixel will be less than or equal to distance between two pixels within that window. So,
pictorially I have shown this one.

Now, so how to calculate vector median? So, for calculating the vector median, I can consider
one window, maybe the symmetric window I am considering, the 3 by 3 window I can consider.
Suppose I am considering a 3 by 3 window, so this 3 by 3 window is considered and I am
calculating the distances d1, d2, d3, d4, d5, d6, d7, d8, d9.

Now, in this case, corresponding to this 3 by 3 window, my center pixel is this, this is my center
pixel. So, if I consider this image suppose, corresponding to this 3 by 3 window my pixel value
are z1, z2, z3, z4, z5, z6, z7, z8 and z9 that is the image, this is the image I am considering. Now,
corresponding to the center pixel, the center pixel is z5 I want to determine the median value, the
vector median value. So, for this what I am considering, the 3 by 3 window I am considering and
I am finding the distances d1, d2, d3, all the distances I am calculating.

So how to calculate the distances? So what is the distance d1? Suppose I am considering the
distance d1, so I can find a distance between this pixel and all the neighboring pixels like this,
the distances I am computing, distances I am computing in terms of the RGB values. So, I am
calculating all the distances corresponding to these pixels. So, all the distances I am calculating,
and after this I am taking the summation of this.

So, I can write d1 is nothing but sum of the, this is the sum of the I can write the sum of the
distances, sum of the distances to all pixels in terms of RGB values. So, this is a distance, the
distance is the sum of the distances, all the distances I have to sum up to all the pixels because I
am determining the distances to all the pixels and distances in terms of RGB value, so that is the

731
meaning of d1. Similarly, I can calculate d2, d3, I can calculate, so all the distances I can
calculate.

Now, out of all the distances I have to find which one is the minimum distance. So, suppose,
because already I have calculated d3, d4, d5, d6, d7 and d8 and d9, so all the distances I am
calculating d1, d2, d3, d4, d5, d6, d7, d7, d9. Out of all these distances which one is the
minimum distance I have to determine. Suppose, in this example, suppose the d2 is minimum, so
this distance d2 is minimum. So, corresponding to d2 what is the corresponding pixel? The
corresponding pixel value is z2.

So that means, the value z5 will be replaced by the pixel value z2, because the distance d2 is
minimum and corresponding to this d2 what is the pixel value? The pixel value is z2, and I am
determining the median value corresponding to the center pixel, the center pixel is z5. So, that
means the z5 will be replaced by z2, z2 is the pixel value, the pixel value that is the RGB, RGB
value I am considering, that is the vector pixel.

So, this is the concept of the vector median. So, by this algorithm, we can determine the vector
median. So, we have to determine the distances, all the distances I have to determine d1, d2, d3.
And after this is what I have to consider, I have to find a minimum distance. And from the
minimum distance I can identify the pixel. And that center pixel value will be replaced by that
pixels value. So, this is the algorithm for vector median filter.

732
(Refer Slide Time: 52:53)

So, in this case, here already I have explained the concept of the vector median filter and I am
considering one window. And in this window the pixels are x1, x2, like this, these are the pixels,
these are the vector pixels.

(Refer Slide Time: 53:08)

After this, the next step already I have explained that I have to find the distances, I have to find
the distances and I have to arrange all the distances in the ascending order. And in this case, I

733
have to find the minimum distance. So, which one is the minimum distance. And corresponding
to the minimum distance I have to find the vector pixel, which one is the vector pixel. Suppose,
corresponding to this one the vector pixel is this one, x1.

(Refer Slide Time: 53:43)

And after this what will be the vector median? The vector median will be x1, because
corresponding to x1 the distance was minimum, so that is why the vector median, x vector
median will be x1, that is the vector pixel. And already I have explained I am considering the
Euclidean distance, sum of square distance.

So, for this I have to consider the Euclidean distance, so already I have explained this concept, so
how to determine the distance in terms of RGB values. So, here you can see in this example, this
is my input image and this is the image corrupted by the impulse noises, that is the salt and
pepper noise. And after vector media filter, I can get this image.

734
(Refer Slide Time: 54:33)

And in summary, this vector medium filter algorithm I can write like this. So, the concept is
same, so mainly I have to find the minimum distance and corresponding to this minimum
distance I have to find the vector median. And for this I have to consider one symmetric mask,
the mask is W, that mask I am considering.

(Refer Slide Time: 54: 54)

735
So, this is this example I am showing the salt and pepper noise, the 15 percent salt and pepper
noise I am considering. And after this, the resulting image, after applying the vector median
filter, so am I playing the vector median filter and corresponding to this, this is the image.

(Refer Slide Time: 55:11)

So, up till now I discussed about the concept of the vector median filter. The edge detection and
the segmentation, image segmentation I will discuss later on. But for the color image also I can
apply the same algorithm, the edge detection algorithm, and the image segmentation algorithm I
can apply for the color image also.

In color image what I can do, I can convert the RGB into HSI and for the I, that is the intensity
component, I can apply this algorithms, that is the edge detection algorithm I can apply. So,
when I will discuss the concept of the edge detection and the color image segmentation or the
image segmentation, then you can understand this concept. Mainly the concept is very similar to
the grayscale image, for the intensity component I can apply the edge detection technique. So,
this is about the edge detection and the color image segmentation.

736
(Refer Slide Time: 56:06)

Now, I have highlighted the problems with processing of the color image. So, one problem
already I have explained, that is the problem with the marginal processing. Marginal processing
means the RGB components are processed separately, then the color distortions take place. And
one example I have given, that is the scalar median filter. And for this I have discussed the
concept of the vector median.

The another point is that color recorded by a camera are heavily dependent on the lighting
conditions. So, here in this case you can see the different lighting conditions, for different
lighting conditions the color will be different. So, in this case my objective is to determine the
actual surface color from the image color, because in the camera I will be getting the images, but
actual surface color that depends on the lighting conditions. So, actual surface color maybe
different from the image color, because image color depends on the lighting conditions. So, what
is the actual surface color, that I have to determine.

737
(Refer Slide Time: 57:16)

In this case you can see so, different lighting conditions and images are taken, first one is by
flash, the second one is by the tungsten lamp. So, these two cases I have shown. And in this case
you can see that the color is different, because different lighting conditions I am considering.

(Refer Slide Time: 57:37)

And again I am showing another effect of different lighting conditions and different images you
can see.

738
(Refer Slide Time: 57:44)

So, that is why what we have to consider for scientific work, for scientific work the camera and
the lighting should be calibrated. And for multimedia applications, this is more difficult to
organize. So, for multimedia applications algorithm exist for estimating the illumination color.
So, that means, my objective is basically how to detect the actual surface color from the image
color that we have to consider. So, for this we have algorithms, so that we can estimate the
illumination color.

(Refer Slide Time: 58:23)

739
For this there is an algorithm, this algorithm is called Color Constancy Algorithm. So, what is
the color constancy algorithm, then in this case, it is very important to determine the actual
surface color of an object from the image color, because the image color is affected by the
lighting conditions. And in this case, what is the objective of this algorithm? That is the objective
is to correct the effect of the illuminant color.

Now, in this case for this what we can consider, we can determine some color invariant feature
we can determine, or by transforming the input image such that the effect of the color of the light
source can be eliminated. So, this is one technique, in this technique what I can consider, I can
consider that color invariance features I can consider or maybe some transformation I have to
consider.

So, that the effect of the color of the light source can be eliminated. So, color constancy is the
ability to recognize colors of object independent of the color of the light source, that is the
objective of that color constancy algorithm. So, how to determine actual surface color from the
image color.

(Refer Slide Time: 59:40)

So, in this case I have shown the response of the camera, that is we have the sensors like R
sensor, G sensor and the blue sensors. In a digital camera, we have the three types of sensors, one
is for the red component, one is for the green component and one is for the blue component. So,

740
RGB response you can see here. And in this case, I am considering the E lambda, E lambda is
the spectral power distribution of the light source, and S lambda is the reflectance function, that
is the albedo.

Albedo, in my earlier class I discussed about the albedo, and the symbol I have shown like this,
albedo is mainly rho lambda. But in this case I am showing the albedo is S lambda is the
reflectance function of the surface. Lambda is the wavelength of the light and I am considering
the spectral sensitivity of the RGB camera sensors.

So, for R component, R lambda, G lambda and B lambda that is the spectral sensitivity of the R
component, G component and the blue component of the camera sensors, RGB sensors. And in
this case I am considering the wavelength lambda, so that is why I am doing the integration over
the visible spectrum, that is all the lambdas I am taking.

(Refer Slide Time: 61:03)

So, what is the concept of the color constancy? Here you can see I have r1, r2, r3 like this, these
are the vectors. So, vector ri is the i-th illuminate dependent RGB camera response triplet which
can be measured in a camera image. So, this r1, r2, all this is illuminant dependent RGB camera
response.

And after processing, that is the processing means this algorithm, the color constancy algorithm I
am getting i1, i2, i3 like this I am getting, that is the illuminant independent output I am getting.

741
That is, it is independent of the light source, that is the objective of the color constancy
algorithm.

But in this case, you can see the E lambda, R lambda, G lambda, G lambda are in general
unknown. So, that is why this problem is under constraint problem. So, this problem is under
constraint problem because E lambda is not available, R lambda is not available, G lambda is not
available, B lambda is not available, so that is why under constraint problem.

(Refer Slid Time: 62:15)

So, for this color constancy the first step process, the objective is to replace an image by suitable
features which are invariant with respect to the light source. That means, I have to extract some
color invariant features, that is invariant with respect to the light source. This is the first
approach.

In the first approach it is not important to determine the direction of the light source or maybe the
characteristics of the light source, it is not important, but in this case what is the important, the
important is I have to extract suitable features which are invariant with respect to the light
source.

In the second approach, the objective is to correct images for deviations from a canonical light
source. So, that means, in this case the light source characteristics I have to understand and I

742
have to consider the canonical light source and corresponding to this I have to correct images for
deviations. So, color corrections I can do corresponding to this canonical light source.

So, based on these principles, I have some algorithms, the popular algorithms, one is Retinex
based patch best algorithm, Gray world algorithm that is also very popular, Gamut mapping
algorithm, Gray edge algorithm. So, there are many algorithms, I am not going to discuss all
these algorithms, but briefly I can explain the concept of Retinex based algorithm, otherwise,
these algorithms you can read from the research papers.

(Refer Slide Time: 63:53)

So, here I have shown the surface reflectance meets illuminant. I have shown the surface
reflectance, if you see this one surface reflectance, surface reflectance I am considering that is a
profile I am considering. And I am considering two light sources, one is the sunlight that is the
illuminate one and skylight the illumination two.

After this if I want to determine the luminance, the luminance is the L x, y, it is the luminance
and what is E x, y, E x, y is the illuminant, that is the irradiance. And R x, y I am considering
that is the surface reflectance, that is nothing but the albedo, albedo of the surface rho x, y.

So that means if I multiply E and R I will be getting L, the luminance I will be getting. So in this
case, if I multiply these two, if I multiply these two I will be getting this one. So, you can see the
effect of the light source here, at the surface you can see the initially the surface you can see the

743
surface profile, same surface profile, but I am considering two light sources, one is the sunlight
and another one is the skylight. After this you can see the effect of the light source, so that I have
to compensate, the effect of the light source.

(Refer Slide Time: 65:11)

So, what is the Retinex theory? So, this L x, y is the luminance, E x y is the illuminant and R x, y
is the surface reflectance, that is the albedo of the surface. So, what is the objective? The
objective is to recover the surface reflectance or the albedo. So, I have to extract surface
reflectance or the albedo. So, how to extract the surface reflectance?

744
(Refer Slide Time: 65:41)

In this case I am showing one example. So, suppose the truly reflectance I am showing here, this
is the profile of the true reflectance. And I am considering the illumination, the illumination is
something like this the profile. And if I multiply Rx into Ex, then I will be getting Lx. So, this is
Lx I am getting. Now, in this case, if I take the log of the Lx, then I will be getting this one the
log of Lx.

And after this, from this if I take the derivative of log x, then I will be getting this one. And after
this if I take the integration of this, then I will be getting this one. That means, I can compute the
reflectance. This is my true reflectance, the reflectance I can compute it from Lx, you can see
from Lx I can compute the reflectance. Here, first I am taking the log, after this I am taking the
derivative and after this if I take the integration of this I can compute Rx.

745
(Refer Slide Time: 66:46)

So, this concept again I am going to explain, if I consider the camera response at a particular
point, at a point suppose X. So what is the camera response? C x is equal to the linear camera, K
is suppose the constant, so linear camera I am considering and rho x. So, if I take the log of this,
the multiplication can be converted into summation. Log K plus log Ix plus log rho x.

So, in this case log K is nothing but the dc component, that we can neglect. This is the
illumination, illumination changes slowly over space and this is the reflectance that is the albedo,
the albedo changes quickly over space. So, illumination changes slowly over space and the
albedo changes quickly over space because of varying phase angles, like in the edges the albedo
changes very quickly.

Now in this case, I can show pictorially how can you recover the reflectance, because the
objective is mainly to recover the reflectance. So, suppose this is my log rho, that is reflectance I
am considering, this is the profile I am showing the reflectance profile with respect to some
distance I am considering. And also I am considering the illumination suppose, the illumination,
the log of illumination something like this log I.

So, if I combine these two, if I multiply log rho plus log I, that means if I do the addition of this,
not multiplication, if I do the addition of this, then in this case, I will be getting what log c I will
be getting. So, log c I will be getting, so it will be something like this, so I will be getting this

746
one, so this is a log c, because I have to add these two, if I add these two, then I will be getting
this one.

And if I take the differentiation of these, if I take the differentiation of these that is the d log rho
divided by dx, so what I will be getting? Corresponding to the first jump, this jump, I have the
differential. Next one is another jump is there, next one is the negative side, I have another jump,
so I will be getting the this one, that is the d log rho divided by dx.

And corresponding to this if I take the differentiation of this one, d log I divided by dx, then I
will be getting something like this. And in this case, if I add these two because, as per the
equation I have to add these two. So, if I add these two, then what I will be getting? I will be
getting this one. That means, in this case I will be getting this one is d log c divided by dx.

And from this if I do the thresholding, thresholding it if I do the thresholding d log c divided by
dx, so after thresholding I will be getting these one. And after this if I do the integration of
this, integrate, so I will be getting this one, that is nothing but log rho plus the integration
constant is c, that is the lightness is recovered, you can see. So you can see here that is I am just
recovering this one.

So, this is recovered, so that means, the lightness is recovered that means, I am recovering the
reflectance that is the concept of this theory. So, I am not explaining in details, but if you want to
see the solution of this problem, then you can see book or maybe some research papers you can
see. So, how to solve this problem. But the main concept is this, that is how to determine the
surface reflectance.

747
(Refer Slide Time: 72:58)

So, Retinex theory for 2D color also we can extend, like this one algorithm is by Land and
McCann. So, multiple 1D path they have considered. Another one is by Horn, so 2D analysis
based on Laplacian like this you will get many algorithms. But in this case we have to consider
some assumptions, the assumption is the reflectance changes abruptly and illumination changes
very slowly, that is already I have explained.

And also, I am considering the Lambertian reflectance characteristics. So, these are assumptions
for applying the Retinex theory for 2D color analysis. And another algorithm I can explain
briefly, how can you correct the color, the color is affected by the lighting source.

748
(Refer Slide Time: 73:54)

So, I have one image suppose, and suppose this region has some color suppose v1, this region
has another color v2, this region has v3 color, v4 like this I have these different colors
corresponding to this image and in this case suppose color distance lookup table I am
considering, that is nothing but the color knowledge base, color knowledge base.

So, in this case I am considering some transformation, the transformation is suppose T1, V1, T2,
V2, like this I am doing some transformation, T3, V3 I am considering this type of
transformation, these are the transformation. And from this I can determine expected color,
expected color surfaces I can determine, v1 dash, v2 dash, v3 dash like this. So, here I am going
giving one example how to compensate the effect of light source, because of the light source
color of the images will be different.

So, corresponding to this input image, suppose v1 is a particular color, v2 is another color, v3 is
another color like this, but that is different from the actual surface color. So, I know the actual
surface color that means I have to do some transformation to get the actual surface color. So, T1,
T2, T3 are the transformation. So, by using this transformation I can get the actual color.

So, v1 is not the actual color because of the light source, so I am doing some transformation T1,
V1, and corresponding to this transformation, I am getting the actual surface color. So, like this, I
have to apply some transformation. And after this, I am making a table, the lookup table I am

749
making that is nothing but the color knowledge base. And based on this I can do some
predictions. So what will be the expected color surfaces, I can determine.

So, suppose unknown color is coming, suppose I have one surface, so for this I have to see the
color in the lookup table and I can do the prediction. So, what will be the expected color of the
surface that I can determine from the color knowledge base.

So, in this class I discussed about some color models. The color models are based on the concept
of the decoupling of the color information from the intensity information. So, for this the color
model are HSI color models, one is that Y Cb Cr color models, one is YIQ color models. After
this I discussed the concept of color balancing. So, how to do the color balancing, and that
concept I have explained. After this, I discussed about the concept of pseudo coloring, so how to
convert a monochromatic image into color image.

After this I discussed the concept of the vector median filter, so what are the problems of the
marginal median filter and how to apply the vector median filter that concept I have explained.
So this is about the today's class. Let me stop here today. Thank you.

750
Computer Vision and Image Processing - Fundamentals and Applications
Professor M. K. Bhuyan
Department of Electronics & Electrical Engineering
Indian Institute of Technology, Guwahati
Lecture-20
Image Segmentation

Welcome to NPTEL MOOCs course on Computer Vision and Image Processing- Fundamentals
and Applications. Today, I am going to discuss the concept of image segmentation. Image
segmentation means the partitioning of an image into connected homogeneous region.
Homogeneity may be defined in terms of gray value, color, texture, shape, motion. So, based on
this I can do image segmentation.

So, today I am going to discuss some very fundamental image segmentation concepts, the
algorithms like the thresholding technique and the split and merge technique, region green
technique, active contour, watershed algorithms, K-mean clustering. The advanced the
segmentation algorithms, I will be discussing when I will discuss the applications of computer
vision.

Image segmentation is a preprocessing step of a computer vision system, so after image


segmentation I can extract image features for image classification, image recognition, let us see
what is image segmentation.

(Refer Slide Time: 1:41)

751
In this case, I have shown one computer vision, the image analysis system and you can see I have
the input image and after this I have to do some preprocessing and we have to do the
segmentation. Segmentation means the partitioning of an image into connected homogeneous
region. After this we can extract features from the images and this is the feature vector, you can
see. And after this we can do classification of the images or maybe we can go for object
recognition. So, this is a typical block diagram of a computer vision system, the image analysis
system.

(Refer Slide Time: 2:24)

So, what is image segmentation? Partitioning of an image into connected homogeneous region
now, homogeneity may be defined in terms of gray value, color, texture, shape and emotion.
Suppose, if I consider color, so suppose in this portion the color is almost same, that means it is
homogeneous. Similarly, if I consider that this region, corresponding to this region the color is
also almost same, that means it is homogeneous. So, that means, based on this that gray value,
color, texture, shape and the motion information, I can do image segmentation.

752
(Refer Slide Time: 3:02)

Here I have shown one example of image segmentation. My input image, you can see first is the
input image, the second one is the segmented, image the segmented output.

(Refer Slide Time: 3:12)

So, for segmentation how to define mathematically, the mathematical definition of image
segmentation. So, mathematically the segmentation problem is to partition the image I. So, I am

753
considering the image I into regions, the regions are R1, R2, Rn like this. So, I is equal to R1
union, R2 union, R3 like this.

So, that will be my image and in this case, if I consider Ri intersection of Rj two different
regions, then in this case it will be phi, the null set, i is not equal to j. So, this region is defined
like this, that is if I consider all the regions then I will be getting the image and if I consider two
different regions and, that means in this case, Ri intersection with Rj will be phi.

For determining the homogeneity, I am considering one measure, that measure is called a
predicate. So, predicate of a region, the predicate of a region is true that I have considered for the
homogeneous region. And in this case, if I consider the adjacent regions, that is two regions, Ri
and Rj, the predicate of Ri union Rj will be false, because it is not homogeneous. So, based on
the homogeneity, I am defining this term, the term is the predicate.

So, for the homogeneous region the predicate will be true and for the non-homogeneous region
the predicate is false. So, this is the mathematical definition of image segmentation and based on
the predicate I can decide whether a particular region is homogeneous or not the homogeneous.

(Refer Slide Time: 5:01)

So, in this class I will discuss these techniques. One is the thresholding technique, one is a region
based technique, that is region growing technique, another one is the split and merge technique,
and for the edge based technique I will discuss the active contour based technique. And for

754
topology based approach, I can discuss the watershed algorithm. And finally, I will discuss the
K-means clustering algorithm for image segmentation. For advanced segmentation principles
that I will discuss when I will discuss the computer vision applications.

(Refer Slide Time: 5:36)

So, first one is thresholding, so here you can see the image histogram I have shown. So, this side
is the frequency and this side is the gray levels, all the gray levels and based on this algorithm
you can see if I x, y is less than threshold, then object is character and else the background is
paper. So, in this case what I am considering, suppose one paper is there, so the background is a
white background suppose, and I have the character here, I am writing some character in the
paper, that is the black.

So, in the histogram you can see this portion is the black portion and you can see the white is this
side. So, in this case, you can see number of pixels in the black that will be less than the number
of pixels of the white region, because white is basically the background, the background is white,
and if I consider the characters that is the black.

So, based on this algorithm, if you select the threshold at this point, the valley portion can be
used to find a threshold. So, suppose based on this if I select the threshold, then what I can do, I
can separate the objects that is the character from the paper, the paper is the background. And
corresponding to this case, this is my histogram corresponding to this case.

755
(Refer Slide Time: 6:57)

Suppose that an image f x, y is composed of light objects on that dark background. So, here you
can see this side is the dark and this side is the light. So, I am showing the histogram. So, the
image is composed of light objects on a dark background and corresponding to this, this is the
histogram.

Then, if I consider the threshold at the valley, the T is the threshold suppose, then objects can be
extracted by comparing pixel values with a threshold, the threshold is T. So, this is a principle of
thresholding.

756
(Refer Slide Time: 7:35)

And corresponding to this principle you can see, this is the input image, original image and after
this I am applying the thresholding principle. And you can see thresholded image you can get as
an output image.

(Refer Slide Time: 7:51)

And in this case, this is another example of the thresholding. You can see the original image and
corresponding to this original image, you can see the image histogram and already I have

757
mentioned this is the dark side and this is the light side, this is the light side that is the
background. And object is the foreground, that is the dark. So, you can see the number of pixels
in the light portion is greater than the number of pixels in the dark regions. And if I apply the
thresholding principle, then corresponding to this you can get the segmented output like this. So,
this is a segmented output.

(Refer Slide Time: 8:33)

And this is another example of thresholding, that is the fingerprint segmentation. So,
corresponding to this input image, this is my histogram and I can select the threshold, suppose
here. And based on the threshold, I can do the segmentation, the segmentation of the fingerprint
from the background that I can do.

758
(Refer Slide Time: 8:53)

But the problem of the thresholding is that non-uniform illumination may change the histogram,
that is one problem. So, that is why it is very difficult to segment the image using only one
global threshold. So, for this we have to select the local threshold, maybe we have to select. So,
that is the problem with the thresholding, that then in this case for a non-uniform illumination
can change the histogram of the image, and in this case it is impossible to segment the image
with the help of a single threshold. So, that is why we have to select a local thresholds that we
have to consider.

759
(Refer Slide Time: 9:31)

So, in this case, you can see the gray level histogram I have shown and you can see in the first
figure I have shown only one threshold. In the second figure I have shown two threshold, T1 and
T2 for segmentation. So, this principle can be extended for color image also, because the color
image has three channels, already I have defined, R, G and B three color channels.

So, suppose even one object which is not separated in a single channel, suppose it is not
separated in a single channel, suppose in the R channel it is not separated, that might be
separable in other channels, suppose corresponding to the G channel if I consider that G channel,
the separation is possible, the segmentation is possible.

In case of the R channel, the objects cannot be separated in a single channel. So, that is why we
have to consider all the channels. So, this is one application that is detecting tracking phases
based on color, the skin color. So, for this I have to consider RGB channels, the red channel,
green channel and the blue channels for the color image.

760
(Refer Slide Time: 10:41)

Another technique is or the optimal thresholding. So, in this case approximate the histogram by
weighted sum of two Gaussians, that I can do. The valley portion represents the threshold, so
threshold can be selected from the valley portion and this is called the optimal thresholding.

(Refer Slide Time: 10:58)

Now, I will discuss another technique, region based technique, that is a region growing
technique. So, for this what I have to consider a region satisfies, certain homogeneity measure.

761
So, already I have defined a homogeneity in terms of gray value, maybe in terms of color value,
maybe in terms of texture value like this. So, this is based on these parameters, these attributes
that is based on the color information based on the gray value, based on the texture information, I
can define the homogeneity.

And in this case the user is required to select a point within a particular feature that means,
suppose within a particular region or within a particular feature means suppose corresponding to
this color, a particular color in a particular region I have to select a particular point. The region is
expanded adding pixels that are similar in brightness or color.

So, the concept is like this, suppose let us consider one image, this is the image and suppose
corresponding to this portion of the image we have some regions. Corresponding to this portion
of the image I am fixing a point, I am selecting a point. After this what I am considering, I have
to compare the nearby pixel with this pixel and if the difference in the gray value or maybe the
difference in the color value is less than a particular threshold, then in this case the second point I
can merge with the first point and I can grow the region.

And like this if I consider another neighborhood I can compare that pixel value with the other
pixel value, if this value is less than a particular threshold then I have to merge that one,
otherwise I have to neglect that one. So, like this, based on this point I can grow the region, in a
particular image I can consider a number of such points. So, these points are called the seed
points.

So, based on these seed points I have to compare the seed point with the neighborhood pixels, if
it is within a particular threshold, then in this case I have to consider that point, the neighborhood
point and that is mainly I am determining the homogeneity, if the difference in the gray value or
the color value of the seed point with the neighborhood pixel is less than a particular threshold,
then in this case I have to consider that point and I have to grow the region. So, this you can see
in the next slide what I have to consider.

762
(Refer Slide Time: 13:27)

So, for this image, I have to select some seed points S1, S2, S3, and the predicate will be in terms
of seed points. So, what will be my predicate? The difference in intensity with respect to the seed
point is less than five, this is one example I am considering. So first what we have to consider,
starting signal point regions consisting of the seed points. So, first I have to consider the regions
with the seed points. So, already I have explained this one.

So, I have to select some seed points like this and the seed points to the pixel for which the
predicate is true. So, if I consider corresponding to this seed point the seed point is S1 I want to
compare that seed point with the neighborhood pixel. And if the predicate is true, then in this
case I have to add the seed points to the pixels. Like this repeat until all the pixels are segmented.
So, I have to do like this. In case of the color images, these comparisons I have to make in terms
of RGB value.

So, how to compare? Suppose, one pixel, suppose the seed point is I x, y, and corresponding to I
x, y the R value is R1, G1 and B1, that is corresponding to the seed point. And I have to compare
the seed point with the neighborhood pixel, the neighborhood pixel is suppose I x dash, y dash,
and this value is R1 dash, G dash, B1 dash. And how to compare? So, for comparison I can apply
the Euclidean distance, so R1 minus R1 dash whole square plus G1 minus G1 dash whole square
plus B1 minus B1 dash whole square.

763
So, like this I can find a similarity between two pixels, that is the predicate I am considering.
And this predicate is considered to determine the homogeneity condition at the seed points to the
pixels for which the predicate is true and this procedure I have to repeat until all the pixels are
segmented out. This is the concept of the region growing.

(Refer Slide Time: 15:40)

So, what are the advantages and disadvantage of this technique? It is a very simple technique,
adaptive to gradual changes and also it is adaptive to noises, some noises. But the main problem
is how to select the seed points, the initial seed points, that is the main difficulty. So, initial seed
points may be obtained by human interpretation, intervention, seed points election from the
modes of the histogram that can be also done, so I have the histogram and from the modes of the
histogram, I can select the seed points.

764
(Refer Slide Time: 16:13)

So, in this case you can see this is the histogram of the image. So, what I am considering, I am
selecting the seed point S1, S2, S3. So, selecting the seed point S1, S2, S3 from the modes of the
histogram, that is one technique.

(Refer Slide Time: 16:30)

And this is the one example of a region growing segmentation technique. So, I have shown the
original image and you can see the region growing results, for this I have to select the seed

765
points, maybe some points like this we can select some seed points after this I have to grow the
regions for segmentation and finally, I am getting the segmented region like this in the final
image. So, in this case manually I have to select the seed points in the image or maybe I can
consider depth technique, that is from the modes of the histogram I can select the seed point.
After this I have to grow that region and to do the segmentation.

(Refer Slide Time: 17:08)

In this case, I have shown one example, I have shown iteration 5, iteration 10, like this 20, 40,
iteration 70, iteration 90. And just growing the region and finally, I am getting the segmented
output after iteration 90, that is one example of that region growing technique.

766
(Refer Slide Time: 17:27)

The another technique, that is also another very important technique that is region splitting
technique. The opposite approach to region growing is region shrinking, that is the splitting. It is
a top down approach and it starts with the assumption that entire image is homogeneous.

So, the assumption is the entire region is homogeneous and that this approach is opposite to the
region growing technique. If this is not true, the image is split into four sub images. This splitting
procedure is repeated recursively until we split the image into homogeneous regions. So this
concept I am going to explain in the next slide, what is the meaning of this.

767
(Refer Slide Time: 18:12)

So, for this first I am assuming that the entire image is homogeneous. So, in this case, I am
showing that entire image is homogeneous. If it is not true, the image is divided into four
regions. So, here you can see this is my first region, this is my second region, this is my third
region and this is the fourth region.

After this you can see in the first region, that is homogeneous, so no need to do splitting. Second
region is also homogeneous, so no need to do the splitting. The third region is also
homogeneous, so no need to do the splitting. And if I consider the fourth region, that is not
homogeneous, so what I have to do, I have to do the splitting. So, I have to do the splitting like
this into four regions.

And if you consider these two regions, they are homogeneous, so these two can be merged
together. And if you see these two regions, these are homogeneous that can be merged. So,
corresponding to this procedure, you can see the Quadtree, so this is the root of this, that is a
region that is a corresponding to the image, corresponding to the entire image this is the node R
naught.

The R naught is splitted into four regions, the regions are R1, R2, R3, R4. Now, R2 is
homogeneous so no need for splitting, R3 is homogeneous so no need to do splitting, R4 is
homogeneous so no need to do splitting. But R1 is not homogeneous, so that is why it is divided

768
into four regions. The regions are R11, R12, R13, R14, so this procedure I have to repeat and
also we have to do the merging, the splitting and merging we have to do.

(Refer Slide Time: 20:08)

So, for this the algorithm will be like this. If a region R is not homogeneous, then the predicate R
will be false, the predicate of the region is false. Then the split into four sub regions, if the region
is not homogeneous, that is the predicate of R is false, then the split regions into four sub
regions. After this merge two adjacent regions Ri and Rj if they are homogeneous, that is the
predicate of Ri union Rj is equal to true. Then if it is homogeneous, I can do the merging. Stop
when no further splitting or merging is possible. So, I have to repeat this and finally I will be
getting the segmented image.

769
(Refer Slide Time: 21:00)

So, here I have given one example. You can see I am considering one image, the sample image.
Since it is not homogeneous, so I am dividing it into four regions, you can see four regions here,
the first region, second region, third region and the fourth region I am doing the splitting.

(Refer Slide Time: 21:19)

Next slide you can see the first region what I have considered in my previous slide, that is not
homogeneous, so that is why I am splitting into four, the four is the first region, second region,

770
third region, fourth region like this. And similarly, other regions also that is not homogeneous, so
that is why I am splitting it again. So, this region, this region, this region and this region.

And similarly, for other region also, since it is not homogeneous, so I am splitting into four
regions, this region, this region, this region and this region. And like this the final, if I considered
this region, final region, that is also not homogeneous, so that is why it is splitted into four
regions.

In the third split what I am considering, so this is homogeneous, this portion is homogeneous. So
then in this case no need for splitting. But this region is not homogeneous. So, that is why I am
doing the splitting, the splitting is done. So, this region I am doing the splitting.

This region is homogeneous, so that is why no need to split this region is homogeneous, no need
to split. But this region is not homogeneous, so that is why I am doing the splitting like this I am
doing, I am considering all the regions and I am doing the splitting, so this is the third split.

(Refer Slide Time: 22:43)

And after this I am doing the merging. So if you consider, this region can be merged with this,
this can be merged with this because this is homogeneous and also this can be merged with this.
But if you consider this portion, these two, two I cannot merge. Similarly, if I consider this, all
the 0, 0, 0, 0, that can be merged. Similarly, if I consider this 1, 1, 1, 1, 1 this can be merge, 3
will be isolated that I cannot merge, like this I have to do the merging.

771
So, in the final region what you will be getting, two will be isolated, three is isolated that cannot
be merged, and you can see the merged pixel. So, you can see 1, 1, 1, all these are merged, so
these are merged, all these are merged, zeros all these are merged, ones this merged, this is
merged. So, I am just doing the merging, like the 6 I am doing the merging, the 5 is different. So,
I can do the merging of the pixels, so I am getting the final results.

(Refer Side Time: 23:58)

So, this is the technique of the splitting and merging. And corresponding to this image, original
image I am applying this technique and you can see the output.

772
(Refer Slide Time: 24:08)

Next one important technique I am going to discuss that is active contour based technique. So,
what we have to consider, suppose if I want to do the segmentation of an object, one image is
there and I have one object, this object. So what I can consider? I can consider one contour, this
contour I can consider around this object and this is called a contour around the object and this is
my object.

And this contour, or the snake, it is also called a snake. The snake has the internal energy that
corresponds to curve bending and the continuity. The snake has the external energy, that is the
image energy. And I am also considering some constraint energies, some constraints I am
considering, that is measure of external constraints either from higher level shape information or
user applied energy that I am considering.

So, these energies I am considering, one is the internal energy, external energy and the constraint
energy. And in this case, what I have to do, I have to minimize this energy corresponding to the
contour. When the energy is minimum the contour touches the boundary of the objects.

So, my problem is I have to minimize this energy. So, the total energy has three components,
internal energy, external energy and the constraint energy. So, what I have to do, I have to
minimize this energy and corresponding to this minimum energy, the contour touches the

773
boundary of the object, that is the active control. So, how to define this energy, one is the internal
energy, external energy and the constraint energy?

(Refer Slide Time: 25:53)

So, the contour is defined in the x, y plane of an image as a parametric curve. So, this is the
definition of the contour, the v s is equal to x s comma y s. And corresponding to this snake or
the contour I am considering these energies, one is the internal energy, one is the external energy
and also I am considering the constraints. So, we have three energy terms.

Now, in this case the energy terms are defined cleverly in such a way that the final position of
the contour will have the minimum energy. So, whenever the contour touches the boundary of
the object, the energy will be minimum. So, the problem is the energy minimization problem, the
third problem is the energy minimization problem.

774
(Refer Slide Time: 26:41)

So, how to define the internal energy? So, internal energy depends on the intrinsic property of
the curve, that is the sum of the elastic energy and the bending energy. So, what is the elastic
energy? That means the internal energy has two components, one is the elastic energy, another
one is the bending energy.

So, what is the elastic energy? The curve is treated as an elastic rubber band possessing elastic
potential energy. So, the curve is considered as an elastic rubber band possessing elastic potential
energy. It discourages stressing by introducing tension. So the elastic energy can be defined like
this, the weight is alpha s that allows us to control the elastic energy along different parts of the
contour.

And for many applications, this alpha s is considered as constant. And this elastic energy is
responsible for shrinking of the contour. So, first I am defining the elastic energy and you can
see, so how to determine v s? v s is nothing but I am just taking the gradient, dv s divided by ds,
that gradient I am considering. So, v s is defined like this. So, first term is the elastic energy I am
considering.

775
(Refer Slide Time: 28:08)

The second term is the bending energy. What is the bending energy? The snake is also
considered to behave like a thin metal strip giving rise to bending energy. Now, this bending
energy is obtained by this, it is defined as sum of squared curvature of the contour that I can
define the bending energy. And B s is very similar to alpha s, in many applications B s can be
considered as constant.

Bending energy is minimum for a circle, that is obvious. So, bending energy for a circle will be
minimum. So, what will be my total energy? For total energy I have to consider these two terms,
one is the elastic energy, another one is the bending energy.

776
(Refer Slide Time: 28:52)

Next one is how to define the external energy of the contour. The external energy of the contour
is derived from the image. So, in this case you can see the definition of the external energy, E
external is equal to, integration I am taking for the variable s, E image into v s ds.

So, what is E image? E image is nothing but it is the energy obtained from the image. So, define
a function E image so that it takes on a smaller value as a feature of the interest such as
boundaries. So, that means, I have to define the function E image so that it takes on its smaller
value at the features of interests such as boundaries.

So, at the boundaries we have to consider. So, this is the definition of the external energy and
that is derived from the image. So, like this if I take the gradient of the image and after this is
squared, so I can determine E image. Also I can consider this expression also, the image is
contoured with the Gaussian and after this, I am just taking the gradient of the image. So, the
image is contoured with the Gaussian to blur the image and after this I am taking the gradient
and from this you can determine the image energy.

777
(Refer Slide Time: 30:16)

So, finally, the problem at the hand is to find the contour v s, that minimizes the energy
functional. So, the problem is we have to minimize the energy. So, energy here I am considering
the internal energy and the external energy. The external energy is derived from the image itself.
And in this case, we have to apply the Euler Lagrange differential equation.

So, this is Euler Lagrange differential equation I have to apply and I will be getting this the
energy minimization problem. So, this equation can be interpreted as a force balance equation,
each term corresponds to a force produced by the respective energy terms. The contour deforms
under the action of these forces, and whenever the contour touches the boundary of the objects,
the energy will be minimum.

So, this is a case. So, the problem is the energy minimization problem and for this we have
considered the internal energy, external energy and in this expression, I have not considered the
constraints. So, only I have shown that two energy, one is the internal energy another only the
external energy.

778
(Refer Slide Time: 31:25)

So, what are the advantage of this technique? So, it works good for ambiguous boundary regions,
but the problem is how to define the initial curve, that is the also the problem of this method and
also the trade off between noise reduction and the detail boundaries.

(Refer Slide Time: 31:47)

So, I can give some examples of the active contours. You can see a is the object and initially I
have to define the contour. So, b is the contour here, if you see the dotted line that is the contour.

779
And after this what I have to consider I have to consider the energy of the contour, that is the
internal energy and the external energy. And after this, with each and every iteration, I have to
minimize the energy that is the snake energy minimization I have to do.

And whenever the contour touches the boundary of the object, the energy will be minimum. And
that corresponds to the final contour, d is the final control, and that is the segmented object I will
be getting. So, the segmented object I will be getting when the contour touches the boundary of
the object. This is the boundary detection using active contour model.

(Refer Slide Time: 32:32)

And here I have given another example, you can see the initial contour and after this I am doing
the energy minimization, energy is nothing but the internal energy and the external energy and
whenever the contour touches the boundary of the object the energy will be minimum.

780
(Refer Slide Time: 32:49)

The another example I can show, this is the segmentation of the mammogram and I am defining
the contour. And you can see that with each and every iteration the contour just touches the
boundary of the objects. So, I can do the segmentation by using active contour.

(Refer Slide Time: 33:08)

The another popular technique is the image segmentation, that is the topographic base
segmentation technique, that is the watershed algorithm. So, in this case the image is modeled as

781
a topographic surface. So, here in this figure, the first figure, you can see I am showing the
topographic surface. Corresponding to this topographic surface you can see a two catchment
basins, you can see two catchment basins are available.

And corresponding to these two catchment basins, I have the minima, this is one minimum
corresponding to the first catchment basin and you have another minima corresponding to the
second catchment basin and between these two catchment basins, I have the watershed lines, you
can see the watershed lines, this are the red watershed lines.

So, that means, the main concept is what is the catchment basin, what is the local minima and
what is the watershed lines, and I will discuss about the dam construction, how to construct the
dam. So, that concept I am going to discussed. So, the main concept is the image is model as a
topographic surface.

So, corresponding to the figure one you can see two minima and I have considered two
catchment basins and corresponding watershed lines. So, corresponding to this image, grayscale
image, you can see the catchment basins, maybe the catchment basins here, and in this case, I am
considering a topographic surface, this is a topographic surface. And corresponding to this
topographic surface, you can see two minima corresponding to two catchment basins. And again
you can see that watershed lines.

(Refer Slide Times: 34:51)

782
So, in case of the watershed segmentation, the segmentation is performed by labeling connected
components-catchment basins within an image. So, in this case what I have to consider suppose
corresponding to this the first catchment basin, this is the minimum corresponding the second
catchment basin this is the minimum. Now, the objective is to find the watershed lines between
the catchment basins. So, that concept I will explain in my next slide.

(Refer Slide Time: 35:20)

Here you can see, so visualize an image in 3D, spatial coordinates and the gray levels I have to
consider. In such a topographic interpretations and there are three types of points, points
belonging to a regional minima, so here in this case you can see suppose this point as a regional
minima. So, in the second figure also you can see, we have the regional minima.

The second point is, point at who is a drop of water would fall to a single minima. Suppose, in
this case if I put some water here suppose, corresponding to this, the drop of water would fall to
a single minima, single minima suppose drop of water it falls into this minima, point at which a
drop of water would fall to a single minima that I am considering, that is the catchment basin or
the watershed of that minima. So, corresponding to this minima, if I put some water at this point,
it would drop to this single minima.

Third point is, point at who is a drop of water would be equally likely to fall into more than one
minima. Suppose, if I put a drop of water here, this drop of water would be equally likely to fall

783
to more than one minima, so this drop of water may fall to this minima or may fall to this
minima.

So, you can see, we have three types of points, the points belonging to a regional minima, that
already I have shown the regional minima. Point at which a drop of water would fall to a single
minima, so I am considering this point. And points at which the drop of water would equally
likely to fall to more than one minima, so that I am considering this one.

(Refer Slide Time: 37:05)

So, idea is how to determine the watershed lines. So, objective is to find the watershed lines. So
idea is very simple, suppose in this case I have shown in this figure the brightness profile with
respect to the distance. Now what I am considering suppose, in this brightness profile of the
image a hole is punched in each of the regional minima, so in this case, this is my original
minima, this is my regional minima, this is my regional minima like this.

So, a hole is punched here, maybe here, maybe here, the hole is punched, in each regional
minima that the entire typography is flooded from the below. So, it is flooded from the below by
letting water rise through the holes at a uniform rate. So, just I am doing the flooding. So how to
do the flooding? The flooded from below by letting water rise through the holes at a uniform
rate. So, for this I am considering the holes like this, this one hole, another hole, like this. So,
water is rising now.

784
so whenever water is rising, what will happen, at a particular time or a particular point, this water
is about to merge to this water about to merge, when rising water in distinct catchment basins is
about to merge, so how to prevent this flooding? Then a dam is built to prevent merging. So, that
is why I am making a dam here so that water cannot flow from one catchment basin to another
catchment basin. So, that is why I am constructing a dam, so that the water cannot merge from
one region to another region, these dam boundaries corresponds to the watershed lines.

Similarly, corresponding to that these two regions, this is one region and another region is this,
the flooding is going on what I am doing flooding from below by letting water rise through the
holes at uniform rate. So, what will happen, to prevent merging of water from this both the
regions, I have to construct a dam. So, dam construction is going on like this, these dam
boundaries corresponds to the watershed lines.

So, that means, the objective is to find the watershed lines. So, that is called a dam construction,
dam construction is done so that the merging is not possible, that is water is rising from the
bottom and I want to prevent water coming from one region to another region. So, for this a dam
is constructed so that water cannot merge from one region to another region. And I have to do
the dam construction like this and this dam boundaries corresponds to the watershed lines.

(Refer Slide Time: 39:57)

785
So, this concept again I am showing here. I have shown in this figure catchment basins I have
shown, and also I have shown the watershed lines. So, objective is to determine the watershed
lines, the watershed transform compute catchment basins and the ridgelines. These ridgelines are
also known as watershed lines, where the catchment basins corresponds to the image regions and
the ridgelines corresponds to the region boundaries.

So, that means I have to identify the watershed lines based on this principle. And in this case the
catchment basin corresponds to the image region. So, this is the image region, like this, this is the
image region. And if I consider the ridgeline that is the watershed line, that corresponds to the
region boundaries. So, if you consider this one this is the region boundaries.

(Refer Slide Time: 40:52)

So, in this example, I have shown how to construct the dam. So, you can see the flooding is
going on, result of the flooding. And for this the water is about to merge from two catchment
basins, so the water is about to merge you can see. So, for this is a short dam constructed, so that
the flooding is not possible from one region to another region. So, I am constructing a dam, so
that water cannot merge from one region to another region.

And finally you can see I am constructing a longer dam, that is the watershed lines, so, that water
cannot merge from one region to another region, water cannot flow from one region to another

786
region, that is I am determining the watershed lines. So, you can see the final results I have
shown, the longer dams are constructed and that is the final watershed segmentation lines.

(Refer Slide Times: 41:47)

So, algorithm will be like this whatever I have explained here, start with all pixels with the
lowest possible value, these form the basis for initial watersheds. Now, for each intensity level k,
for each group of pixel of intensity k is adjacent to exactly one existing region add these pixels to
that region. Else, if adjacent to more than one existing regions, mark as boundary that means, I
am determining the watershed lines, else start a new region. So, that is the concept that means, I
have to find the watershed lines.

787
(Refer Slide Times: 42:22)

So, what are the advantages of the watershed segmentation? Closed boundary is obtained and
correct boundary achievable, and one main problem is over segmentation problem. So, what is
over segmentation problem? Suppose, if I want to do the segmentation of these objects, this
object suppose what I have to determine I have separated the objects from the background, so
this is my object and I have to separate it from the background.

But what is the over segmentation? Because of the over segmentation, what happens objects
being segmented from the background are themselves segmented into sub components, so that
means this will be segmented out, so that means it is again segmented into sub components, that
is not desired, so over segmentation is a big problem in watershed algorithm. The over
segmentation means, again I am repeating, the objects being segmented from the background are
themselves segmented into sub components, that is the over segmentation.

788
(Refer Slide Time: 43:26)

So, this over segmentation is a problem. And what is the approach? The solution is I have to do
the filtering and by filtering I can remove weak edges, that is one method. Over segmentation is
a big problem in watershed algorithm, so that I can consider by filtering, so by filtering I can
remove the weak edges.

(Refer Slide Times: 43:45)

789
And finally, I want to discuss the algorithm that algorithm is the K-means clustering. So what is
the K-means clustering? Partition the data points into K number of clusters randomly. Find the
centroid for each cluster, I have to find a centroid for each cluster, for each data point calculate
the distance from the data points to each other cluster, assign the data point to the closest cluster.
Recompute the centroid of each cluster, repeat step two and three until there is no further change
the assignment of the data points or in the centroids.

Now, I am going to discuss about the K-means clustering. So, what is K-means clustering? So I
will explain, so begin, so first I have to initialize n number of data points, or maybe the samples,
c number of clusters. And corresponding to the c number of clusters, I have the centroid, the
centroid are mu1, mu2, up to mu c, c number of centeroids.

After this, what we have to do, do classify n samples according to nearest mu i. So I have to do
the classification and samples according to nearest mu i. And after this, again, I have to
recompute the centroid, the centroid this mu i until no change of mu i, no change of mu i and
what will be my return value, return mu1, mu2 like this, I have c number of centroids end.

So this is the algorithm for the K-means clustering, I have to initialize a c number of centroids,
so I am considering c number of classes suppose, so I have to initialize c number of centroids, n
means the number of samples. Do classify n samples according to nearest of mu i. After this I
have to recompute mu i, until no change of the centroid, the centroid is mu i.

And finally I will be getting the centroids of the cluster, mu1, mu2, mu c like this. That means
this procedure is the clustering of the data points. So, this algorithm already I have written. So I
am considering K number of clusters randomly. And for each data point what I have to consider,
calculate the distance from the data point to each cluster I have to consider.

Assign the data point to the closest cluster and recompute the centroid of the each cluster that I
am considering, recomputing until no change of the mu i that I have to do this iteration. And
finally I will be getting the centroids, the centroids are mu1, mu2, mu c like this. So this concept
I can explain in my next slide.

790
(Refer Slide Time: 46:45)

So you can see I am considering suppose the cluster are like this, and randomly I am selecting
two clusters centroid, one is the red another one is the green. Two cluster centroids I am
selecting. After this, what I am considering, I am considering the distance between the sample
points, and the cluster centers. That means the cluster center means the mean.

So in this case, by finding the distance between the sample points you can see at this point is
assigned to this cluster. Similarly, this point is assigned to this cluster, this point is assigned to
this cluster like this, based on the minimum distance I can determine, whose sample point is
closer to a particular mean.

Similarly, corresponding to this point, this point is close to this, close to this mean, this point is
close to this mean, and like this I can decide whether these sample points belong to a particular
cluster, particular centroid that I can decide. Similarly, this point is closer to this, as compared to
the red centroid. So based on this, I can assign the sample points to the centroid, red centroid and
the green centroid.

791
(Refer Slide Time: 47:58)

After this I have to recomputed, recompute the centroid. You can see the previous centroid is
here and after this I am recomputing the centroid, so next centroid will be this. This procedure I
have to repeat again and again you can see, again, I can see whether these sample points belong
to this, or this I have to see, that is based on the distance I can do this. And again, I can see
whether this is close to this, or this, I have to find like this.

And based on this you can see these two points are now assigned to the red centroid. And again,
I am recomputing the centroids, I am recomputing the centroid and after this again like this and
finally I am getting the centroid like this and I am getting the two clusters. So this is one cluster,
corresponding to the centroid, centroid is the red centroid and corresponding to green centroid I
have another cluster, so the is another cluster.

792
(Refer Slide Time: 48:54)

This K-means algorithm, the procedure I have shown here. So initially, I have to randomly
define the cluster centroids, that is the mean. After this I have to assign the sample points to a
particular centroid based on the minimum distance, I can consider the Euclidean distance
between the sample point and the centroid and based on this, a particular centroid can be
assigned to a particular mean, like this I can consider.

And finally, if there is no change of the centroid after all this, then I have to stop the iteration,
and I will be getting the clusters corresponding to all the centroids, I have to do like this. So you
can see here in this figure, you can see all the iterations and finally I will be having the centroid
like this. So that centroid will be like this after all the iterations, because I have to update the
centroids after each and every iteration.

793
(Refer Slide Time: 49:51)

So this is one example of the clustering, that you can see, I have the input images, and you can
see I am applying the K-means clustering algorithms for image segmentation. So in this case I
have only two clusters. One is this cluster and other one is this cluster and similarly you can see
this is the result of the K-means clustering.

(Refer Slide Time: 50:10)

794
I can give another example, this is the original image and in this case I am considering K is equal
to 5, that means 5 centroids are considered and in the second case I am considering K is equal to
11. That means 11 centroids are considered, that means, 11 clusters I am considering. K equal to
5 means the 5 clusters I am considering, and the corresponding segmented images you can see in
the results. This is about the K means clustering.

(Refer Slide Time: 50:40)

And also I can use the motion for image segmentation. So just I am giving one simple example.
Suppose I am considering one video sequence and suppose this is one frame and this is another
frame of the sequence. And you can see one moving car is there, you can see, one moving car is
there and if I take the difference between a reference image and the subsequent image, then what
I will be getting? I can get the stationary elements and also I can determine the non stationary
elements.

So in this case, one moving car is available and in this case, this corresponds to the stationary
elements, the stationary background I will be getting. So this is a concept of the motion in
segmentation, just taking the difference between the reference image and the subsequent image
of the video sequence to determine that stationary elements and the non stationary image
components that I can determine.

795
(Refer Slide Times: 51:39)

And also in stereo imaging, I have the left image and the right image and we can determine the
disparity map. And from the disparity map I can do the image segmentation. So, here you can see
the first one is the left image, and the right image is available and I am calculating the disparity
map here, c is the disparity map and based on the color I can do the segmentation. So d is the
result of color based segmentation and e is the result that is obtained from disparity map based
image segmentation. So I can also use disparity information for image segmentation.

(Refer Slide Times: 52:22)

796
So, briefly I have outlined the segmentation techniques, the edge detection based approaches, the
statistical based segmentation methods are not considered here in this discussion. And one point
is that no single effective segmentation methods for all the applications, it is not possible.

So in this class I have discussed the image segmentation concepts and I have discussed some
image segmentation algorithms. One is the thresholding technique, one is the split and merge
technique, one is the region growing technique that is also very important, for region growing
technique I have to consider initial seed points.

And after this I discussed the concept of active contours, so how to select the active contours and
what are the energies, internal energy, external energy, and also the constraints. So the problem
is nothing but the energy minimization problem. When the contour touches the boundary of the
objects, then the energy will be minimum, that point I am considering in case of the active
contour.

After this I discussed about watershed image segmentation algorithms and finally I discussed
about the K-means based image segmentation technique. So let me stop here today. Thank you.

797
Computer Vision and Image Processing - Fundamentals and Applications
Professor M.K. Bhuyan
Department of Electronics and Electrical Engineering
Indian Institute of Technology, Guwahati
Lecture 21
Image Features and Edge Detection

Welcome to NPTEL MOOCs course on Computer Vision and Image Processing - Fundamentals
and Applications. Up till now I have discussed about some image processing concepts, so now I
will discuss about image features. I can give some examples of image features; colour, textures,
edges or the boundaries are the examples of image features. So, how to extract these features, so
I am going to discuss and these features can be extracted in spatial domain or in frequency
domain. So, today first I will discuss about colour features and after this I will discuss the
concept of edge detection. So, let us see what are the colour features.

(Refer Slide Time: 1:16)

So, here you can see, I have shown the input image and I have shown a block that it is a feature
extraction, so I can extract features like may be the colour is a feature or may be the texture I can
consider or may be the edges or the boundaries I can consider or may be the shape of an object I
can consider as image features. And after extracting image features, I can go for decision making
that is pattern classification. So, I can classify different images based on these features. So, first
let us discuss about the colour features.

798
(Refer Slide Time: 2:03)

So, the first feature is the colour histogram, so in my lecture of image enhancement, I discussed
about how to determine histogram of an image, Similarly, I can determine the colour histogram
of an image, so it defines the image colour distribution and mainly the colour histogram
characterize the global distribution in an image. I can compare two images by histograms. So, in
this case I can consider the colour models like HSI colour model, Y CbCr colour model or
maybe the lab colour models I can use. Because they give better results as compared to the RBG
colour space.

So, this colour histogram characterizes the global distribution of colours in an image. And
already I told you that it can be used as an image feature. So, for colour histogram that means to
extract colour histogram from an image, and the color space can be quantize into a finite number
of discrete levels and after this, each of these quantization levels is been in the histogram. So,
that concept already we know, how to calculate, how to determine histogram of an image and
after this we can compare two images based on this colour histogram.

So, here you can see I am doing the comparison between the two images by considering colour
histogram. So, I am considering two images, one image is M and another image is N. And I am
defining the histogram for the first image and the histogram for the second image I am defining
like this. Then in this case, I can find the dissimilarity or maybe the similarity between two

799
histograms and based on this I can discriminate or maybe I can compare two images. So, this is
about the colour histogram.

(Refer Slide Time: 4:10)

The another point is; the another feature is the colour correlogram. In case of the colour
histogram, the spatial information is not available, so that is why we consider this feature, that
feature is the colour correlogram. The spatial correlation of pairs of colours that changes with
distance, that I am considering for this feature. In case of the colour histogram, that information
is not available, that is the spatial information is not available.

So, in colour correlogram we consider that information, the information is the spatial correlation
of pairs of colours that changes with distance and in this case it is mainly the histogram but we
consider the spatial information. The another colour feature is the colour moments, so in the
colour moments, I can determine the first order moment, so you can see I can determine the first
order moment. The second order moment also I can consider, that is nothing but the standard
deviation I can determine. And also I can determine the third order moment that is nothing but
skewness of a colour.

So, in this case what is this Pij c, so this is the value of the Cth colour component of the colour

pixel in the ith row and jth column of an image. So, Pij c it is represented like this, that is the value
of the Cth colour component of the colour pixel in the ith row and jth column of the image. So, this

800
colour moments, I can extract, the first order moment, the second moment, the third order
moment and that I can determine. So, this is nothing but the skewness, skewness of a colour.

(Refer Slide Time: 6:32)

The another, the descriptor , the colour descriptor is the MPEG-7, the MPEG-7 colour descriptor,
the Moving Picture Experts Group – 7 the colour descriptors. So, it mainly comprises of the
histogram descriptors, a dominant colour descriptor and a Colour Layout Descriptor. This is
about the colour descriptors, the MPEG-7 colour descriptors. So, for dominant colour
descriptors, a set of dominant colours in an image or a particular region of interest is selected.
And in this case for colour descriptors, a short description of the most widely used colour space a
model is defined.

Generally, we in this case we consider the Y CbCr model is considered. So, the CLD that is CLD
means the Colour Layout Descriptor, a CLD is a compact MPEG-7 visual descriptor and in this
case it is designed to represent the spatial distribution of the colour in the Y CbCr colour space,
that we consider this the CLD. For this what we have to do, the given image or a region of
interest of an image is divided into 8 by 8, that is 8 by 8 means, 8 by 8, into 8 by 8 means the 64
blocks. So, that means the given image or a region of interest of an image is divided into 64
blocks.

And the average colour of each block is calculated as its representative colour. And finally what
we can do, finally the discrete cosine transform is perform in the series of the average colours

801
and I can consider few low frequency coefficients, that means we can select few low frequency
coefficients are selected. The CLD is from after quantization of the remaining coefficients, this is
about the Colour Layout Descriptor. So, I am not explaining in details, so if you want to read in
details you can see the books or maybe the papers, the research papers for MPEG-7 colour
descriptors.

Another one is scalable colour descriptors. So, it is mainly the Harr-transform-based encoding
scheme and that is used to measure colour distribution over an image. So, for this the HSI colour
space is used and it is quantized uniformly to 256 bins. And finally the histograms are encoded
using a Harr transformation. So, if I use the Harr transformation, that means we have to, we can
consider the scalability. A scalability is important.

So, again I am not explaining in details, but for understanding of this concept, you can see the
papers, the research papers and the books. One is the MPEG-7 colour descriptors, another one is
scalable color descriptors. So, this is about the colour features, so we have considered colour
histogram, colour correlogram, colour moments and regarding the descriptors we have
considered MPEG-7 colour descriptors and the scalable colour descriptors.

(Refer Slide Time: 10:01)

The next topic is the concept of the edge detection. So, edge detection is a very important image
processing step that how to detect the edges and the boundaries. But it is a very difficult image
processing task because mainly it depends on many cases like I can give one example, in

802
different illumination conditions, the number of edges in an image will be different. Suppose, if I
consider a low light condition and if I consider a very good light condition, the number of edges,
the number of edge pixels will be different in both the cases.

So, that is why the edge detection is a very difficult task in image processing. Another point is,
suppose for image segmentation, the edge detection is quite important. Image segmentation
means the partitioning of an image into connected homogeneous region. So, I can determine the
edges, I can determine the boundaries and based on this I can do the image segmentation. And
mainly suppose if I want to consider the separation of the foreground and the background, so
based on the edge detection principles, I can do this.

I can determine the edges and I can determine the boundaries and after this I can do the
segmentation. So, edge detection is a very important image processing step and edges I can
consider as a feature. Now, let us see how to determine the edges, how to detect the edges. So,
before explaining this concept, I can show the concept of the edges and the boundaries.

(Refer Slide Time: 11:40)

So, here you can see, the edge detection is one of the important and difficult operations in image
processing. And in this case already I have explained this one that is for image segmentation is
nothing but the partitioning of the image into connected homogeneous region. So, for this also
the edge detection is quite important. An edge indicates a boundary between objects and
background.

803
(Refer Slide Time: 12:10)

So, you can see here in this example, the boundaries of objects that is marked by many users,
you can see. So, it is very difficult to find the accurate boundary.

(Refer Slide Time: 12:22)

The next example I can show you, the boundaries of objects from edges. So, in this case, I can
determine edges of an image by using some gradient principle, that is the brightness gradient I
can determine. This concept I am going to explain after some time, so based on the gradient
information, I can determine the edges and from this I can determine the boundary. But the main
problem is missing edge continuity, that is one problem and many unwanted edges I am having

804
because of this, the selection of the gradient, how to select the gradient and what are the
thresholds, because in this gradient methods I have to select the threshold. So, based on this I
may get many unwanted edges, the edge pixels.

(Refer Slide Time: 13:14)

And here I am showing one example that is the boundary of the objects from edges, then in this
case I am using the multi-scale brightness gradient, that I am going to discuss, but in this case,
you can see low strength edges may be very important. So, in this example I have shown the
boundaries of the objects from the edges and for this I am considering the multi-scale brightness
gradient. But in this case you can see minor edges or may be the low magnitude edges are
missing, because of the selection of the threshold. So, that may be also important, that is the low
magnitude edges, small-small edges that are missing in this output.

805
(Refer Slide Time: 14:02)

And sometimes it is very difficult to determine the boundaries. So, here you can see, I am
considering one ultra sound image and corresponding to this ultra sound image you can see it is
effected by speckle noise. It is effected by speckle noise, in this case it is very difficult to find the
edges and the boundaries.

(Refer Slide Time: 14:21)

And in this case for these images, you can see this type of images, it is really very difficult to
find the edges and the boundaries, even for humans. Now, I will discuss the concept of the edge
detection, so how to determine the edges in an image. So, let us see the principle.

806
(Refer Slide Time: 14:42)

And here in this example I can show you, corresponding to this input image, you can see the
edges of the image. So, for this I am applying the gradient information, so how to determine the
gradient information and based on this gradient information how to determine the location of the
edge pixels that concept I am going to explain now.

(Refer Slide Time: 15:08)

Suppose I am considering one image, so this side is one side and that is another grayscale value.
So, this is the edge, edge means the abrupt change of intensity value. So, you can see the location
of the edge corresponding to the point x naught. So, if I draw the 1D profile of this one, the 1D

807
profiles will be something like this, so corresponding to the location x naught and I am
considering the 1D profile, so it is fx suppose, so in the left side of the image you can see one
intensity the brightness, in the left side of the image you can see one the grayscale value and
after the, if you see the right side this is one grayscale value.

So, edge means abrupt change of the intensity value. So, if you see this 1D profile, so how to
determine the location of the edge? So, if I take the differentiation of this, the first order
differentiation, that means I am determining the gradient, so that means corresponding to x
naught this point, you will be getting the maximum corresponding to this point x naught. If I
consider a second order derivative, so corresponding to this x naught, you will be getting a zero
crossing.

So, you will be getting a zero crossing, so that means, so how to determine the location of the
edge, the principle is you can see the magnitude of the first derivative, the first order derivative,
is maximum. So, based on this information the magnitude of the first order derivative is
maximum, based on this information you can determine the location of the edge pixel. The next
one is, if I consider a second order derivative, the second derivative crosses 0 at the edge point.

So, you can see by determining the first order derivative, you can determine the location of the
edge pixel, the location of the edge and also by determining the second order derivative, you can
determine the location of the edge pixel. In case of the first order derivative, we have to see the
magnitude, because maximum magnitude I have to see and for the second order derivative, I
have to see the zero crossing. So, that means by using, by determining the gradient, we can
determine the location of the edge pixel. So, that means I have to determine the gradient, the
gradient of the image.

So, suppose the image is f (x, y) and I have to determine the gradient, so one is gradient along
the x direction and another one is the gradient along the y direction I can determine. And from
this I can determine the gradient magnitude. So, this is the gradient magnitude and approximately
it is gradient magnitude will be, so this is the approximate value of the gradient magnitude. So,
that means the gradient magnitude is equals to gradient along the x direction plus gradient along
the y direction. That we can determine.

808
And also, I can determine one angle, the angle is suppose, theta g that is tan inverse g2 divided
by g1, that I can determine. That is nothing but the direction of the normal to the edge, that I can
determine. A direction of the normal to the edge, I can determine. So, you can see from the
gradient I can determine the gradient magnitude, so for this I have to determine the gradient
along the x direction, gradient along the y direction and also I can determine the angle, angle is
nothing but the direction of the normal to the edge.

So, how to determine the edge? So, suppose I have image f (x, y) and suppose I am considering
one operator, operator is suppose h1 and another operator I am considering, operator is h 2. This
h1 operator I am considering for determining the gradient along the x direction. So by using h 1, I
can determine the gradient along the x direction. And by using h 2, I am determining the gradient
along the y direction. And after this, from this two information the gradient along the x direction
and the gradient along the y direction, I can determine gradient magnitude.

The gradient magnitude I can determine. And after determining the gradient magnitude, what I
can consider, I am getting the gradient image, I will be getting gradient image I will be getting.
And I can compare this one with a threshold and after thresholding we can determine the edge
pixel. If the gradient magnitude is greater than a particular threshold, then that will be the edge,
otherwise it is not the edge pixel. That means if the gradient magnitude is greater than equal to
particular threshold, then edge will be 1, else edge will be 0.

So, that means after edge detection, I will be getting a binary image. Edge means 1 and no edge
means 0, e is equals to 1 and e is equals to 0. How to select the threshold? I can give one
example how to select the threshold. So, there are different methods but I can give one example
like this, suppose the histogram of the gradient image has a valley, so we can select the threshold
at this valley, so based on this we can select the threshold.

And in the case the procedure you can see, so first I have to determine the gradient along the x
direction, gradient along the y direction. And after this, I can determine the gradient magnitude
and after this the gradient magnitude is compared with a threshold and after this based on this we
can determine whether particular pixel is the edge pixel or not the edge pixel. This is about the
gradient based method, that is the first order gradient. A second order derivative I can write like
this, that is nothing but a Laplacian, the second order derivative I can determine like this, this is
the Laplacian.

809
And in this case, I have to see zero crossing to determine the location of the edge pixel. In case
of the gradient magnitude, I have to see the maximum value, but in case of the second order
derivative I have to see the 0 crossing. But one problem is here because of the second derivative,
the noise will increase and that concept I can show you later on. This differentiation can be
implemented by finite difference operations.

(Refer Slide Time: 24:50)

So, this what are the finite difference operations, so I can give one example, that is the forward
difference. What is forward difference? The forward difference is f (x + 1) – f(x), so this is one
finite difference operation. The backward difference I can also calculate, that is f(x) - f (x + 1),
that is the backward difference also I can determine. I can also determine central difference, that

f ( x+1) – f ( x )
is , that is the central difference. So, these are the finite difference operations.
2

So, by using these operations I can determine gradient, a gradient along the x direction, gradient
along the y direction. And for this implementation, I can consider some kernels, some common
kernels I can consider or maybe the mask I can consider, are something like the Roberts, another
is the Prewitt. Roberts, Prewitt, Sobel, so these are some common operators, the common
kernels. So, by using these kernels we can determine the gradient of an image. So, I can give one
example how to determine the gradient of an image.

810
Suppose, I am considering one image, the pixels I am considering suppose z 1, z2, z3, z4, z5, z6, z7,
z8, z9. So, these are the pixels and for determining gradient image, I am considering two mask,
one is for the horizontal gradient, another one is for a vertical gradient. So, the first mask I am
considering that is suppose h1, so the mask is suppose - 1, - 2, - 1, 0, 0, 0, 1, 2, 1 and another
mask I am considering that is for determining the vertical gradient, I can consider. So, these two
masks I am considering, actually these are Sobel mask. So, one is for the horizontal gradient,
another one is for the vertical gradient.

So, by using this mask I can determine the horizontal gradient and the vertical gradient. So, if I
first consider this mask, the h1, I can determine the horizontal gradient Gx I can determine, so I
have to overlap the mask over the image and I have to find the values, the gradient values. So,
gradient is z7 + 2z8 + z9 - z1 + 2z2 + z3. This is the horizontal gradient, so you can see you can
determine the gradient. So, here you see 1 is multiplied with z7, 2 is multiplied with z8, 1 is
multiplied with z9, minus and after this I have to consider these values, so z 1 is multiplied with
minus 1, z2 is multiplied with - 2 and z3 is multiplied with - 1.

So, like this I can determine zx, a gradient along the x direction and similarly I can determine the
gradient along the y direction also I can determine. z 3 + 2z6 + z9 - z1 + 2z4 + z7. So, you can see, I
can determine the gradient along the x direction and the gradient along the y direction. And
finally, you can determine, the gradient magnitude also you can determine, the gradient
magnitude is suppose z, if the gradient magnitude is z then it is mainly the zx + zy. The gradient
along the x direction, gradient along the y direction I can determine.

And based on this gradient, what I can do? I can determine the location of the edge pixel, that
means what is the algorithm, find, so algorithm will be like this, so find G x and Gy, that is the
gradient along the x direction and the gradient along the y direction. And after this I can calculate
the gradient magnitude. I can determine the gradient magnitude and if the gradient magnitude is
greater than a particular threshold, threshold is this, greater than equal to particular threshold,
then edge will be 1, else edge will be 0.

So, by using this algorithm I can determine the edge. So, that means, so by using these two
masks I can determine the gradient along the x direction and the gradient along the y direction
and after this determine the gradient magnitude and after this I can determine the edge after the
comparison with a threshold.

811
(Refer Slide Time: 31:37)

So, this concept I have shown here, the what I have explained already, so it is the 1D profile and
here I have shown the edge, the location here I have shown corresponding with the point x 0. So,
that means corresponding to the edge point, abrupt jump or the step, that is the 1D profile I am
considering.

(Refer Slide Time: 32:02)

You can see here already this concept I have explained that f(x) is the 1D profile of this and you
can see location of the edge pixels corresponding to x 0 and corresponding to x1. And in this case,

812
I can determine the first order derivative, so it is f ' (x) I can determine and you can see the
maximum value corresponding to the edge pixels, the location of the edge pixels.

And also, I can determine the second order derivative, that is f ' ' ¿x) I can determine and you can
see the zero crossing you can see here corresponding to the location of the edge pixels, so that is
x 0 and x1. So, that means by determining the first order derivative and the second order
derivative, I can determine the location of the edge pixels.

(Refer Slide Time: 32:55)

A same thing I am showing here again, I am showing the 1D profile corresponding to the
location of the edge at the point x naught. And if I take the first order derivative, you can see the
maximum value will give the location of the edge pixels. And if I consider the second order
derivative, then in this case I have to see the zero crossing. For the first order derivative, I have
to the see the maximum value.

813
(Refer Slide Time: 33:24)

And after this I have explained, so how to calculate the gradient along the x direction and the

δf
gradient along the y direction. So, here you can see the that is the gradient along the x
δx

δf
direction. And that is the gradient along the y direction. So, this is the gradient operation I am
δy
considering, the gradient of an image I am determining.

(Refer Slide Time: 33:48)

814
And after this I can determine, the gradient magnitude I can determine, so gradient along the x
direction square plus the gradient along the y direction square, so from this I can determine the
gradient magnitude. The approximately I can represent like this, the gradient magnitude, that is
the gradient magnitude is equal to gradient along the x direction + gradient along the y direction,
that is the approximately I can represent like this.

And also, I can determine the direction of the edge normal, direction of the normal to the edge,
that also I can determine. So, this tan inverse g y divided by gx, that also I can determine. And
already I have defined the Laplacian, the Laplacian is nothing but the second order derivative, so
based on the second order derivative, I have to see the zero crossing.

(Refer Slide Time: 34:41)

Here I have shown the concept of the edge normal, here you can see I have the edge, suppose
this is the edge and this is the edge normal, so this angle already I have determined, that is the θ g

gx
is nothing but tan' , that I can determine. So, that is nothing but the direction of the edge
gy
normal, direction of the normal to the edge, that also I can determine.

815
(Refer Slide Time: 35:10)

In this example I have shown the input image and I am determining the gradient along the x
direction, so corresponding to this, this is the image and I can also determine the gradient along
the y direction and corresponding to this, this is my output image.

(Refer Slide Time: 35:28)

And this block diagram already I have explained, so my input image is f (x, y) and I am
considering two operators, one is H x, another one is Hy, Hx is mainly to determine the gradient
along the x direction and Hy to determine the gradient along the y direction. So, I can determine
the gradient along the x direction and the gradient along the y direction I can determine.

816
And from this, I can determine the gradient magnitude I can determine and also, I can determine
the angle, that is the direction of the normal to the edge, that angle also I can determine. And
based on the gradient magnitude I can determine the location of the edge pixel. If the gradient
magnitude is greater than a particular threshold, the edge will be 1, else edge will be 0. So, after
edge detection I will be getting a binary image.

(Refer Slide Time: 36:28)

This concept I am showing here, so my input image is this, I am determining the gradient along
the x direction, gradient along the y direction and after this I am determining the gradient
magnitude image I am determining. And you can see the location of the edge pixels, you can see
the edges of the image.

817
(Refer Slide Time: 36:48)

Now, one this that differentiation is highly prone to high frequency noise, so because if I do the
differentiation, the noise will be increase. Even for a second order derivative, if I consider a
second order derivative, noise will be increase. So, that is why we have to do low pass filtering
before edge detection and in this case the differentiation can be implemented as finite difference
operation.

So, already I have discussed about the finite difference operation, one is the forward difference,
one is the backward difference, one is the central difference. So, this point is important that is
because of the differentiation, the noise will be increase. So, suppose I can give one example,
suppose my function is fx, I am considering only 1D function, suppose A sin ω x. If I determine
the first order derivative, that is f ' (x) that is A ω0 cos ω0 ¿x). So, here you see the magnitude of
the noise, it is increasing.

And similarly, if I determine the double order derivative, the second order derivative, so for
second order derivative, it will be A ω02 , so you see the noise is increasing. So, you see the noise,
that is the magnitude of the noise, it is increasing. Previously it is A and after this, after
differentiation it will be A ω0and after the second order derivative, it is - A ω02 , that means the
magnitude of the noise, it is increasing because of the differentiation. So, that is why we have to
do low pass filtering before edge detection.

818
(Refer Slide Time: 38:40)

And this the operations, the finite difference operations, I have mentioned, one is the forward
difference operation, the backward difference operation, another one is the central difference
operation. And some common kernels are Roberts, the Sobel and the Prewitt, these are the
operators. So, by using these operators, I can determine the horizontal gradient and the vertical
gradient I can determine.

(Refer Slide Time: 39:09)

819
df
So, what is the finite difference in 1D? You can see I am determining, I am determining, that
dx

f ( x+dx)−f ( x)
is nothing but , that is approximately equal to, if I sends this point here to x - dx,
dx

f ( x +dx ) −2 f ( x ) +f ( x−dx )
then this dx will be sends to 2dx. So, 2 . And corresponding to the
dx
second order differentiation, you can see I can represent in the form of finite difference
operations.

(Refer Slide Time: 39:50)

I can show one program segment in C, so how to determine the forward difference and the
backward difference. Here I am showing the forward difference, here you can I am considering
this operation you can see, this is the difference equation, the p [i + 1] - p [i], that I can compute
to determine the finite difference.

(Refer Slide Time: 40:15)

820
And if I consider a 2D finite difference, you can see I am determining finite difference for x

δf
component and for y component. So, first I am determining that is the gradient along the x
δx

δf
direction, that is the gradient along the y direction. So, that is equal to if I consider a gradient
δy

f ( x+dx , y)−f ( x−dx , y )


along the x direction, so it is equal to . Because here you can see here
2 dx
this point I am changing here, the point is f (x, y) but now I am considering f (x - dx, y). And
similarly, I can determine the gradient along the y direction.

(Refer Slide Time: 41:12)

821
And corresponding to this, you can see the finite difference in C program segment I am
considering, you can determine the finite difference for the x component and finite difference for
the y component.

822
(Refer Slide Time: 41:26)

So, first operator, the edge operator I am considering, the Roberts edge operator. So, what is the
Roberts edge operator? So, first operator, first mask I am showing here and that is the h 1, that is
to compute the horizontal gradient and another mask I am considering, that is to compute the
vertical gradient. And if I consider this is the central pixel suppose, so this is the central pixel, so
I can consider, I can determine the gradient, the gradient magnitude I can determine.

So, first one f (x, y) - f (x + 1, y +1). That is the gradient along the x direction, that means I am
considering this one. So, this pixel, what is this pixel? Because I am considering this pixel, this is
nothing but f (x, y). What is this pixel? I am considering this pixel position is f (x + 1, y+ 1), I
am considering. So, similarly I can determine the gradient along the y direction, so f (x, y + 1) - f
(x + 1, y), so I can determine the gradient along the y direction.

So, that means I am considering this pixel and this pixel I am considering. So, like this I can
determine the gradient magnitude, but one disadvantage of the Roberts operator is, it is highly
sensitive to noise. So, that is why I can consider another operator that is the Prewitt operator.

823
(Refer Slide Time: 43:20)

So, you can see the Prewitt operator, and the first mask I am considering, that is the h 1 to
determine the horizontal gradient and h2 I am considering to determine the vertical gradient. And
in this case, if you see this is the 3 by 3 mask, so that means, it does some averaging operation to
reduce the effect of noise. And in this case, it may be considered as the forward difference
operations in all 2-pixel blocks in a 3 by 3 window.

So, by using the Prewitt edge operators you can determine the horizontal gradient and the
vertical gradient, that is h1 and h2 I am considering and based on this I can determine the g x for
this and also, I can determine g y and after this I can determine the gradient magnitude I can
determine. So, one advantage of this operator, the mask is, because the averaging operation is
done, so that means it reduces noise.

824
(Refer Slide Time: 44:26)

The next one is the, the Sobel edge operator. So, already I have explained about the Sobel edge
operator and again I am considering two mask, one is the horizontal mask, another one is the
vertical mask. So, by using these mask we can determine horizontal gradient and the vertical
gradient. And in this case it is similar to Prewitt operator, because in this case also averaging
operation is done to reduce the effect of noise. And in this case, we may consider the forward
difference operations in all 2 by 2 blocks in a 3 by 3 window. That we can consider, because this
mask is the 3 by 3 mask.

(Refer Slide Time: 45:09)

825
So, here I am showing the DFT, the simple DFT; simple DFT means I am determining the DFT
of [-1 0 1], I am finding the DFT of this [- 1 0 1]. And if I consider a Prewitt mask,
corresponding to the Prewitt mask this is the DFT and corresponding to the Sobel, this is the
DFT. You can determine the DFT considering the mask. For a simple DFT is I am considering [-
1 0 1], so you can see the response of the mask corresponding to this portion, corresponding to
this portion like this you can see the response of the mask.

(Refer Slide Time: 45:47)

And in this case I am considering, vertical edges obtained by simple thresholding, so I can apply
the thresholding operation, so T is equal to 10, the threshold is 10, so corresponding to this the
output image is a. And if I consider b that is the Prewitt operator, I am considering, so threshold
is 30 I am considering. And the third one is I am considering the Sobel operator, so for this I am
considering the threshold value is equal to 40. So, you can see, so how to detect the vertical
edges.

826
(Refer Slide Time: 46:24)

So, the procedure already I have explained, so how to determine the edge, so for this we have to
determine the fx, here I am showing fx, actually it is the gx, already I have explained gx means the
gradient along the x direction, gradient along the y direction I have to determine. And after this I
have to determine the gradient magnitude, that is nothing but the gradient magnitude is equal is
gx gradient along the x direction and the gradient along the y direction I can determine. And after
this I have to compare the gradient magnitude with the threshold and based on this I can
determine the edge pixel, whether particular pixel is the edge pixel or not the edge pixel. And
after edge detection I will be getting a binary image.

827
(Refer Slide Time: 47:10)

Here I am showing one example, input image I am considering and you can see the outputs
corresponding to the mask, first I am considering the Roberts mask, the next I am considering the
Prewitt mask and the third one is the Sobel mask. You can see the comparisons between these
three masks.

(Refer Slide Time: 47:31)

Now, I am considering the case of the Laplacian that is the second order derivative I am
considering. In this example I am showing mainly the gradient operation that is I am finding the

828
gradient along the x direction, gradient along the y direction. And after this I can determine the
gradient magnitude, I can determine like this. So, this is the gradient image.

(Refer Slide Time: 47:56)

Now, I am considering the second order derivative, that is the Laplacian. So, Laplacian I am
considering, so for Laplacian we have to find a 0 crossing. So, how to determine the Laplacian
by considering the finite difference operation, that I want to explain. So, we have to determine

δf
the gradient along the x direction, gradient along the y direction; and after this I can again
δx
differentiate to get the second order derivative.

829
So, I can show how to determine the second order derivative. So, suppose I am applying the
forward difference equation, so del f divided by δx is nothing but f (x + 1, y) - f (x, y). And this

2
δf
is first order differentiation, that is the forward difference I am considering. And after this 2
I
δx

δ
am considering, this is the second order derivative, f ¿ , y) - f (x, y) I am determining. So, it is
δx
f (x + 2, y) - f (x + 1, y) - f (x + 1, y) + f (x, y). This is the differentiation with respect to the point
x plus 1.

So, if I want to determine the differentiation with respect to x, then in this case I have to subtract
1. So, I am repeating this, this is the differentiation with respect to the point x + 1. And in this
case if I want to determine the differentiation with respect to x, I have to subtract 1. So, that

2
δf
means what I am getting this 2
will be f (x + 1, y) - f (x, y) - f (x, y) + f (x - 1, y). So, finally I
δx

δf 2
will be getting f (x +1, y) + f (x - 1, y) - 2 f (x, y). So, this is the 2.
δx
2
δf
Similarly, I can also determine 2 also I can determine, that is the second order derivative also I
δy

δf 2 δf 2
can determine, so it will be f (x, y + 1) + f (x, y- 1) - 2 f (x, y). So, this 2
and 2
I can
δx δy
determine and finally I can determine the Laplacian I can determine. So, Laplacian will be f (x +
1 , y)+ f (x – 1 , y + f (x, y + 1) + f (x, y – 1) - 4 f (x, y). So, that is the second order derivative.

So, corresponding to this, what will be the mask? My mask will be you can see here, so if I
consider coefficients here, so corresponding to the central pixel, the central pixel is nothing but
x, y this is central pixel, x , y. So, corresponding to the central pixel my coefficient will be, the
coefficient of the mask will be - 4 because it is - 4, you can see here and corresponding to this
point f (x + 1, y) so that means corresponding to this point f( x + 1, y) then this value will be 1, f
(x - 1, y) that means x - 1, y corresponding to this point that value will be 1, the coefficient.

830
Corresponding to f (x, y + 1), so (x, y + 1) is this point, so this will be 1, and the finally f (x, y -
1), so corresponding to the point it will be 1. So, my mask will be, so my mask for this Laplacian
will be now - 4, 1, 1, 1, 1. So, this is the mask and the remaining it will be 0, 0, 0, 0. So, this is
the mask corresponding to the Laplacian and if I consider a diagonal pixel for this computations,
then in this case the mask, my mask will be, so this will be the (Laplacian), this is the mask
corresponding to the Laplacian if I consider the diagonal pixels. So, I have defined how to
determine the mask corresponding to the operation that is the Laplacian operation.

(Refer Slide Time: 54:57)

So, this derivation already I have shown, so how to determine the Laplacian, so you can
determine the Laplacian here.

(Refer Slide Time: 55:08)

831
And based on this Laplacian, you can determine the mask, that is the Laplacian mask and if I
consider a diagonal pixels that is the Laplacian mask. So, for this how to determine the edge
pixels? So, the method is apply Laplacian mask on the image, so this is the first step and what is
the method? We have to determine the zero crossings, so detect the zero crossings, as the zero
crossing corresponds to the situation where the pixels in a neighborhood differ from each other
in an image, so that is the condition.

And if I consider p and q are two pixels, so that means I have to determine the zero crossings,
and based on the zero crossings, I can determine the pixels, the particular pixel is an edge pixel
or the not edge pixel, that condition I can determine. So, one advantage of the Laplacian operator
is no thresholding is required because in case of the gradient operation we considered
thresholding operation, but in this case only we have to consider zero crossings, we have to find
a zero crossing, so that is why no thresholding is required.

And it is symmetric, if you see the mask it is a symmetric operation. But disadvantages, because
of the second order derivative, the noise is amplified. And it does not give the information about
the edge orientation. In case of the gradient operation, we can determine theta g that is the
direction of the edge normal, the normal to the edge I can determine. But in this case, it is not
possible to get that information, the information about the edge orientation.

(Refer Slide Time: 56:53)

832
After this I am considering edge sharpening with a Laplacian filter. So, here you can see I am
considering the original image f (x, y) and I am considering f (x, y), the f (x, y) is a convolved
with H, H is nothing but the Laplacian I am considering. That means I am applying the Laplacian
filtering and I am considering the weight, some weight w is considered. So, for this what I have
to do, the Laplacian filter is first applied to an image, the image is f (x, y). Then a fraction is
subtracted, the fraction is determined by the weight, the weight is w, from the original image, the
original image is f (x, y). So, f (x, y) is my original image.

So, if I apply this operation, you can see the edges. So, I am showing one example here, so this is
my input image, I am applying the Laplacian for edge sharpening. And the second case I am
showing the edge detection by using the Sobel operator, you can see the difference between
these, one is the edge sharpening by using the Laplacian and another one is the edge detection by
Sobel operator. So, this operation is very simple, so for this what I have to consider, first I have
to apply the Laplacian filter to an image and after this I have consider a fraction of this, so for
this I am considering the weight w, the result of this is subtracted from the original image. The
original image is f (x, y).

(Refer Slide Time: 58:28)

833
And after this, I am considering another masking that is the unsharp masking. So, what I am
considering, so first the subtract a smoothed version of an image from the original image. So, if I
do this operation, this enhance the edges. So, first what I am considering, this is my original
image f (x, y) and I am considering the Laplacian operator, the Laplacian operator is H. So,
image is convolved with H, that means I am considering the Laplacian operator, and subtract a
smoothed version of the image from the original image, that I am considering.

And subsequently, the unsharp version of the image is obtained by adding a fraction of the
resultant mask because I am determining the mask, that mask I am obtaining from this operation,
this operation is this, so f (x, y) is the original image and this mask I am adding and for this I am
considering the fraction of the mask, so that means I am considering the weight, the weight is w,
that I am considering.

So, this is the concept of the unsharp masking. So, subtract a smoothed version of an image from
the original image, this step enhances the edges. So, first step you can see, I am determining the
mask, the mask is M, so it is obtained by f (x, y) that is f approximate x, y; that I am considering
this one. And after this the unsharp version of the image is obtained by adding the fraction of the
resultant mask, the mask is M to the original image, the original image is f (x, y). So, I will be
getting this one, so this is called the unsharp masking.

(Refer Slide Time: 60:23)

834
And after this I am considering some operations, these are called the compass operations. So, by
using these operations we can determine different lines present in an image. So, suppose some
lines oriented in the direction, the directions are suppose; North directions, North West direction,
West direction, South West direction, South direction like this, so all the direction suppose the
lines, some lines are available. So, these lines can be determined by using these compass masks.

So, what is the compass masks, I can explain now this. So, by using these compass masks, you
can determine the lines oriented in different directions. Directions means, I can consider the
directions like 0-degree directions, I can consider may be 45-degree direction I can consider, 0-
degree directions I can consider, 45-degree directions I can consider, 90-degree directions I can
consider, or maybe 135-degree directions I can consider. So, I can determine or I can detect a
line oriented in these directions. So, for this I can apply the compass masks.

(Refer Slide Time: 61:39)

835
So, you can see these masks, so by using these masks I can determine a particular line oriented in
a particular direction, so first one is the North, next one is the West, next one is the South, next
one is the East, North West, South West, like this I can determine the lines in these directions.

(Refer Slide Time: 62:02)

So, for this I am considering the masks like these, these are the masks, one is H 0, H1, H2, H3, H4,
H5, H6, H7. So, these masks I am considering and in this case you can see, if I compare these
masks H 0 and H4, so if you just compare this one, you can see H 0 and the H4, you can see these
masks. So, what is the difference between these two masks? Only the sign is reverse, the H 0 and
H4, it is very similar, only the sign is reverse.

836
Similarly, H1 and H5, it is almost similar, only the sign is reverse, because this is + 2, 1, 1; - 1, -
1, - 2; you can see. So, H1 and H5 they are very similar, except the sign. Similarly, H2 and the H6
these are very similar, the sign is reverse, and H 3 and H7 that is also similar, because only the
sign is reversed. So, that means in this case I can write like this, H 4 k is equal to - H 0 k, I can write
like this.

And in this case I can apply the convolution operations to determine the direction of the line, so I
can determine a particular line oriented in different directions, the directions already I have
mentioned, the 0 degree, 45 degree, 90 degree or maybe 135 degree, these directions I am
considering. So, orientation of a particular line along these directions, so suppose the line is
present in this direction, suppose this is the image I am considering. So, one line is in this
direction, I can consider or maybe one line maybe in this direction, so one line maybe in this
direction, 0-degree, 90-degree, 45 degree like these lines I can consider.

So, maybe like this, this line I can also consider. So, I can determine all these lines. So, for this I
have to do the convolution of the image with the mask. So, suppose if I do the convolution of the
image with mask f (x, y) that is convolved with, suppose the mask is H 4 k , because H 4 k is very

similar to H 0 k. So, that is equal to f (x, y) convolved with - H 0 k. Because, H 0 kand the H 4 k they are
very similar, only the sign is reversed, so I can consider this one.

That means it is equal to minus f (x, y); I can consider this one, only the sign is reverse between
the mask H 0 and the H4. Similarly, for the H1 and the H5; similarly, H2 and H6; H3 and the H7. So,
I have to do the convolution between the mask and the image and after this what I have to
consider like, convolution I can consider like this, suppose I can get the value D 0, suppose after
the convolution. So, f (x, y) it is convolved with H 0k , I am getting value D 0. I am getting another

value suppose D1, the image is convolved with H 1k , the another mask is H1.

I am getting another value that is D2, the image is convolved with H 2k and I can get another value

that is D3, the image is convolved with H 3 k. So, only I have to do the convolution for these
masks; H 0, H1, H2, H3. For remaining what I can do? Suppose, for D4, how to get D4? D4 is
nothing but it is - D 0. What is D5? It is nothing but - D1. What is D6? It is nothing but - D2. And

837
what is D7? D7 is nothing but simply minus D3. So, only for these values, I have to compute the
convolutions, that I can determine D 0, D1, D2, D3 I can determine.

And from D 0, D1, D2, D3 I can determine D4, D5, D6 and the D7. Then in this case I can
determine, this value the D 0, D1, D2, D7 I can determine and I am considering the absolute value
I am considering. And I am determining the maximum of this, the maximum of this will give the
direction of a particular line. So, suppose that this is maximum, so corresponding to this, what I
will be getting, then in this case I will be getting a line in the vertical direction, in the 90-degree
directions.

Because if I consider D 0 from this mask, so you can see the response of the mask, the response
of the mask you can see here. So, by using this information that is the maximum I can determine
from D 0, D1, D2, D7, I can determine a particular line. So, this line that is suppose the 90 degree
direction, 0 degree directions, 45 degree directions, 135 degree directions, so detection of a
particular line in these directions, I can determine by using this compass operator.

Up till now I have discussed the concept of the edge detection. So, I have discussed how to
determine the edge by using the gradient operations, I can determine the first order derivative or
I can determine the second order derivatives. The maximum of the first order derivative, it will
give the location of the edge pixels and for the second order derivative, I have to see the zero
crossings. And after this I defined some masks, the masks very important, the Roberts, Sobel and
the Prewitt, I have defined.

And after this I discussed about the Laplacian operators, so that means by using the Laplacian
operator, we can determine the second order derivative and we have to see the zero crossings.
And after this I discussed the concept of the compass operators, so we can determine a particular
line oriented in different directions, directions maybe 0-degree, 45-degree, 90-degree, 135
degree. So, these lines I can determine by using the compass operators.

After this I will discuss another technique of edge detection, that is the model-based technique,
which is based on the mammalian visual system that is the human visual system. So, next class I
am going to discuss about the model-based edge detection techniques, so let me stop here today.
Thank you.

838
Computer Vision and Image Processing - Fundamentals and Applications
Professor M. K. Bhuyan
Department of Electronics and Electrical Engineering
Indian Institute of Technology, Guwahati
Lecture 22
Edge Detection

Welcome to NPTEL MOOCs course on Computer Vision and Image Processing - Fundamentals
and Applications. In my last class I discussed the concept of edge detection, edges and the
boundaries are the features of an image. And in that class I discussed how to detect the edges by
considering the gradient information. So, I can determine gradient of an image. So, gradient
along the x direction and the gradient along the y direction I can determine. And after this I can
determine gradient magnitude.

And based on this gradient magnitude, I can decide whether a particular pixel is an edge pixel or
not an edge pixel that I can determine. Also I can determine the direction of the edge normal that
is the normal to the edge that direction also I can determine that is the angle theta I can
determine, that is about the gradient operation. And also I discussed about the second order
derivative that is the Laplacian.

So, by using the Laplacian also I can determine the location of the edge pixels. So, for the
Laplacian what I have to consider, I have to consider the zero crossings. So, based on the zero
crossings, I can decide whether a particular pixel is the edge pixel or not the edge pixel. So, this
in last class I discussed about these two techniques.

Today I am going to discuss the edge detection technique and another technique that is the model
based edge detection technique. David Marr studied the characteristics of the human visual
system that is the mammalian visual system and based on his observations, this model based
techniques were developed. So, one technique is the log operation that is the Laplacian of
Gaussian that I am going to discuss.

And for this model based technique what we have to do we have to blur the image by a Gaussian
function, that is called a Gaussian blurring. And after this I can apply the gradient information
that maybe I can consider the first order differentiation I can consider. So, in this case, the
maximum will give the location of the edge pixel or I may consider the second order
differentiation. So, for this I have to see the zero crossings. So, today I am going to discuss about

839
that model based edge detection techniques. And already I have mentioned and this is mainly
based on the observations by David Marr for mammalian visual system that is the biological
visual systems. So, let us see what is the model based techniques.

(Refer Slide Time: 3:26)

So, in this case, David Marr studied the concept of the mammalian visual systems and mainly
this is the biological vision. So, his first observation is in natural images a features of interest
occur at a variety of scales. So, that is why no single operator can function at all of these scales.
So, this is the first observation. The second observation is if I consider a natural image, for the
natural image, we cannot expect the diffraction patterns or maybe something like the wave like
effects.

So, that means some form of local averaging actions must take place. So, that means we have to
do some smoothing. And that smoothing I can do by Gaussian blurring, that means the image is
convolve with a Gaussian and I can do the smoothing of the image. So, in this case you can see
that for smoothing, I can use the Gaussian function that is the Gaussian blurring.

840
(Refer Slide Time: 4:33)

So, his main observations are like this and in this case, based on these observations, you can see,
I can apply the gradient operation that is the first order derivative I can consider. And in this
case, the maximum will give the location of the edge pixel or maybe I can consider zero crossing
also I can consider that is the second order derivative I can consider. And in this case I have to
see the zero crossings to find the location of the edge pixels.

So, in case of the model based techniques, what I have to consider, convolve the image with a
two dimensional Gaussian function. So, first I have to do the convolution of the image with two
dimensional Gaussian function, this is mainly to blur the image, to remove the noises. This is
called Gaussian smoothing. So, first one is the blurring the image by a Gaussian function. So,
this is the first step. After this I can determine, so, I can compute the Laplacian of the convolve
image and based on the edge detection principle what I have to consider, I have to find out zero
crossings.

The zero crossings in the second derivative I have to find, that will give the location of the edge
pixels. So, that means in the model based techniques, what I have to consider, first I have to go
for Gaussian blurring, the image is convolved with a Gaussian function for blurring and after this
I can consider the Laplacian operator and for this I have to find a zero crossings that is the model
based techniques. So, first I am explaining the concept of the one model based technique that is
the log operation, the Laplacian of the Gaussian. So, what is log operation I will explain you.

841
(Refer Slide Time: 6:26)

So, first one is the log operator, log means the Laplacian of Gaussian. So, what is log operator I
can explain, the log operator is the Laplacian of Gaussian, the Gaussian function I am
considering to blur the image. So, for this the image is convolved with Gaussian. So, how to
write this? Suppose if I consider the Gaussian function, the Gaussian function is G σ (x, y) and the
image is convolved with the Gaussian function I am doing the convolution to blur the image.

And after this I am taking the Laplacian that is the second order derivative I am considering, that
is equivalent to you can consider, so suppose G σ (x, y) I can consider this convolution of these f
(x, y). So, if you consider this part only this is nothing but the log the Laplacian of Gaussian you
can see it is nothing but a Laplacian of Gaussian. So, I am considering the Gaussian function and
after this I am taking the Laplacian of Gaussian. So, this is nothing but the log and after this I am
convolving the image with the log operator.

So, Laplacian of Gaussian I am considering and the image is convolved with the log operator.
So, what is actually the log operator? So, this log operator I can determine, suppose, if I take the
first order differentiation and I am considering the Gaussian function, Gaussian function is σ (x,
y). So, I am taking the differentiation along the x direction, x 2+ y 2 σ 2 I am considering. So, it will

−x − x + y
2 2

be equal to 2σ that is the first order differentiation.


2

σ2 e

842
And to compute the second order differentiation, now I am considering the second order

2
x
differentiation of the Gaussian function, the Gaussian function is G σ (x, y). So, it will be
σ4

− x 2 + y2
− x2 + y2 −1 2 σ2
2 σ2 e . I am taking the second order derivative and I can neglect the normalizing
e σ2
constant I am neglecting, neglecting normalizing constant of the Gaussian function.

So, that I am neglecting 1 by √ 2 π σ 2 that I am neglecting. Similarly, I can also determine the
second order derivative of the Gaussian function corresponding to the variable y. So, what I can

2
δ
determine and my Gaussian function is G σ (x, y). So, I can determine the Laplacian that is the
δy 2
second order derivative with respect to y. In this case, if I neglect the normalizing constant, then
−x 2+ y2
these derivative will be x 2 −σ 2 and 4 2 σ2 .
σ e

So, this expression I will be getting that is the second order derivative corresponding to the
variable x and similarly, for y also I can also determine, so, for y it will be y 2 −σ 2 and it is
−x 2+ y2
4 2 σ2 , this is for the second order derivative with respect to y. So, ultimately that log function
σ e

2
δ
the Laplacian of Gaussian I can determine that is will be equal to because I have to consider
δ x2
x component and the y component.

So, this is the x component and the y component also I have to consider, that is I have to

2 2
x +y 2
−x + y 2

consider. And finally, the log operator will be, my log operator will be and 4 2 σ . So,
2

2σ 2 σ e

x 2+ y 2 − x + y 2 2

this is my log operator you can see, so, 2σ . 2

2σ e
2

Then in this case I have this parameter, the parameter is sigma, this sigma is a parameter that
controls the extent of blurring. So, the sigma controls the extent of blurring, so, you can see the I
can get the log operator that is nothing but the Laplacian of Gaussian. So, I am considering one

843
Gaussian function and after this I am taking the Laplacian. So, you can see I am getting the log
operator and this log operator if I want to plot the log operator that is log function, this will be
very similar to this shape.

And this is the log function, the log operator, the log function. This shape is something like the
Mexican hat shape, Mexican hat shape, this shape so, this is a Laplacian of Gaussian, so shape is
something like this. And one important point is the human retina frequency response looks like
this. And in this case, it is the bell shape and the sigma controls the extent of blurring.
Corresponding to this log operator, I can consider a mask.

So, what will be the mask corresponding to the log operator? Corresponding to the log operator
w1, w2, these are the weights of the mask. So, these weights I can determine, suppose how to
determine w1? w1 I can determine from this log function, that log operator I can determine this w
value, that is nothing but the discrete approximation of continuous function. So, what I am
considering that is corresponding to this log function I am determining the mask and I can
determine these the weights w1, w2 the all the weights I can determine.

(Refer Slide Time: 15:56)

So, if you see, suppose the Gaussian function is, suppose 2D Gaussian function is something like
this, 2D Gaussian function is suppose K e to the power, this is the 2D Gaussian function is
−m2 +n2
suppose h (m, n) σ2 . So, corresponding to this Gaussian function you can develop a mask.
e

844
What is the mask corresponding to the Gaussian function? So, suppose I am considering the
weights w1 w2 w3 corresponding to this Gaussian function I am considering, w6, w7, w8, w9.

So, what is the weight? The w1 is nothing but h (-1, - 1). And what is the w 2? w2 is h (- 1, 0). So,
from this you can calculate these values w1 w2 like this. So, this is the Gaussian mask. Similarly,
I can also show the mask corresponding to the log operation, the log operator. So, let us consider
one 5 by 5 mask corresponding to the log function the Laplacian of Gaussian. So, what will be
the mask corresponding to the log function, that is the Laplacian of Gaussian?

So, the mask may be something like this - 1, this value the remaining value is 0, 0, 0, 0, - 1, - 2, -
1, 0, this is - 1, - 2 16 the central coefficient is 16, - 2, - 1, 0, - 1, - 2, - 1, 0, 0, 0, - 1, 0, 0. So, this
is the mask corresponding to the log operator. So, here I am considering the 5 by 5 mask, that is

2 2
x +y
the discrete approximation of the continuous function because the log function is 2
divided
−2 σ
−x 2+ y2
by 4 2 σ2 .
σ e

So, this is the log mask. So, this is the Laplacian of Gaussian mask and 5 by 5 mask. So, for
determining the edges by using the log operation, so, how to do this? The convolve the image
with log operator, so, this is the first step. So, convolve the image which log operator. Next what
we have to do, at each pixel observe if there is a transition from one sign to another sign in any
direction and based on this decide it as an edge pixel.

That means what I am considering, I am actually detecting the zero crossings, detect the zero
crossings to determine the location of the edge pixels. So, the first step is I have to do the
convolve, the convolve the image with the log operator the Laplacian of Gaussian and after this
detect the zero crossings to determine the location of edge pixels. So, by using this procedure,
you can determine the location of the edge pixels.

845
(Refer Slide Time: 20:26)

So, here I have shown the log operations. So, convolve the image with Gaussian and the
Laplacian operator that is if I combine the Gaussian and the Laplacian operator that is nothing
but the Laplacian of Gaussian. So, the convolve the image with the log operator and after this I
have to see the zero crossings.

So, already I have explained, so, how to get this the log function that is the log operator, the
Laplacian of Gaussian you can determine and corresponding to this log operator you can get the
mask something like this if I consider sigma is equal to 1.4, a sigma is the parameter that controls
the extent of blurring.

And you can see the shape of the log function, the log function the shape is something like that
Mexican hat, so, it is a shape of the log function corresponding to sigma is equal to 1. So, we
have to follow these two steps; the convolve the image with log operator and after this we have
to detect the zero crossings to determine the location of the edge pixels. So, this is about the log
operation.

846
(Refer Slide Time: 21:39)

The another operation I can explain that is the difference of Gaussian operator that is the DoG
operation. So, here you can see, I am considering two Gaussian functions, this is the first
function and the second Gaussian function and you can find the difference of Gaussians. And I
have shown the parameters, the two parameters are there one is σ 1 and another one is σ 2.

So, the procedure is very similar to the log operation. So, convolve the image with the DoG
operator the difference of Gaussian and again I have to detect the zero crossings to determine the
location of the edge pixels. So, the concept is very similar to log operator, but here I am
considering the difference of Gaussians two Gaussians I am considering. So, up till now, I
discussed about the concept of the Laplacian of Gaussian and the difference of Gaussian. The
next important edge detection technique is the Canny edge detection technique.

847
(Refer Slide Time: 22:40)

So, what is the Canny edge detection technique? The Canny edge detector, so the main points
are, the first point is we have to minimize the error of detection. So, that means, edge should be
detected only when it is present that is the false alarm should be less. Another point is edge
should be localized, what is the meaning of this? Edge should be detected where it is present in
the image. Another one is, the next concept is the singleness. So, multiple peaks corresponding
to single edge point that should be avoided.

So, we have to follow these principles for the Canny edge detection, one is the edge should be
detected only when it is present, that is the false alarm should be less. And a second point is edge
should be localized, so, that means that we have to detect the edges where it is present in the
image. So, edge should be detected where it is present in the image. And the singleness, the
singleness means the multiple peaks corresponding to single edge point that should be avoided.
So, now, I will discuss the main steps of the Canny edge detection technique.

848
(Refer Slide Time: 24:07)

So, the first step is, the step number 1, so, first I have to blur the image with a Gaussian function,
Gaussian. How to do this? So, my output image is g (x, y); my input image is f (x, y) and after
this I am convolving the image with a Gaussian function, the Gaussian function is h (x, y, σ ) is
the parameter that controls the extent of blurring. So, I have to select the sigma, so, this is the
step number 1; the blur the image with a Gaussian to remove noises.

After this I have to find the gradient image that is find the gradient magnitude, I have to
determine. So, for this I can consider a 3 by 3 window or maybe I can consider 2 by 2 window.
Suppose, if I consider this type of window also I can consider to determine the horizontal
gradient and also the vertical gradient I can consider. So, this type of mask also I can consider or
maybe the 3 by 3 mask that also I can consider to determine the gradient along the x direction
and the gradient along the y direction and based on this I can determine gradient magnitude.

The gradient magnitude is equal to g ( x 2 ¿+ g( y 2 ) that I can determine and also I can determine
the angle the angle is α ¿x, y); this angle is nothing but the direction of the normal to the edge.
So, that angle I can determine the α ¿x, y) I can determine. So, for all the pixels of the image, I
can determine the m (x, y) that is a gradient magnitude I can determine and also I can determine
the angle α ¿x, y)for all the pixels of the image.

So, one is the gradient magnitude, this is nothing but the gradient magnitude and the α ¿x, y) is
nothing but that is the direction of the normal to the edge, that is the direction of the normal to

849
the edge, that I can determine. So, this is step number 2, the step number 3 I want to explain that
is a very important step, the step number 3 that is non-maximum suppression, non-maximum. So,
the step number 3 is very important that non-maximum suppression.

So, what we have to consider in a window I have to examine, if the gradient value of the pixel is
greater than the neighborhood pixels; if it is true, then they retain the pixel as the edge pixel,
otherwise, I have to discard the pixel. So, that concept I am going to explain. So, first I have to
consider a window. So, in a window I have to examine if the gradient value of the pixel is greater
than the neighborhood pixels.

So, suppose I am considering one window here, this is a 3 by 3 window and suppose the pixels
are P1, P2, like this P3, P4, P5, P6, P7, P8, P9; these are the pixels. And in this case, suppose if I
consider a horizontal edge passing through the point P5. So, I am considering the horizontal edge
like this passing through the point P5 that I am considering. So, corresponding to this edge, I
have to find a neighborhood pixel. So, corresponding to this edge, this is the normal to the edge
one normal.

And again I am drawing the another figure, the same thing that I am drawing. So, P1, P2, P3, P4,
P5, P6, P7, P8, P9. So, the edge, this is the edge passing through the point P5 and I am
considering the central pixel, the central pixel is P5. And corresponding to this edge, this is the
edge, so this is the edge, another neighborhood pixel will be P2 that is the normal to the edge is
this is another normal. So, this normal I am considering and also these normal I am considering.
So, based on these normals corresponding to the pixel P5, I have two neighborhood pixels, one is
P8, P8 is the neighborhood pixel and another pixel is P2.

So, corresponding to the pixel P5 my neighborhood pixels are one is P8 and another one is P2;
P8 and the P2. So, what is the edge normal I can again give one example, suppose, I have the
edge, suppose this is my edge, this is the edge. So, corresponding to this edge this is the normal
to the edge, so this the normal to the edge and in this case, I am considering this angle that is
with respect to the reference axis. So, my reference axis, this is the reference axis.

So, this angle is alpha x, y that I am considering. So, here in this case, so, in a window I have to
find the neighborhood pixels. So, you can see the edge passes through the point P4, P5, P6 since
I am considering the pixel P5 that is the central pixel, corresponding to the pixel P5 my

850
neighborhood pixel will be P2 and the P8 you can see the edge normal, so, I have shown a
normal to the edge, this is the one normal and another normal I have shown, a two normals I
have shown.

And based on this I can select the neighborhood pixels. So, corresponding to the pixel P5, I have
two neighborhood pixels one is P2 and other one is P8. Now, in this case what is the non-
maximum suppression? So, in a window I have to examine if the gradient value of the pixel is
greater than the neighborhood. So, in this case what I have to consider, I have to compare the
gradient magnitude value of the pixel P5 with P8 and the P2.

If it is greater than the gradient magnitude of P8 and the P2 then in this case I have to retain
depth as an edge pixel, otherwise, I have to discard P5. So, that means again I am repeating this
one first I have to find neighborhood pixels, after finding the neighborhood pixels, what I have to
consider? That I have to compare the gradient magnitude of the pixel the pixel is P5 with the
neighborhood pixels, the neighborhood pixels are P8 and the P2.

If the gradient magnitude of the pixel P5 is greater than P2 and the P8, then in this case I have to
retain the pixel P5 as an edge pixel. If it is not true that means it is less than either P2 or P8, then
in this case I have to discard it, that is not the edge pixel. So, this is the non-maximum
suppression. So, in this case to find the neighborhood pixels and how to take the decisions, I will
explain, now it is not possible to see all the directions to find the neighborhood pixels. So, there
is a procedure to do this. So, maybe I can consider only 4 directions to see the neighborhood
pixels.

851
(Refer Slide Time: 34:04)

And for taking the decisions suppose, let us consider this example. In this example, I am only
considering 4 directions. So, suppose it is 0 degree, this is 22.5 degree, this is 67.5 degree, this is
not in the scale, this is 112.5 degree, this is 157, this is 157.5 degree, this is minus, this side
minus 157.5 degree, this is minus 112.5 degree, this is minus 67.5 degree, this is minus 22.5
degree, I am considering.

After this what I am considering, suppose I am considering one edge, suppose my edge is like
this, suppose I am considering this is the edge and corresponding to this edge my edge normal
will be normal to the edge is suppose, this is the normal to the edge. So, this is my edge and this
is the normal to the edge, that is the edge normal. So, this angle is I can consider is α ¿x, y). This
angle I can consider as α ¿x, y).

So, this is my horizontal edge and this is the vertical edge and this if I consider this portion, this
is 45 degree edge I am considering like this, 45 degree edge. So, you can see here I am only
considering 4 directions by defining range. That means, quantization of all possible edge
directions into 4 directions by defining the range, I am not considering all the directions to find
the neighborhood pixels. Here I am defining the range the range is from 0 to 22.5 degree, that is
one range and from 22.5 degree to 67.5 degree that I am considering, so, that is another range.

From 67.5 to 112.5 that is another range, from 112 to 157.5 that is another range. So, just I am
defining the range. And this is not in the scale. So, that is what I am doing the quantization of all

852
possible edge directions into 4 directions by defining the range. So, for this I am considering the
directions, that directions are d1, d2, d3 and d4. So, 4 directions I am considering and I am
considering the 3 by 3 region I am considering.

Now, how to apply the non-maximum suppression? I think you understand this concept, because
what is the concept behind this, I am not considering all the directions to find the neighborhood
pixels. So, what I am considering the quantization of all possible edge directions into 4 by
defining range. So, in this case, I am considering only 4 directions d1, d2, d3, d4 and for this I
am defining the range. So, range is from 0 to 22.5, 22.5 to 67.5, 67.5 to 112.5, 112.5 to 157.5.

And in this figure you can see which one is the vertical edge, which one is the horizontal edge
you can see. And corresponding to the edge, so, this is my edge and you can see the normal, the
normal to the edge you can see and corresponding to the angle α ¿x, y). So, already I have
explained that for each and every pixel of the image you have to determine the gradient image
that is the m( x, y) you have to determine, also you have to determine α ¿x, y) for all the pixels of
the image you have to determine.

Now, based on this concept, because I have the 4 direction only, so, how to do non-maximum
suppression? So, first point, first step what you can do, find a direction dk So, I have only four
directions d1, d2, d3, d4; find a direction dk that is closest to the angle α ¿x, y). So, the first I
have to find the direction dk, that is closest to α ¿x, y). So, for all the pixels of the image, we
have to do this. And after this we can find a neighborhood pixel. So, based on this the condition
number 1, based on the step number 1, we can find our neighborhood pixels.

If the value of the gradient magnitude m (x, y) is less than at least one of the neighborhood, at
least one of the neighbors along the direction, direction is dk; along the direction dk, then so
what I have to do, let gN, so, that gN is the non-maximum suppression image, g N (x, y) will be
equal to 0, that means I am suppressing that one. And otherwise what we have to, otherwise what
I have to do, otherwise gN (x, y) that is I have to retain the pixel that is equal to m (x, y).

So, if the value of the m (x, y) that is the gradient magnitude is less than at least one of the
neighbors along the direction dk, then what I have to consider, I have to the suppression, the
suppression is nothing but gN (x, y) will be equal to 0. And otherwise we have to retain that pixel,
that is the pixel is x, y is the pixel and corresponding to this gN (x, y) will be m x, y. So, what is

853
gN (x, y) here? gN (x, y) is non-maximum, non-maxima suppressed image. So, this is a very
important step of the Canny edge detection technique that is how to apply the non-maxima
suppression. The next step is the step number 4.

(Refer Slide Time: 42:18)

The step number 4 is thresholding with hysteresis. Now in this case because I have to select the
edge pixels, so for this I have to consider a threshold for comparison, that means we have to
reduce the false edge points. So, for this I have to consider the thresholding operation. So, if I
consider only single threshold for this comparison that means whether the pixel is the edge pixel
or the not edge pixel, if this value is greater than a particular threshold, based on this I can
consider that pixel is the edge or the not the edge pixel.

So, if I consider only the single threshold, so there is a problem. The problem is, if the threshold
is too low, the threshold if I select that is the too low threshold value, so what will happen? I will
be getting false edges. That is, I can consider as false positive, that is corresponding to the low
threshold. And if I select suppose very high threshold, if the threshold is very high, again there
will be a problem. So, what is the problem? Actual valid edge pixels will be eliminated.

That is nothing but false negative, this is the false negative. So, that is the problem with the
single threshold. So, in the Canny edge method, Canny edge detection, it uses two thresholds,
that is one is the low threshold TL, and another threshold is the high threshold that is TH. One is
the low threshold; another one is the high threshold it is used. And based on this how to take a

854
decision, you can see, so gNH first I have to determine that is the first I have to do the
thresholding gNH, I am getting that is by comparison with the high threshold I am getting gN (x,
y).

It is compared with the high threshold and corresponding to this I am getting g NH x, y I am


getting. So, also I can compare with the low threshold and I am getting g NL x, y and if I compare
with the another threshold, another threshold is the low threshold, because I have two thresholds
in the Canny edge detection. In gNH, in this case I have few, fewer non-zero pixels, than g NL x, y.
So, that is obvious because gNH I am obtaining by comparison with the high threshold.

So, in this case I have only few non-zero pixels than g NL (x, y). So, that is you can see how to get
these two values, one is the g NH, another one is gNL. These are the images, that is I am comparing
with the two thresholds, one is the high threshold, another one is the low thresholds. And for
initialization, how to do the initialization? So, initialization what I have to do, both the cases g NH
and gNL, the initialize is 0. This is initialize 0 and after this I can determine gNH and the gNL I can
determine.

And another step I can consider, the g NL (x, y) is equal to gNL (x, y) from this I am doing this one
gNL (x, y) minus gNH (x, y), that I am doing, what is the meaning of this? So, that means all the
non-zero pixels in gNH will be contained in gNL (x, y). So, I am writing this one, there is why, that
is why all the non-zero pixels in gNH (x, y) will be contained in gNL (x, y). So, that is why I am
doing this step, this step means, this step I am doing. That means eliminating these pixels, I can
eliminate these pixels.

So, eliminate these pixels. So, I can eliminate these pixels by using this operation. So, that means
in this case, I am after doing all these operations I will be getting these two images, one is g NH (x,
y), that corresponds to the strong edge pixels. Another image I am getting g NL (x, y), that
corresponds to the weak edge pixels. So, I am getting the strong edge pixels and the weak edge
pixels. The thresholding operation I can draw like this. What I am actually doing?

The thresholding with hysteresis. So, this threshold is the low threshold, this threshold is the high
threshold and corresponding to this, this is a black that is in the dark, black and corresponding to
this point this is white. Because after edge detection I am getting the binary image, either the

855
pixel is present or not, that is I am getting the binary image after the edge detection technique.
So, this is, this I am showing how to do the thresholding with hysteresis.

So, after thresholding with hysteresis, I am getting two images, one is the g NH that corresponds to
strong edge pixels and gNL x, y that corresponds to weak edge pixels. Then finally what I have to
consider because I have the edge pixels, to get the boundaries I have to edge linking. The final
step I can show you, the final step will be the edge linking, so the step number 5 is I can consider
the edge linking, I have to connect the pixels.

The edge pixels I have to connect to get the boundary. So, I have the edge pixels like this and I
have to connect the edge pixels based on some conditions and to get the edges I have to do this.
Because by considering these steps, the step number 1, step number 2, step number 3, step
number 4 I can get the strong edge pixels and the weak edge pixels. After this I have to do the
edge linking. So, this is about the Canny edge detection technique, the edge linking procedure I
will explain.

(Refer Slide Time: 51:53)

So, what is the Canny edge detection? So, first I have to go for the Gaussian blurring, this step
already I have explained. So, the image is convolved with the Gaussian to remove noise, this is
the first step. Second one is, I have to do gradient operation, so for this I have to determine the
gradient magnitude image that is the m (x, y). And also the angle also I can determine, angle is
α ¿x, y), that is the direction of the edge normal.

856
After this I have to apply the non-maximum suppression technique and I will be getting the
image gN x, y, that is the non-maximum suppressed image I will be getting. So, this procedure
already I have explained. So, for this I have to compare the neighborhood pixels and finally what
I have to consider, I have to consider two thresholds, one is the low threshold, another one is the
high thresholds. And based on this I will be getting two images, one is g NH x, y, that corresponds
to strong edge pixels and another image is I am getting g NL x, y, that corresponds to weak edge
pixels. So, these are the steps and after doing this, the step number 5 will be the edge linking.

(Refer Slide Time: 53:29)

So, you can see, how to do the edge linking you can see. That means we have to give the
boundary, so suppose if I consider a 4 connected neighborhood, suppose if I consider one
window, 3 by 3 window. And corresponding to this central pixel, the central pixel is suppose P,
if I consider the 4 connected neighborhood, so this is my neighborhood pixels corresponding to
the 4 connected neighborhood. So, in this case to get the boundaries, what I have to consider?

Corresponding to each edge point, we have search for similar points in the connected
neighborhood. So, I am repeating this, because I have to find the boundary, the object boundary I
have to find, I have to connect the edge pixels. So, corresponding to each edge point, we search
for similar points in the connected neighborhood. So, in this case you can see, these two points I
am considering, one point is x1, y1; another point it should be actually it should be x2, y2.

857
I am determining the gradient for the point x1, y1; f (x1, y1) the gradient magnitude I am
determining. And also I am determining the gradient magnitude corresponding to the point x2, y2.
If it is less than a particular threshold, the threshold is T1, that condition I can take. And also if
you see the second condition, what I am considering, I am determining the direction, the
direction is the angle, this angle I am considering, that means I am determining the angle.

So, f (x1, y1) that I am determining and f (x2, y2) that is the angle that I am determining, the angle
is, already I have defined angle is nothing but α ¿x, y), that is the direction of the edge normal.
And if these two conditions are satisfied, then that based on these conditions, I can join the edge
pixels, suppose these are the edge pixels, so this pixel and pixel I can join. Similarly, if this
condition is satisfied, between these two points then I can join the pixels to get the boundary, to
get the edge, the continuous edge. This is the edge linking procedure.

(Refer Slide Time: 55:49)

So, here I am showing one example of the edge detection by Canny edge detection technique.
This is the first example and second example also you can seem so we have applied Canny edge
detection technique to determine the edges and the boundaries. Up till now, I discussed about the
concept of the edge detection by using the model-based techniques. So, first I discussed about
the log operation, that is the Laplacian of Gaussian.

So, for this, the image is convolved with Gaussian and after this we can take the Laplacian. So,
that means first I have to take the image and after this the image is convolved with the log

858
operator that is the Laplacian of Gaussian. And after this I have to find the zero crossings. And
based on the zero crossings, I can decide whether a particular pixel is the edge pixel or not the
edge pixel. That is about the log operator.

And similarly, I have also discussed about the DoG operator, that is the difference of Gaussian.
After this I considered the Canny edge detection, so there are five steps. The first step is, first I
have to blur the image, so convolve the image with Gaussian to blur the image to remove the
noises. So, this is the step number 1. Step number 2, I have to find the gradient magnitude in the
image, that is m (x, y) I have to determine and also, I have to determine α ¿x, y), that is the
direction of the normal to the edge I have to determine. That is the step number 2.

The third step is very important, that is the non-maximum suppression that I have to consider and
for this I have to consider neighborhood pixels. So, instead of considering all the directions I can
consider only few directions, the quantization of all the directions into the few directions, in my
example I have considered only 4 directions and I can find the neighborhood pixels. And after
this I can determine the non-maxima suppressed image, the non-maxima suppressed image I can
determine because I have to consider the neighborhood pixels.

And finally, I have to apply the thresholding with hysteresis, that principle I have to apply. So, in
the Canny edge detection technique, I have two thresholds, one is the low threshold, another one
is the high threshold. And corresponding to this I will be getting two images, one is the g NL and
another one is the gNH.

So, I will be getting the weak edge pixels and the strong edge pixels I will be getting. And after
this I have to join the edge pixels, that is called the edge linking to get the boundary, to get the
continuous edges. So, this is about the Canny edge detection technique and the model-based
technique. So, let me stop here today. Thank you.

859
Computer Vision and Image Processing - Fundamentals and Applications
Professor M. K. Bhuyan
Department of Electronics and Electrical Engineering
Indian Institute of Technology, Guwahati
Lecture 23
Hough Transform

Welcome to NPTEL MOOCs course on Computer Vision and Image Processing - Fundamentals
and Applications. In my last classes I discussed about the concept of edge detection. So, how to
detect the edges in an image, many edges in an image can be approximated by lines. So, I can
detect lines present in an image. So, today I am going to discuss how to detect the lines or maybe
the circles in an image.

Line detection by one transformation, that is the directional transformation and that is called
Hough Transform because these lines or the edges or the points are the features of the image. So,
line detection is quite important that is nothing but the extraction of the line, the detection of the
line and that can be considered as an image feature.

This Hough transform is quite useful to detect the lines present in an image, also I can detect
circles present in an image. So, today I am going to discuss about this Hough transform. So,
before going to the Hough transform how to detect the lines by using the mask, the masking
operations.

(Refer Slide Time: 1:47)

860
So, I can explain this concept suppose, I have a mask and this is my mask, the weights are w 1 w2
w3 these are the weights of the masks and I am considering suppose one image, the mask I am
considering it is a 3 by 3 mask I am considering and suppose I am considering the image the
pixels are suppose z1, z2 like this, these are the pixels of the image and I am only considering the
3 by 3 image.

Now, how to detect a particular line or maybe a particular point. So, in this case, what is the
masking operation? So, I have to place the mask over the image. So, this is my mask and this is
my image. So, how to do the masking operation? So, I have to place the mask over the image and
corresponding to this I have to determine the response of the mask. So, response of the mask is
w1z1, w2z2, up to w9z9. So, this is the response of the mask.

Suppose, if I consider this mask, this is nothing but the Laplacian mask. So, you probably
considered this mask and if I convolve the image with this mask, this Laplacian mask then you
can see in the center pixel the value is 4. So, that means, corresponding to any isolated point the
response of the mask will be maximum. So, in this case g (x, y) that is the output will be 1 if my
response of the mask is greater than a particular threshold otherwise 0.

So, that means corresponding to isolated points I will be getting g (x, y) is equal to 1 and
otherwise it will be 0. So, for all these points isolated points and my response will be 1,
otherwise it will be 0. So, that means, I can detect points in an image by using this mask, the
mask is the Laplacian mask and already I have discussed about the compass operators.

So, suppose if I consider the mask, suppose if I consider this mask or if I consider another mark
suppose, so, if you see the first mask, this mask you can see, you can see the response, response
will be maximum along this line. That means, this mask can detect horizontal lines and this mask
if you see, this mask can detect the vertical lines.

If I use the compass operators, I can detect the lines, suppose it is in 0-degree direction, 45-
degree directions, 90-degree directions and also 135-degree directions. So, these lines I can
detect, the lines oriented in this particular direction that is 0-degree direction, 45-degree
direction, 90-degree direction and 135-degree directions.

So, for this I can use the compass operator, but for other lines, I cannot use the mask, it is not
possible to detect the all the lines which are oriented in different directions not the 0 degree, not

861
the 45 degree in different directions, then in this case it is very difficult to detect the lines present
in an image. So, that is why I can consider the transformation that directional transformation and
that is called the Hough transformation.

(Refer Slide Time: 5:49)

So, before going to the Hough transformation, suppose I have the edge pixels and already I have
explained that is the many edges can be approximated by a straight line. So, mainly in this case I
am considering how to detect the straight lines, suppose, if I consider n number of edge pixels,
the edge pixels then corresponding to this, how many possible lines will be there, because it is I

n( n−1)
have to group the points to make lines. So, how many possible lines? This possible
2
lines.

So, I have n number of edge pixels and corresponding to n number of edge pixels I have to find

n( n−1)
how many lines I can produce from these edge pixels. So, that means, I can get possible
2
lines. So, this is nothing but the line grouping problem. So, in this case you can see I am
considering one line considering the edge pixels. So, this edge pixel I am considering, this edge
pixel I can consider like this, I can consider these pixels.

Whether this pixel is close to a particular line that I am considering and that is mainly the line
grouping technique and by this I can consider or I can detect a line, but the main problem is if I

862
consider n number of edge pixels, then number of possible lines by this grouping process it will

n( n−1)
be . So, that is very difficult to do that is computationally complex. So, that is why I
2
have to consider the transformation, the transformation is the Hough transform.

(Refer Slide Time: 7:48)

So, this concept I have explained. So, many edges can be approximated by straight lines. So,
here you can see these many edges I can approximate as straight lines. So, these are my straight
lines. So, many edges can be approximated by straight lines. The second point is if I consider n
number of edge pixels, then in this case how many possible lines? The number of possible lines

n( n−1)
will be possible lines.
2

And also to find whether a point is closer to a line, we have to perform some comparison, how

n( n−1)
many comparisons we have to do? number of comparisons I have to do. So, then in this
2
case, you can see the order of the complexity. We have to do many comparisons to find a
particular line. So, that is the problem of line grouping. So, that is why we will consider the
transformation is the Hough transform.

863
(Refer Slide Time: 8:57)

So, in this case very important technique to detect the lines or I can detect the circles present in
an image and in this case, edges need not be connected, in case of the Hough transform. And one
thing is that the complete object need not be visible. And this is also another point for the Hough
transform and the key idea is, so, we have to do the voting, we have to go for the voting to find
possible models. So, that is the case.

(Refer Slide Time: 9:31)

So, in this case I have shown here you can see I am considering one line, the line equation is y
=mx + c. So, that is the equation of the line and I am considering the parameters, these are the

864
parameters of the line one is the gradient, the gradient is m and the intercept is c. So, I have these
parameters one is the m, another one is c, so line is y = mx + c. In these two figures I have shown
a two spaces, one is the image space that is the x and y space, I am showing the x and y
coordinates and another one is the parametric space, I have shown the m and c values, the m is
the gradient and c is the intercept.

Suppose, if I consider one point, the point is suppose xi, yi, I am considering them, this point I am
considering. Corresponding to this point what I will be getting in the m, c space, the m, c space is
called the parametric space. In the parametric space I will be getting a line. So, you can see
suppose anyone on the line you can see. Suppose, if I consider another point that is collinear with
the previous point, suppose, if I consider this point, the second point that is collinear with the
previous point and corresponding to this point also I will be getting another line, suppose, this
line I am getting.

These lines will intersect at a particular point, the point is this point and that is the corresponding
point, the point value is basically the m, c value of the line I have shown here in the figure 1. For
corresponding to the line, I have shown in the figure, I have the m, c value that is a gradient and
the intercept value. And in the second figure, if I consider the parametric space, corresponding to
all the points which are collinear I will be getting lines and they will intersect at a particular point
and that point is nothing but the m, c value of the line.

(Refer Slide Time: 11:33)

865
So, in this case you can see, so, how to consider the parametric representation of a straight line
for line detection. So, here I have shown the x, y space and the parametric space and you can see
corresponding to this line, particular line in the x, y space, I am getting the m, c value; m, c value
in the parametric space, the parametric space is m and c. Because corresponding to this line, I
have the fix value, the fix value is m is the fix and c is fix corresponding to that particular line.
So, you can see the mapping between this x, y space and the m, c space. So, just you see the
mapping of this.

(Refer Slide Time: 12:20)

Now, in this case you can see I am considering two points one is (x 1, y1) another one is suppose
x, y that is in the x, y space. So, two points I am considering one is the x, y point, this point and
another point is suppose (x1, y1). So, corresponding to this (x, y) point the first point suppose,
corresponding to this first point, I will be getting a line in the parametric space. So, parameter
space is m, c already I have explained. So, corresponding to the first point, the point is (x, y), I
will be getting a line that is l1 in the parametric space.

And suppose the (x1, y1)that is a collinear point with respect to the first point, that is a collinear
point, then in this case corresponding to the second point, the second point is (x1, y1), I will be
getting another line the line is l2 in the parametric space, these two lines will intersect at the
point, the point is P, that gives the m, c value corresponding to the line, that line I can join
between these two points corresponding to this line I have the m and c value.

866
So, in this case, if you see here, these two lines l 1 and l2, they will intersect at the point P and the
value of the point P is the m and c value of the line joining these two points (x, y) and (x 1, y1).
So, that is already I have explained the straight-line map of another point collinear with these
two points will also intersect at point P. So, if I consider another point that is collinear with the
points (x , y) and (x1, y1) that is another point I am considering, (x2, y2).

So, corresponding to this point also, I will be getting one line in the parametric space and in this
case, it will intersect at the point, the point is P. P is nothing but the m, c value of the lines in the
edge image plane. So, this is the concept of the Hough transformation. So, that means, in this
case what I have to do I have to count how many times it is intersecting. So, in suppose I have
considered three points, then in this case, three lines I can consider, that means how many times
it is intersecting at a particular point? Three times it is intersecting.

So, that means I have to count that is the voting, this is called the voting. So, how many times I
am getting the vote corresponding to these points, which are collinear, so, that means, in this
case, I will be getting three votes, if I consider three points, that is which are collinear if I
consider four points, then in this case I will be getting four votes. So, that means, I have to count
how many intersections in the (m, c) space that I have to consider.

(Refer Slide Time: 15:28)

So, for this Hough transformation, I have to consider how many times I will get the intersections
in the (m, c) space. So, that is why to count these I am considering one accumulator the

867
accumulator is A (m, c), because in this case my parameter space is (m, c) that is also quantized
and I have to initialize the accumulator. So, initial value is 0 and corresponding to each image
edge point, suppose, if I consider one point, the edge point is this (x i, yi), I have to increment the
accumulator, that means I am observing how many times I am getting the intersections, that
means I am counting the voting.

And if m, c lies on the line, then in this case, I can get this c = -x i m + y and like this I have to
count for all the pixels, the pixels means the edge pixels and I can find the local maximum the
local maximum is I can find from the accumulator. So, suppose in this case if I consider these
two votes that means I have the intersections corresponding to two points.

So, like this I can find the maximum in the accumulator, accumulator array. After this, the group
the edge that belongs to each line by traversing each line. So, that in this case I can do the
grouping, group the edges that belong to each line by traversing each line, so, that I can find a
line. So, this is nothing but the voting.

(Refer Slide Time: 17:11)

So, we have considered the parametric space, the parametric space is the m and c space. So, we
consider the gradient of a line and see the intercept we have considered. But in this case, it is in
principle they are unbounded. So, they cannot consider all the situations, suppose, if I consider a
vertical line, for vertical line the gradient will be infinite, then it is very difficult to apply this
algorithm because m and c will be unbounded for some of the cases.

868
So, then in this case, instead of considering this m and c space, that is the, the gradient and the
intercept, what I can consider, I can consider two parameters that is the parameters of the straight
line rho and the theta, because equation of a straight line is I can write like this x cos θ + y sin θ
= ρ . So, this is the equation of a straight line. So, instead of considering y = m x + c, I can
consider this equation, the equation of the straight-line x cos θ + y sin θ = ρ that I can consider.

And corresponding to this equation, I have two parameters one is the rho, another one is theta.
Then in this case the theta varies between minus 90 degree to plus 90 degree and this rho varies
from 0 to √❑ . So, in this case the size of the image is M × N. So, that means, because m and c
are unbounded for some of the cases, so, we are considering the parametric space, the parametric
space is ρ and θ now, instead of considering m and c.

I have shown it is in the x, y space, image space and this you can see this is theta and this is rho,
this is the parametric space. In the parametric space I will be getting sinusoid because, you can
see the equation it is equation is x cos θ + y sin θ = ρ . So, I will be getting sinusoid in the
parametric space.

(Refer Slide Time: 19:23)

So, here you can see corresponding to this point what I will be getting in the parametric space.
So, that means, the Hough space is the sinusoid. So, I will be getting a sinusoid here in the
parametric space. So, Hough space is sinusoid.

(Refer Slide Time: 19:40)

869
So, in this case you can see I am considering these points, these points actually, I can form a line,
this is the collinear points, all the points are collinear corresponding to a particular line. So, in
this case all these points will give the maximum vote at this point, but rest of the point you can
see it is almost dark. Why it is dark? You can see the most point in the vote array are very dark,
because they get only one vote.

Because the maximum vote I am getting because these points are collinear points. So, other
points if you see because I am getting some sinusoid like this, it is not visible here, but the
maximum the vote I am getting corresponding to this point, that is the rho theta value
corresponding to this line if I get a line by joining these points.

870
(Refer Slide Time: 20:35)

In this case, you can see I am considering the image space, you can see the parametric space, the
parametric space is the rho and theta space. And you can see the votes corresponding to suppose
if I consider this line, I will be getting one vote here, corresponding to these points, I will be
getting one line, so, I will be getting another vote. Corresponding to this line, I will be getting
another vote. So, like this, I am getting the votes and you can see the sinusoids here, you can
these are the sinusoids in the rho theta space, this is the parametric space.

(Refer Slide Time: 21:09)

871
Here I have shown one example. So, what is this example you can see, I am considering 4 points,
which are collinear and also, I am considering one isolated point, this is the isolated point.
Corresponding to the isolated point I am getting one sinusoid and corresponding to all 4 collinear
points, I am getting 4 sinusoids and they will intersect at this point, they will intersect at this
point, but that corresponds to rho theta value of the line joining these points.

And for the isolated points there is no intersections, that is obvious. That means I am getting
maximum vote corresponding to these 4 points, corresponding to these 4 points I am getting 4
votes. And corresponding to this isolated one I am not getting any votes. So, based on this I can
consider that means these 4 points are the collinear points and I can join a line, I can get a line
from these points.

(Refer Slide Time: 22:11)

In this case what I am considering the lot of noise we can see, the noisy points and in this case, it
is very difficult to find the lines because you can see I am getting votes for different noisy points,
the votes are like this. So, in this case noisy points are available. And in this case, it is very
difficult to find a maximum value in the accumulator, the accumulator array.

872
(Refer Slide Time: 22:39)

And in this case, you can see some real world examples, I have shown the original image and
you can see that the edge detection technique I am applying like I can apply the Sobel operator or
maybe the Prewitt operator or maybe I can apply the log operation that Laplacian of Gaussian.
So, this is the edge detection technique I have applied and in the second case, I am applying the
Hough transform to find the lines. So, you can see I can determine the lines present in an image
and in this case, I have shown that parametric space.

(Refer Slide Time: 23:18)

873
Now, this Hough transform can be extended for detection of circles present in an image. So, in
this case, I can consider the equation of the circle, the equation of the circle is

( x ¿¿ i−a) +( y ¿¿ i−b) =r ¿ ¿. And in this case, if the radius is known, then in this case I can
2 2 2

consider the two-dimensional Hough space. So, I have shown the circle here this circle and
corresponding to the circle I have the radius, the radius is known suppose.

And the parameters, I have two parameters one is a and another one is b. So, what is the equation
of the line? The equation of the line is this is the equation of the line

( x ¿¿ i−a) +( y ¿¿ i−b) =r ¿ ¿. And corresponding to this if the radius is known in my


2 2 2

accumulator, I have only two parameters one is a and another one is b. So, in this case you can
see I am considering a circle in the second figure, this figure and I have shown the points of the
circle.

So, corresponding to suppose this point what I will be getting in the parametric space, the
parametric space is now a and b, because I have two parameters of the circles. So, corresponding
to this xi, yi that point in the x, y space, so, what I will be getting in the parametric space? In the
parametric space I will be getting the circle, this circle I am getting. So, this circle I will be
getting in the parametric space.

Similarly, corresponding to another point if I consider the point is actually the circle points.
Corresponding to the second point suppose I will be getting another circle, like this
corresponding to all these points I will be getting the circles like this and these circles will
intersect at a particular point, this point, at this point it will be intersecting. That corresponds to a
and b value of the circle that I have shown in the figure, this figure that is the corresponding to
this circle, I have the a and b value, I have the a and b values.

So, in this case in the parametric space, so, all these circles will be intersecting at this point and
that will give the value of this a and b, that is the a and b is this. So, in this case also what I have
to consider I have to again consider the number of intersections. So, for this again I have to
consider accumulator, so that I can count voting. So, how many times it is intersecting.

874
(Refer Slide Time: 26:03)

So, in this case you can see equation of the circle, suppose the radius is not known, then in this
case my accumulator array will be, these parameters one is a, b and r, if the radius is not known.
If the radius is known, then that accumulator will be only A is the accumulator and I have the
parameters a and b if the radius is known. If radius is not known, then the accumulator will be a,
b, r.

(Refer Slide Time: 26:30)

And in this case, you can see the detection of the coins by using the Hough transform and also I
have shown the results here also the edge detection principle also I am applying and in the edge

875
detection principle, you can see the small, small edges, I can also get by edge detection. But if I
apply the Hough transformation, then only I will be getting the circles, I will be getting the
circles corresponding to the coins all the coins.

(Refer Slide Time: 26:59)

Now, the next point is I want to discuss about does generalized Hough transform. So, what is the
generalized Hough transform? The model shape not described by equation because already I
have discussed about how to detect the straight line, I am considering the equation is y = m x + b
or may be x cos θ + y sin θ = ρ . Also, I discussed how to detect the circles present in an image.
For circle also I have the equation, ( x ¿¿ i−a)2 +( y ¿¿ i−b)2 =r 2 ¿ ¿.

But suppose if I consider this object, then in this case I cannot represent by equation, that is the
model shape not described by equation. So, for this I can apply the generalized Hough transform.
So, what is the method here? So, suppose corresponding to this pixel, that is the edge pixel, the
x, y is an edge pixel and I have shown the edge normal, the edge normal you can see, this is the
edge normal, that is the normal to the edge.

And I am considering one reference point in the object. So, this is my reference point, this is the
reference point Xc, Yc is the reference point, reference point I am considering, reference point in
the object. And corresponding to this reference point you can see one line that is a one vector I
am considering from this point that from the edge point to the reference point. And in this case, I
am considering this length of this vector is suppose r.

876
And if you see the angle, what is the angle? The angle in between this line, line is the, this line
the line joining between the edge pixel and the reference point and the reference axis I am
considering, so, this is my reference axis, reference axis I am considering. So, corresponding to
this reference axis, the angle between the reference axis and the line joining the edge point and
the reference point is alpha. So, this alpha angle I am considering, so alpha angle I am
considering, so, this is my alpha angle.

And I am considering and phi angle, what is the phi angle here? This is my phi angle; the phi
angle is the direction of the normal to the edge with respect to the reference axis. So, what are
the parameters I am considering here, one is the parameter is r, what is r? r is the distance
between the point, the point is the edge point, edge point is the X and Y point. And the reference
points the reference point is Xc, Yc, that is a centroid point that I am considering.

So, the first parameter is r, the second parameter is alpha, what is alpha? alpha is the angle
between the reference axis and the line joining the edge pixel and a reference point. So, alpha I
am considering, so this is my alpha. And also, another parameter I am considering the phi, what
is phi? The phi is the direction of the normal to the edge. So, these parameters I am considering,
one is r, α , φ i. So, these parameters I am considering to represent a particular edge pixel.

In the second picture what I am showing, the same thing I am showing, I am considering two
edge pixels here and corresponding to these two edge pixels I am considering suppose, my
reference point is this, the reference centroid point of the object is X c , Yc and you can see the
parameters, what are the parameters; one is the r 1 that is the r I am considering r is the parameter,
that is the distance between the point and the reference point, the reference point is Xc, Yc.

So, I have r 1 i and similarly, corresponding the second point, the second point is supposed 2, this
is the first point. So, corresponding the second point my distance is r 2, that is the distance
between the point 2 and the reference point that is r 2, that is the vector and also I have the angle,
the angles already I have defined, one is α 1 , you can see α 1i and α 2i and also I have the angles
that is the direction of the edge normal that is φ i andφ i, that I am considering their phi angle. So,
you can understand what are the parameters one is the r, α ,φ these are the parameters.

(Refer Slide Time: 31:43)

877
So, for this model, this technique the generalized Hough transform that is the model shape not
described by equation. So, for this what I have to consider I have to make one table that table is
called phi table, the phi table for generalized Hough transform. So, in this table you can see I am
considering different edge directions like this φ 1, φ 2 all these directions I am considering. And
corresponding to these edge directions you can see the vector, the vector is r, the vector r I am
considering and it is two components one is r, r is the distance I have defined what is r and
another one is the angle is alpha.

So, in this case you can see corresponding to φ 1 I may have r1, r2, r3 because I can get the value
rn, α corresponding to the angle, the angle is φ 1. For this angle I can have r1 or maybe r2 or
maybe r3. For a particular this angle condition, I can get this vector r1, r2, r3 like this I can get.

And in the vector r what are the components? The component is r, r is the distance and α is the
angle. Similarly, corresponding to this angle, the direction of the edge normal I can get r 1 and r2
like this. So, like this I am showing for edge direction, for different edge directions I can get the
vector r this is the phi table for generalized Hough transform.

878
(Refer Slide Time: 33:16)

So, in this case, so what will be the algorithm to detect the object, that is not described by the
model equation. I can find the object center based on the centroid. So, suppose the centroid is
given. So, this is the object center and also, I have the given is the edges, the edges are xi, yi, θi .
And for this again I have to do the voting. So, for this I can create the accumulator, the
accumulator is xc, yc these values and I have to initialize the accumulator, so, this is the
initialization of the accumulator.

After this for each edge point, so, I can consider edge points are like this xi, yi, φ i. So, from the
phi table, corresponding to phi, I can find this one, this vector. So, corresponding to this r k, I can

879
find xc and yc by these equations and after this I can increment the accumulator. And finally, like
I described earlier I can find a local maximum of the accumulator, the local maxima in the
accumulator I can determine.

So, that means, corresponding to this angle, corresponding to the direction of the edge normal, I
can determine this vector rk from the phi table and after this I have to find a value of x c and yc I
can find and based on this, I can increment accumulator. So, the process is very simple. So, one
thing is only important you can see in my previous slide, so, I have to make the phi table
corresponding to these angles.

The angle is basically thatφ 1, φ 2, these are the angles that means the direction of the, direction of
the edge normal. And corresponding to a particular angle there may be two or three vectors,
maybe r1, maybe r2, because for the same angle I can have r 1 or I can have r2, because r has the
two components, one is the r and other one is α .

(Refer Slide Time: 35:18)

And suppose, if I consider the scaling and the rotation of the object, then in this case what I have
to consider, I have to consider the accumulator, the accumulator is already I have considered x c
and yc parameters I have considered. And if I consider the scaling by a factor of S and a rotation
by an angle θ . So, this is the object, it is scaled and it is rotated suppose, so, in this case how to
detect this object.

880
So, corresponding to these I can use these equations, that because I have another two parameters,
one is the S, other one is θ . S is just scaling parameter and theta is the rotation parameter and I
can find xc and yc by these equations and accordingly I can increment the accumulator, I can get
the value in the accumulator and I can find a maximum in the accumulator. So, this is for any
arbitrary shape.

(Refer Slide Time: 36:12)

And the Hough transform comments. So, this Hough transform works on disconnected edges you
can see this example, already I have given some examples like in the coins also it works on
disconnected edges. It is relatively insensitive to occlusions that you can see and it is effective
for simple shaped like I can detect the lines, I can detect the circles like this and in this case in
the in case of the Hough transformation, we can get the image space and we can get the
parametric space.

So, I can work in the image space or I can work in the parameter space. So, you can see the
mapping from image space to the parametric space. And also, the handling inaccurate edge
locations. So, suppose if I apply the edge detection techniques, then in this case I will be getting
edge pixels that may not be correct, but I can apply the Hough transform techniques, so that I can
get the lines or I can get the circles. So, this, the noisy edge pixels I can neglect. So, these noisy
edge pixels I can neglect and I can find a particular line or I can find a particular circle like this.

881
So, these are the advantages of the Hough transform. So, today I have discussed the concept of
line detection and the circle detection by considering the Hough transformation. And also
suppose for some objects, which cannot be described by model equations, then in this case, I
have to apply generalized Hough transformation. So, this is about the Hough transformation. So,
let me stop here today. Thank you.

882
Computer Vision and Image Processing - Fundamentals and Applications
Professor M. K. Bhuyan
Department of Electronics and Electrical Engineering
Indian Institute of Technology, Guwahati
Lecture-24
Image Texture Analysis - I
Welcome to NPTEL MOOCs course on Computer Vision and Image Processing
Fundamentals and Applications. Today I am going to discuss about image texture analysis.
So, what is image texture? Image texture means spatial distribution of gray level intensity
values, that is how the gray level intensity values are distributed spatially, that is the
definition of image texture. The basic element of a texture is called Texel. Texel means a
group of pixels having homogeneous property, that is the definition of Texel.

And Texel is repeated spatially and we get a particular texture. Texture is a very important
image feature. Human can distinguish different surfaces based on texture. So, in computer
vision also, the texture is a very important image feature. The texture feature can be
combined with other features like color feature, motion feature, shape features for object
recognition for image classification. And for texture analysis, there are actually many
approaches we will discuss all these approaches.

And there are mainly 4 research directions, one direction is texture classification. So, texture
classification means we have to extract texture features and based on these texture features,
we can classify different types of textures present in an image, that is called texture
classification. And another direction is the texture segmentation. So, based on texture
information, we can do image segmentation. Image segmentation means partitioning of an
image into connected homogeneous region.

So, by considering the texture information, we can do the image segmentation. Another
direction is texture synthesis. So, from a small texture, I can generate a big texture. So, the
input is a small texture and from that texture I want to generate a big texture of an image.
Another direction is the already I have explained in my first class, shape from texture. That is
the shape I can determine from texture variations. Texture variations give you to estimate the
shape of a particular surface.

So, these are main research directions, one is texture classification, one is texture
segmentation, one is texture synthesis, and one is shaped from texture. So, now, I will discuss
about the concept of texture analysis and I will discuss the texture synthesis also.

883
(Refer Slide Time: 3:34)

So, here you see the definition of a texture, a repeating pattern of local variations in image
intensity, that is spatial variation in image intensity that I am considering. It is characterized
by the spatial distribution of intensity levels in a neighborhood and the texture cannot be
defined for a point. A feature used to partition image into regions of interest and to classify
those regions, that is, I can use the texture features. And the texture provides information on
in the spatial arrangement of color or intensities in an image. So, this is the definition of a
texture.

(Refer Slide Time: 4:15)

And in this example, in this figure you can see, a texture is a repeating pattern of local
variations in image intensity, you can see the texture and you can see also the local variations

884
in image intensity. So, you can see this portion of the texture and corresponding to that
portion you can see local variations in image intensity.

So, you can see the image intensity these variations in this case. And in this figure also I have
shown a particular texture. So, from this you can see the shape of a particular surface can be
estimated from the information of texture variations. That concept already I have explained in
my first class.

(Refer Slide Time: 5:01)

So, in this case as you can see, in this example, an image has 50 percent black and 50 percent
white distribution of pixels. And corresponding to this case, I may have three different
images with the same intensity distribution, but in this case the different textures. So, in this
case, you can see I have the first texture, second texture, third texture and in this case, you
can see the 50 percent black and the 50 percent white distribution, but, the different images I
am getting.

885
(Refer Slide Time: 5:35)

And texture consists of the texture primitives. The texture primitive is called the Texels. So,
what is the definition of texels? A group of pixels having homogeneous property that is called
the Texels. So, if I consider a particular texel suppose, a group of pixels having homogeneous
property, so, this Texel is spatially distributed, suppose, suppose this distributed I am
considering and I will be getting a texture.

And in this case the texture can be described as the fine texture, the coarse texture, the grain
texture and the smooth textures. Such pieces are found in the tone and the structure of
textures, because the textures are mainly the fine textures, the coarse textures, the grain
textures and the smooth textures.

So, what is the definition of the tone? The tone is based on pixel intensity properties in the
Texel. And what is the meaning of the structure? The structure represents the spatial
relationship between the texels. So, in this case, you can see, I have shown one spatial
relationship and that is called the structural property, the structure of the texture. If Texels are
small and the tonal differences between texels are large, then corresponding to this I will be
getting a fine texture.

And also, if the pixels are large and consist of several pixels, then I will be getting a coarse
texture. So, mainly you can see that based on the pixels, I can define the different types of
textures, the fine textures, course textures, grain textures and smooth textures.

886
(Refer Slide Time: 7:17)

So, already I have defined that is there are 4 primary issues in texture analysis, one is the
texture classification, one is texture segmentation, one is texture synthesis, and another one is
the shape from textures. So, we can consider these research directions in computer vision. So,
here you can see this photo, if I consider this photo and the photo is repeated this pattern is
repeated and I am getting the textured image.

(Refer Slide Time: 7:49)

So, what is texture classification? Texture classification is concerned with identifying a given
textured region from a given set of texture classes. So, I have texture classes and in this case I
have to identify a textured, given textured region, that is the texture classification. So, for this
I have to extract texture features, maybe I can extract some statistical features like GLCM.

887
GLCM is called the gray level co-occurrence matrix, some texture statistical parameters like
contrasts, entropy, homogeneity. So, these features I can extract and based on these features, I
can do the texture classification.

Another one is the texture segmentation; the texture segmentation is the partition into
different regions. Suppose if I have an image so, I have to partition into different regions
where the texture is homogeneous. So, suppose if I consider one image suppose one region
and suppose these are different types of textures. So, I am just partitioning the image based
on the textures. So, this is one kind of texture and this is another kind of texture, this is
another kind of textures like this and suppose this is another kind of textures.

So, that is I am doing the partitioning into different regions where the texture is
homogeneous. So, corresponding to this region, suppose that corresponding to this region that
texture is homogeneous, that I am doing. So, that is based on the texture, I am doing the
image segmentation. The next one is the texture synthesis. The texture synthesis means, I
have to construct a large digital image from a small digital sample image. That is, I can
generate a large digital image from a small texture, that is the concept of the texture
synthesis.

So, the goal is to synthesize other samples from that particular texture. So, I can generate
other textures or maybe I can generate a large digital image from a small texture pattern, that
is called texture synthesis. And shape from texture is that texture pattern variations give cue
to estimate shape of a surface. So, in this case you can see these examples, you can see the
texture pattern variations and this texture pattern variations indirectly give some information
to estimate shape of a surface.

So, this is called shape from texture. So, I will discuss all these concepts one by one, one is
the texture classification for this I have to extract texture features. For texture segmentation
also, I have to extract the features, that is called texture segmentation. For texture synthesis,
from a small texture how to construct a large digital image, that I can show some examples.
And one concept is the shape from texture.

888
(Refer Slide Time: 11:08)

This is one example of image segmentation based on textures. So, in the left you can see the
input image and in the right, you can see the segmented image and it is based on texture
information.

(Refer Slide Time: 11:23)

And similarly, you can see these two images. So, based on the textures, we can segment out
at different regions of an image. So, I can do the partitioning. So, suppose corresponding to
this portion, the texture is homogeneous and corresponding to suppose this portion if I
consider, the texture is homogeneous.

889
So, based on this I can do the partitioning of an image. And if I consider the second image
corresponding to suppose this portion, the texture is homogenous. So, based on this the
texture information I can do the segmentation, the image segmentation.

(Refer Slide Time: 12:02)

Next one is the texture synthesis. So, for this I have small input image, from the small input
image I want to generate the large image based on this texture pattern. So, in this case you
can see I can generate images that, you can see the small input image the particular texture
pattern I am considering and after this, by synthesis that is artificially I can generate a big
image. The goal is to synthesize other samples from the same texture. So, texture is the input
texture is given and from this I can generate other samples from that texture, particular
texture. So, I can generate a big image. So, I can give some examples of texture synthesis.

890
(Refer Slide Time: 12:51)

So, here you can see the examples like this. I am considering a small texture pattern and from
this I am generating a big image. The texture pattern is repetitive and from this I am getting
the bigger images. So, these are the examples of texture synthesis like this I am giving
another example, from a small texture pattern I can generate a big image, that is the texture
synthesis.

891
(Refer Slide Time: 13:23)

Like this, these are some examples of texture synthesis, you can see this is also another
example of texture synthesis.

892
(Refer Slide Time: 13:31)

These are again another examples of texture synthesis.

(Refer Slide Time: 13:35)

And you can see the texture synthesis sometimes used in image forgery. You can see, this
pattern is duplicated, this pattern is duplicated and I am getting this image, the total image I
am getting. So, that means the particular texture pattern, this pattern, the first pattern, this
pattern is repeated that is duplicated and I am getting the second image. So, that is used for
image forgery.

893
(Refer Slide Time: 14:09)

Another research direction is the texture transfer. So, in this case what I am doing, just I am
transferring a texture to another image. Take the texture from one image and paint it onto
another object. So, that is the concept of texture transfer. So, here you can see that this texture
is transferred to that face image. So, this is my face image and that this texture is transferred
to the face image.

(Refer Slide Time: 14:38)

And in this case, this suppose if I consider the first texture in the second texture, so what will
be the texture that I can generate from these two textures. Similarly, if I considered this first
image and the second image, I can do that texture transfer, the transfer of the texture I can do
from these two images.

894
(Refer Slide Time: 15:01)

You can see this example, how to do the texture test. The first image you can see the first
object and the second image. And I am just transferring the texture to the first object. So that
is transferring the texture to the first object, that is the texture transfer.

(Refer Slide Time: 15:17)

And you can see, just I am transferring the textures to other images. So, this texture, this is
the texture and this is transferred to other objects. So, object is suppose this object. So, this is
called the texture transfer.

895
(Refer Slide Time: 15:34)

And in this case, you can see the textures transfer I am considering the face image is
considered and I am considering the texture sample and that texture sample is transferred to
the face image that is that texture transfer.

(Refer Slide Time: 15:51)

Similarly, I can give another example. This is my input image and I am considering a
particular texture and that texture, I am transferring to the face image.

896
(Refer Slide Time: 16:01)

And in this case also I am showing some texture transfer examples. You can see just I am
transferring the textures. So, if I consider this object, then this texture is transferred to that
object. And similarly, in the second case also, I am doing the texture transferring.

(Refer Slide Time: 16:20)

And this is another example of texture transfer. So, just I am transferring that texture to that
face image.

897
(Refer Slide Time: 16:27)

So, this is about the texture transfer. Now, let us discuss how to define a particular texture.
So, there are mainly the three approaches, one is the structural approach. In the structural
approach a set of primitive texels in a particular spatial relationship is considered and a
structural description of the texture is the a description of the Texels and the specification of
the spatial relationship. So, because the texture is a set of primitive Texels in some, some
regular or repeated relationships, so, that I am considering.

So, that means, a structural description of the texture, is a description of the texels and
specification of the spatial relationship, that is the structural representation of a particular
texture. So, that means, in this case, I have that Texels, suppose these are the Texels and how
these Texels are distributed spatially that information I am considering. Because these Texels
are distributed spatially, so that information I am considering to represent a particular texture.

The another important representation is the statistical representation, that is very important.
So, what we can do, we can extract some statistical parameters, the statistical quantities to
represent a particular texture. So, I will explain what are the parameters we can extract from a
particular texture pattern and based on these quantities, based on these parameters, I can
recognize a particular texture, I can classify different types of textures or even I can do
texture segmentation.

And in this case, I have to extract the feature vector corresponding to the particular texture.
So, in the feature vector I can consider the statistical parameters or statistical quantities which
I can extract from a particular texture. And a finally another approach is the spectral

898
approach. So, in this case I can apply the Fourier transformation for texture representation.
So, I will discuss these techniques, the statistical technique and the spectral technique
because they are very important. So, first let us discuss about the statistical techniques.

(Refer Slide Time: 19:03)

So, in this case you can see I have shown three types of textures. The first one is I am
considering a smooth texture corresponding to this portion of the image. Next one is I am
considering the coarse texture. And the third one is I am considering the regular texture
corresponding to that portion of the image. Now, how to describe a particular texture
mathematically?

(Refer Slide Time: 19:33)

899
So, for this, the first approach is the statistical approach. I am not going to discuss about the
structural approach because nowadays it is not much of use. So, mainly the statistical
techniques and the frequency domain technique and these are used for texture
representations.

So, for this you can see for statistical approach I can extract statistical moments that I can
determine, that I can determine from image histogram. So, in this case I have shown p(z) is
the image histogram that is a PDF or the histogram of z, z is the intensity that is the random
variable.

And from this you can determine the moments. We can determine the mean, the first one is
the mean the mean of the gray levels I can determine. Also, I can determine the second
moment that is the variance and it gives the measure of smoothness, the second moment
measures smoothness. The third moment also I can determine and this is the measure of
skewness, the fourth moment also I can determine and that is the measure of uniformity.

So, you can see I can extract these parameters the second order moment, the third order
moment, the fourth order moment I can determine. The second order moment is a measure of
smoothness of the texture. The third order moment is a measure of skewness and the fourth
order moment is a measure of uniformity. And in this case, you can see I can determine one
parameter, one factor that factor is color roughness factor from the variance.

Variance means the second order moment. So, you can see the second order moment is the
variance. So, from the variance, I can determine one factor that factor is called the roughness
factor. So, based on this roughness factor, I can distinguish two types of textures. Suppose, if
R is equal to 0 that corresponds to the smooth texture and if I consider R equal to 1 that
means, I will be getting a coarse texture. So, that means, the roughness factors give an
information about the smoothness or the coarseness of the texture.

For constant intensity the roughness factor will be 0 and it approaches 1 for large value of
variance, that is the concept of the roughness factor. And this skewness parameter that is the
third moment I can determine, it gives the direction of intensity sense. And from this I can
determine the entropy. Entropy means the randomness, the average entropy I can determine
you can see, so, by this formula, I can determine the entropy.

Now, in this representation one problem is, suppose I have one image the image is supposed
the spatial distribution of the Texels are like this. So, suppose if I consider digit one image

900
and another image suppose, if I consider in both the cases the image histogram will be same,
then in this case by using these parameters, I cannot distinguish these two images. You can
see the first image, this is the first image and this is my second image. So, if I want to
determine the image histogram, for both the cases the image histogram will be same, the
same histograms for these two images. So, that is the problem with this technique.

(Refer Slide Time: 23:34)

So, in this case, you can see based on these parameters, the mean, standard deviation,
roughness factor, third moment, uniformity, entropy, I can distinguish different types of
textures. So, you can see I have the smooth textures, the coarse texture and the regular
textures. By using this technique, I can distinguish these textures. But one problem already I
have explained in the two images, the histogram only be same for both the images.

901
(Refer Slide Time: 24:08)

So, for this there is another technique and that technique is I can extract one matrix and that
matrix is called the gray level co-occurrence matrix. The statistical measure described so far
are easy to calculate, but do not provide any information about the repeating nature of the
texture. So that is why I have to consider at this matrix, this matrix is called the gray level co-
occurrence matrix that contains information about the position of the pixels having similar
gray level values.

So, that information is available in the GLCM, the position of the pixels having similar gray
level values. So, now, I will explain how to determine the GLCM matrix, that is a gray level
co-occurrence matrix how to develop that I will explain. So, let us consider how to determine
the gray level co-occurrence matrix.

902
(Refer Slide Time: 25:10)

So, let us consider two pixels, one pixel is suppose I and another pixel is j. So, suppose this is
dx and this is dy and you can see the displacement between the pixels i and j. So, first I have
to define the displacement vector between the pixel i and j. So, that is dx along the x direction
and d y along the y direction that is the displacement vector. So, displacement I am
considering. After this I have to define an array. So how to define an array, so, I am
considering the array is P d suppose corresponding to the particular displacement P d ( i , j),
that array I am considering that is nothing but the probability.

Probability means number of occurrence, probability means number of occurrence of the


probability of the intensity pair is i,j of two pixels separated by the displacement vector by
the displacement by displacement d. So, that means, in this case I am considering the pixels i
and j, this pixel is this pixel is i and this pixel is a and I am considering the displacement is d,
d is the displacement between i and j. And in this case how to determine this array, so, the
array is the P d( i,j) So, that is the co-occurrence matrix.

So, suppose L is the number of intensity levels. So, L is the number of intensity levels present
in an image. So, for this I have to define L by L matrix L × L matrix, define L by L matrix
for individual pairs. So, let us consider one example. Suppose, I am considering one image,
suppose it is a 5 by 5 image. So, this 5 by 5 image I am considering. So, suppose the pixel
values are like this 2, 1, 2, 0, 1, 0, 2, 1, 1, 2, 0, 1, 2, 2, 0, 1, 2, 2, 0, 1, 2, 0, 1, 0, 1. So, this
image I am considering. Now, in this case, so, what will be the size of the co-occurrence
matrix.

903
So, in this case you can see I am considering the gray values, the gray levels 0, 1 and 2. 0, 1,
2 that means, I have three gray values. So, that is why my array size will be P d (i,j) that is the
co-occurrence matrix, the size will be 3 by 3 matrix. Because this is L by L matrix so I am
considering only 3 levels. So, that means my size will be 3 by 3 P d (i,j). So, this side is i, so i
is 0, 1, 2 and that side is 0, 1, 2 that is the direction j in this direction is j and this matrix is 3
by 3 matrix, because I am considering 3 levels, so that is way L by L that means, by 3.

Now, you can see first, the first element is the first you can see this side is i. So, first point if
you see this element, what do we mean this element. So, i is 0 and j is 0. So, I have to see
whether this pair is available in the image or not. That means, considering that particular
displacement, whether that pair is, 00 pair is available in the image or not. You can see the 00
pair is not available in the image. So, that means, this entry will be 0.

Next element is you see this next element is equal to 0 and z is equal to 1 that means is 0 and
this is 1 corresponding to this displacement, the displacement already I have defined. So,
what is the position of the pixel j. This is dx in the x direction and d y along the y direction
with respect to the pixel i. So, that means corresponding to that displacement whether 01 pair
is available or not that I have to see in the image. So, in this case, you can see the first 01
pixel.

So, this is 01 pixel, this pair is available and another pair I can see here, the another pair is
available, only two pairs. So, that means, this entry will be able to. The next one is, the next
element if you see that is i is equal to 0 and j is equal to 2. And corresponding to that
particular displacement where the that pair is available or not in the image 02. So, how many
times it is occurring in the image. So, you can see 02, this is 02 first one is and another one is
02 is this. So, corresponding to this only I have two times it is occurring. So, this value will
be 2.

Like this for all the values of i and j you have to determine, so it will be 2, 1, 2, 2, 3, 2 So,
this is an array, the array is P d (i,j). After this I have to do some normalization. So, how to do
the normalization. Suppose, let n be the total number of points pixels, point pairs in the image
depth that satisfied pd. So, in this example, how many pairs satisfy this condition? So, in this
example, n is equal to 16 in this example. So, that means divide each and every element of Pd
that array by n to get the matrix N(i,j). So, I will be getting the matrix N(i,j), the final matrix I
am getting N(i,j).

904
So, for this I have to divide each and every element of this P d (i,j) by 16. So, that means by n
So, that means, if you see it is divided by 16 So, all the elements are divided by 16 like this.
That is the normalization I have to do. So, divide each and every element of the array is P d
(i,j) by n to get the matrix the matrix is N(i,j). So, I will be getting that one. So, that means
N(i,j) is nothing but it is an estimate, it is an estimate of the joint probability distribution. So,
what is the procedure I am repeating it again.

So, corresponding to a particular displacement, I have to define an array, the array is P d (i,j)
So, does the array. So, what is the definition of this array, the probability that means the
number of occurrence of the intensity pair ij of two pixels separated by the displacement d,
the d already I have defined that is the displacement vector and after this I have to consider a
matrix the size of the matrix is L cross L, L means, L is the number of intensity levels I have
to consider and based on this I have to develop the matrix the matrix is P d (i,j), that is the co
occurrence matrix.

And for this I have to count probability that is the number of occurrence of the intensity pair
of the two pixels separated by the displacement d, that I have to count. After this I have to do
the normalization. So, for normalization first I have to see the small n. So, n is the total
number of point pairs in the image that satisfied Pd. And in this case after this, I have to
divide each and every element of the Pd that is an array by n to get the matrix, the matrix is
N(i,j). So, that is the co-occurrence matrix that I have to get and it is an estimate of the joint
probability distribution.

So, this is the procedure on how to get the GLCM, gray level co-occurrence matrix. And
from the gray level co-occurrence matrix I can determine some parameters to describe a
particular texture. So, I will explain what are the parameters we can extract from the GLCM.

905
(Refer Slide Time: 36:57)

So, a co-occurrence matrix is a two-dimensional array P in which both the rows and the
columns represent a set of possible image values, that is already I have explained. And for the
GLCM the matrix is Pd is a that is an array I have to define for this I have to first define a
displacement vector, the displacement vector is d that is dx along the x direction and d y
along the y direction. And counting all pairs of pixels separated by d having gray levels i and
j. So, I can determine different, different GLCMs corresponding to different displacement
vectors.

In my previous example, I have considered a displacement vector is something like this


suppose one pixel is number, one pixel is this number 2 is this, so, it is in the x direction is a
dx distance in y direction it is the d y distance. So, this is a displacement vector I can
consider.

906
(Refer Slide Time: 38:07)

And so, the GLCM is defined by the P d (i,j) and in this case I have to count, that is the
number of occurrences of the pixels value ij lying at the distance d in the image that I have to
count. So, where N(i,j) is the number of occurrences of the pixel values i,j lying at distance d
in the image. The co-occurrence matrix Pd has dimension n × n, in my example, I have Ll ×
L, but here I have shown n × n. So, n is the number of gray levels in the image.

(Refer Slide Time: 38:42)

And you can see this example, the same example I have considered corresponding to this
particular displacement I can get the array, the array is P d (i,j) and I have this co-occurrence
matrix.

907
(Refer Slide Time: 38:57)

So, algorithm is that, count all pairs of pixels in which the first pixel has a value i and its
matching pair displaced from the first pixel by d as the value j, that I have to count. This
count is entered in the ith row and jth column of the matrix P d (i,j), that is the co-occurrence
matrix. And in this case 1 important thing is the P d (i,j) is not symmetric matrix, it is the
number of pairs of pixels having gray levels ij does not necessarily equal to the number of
pixel pairs having gray level j. So, that is why the array P d (i,j) is not symmetric.

(Refer Slide Time: 39:45)

And after this as I mentioned earlier, so, we have to do the normalization. So, the elements of
Pdij should be normalize. How to do the normalization, by dividing each entry by the total
number of pixel pairs, that is total number of point pairs n in the image that satisfies P d (i,j)

908
and divide each and every element of P d (i,j) by n to get the matrix N(i,j), that is the
normalized GLCM, that is called a normalized GLCM, that I can calculate. Which normalizes
the co-occurrence values to lie between 0 and 1.

And in this case the N(i,j) is nothing but, it is an estimate of the joint probability distribution.
So, N(i,j) is an estimate of the joint probability distribution. So, you can see the procedure
how to determine the normalized GLCM.

(Refer Slide Time: 40:50)

And from the normalized GLCM we can extract some features to represent a particular
texture. So, what are the features I can compute or I can determine. The first one is the
maximum probability I can determine, different moments I can determine, the contrast also I
can determine, uniformity and homogeneity I can determine and one important parameter is
the entropy, the entropy also I can determine from GLCM. So, first procedure is I have to
determine the GLCM and After determining the normalized GLCM I can determine all these
parameters.

909
(Refer Slide Time: 41:32)

Now, in this case, one problem with deriving texture measures from co-occurrence matrix is
how to select the displacement vector d. So, for this, the choice of displacement vector is an
important parameter in the definition of the GLCM. So, that is why what I have to consider,
that occasionally the GLCM is computed from several values of d. So, the GLCM is
computed from several values of d and the one who is maximized a statistical measure
computed from P ij is used. So, that is the procedure.

So, instead of considering only one displacement vector I can consider a number of
displacement vectors, like this if I consider displacement vector, one displacement vector
maybe like this, one displacement vector maybe like this, that is one pixel. Suppose, we
consider 1 and 2 that means, the pixel 2 is located, one pixel along the right direction and one
pixel in the bottom direction with respect to the pixel 1. So, for different displacements, I can
consider the GLCM.

910
(Refer Slide Time: 42:46)

So, first parameter is the maximum probability I can determine. So, from that GLCM, N(i,j)
is the normalized GLCM and I can determine the max corresponding to these variables i and
j. This is simply the largest entry in the matrix and corresponds to the strongest response of
the P d (i,j) . This could be maximum in any of the matrices or the maximum overall. So, first
parameter I can determine that is the maximum probability. This is simply the largest entry in
the matrix and corresponds to the strongest response of the P d (i,j) that I can determine. So,
still the maximum probability I can determine from normalize GLCM.

(Refer Slide Time: 43:38)

I can determine the moments. So, you can see the moments I am determining, the order K
element difference moment I can determine. This is nothing but ( i− j )k and I am considering

911
N ( i , j ) , N ( i , j ) is the normalized GLCM. This descriptor has small values in cases where the
largest element in n are along the principal diagonal, that you can verify. So, that is the order
k element difference moment we can determine and this descriptor has small values in cases
where the largest element in n are along the principal diagonal and the opposite effect can be
achieved using the inverse moment. So, I can also determine the inverse moment by using
this expression and I can get the opposite effect.

(Refer Slide Time: 44:33)

Next one is that I can determine the contrast information. Contrast is a measure of the local
variations present in an image. So, by using this expression, and that is from the GLCM, you
can determine the contrast. A measure of intensity contrasts between a pixel and its neighbor
over the entire image, that I can determine. If there is a large amount of variation in an image
the N ( i , j ) will be concentrated away from the main diagonal and the contrast will be high in
that case.

So, that also you can verify you take one simple text our image, and you can determine the
contrast from the GLCM. And in this case, if I consider the large amount of variation in an
image, then corresponding to these N ( i , j ) will be concentrated away from the main diagonal.
And in that case, the contrast will be very high, that you can verify, that you can see.

912
(Refer Slide Time: 45:38)

Another parameter is homogeneity. A homogeneous image will result in a co-occurrence


matrix with a combination of high and low N ( i , j ) . So, you can determine the homogeneity
by this expression. And in this case, the range of gray levels is small, suppose the range of the
gray level is small, the N ( i , j ) will be clustered around a main diagonal.

So, that means, when the range of the gray levels is small, the N ( i , j ) will tend to be clustered
around the main diagonal, that you can verify also. And a heterogeneous image will result in
an even spread of N ( i , j ) s, that you can also verify. One is the homogeneous another one is
the heterogeneous image you can represented by this parameter, that is the parameter is
homogeneity parameter.

(Refer Slide Time: 46:32)

913
Another parameter is the inner property a measure of any property in the range of 0 and 1 and
uniformity is 1 for the constant image, it is highest when N ( i , j ) 's are all equal. So, from this
expression, it is evident that it can measure the uniformity of an image. So, for a constant
image the uniformity will be 1, it is highest when N ( i , j ) s are all equal. So, for constant
image I will be getting 1, the uniformity will be 1.

(Refer Slide Time: 47:06)

And finally, another important parameter, that is the entropy. Entropy is a measure of
information content and it measures the randomness of intensity distribution. So, by using
this expression, you can determine the entropy, the entropy from GLCM you can determine
and such a matrix corresponds to an image in which there are no preferred gray level pairs for
the distance vector d. Entropy is highest when all entries in N ( i , j ) are of similar magnitude

914
and the entropy will be small when the entries in N ( i , j ) are unequal, that you can verify from
this explanation.

So, entropy is highest when all entries in N ( i , j ) are of similar magnitude and small when the
entries in N ( i , j ) are unequal, because entropy measures the randomness of intensity
distribution.

915
(Refer Slide Time: 48:09)

So, another parameter you can determine from GLCM that is correlation. Correlation is a
measure of image linearity. So, you can see by using this expression, you can determine the
mean and variance you can determine and from this you can determine the correlation also
you can determine, that is that measure of image linearity. Correlation will be high if an
image contains a considerable amount of linear structure. So, from this you can see the image
linearity. So, all these parameters you can determine from GLCM.

So, this is the statistical method. In a statistical method, the inner parts case you can
determine the moments image moments you can determine from the histogram of the image
like the first order moment you can determine, the variance you can determine, the third order
moment you can determine, the fourth order moment also you can determine, roughness
factor also you can determine.

But the problem already I have mentioned that the spatial information is not available. So,
that is why we have considered the gray level co-occurrence matrix. And how to determine
that GLCM we have explained for different displacement you have to do and after this we
can extract these parameters, these quantities like the maximum probability element
difference moment of order k and also you can determine the uniformity entropy, all these
parameters we can determine from the GLCM.

916
(Refer Slide Time: 49:45)

The next technique is the Fourier approach for texture representations. So, in this block
diagram, I have shown the Fourier method. You can see the original image I have shown and
after this I have to determine the 2d Fourier transform of the image. But before applying the
2d Fourier transformation, what I have to do, I have to do the preprocessing that means
images multiplied by minus 1 to the power x plus y. And after this, I have to determine the 2d
Fourier transformation. So, what is the main concept of the spectral approach, this is called a
spectral approach, spectral approach.

This Fourier transform approach is called the spectral approach. So, suppose a Texel, this
Texel is repeated like, this I have the texture that means, you can see some regularity in the
texture that is nothing but the periodicity. So, this is my texture and, in this case, you can see
regularity in the texture that is nothing but the periodicity. So, that means, the Fourier spectra
of the image should exhibit strong components representing the periodicity of the texture
elements.

So, for this what I have to consider, I have to determine the Fourier transform of the image.
So, suppose this is the Fourier transform of the image, Fourier transform of the image I have
to determine. Suppose I have the prominent peaks in the Fourier spectrum, the peaks are like
this. The prominent peaks in the polar spectrum give the principal direction of the texture
pattern. So, that means, if I consider these peaks, then in this case it gives the principle and
directions of the texture pattern.

917
And also, the location of the peaks, if you see the location of the peaks, the location of the
peaks in the frequency plane gives the fundamental spatial period of the patterns and we can
eliminate the periodic components by filtering. Then in this case, I have the only the non-
periodic elements, which can be described by statistical methods. So, you can see, first I have
to determine the Fourier transform of the image.

And the prominent peaks in the Fourier spectra give the principal direction of the texture
pattern and also the location of the peaks in the frequency plane gives the fundamental spatial
period of the patterns. Also, I can eliminate any periodic components bias filtering. Then, in
this case, I will be getting the non-periodic image elements which can be described by
statistical methods as S x , y ( ω ) that is the Fourier transform of the image. In polar form, in
polar form, I can represent like this Sr ( θ ) this is in the polar form.

In the polar form what are the things I can do, Sr ( θ ) also I can consider and another one is I
can consider Sr ( θ ). So, what is s theta r, theta is fixed and, in this case, r is variable, theta is
fixed. So, it shows the behavior of the spectrum along the radial direction from the origin.
The first one is that theta is fixed, but r is not fixed. In the second case Sr ( θ ), r is fixed, but
theta is not fixed. That shows the behavior along the circle centered on the origin that is the
behavior of the spectrum.

So, I will show in the figure here you can see. So, first what I am doing, I am continuing the
original image after this I am doing the pre-processing, that is the image is multiplied by
minus 1 to the power x plus y, after this I am determining the Fourier transformation. In the
two cases you can see, I am considering that this case, the first case what I am considering I
am determining s r, so for this I am considering Sr ( θ ), that means that theta is fixed and r is
variable.

So, that means sum all pixels in each area. So, thus I am summing the pixels in each area,
suppose this area I am considering, so summing all the pixels in each of the areas. So, what is
the meaning of this? It shows the behavior of the spectrum along the radial directions from
the origin. So, origin is this, this is my origin and it shows the behavior of the spectrum along
the radial direction from the origin. So, I can determine the parameter, the parameter is Sr .

So, this parameter I can determine. So theta is fixed, and I am considering r is from 1 to r0,
the maximum value. In the second case, what I am considering divided into areas by radius

918
that I am considering. So in the second case, what I am considering, r is fixed, I am
considering Sr , theta, that means I am considering the fixed r. And in this case, I want to
determine or I want to see the behavior along a circle centered on the origin. That is sum of
all the pixels in his area I have to determine and I am getting S theta.

So these two parameters I am getting one is Sr , another one is S theta. In one case, that the
theta is fixed, in another case, the r is fixed. So, the meaning is first I have to determine the
Fourier transform of the image. And after this I have to convert into polar coordinate, the
polar coordinate is r and theta. And I have to consider these two cases, in one case the theta is
fixed, in another case the r is fixed.

In one case, I want to observe the behavior of the spectrum along the radial direction from the
origin. In another case, I want to observe the behavior along the circle centered on the origin.
So, this is the objective of the Fourier transform approach for texture representation.

(Refer Slide Time: 56:30)

And in this example, you can see I have considered the original image and after this you can
see the 2d Fourier transform of the image and you can see these two parameters one is Sr ,
another one is S theta. And corresponding to another image I am considering another image
corresponding to another image you can see the value of S theta you can see the spectrum S
theta. So, by using this S r and the S theta we can describe a particular texture.

So, up till now, I discussed about the concept of texture, I have defined the texture what is
texture. After this there are 4 research directions one is texture classification, one is texture

919
segmentation, one is texture synthesis, and one is shape from texture. After this I discussed
about the statistical method for texture representation. So, for this I considered one very good
method, that is the GLCM the gray level co-occurrence matrix and from the GLCM I can
extract different parameters.

After this I discussed about the spectral approach, that is the Fourier transform approach for
representing a particular texture. So, in my next class, I will discuss another technique like
Gabor filter, the local binary patterns. So, all these techniques I will be discussing in my next
class. So, let me stop here today, thank you.

920
Computer Vision and Image Processing - Fundamentals and Applications
Professor M. K. Bhuyan
Department of Electronics and Electrical engineering
Indian Institute of Technology, Guwahati
Lecture-25
Image Texture Analysis - II
Welcome to NPTEL MOOCs course on Computer Vision And Image Processing -
Fundamentals And Applications. In my last class I discussed the concept of image texture
analysis. So, what is texture? Texture means spatial distribution of gray level intensities, I
discussed 4 research directions one is texture classification, one is texture synthesis, one is
texture segmentation and one is Shape from texture. A particular texture can be represented
or described by 3 techniques I highlighted, one is the structural method, one is the statistical
method and one is the spectral method.

In case of the structural method, I have to find the placement rule for the texels, texel means
the basic unit of a texture, that is the primitive. Texel means a group of pixels having
homogeneous property and based on this texel I have to find the placement rule for the
structural representation of a particular texture. The next one is the statistical method. So, I
have to extract some statistical parameters and based on these parameters, I can represent a
particular texture.

And finally, I discussed the technique is the spectral method. In spectral method, I have to
determine the Fourier transform of the image. And based on the Fourier transform of the
image, I can represent a particular texture. Last class I discuss the statistical methods for
texture representation. So, for this I can extract some statistical parameters like mean,
standard deviation, third order moment like this I can extract and based on this I can represent
a particular texture.

After this I discussed the concept of GLCM or gray level co-occurrence matrix. So, today I
am going to discuss some other texture descriptors. So, one is the autocorrelation function
that I am going to discuss today. Another one is very important, that is the Gabor filter based
texture representation. So, I can extract the texture features by Gabor filter and finally, I will
discuss the concept of local binary pattern, LBP. So, how to represent a particular texture by
using LBP. So, let us see the concept of the GLCM that I discussed in my last class.

921
(Refer Slide Time: 3:08)

So, you can see in my last class I discussed this, the concept of the GLCM. So, from the input
image, I can determine the GLCM, that is that gray level co-occurrence matrix. And in this
case I am considering 3 gray levels. So, that is why the P ij will be 3 by 3 matrix and for
GLCM I have to define the displacement vector. So, the displacement is 1 and 1. Suppose I
can consider another example of GLCM suppose, my input is 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 2, 2, 0,
0, 2, 2.

So, this is my input image and, in this case, I will be getting the 3 by 3 gray level co-
occurrence matrix. So, suppose my displacement is 1, 0 in x direction it is 1 and in y direction
it is 0. So, corresponding to this d (1, 0) my array the array is P ij the P ij will be there you can
verify this 4 0 2, 2 2 0, 0 0 2. So, this is my array also if I consider the displacement suppose,
and d (0,1) corresponding to the displacement my P ij will be that is the gray level co-
occurrence matrix 4 2 0, 0 2 0, 2 0 2, that is for a displacement d (0,1).

And suppose if I consider a displacement d (1, 1), that is in the x direction it is 1 and in y
direction it is 1. So, corresponding to this displacement, the P ij will be 3 1 1, 1 1 0, 1 0 1. So,
you can verify this. So, what is the method, so, for all the displacements, so, different
displacements, I have to find the GLCM. That is GLCM is computed for several values of d
and the one who is maximize a statistical measure computed from P ij is finally used. So, that
is the concept of GLCM and also in my last class I discussed about the normalization
procedure.

922
So, we have to do the normalization of the gray level co-occurrence matrix. So, in this case,
in this example, if you consider this example, what I have to consider, these elements, we
have to divided by 16 because 16 pairs of pixels in the image satisfy the spatial separation
condition. So, that is why 0 is divided by 16, 2 is divided by 16 like this I have to consider
corresponding to this co-occurrence matrix. So, I have to do the normalization.

(Refer Slide Time: 6:24)

And from the GLCM I can extract some features, the statistical features, the parameters are
like this. The maximum probability I can determine, the moments I can determine, contrast I
can determine, uniformity and the homogeneity also I can determine from the GLCM and
finally, one important parameter that is the entropy also I can determine.

(Refer Slide Time: 6:50)

923
After this I discussed in my last class the concept of the spectral representation of a texture.
So, for this, you can see the texture is nothing but repetitions. So, if I consider a texture, it is
nothing but the repetitions of this texels. So, that is regularity in the texture. So, for this what
I can do, I can determine the Fourier transform of an image. So, suppose S xy is the Fourier
transform of the image. So, in the Fourier transform, the prominent peaks in the Fourier
spectrum give the principal directions of that texture pattern.

So, that is the first point, that is the prominent peaks in the Fourier spectrum give the
principal directions of that texture pattern and also the location of the peaks in the frequency
plane gives the fundamental spatial period of the patterns. So, this is the concept of the
Fourier transformation. This S xy , there is a Fourier transformation, the 2d Fourier transform of
the image that I can represent in the polar form S (r, theta). I can convert to the polar
coordinate.

Last class I discussed about this and I can consider this case S (r, theta) and another one is Sr
theta I can consider. The first case is what is the first case, the theta is fixed and, in this case,
I want to find a behavior of the spectrum along the radial direction. So, this is the theta is
fixed and I want to see the behavior of the spectrum along the radial directions. In the second
case, the r is fixed and I can see the behavior along the circle centered on the origin.

So, because the theta is variable here, r is fixed, that means, I can see the behavior along a
circle centered on the origin that I can see. And based on this I can compute Sr and the S (r,
theta), can compute. So, some are some all pixels in each area and some all pixel in each area
I have to consider that compute Sr , I am determining, another one is S (r, theta) I am
determining.

(Refer Slide Time: 9:15)

924
And you can see this example corresponding to this original image I am determining the 2D
Fourier transform, there is a 2D Fourier transform I am determining and you can see these 2
spectrums, one is Sr and other one is S (theta) corresponding to this image. And if I consider
another image the textured image corresponding to this image, you can see the spectrum S
(theta) So, S (theta) is different and if you see these 2 spectrums, it is quite different. So, that
means based on Sr and S (theta), I can represent different types of textures. This is about the
Fourier approach for texture representations.

(Refer Slide Time: 10:02)

Now, next one is the autocorrelation function I can consider. So, in this case, one important
property of textures is the repetitive nature. Repetition is due to the placement of the texture
elements, the texture element is the texels in the image. So, that is the repetition due to the

925
placement of texture elements, Texels in the image. So, in this case, this autocorrelation
function, it is used to identify the amount of regularity of the pixels of the image and it can be
used to estimate the fineness and the coarseness of the texture, that is one important point.

This autocorrelation function can be used to estimate fineness and the coarseness of the
texture. An autocorrelation function can be used as a measure of periodicity of texture
elements and autocorrelation function is directly related to the power spectrum of the Fourier
transform. This autocorrelation function is defined like this corresponding to the image the
image is ρ (x, y). So, this is the definition of the autocorrelation function. So, f (k, l) I am
considering f (k + x, l + y) and after this we have to do the normalization this is the f 2 ( k ,l ) .

So, this is the definition of the autocorrelation function. This autocorrelation functions drops
off rapidly for fine textures. So, I can write like this autocorrelation function drops off rapidly
for fine textures and for regular textures the autocorrelation function the autocorrelation
function has peaks, autocorrelation function has peaks and valleys. So that means the
autocorrelation function can be used as a measure of periodicity of texture elements.

And here I have shown the 2 cases, the autocorrelation function drops off rapidly for fine
texture. So, for fine texture it drops of rapidly and for regular textures, the autocorrelation
function has peaks and valleys. And also, the autocorrelation function can be used as a
measure of periodicity of texture elements.

(Refer Slide Time: 13:04)

926
Now, suppose if I consider f (u, v) is the Fourier transform, the 2D Fourier transform of the
image and ¿ f ( u , v )∨¿2 ¿ is it is called the power spectrum. So, the spectral features can be
extracted by dividing the frequency domain into rings and wedges. So, in the picture you can
see, I am considering some rings, you can see the rings, these are the rings. I am considering
the Fourier transform, the spectral features I can extract by dividing the frequency domain
into rings or maybe the wedges suppose if I consider, this is the wedge, I can consider.

The rings are used to extract frequency information, while the wedges are used to extract
orientation information of the texture pattern. So, the rings these are rings are used to extract
the frequency information. And this the wedges I am using to extract the orientation
information. The wedges are used to extract the orientation information. And after this, the
energy features can be computed in his annular regions I can compute and it indicates the
coarseness or fineness of a texture pattern.

So, I can determine the energy, I can determine the energy for the annular regions and it
indicates the coarseness or the fineness of a texture pattern. Also, I can determine the energy
features in each wedges and these features indicates the directionality of a texture pattern. So,
that means, I can determine the energy features in the annular regions and also, I can
determine the energy features in wedges.

(Refer Slide Time: 15:34)

And in this wedge, you can see, I am considering the power spectrum, so this is the power
spectrum. In the first case I am calculating f r r I am calculating. So, in this case what I am
1 2

considering, that means, all the angles I am considering that is for the annular rings, so, this

927
ring I am considering these the annular rings So, the angle is from 0 to twice pi I am
considering, so this angle is from 0 to twice pi, that I am considering and but radius is from r
1 to r 2 I am considering.

The annular rings I am considering and I am determining the feature, the feature is f r r . In the
1 2

second case what I am considering f θ θ and, in this case, I am considering the radial distance
1 2

from 0 to infinity. 0 to infinity the distance I am considering the radius. But angle I am
considering from theta 1 the angle is suppose this theta 1 to this direction is supposed Theta
2. So, that is and this angle theta 1 and theta 2 I am considering and that means, I am
considering this wedge, this I am considering.

And corresponding to this I am determining the feature, the feature is f θ θ from the power
1 2

spectrum. So, this ¿ f ( u , v )∨¿2 ¿ is the power spectrum. So, from the power spectrum, I am
determining these 2 features, once is f r r and another one is f θ θ .
1 2 1 2

So, that means, we can see the behavior of the texture, the fineness or the coarseness and also
we can see the direction of a particular texture pattern, that we can see from these 2 features,
one is f r r , another one is f θ θ So, this is about the autocorrelation function for representation
1 2 1 2

of a particular texture.

(Refer Slide Time: 17:31)

The next one is very important, and that is the Gabor filter-based texture representation. That
is I can extract texture features by using a Gabor filter. So, what is Gabor filter you can see a
2D Gabor function consists of a sinusoidal plane wave of a certain frequency and orientation

928
modulated by a Gaussian envelope. That means I am considering its 2D sinusoidal function.
So, here you see Sω , θ ( x , y )that is the sinusoidal plane wave and I am considering the
frequency, the frequency is omega and orientation is theta.

And that is modulated by a Gaussian function that Gaussian function is G θ ( x , y ) , that is the
Gaussian function. And if I consider these 2 functions, one is the sinusoidal function, another
one is the Gaussian function then I will be getting the Gabor function, the 2D Gabor function
I am getting. And in this case you can see the G θ ( x , y ) that is the Gaussian function, the 2D

Gaussian function and in this case you can see the σ x 2 that is the variance along the x

direction and theσ y 2, that is the variance along the y directions.

This actually represents the spread of the Gaussian function. And in this case S omega theta
that is the sinusoidal plane wave you can see, so it has 2 components one is the imaginary
part, another one is the real part. Here I am considering the imaginary is you can see. So it is
a complex function and I am considering a sinusoidal plane wave of a certain frequency, the
frequency is omega and the orientation is theta. This sinusoidal function is modulated by the
Gaussian.

The 2D Gaussian function is used to control the spatial spread of the Gabor filter. The various
parameters of the Gaussian determine the spread along the x direction and the y directions
and also, we have done orientation parameters. So, in general, in general for the Gaussian
filter, the σ x =¿ σ y =σ that means that is a variance along the x direction is equal to variance
along the y direction is equal to sigma is a constant.

In this case, the rotation parameter theta does not control the spread as the spread will be
circular, if variances are equal. In this case, I am considering the sigma x is equal to sigma y
is equal to sigma that means, the variance along the x direction is equal to variance along the
y direction. And corresponding to this case, the rotation parameter Theta does not control the
spread, as the spread will be circular, if the variances are equal. So, this is the definition of a
Gabor filter. That means we are considering a sinusoidal plane wave of a certain frequency
and orientation modulated by a Gaussian envelope.

(Refer Slide Time: 20:57)

929
The same thing I am showing here in this slide also. I am defining the Gabor function, the 2D
Gabor function and I am considering the sinusoidal function and also the Gaussian function.
So, here you can see that G θ ( x , y ) that is the Gaussian function the 2D Gaussian function,
Sω , θ, so I am considering the parameters the frequency parameter is omega orientation

parameter is theta, that is the Sω , θ ( x , y ) is the 2D sinusoid function sinusoidal the sinusoidal
function. XY corresponds to a spatial location of an image.

So, a spatial location of the image is considered and, in this location, the Gabor filter is
centered. Omega corresponds to the frequency parameter of the 2D sinusoid. And Theta
already I had mentioned that theta is the orientation parameters. And we have considered the
variance along the x direction and the variance along the y directions. If this sigma x and
sigma y is equal, then in this case, the rotation parameters does not control the spread as the
spread will be circular if the variances are equal.

(Refer Slide Time: 22:16)

930
The 2 sinusoidal components of the Gabor filters are generated by a 2D complex sinusoid.
The local spatial frequency content of the image intensity variation, that is a texture in an
image can be extracted by applying these 2 sinusoids. And in this case, you can see there are
real and imaginary components of the complex sinusoid. The 2 components are phase shifted
by pi by 2. So, if you see the previous slide, you can see the sinusoid function that is a
complex number and I have 2 components one is the cos theta another one is a sin Theta.

The phase difference between these is pi by 2. And in this case the Gaussian and the sinusoid
function are multiplied to get the complex Gabor filter. So, here you see I am considering the
ideal part of the sinusoidal function. So, R is the real and I am considering does G θ ( x , y ) that
is the Gaussian function, the 2D Gaussian function that is the real part of the sinusoidal
function is modulated by the Gaussian and I am considering the real part of the Gabor filter.

The second one is I am considering the imaginary part of the sinusoidal function and this is
modulated by the Gaussian and I am getting the imaginary part of the Gabor function. The
real and the imaginary part of the Gabor filter are separately applied to a particular location,
the location is suppose x and y of an image F ( x , y) to extract Gabor features. So here you
see what I am doing. I am considering the images f ( x y) that is convolved with the real part
of the Gabor filter.

So this is the real part of the Gabor filter. And also you can see the next part, the second part,
this is the image the image is F ( x , y) that is convolved with the imaginary part of the Gabor
function and from this the real part and the imaginary part, I can determine that Gabor

931
coefficients, the Gabor features I can determine. So C Ψ ( x , y) that is nothing but the Gabor
features. So, corresponding to a particular texture, I can determine the Gabor features.

So, the parameters are omega and the theta. Omega is the frequency and theta is the
orientation. Also, another parameter the sigma x and sigma y. So, I will be getting the Gabor
features for different orientations. So, you can see how we can extract the Gabor features by
considering this Gabor filter. So, I have to consider the real part of the Gabor function. So,
for this I am considering the real part of the sinusoidal function and also, I have to consider
the imaginary part of the Gabor function, for this I am considering the imaginary part of a
sinusoidal function.

And already I had mentioned what are the parameters, the parameters are variance along the x
direction, variance along the y direction, the orientation parameter theta and I can consider
the frequency parameter omega.

932
(Refer Slide Time: 25:24)

So, in this case I have shown a typical Gaussian filter with sigma is equal to 30 and in this
case I am considering the σ x =¿ σ y =σ . And another case you can see a typical Gabor filter
with sigma is equal to 30 and omega is equal to 3.14 and the orientation is 45 degree that I
am considering.

(Refer Slide Time: 25:47)

And here I have shown that 2D Gabor filter and this is the expression for the 2D Gabor filter
and sigma is the spatial spread, omega is the frequency and theta is the orientation parameter.
And in this case, you can see, use a bank of Gabor filters at multiple scales and orientations to
obtain filtered image. Gabor filters with different combinations of spatial width, frequency
omega and orientation theta.

933
So, you can see all these examples 1, 2, 3,4. So for this what I am considering the Gabor
filters with different combinations of spatial width and frequency omega and the orientation
also I am considering that is that theta. So, this is about the Gabor filter. So, by using the
Gabor filter, we can extract texture features.

(Refer Slide Time: 26:38)

Another important technique is MPEG-7 homogeneous texture descriptors that is called


HTD, the homogeneous texture descriptors. This HTD describes directionality and the
coarseness and the regularity of a texture pattern. So, what it describes, the first one is the
direction of a particular texture, the coarseness and the regularity of a texture pattern, it
characterizes a particular texture. So, for this what I have to consider, for this the mean
energy and the energy deviation from a set of frequency channels are considered.

So, I am repeating this. So, for HTD, the homogeneous texture descriptors, the mean energy
and the energy deviation from a set of frequency channels are considered and 2D frequency
plane is partitioned into 30 frequency channels. So, for this we have to consider 30 frequency
channels. So, for this what we have to consider, the mean energy we have to compute and its
deviation are computed in each of these 30 frequency channels in frequency domain.

934
(Refer Slide Time: 27:48)

So, here I have shown the pictorial representation of the HTD, the homogeneous texture
descriptors. And you can see we have considered 30 channels, the frequency channels we
have considered. So, here what is HTD? So, here you can see HTD is the first component is
Fdc. So, what is F dc and what is F sd. F dc and F sd are the mean and standard deviation of the
image that is the definition of F dc and F sd, the mean and the standard deviation of the image
respectively.

And I am considering e 1, e 2, e 3, up to e 30 because we have 30 frequency channels and also


we are considering the deviation is d 1, d2 up to d 30 for 30 channels. So, in this case the e i and
di are nonlinear victories scale and quantized mean Energy and Energy deviations of the
corresponding ith channel. So, that means ei corresponds to the mean energy and di
corresponds to energy deviation of a particular channel.

935
(Refer Slide Time: 29:09)

And also the individual channels or modeled by using Gabor functions. So, suppose let us
consider channels index by s r. So, what is s here s is the radial index and r is the angular
index. So, what I have to consider, the individual channels are modelled by using Gabor
functions. So, I have 30 channels and the individual channels are model by using Gabor
functions. And I am considering the channel index by s, r. s means the radial index and r
means the angular index. So, s, r channel is model in the frequency domain like this.

So, this is G s r omega theta is nothing but I am considering 2 exponential functions you can
see and P omega theta that I am considering. So, on the basis of the frequency layout and the
Gabor functions, the energy ei of the ith feature channel is defined as the log scale sum of the
squares of the Gabor filtered Fourier transform coefficients of an image. So, that means, what
is ei, so ei is computed like this and you can see what is pi, pi it is determined from this
expression that is the definition of pi.

936
(Refer Slide Time: 30:46)

So, this is the pi. So, what is p omega theta in this expression if you see all this expression
what is the P (ω ,θ). P omega theta is the Fourier transform of an image represented in the
polar frequency domain. That is P ( ω ,θ )is equal to f (ωcos θ , ω sinθ). And in this case f u, v is
the Fourier transform, 2D Fourier transform of the image in the Cartesian coordinate system
that I am considering. And also, I need to determine the energy deviation I have to determine.

In my last slide I have determined the mean energy. Now, I have to determine the energy
deviation, that is the di for a particular channel. That is defined as the log scale standard
deviation of the square of the Gabor filtered Fourier transform coefficients of an image. So,
you can see I can compute di by using this expression and what is qi, the qi is obtained from
this. This concept you can read from some papers on MPEG 7 homogeneous texture
descriptors. So, that is by HTD, I can represent a particular texture, that is nothing but the
texture descriptors.

937
(Refer Slide Time: 32:05)

And finally, I want to discuss another descriptor for texture, that is LBP, the local binary
pattern. So, LBP is a well-known texture descriptor and that is widely used in many computer
vision applications. So, for many computer vision applications, even the face recognition,
facial expression recognition are the many applications the LBP is used. LBP is
computationally simple and easy to apply and also, it is a nonparametric descriptor.

One important point is it can handle the monotonic illumination variation, that is one
important point. So, it can handle the monotonic illumination variations. So, how to compute
the LBP? So, corresponding to this image you can see here, that is I am considering the basic
local binary pattern, because LBP has many variants, but in this case, I am considering only
the basic LBP. So, in basic LBP each pixel of an image is compared with their 8
neighborhood pixels of a 3 by 3 block.

So, here you can see in this image, I am considering a 3 by 3 block and you can see the
neighborhood of the pixel, the pixel is 42, the center pixel is 42 and the neighborhood pixels
are 202, 12, 24, 0, 42, 64, 33 and 81, that means we have 8 neighborhoods. So, in that
neighborhood, I can write like this i n i n, that is the neighborhood and in this case n is equal
to 1, 2, so I have eight neighborhood pixels. Corresponding to the center pixel, the center
pixel is suppose ic that is the center pixel. And after this what I have to do, I have to
determine the LBP. So, how to determine the LBP.

So, for this if i n that is the neighborhood pixel is greater than equal to the center pixel, ic is
the center pixel, this is the center pixel and in is the neighborhood pixel, this is the

938
neighborhood pixel. And if i n is greater than ic, then in this case, i n will be equal to 1 else i
n is equal to 0. So this I am considering. So based on this you can see the 42 is compared 202.
Since 202 is greater than 42, corresponding to this, I will be getting 1 here, this 1.

After this is the 12 is compared with 42, the 12 is Less than 42, so that is why I am
considering it 0. Next, I am comparing 24 with 42, 24 is less than 42, so that is why I am
considered it is 0. And like this I am getting these numbers, all these numbers I am getting.
So, what will be the my number binary number 10001101. That means, I have to do the
binarization. So, binarization is done by concatenating this in from the left top corner so,
what is the left top corner is the left corner left top corner in the clockwise direction. So,
clockwise direction is, this is the clockwise direction I am considering, this is the clockwise
direction.

So, that means I am repeating this the binarization is done by concatenating this i n from the
left top corner in the clockwise direction. So, corresponding to this my binary number will be
1000 if you see this example 1101 that is the case. And subsequently the corresponding
decimal equivalent value I have to determine. So, corresponding to this what will be the
decimal equivalent value? The decimal equivalent value will be 141, that 141 will be
assigned to the center pixel, the center pixel is this is the central pixel.

So, this 141 will be assigned to the central pixel ic which is known as the LBP code and this
step is repeated for all the pixels of the image. So, for all the pixels of the image I have to
repeat this step and finally, the corresponding histogram of the LBP codes is considered as a
local texture features of the original image. So, what finally I have to do. Finally, the
corresponding histogram of the LBP codes we have to consider and the local texture features
of the original image I can determine.

So, here you see corresponding to the center pixel I am getting the LBP code, the LBP code is
141. So, this is the LBP code for the center pixel. So, for all the pixels of the image, I have to
do this and finally, the corresponding histogram of the LBP codes we have to consider and
that is considered as the local texture features of the original image. So, I have to determine
the histogram of the LBP codes.

The LBP p r corresponding to the central pixel I can determine and that will be the binary
number and this binary number I can convert into the decimal number and what about this
function, the function is F (x) the F (x) will be 1 if x is greater than equal to 0 otherwise it

939
will be 0, if x is less than 0. That means I am doing this case, that is I am comparing the
center pixel with the neighborhood pixel and based on this I am getting the value of F (x).

And from the F (x) I am getting this binary numbers and binary numbers I am converting into
equivalent decimal representation and that is nothing but the LBP code. And there are several
variants of the LBP. This is the basic LBP, like this I can give some example. One example is
the extended LBP. Another LBP is the uniform LBP, that is very popular, uniform LBP is
very popular. So, you can see all this concept from the research papers. So, what is extended
LBP, what is uniform LBP. But the basic concept of the local binary pattern is this.

(Refer Slide Time: 38:58)

So, here also I have shown one example of the local binary pattern. Corresponding to this
image you can see I am following the same procedure, just comparing the central pixel with
the neighborhood pixels, just I am doing the comparisons like this. And based on this I am
getting the binary code, the binary code will be 11111100 and the decimal equivalent will be
252, that is the LBP code.

940
(Refer Slide Time: 39:32)

And finally, there is another technique, I am not going in details about this local derivative
pattern, that is called the LDP, the local derivative pattern. So, this is also very popular. If
you want to see this concept, you can see this paper, the research paper, that is the local
derivative pattern vs local binary pattern. So, in case of LBP, the LBP actually encodes the
binary result of the first order derivative among local neighbors by using a simple threshold
function. LBP is incapable of describing more detailed information.

So, that is why we consider the LDP, that means LDP considered n minus 1 th ordered
derivative direction variations based on binary coding functions. An LDP encodes the higher
order derivative, that is one also important point. In LDP, we consider a higher order
derivative information, which contains more detailed discriminative features that the first
order LDP cannot obtain from an image. Because in case of LBP, we consider the first order
derivative.

And also, we considered a simple threshold function because I am comparing the center pixel
with the neighborhood pixels and for this I am considering a threshold function. But in case
of the LDP, we are considering the high order derivative we are considering and many
directions I am considering, the directions like 0-degree, 45 degree like this different
directions I am considering. And LDP encodes the higher order derivative information and
that is why it contains more detailed discriminatory features in the LDP as compared to LBP.

So, for more details, you can see this paper on LDP, the local derivative pattern. So, in this
class, I discussed some other texture features like autocorrelation function I have discussed

941
and after this one important representation that is a Gabor filter I have considered. So, how to
extract the texture features by Gabor filter, that concept also I have discussed. This is very
important. And finally, I discussed the concept of LBP, the local binary pattern. So, this is
about the texture analysis. Let me stop here today. Thank you.

942
Computer Vision and Image Processing - Fundamentals and Applications
Professor M. K. Bhuyan
Department of Electronics and Electrical Engineering
Indian Institute of Technology, Guwahati
Lecture-26
Object Boundary and Shape Representations - I
Welcome to NPTEL MOOCs course on Computer Vision And Image Processing -
Fundamentals And Applications. I have been discussing about image features, I have already
discussed about edges, textures and color, which are image features. Today I am going to
discuss about object boundary and object shape information that I can consider as image
features. So, what is object boundary and how to represent object boundary I will discuss
now.

(Refer Slide Time: 1:07)

So, in this case you can see the image representation and the descriptions. So, how to
represent a particular image, that is to represent and describe information embedded in an
image in other forms that are more suitable than the image itself. So, that is the main concept.
So, how to describe and how to represent information embedded in an image. So, instead of
storing the entire image in the memory, I can store only the descriptors corresponding to that
particular image.

So, what are the advantages, you can see easier to understand and require fewer memory,
faster to be processed. That is, instead of storing or instead of processing the entire image, I
can only consider the descriptors corresponding to a particular image. So, I can consider the
descriptor for the texture, descriptor for the color, descriptor for the boundary, descriptor for
the shape that I can consider and that I can store in the memory and that also I can process.

943
And what other information present in an image. So, other information maybe the boundary,
the shape of the object I can consider, region information, how many regions are present in an
image that I can consider, texture information I can consider, color information I can
consider, the relationship between the regions present in an image. So, these are the
information present in an image. So, already I have discussed about the texture and the color
and also I have discussed the concept of the edge detection. So, now, I will discuss about how
to describe boundary and the shape.

(Refer Slide Time: 2:57)

So, first one is the 2D and the 3D structure representation. So, what is the 3D structure? 3D is
a representation of a set of the 2D representation from various angles. So, if I consider the 2D
representation from various angles, then in this case I can get the 3D representations. This 2D
structure can be represented by considering the boundary and also by considering a region.
So, that means, I am considering 2 image features, one is the boundary another one is shape.

944
(Refer Slide Time: 3:32)

So, for boundary descriptors, I can consider information like this length of the boundary I
can consider, the size of the smallest circle or box that can totally enclosing the object, that
also I can consider as a descriptor for the boundary. If I consider these descriptors like the
shape number I can consider, the Fourier descriptor is very easy very good descriptor to
represent a particular boundary, statistical moments also I can consider, a B spline
representation for boundary representation that is also very important, that I am going to
discuss the B spline representations.

And one is MPEG 7 contour shape descriptors. So, by using these descriptors, I can represent
a particular boundary present in an image. So, by using these descriptors, the shape number
Fourier descriptors, the B spline, MPEG 7 contour shape descriptors, I can represent a
boundary.

945
(Refer Slide Time: 4:35)

So, first one is the shape representation by using chain codes. So, in this case, you can see
what is the chain code here. The chain code represents an object boundary by a connected
sequence of straight line segments of specified length and direction. So, in this case in the
first figure you can see I am considering 4 directions and direction is this 0, 1, 2,3. So, I am
only considering 4 directions and from this I can determine 4 directional chain code.

In the second figure I am considering 8 directional chain code. So, in that case I am
considering 8 directions 0, 1, 2, 3, 4, 5, 6, 7 like this I am considering So, that means, I am
considering 8 directions for chain code. So, suppose I have a motion trajectory, suppose I
have a trajectory suppose these type of trajectory, this trajectory I can represent by using the
chain code because in this case you can define the direction, suppose corresponding to this
segment I have 1 direction that direction I can see from the chain code, I can use the 8
directional chain code.

So, if you see this direction this is I can consider as 1 and suppose if I considered from this
segment to this segment this portion I can consider as the direction 7 corresponding to this 8
directional chain code. And similarly, if I consider this direction also from this point to this
point, I can find a particular direction from the 8 directional chain code or I can use 4
directional chain code. So, this is one application of the chain code So, representation of a
motion trajectory by using the chain code.

946
So, for example, suppose, a gesture I am doing, gesture recognition and hand is moving in the
space and in this case, I am getting the trajectory of the hand movements. So, this trajectory I
can represent by using the chain code.

(Refer Slide Time: 6:45)

And in this case, I have shown how to determine the chain code. Here I am considering 8
directional chain code. So, I have 8 directions 0, 1, 2, 3, 4, 5, 6, 7. So, I have 8 directions and
I am considering one arbitrary boundary. So, what I have to do, start at any boundary pixels
suppose, if I considered a pixel is suppose A. Start at any boundary pixels I am considering
and suppose if I considered the first segment. So first segment is from this to this.

Corresponding to this if you see the direction the direction is 7 after this I am considering the
next direction that direction is this. So, corresponding to this direction, if I see the value in the
left figure that is the chain code it will be 6 and again if I consider this segment
corresponding to that segment my direction will be 0 and corresponding to this direction, my
direction will be 1, like this I can consider all the directions. So, corresponding to this if you
see, so, I have these directions, the directions are 7, 6, 0, 1, 0, 6, 5, 5, 4, 3, 2, 4, 2, 1
corresponding to their particular boundary.

And after this number I can represent in the binary form. So, starting point is the A, now,
corresponding to 7 I have 111, 6 is 110 that is represented in the binary form that I can
represent in the binary form that is the chain code. So, you can understand the concept of the
chain code, I can consider 4 directional chain code, but if I want to get more accurate

947
boundary representation, then in this case you have to consider 8 directional chain code or
maybe 18 directional chain code like this you can consider the possible directions.

(Refer Slide Time: 8:41)

And in this case, I am considering a generalized chain coding. So, in this case you can see I
am considering the angle theta and the perimeter I am considering the perimeter is t. So, what
I have to consider the encode Theta, t. So, for different angles you can see, I can determine
theta t and in this case this curve I am showing a theta versus t, t is the perimeter and theta is
the angle. So, this is the generalize chain coding thus I am plotting theta vs t.

(Refer Slide Time: 9:16)

948
And in this case I am giving one example how to determine the chain code for a boundary.
So, in this case I am considering one object boundary and I am doing the resampling, the
resampling of the object boundary. So, after resampling you can see I am getting the
boundary vertices. So, these are the vertices I am getting after resampling. So, these are the
boundary vertices I am getting. After this, if you consider the segment by segment you can
see if I consider a 4 directional chain code, can see the directions.

The direction is 00 333 like this and again 2, 3 221 like this, I am getting the boundary
directions, all the directions I am getting. And if I consider 8 directionals chain codes, then
you can see the directions all directions 076666 and 5, 5 3 3 like this, I am getting the
directions. And based on this I am getting the chain code. The chain code can be converted
into binary numbers.

(Refer Slide Time: 10:35)

So, now, the problem of a chain code. The problem of the chain code is a chain code
sequence depends on the starting point. In my last example. I have considered a starting point
as the A, A is the starting point, where the chain code sequence depends on the starting point.
So, that is why I can consider the chain code as a circular sequence and redefine the starting
point, so that the resulting sequence of numbers forms an integer of minimum magnitude, that
I can consider.

So, for this I have to consider and quantity, the quantity is the first difference of a chain code.
So here in this case, I am considering the 4 directional chain code and how to determine the
first difference. So first difference is determined like this. The counting the number of

949
direction change in anti clockwise direction, that is the counterclockwise direction, between 2
adjacent element of the code.

Suppose in this example, I am considering the transition from 0 to 1 the direction is 0, and it
is transition from 0 to 1. So, I have to see in the counterclockwise direction, from 0 to 1, the
first difference will be 1 because only 1direction change from 0 to 2, how many directions
change the first one is 0 to 1 and after this from 1 to 2, that means 2 directions change that is
the first difference. And again from 0 to 3, if you see it should be in the anti clockwise
direction.

So, first one is 0 to 1 and after this 1 to 2, and after this 2 to 3. So, if I consider the direction
change from 0 to 3, then the first difference will be 3. Similarly, from 2 to 3, the first
difference will be 1, from 2 to 0, it will be 2 from 2 to 1, it will be the first difference will be
3. So, in this case, you can see corresponding to this chain code the chain code is suppose 101
03322. So I am considering a chain code. And from this, I can determine the first difference.
The first difference I am determining like this, so from 1 to 0, so how many directions change
1 to 0, the 3.

Next one is from 0 to 1, 0 to 1, how many directions change, one. So I am considering one
next 1 to 0, so how many directions change 1 to 0? So directions in this 3, that is the first
difference from 0 to 3, we have the first difference is 3 next from 3 to 3, the directions change
is 0 and from 3 to 2 the directions change is 3 and from 2 to 2 the first difference will be 0.
So, like this, I am calculating the first difference. After this the chain code is treated as a
circular sequence and we get the first difference like this.

So, if I consider this as a circular sequence, then in this case, if I consider the direction
change from 2 to 1, so direction since from 2 to 1 will be 3. So, that I am considering here.
The direction change from 2 to 1, it will be 3 that I am considering, because I am considering
the chain code as a circular sequence and the first difference is rotational invariant. So, in this
example, I have shown, so from the chain code, how to determine the first difference and
after this what we have to consider.

The chain code is treated as a circular sequence. And from this, we can determine that the
first difference so circular sequence means you can see from 2 to 1 that is from 2, this to 1 I
have to again consider how many direction change so 2 to 1 how many direction change it is

950
3. So there I am considering here, so that is the first difference of the chain code. The first
difference is rotational invariant.

(Refer Slide Time: 14:53)

After this there is another definition, that definition is the shape number. The shape number
of the boundary, how to determine. This is the first difference of the smallest magnitude that
is the definition of shape number, the first difference of smallest magnitude. So, for this what
I have to consider, first I have to determine the first difference and after this I have to
consider the first difference of smallest magnitude I have to consider. So, corresponding to
this case you can see, I am considering the chain code the chain code is 0321.

So, corresponding to that boundary I am considering the order is 4 suppose and


corresponding to this I can determine the first difference, the first difference will be 3333,
this is the first difference. And what will be the smallest magnitude? In this case
corresponding to this number the smallest magnitude will be the same the 3 3 3 3. In the
second case, if I consider order 6, that 003221, that is a good I am getting corresponding to
this order 6, that boundary I am considering.

And after this I can determine the first difference I can determine. And corresponding to this
first difference what is the smallest magnitude of the first difference, the smallest magnitude
will be 033 033 that is the smallest magnitude I am considering, that is the step number. So,
this is my step number. So, in this case, I have explained how to determine the Shape number
corresponding to a particular boundary.

951
(Refer Slide Time: 16:43)

Here I have given another example. So, already I have shown this example that is the order 4
and that boundary is considered and from this I can determine the Shape number. Similarly,
corresponding to order 6 I can also determine the Shape number. And if you see here
corresponding to this boundary my chain code is, this is my chain code 00 33 22 11 and from
this I can determine the first difference and after this I can determine the Shape number.

Similarly, in the next example, you can see the chain code, you can see the order is 8 and you
can see the first difference I can consider and after this I can determine the Shape number.
So, for the last boundary also I can determine the chain code, there is the order is 8 and also I
can determine the first difference and from the first difference I can determine the shape
number.

952
(Refer Slide Time: 17:44)

So, suppose if I consider one boundary, one arbitrary boundary, so how to determine the
shape number. So, first I am considering one original boundary, that boundary I am
considering. After this find the smallest rectangle that fits the shape. So, I am finding one
smallest rectangle and that fits the shape. And after this what I am considering, I am creating
the grid. So, you can see the, you can see the third figure. I am creating the grid for n is equal
to 18 subdivision of basic rectangle is 3 cross 8.

That means, I am considering 3 columns and 8 rows and chain code directions are aligned in
the direction of the grids. So, we can determine the chain code from the grids direction. So,
corresponding to this one, if you see, I can find the chain code because you can see the chain
code directions are aligned in the direction of the grid and I am considering this is the chain
code the 4 directional chain code I am considering 0123.

And corresponding to this case, I am getting the chain code the chain code will be this. This
is the chain code corresponding to that boundary. So, this is the approximated boundary I am
considering, this is the approximated boundary and after this it is very easy to determine the
first difference I can determine and the from the first difference I can determine the shape
number. So, by using the shape number I can represent a particular boundary that is the
boundary representation by using shape number.

953
(Refer Slide Time: 19:36)

There is another method and that also we can use for representation of a particular boundary.
The polygon approximation, represent an object boundary by a polygon. So, you can see the
object boundary, so this is the object boundary. You can see this is the boundary and after
this I am considering the polygons you can see I am considering the polygons like this. So,
how to represent the boundary by polygon. So, minimum perimeter polygon consists of the
line segments that minimize the distance between the boundary pixels.

So, in this case you can see that boundary is represented by the polygon. So, corresponding to
this you see the boundary points I am representing. So, in this case what I have to consider
the minimum perimeter polygon I am considering, it consists of a line segment, so I am
considering the line segment that minimize the distance between the boundary pixels. So,
corresponding to this boundary, if you see these 2 points, what will be the minimum distance
between the boundary pixels, the minimum distance will be this. This is the minimum
distance.

Similarly, corresponding to these 2 points, suppose this point and at this point what will be
the minimum distance between the boundary pixels, the minimum distance of the boundary
pixels will be this. So, from this I am getting this one, that is the object boundary
representation by a polygon and I am considering the minimum perimeter polygon which
consists of line segment that minimize distance between the boundary pixels.

954
(Refer Slide Time: 21:29)

And in this case, you can see how to represent the object boundary by polygon
approximation. So, here you can see I am considering one arbitrary boundary, the object
boundary and in this case you can see, find a line joining 2 extreme points. So, these lines if
you see, the line connecting a and b and also I can determine, this distance also I can
determine these distance also I can determine. So, that is I can determine the line between
point A and B, find the line joining 2 extreme points I can determine that is the line AB.

After this I can find out this distance I can find, find the furthest point from the line, that is
this distance I can determine. So, by using this information, I can represent the boundary.
And here you can see the boundaries approximated by a polygon. So, if you see the
boundary, the boundary is approximated by a polygon. So, like this I can represent a
particular boundary.

955
(Refer Slide Time: 22:42)

And another one is the distance versus angle signature. So, how to represent a 2D boundary
in terms of 1d function. The 1d function is the radial distance with respect to theta. So,
corresponding to the first boundary that is the circle, I am considering the radial distance and
also angle theta. So, corresponding to this you can see corresponding to a circle the radial
distance will be always constant, that is the value is always a. So, the radial distance is always
a corresponding to all the angles starting from 0 degree pi by 4, pi by 2, 3 pi by 4, pi.

For all the angles the radial distances fixed that is constant, that is A corresponding to the
boundary the boundary of the circle, corresponding to this. And if I consider a square,
corresponding to square, if I considered a radial distance with respect to theta, then in this
case you will be getting the 1d function like this. So, corresponding to these distance if I
considered a diagonal this distance, the distance would be root 2a and for all other points, the
distance is a, radial distance is A.

So, this distance is a, but corresponding to the diagonal points, the distance will be root 2A,
root 2A, root 2A, root 2A. So, I have 4 diagonal points that is 1, 2, 3, 4. So, corresponding to
4 diagonal points, the radial distance is root 2 A and for other points the radial distance will
be only A, so that I am considering. So, that means, by considering 1d function of radial
distance with respect to theta, I can represent a particular boundary.

956
(Refer Slide Time: 24:38)

And also another method that by using the statistical moments I can represent a particular
boundary. So, in this case, this is the definition of nth moment. So, first moment is called the
mean you know and this one, the second moment is called the variance. So, what I can
consider suppose I have the boundary, the boundary segment is given. The boundary segment
I can convert into 1 D graph. So, you can see the conversion, the boundary is considered as
1d graph and that can be considered as the PDF, the probability density function.

And from the PDF I can determine all the moments, I can determine the first moment that is
the mean I can determine, I can determine the second moment, third moment I can determine.
So, by using these moments, I can represent the boundary. So, I am repeating this how to
represent the boundary by using the statistical moments. So, boundary segment I am
considering and this boundary segment I can consider is 1d graph and that I can consider as a
PDF, the probability density function.

And after this from the PDF I can determine statistical moments. So, all the statistical
moments I can determine, the first order moment and the second order moment I can
determine. And by using these moments I can represent a particular boundary.

957
(Refer Slide Time: 26:07)

And this is also a very important technique that is I can represent a boundary by convex hull,
by a convex hull. So, first I have to define what is a convex hull. So, a convex hull I can
represent as H suppose, a convex hull H of an arbitrary set S suppose, is a smallest convex set
containing S. So, I am writing here the definition of the convex hull. A convex hull H I am
considering of an arbitrary set S is the smallest convex set containing S.

So, in this case I am considering the set, you can see the set here this is the set S and I have
shown the object boundary, you can see the object boundary. And if you see that gray color,
that is nothing but the convex hull, H is the convex hull of an arbitrary set S is the smallest
convex set containing S. So, I am considering H and within this H, the set S is contained.
After this the H minus S that is the convex hull minus S, this is called convex deficiency, the
convex deficiency I can consider as D.

Now, in this case for partitioning the boundary. So, what is the procedure, I am writing the
procedure for partitioning what we have to consider follow the contour of S, S is the arbitrary
set, one arbitrary set I am considering. Follow the contour of S and marking the points at
which a transition is made into or out of a component of the convex deficiency. So, I have to
mark the points, that is how to partition the boundary.

So, for partitioning the boundary follow the contour of S and marking the points I am doing
the marking, the marking of the points these points I am marking. Based on what, as to who
is a transition is made into or out of a component of the convex deficiency. So, for

958
partitioning the boundary I have to identify these points. So, I have to identify these points.
So, and based on these points I can represent a particular boundary.

So, that means this is the procedure, that follow the contour of S and marking the points at
which the transition is made into or out of a component of the convex deficiency. So, one
practical example is suppose in the human eye suppose the cornea suppose the cornea shape
something like this. So, how to represent this cornea, by using the convex hull. So, in this
case I have to identify these points, I have to identify these points based on this concept of the
convex hull.

So, I have to partition the boundary and this concept I can apply the concept of the convex
hull I can apply to find this at the boundary points that is I am partitioning the boundary. So,
like this in practical examples, the practical applications we can use the convex hull to
partition a particular boundary.

(Refer Slide Time: 30:50)

The next important point is the Fourier descriptors. So, by using the Fourier descriptors we
can represent a particular boundary. So, in this case you can see, a view the coordinate x y is
a complex number. So, x y is the boundary point and I can consider as a complex number. So,
x is the real part and y is the imaginary part. So, in this case, I am considering the boundary is
like this x k plus j yk that is a complex number. And in this case, I am considering the
boundary, boundary is SK, SK be the coordinate of the boundary point K.

And I am considering x coordinate and the y coordinate, x is the real part and y is the
imaginary part. So, I am getting the complex number. So, this is my complex number. And in

959
this figure you can see, I have shown the boundary points and I have shown the x coordinate
and the y coordinate. So, x coordinate is the real part and y coordinate is the imaginary part.
So, how to represent the boundary by using the x coordinate and the y coordinate you can see
here.

So, corresponding to this point, the x coordinate is x1 and the y coordinate is y1. So, x1 is the
real part of the complex number and y1 is the imaginary part of the complex number. So, for
this complex number I can determine that Fourier transform, that is called the Fourier
descriptors. So, what is the Fourier descriptor, just I am taking the Fourier transform of SK
that is the DFT I am taking, SK e to the power minus twice pi ul divided by k, I am
determining the polar dress from I am considering.

And by this I can represent the boundary the boundary is SK and corresponding to this
boundary I have the Fourier descriptor Fourier descriptor is au. So, that means the boundary
is represented by the polar coefficients. So, I have the coefficients a u. So, you can see
instead of storing the entire boundary, I have to consider only the Fourier coefficients. And
also by using this inverse Fourier transform, I can reconstruct the original boundary that is the
reconstruction formula.

So, you can see s k is equal to 1 by K and summation I am taking from k is equal to k minus
1, k number of points and au is the Fourier coefficients e to the power twice pi u k divided by
k that I am considering. That is the reconstruction formula.

(Refer Slide Time: 33:33)

960
So, in this example, I have shown, one boundary I have shown the boundary you can see the
original boundary I am considering that is a square boundary, I want to reconstruct this
boundary from the Fourier coefficients. So, original boundary, you can see the 64 points K is
equal to 64. So, 64 points are available and by applying the inverse Fourier test from
principal I want to reconstruct the original boundary. So, first I am considering only 2
coefficients p is equal to 2 and by using 2 coefficients, here you can see in this reconstruction
formula I am considering p is equal to 2 this is p.

So, only by considering 2 coefficients, I want to reconstruct the boundary. So, this is my
reconstructed boundary. And again if I consider p is equal to 4, that will be my reconstructed
boundary. Next I am considering p is equal to 8, that is my reconstructed boundary. Next I
am considering p is equal to 16 and that is my reconstructed boundary. Next I am considering
p is equal to 24 like this, I have the reconstructed boundary. And if I consider p is equal to 61
that is my reconstructed boundary.

And if I consider p equal to 62 then you can see I can perfectly reconstruct my original
boundary. So, you can see how to reconstruct the original boundary from the Fourier
coefficients. So, this is about the Fourier descriptors. Now, main concept is instead of storing
the boundary pixels, I can only store the polar coefficients to represent a particular boundary.
We need to store all the pixels corresponding to the boundary. So, for this what I can do, I
can determine the Fourier descriptors and what I have to store, I have to store the Fourier
coefficients in the memory.

(Refer Slide Time: 35:33)

961
And some of the properties of the Fourier descriptors you can see. Corresponding to the
boundary SK, my Fourier descriptor is au and the next properties I am considering the
rotation. So, rotation of the boundary by an angle theta. The rotation of the boundary by an
angle theta causes a constant phase shift of theta in the Fourier descriptors. So, that means the
Fourier descriptor is multiplied by again e to the power t theta, that is a constant phase shift
of Theta in the Fourier descriptors.

Next, I am considering the translation, the translation by amount delta x y. So translation is


delta x y is mainly, delta x y is nothing but in x direction that translation is x naught and in y
direction, this translation is y naught, that I am considering. If the boundary is translated by
delta x y that is in the x direction it is x naught in y direction it is y naught, then the new
Fourier descriptor remains same except at point u is equal to 0, because in this case, I am
considering the Dirac delta function.

So, corresponding to u is equal to 0 here, so, I will be getting one, for other values it will be
0. So, that is why if the boundaries Translated by x naught in the x direction and the y naught
in the y direction, the new Fourier descriptors remain same except at u is equal to 0. The next
one is I am considering the scaling, the scaling of the boundary. This is nothing but the
shrinking or the expanding the boundary alpha into S x, that is nothing but what I am doing,
this is the shrinking or expanding the boundary, that I am considering this the scaling.

So, what is happening if I do the scaling, that is nothing but the scaling of the Fourier
descriptors, because the Fourier descriptor is multiplied with alpha. And if I considered
changing the starting point here you can see I am changing the starting point, changing the
starting point in tracing the boundary what will happen, that means, in this case what I am
considering? Changing the starting point in tracing the boundary.

So, what is happening here? It corresponds to the modulation of the Fourier descriptors
because au is multiplied with exponential function, that is nothing but the modulation of au.
So, these properties I am considering, one is the rotation, so if I consider rotation, what is
happening, if I do the rotation of the boundary by an angle theta that corresponds to the
constant phase shift of Theta in Fourier descriptors. And if I considered a boundary is
translated.

So, except at the point u is equal to 0, it will not since. And also we have considered the
concept of the scaling, if I do the scaling the Fourier descriptors will be multiplied by the

962
scale factor and if I change the starting point, then in this case it is nothing but the
modulation. So, these are the properties of the Fourier descriptors.

(Refer Slide Time: 39:28)

Now, I will discuss the boundary representation by B spline curve. So, I have shown the
figure a I have shown that one boundary. The B spline are piecewise polynomial functions
that can provide local approximation of contours of shapes using a small number of
parameters. So, because of this B spline, the B spline representation results in compression of
the boundary data. So, instead of storing the entire boundary pixels what I can consider, I
cannot consider some parameters.

And by using these parameters I can represent a boundary. So, that is why the B spline
representation results in compression of boundary data. And in the figure a shows a B spline
curve of degree 3, I am considering the B spline curve of degree 3. And it is defined by 8
contour points. So, I have shown the 8 contour points here 1, 2, 3, 4, 5, 6, 7, 8. So, 8 contour
points are considered to represent the boundary. These little dots divide the B spline curve
into number of curved segments. So, in this case I am getting the number of curve segments
you can see this is one segment, this is one segment like this I have the segments.

The subdivision of the curve can also be modified. So, that is why the B spline curve have a
higher degree of freedom for curve design. So, based on the contour points, I can represent
the object boundary. And as seen in the bigger B, if you see the figure B, to design a B spline
curve what actually we need, we need a set of control points. So, you can see we need set of

963
control points, the control points are P0, P1, P2, P3, P4, P5, P6. So, these are the control
points.

So, for B spline we need the set of control points and also we need a set of knots. So, what
are the things we need one is the control points you need control points we need for B spline
representation. Also we need a set of knots. So, here I have shown the knots T1, T 2, T3 like
this, I have shown the knots T 3, T 4 like this, these are the knots. Number 3 also we need a
set of coefficients, effect of coefficients one for each control points, one for each control
points. So, I need this information, one is the control points after this I need a set of knots, I
need a set of coefficients one for each control points.

And all the curve segments because you can see that we have the curve segments, these are
the segments if you see the segments. All the curved segments are joined together satisfying
certain continuity conditions. So, I have to define some continuity conditions and these
segments can be joined by considering, by satisfying certain continuity conditions. The
degree of the B spline curve or the degree of the B spline polynomial can be adjusted to
preserve smoothness of the curve to be approximated.

So, I can consider the degree of the B spline polynomial, maybe I can consider it a second
degree, third degree like this I can consider. The curve is approximated by these line
segments connected between the control points and in this case the degree of a B spline
polynomial I can adjust to preserve smoothness of the curve to be approximated. And the B
spline allow the local control over the shape obey spline curve. So, based on these
parameters, the control points, the set of knots and a set of coefficients and also the degree of
the B spline polynomial I can represent a particular boundary.

So, this concept I am going to discuss in my next class, the concept of the boundary
representation by B spline curve. So, in this class I discussed the boundary representation
concepts. So, first I discussed the concept of the chain code and the chain code depends on
the starting point. So, that is why we considered the first difference of the chain code I have
considered. After this I defined the shape number. So, by using the same number I can
represent a particular boundary.

After this I discussed one important representation that is the Fourier descriptors. So, by
using the Fourier descriptors I can represent a particular boundary. So, instead of storing all
the pixels of the boundary, I can only store the Fourier descriptors, that is the Fourier

964
coefficients. And by using these polar coefficients, I can represent the boundary. I can
reconstruct the original boundary by considering these Fourier descriptors. So, this is the
advantage of the Fourier descriptors.

And also I discussed about the properties of the Fourier descriptors. Next, I have introduced
the concept of the B spline curve. So, by using the B spline curve, I can represent a particular
boundary. So, how to represent a boundary by using the B spline, that concept I will discuss
in my next class. So, let me stop here today. Thank you.

965
Computer Vision and Image Processing - Fundamentals and Applications
Professor M. K. Bhuyan
Department of Electronics and Electrical Engineering
Indian Institute of Technology, Guwahati
Lecture-27
Object Boundary and Shape Representations - II
Welcome to NPTEL MOOCs course in Computer Vision And Image Processing -
Fundamentals and Applications. In my last class I discussed about the shape and the
boundary representations, I discussed about chain code and the Fourier descriptors and also I
highlighted the concept of B spline for boundary representation. The B spline is nothing but
the piecewise function that can be used for approximating a particular boundary.

In B spline representation, I need some control points, some knot points and also I need the B
spline function. So, by using this I can approximate a particular boundary. So, instead of
storing the entire boundary pixels, I can only consider the B spline function and the control
points for the representation of that boundary.

So, I can consider only the B spline function and the control points for representation of the
boundary. So, now, I will be discussing about the B spline representation, so how to represent
a particular boundary by using the B spline function. So, let us see what is the B spline
function.

(Refer Slide Time: 2:00)

So, this boundary representation by B spline curves. So, the B spline curves are piecewise
polynomial function that can provide a local approximation of contours of shapes using
small number of parameters. So, that is it is nothing but the B spline representation results in

966
the compression of the boundary data. So, in the figure if you see I have 2 figures, so, figure
A and the figure B and in the figure A you can see the B spline curve of degree 3 I am
considering and in this case I am considering 8 control points.

So, in the figure you can see I have 8 control points. And the boundary is represented by the
B-spline curve and for this I am considering 8 control points. So, these control points divide
the B spline curve into number of curved segments and this subdivision of the curve can also
be modified. And the B spline curves having a higher degree of freedom can also be
considered for curve design.

So, in the right figure if you see, to design a B spline curve, what actually we need, we need a
set of control points. So, here you can see I have the control points P0, P1. P2. P3, P4, P5, P6
are the control points and I also need a set of knots. So, here I am showing the knots like this
T3, T4, T5, T6 these are knot points and a set of coefficients I need a set of coefficients, one
for each control point. And after this, so all the curve segments are joined together satisfying
certain continuity condition and the degree of the B spline polynomial can be adjusted to
preserve smoothness of the curve to be approximated.

So, this is the concept of the B spline curve. And in the B spline, the B spline allow local
control over the shape of a spline curve. So, this approximation I can do by using the B spline
curve and this is nothing but the compact representation of a boundary. So, instead of storing
all the boundary pixels, I can store only the control points, the knots and also the B spline
function. Now, what is the B spline function in the next slide you can see.

(Refer Slide Time: 4:52)

967
Now, the B spline curve is represented by PT, this is a B spline curve, so it is the B spline
curve that is PT is equal to summation i is equal to 0 to N Pi N i k t. Now, in this case the
boundary address the because the boundary is nothing but x coordinate and the y coordinate.
So, XT and YT and that is represented like this it is represented by P t that is the boundary
address. And also I have the control points. So, what are the control points the control points
are P i so, control points are P 1 I, P 2 i like this. So, I have the control points.

And in this case the k is the order of the polynomial segment of the B spline curve. So, you
can see here what is k, so k is the order of the B spline curve, so I can write it like this. So,
the k is the order of the polynomial segments of the B spline curve. So, this is the polynomial
segment of the B spline curve. The normalized B spline blending functions of order k is
defined by N i,k, N i,k is the the normalize the B spline blending function of order k, that can
be defined like this.

And in this case, I need a non decreasing sequence of the real numbers. So, what are the real
numbers I can consider, the real numbers are ti and i is equal to 0, 1 up to N plus k, this is ti.
Ti is column knot sequence It is called a knot sequence, it is called the knot sequence. So,
knots are the locations where the spline functions are tied together. So, I am defining the
control points and also the knots and I have shown that B spline of order k. So, this is the B
spline of order k.

Now, the B spline of order 1 that I can get here. So, this is the B spline of order 1 and is the
knot sequence of a B spline is uniform then it is quite easy to calculate these functions, the B
spline functions. So, suppose if I consider the knot t 0, t 1 like this t N k N plus k, I am
considering these are the k knot points 0, 1 up to N plus k. So, if the knot sequence of a B
spline is uniform, then it is quite easy to calculate these functions this function means the B
spline functions. So, in this case I have defined the B spline curve.

968
(Refer Slide Time: 8:45)

And in this figure first I am showing the B spline of order 1 and after this I have shown N 1 1
after this I have shown N 2 1 and after this I have shown N 3 1. And in this case how many
knots I am considering 5 knots in this case you can see 0, 1, 2, 3, 4 So, that means I am
considering 5 knots and in this case you can see I am plotting the functions, the functions are
N 01, N 1 1, N 2 1, N 3 1 I am plotting, that is a B spline function I am considering. The
circle at the end of the line indicates that the function value is 0 at that point.

So, in this case you can see a circle here this is a circle, he circle here at the end of the line
that indicates that the function value is 0 at that point. And in this case all these functions
have support in an interval, particular interval support means the region where the curve is
nonzero. For example, in this case, suppose if I consider the Ni 1 that is a B spline of order 1
has a support on i, i plus 1. And in this case, if I consider what is Ni 1 and Ni 1 is nothing but
N 01 and t minus ii.

That means, I am considering the shifting. So, that means from N 01 i can determine Ni 1 by
sifting. So, this is about the functions N 01, N 1 1, N2 1 and N 3 1 this is a B spline functions
and for this I am considering 5 k knot points. These are the normalized B spline of order 1
because in this case I am considering 1, N 01, N 1 1, N 2 1, N 3 1 I am considering. So, that
means, I am considering the B spline of order 1, but you can see the starting point the starting
point is 0123.

969
(Refer Slide Time: 11:03)

Now, in this case I am showing the normalized B spline of order 2, that means I am
considering k is equal to 2. The normalized B spline of order 2 and in this case I am
considering N 02 I am considering that is the B spline of order 2 and after this I am
considering N 1 2, after this I am considering the B spline of order 3 that is N 03 and after
this I am considering N 1 3 that is I am doing the shifting. So, this N 02 can be written as a
weighted sum of N 01 and N 1 1.

(Refer Slide Time: 11:40)

So, here you can see in this equation, the N 02 that is the B spline of order 2 can be written as
a weighted sum up N 01, the B spline of order 1 and a B spline of order 1 that is the shifted 1
this is about the N 0 2 the B spline of order 2.

970
(Refer Slide Time: 12:01)

The next you can see I am considering N 1 2 that I can determine what is N 1 2, N 1 2 I can
represent it in terms of N 1 1 and N 2 1 that is the B spline of order 2. So, N 1 2 is a shifted
version of N 02. So, I can write this that is N 1 2 is a shifted version of N 02 and in this case
you can see I have plotted N 02 and N 1 2. This curve is the piecewise linear curve and the
support is from 0 to 2.

So, this is a piecewise linear and the support in the interval 0, 2 this is commonly known as
the head function, this is known as head function. They are used as a blending functions
during linear interpolations. So, this if you see this one this is the head function and also I am
considering this is the head function.

(Refer Slide Time: 13:43)

971
And in this case I can show the N 0 3 I can determine that is the N 0 3 is nothing but the B
spline of order 3 and N 1 3 that is the shifted version I can determine. Now, if you see these
functions N 03 and the N 1 3 in the next slide.

(Refer Slide Time: 14:04)

One is the N 0 3 another one is the N 1 3 corresponding to the previous equation these
functions are the piecewise quadratic curves. So, I can write like this. So, piecewise quadratic
curve. So, these are piecewise quadratic curves. And what is N 1 3? N1 3 is a shifted version
of N 03.

So, that is this N 1 3 is nothing but it is a shifted version of N 0 3. And if I consider suppose k
is equal to four that is the B spline of order 4, then in this case I will be getting the blending
function that is the B spline function and N i, 4 then in this case, I will be getting the the
piecewise cubic functions.

So, this is about the B spline functions. So, I have defined the B spline of order 1, B spline of
order to B spline of order 3 and a B spline of order 4. So, I can again draw these B spline
functions like this, all the B splines I am again drawing.

972
(Refer Slide Time: 15:36)

So, first one is you can see the B spline, the B spline of order 1, the B spline of order 1 I am
considering. For this I am considering 2 knot points ti and ti Plus 1. So, it is constant, that is
the B spline of order 1. And if I consider this one the B spline of order 2, so for this I need 3
knot points ti ti plus 1 and t i plus 2. So, already I have defined this is the linear. So, B spline
of order 2 and a B spline of order 3 so, what will be the B spline of order 3, this is quadratic.

So, ti ti plus 1, t i plus 2 and t i plus 3 So, this is the B spline of order 3. So, this value is 1 by
2. So, this is quadratic, this is quadratic. And if I considered a B spline of order 4, that is Ni4.
So, I need 5 knot points ti plus 1 ti plus 2 ti plus 3 ti plus 4 ti. So, I will be getting. So, I will
be getting and the B spline function like this. So, it is called the cubic B spline. So, these are
the piecewise polynomial functions, piecewise polynomial functions.

So, generally for computer graphics we consider the B spline of order 3 and the B spline of
order 4, the high order B splines. So, this is about the piecewise polynomial functions. And
now if I consider the control points. So, how many control points I need for representation of
a boundary control points? The number of control points necessary to reproduce a given
boundary accurately is usually much less than the number of points needed to trace a smooth
curve.

So, that means I need only a few control points to represent a particular boundary. And in this
case, I can consider this suppose control point is Pi, this is the suppose vector control points
and that can be translated, I can do the translation by a vector. So, what is x0, x0 is nothing

973
but x0 y0 dy. So, I can do the translation, I can do the scaling, I can do the scaling of the
control points and also I can do the rotation of the control points.

So, in this case the rotation's transformation already I have explained, the rotation
transformation is cos theta minus sine theta, sine theta cos theta, this is the rotation
transformation matrix. Suppose we are given the boundary points at the distance values t is
equal to suppose S 0, S 1 like this up to SN and corresponding to this we have to find the
control points. So, what is the problem, the problem is we are given the boundary points at a
distance values of t is equal to S 0, S 1 like this and from this I have to find the control points.

So, I can consider the boundary at this has P Sj and that is summation of i is equal to 0 to n,
Pi is the control points and I am considering the Bspline plane of order k in this case j is equal
to 0, 1, 2 like this. Now, in this case I can write like this and that is the B spline. This is the B
spline of order k and P and P I suppose, N is equal to p. So, from this I can determine the
control points Pi is equal to nk inverse p.

So, that means, when the boundary points are given at distance values of T S0 comma s 1 like
this, then I can find control points. Now, in this case how to actually represent the boundary.
Suppose, if I consider one boundary something like this, this boundary, this is the boundary
this boundary is represented by using the control points and by using the B spline functions.
So, that means I am approximating the boundaries by considering the control points and the B
spline functions.

So, I have considering the control points and the B spline functions I am considering. This
smoothness I can adjust by considering the order of the B spline. So, if I consider a higher
order B spline the smoothness will be more. So, in this case for representing a particular
boundary, so what information I need to store in the memory? So, information I need to store
in the memories the order of the B spline and the control points.

So, by using these 2 information, I can represent a particular boundary, one is the order of the
B spline order of the B spline and next one is the control points. So, that means by using
these 2 I can represent a particular boundary. And to change the shape of a B spline curve,
one can modify one or more of these control parameters. So, suppose if I want to change the
shape of the B spline curve, then in this case what are the parameters I can adjust, one is the
position of the control points I can adjust and also I can adjust the position of the knots, that

974
means I can change the position of the knots and also I can change the degree of the B spline
curves.

So, that means to change the shape of the B spline curve, I can modify the parameters. So,
what are the parameters I can modify, one is the position of the control points, position of the
control points, another one is the position of the knots, position of the knots and also I can
change the degree of the B spline curve, degree of the curves. So, that means I can change the
shape of the B spline curves based on these parameters, one is the position of the control
points, the position of the knots and the degree of the curves.

So, in this discussion you can see that to represent a boundary I need to store these 2
information, mainly one is the order of the B spline curve and another one is the control
points. And you can see how can I define the B spline curves for order 1, order 2, order 3 and
order 4. So, this is about the B spline curve representations.

(Refer Slide Time: 24:56)

After this I will discuss the regional descriptors. So, how to do describe a particular region or
maybe the particular area. And some of the original descriptors I am going to define here.
The first one is I can consider maybe the area of the region or maybe the length of the
boundary of the region, I can consider, that is the perimeter I can consider. And also I can
determine the compactness. The compactness is nothing but A R divided by P square R. So,
what is AR, AR is nothing but the area and PR, the PR is the perimeter of the region lets R.

So, by using these 2 parameters, one is the area another one is the perimeter of the region, I
can define the compactness. So, if I consider the circle, corresponding to the circle, the

975
compactness will be C is equal to 1 by 4 pi. So, this is a very simple regional descriptors.
After this, the another descriptor is the topological descriptors, the texture descriptors and
also the moments of the 2d functions I can determine. And also the MPEG 7 ART shape
descriptors. So I am going to discuss about this descriptors, the regional descriptors.

(Refer Slide Time: 26:23)

The first one is you can see the original descriptors. In this case, what I am considering the
white pixels represents light of the cities, you can see in the images here. I am considering the
input image, that is a binary image and I have the white pixels and I have the black pixels, the
black pixels are the background and I have the white pixels that represent light of the cities.
And in this case, for regional descriptors what I am considering the percentage of white
pixels compared to the total white pixels.

So, for the region 1 it is 20.4 percent, for region 2, 60.4 percent like this I have the
percentage. So, by using this concept, I can represent a particular region. This is the most
simple regional descriptors.

976
(Refer Slide Time: 27:20)

Another descriptor is that is the topological descriptors. I can consider that Euler number. So,
how to define the Euler number? In these 2 figures I have shown in the first figure you can
see a region with 2 holes, is the first figure I have shown a region with 2 holes and in the
second figure you can see a region with 3 connected components. So, from this you can
define that Euler number. So, is the Euler number?

The Euler number is E is equal to C minus H. C means the number of connected components
and H means the number of holes. So, by using these 2 parameters, one is the number of
connected components and AC the number of holes, we can determine Euler number.

(Refer Slide Time: 28:20)

977
So, in my next slide, I can show how to calculate the Euler number. So, corresponding to this
image a the Euler number will be 0. In this case, the number of connected component 1 and
only 1 hole, this is the hole and connected component is 1. So 1 minus 1 it is being 0. In the
second case, number of connected components, it is 1 because this is the connected
component if you see and I have 2 holes, this is 1 hole, another hole. So corresponding to this
C minus H that is that Euler number, the Euler number will be minus 1.

This number can be represented also like this, if you see the third figure. So, Euler number
you can see V minus Q plus F is equal to C minus H. C means the connected components and
H means holes and E is that Euler number. So, what is V, the number of vertices and Q is the
number of edges and F is the number of faces. So in this example, if you calculate that
number of vertices you can see 1, 2, 3, 4 5, 6, 7 so number of parties is will be 7 in this
figure.

And how many edges you can calculate. So edges will be 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11. So
number of edges will be 11 and number of faces you can see number of faces, I have 2
pieces, so it is 2. So if I put this below here, so V minus Q q is 11 plus F, F is 2. So, I will be
getting it is minus 2. So, Euler number will be 2 , E is equal to minus 2 corresponding to this
figure, corresponding to this region. So, like this I can calculate the Euler number.

(Refer Slide Time: 30:22)

So, in this example I have shown how to calculate the Euler number and in this case I have
shown the intensity image, in this case I am showing the original image, that is the gray scale
image and after intensity thresholding I am having 1591 connected components. So, I can

978
calculate how many connected components are there and how many holes are there, thus 39
holes and corresponding to this I can determine that Euler number. So, that means, this image
is represented by Euler number.

And in this case, I am considering the largest connected area I am considering, that
corresponds to 8479 pixels and after this the morphological thinning I am doing, so I am
getting this one. So, that means the main concept is that a particular region is represented by
the Euler number.

(Refer Slide Time: 31:19)

Next, I am going to discuss about the run length presentation of a region. So, in this case I am
considering the binary image, so this is the binary image. And you can see suppose if I
consider this 1 and 1 pixel that means this pixel. So, how many times it is repeating? So, 1
and 1 pixel is repeating only 1 time, so, it is 1 time is this okay? After this next I am looking
for another pixel, the next pixel is 1 and 3, now next pixel this pixel is 1 and 3. This is the x
direction if you see this is the x direction and this is the y direction.

So, x coordinate is 1 and y coordinate is 3. So, the next pixel is 1 3, and how many times it is
repeated? So, it is repeated twice. So, 1 3 is repeated twice. And if I see the next row, so next
row, it is starting from 2 0, the pixel is 2 0, this pixel is to 0 and how many times it is
repeating. So, it is repeating 4 times, first, second, third and fourth. So, four times it is
repeating. So, that means the pixel 2 0 is repeating four times, and similarly the pixel 3 1 is
repeating 2 times.

979
So, I will be getting the run length code. And by using the run length code, I can represent
that image, that is the binary image and this is the run length representation.

(Refer Slide Time: 32:56)

Next I am considering the quad tree. So, by using the quad tree also I can represent a
particular region. This concept I have already discussed in my class of image segmentation
and that is the concept is split and merge technique. In this case, if the region is not
homogeneous, that means I am considering one image if the image is not homogeneous, if the
region is not homogeneous, I have to divide it into 4 regions.

So, here you can see the region is not homogeneous, so that is why I am dividing into 4
regions like this. And this is my root node, this is my root. And in this case, if it is
homogeneous, then in this case no need to do the splitting, I can merge as per the image
segmentation procedure, procedure is split and the merge technique. And if I see the region A
that is homogeneous, so no need for further splitting, but if I consider this region, it is not
homogeneous. So, that is why I am dividing it into four regions, hat is the splitting it into four
regions.

And if I consider this region, this is homogeneous, so that is why no need to do the splitting.
And similarly, I am doing this one and you can see this region is not homogeneous. So, that is
why I am dividing it into four regions. So that means I am doing the splitting. And based on
this I can represent this in the quad tree. So, the root is the image and you can see here I am
considering the code here. The code will be something like this. If I considered G, G means
the gray, that means I am considering white and the black.

980
If I consider white, white means 0 and if I consider black the black is 1. So, original image is
the gray image because it contains white and black. So, that is why the original image I am
writing it is gray, G. After this, I am splitting this one, splitting the original image. So, I am
getting the regions. The region is A and that is why the region is the black, after this I am
getting the second region, that is it has both black and white. So, that is why I am considering
it as gray.

And after this within this gray, you can see I have the black 1. So, this is my black 1 white 1
another white and after this again black. So, up to this I am getting this one. After this if I
consider this region this is completely white, so I am considering it as white, this is white.
After this if I see this node that is the gray because it contains both black and white, so that is
the gray. So, I have the gray this one and in this case, you can see the first I have the black.

So, this is black after this the white after this the white and if I see this node, it contains both
black and white, so that is why I am writing as gray. So, this is a gray and in this gray node, I
have the black 1, the black is this after this the white, after this white and after this black. So,
like this I will be getting a code corresponding to this image, that is the region corresponding
to this region I am getting the code. So, this is called the quad tree representation. So, by
using the quad tree representation, I can represent a particular region.

(Refer Slide Time: 37:01)

And already I have discussed about the texture descriptors. So, I am not going to discuss it
again. And you can see I have considered the smooth textures, the coarse textures and the
regular textures. And for this I can determine some statistical parameters I have already

981
discussed about this and also I can determine GLCM the gray level co occurrence matrix and
from this I can determine some quantities to represent the textures.

(Refer Slide Time: 37:34)

So, you can see I can determine the statistical parameters like the mean, standard deviation,
roughness factor, third moment, uniformity, entropy, and by using this I can represent
different types of textures, that I can represent the smooth textures, the coarse textures and
the regular textures.

(Refer Slide Time: 37:55)

And also I discussed about the Fourier approach for texture descriptors. And that concept also
I have discussed. So, for this I have to determine the Fourier transform of the image and after

982
this I have to convert the Fourier transform into polar representations. And after this I have to
determine these 2 spectrums, one is SR another one is S Theta. So, that means in case of the
this one S theta, the theta is fixed and R is variable. And in the in this case the R is fixed the
theta is variable. So, I will be getting these 2 spectrums, one is s r Another 1 is S theta and by
using this I can represent a particular texture.

(Refer Slide Time: 38:40)

So, by using the image moments, I can represent a particular region. The image moments
corresponding to the image Fx y, I can determine like this mpq, I can determine okay and
also I can determine the central moments of order p plus q. So, I can determine the central
moment of order p plus q for the image F xy by using this expression and from this I can
determine the moments mu 00, mu 01, mu 1 1, mu 2 0, mu 02, mu 2 1, mu 1 2 mu 3 0, mu 03
I can determine.

So, that means corresponding to a particular region, that means suppose corresponding to this
region suppose it is Fx y. So, corresponding to this Fx y by can determine this moment I can
compute the central moment of order p plus q.

983
(Refer Slide Time: 39:33)

And I can determine invariant moments, that is the invariant moment means it is invariant to
a fine transformation. So, I can determine 7 moment invariance. So I can determine phi 1, phi
2, phi 3, phi 4, phi 5, phi 6, phi 7 I can determine 7 invariant moments which are independent
of rotation, translation, scaling and reflection, so that I can determine. So, from the image I
can determine this the image is F xy. So, from the image I can determine 7 invariant
moments.

(Refer Slide Time: 40:16)

And in this case, I have shown one example, you can see the original image, the second one
is the translated image. The third one is the scale image, that is the half size image, the fourth
one is the mirrored image, you can see the mirrored image, that is the 180 degree rotation, the

984
fifth one is rotated by an angle of 45 degree, rotation by an angle of 45 degree and the
number 6th is rotation by an angle of 90 degree rotated 90 degree.

(Refer Slide Time: 40:50)

So, corresponding to this I can determine the moment invariance, these are the 7 moment
ingredients, the phi 1, phi 2, phi 3, phi 4, phi 5, phi 6, pi 7, I can determine and you can see
corresponding to all these cases, this is my original image, corresponding to the original
image I have these values corresponding to phi 1, phi 2, phi 3 up to phi 7 and corresponding
to these cases the translated image, the scale image, the mirrored image rotated image rotated
image, I can determine these values the phi 1 up to phi 7. And in this case, corresponding to
these rotations and all these fine transformations you can see this value is almost same, that is
it is invariant to a fine transformation.

Similarly, if I consider the second moment phi 2, that is also almost constant for all these
cases. Except for a mirrored image this value is positive, but here this value is negative.
Except for the mirror image and this value is positive and this value is negative. So, from this
you can see that I have the 7 invariant moments for representing a particular image. These
invariant moments are independent of rotation, translation, scaling and reflection.

985
(Refer Slide Time: 42:17)

After this, I will just briefly discuss about the principal component for image description. So,
in my image transformation class I have discussed about the KL transformation, I discussed
about the PCA, the principal component analysis, the same concept is used for representing
images, that is the principal component for image description. And in this case, the purpose is
to reduce the dimensionality of a vector image while maintaining information as much as
possible.

So, in this case, in this example, I am considering different images captured in different
spectral bands. So, spectral band 1, spectral band 2, spectral band 3 like this, so, different
spectral bands I am considering. And if I consider a pixel suppose, these pixels from all the
spectral bands I am considering and that can be considered as a vector, the vector is x.

Suppose one pixel I am considering and I am considering all the spectral bands, in this
example, I am considering the 6 spectral bands. So, I am having done this the vector, vector
is x, x is the input vector. From this input vector I can determine the mean of this vector and
also I can determine the covariance matrix I can determine. That this concept already I have
explained in KL transform.

986
(Refer Slide Time: 43:51)

After this I am considering the Hotelling transformation, Hotelling transformation is nothing


but that KL transformation and this is also the KL transformation. So, transformations is y is
equal to a x minus mx that is the KL transformation, that is the Hotelling transformation. In
this case a is the transformation matrix. So, how to construct that transformation matrix? So,
the matrix A is constructed from the Eigen vectors of the covariance matrix.

The matrix A that is a transformation matrix is constructed from the Eigen vectors of the
covariance matrix. So, the row number 1 contains the first Eigen vector with the largest Eigen
value. So, this is the first row of the transformation matrix. What about the second row of the
transformation matrix? Row 2 contains the second Eigen vector with the second largest Eigen
value. So, like this I can determine transformation matrix.

So, like this e1 T, first Eigen vector, the second Eigen vectors, like this I can determine the
transformation matrix. After the transformation what is happening, you can see the mean of y
that is the mean of the transformed data will be 0. And I can determine the covariance matrix
in terms of the covariance matrix of x, x is the input data, the covariance matrix of y, y is the
transformed data, I can determine the covariance matrix of the test data in terms of the
covariance matrix of x, that is the Cx.

And you can see, I have the covariance matrix of y of the transformed data and I have the
diagonal covariance matrix and these off diagonal elements are 0 all are 0. That means, the
transformed data will be uncorrelated. So, that means, the elements of y are uncorrelated.
That is the transformed data will be uncorrelated the component of y with the largest lambda

987
is called the principal component. So, if I considered a component of y we the largest lambda
lambda is the eigenvalue is called a principal component.

So, by using this method, the Hotelling transformation or the KL transformation or I can
consider PCA the principal component analysis, how to reposition an image.

(Refer Slide Time: 46:19)

Here I am considering 6 images corresponding to 6 spectral bands. So, I am considering the 6


spectral bands, Channel number 1, Channel number two, channels number 6 up to. So, you
can see I have the images corresponding to all the spectral bands channel 1, Channel 2,
Channel 3, Channel 4, Channel 5, Channel 6 so, I have 6 images.

(Refer Slide Time: 46:49)

988
And from this I can apply the Hotelling transformation, that is the KL transformation. And in
this case, I am considering the principal component you can see corresponding to the largest
Eigen value the number 1 and you can see the Eigen values like this 3210, 931, 118 like this
and these are represented as an image. So, if I only considered the first eigen value, that is the
principal component corresponding to this, this will be my image, the component 1. If I
considered a component 2, that is this component that will be my image.

If I consider a component 3 corresponding to the eigenvalue that will be my image


component 4 is like this, component 5, component 6. So, in this case, the component 1
corresponding to the largest Eigen value have maximum visual information as compared to
component 3, component 4, component 5, component 6.

So, for the presentation of the images, I can only select component 1 and component 1 and
the component 2, I can discard component 3, component 4, component 5, component 6. So,
that means by using only 2 principal components, I can represent the image that is nothing
but the dimensionality reduction.

(Refer Slide Time: 48:09)

So, here I have shown the original image Channel number 1, Channel number 2, Channel
number 3, Channel number 4, Channel number of 5, Channel number 6 I have the original
image and I have shown I have shown that components these are the first one is the principal
component, the component 1, component 2, component 3, component 4, component 5,
component 6 after the KL transformation. And already I have explained the component 1 and
the component 2 contain maximum visual information.

989
So, I can consider only these 2 components, component 1 and component 2 and I can neglect
component 3, component 4, component 5 and component 6. This is the method to represent a
particular image by using and Hotelling transformation.

(Refer Slide Time: 48:58)

And finally, I want to discuss their MPEG 7 angular radial transformation. This MPEG 7 is
called the multimedia content description interface. This is not used as an compression
standard. I am considering some binary images. So, how to represent these binary images by
using MPEG 7 Angular radial transformation.

That is the shape representation by MPEG 7 RT, ART means the angular radial
transformation descriptors. And for details you can see the book by Professor Manjunath, that
introduction to MPEG 7, multimedia content description interface. So, in my class only I am
giving the highlight of this technique, the MPEG 7 ART shape descriptors.

990
(Refer Slide Time: 49:52)

So, in this case based on the shape similarity, I can do the clustering. So, in this case, so if I
consider these shapes, these are almost similar, so this will be one cluster. And similarly, if I
consider these shapes these are almost similar. So, I can do the clustering like this based on
the shape. And also for content based image retrieval, I can use this technique. So, I will
explain how we can use the MPEG 7 ART shape descriptors for shape retrieval, the image
retrieval.

(Refer Slide Time: 50:25)

So, what is the definition of the MPEG 7 ART descriptors. So, here you see this is the ART
coefficients I can get by using this expression. So, in this case F rho Theta that is nothing but
the image the image function that is represented in the polar coordinate. And V nm is the

991
ART basis function. So, my basis function is V nm. So, that is I am considering the inner
product between Vnm that is the ART this is function and the image the image is F rho theta.
So, ART is the orthonormal unitary transformation defined on a unit dx that consist of the
complete orthonormal sinusoidal basis functions in polar coordinates.

So, in this case Fnm, Fnm is the ART coefficients, ART coefficients that is the angular radial
transformation coefficients of order, order is given the order is and n and m. That is the ART
coefficients here the coefficients of order n and m. And this ART basis function is separable
along the angular and the radial direction. So, we have the ART basis function that is
separable along the angular and the radial direction. So, that is the angular function and this is
the radial function corresponding to the ART basis function. And from this you can easily
calculate the LT coefficients.

(Refer Slide Time: 52:04)

And in this case, I have shown the real part of the basis function, the ART basis function you
can see, I am showing the coefficients that n0, n1, n2 these are the I am showing n8 and here
also I am showing m0, m1 like this, these are the real parts of the basis function ART basis
function.

992
(Refer Slide Time: 52:24)

And similarly, I have the imaginary part corresponding to the basis function. This by using
the previous expression that you can determine, this real part and the imaginary part you can
determine from the previous expressions.

(Refer Slide Time: 52:41)

And this the ART descriptor is a set of normalized magnitude of these complex ART
coefficients and the 12 Angular and the 3 radial functions are used. So, that means, n is less
than 3 and m is less than 12 I am considering and corresponding to this if I neglect this one
the ART coefficients of order n is equal to 0 and m is equal to 0 that is used for
normalization.

993
So, if I neglect this, then in this case I will be having only 35 normalized ART coefficients.
So, that means, I will be getting an array of 35 normalized and quantized magnitude of the
ART coefficients, because this value I am not considering the ART coefficients
corresponding n N is equal to 0 and m is equal to 0 that is not considered, that is used for that
normalization.

(Refer Slide Time: 53:34)

And in this case, if I want to consider the similarity between suppose 2 shapes, suppose one
shape is this and another shape is this. So the dissimilarity between these 2 shapes can be
determined based on the ART coefficients. So, I am calculating dissimilarity between 2
shapes, then in this case, you can see Md and the Mq are the arrays of the ART coefficients
for these 2 shapes. For this shape also we have to determine the ART coefficients and for
these also I have to determine the ART coefficients and after this, I can compare these 2
shapes by using MPEG 7 ART shape descriptors. So, I can find the dissimilarity.

994
(Refer Slide Time: 54:21)

And one application already I have mentioned, that is the content based image retrieval.
Suppose, in my database, I have these all the images. And suppose, if I want to retrieve a
particular image, suppose if I want to retrieve this particular image from the database, then
what I have to consider corresponding to this shape, I have to determine the ART coefficients
and for all other shapes I am storing the ART coefficients corresponding to a particular shape.

So, we need to store the image, we have to store the ART coefficients corresponding to a
particular shape. And after this for comparison what I have to consider I have to consider the
ART coefficients corresponding to 2 steps. And based on the dissimilarity measure, I can
select a particular image from the database, that is the content based image retrieval by
considering MPEG 7 ART shape descriptors.

So here you can see, so corresponding to all the images here, I can determine ART shape
descriptors, ART coefficients I can determine. And corresponding the query image, suppose
this is my query means, I can also determine the ART coefficients. After this I have to find
out the similarity between the shapes based on the ART coefficients and based on this I can
retrieve a particular shape from the database, that is the content based image retrieval.

995
(Refer Slide Time: 55:51)

And similarly, I can show some of the images like this, the binary images and corresponding
to all the binary images, I can determine the ART shape descriptors. So, I need not store the
image, only I have to store 35 ART coefficients because each shape is represented by 35
coefficients and which are ingredient to a fine transformation. So, this is about the MPEG 7
ART shape descriptors.

In this class I discussed 2 important descriptors, one is the ART shape descriptors that is the
MPEG 7 ART shape descriptors. And also I discussed about the B spline representation of a
curve. So, in the B spline representation of the curve, to represent a particular curve or to
represent a particular boundary, I have to consider the B spline curve and for this I have to
store the control points and the order of the B spline. By considering this information I can
represent a particular boundary. This is about the B spline.

And regarding this MPEG 7, that is called a multimedia content description interface, I can
represent a particular shape by using ART, that is the angular radial transformation shape
descriptors. This is about the shape and the boundary representation by descriptors. So, I can
consider boundary descriptors, I can consider the region based descriptors. So, let me stop
here today. Thank you.

996
Computer Vision and Image Processing Fundamentals and Applications
Professor M. K. Bhuyan
Department of Electronics and Electrical Engineering
Indian Institute of Technology, Guwahati
Lecture 28
Interest Point Detection
Welcome to NPTEL MOOCs course on Computer Vision and Image Processing Fundamentals
and Applications, I have been discussing about image features, today I am going to discuss about
interest point detection. So, with the help of interest point I can do image matching that means
the feature matching can be done with the help of interest point.

So, one example of interest point is the corner points. And with the help of this interest point I
can do image matching, so for example in case of the stereo image matching I have to do
matching between the left image and the right image, for stereo correspondence, so for this I
have to consider some interest points and based on this interest point I can do matching. Another
example I can give, suppose I have two images of a particular scene, so with these two images I
can make a big image that is called the panoramic view.

So, I can join these two images to get a bigger image. So, for this I have to match these two
images and after the matching I can join these two images, this is called the image stitching, so
image stitching is important for panoramic view, for this also I have to do feature matching with
the help of interest points. So, there are many applications of interest points like image alignment
also I can use the interest points, for 3D reconstruction also I can use interest points.

So, I will be explaining all these concepts and one example already I have mentioned the corner
points I can consider as interest points and also the interest points should be robust to affine
transformation and also photometric variations. So, let us see the concept of the interest points
and how to detect the interest points, so I will show with the help of interest point how we can
match two images to find a correspondence between the two images, let us see the concept of the
interest point.

997
(Refer Slide Time: 02:56)

So, in this case, what is the definition of the interest point? A point in an image which has well-
defined position and can be robustly detected that is the definition of the interest point. And in
the interest point there is a significant change of one or more image properties simultaneously, so
I can give one example suppose, intensity and colour may change in the interest point or colour
and the texture may change in the interest point and already I have mentioned that the corner
point is one example of interest point.

But the interest points could be more generic than the corner points. So, in this figure you can see
I have shown some interest points, you can see these are the interest points and already I have
mention, so in the interest point there is a significant change of one or more image properties
simultaneously.

998
(Refer Slide Time: 03:58)

Now, you can see if I consider the corner points as interest points, you can see I have shown the
corner points in the image you can see all the corner points present in the image.

(Refer Slide Time: 04:12)

And one example already I have mentioned the corners is a special case of interest point,
however the interest points could be more generic than the corner points. And one important
property of the interest point is that it should be invariant to affine transformation like rotations,
scaling, translation like this and also it should be robust to photometric variations.

999
(Refer Slide Time: 04:43)

So, I can give this two examples why the interest points are important for image matching, you
can see here at two images I am considering one is the left image another one is the right image,
for finding the stereo correspondence and you can see I have some interest points and based on
this interest points I can find the correspondence between the left image and the right image, that
is for stereo matching.

In the second example I am considering the image stitching, so I have two images of a particular
scene and I have considered some interest points, so you can see and based on this interest points
I can find the correspondence between these two images and based on this I can join these
together for a panoramic view. So, you can see from these two images I am making this image.
So, this is the example of the image stitching.

1000
(Refer Slide Time: 05:44)

1001
So, like this the panoramic stitching that is the image stitching and in this case, we have two
images and how to combine them? So, already I have mention, so based on the interest points, I
can find the correspondence between these two images and after this I can join these two images
for panoramic view.

So, first I have to extract features these are the interest points and after this I have to do the
matching of the features and finally I have to align the images, this is for image stitching. So, this
is the method of image stitching, so first I have to extract the features and after this I have to
match the features and finally I have to align images.

(Refer Slide Time: 06:31)

1002
Now, in this case if I want to find the correspondence between these two images you can see one
is the left image another one is the right image, so corresponding to this point you can see I am
considering the feature and I am considering the feature descriptors and corresponding to the
right image also I am considering one important point that is the interest point and corresponding
to this I am considering the feature descriptors. So, first I have to extract feature descriptors from
every patches of the image and after this based on these features I can find a correspondence
between these two images that is nothing but the feature matching.

(Refer Slide Time: 07:14)

So, what are the characteristics of the good features? You can see I am considering the
characteristics of a good features, the first one is repeatability, the same feature can be found in
several images despite of affine transformation and the photometric variations. So, that is one
important point, so we have to consider the affine transformations and also the photometric
variations.

And in this case also the same feature should be found in several images, irrespective of affine
transformation and photometric variations. Next one is the saliency, so each feature should be
unique, that is the concept of the saliency. Another point is the compactness and the efficiency.
So, we have only few features as compared to the image pixels, so by using these features we can
represent the entire image.

1003
So, instead of considering the all the image pixels, I can only consider few image features and
with these features I can represent an image. Another point is that locality, so locality means it
should be robust to clutter and occlusion, that is the concept of the locality and also it should be
efficient for real time implementation. So, for real time implementation it should be
computationally efficient. And the concept of the covariant I will be discussing in my next slides,
so that is also very important, what is covariant?

(Refer Slide Time: 08:56)

So, interest point detection should be covariant, that means the meaning is here you can see the
feature should be detected in corresponding locations despite of geometry and the photometric
change, so in this example I am considering the interest points and I am considering the
geometric transformation and also the photometric variations.

So, despite of this variations in this case you can see the features I can detect in the
corresponding locations. So, suppose if I consider in this figure you can see here, so
corresponding to this portion you can see so I have the features here, this features, so that means
the features should be detected in corresponding locations irrespective of geometry and the
photometric variations.

1004
(Refer Slide Time: 09:54)

And interest point descriptors should be invariant, that is one point that is already I have
mention, if I consider the interest points and that should be similar despite of geometric and the
photometric transformation that is the concept of the covariant.

(Refer Slide Time: 10:13)

And already I have mention about the applications of the interest points, the applications maybe
image alignment, so for these also I have to determine the interest points and based on this the
interest points I can do image alignment. For 3D reconstruction one example, the motion
tracking for motion tracking also I have to extract interest points and for indexing and for

1005
database retrieval that is nothing but the content-based image retrieval, so for content-based
image retrieval, I can use interest points.

And finally for object recognition I can extract image features that means I can consider the
interest points and based on this interest points I can do the object recognition. So, these are
some applications of the interest points.

(Refer Slide Time: 11:12)

So, already I have mentioned the corner points I can consider as interest point, but the interest
points could be more generic then the corner points. So, in this figure you can see the finding the
corner points and in this case how to determine the corner points, so for this I have to determine
image gradient, I can determine the image gradient along the x direction and along the y
direction and based on this I can determine the corner points and the corners are repeatable and
unique, so I will explain how to determine the corner points in my next slides.

1006
(Refer Slide Time: 11:53)

The first method I am going to discuss is the SUSAN corner detector. So, this is a method for
edge and the corner detection and in this case I need not consider the image derivatives that
means I need not compute image derivatives and also it is insensitive to noise, the SUSAN edge
detector is implemented using circular marks that I can consider as windows or kernels to get the
isotropic responses.

So, for the SUSAN edge detector I need circular mask and this I can consider as windows or the
kernels to get isotropic response. Isotropic response means the property do not vary in magnitude
according to the directions of implementation. So, that is the concept of the isotropic response
that is the property do not vary in magnitude according to the direction of implementation. So, let
us see the concept of the SUSAN corner detector.

1007
(Refer Slide Time: 12:57)

So, it is SUSAN corner detector it is used to determine whether a pixels lying within the mask
have similar brightness as that of the nucleus. In this case, I have shown the SUSAN mask if you
see in the figure, so the central pixel of the mask is called a nucleus. So, if I consider this mask,
so this is called a nucleus of the mask.

If the pixels lying within the mask have similar brightness as that of the nucleus, then such areas
are defined as Univalue segment assimilating nucleus that is called the USAN. So, what is the
meaning of the USAN? USAN means the unique value segment a simulating nucleus that is
called the USAN. So, it is the portion of the template with intensity within a threshold of the
nucleus. So, that is the concept of the USAN.

So, that means it is used to determine whether the pixels lying within the mask have similar
brightness as that of the nucleus. Now, in this case if you see the in this figure, so I am
considering three cases suppose the USAN is here at this point, the point is supposed the first
position is one that is nothing but the homogeneous region. In the homogeneous region, the
USAN area is maximum, so if you see the homogeneous region the corresponding to the point
one suppose the USAN area will be maximum.

And suppose corresponding to the edge I am putting the USAN here and you can see the nucleus,
so you can see corresponding to the edge pixel the USAN area will be less as compared to
USAN area corresponding to the homogeneous region. So, in the edge is you can see the USAN

1008
area will be half, so USAN area will be half. And if I consider the corner points, if you see the
corner points, the USAN area will be minimum as compared to the edge. So, based on this
concept I can determine the corner points, based on the USAN area. So, in my next slide
mathematically how to consider this problem, you can see.

(Refer Slide Time: 15:12)

So, in the flat regions that means in the homogeneous region, the USAN has a similar area to the
template, so corresponding to this case the case number one, at the edges if I consider the edges
the USAN area is about the half of the template area, so that means if I put the mask here and
this is my nucleus then you can see the USAN area is about half the template area. And if I
consider the USAN at the corner so at a corner points at the corners the USAN area is smaller
than the half of the template area.

So, I am repeating this at corners the USAN area is smaller than half the template area. And what
is SUSAN? SUSAN means smallest USAN and that is the meaning of the SUSAN, the SUSAN
means the smallest USAN. So, based on this concept I can determine the corner points. So,
mathematically how to explain this concept?

1009
(Refer Slide Time: 16:11)

You can see here. So, for this I am considering one circular mask having 37 pixels, so C is the
circular mask I am considering and you can see it is a circular mask, I am considering having 37
pixels. The nucleus is at the centre pixel, the centre pixel is r naught, now in this case I am
considering this u r, r naught that is equal to 1, if this difference that deference is I r minus I r
naught is less than a particular threshold, otherwise it is 0.

So, what is the meaning of this? It is used to determine the portion of the template which
intensity within a threshold of the nucleus, so that I am considering that is u r comma r naught,
after this I am just counting this, I am just doing the summation that means I want to find the
total area and finally I am considering the A r naught that is the area I am determining, it is used
to find the USAN area.

So, I am finding the USAN area based on this because already I have calculated n and C means
the 37 pixels I am considering and you can see for edge detection I can consider this threshold
and for the corner detection I can consider threshold C divided by 2. So, in flat region USAN has
a similar area to the template that already I have explained at edges the USAN area is about half
of that template area, that is the concept of USAN.

And at the corners the USAN area is smaller than half the template area. And in this case the
SUSAN means you can see what is the SUSAN already I have explained the smallest USAN.

1010
What is USAN? That is univalue segment assimilating nucleus that is USAN. So, SUSAN means
the smallest USAN. So, this is about the implementation of the SUSAN corner detector.

(Refer Slide Time: 18:32)

And you can see in this example I am considering one binary image and I am finding the USAN
area for each image position, so you can see for a corner points the area is minimum you can see
the area of the USAN I am determining and this is the area corresponding to the flat portion of
the image. So, based on this I can determine the corner points.

(Refer Slide Time: 19:00)

1011
And after this I will discuss some other techniques for determining the corner points, so you can
see here I am considering one window, this window I can move in any direction and in this case I
am considering the this window that is the shifting window I am considering and in this case I
am finding the change in the intensity. So, here you can see in this image this is the corner, you
can see this is the corner position, so if I move the windows in any direction within the flat
region, then in this case there is no change in intensity in all the directions.

So, I can move the window in this direction, I can move the window in this direction, I can move
the window in this direction like this I can move the window in all the directions I can move and
corresponding to the homogeneous region that is the flat region there is no change in intensity in
all directions. And corresponding to the edge you can see in the second figure no change along
the x direction. So, if I consider this is edge, so if I consider edge and if I move the mask or the
window in this direction, then in this case you can see the no change along the x direction.

But corresponding to this corner points you can see, if I move the window or if I move the this
mask in all the directions I can move the mask in this direction in this direction all the directions
what I am getting a significant change in intensity in all the directions I will be observing. So,
corresponding to the corner point you can see the significant change in intensity in all the
directions I will be getting. So, that is the concept of the corner point detections by considering a
window.

(Refer Slide Time: 20:57)

1012
So, first the corner detector, I am discussing that is the more Moravec corner detector, so the
earliest corner detector is the Moravec corner detector and in this case, what concept I am
considering? Measures the grey value difference is between a window and windows shifted in
eight principle directions.

So, that means that window is shifted in eight directions, so if I consider suppose this is the
window and this is the central pixel, so I can move the windows in eight directions like this, in
all the eight directions I can consider the moment. And the intensity variation for a given shift is
calculated by taking the sum of squares of intensity difference of corresponding pixels in this
two windows.

So, that means I am determining the intensity difference and in this case I am considering the
sum of square intensity difference I am calculating. After this what I am considering? If the
minimum of these differences is greater than a particular threshold that corresponds to an interest
point. So, that means I am considering the intensity differences, I am determining based on this
that directional window the window I am moving in all directions and after this I am determining
the intensity differences and if the minimum of these differences is greater than a particular
threshold that corresponds to a particular interest point that is nothing but the corner point.

(Refer Slide Time: 22:35)

So, mathematically you can see this concept, I am considering patch A and the patch B and I am
determining SSD the sum square distance I am considering and you can see eight possible

1013
directions I am considering here and you can see I am determining the SSD between patch A and
the patch B. And if I consider the patch size 5 by 5 the 5 cross 5 then SSD between patch A and
the patch B I can determine like this.

(Refer Slide Time: 23:07)

So, SSD that is E u v I can compute for 8 possible directions like this and in this case I am
determining the response, the response is nothing but the minimum of E u v and the W that is the
window is centered at the point, the point is xc comma yc and if the response is greater than a
particular threshold that corresponds to the corner points.

So, this is the concept of the Moravec corner detector. So, I have to find a SSD that indicates the
intensity changes, the intensity difference that I am observing for 8 directions and based on this I
am considering the response, the response is the R, if the response is greater than a particular
threshold that corresponds to the corner points.

1014
(Refer Slide Time: 24:02)

So, it is a good corner detector, but sometimes we have the false positives edges are detected as
corners that is the problem of the Moravec corner detectors. So, for this we can consider another
detector the corner detector that is the Harris corner detector.

(Refer Slide Time: 24:22)

So, now I will discussed the concept of the Harris corner detector.

1015
(Refer Slide Time: 24:26)

So, already I have defined a corners, the corner points are more stable features over changes of
viewpoint. And in this case how to detect the corner points? The concept already I have explain,
so if I consider one window, so in the corner points we can observe the large variation of
intensity change and based on this we can determine the location of the corner points. So, in this
figure you can see I have shown the some corner points like this, some of the corner points you
can see. So, based on this corner points we can do the image matching.

(Refer Slide Time: 25:03)

1016
So, the basic idea is like this, so we should easily recognize the point by looking through a small
window. So, one window I am considering, so small window I am considering and suppose this
is the centre point of the mask the window and I am shifting the window in any directions I can
shift the window in this direction in all the directions I can shift the window.

Now, I am observing the change in intensity. So, corresponding to the corner points we have a
large change in intensity and that is important and based on this we can determine the corner
points that is corresponding to corner point there is a large change in intensity.

(Refer Slide Time: 25:50)

So, this concept again I am explaining here, so corresponding to the flat region a no change in all
the directions, no change means no change in the intensity in case of the edge no change along
the x direction that concept already I have explained and corresponding to the corner points
significant change in all the directions. And based on this concept we can develop the
mathematical concept behind the Harris corner detector. So, Harris corner detector is based on
this principle.

1017
(Refer Slide Time: 26:23)

So, mathematically how to represent the Harris corner detector, you can see mainly we are
observing the change of intensity for a particular shift, so I am considering the shift u and v, u is
the spatial shift in the x direction and v is the spatial shift in the y direction and I am considering
the change of intensity I am determining here that is nothing but E u v I am determining and I am
considering the window function, the window function is w x comma y.

So, this window function I am considering, so if I evaluate E u v within this window, then I have
to consider 1 and if it is outside the window then in this case it will be 0. So, that the Gaussian
function I can consider as also window, I can consider a Gaussian function as a window. Now, in
this case if you see this expression, what is the meaning of this expression? I am actually
determining the change of intensity between two image points. So, for this what I am
considering? I am shifting that window and I am determining change of intensity for a particular
shift.

1018
(Refer Slide Time: 27:37)

So, you can see for nearly constant patches has this change of intensity will be 0 almost 0 and for
very distinctive patches this will be very large that is the change of intensity will be very large.
So, that means for the corner points E u v will be very high. So, that is why we want patches
where E u v is large. That is the change of intensity for a particular shift will be large
corresponding to a corner point.

(Refer Slide Time: 28:11)

And for this you can see I am considering the Taylor series approximation, so if I consider f x
plus u comma y plus v, so with the help of the Taylor series expansion I can write like this f x y

1019
plus u fx x comma y that is nothing but the gradient along the x direction fx means the gradient
along the x direction and vfy xy, what is fy? fy is the gradient along the y direction, that is the
first order derivative.

And after this I am considering the second order partial derivative the third order partial
derivatives and also the higher order terms I am considering corresponding to the Taylor series
expansion. And I can approximate this one by neglecting the second order derivative and all the
higher order derivatives I can neglect, so based on this I can do the approximation, so f x plus u
comma y plus v is approximately equal to f x comma y plus u the gradient along the x direction
plus v fy xy that is the gradient along the y direction. So, this is the Taylor series approximation
for 2D functions.

(Refer Slide Time: 29:34)

So, based on this approximation you can see the change in intensity between two patches and I
am considering the first order approximation you can see, so it is I xy plus uI x plus b Iy minus I
x comma y whole square I can consider and after this you can see I can cancel this one this and
this I can cancel and finally, what I will be getting? I will be getting u square into Ix square plus
twice uv Ix iy plus b square Iy square that I will be getting.

And I can write the matrix equation in this from here you can see, so u v and I am considering
this matrix Ix square Ix Iy Ix Iy Iy square that is the matrix and after this u v I am considering so
this expression I can write in the in the form of the matrix equation like this and I will be getting

1020
this one. So, this is the expression for the change in intensity I will be getting. So, corresponding
to this I will be getting this one.

(Refer Slide Time: 30:48)

So, you can see there by considering that approximation I can write E u v is approximately equal
to that u v M matrix, M is I am considering that matrix already I have defined in my last slide
and u v you can consider like this. So, what is this matrix M? The matrix m is nothing but Ix
square Ix Iy, Ix Iy, Iy square, what is Ix? Ix is nothing but the gradient along the x direction. And
what is Iy? Iy is nothing but that is the gradient along the y direction.

So, that is the matrix M is computed from the gradient the matrix m is computed from the
gradients and again you can see I am considering the window function, so that window function
w is equal to 1, but a simple cases and you can see what is the matrix M, this is the nothing but
the matrix M is obtained from the image gradients.

So, in this case I am calculating E u v and for nearly constant passes the E u v will be
approximately equal to 0 and for the distinctive patches this will be large, that is the E u v will be
large. So, that is the concept of the E u v and in this case I have to determine term matrix M and
that is computed from image gradients.

1021
(Refer Slide Time: 32:22)

So, this concept already I have explain and in this case I have to compute the image gradients the
gradient along the x direction and the gradient along the y directions and in this case I have to
consider the ellipse fitting the I have to consider the scatter Matrix depth concept I am explaining
in my next slide what is the scatter matrix and how to fit an ellipse. So, you can see this one.

(Refer Slide Time: 32:48)

So, the concept is like this if I consider these images I am considering the first image is the linear
is I am considering the second image with a flat image that is the homogeneous image almost
and third one I am considering the corner, one corner point I am considering. So, corresponding

1022
to the linear is if you see the x derivative, so I will be getting the x derivative like this and this is
the y derivative corresponding to the linear edges.

And if I consider the flat region, so for the flat region also I can determine the x derivative and
the y derivative like this and corresponding to the corner points you can see I have this x
derivative the first one is the x derivative and second one is the y derivative I can determine. So,
based on this I have to determine the corner points.

(Refer Slide Time: 33:39)

And you can see corresponding to the flat region the gradient along the x direction and the
gradient along the y direction there will be very small and in this case it will be centered around
the origin like this, corresponding to the flat region. Corresponding to the linear is the gradient
along the x direction will be high and gradient along the y direction will be less. So, this is the
gradient along the y direction and this is the gradient along the x direction.

So, corresponding to the linear is the gradient along the x direction will be high and gradient
along the y direction will be less. And corresponding to the corner points if I consider the both
the gradients that is the gradient along the x direction will be high and the gradient along the y
direction will be high.

So, based on this gradient information I can determine the corner points that means for the corner
points both the gradients will be high for the flat region both the gradients will be low. And for

1023
the linear is suppose in this case I am considering this linear is that is the vertical is the x gradient
will be more and the y gradient will be less.

(Refer Slide Time: 34:52)

Now, corresponding to this gradient points I can fit an ellipse, so corresponding to the first case
you can see this I can fit a circle here, now I can fit ellipses to each set of points, the points
corresponding to the gradients along the x direction and the gradient along the y directions. So,
corresponding to the flat region you can see I am fitting the ellipse, but in this case the major axis
and the minor axis of the ellipse will be same almost same because the gradient will be small
along the x direction and along the y direction, so that is why the lambda 1 and lambda 2 will be
small.

What is the lambda 1 and lambda 2? The lambda 1 and lambda 2 are the eigenvalues I am
considering so that I am going to explain, but you can see in the scatter plot in the corresponding
to the flat region the gradient of x direction and the gradient of y direction will be less and in this
case you can see I am fitting the ellipse. In case of the corner points, you can see the lambda 1
and the lambda 2 will be large and corresponding to the linear is you can see if I fit the ellipse
the lambda 1 is large that is nothing but the major axis will be large, you can see the major axis
and the minor axis will be small.

So, in case of this eigenvalues, the major axis corresponds to the eigenvalue lambda 1 and the
minor axis corresponds to the eigenvalue lambda 2. So, in case of the flat region, lambda 1 and

1024
lambda 2 will be small and in case of the linear is that is the vertical is this lambda 1
corresponding to the major axis will be high and corresponding to the minor axis is the lambda 2
will be small. And in case of the corner points you can see both lambda 1 and lambda 2 will be
large.

(Refer Slide Time: 36:59)

Based on this concept I can do classification with the help of the eigenvalues. So, eigenvalues
means if I consider the ellipse the major axis of the ellipse corresponds to the lambda 1
eigenvalue and the minor axis corresponds to the eigenvalue lambda 2. And in this case you can
see corresponding to the flat region lambda 1 and lambda 2 are small and each almost constant in
all the directions.

And corresponding to the vertical edge if you consider corresponding to the vertical edge the
lambda 1 is greater than lambda 2, so in the x direction I am plotting lambda 1 and in the y
direction I am plotting lambda 2. And corresponding to the horizontal is the lambda 2 will be
greater than lambda 1.

And corresponding to the corner points both lambda 1 and lambda 2 will be large that means E
increases in all the directions. So, based on this concept based on the eigenvalues I can determine
the vertical edge, I can determine the horizontal edge, I can determine the flat region but mainly I
can determine the corner points.

1025
(Refer Slide Time: 38:17)

So, for this you can see I can determine the corner response measure with the help of this
concept, so how to determine the corner response? The corner response is determined by R that
is nothing but the determinant of M, matrix M you can see here, matrix M is nothing but it is
computed from the gradient of the image. So, first I have to determine the determinate of M
minus k, k is a constant that is the predefined constant and after this I am considering the trace M
whole square.

So, what is the determinant of M? Determinant of M is nothing but lambda 1 into lambda 2. And
what is the trace of M? The trace of M is lambda 1 plus lambda 2 that is the trace of M. And the
value of k is empirically determined it lies between 0.4 to 0.6. And based on this corner response
measure that is the R I can detect the corner points.

1026
(Refer Slide Time: 39:16)

So, here you can see the corner response map and I am considering the lambda 1 in this direction
and lambda 2 in this direction and I am determine the response that is mainly from the
determinant of M and also from the trace of M and you can see I am showing the responses,
response will be 0 response is 28 like this I am considering all the responses the corner response
map.

(Refer Slide Time: 39:41)

And in this case you can see the R depends on the eigenvalues of M. So, this response depends
on the eigenvalues of M, because we have to determine the determinant of M that is nothing but

1027
lambda 1 into lambda 2 and also we have to determine the trace of M that is nothing but lambda
1 plus lambda 2. And R is large for a corner, so in this case you can see if I consider this value R
is equal to 142 or maybe 104 that corresponds to the corner points.

And R is negative with large magnitude for an edge, so you can see for edge R is negative and
also you can see R is negative for the edges. And r is small for a homogeneous for a flat region.
So, you can see r is small for a flat or the homogeneous region. So, based on this corner response
I can determine the location of the corner points.

(Refer Slide Time: 40:41)

And in this example I have shown the input image and I am considering the threshold the
threshold is suppose 10,000 something like this and based on this I am determining the corner
points, with the help of the Harris corner detector.

1028
(Refer Slide Time: 40:55)

And in this case what I am considering? Two images I am considering and I am considering the
affine transformation and also the photometric variations. In this case also I have to determine
the corner points, because it should be invariant to affine transformation and the photometric
variations.

(Refer Slide Time: 41:14)

So, here you can see based on the corner response R, I am determining the corner points, for the
both the images.

1029
(Refer Slide Time: 41:23)

And also you can see if I consider the response is greater than a particular threshold and
corresponding to this I am determining the corner points. So, all the corner points I am
determining.

(Refer Slide Time: 41:32)

And if I consider take only the points of local maximum of R and corresponding to this you can
find the corner points for the both the images you can find. So, like this I have to determine the
corner points.

1030
(Refer Slide Time: 41:47)

So, the Harris corner detector the summary is I have to determine the matrix M that is important,
so matrix M I can compute from the image gradients. And from the matrix I can determine the
corner response, the corner response is nothing but lambda 1 into lambda 2 minus k lambda 1
plus lambda 2 whole square that is the corner response. So, that means lambda 1 and lambda 2
are the eigenvalues of the matrix M.

So, matrix M is I am considering and corresponding to the matrix M I am determining the


eigenvalues. So, eigenvalues are lambda 1 and lambda 2. And in the pictorially what is the
meaning of the lambda 1 and the lambda 2, lambda 1 is nothing but that is the corresponding to
the major axis of the ellipse that is the eigenvalue and corresponding to the minor axis the
eigenvalue is lambda 2. But from the matrix M I have to determine the eigenvalues lambda 1 and
the lambda 2. And after this based on this corner response I can determine the corner points.

1031
(Refer Slide Time: 42:53)

And for a Harris detector some properties I have to explain, the first property is the rotation
invariant and in this case you can see one corner I am considering you can see one corner I am
considering and it is rotated then in this case the ellipse rotate but its shapes remains the same
that means the eigenvalues will remain the same.

So, that means the corner response R is invariant to image rotation in case of the Harris corner
detector. So, this example I have shown that corner point is rotated but what is happening the
eigenvalues remains the same in both the cases and that is why the corner response R is invariant
to image rotation.

1032
(Refer Slide Time: 43:39)

And this concept already I have explained what is the meaning of the covariance and the
invariance, so what is invariance? Suppose the image transformed and the corner locations do
not change that is nothing but invariance and if we have two test from versions of the same
image, that means I am doing some affine transformation of an image and feature should be
detected in the corresponding locations.

So, in the corresponding locations the feature should be detected, here you can see I am
determining the corner points or the interest points and in this case I am considering affine
transformation that is I may considered rotation translation scaling like this and also I am
considering photometric variations. So, this concept is quite important.

1033
(Refer Slide Time: 44:30)

So, in this case I am showing one case that is the affine intensity change. So, I am showing these
two images and the intensity is change. So, for this I am considering I is the intensity of the
original image and after this I am doing the intensity scaling the intensity scale a into I and I am
considering the offset, the offset is supposed to b.

And corresponding to the first image there intensity is I and if I apply the gradient operation so
gradient operation you can see in the gradient operation I have to determine the maximum value,
so the maximum of this I have to determine and you can see I am getting the maximum value
here the that is a maximum of the gradients.

So, that means I have to find the maximum value that means I am determining the gradients. In
the second case what I am considering the intensity skill, so the intensity is increased and again I
am applying the gradient operation. In this case also I can determine these points the maximum
point I can determine, but again you can see I am getting one false positive. So, that is why I can
say that is it is partially invariant to affine intensity change that I can I want to explain this
concept that is it is partially invariant to affine intensity change.

1034
(Refer Slide Time: 46:00)

Now, if I consider the translation operation, I am considering one corner point you can see one
corner point I am considering and I am doing the translation image translation and in this case I
am applying the derivatives and mainly I am considering the window function on window
function I am considering and also I am considering the gradient operation that is these are shift
invariant the gradient and the window functions are shift invariant, so that is why the corner
location is covariant which respect to translation operation that is about the translation.

(Refer Slide Time: 46:37)

1035
And if I consider the rotation at that concept already I have explained the corner point is rotated
this is in the first image in a second image it is rotated, what is happening the ellipse will be
rotated but its shapes remains the same that means the eigenvalues remains the same, so that is
why the corner location is covariant which respect to rotation. So, we have discussed about the
translation and also the rotation let us see about the scaling.

(Refer Slide Time: 47:05)

So, I am considering one corner point and I am doing the scaling and in this case you can see if I
consider this window function and it is moving, so that we can determine the change in the
brightness the change in the intensity then in this case based on the Harris corner detector
algorithm, so all the points will be classified as edges.

So, that means the corner location is not covariant to scaling. So, I am explaining this I am
considering one corner and after this I am doing the scaling after scaling I am having this is you
can see and I am considering the one window function and in this case if I apply this window
function and if I consider the Harris corner detector principle, then what will happen in this case?
So, all the points will be classified as edges. So, that is why corner location is not covariant to
scaling.

1036
(Refer Slide Time: 47:59)

And finally I want to discuss about another detector that is the Hessian detector. So, the concept
is very similar, so searches for image locations which have strong changes in gradient along both
the orthogonal directions. So, for this also I have to determine the image gradients and I have to
determine the Hessian matrix corresponding to the image the image is I x y.

So, you can see first I have to determine Ixx, what is Ixx? Ixx is the second partial derivative in
the x-direction. And what is Iyy? Iyy is the second partial derivative in the y direction. And also I
have to determine I x y I have to determine that is I x y is the mix partial second derivative in the
x and y directions. So, I have to determine all these derivatives.

So, I have to determine Ixx, I have to determine Iyy and also I have to determine the mixed
partial second order derivative I have to determine. And from this I have to determine the
Hessian matrix. So, you can see I am determining the determinant of the Hessian matrix that is
nothing but Ixx into Iyy minus I x y whole square. So, I am determining the determinant of the
Hessian matrix. And if the determinate of this is greater than a particular threshold then the
corner points will be detected. So, this is the main concept of the Hessian detector Hessian corner
detector.

1037
(Refer Slide Time: 49:34)

So, in this case searches for image location which have strong change in gradient along both the
orthogonal directions and also we have to perform a non-maximum suppression using a 3 by 3
window that is also very important, so I have to perform a non-maximum separation using a 3 by
3 window and also we have to consider points having higher value then it's 8 neighbourhood that
is we have to consider and based on this we can eliminate false corner points. So, this is about
the Hessian detectors.

(Refer Slide Time: 50:08)

1038
And you can see I can determine the corner points with the help of this algorithms, this is the
Hessian detectors to detect the corner points, you can see all the corner points. So, in this class I
discussed the concept of the interest points and the interest points should be robust to affine
transformation and also the photometric variations and with the interest points, I can do image
matching, so I had given some examples, one is the stereo correspondence, I can find a stereo
correspondence between left image and the right image.

Also, I have given another applications that is image stitching for image stitching also I have to
do image matching with the help of the interest points. And I have discussed one important
corner detector that is the Harris corner detector. So, let me stop here today. Thank you.

1039
Computer Vision and Image Processing Fundamentals and Applications
Professor M. K. Bhuyan
Department of Electronics and Electrical Engineering
Indian Institute of Technology, Guwahati
Lecture 29
Image Feature – HOG and SIFT
Welcome to NPTEL MOOCs course on Computer Vision and Image Processing Fundamentals
and Applications. In my last class I discussed the concept of interest point detection, today I am
going to discuss about two important image feature descriptors one is hog that is the histogram of
oriented gradients and another one is sift the scale invariant feature transformation. In case of the
hog we can determine gradient of an image, the orientations of gradients give information of the
image that we can consider as image feature.

In case of the sift the objective is to determine image features or the extract image features which
are invariant to scale and rotation, also the feature should be robust to viewpoints changes. So,
these are the objective of sift features. So, these two concepts one is the hog another one is sift
that I am going to discuss in this class. So, what is your hog? Let us, see.

(Refer Slide Time: 01:43)

So, histogram of the oriented gradients.

1040
(Refer Slide Time: 01:46)

So, it is a features descriptors used in computer vision applications and one application is object
detection. So, in this case we have to count the occurrences of gradient orientation in localized
portions of an image and the local object appearance and set within an image can be described by
the distribution of intensity gradients or edge directions.

So, main concept is I have to determine the gradient orientations in localized portions of an
image and also we have to count the occurrences of gradient orientation in localized portion of
an image. So, that is the main concept of the histogram of the oriented gradient. So, that means
the main concept is we want to see the distribution of intensity gradients or edge directions and
by using this I want to represent a particular image.

1041
(Refer Slide Time: 02:42)

So, gradient based feature descriptors where develop for people detection in the paper by Dalal
and Triggs they considered gradient-based feature descriptors for people detection. And this is
the global descriptor for the complete human body and very high-dimensional typically 4000
dimensional. So, very promising results on challenging data sets. So, here you can see I am
showing some of the results of people detection by considering gradient base feature descriptors.
So, you can see the results like this.

(Refer Slide Time: 03:20)

1042
So, for this what we have considered for this the procedure is like this, create normalised train
image data set. So, first I have to create the normalised training image data set that is set up crop
images containing pedestrians in normal environment, is the first step. So, training image data set
we have to create, after this encode images into features spaces, so for this we are considering
the global descriptor rather than local descriptors there local features, after this we have to
learned a binary classifier, maybe we can consider the support vector machine and based on this
we can detect objects or the non-objects that is mainly the object detection. So, this is a simple
procedure.

(Refer Slide Time: 04:08)

So, in case of the hog, the procedure is like this, we can consider the this mask maybe it is
something like the Sobel mask we can consider or maybe any other edge detection mask we can
consider and by considering these mask we can determine gradient magnitude. So, for this we
have to determine gradient along the x direction and the gradient along the y direction we have to
determine and based on this we can determine the gradient magnitude.

So, corresponding to this image you can see I have determined a gradient magnitude here. And
also I can determine the orientation that is the direction of the normal to the edge I can determine
so theta is nothing but Tan inverse Gy divided by Gx that is the direction of the A's normal that I
can determine. So, this is the first step.

1043
(Refer Slide Time: 04:59)

After this next what we have to do that is called the orientation binning, so for a 64 cross 128
image divided the image into 16 cross 16 blocks of 50 percent overlap, so you can see
corresponding to the image I am considering 16 cross 16 blocks of 50 percent overlap, so based
on this I will be getting 105 blocks total the total blocks will be 105 blocks and each block
should consist of 2 cross 2 cells which size 8 cross 8, after this the quantize the gradient
orientation into 9 bins, so we are considering 9 bins, that means 9 gradient orientations I am
considering.

The vote is the gradient magnitude, so that we have to do that is a voting is the gradient
magnitude and interpolate both by linearly between neighbouring bin centres. So, we have to do
the interpolation that is a bilinear interpolation between neighbouring bin centres. So, this is the
concept of the orientation binning, so that concept again I am going to explain in my next slide.

1044
(Refer Slide Time: 06:11)

So, here you can see I am showing the blocks and the cells corresponding to this input image and
I am considering the 50 percent overlap so you can see the block 1 and the 2 that is 50 percent
overlap, so in this image I will be getting 105 blocks in total. And you can see I am considering
the cells, so each block should consist of 2 cross 2 cells, so you can see the cells here with size 8
cross 8.

(Refer Slide Time: 06:44)

So, I have 9 gradient orientation directions that means the 8 bins for gradient orientations and 0
to 180 degree we are considering so unsigned gradient orientation from 0 to 180 degrees, each

1045
bin voted with gradient magnitude. So, in the figure you can see I am showing the cells
corresponding to this input image and I have the histograms of oriented gradients and you can
see this is the histogram of oriented gradients. So, each bin voted with gradient magnitudes.

(Refer Slide Time: 07:19)

So, here corresponding to this image and I am showing the blocks you can see the histogram, the
histogram of the oriented gradients. So, these are the orientations I am considering, so for
different orientations I have to go for voting. So, corresponding to this orientation this is the
value after voting corresponding to this orientation so I am just doing the voting and that is the
magnitude of that particular histogram and like this corresponding to this orientation this is the
magnitude. So, the vote is the gradient magnitude.

1046
(Refer Slide Time: 07:53)

So, that means if I consider 0 to 180 degree, that means I am considering the gradient
orientations and I am considering 9 bins, so suppose I am considering 0 to 180 degree, so this 0
to 180 degree I am considering, so suppose this is 20 degree, this is 40 degree, this is 60 degree,
this is 80 degree, this is a 100 degree, this 120 degree, 140 degree, 160 degree and it is 180
degree. So, that means we are considering 9 bins, we are considering 9 bins.

So, each block has 2 cross 2 cells which size 8 cross 8 and quantize the gradient orientation into
9 bins, so we are considering from 0 to 180 degree. The vote is the gradient magnitude and also
we have to interpolate vote’s sides by linearly between the neighbouring bin centres. So,
corresponding to theta is equal to 75 degree distance to bin centres, so if I consider bin 70 the
distance to bin centre will be 5 degrees and if I consider bin centre 90, so distance to bin centre
will be 15 degree and after this we have to determine the ratio of the belongingness, so the ratio
will be 5 divided by 20 that is nothing but 1 divided by 4.

And another one is 15 divided by 20 that is nothing but 3 divided by 4. So, that means I am
considering the ratio of the belongingness that is what I am considering interpolating votes by
linearly between the neighbouring bin centres. So, corresponding to 75 the neighbouring bin
centres are 70 and the 90. So, based on this I am determining the ratio of belongingness. So, in
the figure a you can see I am showing the cell histograms.

1047
And in the figure b what I am showing? I am showing the orientation binning. So, I am
considering 9 bins. So, you can see in the histogram you can see the 9 bins, 1, 2, 3, 4, 5, 6, 7, 8,
9, so 9 bins I am considering. So, this is about the orientation binning and how to do the
interpolation the histogram interpolation.

(Refer Slide Time: 10:38)

After this we have to determine the feature vector, so for this what we are doing? That
concatenation of descriptor blocks, the cell histograms are then concatenated to form a feature
vector. So, histogram obtained from the overlapping blocks of 2 by 2 cells are concatenated into
1-D vector of dimension you can see we have 105 blocks and the 2 cross 2 cells 9 bins, so that
means the dimension of the feature vector will be 3780.

1048
(Refer Slide Time: 11:13)

After this we have to do the block normalization, so the in the paper by Dalal and Triggs they
explode different methods for block normalization, so maybe we can consider L2 norm L1 norm
or the L1 square norm we can consider I am considering v is a non-normalized vector containing
all histograms in a given block. After this I can do the normalization that I can consider L2 norm
L1 norm or maybe L1 square root I can consider.

In addition the scheme L2 hysteresis can also be computed, how to compute this? The compute
means the L2 hysteresis I can compute first by taking L2 norm clipping the result and after this
we have to do the renormalization. So, by this procedure we can determine the L2 hysteresis. So,
this is about the block normalization.

1049
(Refer Slide Time: 12:14)

What is the importance of the block normalization? The block normalization ensures invariance
of descriptor to illumination and the photometric variations. So, it improved the performance, so
for this Illumination in the photometric variation we have to consider the block normalization.
The descriptors should be invariant to illumination and the photometric variations. So, that is
why we have to do the block normalization.

After this, the gradient magnitude are weighted according to a Gaussian special window, so we
can consider a Gaussian special window we can consider and the gradient magnitude can be
weighted based on this Gaussian special window. Distant gradient contribute less to the
histogram, so that concept we are considering because if I consider the boundary, so we have to
neglect the boundary, the gradient corresponding to the boundary.

So, that is why the distant gradient contribute less to the histogram. So, that is the objective of
the weighted gradients. So, in the figure I am showing the Gaussian special window, so this is
the Gaussian special window I am considering and the gradient magnitude are weighted by this
Gaussian special window.

1050
(Refer Slide Time: 13:32)

So, finally I am getting the hog feature that is the concatenation of the blocks, so this is the
feature vector I am getting. And for visualization you can see corresponding to this input image
this is the histogram of the oriented gradient, corresponding to the second image you can see the
histogram of the oriented gradient and like this you can see and corresponding to the third image
also we can see the histogram of the oriented gradients.

(Refer Slide Time: 13:58)

1051
And similarly in this case if you see the input image with the cameraman image and the second
image is the hog the histogram of oriented gradients. So, by using the hog, I can represent the
input image.

(Refer Slide Time: 14:11)

For hog there where many engineering efforts, so this is called a feature engineering, so like this
the testing of the parameters the parameters may be the size of the cells, the blocks number of
cells in the a block and the size of the overlapping, so these are the parameters. Also the
normalization schemes we can consider L1 norm L2 norms like this or maybe we can consider
gamma correction pixel intensity normalization, so we have to do the testing for all these
parameters.

And extensive evaluation of different choices was perform when the descriptor was proposed.
So, that means many test where done to see the performance of the hog descriptors and based on
this these parameters were fixed, the parameters may be the size of the cells, the block size,
number of cells in a block like this. So, these parameters will fix after all the experimentations.
So, it is not only the idea, but also the engineering effort.

1052
(Refer Slide Time: 15:12)

And the training set was like this, so more than 2000 positive images and 2000 negative training
images were reconsidered and all these images are aligned and resize and they considered wide
variety of backgrounds. So, you can see there are some positive training images and the negative
training images and also considered different types of backgrounds, so different backgrounds
they considered.

(Refer Slide Time: 15:41)

1053
After this the support vector machine was used on the top of the hog features and slightly better
results can be achieved by considering the support vector machine with a Gaussian kernel. But it
increases the computational complexity. So, this is about the model learning.

(Refer Slide Time: 15:58)

Some of the results of the INRIA database you can see these are the output images that means
the people detection, so these are the results of the people detections corresponding to these
images you can see.

(Refer Slide Time: 16:12)

1054
So, what are the important steps of the hog descriptors? I am again explaining so the procedure
of hog feature extraction is first compute the centered horizontal and the vertical gradient with no
smoothing first I have to compute the horizontal and vertical gradients, after this we have to
determine the gradient magnitude and the orientations. So, if I consider a colour image, what is
the procedure? Pick the colours channel with the highest gradient magnitude for each pixel, so
for colour image we have to consider this one with the highest gradient magnitude for each pixel.

So, for 64 cross 128 image divide the image into 16 cross 16 blocks of 50 percent overlap that
we are considering, so if I consider this one, so there will be 105 blocks in total and each block
should consist of 2 cross 2 cells with size 8 cross 8, after this quantized the gradient orientation
into 9 bins that already I have explained, the vote is the gradient magnitude and I have to do the
interpolation the interpolate votes by linearly between the neighbouring bin centres, so that
concept also I have explained.

And after this the vote can also be weighted with Gaussian to down weight the pixels near the
edges or the boundaries of the block that we are considering and after this concatenating
histograms the dimension of the feature will be 3780, so that is the dimension of the feature
corresponding to the image, the image is 64 cross 128 image. So, I will be getting the hog
features. So, this is the fundamental concept of the histogram of the oriented gradient.

(Refer Slide Time: 18:05)

1055
Now, I will discussed about the SIFT that is the scale invariant feature transformation, so this
concept was developed by David Lowe of university of British Columbia.

(Refer Slide Time: 18:16)

So, you know that Harris operator is not invariant to scale, so the objective is to develop an
interest operator that is invariant to scale and rotation, so that is the one objective of sift the scale
invariant feature transformation. So, the feature should be invariant to scale and rotation. Also
the objective is to create a descriptor which is robust to the variation corresponding to typical
viewing conditions. So, that is another objective of sift.

(Refer Slide Time: 18:48)

1056
So, if you see here I am showing one image and I am considering the image features which are
invariant to translation rotation scale and other imaging parameters. So, from this input image I
am extracting the features these are the 6 features and these features should be invariant to
translation rotation scale and other imaging parameters. So corresponding to this image you can
see I am considering some features or the key points that key points are invariant to translation
rotation scale and other imaging parameters. So, that I can consider as sift features or the
descriptors.

(Refer Slide Time: 19:29)

So, SIFT provide features characterizing a salient point that remain invariant to changes in scale
or rotation that is the main objective of the SIFT that means the feature should be invariant to
changes in scale or rotation. So, here you see corresponding to this input images I am extracting
the SIFT feature which are invariant to changes in scale or rotation. So, these I can consider as
SIFT feature.

1057
(Refer Slide Time: 20:01)

Now, I will highlight the steps of SIFT algorithm. The first one is the determine approximate
location and scale of the salient feature points. This salient feature points are called the key
points, after this we have to refine the location and the scale of the key points. So, already we
have extract the key points and after this we have to refine the location and the scale of the key
points.

After this, we have to determine orientations for each of the key points, because already we have
extract the key points and we have to determine the orientations for each of the key points that is
the third step. After this finally we have to determine descriptor for each of the key points. So,
these are the main steps of the SIFT algorithm.

1058
(Refer Slide Time: 20:50)

So, overall procedure at a high level first I have to consider scale-space extrema detection that
means I have to search over multiple scales and different image locations that means I have to
find the key points. So, for this we are considering that is a searching over multiple scales and
different image locations. The next step is the key point localization that means fit a model to
determine location and scale of the key points and based on this select key points based on a
measure of stability, so based on this we can select the key points.

After this orientation assignment that is the third step, so compute base orientation for each key
point region, because already we have extracted the key points and also after extracting the key
points we have to compute best orientation for each key point regions. And finally in the step 4
key point description, so for this we can use local image gradient at selective scale and rotation
to describe each key point region.

So, these are the steps of the SIFT algorithm, I am repeating it again. So, first I have to consider
the searching over multiple scale and different image locations, after this we have to find the key
points and after this orientation assignment means computer the best orientation for each of the
key point regions and finally the description for the key points.

1059
(Refer Slide Time: 22:36)

So, in the block diagram I have shown the algorithm that a SIFT algorithm I have shown, so first
I have to construct the scale space and after this we have to consider the difference of Gaussian I
am considering and locate the DoG extrema that means in case of the difference of Gaussian I
can determine the extreme points, this is the maximum point I can determine or the minimum
point I can determine in the DoG image that is a difference of Gaussian image, so I can find a
maximum or the minimum and based on this I can find the key points.

So, you can see I can find the or I can locate at the potential feature potential key points. After
this what we have to do? We have to filter out the low contrast responses that may not be the key
points, so that is why we have to filter out low contrast responses and also we have to neglect the
edge pixels which are considered or which are detected as key points, because edges or the edge
pixels are not the key points some of the edge pixels may be detected as key points.

So, that we have to eliminate. So, that is why the filtering out low contrast responses and the
edge pixels. After this assigning orientations for key points and after this we have to find the
descriptors for the key points and that is nothing but the SIFT features. So, this is the block
diagram of the SIFT algorithm.

1060
(Refer Slide Time: 24:11)

So, first I will discuss about image pyramids. So, here I have shown one image pyramid, so here
you can see the bottom level is the original image, so the first one is the original image, second
level is derived from the original image according to some functions, so I am getting the second
level that is obtained from the first image, the original image. And third level is obtained from
the second level according to some functions and like this I will be getting all these levels and
that corresponds to the image pyramid.

(Refer Slide Time: 24:49)

1061
Now, let us discuss about mean pyramid. So, here you can see the bottom level is the original
image, at second level each pixel is the mean of 4 pixels in the original image. So, that means I
am considering 4 pixels like this and this pixel is nothing but the mean of 4 pixels of the original
image. Similarly, at third level each pixel is the mean of 4 pixels in the second level and like this
we have to go on and we will be getting the mean pyramid like this. So, this is one example of
the image pyramid. In the next slide also I am showing another pyramid that is Gaussian
pyramid.

(Refer Slide Time: 25:31)

So, that is the Gaussian pyramid. So, bottom level is the original image, after this we can apply a
Gaussian filter, so if I apply Gaussian filter the image will be smooth, the image will be blurred
and after this at a second level each pixel is the result of applying a Gaussian mask to the first
level and then resampling to reduce the size. So, that means this image is resampled and I will be
getting the second level and like this I will be getting all the levels of the Gaussian pyramid.

1062
(Refer Slide Time: 26:08)

In this example I am showing the subsampling with Gaussian pre-filtering, so this is Gaussian I
am showing it is a half, the Gaussian image is down sample by a factor of 2 to produce an image
that is the image is this 1 4th I am getting 1 4th size I am getting like this I have to do the down
sampling. So, the image is convolved with a Gaussian and after this I am doing the down
sampling. So, this is the concept of the subsampling with a Gaussian pre-filtering.

(Refer Slide Time: 26:39)

With the difference of Gaussian we can construct the scale space. So, in one of my class I
discussed about the Laplacian of Gaussian, so here you see this is nothing but the Laplacian of

1063
Gaussian that we can determine, so what is the Laplacian of Gaussian? That is the this is the
Laplacian and suppose I am considering a Gaussian function G x y that is convolved with the
image the image is f x y and that is equal to so this is nothing but the Laplacian of Gaussian, this
is the log the Laplacian of Gaussian.

So, I can write like this, this is the Laplacian of Gaussian and this is convolve with f x y, f x y is
the image. So, for calculating the Laplacian of Gaussian, we can determine the differentiation
with respect to x that is a second order differentiation of the Gaussian function, so this is equal to
x square minus sigma square sigma to the power 4 e to the power minus x square plus y square 2
sigma square. So, I can determine the second order derivative with respect to x for the function
the function is the Gaussian function.

Similarly, we can also determine the second order derivative with respect to y corresponding to
the Gaussian function the Gaussian function is G sigma x y, so that will be also equal to y square
minus sigma square sigma to the power 4 e to the power minus x square plus y square 2 sigma
square. And finally from these 2 second order derivative I can determine the log.

So, log will be like this that is the Laplacian of Gaussian, so it is approximately equal to x square
plus y square minus 2 sigma square sigma to the power 4 e to the power minus x square plus y
square 2 sigma square, so I can determine the Laplacian of Gaussian. So, in this case you can see
this is nothing but this is the Laplacian of Gaussian. So, this Laplacian of Gaussian here you can
see just I am determining this.

So, this is nothing but the difference between two Gaussians, in this case that this parameter k
controls the scale, because the sigma actually the sigma means the scale, sigma of the Gaussian
function it corresponds to the scale. So, in one function the sigma square is multiplied by k and
in another function it is only sigma square. So, based on that k I can change the scale.

So, corresponding to this Laplacian of Gaussian, so I will be getting this expression and here you
can see the Gaussian function is convolve with the image and in this case if I do the convolution
the image will be smooth and corresponding to this I can determine the difference of Gaussian,
so this is the difference of Gaussian that is nothing but I will be getting this, l xy k sigma square
minus l x sigma square. So, sigma controls the scale.

1064
So, this is the concept of the difference of Gaussian. So, with the difference of Gaussian, I can
construct the scale space, because here you can see in this case the sigma square only the sigma
square, but in this case xy k sigma square, so suppose if I put k is equal to 2, so I will be getting
different scale, if I put k equal to 3, so I will be getting the different scale like this, I will be
getting a number of scales. So, I can determine the difference of Gaussian.

(Refer Slide Time: 31:11)

So, first tip of the sift algorithm is approximate key point location. So, look for intensity changes
using the difference of Gaussian at two nearby scale. So, this is the first step. So, this is the
Gaussian function I am showing the two-dimensional Gaussian function and after this the image
is one vote with the Gaussian that means I x y is the image and image is convolved with a
Gaussian and because of this the image will be smooth and after this you can determine the
difference of Gaussian. So, difference of Gaussian will be something like this and the scale
corresponds to sigma of the Gaussian.

1065
(Refer Slide Time: 31:57)

Here I am showing the pyramid scheme corresponding to the sift algorithm. Scale space is
separated into octaves, so I have two octaves, so here you can see this is the first octave and this
is the second octave like this the first octave uses scale sigma and the second octave uses the
scale 2 sigma. So, in each octave the initial image that is the image the input image is repeatedly
convolved with the Gaussian to produce a set of scale space images.

So, that means corresponding to this first octave what I am doing I am doing the convolution of
the image with the Gaussian to produce a set of scale space images. After this the adjacent
Gaussians are subtracted to produce DOG, so here you can see I am considering this to adjacent
Gaussians and from this I can determine the difference of Gaussian like this.

Similarly, if I consider these two suppose these two and corresponding to this I can determine the
difference of Gaussian. So, after this what I can do? The Gaussian image is down sample by a
factor of 2 to produce an image 1 4th put the size to start the next level. So, that means I will be
getting the second octave I will be getting or the next octave I will be getting for this I have to do
the down sampling of the original image.

So, that means I am doing the down sampling and I will be getting the second octave, the next
octave I will be getting. And corresponding to the next octave also I have to determine the
difference of Gaussian.

1066
(Refer Slide Time: 33:40)

So, same thing here also I am showing here you can see I am showing two octaves that is the
first octave I am showing and after this it is down sample, so I will be getting the next octave and
here you can see I am finding the difference of Gaussian D x y sigma that is the difference of
Gaussian I will be getting, so that means the image is repeatedly convolve with a Gaussian to
produce a set of scale space images.

So, I will be getting these images, all these images I will be getting that means a set of scale
space images I will be getting. After this what I can do? Adjacent Gaussians are subtracted so
suppose these are subtracted to produce the difference of Gaussian, so I will be getting the
differential of Gaussian like. After this the Gaussian image is down sample by a factor of 2, to
produce an image of 1 4th of the original image. So, after this I am doing the down sampling and
I will be getting the next octave. So, that is the concept of the scale space.

1067
(Refer Slide Time: 34:51)

So, again I am showing the pyramid so you can see I am showing 2 octaves the first octave and
the second octave and how to compute the difference of Gaussian also you can see here, so here
you can see in the first octave I have s plus 3 images including the original, so suppose s is equal
to 2 that means I will be getting 5 images in the scale 1 that is the in the first octave I will be
getting 5 images corresponding to s is equal to 2.

And how many different sub Gaussian images I will be getting? s plus 2 that means 2 plus 2
there will be 4, so 4 images I will be getting that is the difference of Gaussian. So, this parameter
s determines the number of images per octave. And here you can see I am getting the number of
images, how to get the number of images? That is the original image has repeatedly convolve
with the Gaussians to produce a set of scale space images.

So, that concept already I have explained. So, corresponding to this parameter s you can see the
number of images in the octave, so in the first octave suppose s is equal to 2, then I will be
getting 5 scale space images and corresponding to this will be getting 4 difference of Gaussian
images. And after this I have to do the down sampling after doing the down sampling I will be
getting the second octave. And similarly I have to determine the difference of Gaussian. So, this
is a pyramid scheme.

1068
(Refer Slide Time: 36:31)

So, here I am showing the images corresponding to scale is equal to 0, scale is equal to 1, scale is
equal to 4, scale is equal to 16, 64, scale 256. So, that means I am observing an image at different
scales, that is the concept of the scale space representation and from this I want to determine the
key points.

(Refer Slide Time: 36:54)

After this the next is the key point localization. So, I am getting the difference of Gaussian
images, so detect maxima and minima of difference of Gaussian in scale space, so I have to
determine the maxima and the minima of difference of Gaussian in scale space, so here I have

1069
showed the scale space and also I have shown the difference of Gaussians images. So, these are
the DOG images, the difference of Gaussians images.

And for each maxima or minima we have to determine the location and the scale of the key
points. So, which one will be the key points for this we have to do the comparison. So, each
point is compared suppose if I consider this point, so each point is compared to its 8
neighbourhood pixels, so here you can see I have the 8 neighbourhood 1, 2, 3, 4, 5, 6, 7, 8, so 8
neighbours the comparison with the 8 neighbours in the current image and 9 neighbours is in the
scale above and below.

So, if I consider this 9 neighbours below the image and this 9 neighbours above the image so that
means how many comparisons? There will be 26 comparison, so one comparison that is the 8
comparisons with 8 neighbours in the current image and after this 18 comparisons with the
images above and the below of the current image. So, that means I am doing 26 comparisons. So,
in the next slide, you can see this comparisons how many comparisons I am doing? I am doing
26 comparisons.

(Refer Slide Time: 38:36)

So, here I am showing the difference of Gaussian images DOG, so look at all neighbouring
points and we have to see the scale and we have to identify minimum and the maximum so that
means suppose corresponding to this point, so each point is compared to its 8 neighbours that

1070
means is point is compared to its 8 neighbours in the current image and 9 neighbours is in the
scales above and below that means I have to do the 26 comparisons.

So, comparisons with this all these and comparison with all these points comparison with all
these points and based on this I can and determine the key points that is nothing but the
maximum value or the minimum points I have to determine from the difference of Gaussian
images.

So, corresponding to this input image you can see I am determining the key points the location of
the key points. So, for a particular scale I can get the key points. So, that means I will be getting
both location as well as the scale of the key points by this process. So, this is the step number
one. So, I have to find the approximate location of the key points.

(Refer Slide Time: 40:06)

The same thing here again I am showing the same concept already I have explained, so I will be
considering the scale space and we can determine the difference of Gaussian and we have to do
the down sampling like this, so I will be getting the next octave and after this from the difference
of Gaussian, I can determine the location of the key points.

1071
(Refer Slide Time: 40:28)

So, in this image I have shown the initial detection of the key points. So, these are the key points
you can see.

(Refer Slide Time: 40:36)

The next one is refining key point location. So, we have determined the approximate location of
the key points and in this case we have to perform a detailed fit to nearby data to determine
location scale and the ratio of the principal curvatures. So, some of the key points that we have
determined in the first step may not be actual key point, so that means we have to do some

1072
refinement, so this refinement we have to do so for this we have to perform a detailed fit to
nearby data to determine location scale and ratio of principal curvatures.

So, in initial work key points where found at location and scale of a central sample point, but in
the resend work a 3D quadratic function is fitted to improve the interpolation accuracy. So, that
concept I am going to explain in the next slide. And also the Hessian matrix is used to eliminate
edge responses.

Because sometimes the edge pixels may be detected as key points, but we have to discard this
points the edge pixels we have to discard. So, there we have to discard based on the concept of
principal curvatures. So, based on this measure I have to determine the principal curvatures and
based on this I can neglect or I can delete the edge pixels which are detected as key points.

(Refer Slide Time: 42:04)

So, how to do the interpolation? Here you can see we have detected the extrema by this
comparison does 26 comparisons we have determined extrema so we have determined the
extrema but the true extrema are like this the red points these are true extrema, so this localized
extrema by fitting a quadratic, sub-pixel or the subscale interpolation is done using Taylor
expansion, so you can see the difference of Gaussian is approximated by the Taylor series
expansion.

After this we have to take the derivative and set it to 0, so corresponding to this I will be getting
the location of the key points. So, I will be getting the location of the key points that is x hat will

1073
be getting, so here you can see I am getting the location x and y and also I will be getting the
information above the scale. So, that information is available in the key point definition. So, I
will be the coordinates as well as the scales. So, by using this a Taylor series interpolation I can
get the actual location of the key points.

(Refer Slide Time: 43:17)

After this we have to discard the low contrast points, so low contrast points are discarded by this
condition that is the difference of Gaussian for the point x hat is less than threshold a particular
threshold then corresponding to this we can discard the low contrast points. So, generally the
threshold is 0.03 depth threshold is used to discard a low contrast points. So, that is the first step
that is the I have to discard the low contrast point the key points based on this condition.

1074
(Refer Slide Time: 43:52)

So, here you can see I am showing the example, so in the first image I am getting all the key
points, after this I am removing the low contrast key points, so I will be getting the second image
that is the low contrast key points are removed.

(Refer Slide Time: 44:07)

After this I have to consider the edge pixels which are detected as key points. So, we have to
neglect this key points that means the edge pixels which are detected as key points should be
neglected should be discarded, so for this we are considering one technique that means we have
to determine the curvature response we have to determine, so here you can see difference of

1075
Gaussian gives a high response along an edge, so that is why a poorly defined peak in the
difference of Gaussian exhibits a high curvature across the edge and low value in the
perpendicular direction.

So, the corresponding to the edge I will be getting high curvature across the edge and the low
value in the perpendicular direction that is corresponding to the edge points. And this principle
curvature is computed by considering the Hessian matrix, so based on the principle curvature I
can determine which pixel is the edge pixel or the not the edge pixels, because I have to discard
the edge pixels based on the curvature the principal curvature.

(Refer Slide Time: 45:25)

The problem is the edge key points are poorly determine, so what are the key points already we
have determined in that case the edge pixels may be present, so that we have to discard based on
the principle curvature, so here you can see the point detection point can move along the edge so
I am showing the edge pixels and also I am showing the corner points. So, this is the corner
points and this is these are the edge pixels I am showing.

1076
(Refer Slide Time: 45:56)

So, we have to check key points that means the cornerness we have to see, so what is the
meaning of this? The high cornerness that is the corner points, no dominant principal curvature
component. So, we have to see the high cornerness that means the corner points, so no dominant
principal curvature component for the corner points, but if I consider a edge pixels in one
direction high curvature, but in the perpendicular direction low curvature corresponding to the
edge pixels.

But corresponding to this corner points no dominant principal curvature component. So, by
considering the Hessian matrix, we can determine this curvature components, so corresponding
to the corner points, no dominant principal curvature component, but the corresponding to the
edge pixels in one direction the curvature is prominent and in the perpendicular direction it is
low. So, based on this principle we can discard the key points, which are detected as key point.

1077
(Refer Slide Time: 47:02)

So, that means corresponding to the edge point high contrast in one direction and low in the
other. So, based on this we can determine the edge pixels and for this for determination of the
principal curvature we can consider the Hessian matrix, so we have to compute principal
curvatures from eigenvalues of 2 by 2 Hessian matrix. So, this is the Hessian matrix we can
obtain from Dxx Dxy Dxy Dyy that already explained in my corner detection class.

(Refer Slide Time: 47:42)

Now, I have to reject the edges, so eliminating the edge response we have to do, so I am
considering the Hessian matrix, so let alpha be the eigenvalue with larger magnitude and beta the

1078
smaller eigenvalue. So, corresponding to this Hessian matrix we can determine that trace of H
and also we can determine the determinant of H that we can determine in terms of alpha and beta
alpha and beta are the eigenvalues.

After this we can determine this ratio that is trace H square divided by determinant H that is
nothing but r plus 1 whole square divided by r, so what is r? r is nothing but alpha divided by
beta, so that is the definition of r, so that alpha is equal to r into beta. So, r plus 1 whole square
divided by r is at a minimum when the two eigenvalues are equal. So, that is not true for the edge
pixels.

So, for the edge pixel in one direction it is high and in another direction it is low, if this ratio is
greater then r plus 1 whole square divided by r, so based on this condition I can neglect the key
points corresponding to the edge pixels that means the edge response I am eliminating. So, in the
SIFT algorithm r is equal to 10 was considered. So, that means by considering this Hessian
matrix I am considering the curvature, the principal curvature and by considering this condition
we can neglect the edge response. So, we can eliminate the is response, so that is the concept.

(Refer slide time: 49:29)

So, here you can see removal of high contrast key points residing on edges, so that means the
edge pixels which are considered as key points that we can neglect by the principal curvature
condition that is a by considering the Hessian matrix we can determine the eigenvalues and

1079
based on this we can neglect the key points corresponding to the edge pixels. So, here you can
see removal of high contrast key points residing on edges that we have done.

(Refer slide Time: 50:04)

So in this example I have shown the image, so a is the original image the size is 233 into 189, the
b is I am considering the a the DOG extrema I am considering, so I will be getting 832 key
points that is based on DOG extrema. In the figure c what I am considering? I am getting 729
key points after threshold in the low contrast point.

So, low contrast points are eliminated and by this I am getting 729 key points. After this in the
figure d that figure, so I will be getting 536 key points after testing the principal curvature that
means I am neglecting the key points corresponding to the edge pixels. Finally, I am getting 536
key points, I am getting in the figure d.

1080
(Refer slide Time: 51:00)

In step number three assigning orientations, so create histogram of local gradient directions at
selected scale, assign canonical orientation at peak of smooth histogram, each key specified
stable 2D coordinates, so that means the key point corresponds to the coordinate x and y and the
scale is available and the orientation is available and a peaks in the histogram corresponds to the
orientation of the patches, so I am showing the histogram here and if I considered the peaks it
corresponds to the orientation of the patches for the same scale and location there could be
multiple key points with different orientations. So, that means I am assigning orientations.

(Refer Slide Time: 51:58)

1081
So, compute the gradient magnitude and orientation in a small window around the key points at
the appropriate scale, so here you can see first I am doing the convolution of the image with a
Gaussian, so that the image will be smooth, after this I am determining the gradient magnitude
that is the gradient magnitude I am determining and also I am determining the orientations. So,
by using these equations I can determine the gradient magnitude in the orientations.

(Refer Slide Time: 52:28)

And after this create gradient histogram, so for this we are considering 36 bins weighted by
magnitude and Gaussian windows. So, sigma is 1.5 times that of the scale of a key points. So, I
am considering 36 bins and I am creating the gradient histogram. So, any histogram peak within
80 percent of highest peak is assigned to key point that means the multiple assignment is
possible.

So, suppose if I consider this is the peak of the histogram, so suppose this is within 80 percent
suppose this peak is within 80 percent of this then we have to assign to a key point, so any
histogram peak within 80 percent of the highest peak is assign to key points that means the
multiple assignments are possible.

1082
(Refer Slide Time: 53:20)

So, finally in the step number four, we have to find the descriptor for each of the key points. So,
the key points are represented like this the x and y coordinates after did the scale the scale is
nothing but the variance of the Gaussian function and after this the theta the theta is the
orientation.

And we have to compute a descriptor for the local image region about each key point that is
highly distinctive also it is invariant as possible to variation such as changes in viewpoint and
illumination. So, this points we are considering, so invariant as possible to variations such as
changes in viewpoint and illumination.

1083
(Refer Slide Time: 54:07)

So, descriptor for each key points, so consider a small region around the key points divide it into
n cross n cells, so usually n is equal to 2 and each cell is of size 4 cross 4, after this what we have
to consider build a gradient orientation histogram in each cell and each histogram entry is
weighted by the gradient magnitude and a Gaussian weighting function with sigma is equal to
0.5 time window width. So, that means I am building a gradient orientation histogram.

After this sorting is gradient orientation histogram bearing in mind the dominant orientation of
the key points. So, here you can see I am showing the image gradients and after this I am
considering the gradient orientation histogram in each cell, so I am considering the cells like this,
so that means I am building a gradient orientation histogram in each cell and after this I am doing
the sorting of gradient orientation histogram. So, if I do the sorting then it will be invariant to
rotation. So, that is the objective of the sorting.

1084
(Refer Slide Time: 55:22)

So, here you can see, so we now have the descriptor of size r into n square that means r bins in
the orientation histogram and in the SIFT paper r is equal to 8, n is equal to 4, so the length of the
SIFT descriptor is 128 and the descriptor is invariant to rotations due to the sorting, so I will be
getting the key point descriptors. So, these are cells and I am showing the orientations.

(Refer Slide Time: 55:53)

For scale invariance the size of the window should be adjusted as per the scale of the key points.
So, larger scale corresponds to larger window, so here in the figure you can see I am showing the

1085
original image I am showing the key points, so suppose this point corresponds to large scale,
large scale corresponds to larger window.

So, that means this larger window I am considering. Similarly, if I consider small scale only the
small window we can consider. So, because of this the key point will be scale invariant that is the
objective of this, because we have to make the descriptor which is invariant to rotation and the
scale. So, I am getting the key point descriptors like this.

(Refer Slide Time: 56:41)

The SIFT descriptor so far is not illumination invariant the histogram entries are weighted by
gradient magnitude, so that is why the descriptor vector is normalized to unit magnitude this will
normalize scalar multiplicative intensity changes. So, that means the descriptor vector is
normalized to unit magnitude and scalar additive changes do not matter because gradient are
invariant to constant offset. So, that is why the scalar additive is not a important thing, but we
have to see the multiplicative intensity change.

1086
(Refer Slide Time: 57:16)

So, this is about the SIFT descriptors and what are the uses of the SIFT? So, one uses the image
alignment like if I consider suppose stereo imaging we can do alignment of the right image and
the left image and we can find a correspondence between the images, so we can find the
homography we can determine the fundamental matrix for image alignment we can do the
feature matching.

So, by using the SIFT we can do this matching and finally we can do the alignment the image
alignment we can do. But a 3D reconstruction also we can extract the SIFT features for motion
tracking we can consider a SIFT features, object recognition is one important application in
which the SIFT features can be used and indexing and the database retrieval that is nothing but
the content-based image retrieval, for this also we can use the SIFT features, robot navigation
also we can use the SIFT features and for many other computer vision applications we can use
the SIFT features.

1087
(Refer Slide Time: 58:20)

So, one example already I had explained that is the content-based image retrieval. So, for this
suppose the database images are available, so for this we can compute the SIFT features and save
descriptors to the database and for a query image again we can determine the SIFT features after
this we have to compare these features. For each descriptor find closest descriptors in the
database. So, for this we can consider L2 distance and based on this we can determine the
matching, matching between the test image that is the query image and the image is available in
the database that we can do.

(Refer Slide Time: 59:01)

1088
So, here you can see for the object recognition also we can apply the SIFT features, so
corresponding to all these objects we can determine the SIFT features and in this input image I
can find the corresponding object or any objects based on the matching, so that means again I am
explaining so corresponding to these images input images I have to SIFT descriptors and for
object recognition again we have to do the matching and based on this matching I can recognize
or I can detect the particular object in an image.

(Refer Slide Time: 59:36)

So, like this in object detection corresponding to this images we have the descriptor the SIFT
descriptors and we can determine the objects present in an image that is by using the SIFT
descriptors.

1089
(Refer Slide Time: 59:48)

Similarly, I can show these are the input images and for this we have the SIFT descriptors and
we can find the objects present in an image.

(Refer Slide Time: 59:58)

Like this, this is another example, so we can find objects present in the image.

1090
(Refer Slide Time: 60:04)

Again, I am showing the another example. So, these are the input images and for this we have
the SIFT descriptors and based on this we can find the objects available in the images. So, this is
the concept of the SIFT descriptors.

(Refer Slide Time: 60:14)

And some of the variations of the SIFT are one is the SURF, so SURF is nothing but the speeded
up robust feature, so this is also very important that is the speeded up Robust feature, so in case
of the SURF the fundamental concept is so much similar to scale invariant feature transformation

1091
and this SURF is based on Hessian Laplacian operator and SURF shows very similar
performance to that of SIFT.

But one advantage of the SURF is faster than SIFT, in case of the SURF the integral image
approximation is considered to perform repeat computation of the Hessian matrix and also it is
used during the scale space analysis also the difference of Gaussian is used in place of Laplacian
of Gaussian for assessing scales. Also, the sum of high wave lets are used in place of the gradient
histograms.

So, this depth reduces the dimensionality of the descriptor, which is half depth of the shift. So,
these are the difference between the SIFT and the SURF, so one important point is the integral
image approach is considered to perform repeated computation of Hessian matrix that is the
SURF. So, I am not going to explain the concept of SURF.

But concept is very similar to SIFT, over the difference with the SIFT that I have explained
briefly, so one is the integral image approach one is the difference of Gaussian is used in place of
Laplacian of Gaussian and also some of high wave lets it's are used in place of gradient
histograms. And one is the PCA tipped SIFT one is the FREAK so this you can read from the
research papers.

(Refer Slide Time: 62:25)

So, in case of the SIFT what are the advantages? Resistance to affine transformation of limited
extent works better for planar objects then full 3D objects and resistance to a range of

1092
illumination changes this is the advantage resistance to occlusion in object recognition since
SIFT descriptors are local, so these are the main advantages of the SIFT.

(Refer Slide Time: 62:48)

And what are the disadvantages? Resistance to affine transformation is empirical no hard-core
theory is provided and we have several parameters in the algorithms like the descriptors size, size
of the region various thresholds, theoretical treatment for their specifications these are not
cleared, so these are the disadvantage of the SIFT.

(Refer Slide Time: 63:11)

1093
After this I will explain the briefly the concept of the saliency, the concept of saliency is
employed to extract robust and relevant features, some of the regions of an image can be
simultaneously unpredictable in some feature and scale space and these regions may be
considered as salient, the saliency is defined as the most prominent part of an image and saliency
model indicates what actually attracts the attention.

So, that is important. So, what actually attracts the attention that is the concept of the saliency, I
will explain in the next slide. And output of such models are called the saliency map. So, from
the input image we can determine the saliency map, so the saliency map depends on the
attention. A saliency map refers to visually dominant locations and these pieces of information
are topographically represented in the input image. So, in my next slide you can see what are the
salient regions or the saliency map, I will explain in the next slide.

(Refer Slide Time: 64:21)

So, corresponding to this input image you can see I am determining the saliency regions. So,
suppose this region is salient region and similarly suppose these by considered this region this is
salient this is the saliency region. So, I will be getting the saliency map that means it indicates
what actually attracts the attention that is the visually dominant locations. So, if I consider this
regions these are visually dominant locations. So, based on this we can determine the saliency
map.

1094
(Refer Slide Time: 64:58)

And so a saliency map image source unique quality of each and every pixel of an image and
image can have more than one salient area and one region may have more salient than other
regions, so here you can see in this example I am considering these the input image I am
determining the saliency map that means it indicates what actually attracts the attention.

So, that means the visually dominant location. So, if I see so this is the visually dominant
location, so you can see I am considering the heat map and similarly this is the visually dominant
locations. So, rest of the information is not so important. So, based on this we can determine the
saliency map. So, it indicates what actually attracts the attention.

1095
(Refer Slide Time: 65:52)

And you can see we can develop some algorithm for resiliency prediction, so produce a
computational model of visual attention, predict where human will look so corresponding to this
input image you can see the visually dominant location, so this is the visually dominant locations
like this corresponding to this image also visually dominant locations I can determine and from
this I can determine the saliency map.

(Refer Slide Time: 66:17)

Human eye generally detects saliency based on movement, contrast, colour, intensity, et cetera,
so maybe we can employ statistical techniques to determine unpredictability or rarity or maybe

1096
we can employ the entropy measure to determine rarity. So, one is the entropy measure also we
can determine to determine the saliency map. The entropy can be determined within image
patches at scale of interest and the saliency can be represented as a weighted summation of
where the entropy peaks.

So, based on the entropy we can determine the saliency map. The estimation needs to be
invariant to rotation, translation, non-uniform, scaling and any from intensity variation and
additionally such measures should be robust to small changes in the viewpoint. So, briefly this is
the concept of the saliency, so that means again I am repeating the saliency indicates what
actually attracts the attention and mainly it is visually dominant locations we have to determine.

(Refer Slide Time: 67:25)

And in this example I have shown the original image and the salient points you can see, so
corresponding to this input image I am determining the salient points. So, briefly I have
explained the concept of the saliency. So, in this class I discussed the concept of the HOG
features that is the histogram of gradient. And after this I discussed the concept of the SIFT the
scale-invariant feature transformation, I think for more detail you should see the research papers
of HOG the histogram of gradient and also the scale invariance feature transformation. So, let me
stop here today. Thank you.

1097
Computer Vision and Image Processing- Fundamentals and Applications
Professor M. K. Bhuyan
Department of Electronics and Electrical Engineering
Indian Institute of Technology, Guwahati India
Lecture 30
Introduction to Machine Learning
Welcome to NPTEL MOOCS course on Computer Vision and Image Processing-
Fundamentals and Applications. I have been discussing about image features after image
features, the next step is image classification, image recognition. So, for this I have to apply
pattern classification or pattern recognition algorithms, the machine learning algorithms I
have to apply for object recognition, object classification.

So, now I will discuss some fundamental concepts of Machine Learning, pattern
classification and it is not possible to discuss all the machine learning concepts in this
computer vision course. So, that is why I will discuss the briefly some important concepts of
machine learning and after this I will discuss some of the applications of computer vision.

So, for pattern classification there are mainly three approaches one is a statistical approach,
one is the structural approach and one is the sub computing base approach. In statistical
approach, we will consider some statistical methods and based on this we can do pattern
classification, pattern recognition. Another one is just structural method that is nothing but
description of a particular pattern and finally, the soft computing-based pattern recognition in
this case we can use fuzzy logic, artificial neural networks, genetic algorithms and genetic
programming.

So, I will mainly discussed only the statistical pattern recognition techniques and also the soft
computing base pattern recognition techniques, but all the concepts I will be discussing in
brief. So, let us see what is pattern recognition.

1098
(Refer Slide Time: 02:22)

So, in my first class I discussed about this block diagram, this is a typical computer vision
system. So, first one is the image equation by camera and after this we have to do some pre-
processing and final step is the feature extraction and the pattern classification that is nothing
but decision making.

So, pattern recognition and artificial intelligence algorithms. So, already I have discussed
about image features, so different feature like colour feature or texture features, a shape and a
boundaries like this we have discussed. So, now I will discuss some pattern recognition
concepts which are mainly used in computer vision applications.

(Refer Slide Time: 03:06)

1099
So, this block diagram also I have shown in my second class and that is the image analysis
system you can see input images are available and after this we have to do some pre-
processing, after pre-processing we have to do feature extractions. So, here you can see, I
have shown the feature vector x1, x2,… xn these are the elements of the feature vector and
after this we have to go for classification that is the object classification, object recognition.

(Refer Slide Time: 03:39)

And you can see the domain of the machine learning. The first one is the artificial
intelligence, the programs with the ability to learn and reason like humans and that is the
artificial intelligence and you can see the machine learning is a subset of artificial
intelligence. So, algorithms with the ability to learn without being explicitly programmed,
that is machine learning and if you see the deep learning, the deep learning is a subset of
machine learning and in this case, we use the artificial neural networks and also the deep
networks. So, this is about machine learning and the deep learning.

1100
(Refer Slide Time: 04:20)

And in this case, there are some applications of pattern recognition, machine learning and in
this case, I have shown that machine perception for speech recognition, fingerprint
identification, optical character recognition, a DNA sequence identification, biomedical
image processing and biomedical signal processing. So, build a machine that can recognise
different patterns. So, these are the examples.

(Refer Slide Time: 04:48)

So, I can give one example of human perception. So, we can recognise all the alphabets of the
English language. So, for this we train ourselves to recognise a different alphabet and
whenever suppose the new alphabet is coming or it is available, we can recognise based on

1101
this learning and in this case, we have the memory and the intelligence in recognising and the
alphabets. This is one example of human perception.

(Refer Slide Time: 05:19)

The next one is the machine perception. So, how about providing such capabilities to
machines to recognise alphabets. So, in this case also we have to do training the machine and
after this the machine can recognise different alphabets. So, first thing is the training and after
this the testing.

(Refer Slide Time: 05:43)

So, for pattern recognition, I have to define some terms the first one is the pattern. So, I can
give one example of the pattern, the pattern maybe signal or maybe object. So, if you see the

1102
definition of the pattern, so a pattern is an object process or event that can be given a name
and after this next one is a pattern class.

So, a pattern class is a set of patterns, sharing common attributes and usually originating from
the same source. So, suppose the fingerprint is a pattern and suppose if I consider the
fingerprint of school employees, then that is the class, the pattern class. So, my pattern will be
the fingerprint and if I consider a pattern class that is the fingerprints of the school employee,
particular school employee, if I consider that is the pattern class.

After it is the recognition. So, during recognition or classification, given objects are assigned
to prescribe classes and for this we consider classifier. So, this is about the pattern
classification. So, first I have to understand the pattern and after these the classes, the pattern
classes, after this what is the recognition and the classification and what is classifier.

(Refer Slide Time: 07:05)

Some of the pattern classification examples are like optical character recognitions, even in the
biometrics and the face recognition, fingerprint recognition, speech recognition, machine
learning is used and for medical diagnoses, the X ray imaging, ECG analysis there are many
applications of machine learning, pattern recognition and also applications in military.

1103
(Refer Slide Time: 07:35)

So, the approaches are statistical pattern recognition, so it is based on underlying statistical
model of patterns and the pattern classes. So, statistical techniques will be used for statistical
pattern recognition systems. So, mainly I will discuss about statistical pattern recognition
systems, the next approach is the structural approach, it is nothing but the description of a
pattern.

So, suppose if I considered a pattern A, so, A I can consider like this A has three primitives,
if you see the A has three primitives. So, by using these structural components, I can
represent the pattern A, this is about the structural pattern recognition system, nowadays it is
not much of use. So, that is why we will discuss a statistical pattern recognition and also the
sub computing base pattern recognition. The next approach is the sub computing base, here I
have shown the neural networks, but it is the sub computing base pattern recognition, sub
computing base pattern recognition.

So, for this we can fuzzy logic, artificial neural networks, genetic algorithms and the genetic
programming. So, these are the approaches of pattern recognition. Now, I will show one
typical block diagram of a pattern recognition system.

1104
(Refer Slide Time: 09:10)

So, for this you can see I have suppose patterns, after this we have to do some measurement
and we have the measured values. After this we have to extract features, so after feature
extraction, we will be getting the feature values. So, all the features are not important. So,
that is why we are considering feature selection and after feature selection, we will be getting,
the feature vector we will be getting.

So, feature vector is suppose x, x is feature vector. So, after getting the feature vector what I
can consider, I can consider a classifier. So, input to the classifier is the feature vector and
suppose we have a database and suppose some rules are available and after this, what is the
output? Output is the, that classify that is nothing but that decision making.

So, if you see this typical block diagram of the pattern recognition system. So, we have the
input patterns and after this we have to do some measurements and I will be getting the
measured below that is nothing but the feature extraction. So, I will be getting the feature
values, after this all the features are not important.

So, what we are considering, we are selecting some of the important features that is the
discriminative features we are selecting that is the feature selection and after this we are
getting the feature vector, the feature vector is x that is input to the classifier. So, you can see
the classifier and we have the database, suppose that is the training samples are available and
some rules are available and based on this we can do pattern classification, we can do
decision makings and these decision makings may be the hard decisions or maybe the soft
decisions.

1105
In case of the hard decision, we use classical set theory and in case of the soft decision we
considered a fuzzification, that is a fuzzy set theory is used. So, we consider fuzzification. So,
one may be hard decision, another one is the soft decision. In case of the hard decision, we
use classical set theory and in case of soft decision we consider fuzzification, that is nothing
but the fuzzy logic we can consider.

So, this is a typical block diagram of a pattern recognition system and in this case, what is the
hard decision and what is the soft decision. Suppose if I consider two classes, these are the
classes suppose, the samples of a particular class and this is another class, the samples of
another class and suppose this is a decision boundary between the classes.

So, I am considering that class 1 and suppose with a class 2 and between class 1 and class 2, I
am considering the decision boundary. So, this is the decision boundary. Now, in this case in
case of the hard decision, you can see I have a clear separation between the classes, the class
1 and class 2, but if I considered a fuzzy decision boundary, the decision boundary will be
something like this, this is not a rigid in case of the fuzzy decision boundary.

So, what happened, there is a possibility that a particular sample, suppose this sample belongs
to another class. In case of the hard decision already I have shown the decision boundary,
there is no possibility that a particular sample belonged to another class, but in case of the
soft decision, there is a possibility that a particular sample may belongs to another class, this
depends on the fuzzy membership grid.

So, if you read the fuzzy logic, then you can understand one concept that concept is that
fuzzy membership grid. So, based on the fuzzy membership grid, there is a possibility that a
particular sample may belong to another class. So, one is the hard decision. That is there is no
possibility, because the decision boundary is very rigid and you will be getting the separation
between two classes, the class 1 and the class 2.

But in case of the fuzzification, in case of the soft decision based on the membership grid,
there is a possibility that a particular sample may belong to another class, it is based on the
membership grid that is about the hard decision and the soft decision and how to represent the
pattern classification problem. So, I can show you what is the pattern classification.

1106
(Refer Slide Time: 15:36)

So, for this what I am considering, I am considering a space, at this space is the, the class
membership, class membership space C I am considering and in this case I am considering
the class suppose that class is wi, another class is suppose class wj and another class I am
considering class wk. So, three classes I am considering suppose. So, that is the class
membership space and I have the pattern space. So, pattern space is something like this,
suppose I have two patterns one is P4 and another one is P, suppose 1 and another patterns
species P suppose this is P2 and another species is suppose P3.

These I am considering as the pattern space, this is I am considering as patterned space and
after this I am considering another space and that is the measurement space. So,
corresponding to this I have three measurements suppose m 1, m2 and suppose m3. So, each
class wi generates a subset of pattern in the pattern space, if you see the pattern space P 1, P2,
P3, P4 they are overlapping that is the meaning is patterns of different classes may share some
common attributes.

So, in this case if I considered a mapping suppose. So, corresponding to the class w i, I have
two patterns the P1 and P4 corresponding to the class wj my pattern is P2 corresponding to the
class wk, my pattern is P3, this is a mapping I am considering. So, here you see the
corresponding to the class wi, I have two patterns, so one is p1 and P4 corresponding to the
class wj, I have the pattern, the pattern is P2, corresponding to the class wk I have the pattern
P3 and you can see here the patterns of different classes may overlap, why it is overlapping?

The, because the patterns of different classes may share common attributes. So, that is why it
is overlapping and corresponding to these patterns if I considered a mapping from the pattern

1107
space to the measurement space. So, corresponding to the pattern p 1, a measurement is
suppose m1 corresponding to the pattern P 2 the measurement is m2, corresponding to the
pattern P3, the measurement is suppose m1 and corresponding to the pattern P4 the
measurement is suppose m3.

So, this is a mapping from the pattern space to the measurement space. Now, in this case the
problem of the pattern classification is, I have the measurements from the measurements I
have to determine the class, the corresponding class. So, suppose the measurement is m1. So,
what is the corresponding class I have to determine, but in this case, it is not one to one
mapping, if you see, it is not one to one mapping that means from the measurement I have to
determine the class, that is nothing but the inverse mapping.

So, I have done measurements and from the measurements I have to determine the
corresponding class, but it is not one to one mapping. Had it been one to one mapping, then
the pattern classification problem would have been very easy. Since it is not one to one
mapping, then in this case we have to apply some techniques like machine learning
techniques.

That means from the measurement, how to determine the particular class and this is nothing
but the inverse mapping, the mapping from the measurement space to a class, a membership
space. So, like this I can define the concept of pattern classification, statistically I can
represent this problem like this, the probability of w j given x. So, what is the meaning of this?
So, I have to find a probability of obtaining a particular class given the feature vector, the
feature vector is x, x is nothing but the measurement.

So, from the measurement, I have to determine the particular class. So, that is the concept of
the pattern classification and I have to determine this. So, in pattern classification, I have to
determine this probability and, in this case, there are two learning techniques one is the
supervise, in case of a supervise learning technique, we know the class levels and the
corresponding training samples are available.

So, suppose I have the class w i and corresponding to wi I have the training samples, that
information is available in case of supervised learning and after learning after training, we
have to do the classification based on the train model, that is the supervised classification. In
case of the unsupervised classification, we do not know about the class levels.

1108
So, what is available only we have the feature vector. So, we have to group the features of
vectors based on some similarity. So, that is nothing but that clustering. So, what do we have
to consider we have to group the feature vector based on some similarity that is the
unsupervised. So, later on I will discuss about supervised and unsupervised, the main concept
is that in case of the supervise, we know that class levels and also, we have the training
samples corresponding to a particular class.

In case of the unsupervised learning, we do not know about the class levels, but the feature
vectors are available. So, we can group the feature vectors based on some similarity that is
about the unsupervised technique and for classification, one mathematical representation is
that we can define one function that function is called discriminant function.

This function is used for taking classification decisions. So, this discriminant function is
represented by gi (x) that is to partition R to the power d space. So, g i (x) is nothing but two
partition R to the power d space and in this case, I am considering i is equal to 1, 2 up to C
number of classes.

So, for class 1 suppose, I have the discriminant function g 1(x), for class 2 I have the
discriminant function g2(x) like this. So, I have to determine the discriminant function and
after this I have to find the maximum discriminant function that corresponds to that particular
class. So, I have to determine the discriminant function for the class 1, for class 2, class C I
have to determine the discriminant function and, in this case, I have to determine the
maximum discriminant function that corresponds to that particular class.

So, the decision rule will be something like this, what is the decision rule, the decision rule
will be like this the feature vector x which is assigned to this particular class, the class is
suppose wm, when the gm (x) is greater than gi (x) for I is equal to 1, 2 up to C and i is not
equal to m. So, based on this condition if gm x is greater than gi (x), then in this case based on
this I can assign the feature vector to the class that classes w m and in this case you can see the
discriminant function is used to partition R to the power d space that is the d dimensional
space that is the feature space.

Suppose, if I consider the two-dimensional feature space, the 2D feature space, So, suppose
x1 and x2, two-dimensional space or space and you can see this is the decision boundary I am
considering and corresponding to the class w1, their region is R1 and corresponding to the
class w2, the region is R2 and here you can see this is the decision boundary. This is the
decision boundary.

1109
It is nothing but g1 is equal to g2, g1 will be equal to g2. So, in this case what I have to
consider I have to determine a discriminant function corresponding to the class 1 and
corresponding to the class 1 region is R1 and corresponding to the class 2 the region is R2.
So, for class 2 also we have to determine, the discriminant function and this maximum
discriminant function, the maximum, which one is the maximum g1 will be maximum or
maybe the z2 may be maximum based on this we can take a classification decision and in this
case, what is the equation of the decision boundary the equation of the addition boundary is
z1 is equal to z2. So, one important point is that decision boundary and the discriminant
function.

(Refer Slide Time: 25:54)

Here I have shown the components of pattern recognition system. So, already I have defined
So, we have the patterns, after this we have some sensors for measurement, that is for the data
equation and after this we have to do some pre-processing, after this we have to extract some
features, after extracting the features, we can select some important features that is called the
feature selection and after this we have to go for classification.

And in this case, we have shown here, learning algorithms we have shown that is nothing but
if I consider the supervised training and the supervised learning, the training samples are
available for all the classes and we know the class levels and based on this information, we
can train the classifier we can train the system. So, this is about the components of the pattern
recognition system.

1110
(Refer Slide Time: 26:53)

And in this case, I have shown one example of pattern classification here you see the problem
is the jockey and the hoopster recognition and for this I am considering these two states one
is the H and another is J that is the two classes I am considering and I am considering the
two-dimensional feature space and in this case I am considering two features, one is the
height, another one is the width.

So, x is the feature of vector and x 1 is the height and x2 is the width. I am measuring by using
the sensor. So, the feature of vector has two components, one is x 1 another one is x2 that is
these are elements of that feature vector. After this what I am considering, I am considering
the function q (x) in this case what I am considering this equation I am considering w . x that
is a dot product between the width vector w and the input vector, input feature vector is x
plus b is the offset I am considering.

If it is greater than 0, then the corresponding class will be H and if this w . x+b is less than 0,
then the corresponding class will be J. So, based on this condition, I am determining the class,
particular class that is the decision making I am doing based on these equations and in this
case you can see I am showing two classes in the figure, the red is the class J. So, this is the
class J and the blues are the class H and in between you can see the decision boundary.

So, what will be the equation of the decision boundary, the equation of the decision boundary
is w . x+b is equal to 0 and for this I have the training examples, the training samples are
available. So, x1, y1, x2, y2 like this, that training examples are available to train the system
and after this we can find a decision boundary between these two classes and in this case, you
can see based on this objective function that is w . x+b is greater than or equal to 0, I can

1111
decide that class H and also if I consider w . x+b less than 0, then the class will be J. So, this
is the example, one example I am considering.

(Refer Slide Time: 29:11)

And for feature extraction, you can see here in these two figures, I am considering good
features and the bad features. In the first figure you can see I am considering two classes; one
is the red and other one is blue. In case of the good features I can draw, I can easily draw a
decision boundary between these two classes and this is a linear decision boundary. In case of
the bad features, it is very difficult to find a decision boundary between the classes.

So, here you see in the second example in the second figure, it is very difficult to draw the
decision boundary between the classes. So, it is very difficult to draw the decision boundary
between the classes. So, these are the examples of the bad features. So, that is why the
features selection is important.

So, suppose I have a feature vector, the dimension of the feature vector is suppose n
dimension x1, x2… xn. So, all the features may not be important. So, that is why I have to
reduce the dimension of the feature vector. So, I can reduce the dimension of the feature
vectors by some methods. So, one method already I have discussed that is the PCA, the
Principal Component Analysis.

So, you can see n is greater than m. So, I can reduce the dimensionality of the feature vector.
So, that is called a feature selection and one issue is like this suppose, if I consider, suppose
D is the dimension of the feature vector and if I consider this accuracy this side, so if I

1112
increase the dimension of the feature vector initially the accuracy will increase and after if I
increase the dimension again the accuracy will drop like this.

Because, if the number of training samples are limited, limited training samples are available
and if I increase the dimension of the feature vector, that may not increase the accuracy, the
accuracy may decrease. So, that is why feature selection is important, because of the high
dimension the accuracy drops and this is called, this is called the curse of dimensionality. If I
increase the dimension of the feature vector, the accuracy may not increase it may drop and
this is called the curse of dimensionality. So, we have to reduce the dimension of the feature
vector. So, that is why the feature selection is important.

(Refer Slide Time: 31:44)

And in this case, I have shown, the classifiers and also you can see the regions X 1, X2, X3
corresponding to different classes. So, suppose the class w1 corresponding to X 1, the region is
X1, corresponding to w2, the region is X2 like this, I am considering and a first case if you see
the first figure that is the linear decision boundary I am getting between different classes and
if I consider the second figure.

In the second figure, I am not getting the linear decision boundary between the classes. So,
you can see X1, X2, X3 like this I am considering. So, in the first case I have the linear
decision boundary and in the second case I am not getting the linear decision boundary and
here you can see x is nothing but, x1 union of x2, union of like this. So, this is the total space,
that is a feature space.

1113
(Refer Slide Time: 32:49)

And this concept already I have explained. So, here you can see the input is the feature vector
here and corresponding to this feature vector I am determining the discriminant functions g 1
(x), g2 (x), g3 (x) like this for all the classes. I have C number of classes and I have to
determine the maximum discriminant function. So, suppose g 2 is maximum that means, the
corresponding class will be 2. So, based on this discriminant function, I can do the
classification. So, based on the discriminant function, I can do the classification.

(Refer Slide Time: 33:25)

And later on, I will discuss about the Bayesian decision theory that is mainly on the Bayes
law. And in this case, I will be considering two measures, one is the probability of the error.
So, probability of error also I can determine and also the risk, risk also I can determine. In

1114
case of the base decision theory and based on these two measures, one is the probability of
error and the another one is risk I can take a classification decision.

So, what is risk? Risk of taking a particular action. So, I have to take some actions and I have
to determine the risk of taking these actions and I have to determine the conditional risk and
based on the conditional risk I can decide a particular action and also the classification
decision I can take. So, by using these two measures, one is the probability of error and other
one is risk, I can do the pattern classification.

(Refer Slide Time: 34:37)

And in this case in case of the Bayes law, these are the terminologies one is the class, the
class are represented like this w1, w2 like this. So, in this case, I am considering two features.
One is the sea bass, other one is just salmon. And in this case, I am considering the prior
probabilities. The prior probabilities are P (w1) and P (w2) these are the prior probabilities and
we have the evidence and that is the P (x). So, if I considered a Bayes law, the posterior
probability is equal to prior probability into likelihood divided by evidence. So, this is the
base law, evidence has no significance in classification, it is nothing but the normalising
factor. So, based on the information of the prior and the likelihood, we can take classification
decision.

1115
(Refer Slide Time: 35:43)

So, here you see I am showing the Bayes rule that is the posterior probability you can see, is
equal to the likelihood into prayer divided by evidence, the evidence is nothing but the scale
factor. So, if you see the evidence, the evidence is nothing but the scale factor that is used for
normalisation. So, in classification, it has no significance.

So, based on the parameters, the parameter is likelihood and the prior we can take
classification decision. So, let us consider these decisions. So, decide a particular class w1, if
the probability of w1 given x is greater than probability of w2 given x, otherwise we can
decide the class w2. So, this is based on the posterior probability, the posterior density.

Similarly, if I consider this likelihood and the prior, so this is the likelihood and the prior that
based on this I can decide a particular class, if the probability of x given w1 into probability
of w1 is catered in probability of x given w 2 into probability of w2, then I can decide the
class w1 otherwise, I can decide the class w2.

And in this case, we can define that term, the term is the likelihood ratio. So, from this I can
determine the term the likelihood ratio, if the likelihood ratio is greater than a particular
threshold, then in this case, the corresponding class will be w1 otherwise, we have to decide
the class w2. So, based on the Bayes law, I can do the classifications you can see.

1116
(Refer Slide Time: 37:28)

In case of the basic pattern recognition framework, we need the training samples, we need the
testing samples and also the algorithm for recognising unknown test samples and we are
considering the supervised learning that means, we know the class levels and corresponding
to all the classes we have the training samples. So, we are considering the supervised
learning.

(Refer Slide Time: 37:55)

And in this case, one problem is suppose alphabet recognition. So, 26 alphabets are available,
uppercase. So, for this what we have to consider collect the samples of each of the 26
alphabet and train using an algorithm. So, first we have to collect all the training samples of
the alphabet and how many classes, we have 26 classes because we are considering 26

1117
alphabets, uppercase alphabet. After training what we have to do we have to do the testing
using unknown samples, that is the unknown sample means unknown alphabets we have to
consider and after this we can determine the accuracies. So, this is the typical supervised
pattern recognition problem.

(Refer Slide Time: 38:42)

And what are the patterns, already I have defined what are the patterns, the patterns maybe a
signal or maybe the fingerprint image, handwritten word, human face, speech signal, DNA
sequence alphabets like this, these are the examples of the patterns.

(Refer Slide Time: 38:57)

1118
And in this case, I have shown one example of the handwriting recognition. So, for this also
we have to extract the features and based on these features we can recognise and the
handwritings.

(Refer Slide Time: 39:08)

And this is another example, the face recognition here if you can see, if I consider the first
row, the same face, but different lighting conditions and different poses. In this case also we
have to recognise the face. So, for this also we have to extract the features and based on these
features, we can recognise that particular face. In the second row also, I have shown the face
ability and, in this case, you can see different poses and different facial expressions and also
different lighting conditions. So, for this also we have to recognise the face.

(Refer Slide Time: 39:48)

1119
And this is another example that is the fingerprint recognition. So, in this class I discussed the
concept of pattern classification and also, I have shown one mapping diagram. So, in the
mapping diagram you can see from the measurement, I have to determine the classes, the
corresponding class I have to determine.

So, the feature of vector is available and I have to determine the corresponding class.
Statistically that is represented by probability of wj given x. That is the definition of pattern
recognition, pattern classification. In my next class, I will continue the same discussion. So,
let me stop here today. Thank you.

1120
Computer Vision and Image Processing- Fundamentals and Applications
Professor M. K. Bhuyan
Department of Electronics and Electrical Engineering
Indian Institute of Technology, Guwahati India
Lecture 31
Introduction to Machine Learning
Welcome to NPTEL MOOCs course on Computer Vision and Image Processing-
Fundamentals and Applications. In my last class, I discussed the concept of pattern
classification and I briefly explained the concept of the supervised learning. So, today I am
going to continue the same discussion, first I will explain the concept of deep learning, how it
is different from the traditional pattern recognition system and after this, I will discuss the
concept of supervised learning and unsupervised learning. So, what is deep learning?

(Refer Slide Time: 01:11)

Here you can see in my next slide, now already I have discussed about the definition of the
deep learning the deep learning is nothing but the subset of the machine learning. So, mainly
it is the extension of the artificial neural network. So, in this figure you can see what I am
considering, I am considering one image and I want to recognise whether it is a car or not.

So, in this case in case of a deep networks, it is nothing but the cascade of nonlinear
transformations and in this case, I am considering the hierarchical model that means, I want
to extract more and more information from the input signal, input image. So, by considering
the cascade of nonlinear transformations and also, we are considering the end-to-end
learning, we want to extract more and more information from the image, from the signal that
means, the image has the low frequency information, the high frequency information, maybe
some diagonal edges may be there.

1121
So, all this information I want to extract from the image, I will explain later on about the deep
learning techniques, but mainly it is a cascade of nonlinear transformations and the end-to-
end learning and we want to extract more and more information from the input image, from
the input signal and based on this we can do the classification. So, the output is the car, that
means it is identified a car is present in the image. The space of machine learning methods.
So, I will show some popular machine learning methods in the next slides.

(Refer Slide Time: 02:42)

So, first you can see there are many algorithms like recurrent neural networks, convolution
neural networks, neural networks, auto encoders, support vector machines, SVM is the
Support Vector Machines, GMM is the Gaussian Mixture Model, restricted Boltzmann
machine, sparse coding. So, perception. So, these are the techniques, the popular techniques.

1122
(Refer Slide Time: 03:06)

1123
And you can see the left side is the deep structure, the right side is the shallow structure. So,
if I consider support vector machine that is nothing but the shallow structure, but if I consider
the convolution neural network, the recurrent neural networks that is the deep learning
techniques and again I am classifying, you can see some techniques are supervised and some
techniques are unsupervised.

So, if I consider, I can consider the supervised neural networks, also the unsupervised neural
networks are also available, like the competitive neural networks or the unsupervised neural
networks and you can see like Gaussian mixture model that is unsupervised restricted
Boltzmann machine that is also unsupervised, but support vector machine you can see this
side, that is the shallow learning, but if you consider the like, the convolution neural
networks, recurrent neural networks, these are the deep learning techniques.

And some are the probabilistic methods you can see if I consider a Bayes NP, deep belief
networks, restricted Boltzmann machines, GMM all these other probabilistic methods. So,
here in this diagram, I have shown the popular machine learning algorithms.

(Refer Slide Time: 04:24)

And in this case, you can see I am considering the machine learning techniques, some
techniques are supervised you can see the machine learning books, because already I told you
that it is not possible to discuss all the machine learning concepts in this computer vision
course. So, that is why I am considering only the discussion, the brief discussion about these
techniques.

1124
So, one is the supervised technique. If you see maybe the regression, you can consider
decision trees, random forests and classification like the K nearest neighbour technique, trees,
logistic regression, naïve bayes, support vector machines. In case of the unsupervised
clustering, I can consider the singular value decomposition that PCA, the Principal
Component Analysis. The K means clustering like this we can consider. So also, we have the
reinforcement learning that I will explain later on. So, these are some machine learning
techniques.

(Refer Slide Time: 05:21)

And in case of the machine learning techniques you can see, first I have to do the training and
after the training, we have to do, go product testing. So, here you can see, I have the label
data and we are considering the machine learning algorithms that is mainly for the training.
So, in case of the supervised technique, already I have explained, we know that class labels
and the corresponding training samples are available.

So, based on these training samples, what we can consider, we can do the training of the
system, the training of the algorithm and after this, what we can do that suppose the level data
is available and already we have the learn model. So, by using the learn model, we can do the
prediction, we can do the decision making, this is about the machine learning basics.

1125
(Refer Slide Time: 06:11)

And here I have shown that concept of the supervised unsupervised and the reinforcement
learning. The reinforcement learning, I will explain later on. So, here you can see, first one is
the classification example I have shown. So, two classes I have shown the class A and class B
and in case of regression you can see it is nothing but the fitting of a line between the sample
points that concept also I will explain later on and what is the unsupervised, unsupervised is
nothing but the clustering. So, the class levels are not available and only the feature vectors
are available and based on some similarity, we can do the grouping of the feature of vectors
that is the clustering.

(Refer Slide Time: 06:54)

1126
And in case of a traditional pattern recognition system input image I am considering, I can
extract the features like sift I can consider or maybe the HOG feature I can consider and after
this we can apply some pattern classification techniques, maybe the unsupervised
classification techniques or also the supervised classification technique we can apply for
classification, for object recognition this is about the traditional pattern recognition system.
That means from the input image, from the input signal, I have to extract the features and
based on these features, we can do the classification.

(Refer Slide Time: 07:32)

And you can see the difference between the deep learning and the machine learning. So, first
if you see the first one is machine learning, so I have the input image, first I have to extract
the features and after this I have to apply the classification algorithms and based on this, I can
do the classification.

In case of the deep learning the feature of extraction is not important, but what I am
considering, I am considering one network that is the suppose artificial neural network, which
can directly extract important features from the input image and simultaneously it can do the
classification. So, that is about our deep learning.

So, from the input emails directly, we can extract some important features and based on these
features, we can and go for classification that is about that deep learning, but in case of the
machine learning first we have to extract the features and after this we have to do go product
classifications.

1127
(Refer Slide Time: 08:29)

So, here you can see in case of the deep learning what is available here you see, suppose this
is my input image. So, from the input image, I can extract some information, maybe the low-
level information I can consider, mid-level information, high level information, I can
consider, maybe the low frequency information, high frequency information all the
information I can extract.

So, first I can extract the low-level parts like this and after this I can extract the mid-level
parts and finally, I can extract the high-level parts. So, that is why it is called a deep learning,
because I want to extract more and more information from the input image and that is nothing
but the cascade of nonlinear transformations and it is the end-to-end learning and considering.
So, this is the deep learning techniques and I have shown the interpretation of the deep
learning techniques.

1128
(Refer Slide Time: 09:26)

So, what exactly is deep learning and why it is generally better than other methods on image
speeds and certain other types of data. So, deep learning means we are considering the
artificial neural networks, having several layers of node between input and output and the
series of layers between input and output do features identification and processing in a series
of stages, just as our brain seems to do.

(Refer Slide Time: 09:56)

Multi-layer neural networks have been around for 25 years. So, what is new actually consider
the difference between the artificial neural networks and the deep networks. You can see in
case of the artificial neural networks; good algorithms are available for learning that width in
networks with one hidden layer.

1129
But if I consider more and more hidden layers with the existing algorithms, which are used
for the artificial neural networks, that may not be sufficient to train the artificial neural
network. So, that is why we have to consider the deep networks in which there are some good
learning algorithms are available. So, that we can train the hidden layers.

(Refer Slide Time: 10:45)

And if I want to compare the deep learning versus machine learning, you can see I can show
some comparisons like this, one is the data requirement. In case of deep learning, it requires
large amount of data, in case of the machine learning we can train on lesser data and accuracy
provides high accuracy in case of the deep learning.

But in case of the machine learning accuracy is less, the training time another parameter for
deep learning, we need the longer training time, but in case of machine learning takes less
time to train and hardware dependency if I consider requires GPU to train properly, but in
case of the machine learning it can be trained on CPU and also the hyper parameter tuning.
So, in case of the deep learning, can be tuned in various different ways. But in case of the
machine learning, limited tuning capabilities.

1130
(Refer Slide Time: 11:45)

And I want to show the concept of the overfitting So, first figure you can see, in the first
figure I am placing the decision boundary between the classes, then in this case you can see it
is not the perfect placement of the decision boundary. If I consider these decision boundaries
there may be misclassification, but if I consider a second figure that is a good compromise.
So, I can consider the decision boundary like this, but in the third case what I am considering
that is the overfitting, overfitting in the training.

So, if you see these diagram, if I consider the error versus these different cases, one is the
inadequate, one is a good compromise, one is the overfitting corresponding to the underfitting
that corresponds to high Bayes you can see the error is maximum during the training and also
it is maximum during the testing. But if I consider the good compromise, you can see during
the training, the error is less than the underfitting case and also in the testing the error is less
than the underfitting case.

So, that is, that this case you can see and in case of overfitting that corresponds to high
periods during the training the error is very less, then corresponding to this I am getting the
decision boundary between the classes, but in the testing the error is very high that is that
overfitting. So, underfitting is also not desirable and overfitting is also not desirable. So, that
is that is why we have to consider that case, the middle case, the middle case is a good
compromise, we have to consider. So, that is the concept of the overfitting.

1131
(Refer Slide Time: 13:38)

And in this case, I have shown this case, one is the regression, one is the just compromise and
another only the overfitting, corresponding to regression. Second one is the classification,
corresponding to classification also I have shown one is the underpinning that means
underfitting means high training error and also high Bayes, the training error close to the test
error and after this I am considering the just compromise, that is the training error slightly
lower than the testing error that is the just right.

And after this I am considering the overfitting case. So, in the overfitting case low training
error and the training error must be lower than the test error and that corresponds to high
variance and after this I have shown the deep learning. So, in the deep learning also I have
shown the training error and the testing error in all the cases. One is the underfitting, one is
the just right and one is the overfitting. So, this is about the underfitting just compromise and
overfitting.

1132
(Refer Slide Time: 14:52)

Here also I have shown this all the cases. One is put a regression. One is put a classification.
One is the underfitting and one is the overfitting and I am considering the just right in the
middle one is the just right I am considering. Regression is nothing but the fitting of the line
between the simple points.

(Refer Slide Time: 15:12)

In case of the pattern recognition algorithms, the bag of algorithms that can be used to
provide some intelligence to a machine and these algorithms have a solid probabilistic
framework and for this we are defining the classes and also, we need the features. So,
algorithms work on certain characteristics defining a class, refer as features.

1133
(Refer Slide Time: 15:39)

And what is a feature of a feature if I consider, suppose if I want to do the classification,


between these two alphabets, one is the I, I another one is i. So, this I is the pattern I can
consider. So, what will be the feature for this classification, one is the uppercase and
lowercase.

(Refer Slide Time: 16:02)

The presence of a dot in lowercase i can distinguish this i the small i from capital I and it is a
feature. Feature’s value can be discrete or continuous in nature and in most of the cases, the
single feature may not be sufficient for discrimination. So, I may need more features for
discrimination between the classes.

1134
(Refer Slide Time: 16:25)

So, in this case, I am giving one example the classification of a fish. So, I am considering two
features, one is the sea bass and other one is the salmon. So, how to consider this
classification problem.

(Refer Slide Time: 16:40)

So, maybe we can consider the input images we can consider and we can extract some
features, the features may be the length of the fish, the lightness of the fish, the weight of the
fish like this, we can extract these features from the input image.

1135
(Refer Slide Time: 16:56)

So, after getting the image we have to do some pre-processing image processing we have to
do and also, we have to do the segmentation, the image segmentation, that is the
segmentation is important to isolate the fish from the background. So, in the image we have
the background and the foreground the foreground is nothing but the fishes. So, that means
we have to use a segmentation algorithm to isolate fish from one another and from the
background and after this we have to extract features.

(Refer Slide Time: 17:26)

So, this is the technique you can see. So, first I have to do the pre-processing input images are
available, after this we have to extract features and after this, we have to do the classification.
One is the salmon, another one is a sea bass.

1136
(Refer Slide Time: 17:41)

And in this case you can see I am considering only one feature, the feature is the length,
length of the feature I am considering and based on this I am just counting you can see and
here you can see, I can recognise salmon and sea bass. But there are some misclassifications,
the misclassification is shown by the red one.

That means, the sea bass is recognised as salmon here you can see, a sea bass is recognised as
salmon that is the red one and if I consider that this black one, you can see the salmon is
recognised is sea bass and you can see this is the decision boundary, if I consider this is the
decision between two fishes, one is a salmon and another one is the seabass. But the
misclassification is happening, the seabass is recognised as salmon and salmon is recognised
as sea bass. This is by considering only one feature.

1137
(Refer Slide Time: 18:41)

Next figure you can see I am considering another feature, that is the lightness of the fish, I am
considering and again I am counting. So, counting means either I am just determining the
probability. In this case also you can see there are some misclassifications, the red one is the
sea bass and the black one is the salmon. The sea bass is recognised as salmon and salmon is
also recognised as sea bass. So, this is the misclassification is taking place, because I am
considering only one feature.

(Refer Slide Time: 19:16)

So, that is why we can consider maybe or two features or maybe more features for this
classification problem. So, we have to minimise the misclassification, we have to increase the
accuracy. So, we have to reduce the number of sea bass classified is salmon, that we have to

1138
reduce. So, for this we have to consider the decision theory. What decision theory we can
consider.

(Refer Slid Time: 19:42)

You can see here, so I am considering now two features, one is the lightness of the fish,
another one that width of the fish. So, that is the feature of vector I am considering, X is the
feature vector.

(Refer Slide Time: 19:54)

1139
And if I consider these two features, here you can see it is a two-dimensional feature space. I
am considering one feature is the lightness, another feature is that width and you can see the
decision boundary between these two classes, one is the same one and another one is the
seabass.

Then in this case the classification accuracy increases, because of considering two features,
but in this case there are still some misclassification you can see, some misclassifications are
happening in this case also, but because of this two features, the accuracy increases as
compared to previous cases and in this case, you can see for optimal performance, we have to
select the best decision boundary between the classes, if I consider these decision boundary
and that is a nonlinear decision boundary and that classification error will be less, between
these two classes.

1140
(Refer Slide Time: 20:51)

The objective is dealing with novel data and in this case, we have to the performance of the
training data and that is important. First, we have to do the training and we have to see the
performance on the training data and after this the performance on the new data novel data
we have to consider, that should be good, the performance for the training and the testing
should be good.

(Refer Slide Time: 21:13)

So, in case of the pattern recognition systems, what are the main steps? The first one is the
data acquisition that is nothing but the sensing, that is I can consider in case of the image
classification, I have to consider image acquisition or maybe the sensors we can consider for
signal acquisition and after this we have to do the pre-processing and after this we have to do

1141
the segmentation in that grouping. So, patterns should be well separated and should not
overlap that we have to consider.

(Refer Slide Time: 21:49)

Here you can see the block diagram of the pattern recognition systems, first one is the
sensing, after this we have to do the segmentations. The next step is the feature extraction and
all the features may not be discriminative. So, that is why we have to go for features selection
and after this the classification and the post processing, the post processing is nothing but the
system evaluation. So, we can determine accuracy, we can determine the false positive rate
that true positive rate. So, all these rates we can determine and based on this we can adjust the
parameter of the system that is nothing but the feedback. So, we can just do the feedback.

(Refer Slide Time: 22:31)

1142
We can see here the next step is the feature extraction and we have to consider the
discriminative features and one important point is the features should be invariant to a point
transformation. Like if I consider translation rotation scale. So, in this case also the features
should be invariant to a point transformation.

After extracting the discriminative features, we can go for classification and after this the post
processing that already I have explained. So, we can improve the performance. So, for this,
we can give the feedback to all the steps like a feature extraction, feature selection. So, that
feedback is important from the output to the inputs.

(Refer Slide Time: 23:13)

So, now the design cycle is first one is the data collection. Next one is the feature selection,
after this the models election, we have to do the training and after this the evolution of the
pattern recognition system and one important point is the computational complexity, we have
to determine the computational complexity of the system that we also, this is important.

1143
(Refer Slide Time: 23:37)

So, same thing I am showing here, first one is the collection of data after this the selection of
the features, selection of the model and after this we have to train the classifier and after this,
the evaluate the classifiers and here you can see the feedback. So, I am considering the
feedback, so that we can improve the performance of the pattern recognition system.

(Refer Slide Time: 24:00)

1144
And what is clustering, clustering is the unsupervised learning and, in this case, you can see
that we have to do the grouping and one important point is we have to consider high intra-
class similarity we have to consider and also we have to consider the low inter class
similarity. In case of the grouping, this is very important, one is the high intra class similarity,
another one is the low inter class similarity that is important. So, based on this we can do the
clustering.

(Refer Slide Time: 24:31)

So, in this case, you can see I am considering these images suppose and based on this we can
do the groupings, but the clustering is subjective. If you see the second figure you can see, if I
consider the family, the class, the class will be like this. If I consider the school employees,
that class will be like this, if I consider only females, then the grouping will be like this. If I

1145
only considered males, then the grouping will be like this. So, that is why the clustering is
subjective.

(Refer Slide Time: 25:03)

And in this case, already I have explained that the feature vector is available and based on
some similarity, we have to group this feature of vectors. Here you can see I am giving one
example, I am considering the, in the first example I am considering these two images and I
want to find a similarity between these two.

So, maybe some equivalent distance measure we can consider for finding the similarity
between these two images and I am getting the similarity score of 0.23. Similarly, in the
second case also if I consider these two cases, the similarity score is 3, a third case I am
considering the similarity between two fingerprints and the similarity score is 342.7 I can
determine and these values can be normalised. So, this is the by using the distance measure I
can find the similarity between the patterns.

1146
(Refer Slide Time: 25:57)

And for this clustering one popular algorithm is, the k means clustering, the k means
clustering already I have discussed in one of my classes, let me image segmentation class I
have discussed about the k means clustering. So, for this I have to consider the k number of
cluster centres that I can select randomly and after this what we can consider and decide the
class membership of the N objects by assigning them to the nearest cluster centre.

So, for this by measuring the distance between the sample points and the cluster centre, I can
decide that one. So, whether their particular sample points belong to a particular cluster that I
can decide based on this distance measure that is the minimum distance after this I have to re-
estimate the k cluster centres, I have to re-estimate and after this, this process I have to repeat
again and again and finally what will happen, this clusters centre will not be moving too
much.

It will be fixed after some iteration and corresponding to this, I will be getting the cluster
centres corresponding to k number of or maybe the c number of classes. So, pictorially I can
show this algorithm like this.

1147
(Refer Slide Time: 27:16)

1148
So, you can see I am considering some sample points and I am considering randomly
selecting these centroids k1, k2, k3 centroids I randomly am selecting and after this you can see
I am assigning the sample points to a particular centroid based on the distance measure, I can
consider the Euclidean distance.

So, corresponding to this I am getting three classes one is the green, another one is the red,
another one is orange and after this assignment, this cluster centre will be moving because I
have to recompute the cluster centre that is the centroid. So, after the competing you can see
the k1 is moving, k2 is moving, k3 is moving, so that I have to recompute the cluster centres.

And after the computing you can see I am getting the new position of k 1, k2 and k3 and after
this again I have to decide the sample points which will belong to a particular centroid. So,
now, you can see these green suppose this point now belong to the cluster centre k 1, but if I

1149
find the distance between this point and this k1 suppose if I find a distance between this one
k1 distance between k2 distance between k3 I have to find a minimum distance.

Now, in this case the minimum distance will be with k2. So, that means that this point should
be assigned to the cluster centre k2. So, that means, you can see initially it was the green point
and after this after finding the distance you can see this point is assigned to the cluster centre
k2 and after this again I have to recompute the centroids k1, k2, k3 it will be shifted like this and
this process I have to do it iteratively.

And finally, if there is no significant change of the values of k 1, k2, k3, then I have to stop the
iteration and there will be the final position of the centroids belonging to the cluster. So, in
this case I will be getting the cluster, one cluster is this another cluster is this and another
cluster is this, so k1, k2, k3.

(Refer Slide Time: 29:40)

And in this case, you can see I have shown the concept of the supervised learning and
unsupervised learning. In case of supervised learning, we know the class levels and also, we
have the training samples for all the classes and you can see we have the classes like this
duck, duck, not duck, not duck like this and we have to go for the supervised learning and we
can do the prediction that is that supervised learning algorithms. In case of the unsupervised
learning, what we can do, we can do the clustering. So, this is one cluster, this is one cluster,
this is one cluster, based on some similarity, we can do the clustering.

1150
(Refer Slide Time: 30:22)

Again, I am showing this one. So, how to do the clustering. So, we have the training data set
and after this based on some similarity, we can do the clustering. So, you can see, so this is
one cluster, like this we have the cluster, the groups. This is about the unsupervised learning.

(Refer Slide Time: 30:45)

Now, I will discuss the concept of the reinforcement learning. So, concept is like this,
suppose in some applications, the output of the system is a sequence of actions in such a case,
a single action is not important. So, what is important, the important is, is the policy that is
the sequence of correct actions to reach the goal and there are no such things as the best
action in any intermediate state. So, that means, in any intermediate step, I cannot say that
this is the best action.

1151
An action is good, if it is a part of a good policy and machine learning programme should be
able to assess the goodness of the policies and learn from past good action sequences to be
able to generate a policy. This is the concept of the reinforcement learning. So, briefly I will
explain one example.

So, if I consider the playing of chess. So, where a single move by itself is not that important,
it is the sequence of right moves, that is important, a move is good, if it is the part of a good
game policy and another example, I can give in robot navigation.

Now, a robot can move in one of a number of directions and after a number of trial runs, it
should learn the correct sequence of actions to reach to the goal state from an initial state and
in this case, it should not hit any of the obstacles. So, I have given these two examples. So,
for this we can apply reinforcement learning. So, one action is not important, the group of
actions or the policies more important that is the main concept of data reinforcement learning.

(Refer Slide Time: 32:48)

After these the next concept is semi supervised learning. Semi supervised learning is a class
of supervised learning tasks and techniques that also make use of unlabelled data for training.
Typically, a small amount of label data with a large amount of unlabelled data. So, that
means, in case of the semi supervised learning, we have a small amount of the label data and
also, we have the large amount of unlabelled data. So, that is it is in between the supervised
learning and the unsupervised learning.

1152
(Refer Slide Time: 33:27)

Next concept of the regression means, you can see similar to curve fitting problem to a set of
sample points. So, here I am considering some sample points and I am hitting a curve. So,
you can see in the figure, first figure and the second figure, I am doing the regression that is,
it is similar to curve fitting problem to a set of sample points.

And briefly I can say the linear regression is a statistical method that allows us to model a
relationship between a dependent variable and one or more independent variables and this
can be done by fitting a linear equation to the observed data. So, that is the definition of
regression. So, it is a statistical method that allows us to model the relationship between a
dependent variable and one or more independent variables.

Suppose I can give one example, suppose we want a system so that it can predict the price of
a used car. What I am considering the price of a used car that I am considering. So, what are
the inputs? So, inputs will be, inputs are the car attributes. So, what are the attributes? I can
consider suppose the brand, brand is one attribute maybe the year of manufacturing that I can
consider or maybe I can consider engine capacity, engine capacity I can consider or maybe I
can consider the mileage, I can consider or maybe some other information I can consider.

So, based on these attributes, I can determine the price of a used car. So, the output is the
price of the car. So, output will be, the output will be the price of the car. So, this is the
regression problem. So, briefly I have explained the concept of the regression. So, it is a
statistical method that allows us to model the relationship between a dependent variable and
one or more independent variables and this can be done by fitting a linear equation to the

1153
observed data. So, here you can see I am considering some observed data and I am fitting a
curve.

(Refer Slide Time: 36:18)

Next, I am considering the classifier. So, in this example, you can see I am considering a
feature of space, the two-dimensional feature space. So, x 1 and x2 I am considering the
features and you can see I am considering two classes, one is the red plus another one is the
green plus and corresponding to this you can see the decision boundary between the classes.
So, this is one example of a classifier that already I have explained in my last class.

(Refer Slide Time: 36:49)

And one is the empirical risk minimization. So, based on this principle and that one is the
probability of the error I can determine. So, in my next class, I will define what is the

1154
probability of error, probability of error I can determine and another one is the risk. So, these
are two techniques by using this, we can take a particular classification decision.

So, we can determine probability of error and also, we can determine risk. So, risk means,
suppose if I consider a particular action corresponding to this particular action, I can consider
a risk. So, this conditional risk I can determine and based on this conditional risk, I can take a
classification decision. So, that is the empirical risk minimization. So, I have to minimise the
risks.

(Refer Slide Time: 37:48)

And one is that no free lunch theorem What is the meaning the meaning of this, so it is
impossible to get nothing for something. The meaning is that in view of the no free lunch
theorem, it seems that one cannot hope for a classifier that would perform based on all
possible problems that one could imagine. That means, suppose you consider one
classification algorithm that may be suitable for a particular application, but that may not be
suitable for many other applications. So, it cannot be generalised. So, that is the concept of
the no free lunch theorem.

1155
(Refer Slide Time: 38:30)

And classifier taxonomy. So, there are two types of classifier, one is the generative
classifiers, another one is discriminative classifiers. So, first I will be discussed about the
generative classifiers, in case of generative classifiers we have two categories, one is the
parametric classifier and other one is the nonparametric classifier. So, let us see what is the
generative classifier.

(Refer Slide Time: 38:58)

So, in case of the generative classifier, a samples of training data are a class assumed to come
from a probability density function that is the class conditional pdf. So, if you remember that
in case of the Bayes theorem, we have shown like this, the probability of w j given x that one,

1156
that I want to determine is equal to the P (x |wj) and the prior is probability of wj and this is
the evidence.

So, evidence has no role in classification. So, it is simply a normalising factor. Now, suppose
this is the class conditional density, the class conditional pdf. So, suppose the density form is
known, density from its known. So, that means density form maybe it is the uniform density,
Gaussian density like this we can consider and, in this case, if I know the density form, we
can determine the parameters.

Supposing if I consider the Gaussian distribution in case of the Gaussian distribution, I have
two parameters, one is the mean, another one is the variance. So, that I can determine if I
know the density form. So, based on this class conditional density, because if I know the
density form, if I know then based on this, I can take a classification decision, because if I
want to determine this the probability of w j given x I have to know the probability of x given
wj and also, I have to know the prior probability.

So, by using the Bayes law you get this one and if the density form is known that is the
density form the likelihood. This is the likelihood is known, then it is called the generative
classifiers and sometimes the density form may not be available. So, then in this case we have
to estimate the density. So, there are two cases the first case is, the density forms its known
and we have to estimate the parameters. So, this is the first case.

The second case is density form is not known, density form is not available or not known,
then in this case we have to estimate the parameters. In the first case, number one the density
form is known and we have to estimate the parameters, the parameters may be mean and the
variance or if I consider high dimensional case, that will be the mean vector and also, I have
to consider covariance and in the second case, the density problem is not available, so we
have to estimate the density.

So, first I will consider the parametric classifier. So, what is the parametric classifier, the
parametric classifier means the density form is known and we have to estimate the
parameters. So, that is the parametric classifier, in case of the nonparametric classifier,
density form is not known, then we have to estimate the density that is called nonparametric
classifier.

1157
(Refer Slide Time: 42:28)

In case of the parameter classifier, we know this pdf, the probability of x given w 1 that is the
likelihood we know and similarly, we know the probability of x given w 2 that is also
available. So, from this we can take the classification decision. So, this is about the
parametric classifier that is the density form is known.

(Refer Slide Time: 42:57)

So, in this case you can see I am considering one generative classifier and you can see the
two classes I am considering. One is the salmon, another one is a seabass two types of
features I am considering and I am considering two features, one is the lightness of the fish,
another one is the width of the fish and you can see the decision boundary between the
classes.

1158
So, this is my decision boundary, in case of generative classifier. So, already I have
mentioned that one is the parameter approach, another one is the non-parameter. In
parametric approach the density form is available but we have to estimate the parameters. In
case of the non-parameter the density problem is not known, but we have to estimate the
densities.

(Refer Slide Time: 43:43)

So, you can see here one is a nonparametric approach. So, that means, so from the training
samples, we have to estimate the density of what, the density of this. So, we have to estimate
this density that we have to determine, this is about the generative classifiers. In case of the
discriminative classifier.

So, no such assumption of data being drawn from an underlying pdf that is not important, that
is the class conditional density is not important, but what is important is that is that it models
the decision boundary by adopting the gradient descent techniques. So, we can determine the
decision boundary by considering one technique that is the gradient descent techniques. So,
we can find a decision boundary.

1159
(Refer Slide Time: 44:40)

So, what is the discriminative classifier, start with initial weights that the define the decision
surface. So, first I have to consider initial weights that define the decision surface. After this,
I have to update the weights based on some optimization criterion that we can consider and in
this case no need to model the distribution of samples of a given class, that is the class
conditional density is not important, what is not important, so this is not important in this case
a the discriminative classifier, so this is not important.

(Refer Slide Time: 45:18)

So, some examples of the discriminative classifiers are neural networks like multi-layer
perception, single layer perceptions support vector machines are some examples of
discriminative classifiers.

1160
(Refer Slide Time: 45:34)

So, you can see first I am considering some initial weights and after this I am adjusting the
weights, I am determining the weights to find the best possible decision boundary between
the classes. So, in this example, I am showing two classes, one class is one and another class
is 0. So, between these two classes, I want to find the decision boundary. So, first I am
considering one initial decision boundary corresponding to this I have the width, but I have to
adjust the width, so that I can find a best decision boundary between the classes.

(Refer Slide Time: 46:09)

And you can see in case of a discriminative classifier, we have linearly separable data and
also, we have non-linearly separable data. So, in this example, you can see I can draw the
decision boundary between the classes. So, I have two classes, one is the plus another one is

1161
the minus. So, between this I can easily draw the decision boundary and similarly, in the
second figure also I have two classes and between this I can easily draw the decision
boundary. So, that is why this is the linearly separable data.

(Refer Slide Time: 46:45)

And corresponding to this linearly separable data, if I consider this one, this equation
w 1 x 1+w 2 x 2+b>0, the b is the Bayes, w1, w2 are the weights and x1, x2 are the features, I
am considering the two-dimensional feature space I am considering and you can see this is
the decision boundary, decision boundary.

So, what is the equation of the separating line the equation of the separating line is
w 1 x 1+w 2 x 2+b=0 and if I consider this green class corresponding to this green class, the
equation will be w 1 x 1+w 2 x 2+b<0that is the inequality and similarly, for the red class the
equation is w 1 x 1+w 2 x 2+b>0. So, based on this formulation, I can distinguish these two
classes, one is the red class and another one is the green class and also you can see the
equation of the separating line between the classes.

1162
(Refer Slide Time: 48:05)

If you see this example, again I am considering two classes, one is the blue class and other
one is the red class that is a plus and the minus and, in this case, it is very difficult to draw the
decision boundary, the linear decision boundary between the classes. So, that is why I am
considering one boundary like this, other boundaries like this. So, by using these two
boundaries, I am separating these two classes. So, that is why I can say it is a non-linearly
separable data. Previously I have shown the linearly separable data, but this is one example of
non-linearly separable data.

(Refer Slide Time: 48:44)

And one important theorem is the covers theorem. So, what is the covers theorem? The
theorem states that given a set of training data that is not linearly separable, one can

1163
transform it into a training set that is linearly separable by mapping it into a possible higher
dimensional space via some nonlinear transformation. So, that the concept I am explaining in
my next slide. So, what is the importance of the cover’s theorem.

(Refer Slide Time: 49:16)

So, here you can see in the first figure, figure a I am considering some sample points of the
original data that is in 2D in two dimensional which is not linearly separable. So, what we
can do? I can apply some nonlinear transformation. So, nonlinear transformation is applied so
that this original data can be mapped into high dimensional space.

So, if you see the original data that is a two-dimensional space, but after this transformation, I
am considering the three-dimensional space that is the high dimensional case I am
considering. So, in high dimensional case, this samples will be linearly separable. So, that
means in 2D it was not linearly separable, but in the 3D, it is linearly separable. So, that is the
concept of the cover’s theorem.

1164
(Refer Slide Time: 50:17)

For pattern classification problem, I have to consider some evaluation metric. So, one is like
this, suppose we have the predicted class and we have the actual class. So, actual classes as
yes the predicted class is yes that means, it is that true positive. If the actual classes suppose
no and the predicted classes yes, that is false positive. If the actual class is yes and the
predicted class is no, then it is false negative and similarly, true negative I can determine. So,
one is the actual class another one predicted class. So, these parameters one is the true
positive and other one is the false negative, one is the true negative and the false negative I
can determine from the actual class and the predicted class.

(Refer Slide Time: 51:11)

1165
And these parameters are generally used in case of the pattern classification problem, one is
the true positive, the false negative, false positive, true negative and the true positive rate also

TP
you can determine that is nothing but . So, that is the true positive rate, false positive
TP+ FN

FP
rate also you can determine that also you can determine.
F P+T N

TP+TN
Accuracy also you can determine that also you can determine and these
TP+ FN + FP+TN

TP
are some important parameters, one is the precision, precision is nothing but and
TP+ F P
also you can determine recall and the specificity also you can determine. So, these are the
parameters.

(Refer Slide Time: 52:06)

And one thing is that you can determine, the ROC the region of convergence also you can
determine, it is the you can see the ROC curve showing TPR against FPR for a particular
pattern matching algorithm. So, I am showing the TPR that is a true positive rate and FPR I
am showing and you can see the arrow see ROC, ROC you can see. So, you can see in the
second figure what I am showing the distribution of positive and negative matches I am
considering that means the true positives and true negatives I have shown which is shown as
a function of inter feature distance the inter feature distance is d. So, you can see I am getting
the FP and FN also there is a false positive and false negative and you can see threshold I am

1166
considering. So, based on this threshold, you can determine the true positive and the true
negative.

1167
(Refer Slide Time: 53:09)

And finally, I want to show another important matrix that is called the confusion matrix
confusion matrix. So, this is used to show the performance of a pattern classification
algorithm. So, in this example, you can see, I am considering actual class labels and also you
can see the number of test patterns assigned to different classes. So, what is the meaning of
137 that means, this suppose the class, suppose the pattern 1 is recognised as 1, how many
times 137 times, the pattern is recognised, the pattern 1 is recognised as 2, 13 times the
pattern 1 is recognised as 3 3 times.

So, like this, I can determine the confusion matrix and from this you can see from this you
can determine the accuracy you can determine and also you can determine the
misclassification rate also you can determine and also you can determine the rejection rate.
Rejection rate means, how many times it is not recognised as 1, 2, 3, 4, 5, 6, 7, 8, 9. Suppose,
I am drawing a character suppose, like this. So, it is not recognised as 1, 2, 3, 4, 5, 6, 7, 8, 9
So, that is why it is not recognised. So, from this I can determine their rejection rate.

Similarly, what is 55. That means two is recognised as two how many times 55 times, the 2 is
recognised as 1, how many times 1 that is the misclassification, there is the misclassification
and 2 is recognised as 3. So, how many times he does 1 time. So that is the misclassification
and you can see from this information, I can determine the accuracy.

So, in the confusion matrix, I will be getting a diagonal matrix. So, I will be getting the
diagonal matrix, if I get the high values in the diagonal matrix, that corresponds to the high
accuracy of the pattern classification algorithm. In this class, I discussed the concept of
pattern classification. I discussed the concept of supervised and unsupervised learning. After

1168
this briefly, I have highlighted the concept of semi supervised learning and the reinforcement
learning. After this, I discussed the concept of generative classifier and the discriminative
classifier. So, let me stop here today. Thank you.

1169
Computer Vision and Image Processing – Fundamentals and Applications
Professor M. K. Bhuyan
Department of Electronics and Electrical Engineering
Indian Institute of Technology, Guwahati, India
Lecture – 32
Introduction to Machine Learning

Welcome to NPTEL MOOCs course on Computer Vision and Image Processing- Fundamentals
and Applications. I have been discussing about the concept of Machine Learning. Today I am
going to continue the same class that is the concept of Machine Learning. So, first I will discuss
the concept of Regression and after this I will discuss the concept of Bayesian Decision Theory.

(Refer Slide Time: 01:01)

So, what is the fundamental concept of Regression I will explain now. Linear regression is a
statistical method that allows us to model the relationship between the scalar response that is the
dependent variable and one or more independent variables. So, that means, I want to find the
relationship between a scalar variable that is the scalar response, dependent variable and one or
more independent variables.

This is done by fitting a linear equation to the observed data, the dependent variable I can say as
a response or outcome and the independent variable is called as Predictor or maybe the
Regressor. Suppose, if we have only one independent variable, this is called a Simple Linear

1170
Regression. And suppose, if I consider two or more independent variables, then in this case it is
the Multiple Regression.

So, this method looks for the statistical relationship between the Dependent Variable and the
Independent Variable. For example, given a temperature in degree Celsius, we can find the exact
value of Fahrenheit. Let us now consider the simplest case that is the Simple Linear Regression.
So, where we are having only one dependent and one independent variable and simple linear
regression boils down to the problem of line fitting on a 2D x-y plane.

So, suppose the given a set of points in x-y plane, x and y coordinates I am considering the linear
regression attempts to find a line in 2D which is best fits the points. So, there is a concept of the
linear regression. The most popular method of fitting a line is the method of least square. So, we
can consider that method that is the method of Least squares, as the name suggests, at this
method minimizes the sum of the squares of vertical distances from each data point to the line.

Now, the question is how to find a baseline and that is the objective of regression. Suppose, that
the slope and the intercept of the required line are a and b. So, I am considering the slope is a and
the b is the intercept of the line, then the equation of the line will be y i =a x i +b that is the
equation of the line. And the error between the actual point and the line it can be determined. So,
I am determining the error in this figure these two figures I am showing the concept of the
Polynomial fitting.

So, you can see observed data, you can see the sample points the observed data you can see, and
I am fitting a curve between these observed sample points observed data and same concept I am
showing in a second figure also that is the Polynomial fitting. So, mainly we want to reduce the
error. So, these are my observed data you can, you can see and this is the curve, I am fitting
between the observed data points. So, the objective is to minimize the error.

So, that is the objective of regression. So, you can see that I can determine the error like this
y i that I can determine.
e i= yi −^

1171
(Refer Slide Time: 04:29)

And the concept of regression is I have to minimize the error. So, you can see I can compute the

1 2
error like this E = . So, n number of sample points suppose, y i − ^
y i that I am considering the
n
average error I am determining and this is the expression for E. So, the objective is to find the
slope a and the intercept b which gives minimum error so, already I have defined a line and we
have the slope the slope is a and the intercept is b.

So, objective is to we have to minimize the error and corresponding to this we have to determine
the slope a and the intercept b. So, to find the required values of a and b, we have to consider the
partial derivative of E with respect to a and b. So, that is the partial derivative of E with respect
to a that should be equal to 0 and partial derivative of E with respect to b that should be equal to
0. So, objective is to find a slope a and intercept b which gives minimum error E that is the
objective.

1172
(Refer Slide Time: 05:45)

δE
And after this you can see if I consider this equation that is the =0 . So, based on this I can get
δa
this equation, you can do the differentiation after doing the differentiation I will be getting this 1

δE
and similarly, if I do that =0 then I will be getting this equation.
δb

So, I will be getting these two equations and this equation this equation if I consider suppose
equation number 1 and equation number 2, so, I am getting two equations. So, this equation 1
and 2 are linear equations in two variables. Hence, they can be easily solved to find the values of
a and b now, a means the slope and intercept is b. So, by solving these two equations, I can
determine the value of a and b, a is the slope of the line and b is the intercept that I can
determine.

So, in this case, I have shown only the simple case of Linear Regression. So, if you want to see
the Polynomial fittings, so, you have to see the books, so, how to go for polynomial fittings
between the observed data points that you have to see. So, in my discussion only I have
considered the simplest regression model in which I have only 1 dependent variable and 1
independent variable. So, this is the fundamental concept of regression after this I will consider,

1173
(Refer Slide Time: 07:16)

The Bayesian decision theory. So, already I have explained the concept of the Bayes Theorem.
So, here you can see this posterior density you can see these the posterior density is equal to
likelihood into prior divided by evidence and if I considered two classes, then the P(x) will be
like this. And in this the Bayesian decision theory, this P(x) is the normalizing factor. So, it has
no role in classification, so, that we can neglect. So, P(x) is the normalizing factor that is the
evidence.

So, it has no role in classification. So, only we have to consider the likelihood and the prior. So,
for a particular Feature vector, the Feature vector is x, we have to determine the class. So, the
probability of obtaining a particular class given the Feature vector, the Feature vector is x. So,
that we have to determine that is the objective of Bayes Decision Making.

1174
(Refer Slide Time: 08:20)

And you can see the Decision theory will be like this. So, x is the Feature vector suppose, so, if I
consider two classes that classes suppose w1 and w2 two classes I am considering. If the
P (w 1∨x)> P (w 2∨x), then in this case, I have to consider the class that class is w1 and
similarly, if I consider the P ( w 1|x ) <P (w 2∨ x), then in this case the corresponding class will be
w2.

So, based on this principle I can do the classification. Suppose, if I have suppose this condition,
the probability 1, P ( w 1|x ) =P( w 2∨x ). Then in this case, we have to see the prior probabilities.
Suppose, this is a condition the P ( w 1|x ) = P( w 2∨x ), then in this case, we have to see the
probability there is a prior probability we have to see P(w1) and the P(w2) we have to see and
based on this we can take a classification decision.

So, how to do the classification now? So, you can see, suppose, I have the Feature vector, the
Feature vector is x suppose, that is the input and I can determine this values P (w 1) P ( x∨w 1) ,
P (w 2) P( x∨w 2), P (wc) P( x∨wc ). So, I am determining this and I will be getting P (w 1∨x), I
will be getting P (w 2∨x), P (wc∨x), I will be getting. I have to pick the largest one.

So, out of these, what I have to do pick the largest. So, out of this I have to pick the largest. So,
based on this I can do the classification. So, my input is x so, I can draw the x here suppose, so,

1175
my input is x and I have to determine these values the probability of P (w 1∨x), the probability
of P (w 2∨x) like this I have to determine and out of this I have to pick the largest and based on
this I can do the classification decision.

And also I can determine the probability of error for a given Feature vector. So, the what is the
probability of error? P (error∨x )=P (w 1∨x) that is, if we decide the class, the class is w2. So,
if I decide the class w2 then that is the P (error∨x )=P (w 1∨x). And the
P (error∨x )=P (w 2∨x), if we decide the class the class is w1. So, like this I can define the
probability of error.

(Refer Slide Time: 12:22)

So, we have to minimize the probability of error. So, how to minimize this decide the class w1 if
P (w 1∨x)> P (w 2∨x) otherwise, I have to decide the class the class is w2. So, we have to
minimize the probability of error and based on this principle, I can minimize the probability of
error I have to minimize the error. So, I have to select the appropriate class, that class you can
select by these probabilities the P (w1 given x) and P (w2 given x).

1176
(Refer Slide Time: 12:58)

And you can see the average probability of error I can determine like this. So, for all the feature
vectors, I can determine the average probability of error and we have to minimize this error. the
probability of error we have to minimize. So, how to minimize this error. So, this probability of
it will be P(w1|x) if x is assigned to this particular class. This is the Probability of Error. The
probability of error is equal to probability of P(w1| x) if x is assigned to this particular class is
equal to probability of P(w2| x) if x is assigned to that class, that class is w1.

So, we have to minimize this probability of error. So, we want probability of error given x to be
as small as possible for every value of x. So, for all the feature vectors, we have to minimize this
error, I can show here. So, suppose this is my x, x is a feature vector and I am considering the
probability of P (x |wi). I am considering. So, this is suppose the probability of P (x| wi) or
maybe I can consider suppose two classes if I consider it will be w1 and suppose this is the
probability of P (X| w2). So, this portion if you see, this is the area of, area of probability error.

This is the area of Probability Error. So, this is the concept of the Probability of Error. So, based
on the probability of error, we can take a classification decision.

1177
(Refer Slide Time: 15:15)

And suppose I have C number of classes. So, I am considering C number of classes


w 1 , w 2 ,… wc and assume that we have the d dimensional feature vector. So, I am considering
the d dimensional feature vector I am considering and if you consider this is the Bayes rule,
P (wj∨x ) , x is the feature vector that is a d dimensional feature vector is equal to P ( x∨wj) that
is the likelihood into P (wj) that is the prior probability divided by P(x) that is the evidence.

So, we have to compute the posterior probability corresponding to the feature vector, the feature
vector x and for a decision making what do we have to do so, we assigned a pattern to the class
for which the posterior probability is the greatest, that we can determine.

1178
(Refer Slide Time: 16:12)

So, you can see the same thing I am showing here. So, w test =argmax P (w j∨ x) and I am
considering this is the posterior probability is equal to this and evidence already I have explained
it is nothing but the normalization factor. So, it is same for all the classes. So, that means, it has
no role in classification. So, how to do the classification? So, w test is the class for which the
posterior probability is the highest.

So, based on this we can do the classification, that means, the pattern is assigned to this
particular class. So, we can determine the posterior probability and we have to determine the
highest value the highest posterior probability we have to determine and based on this we can
take a classification decision. The pattern is assigned to this particular class.

1179
(Refer Slide Time: 17:05)

After this another technique of decision making is by considering the Risks. So, by considering
the risks, we can also take a classification decision. Suppose, I have C number of classes. So, I
am considering the classes w1, w2 like this wc number of classes. I am considering some
actions, I am considering the actions are α 1 , α 2 like this, I have number of actions and based on
this I can consider the definition of loss.

So, what is the loss? I will explain, suppose x is the feature vector and x may belongs to the
particular class I am considering the classes like this, suppose w1 suppose w2. So, w2 is the true
class, true state of nature. So, the feature vector x may be assigned to the class, the class is w1 or
maybe w2 like this and corresponding to this x belongs to w1, I am taking some actions some
actions I am taking.

So, what are my actions? Action is α i I am taking the action α iI am taking and corresponding to
this I can determine the loss. So, what is my loss? Loss is equal to λ (α i∨w j ) then I can
determine. The loss is λ (α i∨w j ). So, particular action is considered the action is α i
corresponding to the class the class is wj. So, there is a loss I can define like this.

So, loss is nothing but λ ij that means, action α i is considered for a class that class is wj. So, I can
show this again suppose, I have the classes w1, w2 like this, I suppose wk, So, K number of

1180
classes, and I am taking the actions like this α 0 ,α 1 ,… , α k I am taking. So, these are my classes
and I am taking some actions, actions are like this α 0 ,α 1 like this.

So, corresponding to w1 I can take the action α 0and similarly, corresponding to w1 I can take

this action also. So, action is α k ' . So, corresponding to this what will be my loss the loss will be

λ k' 1 that is the loss I can determine. So, this suppose action is the reject option suppose, this
action is the suppose reject option. So, in patent classification reject option is very important that
means a particular feature vector may not be assigned to any 1 of the classes, then I can consider
the option, the option is the reject option.

Suppose, in alphabet recognition, so, alphabet suppose A B C D like this, suppose, I am writing
one alphabet something like this English alphabet. So, that means, I have to consider the option
the option is the reject option I have to consider, because it does not belong to any of the classes
that means, the alphabets. So, this is the concept of the loss. So, I can define the loss like this. So,
action α i is the considered for the class, the classes wj and corresponding to this the loss is λ ij.
So, you can see.

(Refer Slide Time: 20:55)

And from this you can determine the expected loss and also the conditional risk also we can
determine that is the actual α i I am considering a for a feature vector, but feature vector is x and

1181
that is nothing but the R(α i ∨x ¿ that I am considering that is the risks I am determining this is λ ¿

α i ∨w j P (w j∨x) . So, this is nothing but λ ijthis is nothing but λ ij.

At every x a decision is made and we have to minimize the expected loss, that concept is we
have to minimize the expected loss. So, final goal is to minimize the total risks for all the feature
vectors. So, we have to minimize these risks that is the objective of the risk minimization. So, by
considering this, we can do the classification.

(Refer Slide Time: 21:58)

So, for Two category classification, what we can consider suppose, I am considering the action
α 1 corresponding to the class the class is w1 and α 2 I am considering, corresponding to the class

w2 and based on this I can define the loss function. So, loss function is λ ij so, that means, action
α iI am considering for a class, the class is wj. And from this we can determine the conditional
risk.

So, you can see pictorially I can show you like this suppose w1, w2 these are the classes and I
am considering the actions α 1,α 2 like this and I am taking the actions like this I am determining
the actions. So, what is λ 11, λ 11, is nothing but λ α 1 action I am taking corresponding to the class
w1. And similarly, I can consider λ 1 to λ 21 and λ 22, I can consider like this. So, I can determine
this, after this we can determine the condition risks.

1182
(Refer Slide Time: 23:10)

After this we have to minimize the risks. So, what will be our decision rule? So, if the risks
α 1 ∨x is less than risks α 2 ∨x that is a conditional risk, then in this case, we have to consider the

action the actionα 1 I have to consider that means, we have to consider the class the class w1 we
have to decide, and this is equivalent to this is equivalent to this because, from this just you can
put these values you will be getting this one.

So, we can decide that class w1 if this condition is satisfied, so, this is the condition. So, we can
decide the class w1 if this condition is satisfied, otherwise, we have to consider the class w2. So,
that is my Decision Rule.

1183
(Refer Slide Time: 24:08)

So, this the previous rule is equivalent to the following rule. So, that means, we can consider this

P( x ∨w 1)
one. So, from the previous equation you will be getting this one the ratio that ratio we
P( x∨w 2)
have to determine. If it is greater than this one. So, that means, you pay consider this one then
based on this, we can take a classification decision.

If this condition is satisfied, then we have to consider the actual α 1 and what is the corresponding
class? The corresponding classes w1. Otherwise we have to take the action, the action is α 2and
what is the corresponding class? The class will be w2. So, based on this decision rule, we can do
the classification. So, we can select either w1 or w2.

1184
(Refer Slide Time: 25:04)

And we can also consider this ratio. So, this ratio is called the Likelihood Ratio, this is called the
Likelihood Ratio we can determine. So, from a previous slide you can see, so, I am considering
suppose this is equal to θ k and if the likelihood ratio is greater than this threshold the threshold is
suppose θ k then based on this we can take a classification decision. So, if the likelihood ratio is
greater than a particular threshold, then in this case we have to consider w1 otherwise, we have
to consider w2.

So, you can see this likelihood ratio is independent of the feature vector that it is likelihood ratio
is independent of x. So, we can determine the likelihood ratio and based on this likelihood ratio
we can take a classification decision. So, you can see that based on these risks, we can take a
classification decision.

1185
(Refer Slide Time: 26:10)

Now, I am defining one function that is called the Zero-one loss function. So, suppose actions
are decision on classes, so, we are taking some actions for decision making. So, suppose if action
α iis taken and the true state of nature is wj, then the decision is correct, if i is equal to j and the
decision is not correct, if it is not equal to j, we have to consider this zero to one loss function
and based on this we can take a classification decision. So, objective is to minimize the
probability of error that is the objective. So, we have to minimize the probability of error.

(Refer Slide Time: 26:47)

1186
Now, I am defining the Zero-one loss function. So, λ ¿ α i ∨w j ¿ I am considering that is equal to
0 if i is equal to j so, that means, the loss will be 0 if i is equal to j. Otherwise, if i is not equal to
j, then the zero-one loss function will be 1. So, I am defining this function zero to one loss
function I am defining. After this I am considering the conditional risks, so, conditional leaks
already you know, so, this is the, the formula for the conditional risk and from this you can see, I
am getting this one, but you can see at this point that j is not equal to i, because, when j is not
equal to i, the value of zero-one loss function is 1.

So, that is why this value will be 1, if j is not equal to i or if i is not equal to j, then value will be
equal to 1 and corresponding to this it is nothing but 1 minus probability of wi given x. So, that is
the meaning of this. So, to minimize the risks, what I have to consider, I have to select maximum
probability of wi given x that is the posterior probability I have to select the maximum value of
this I have to select, that means the decision rule will be like this.

So, decide wi if probability of wi given x is greater than probability of wj, given x for i is not
equal to j. So, this is my decision rule. And this classification technique is called the Minimum
Error rate classification, because I have to minimize the error. So, this is called the Minimum
Error rate classification technique.

(Refer Slide Time: 29:08)

1187
So, for minimization of the risks, we have to maximize this probability since the conditional risk
is equal to 1 minus probability of wi given x and I have to minimize the error. So, what is my
classification rule? Decide wi if probability of wi given x is greater than probability of wj given
x. So, that is my classification rule and this is the concept of the Minimum Error rate
classification.

(Refer Slide Time: 29:35)

And based on the Discriminant function, I can see the different types of decision surfaces. So,
this concept I am going to explain so, what is the Discriminant function and from this how to

1188
determine their decision boundaries, the decision surfaces. So, let us consider the multi category
case. So, multiple clusters I am considering and for this I am considering the discriminant
function that discriminant function is gi (x) I am considering and for each and every class, I have
to determine that discriminate function.

So, I have to determine g1(x), g2(x), g3(x) like this for all the classes I have to determine the
discriminant function. The classifier assigns a feature vector x to a particular class that class is
supposed wi if gi (x) is greater than gj (x) for j is not equal to i. So, based on the discriminant
function, I can take a classification decision that means, if gi (x) is greater than gj (x), then based
on this I can decide the class the class is wi that means, the feature vector x will be assigned to
the class the classes wi.

1189
(Refer Slide Time: 30:56)

And here you can see I am considering the input feature vector the input feature vector is the D
dimensional feature vector. So, this is my x that is the feature vector. So, this is the D
dimensional feature vector I am considering that means, x1, x2 like this. So, this is the D
dimensional feature vector I am considering and after this you can see I am determining the
discriminant function for all the classes.

So, I am considering C number of classes. So, I am determining g1 (x), g2 (x) like this I am
determining and for classification what we have to consider I have to find a maximum
discriminant function I have to determine. So, out of this g1 (x), g2 (x), gc (x) which one is the
maximum I have to determine for classification. So, you can see so, I am determining the cost
that means, I have to find a maximum discriminant function and based on this I can take a
classification decision.

So, what is the function of the discriminant function? It divides the features space into C decision
regions that decision regions are R1, R2 like this, these are the decision regions Rc and if g i(x) is
greater than g(x) for i is not equal to j then x is in the region x is in the region Ri that means, the
meaning is x is assigned to the class that class is wi. So, that is the decision rule and what is the
decision boundary?

1190
The decision boundary is nothing but gi (x) is equal to gj (x). So, that is the equation of the
decision boundary. So, for C number of classes we have to determine C number of discriminant
functions.

(Refer Slide Time: 33:04)

So, you can see I have to determine the maximum value of the discriminant function that
corresponds to the minimum risks. So, maximum discriminant function corresponds to minimum
risks. So, that means, for minimum error rate gi (x) should be equal to P (wi| x) that is the
posterior probability. So, that means, the maximum discriminant function corresponds to
maximum posterior probability.

So, I am repeating this the maximum discriminant function corresponds to maximum posterior
probability and what is the posterior probability if you see? This is nothing but it is equal to P (x|
wi), P(wi), P(x) and that I can write like this. So, promise I can write like this, because, the
evidence has no role in classification. So, I can write like this, after this I can take the natural
logarithm, because the multiplication is converted into addition by considering the logarithm. So,
I have that this discriminant function gi (x) is equal to this.

1191
(Refer Slide Time: 34:26)

So, discriminant function does not change the decision when scale by some positive constant. So,
if it is scaled by some positive constant, the discriminant function does not change the decision.
The decision is not affected when a constant is added to all the discriminant function. So, that is
a concept of the discriminant function. So, if I consider two classes suppose, so, for first class
suppose the discriminant function is g1 (x).

For the second class suppose the discriminant function is g2 (x), then what will be my
classification rule if g1 (x) is greater than g2 (x) that means their meaning is x will be assigned to
the class the class is w1. And what will be my decision boundary? The decision boundary will be
g1 (x) is equal to g2 (x) so, that is the equation of the decision boundary. So, this is my decision
boundary.

So, this equation that means, g 1( x )−g 2 ( x)=0 that is the equation of the decision boundary. So,
I can write like this that means g(x) is equal to 0. So, that is the equation of the curve that is the
equation of the curve. So, if I consider these two classes, so, if I consider this the feature space
so, I have two regions. The region is R1 another region is R2 and what will be the equation of the
decision boundary?

The equation of the decision boundary is g(x) equal to 0 and that is the equation of the curve or
maybe in this case I am considering the line that is the decision boundary I am considering the

1192
line I am considering two regions R1 and R2 and in the region R1 g(x) is greater than 0 and in
the region R2 g(x) is less than 0. So, g(x) is greater than 0 means, I am considering the class w1
and g(x) is less than 0 that means, I am considering the class w2.

(Refer Slide Time: 36:43)

The feature space is divided into c number of regions, c decision regions and if gi (x) is greater
than gj (x) for j is not equal to i then x is in Ri that means the feature vector will be in the region
Ri that means, the feature vector x will be assigned to the class that class is wi and for two
category case that is for two classes I have to determine g(1) and g(2) and g(x) is equal to
g 1( x )−g 2 ( x) and we can take a classification decision based on this condition. So, decide w1 if
g(x) is greater than 0 otherwise decide w2 that we have to consider.

1193
(Refer Slide Time: 37:35)

And, and this gi (x) already I have this equation the equation of the discriminant function and
corresponding to this you can see this g(x) is nothing but P (w 1∨x)−P( w 2∨x )that is the g(x).
g(x) is nothing but g 1( x )−g 2 ( x). And from this if you can put this value because this what is P
(w1| x) what is P (w1) given x that is nothing but P (x| w1) and P(w1) and the evidence is
suppose P(x). So, if I put these below in this equation, then you will be getting this one so, you
can get this one.

So, we have to determine g(x) and similarly also you have another 1 P (w2| x). So, P (w2) also x
we can determine. So, if I put these two values you will be getting g(x) so, the g(x) is nothing but
g 1( x )−g 2 ( x). So, this is nothing but g 1( x )−g 2 ( x).

1194
(Refer Slide Time: 38:54)

Now, I will discuss the concept of Normal Distribution. So, already you know what is a Normal
Distribution. So, here you see I am showing the density function corresponding to the normal

−1
1 ¿¿
distribution P(x)= 2
e 2 . So, that is the normal distribution and corresponding to this you
√2π σ
can see I have the bell-shaped distribution.

So, this is the bell-shaped distribution corresponding to that PDF that PDF is P(x) and
corresponding to this I have the, the probability density function and this is the normal
distribution with mean 0. So, here you can see the mean is 0 and from this you can determine the
expected value or the mean value of x. So, I can determine the mean of this random variable. So,
E[x] I can determine also I can determine the variance of x.

So, in this case I am considering suppose these variances suppose σ 12 and suppose I am
considering another Gaussian function something like this then in this case suppose the variance
is σ 22 . So, in this case the σ 1 is, σ 12 >σ 22. So, you can see that the variance determines the spread
of the Gaussian function. So, this is about the normal and the Gaussian distribution.

1195
(Refer Slide Time: 40:38)

Now, let us consider the Multivariate Gaussian Distribution. So, I am considering the vector,
vector is the D dimensional vector so, x is the D dimensional vector x1, x 2 like this. So, this is
the D dimensional vector corresponding to this I can define the density function, the density
function is Px and this is for the Multivariate Gaussian distribution. So, in this case I have two
parameters one is the Mean vector and another one is the Covariance matrix.

So, I have two parameters one is the mean vector another one is the covariance. I can determine
the mean vector from the input vector the input vector is x. So, the mean vector is nothing but μ=
μ1 , μ2 , μ3 , … and since I am considering the D dimensional vector, so, this is the mean vector that
means, I can determine the that means, I can determine the expected value x1, x 2 like this I can
determine.

So, mean vector I can determine and also, I can determine the covariance matrix so, this is my
covariance matrix. So, this is expected below E ( x−μ)( x−μ)' . So, you can see I can determine
the covariance matrix and if I consider the dimension is suppose 1. So, suppose the dimension of
the feature vector is 1 then in this case I will be getting the Univariate density.

So, what is the Univariate density that is already I have defined. The univariate density is twice
by sigma so, I have, I will get the univariate density corresponding to D is equal to 1. So, in this

1196
case I have two parameters one is the mean another one is the variance. So, for univariate case I
have two parameters one is the mean another one is variance. So, in this case you can see I am
considering the D dimensional feature vector and from this I can determine the mean vector and
also, I can determine the d by d covariance matrix.

(Refer Slide Time: 43:30)

So, in this slide you can see how to determine the covariance matrix. So, the covariance matrix
you can see expected below of this expected below E [ ( x i−μi )( x j −μ j ) ] that I can determine. So,
that means the covariance between xi and xj I can determine. So, this is my covariance matrix
and the diagonal elements σ ij are the variants of their respective xi. So, if I see that that if I see
the diagonal elements that is nothing but the variance of respective xi.

So, that means, σ 12 that corresponds to the variance variances of respective xi and if I consider
the off-diagonal elements that means, these off diagonal elements so, off diagonal elements are
σ ij. So, these are the covariance, covariance between covariance between xi and xj. So, that is a
covariance between xi and xj and if I considered suppose, if xi and the xj are statistically
independent, what is the meaning of this?

That means σ ij is equal to 0 that is the covariance is equal to 0 that means, xi and xj are
statistically independent then corresponding to this, this P(x) that already I have defined that will

1197
be the univariate density, univariate, univariate normal density. So, that will be the univariate
normal density. So, that means, if xi and xj are statistically independent that corresponds to
sigma ij is equal to 0 and that corresponds to P(x) will be the univariate normal density.

Now, I can define a distance. So, what is the distance you can see, I am considering the distance
( x−μ ¿ ¿' so, I have this distance R square is equal to ( x−μ ¿ ¿' Σ−1 ( x−μ). So, in this case you
can see this corresponds to the determinant and this corresponds to the inverse, inverse of the
covariance matrix.

So, this corresponds to inverse of the covariance matrix so, I am defining a distance that is R
square is equal to ( x− μ ¿ ¿' Σ−1 ( x−μ) that is the inverse of the covariance matrix x minus mu
and that distance is called that is a very important distance this is called the Squared Mahalanobis
Distance. The Mahalanobis distance, Squared Mahalanobis distance, already you know that what
is the Euclidean distance.

The Euclidean distance is nothing but the distance between the vector x and mu i suppose, so,
this is the Euclidean norm. So, this is the equilibrium norm and I have shown that this is the
Mahalanobis distance. So, in case of the multivariate density the center that means, in this case if
I consider a cluster suppose if I consider a cluster suppose a cluster I am considering some
sample points I am considering.

So, these are the, these the cluster the center of the cluster is determined by the mean vector and
the shape of the cluster, the shape of the cluster is determined by the covariance matrix. So, that
is the importance of the mean and the covariance. So, if I consider a cluster I am considering
some sample points. So, the center of the cluster is determined by the mean vector and the scope
of the cluster is determined by the covariance matrix. Now, I am discussing the concept of the
Bayesian classification for normal distribution.

1198
(Refer Slide Time: 48:33)

So, let us discuss about the concept of the Bayesian classification, Bayesian classification for
Normal Distribution. So, let us discuss about this concept that Bayesian classification for normal
distribution. So, in case of the Bayes classifier, so, I have to determine this. So, this is nothing
but P (wj| x) is equal to P(x| wj), P (wj) and P(x). So, this is the Bayes law. So, for determining
the probability of wj given x, I have to consider this, that is the likelihood or the class conditional
density I have to consider.

Suppose, the class conditional density follows the normal distribution. So, that means the
likelihood function wi which with respect to x in L dimensional feature space follow the general
multivariate normal density. So, that means the probability of (x |wj), I am considering the
density is suppose the normal density and in this case I am considering multivariate normal
density. So, that means the probability of x given wj is a multivariate normal distribution and I
am considering the feature vector that is the L dimensional feature vector.

So, L dimensional feature vector I am considering. So, twice by to the power l by 2 sigma i to
the power half an exponential minus half x minus mu i transpose so, I am considering the
multivariate normal distribution and I am considering c number of classes. So, that covariance
matrix for a particular class, so, that means, I am considering the covariance matrix for class i.

1199
So, that is the dimension is l by l covariance matrix I will be getting also I can determine the
mean vector. So, mu i also I can determine that is the mean vector I can determine that is nothing
but the expected value of x i can determine. Now, you know the discriminant function gi x is
equal to log you know this. So, in this case I think I better I should write wi that is x given wi I
am writing in place of wj I am writing just wi. So, this is a Discriminant function.

So, if I consider this multivariate normal distribution so, just I have to put this below here. So, I
will be getting minus half x minus mu i transpose the covariance matrix for a class i plus ci. So,
suppose this equation number 1. The ci is a constant to what is the value of this constant? l
divided by 2 ln twice pi minus half ln so, this is ci so, I can expand gi (x) that is a discriminant
function.

So, if I expand gi (x) it will be something like this minus half x transpose mu i T just I am
expanding suppose this is the equation number 2. This is the equation number 2. So, I am getting
the, the expression for a discriminant function. And in this case, what I am considering I am
considering the probability of x given wi that follows the multivariate normal distribution. So,
based on this, I am calculating the discriminant function, the discriminant function is gi (x).

What will be my decision boundary? So, my decision boundary or decision curves the decision
curves will be gi (x) is minus gj (x) is equal to 0. So, that is the Decision Boundary. So, the
decision boundary maybe the quadric decision boundary because this is the quadratic equation.
So, if you see this one, so, this is a quadratic equation. So, that is why this is called a Quadric
classifier.

Bayesian classifier is also called Quadric classifier because it is a Quadric equation. So, decision
boundary will be quadrics. So, decision boundary may be like this Ellipsoid or maybe the
Parabolas, Parabolas or maybe the Hyper Parabolas or maybe the pairs of lines I can consider
like this, I have the Quadrics Decision boundary. So, this equation number 2 is the quadratic
equation.

So, that is why the Bayesian classifier, the Bayesian classifier, the Bayesian classifier is also
called the Quadric classifier. So, if I considered the 2-dimensional case, so, decision boundaries
maybe something like this. So, this will be one class and this will be another class or maybe the

1200
nonlinear maybe the decision boundary is maybe something like this. So, the suppose w1, w2
and w1 so, the x1 and x2.

So, 2-dimensional feature vector I am considering and for this I may have this type of decision
boundaries. So, if I consider high dimensional case, then in this case I will be getting hyper
ellipsoid, hyper parabolas like this I will be getting.

(Refer Slide Time: 58:19)

Now, what about the Decision Hyper planes So, if I consider the equation number 2, you can see
the equation number 2 in equation number 2 you will be having this term x T Σi−1 x this in
equation number 2, it is same for all the discriminant function it is same for all the discriminant
function. So, that means, it has no role in classification. So, that means I can neglect this one.

And suppose if I consider this case that covariance matrix is same for all the classes, the
covariance matrix is same for all the classes, but it is arbitrary. So, that I am considering now.
So, this, this quadratic term is same for all the classes, same for all the classes. So, that means, I
can neglect this one it has no role in classification. And I am considering this case that is the
covariance matrix is same for all the classes, but it is arbitrary. So, that is why I can write the
discriminant function something like this gi (x) is equal to w iT x+w i0 .

1201
So, this wi is this is called the Weight vector. This is called the Weight vector and this is I can
consider as a Bias or Threshold. So, Bias or Threshold I can consider like this. So, what do what
is that weight vector? wi is equal to sigma to the power minus 1 mu i. So, that is the weight

1 T −1
vector and what is the bias? The bias is nothing but w i 0 that is equal to log P (wi)−¿ μ Σ μ.
2 i i i

So, here you can see if I see this expression if I see this expression that is the discriminant
function gi (x) is a linear function of x and in this case the decision surfaces will be hyper planes.
so, I am repeating this, that is the discriminant function gi (x) is a linear function of x, x is the
input feature vector and corresponding to this the decision surfaces will be hyper planes. Now,
let us consider two cases. Case number 1, I am considering the covariance matrix is same for all
the classes and it is a diagonal covariance matrix.

So, that means, what I am considering the diagonal covariance matrix with equal elements. So,
what is the meaning of this the feature vector is mutually uncorrelated and of and of some
variance. So, that means, I am considering a diagonal covariance matrix with equal elements the
meaning is feature vector is mutually uncorrelated and of same variance. And in this case, you
can see what is the I? I is the identity matrix. So, it is the l dimensional, l dimensional identity
matrix.

So, I is the l dimensional identity matrix and suppose this equation is suppose I am considering
the equation is 3 suppose. So, from 3 equation number 3 what I will be getting now, gi (x) will

1 T
be equal to μ i x +wi, so, I will be getting w i 0 that is the bias. And in this case what is σ −1?
σ2
That is the inverse of the covariance matrix 1 by sigma square I. I is the identity matrix.

So, what will be the decision hyper planes? Decision hyper planes, decision hyper planes will be

gij ( x)is equal to gi ( x ) −g j ( x). So, that is nothing but w T x−x 0. This you can verify this you can

verify like this suppose w T x 1 +w 0 is equal to w T x 2 +w 0. So, corresponding to this what I will be

getting, what I will be getting w T x 1 −x2 is equal to 0. So, like this you can verify that one.

1202
So, this is the equation and that is the equation what is the equation of the decision hyper planes?
gij ( x) is equal to w T that is the weight vector x−x 0. So, this is the equation of the Decision

Hyper Planes and in this case what is the weight vector? The weight vector is nothing but μi −μ j

1 2
that is the weight vector, μi −μ j. And the x 0is equal to ( μi +μ j )−σ . So, that is x 0.
2

So, the decision surface is a hyper plane which will pass through the point, the point is x 0. So, I
am repeating this in this case the decision surface is a hyper plane passing through the point the
point is x naught. And suppose, if I consider if the probability of wi suppose, the probability of
wi is equal to probability of wj then in this case corresponding to this, the point x 0will be

1
[ μ +μ ] .
2 i j

1
So, that is the decision surface and the point will be x 0 is equal to [ μ +μ ] . That means, the
2 i j
hyper plane passes through the mean of μi ∧μ j So, in this case if the probability of a particular
class that class is wi is equal to probability of the another class and the class is wj and in this case

1
i is not equal to j then x 0 is equal to [ μ +μ ] . So, that means, the hyper plane will pass through
2 i j
the mean of μi ∧μ j .

And also the hyper plane will be orthogonal to the vector, the vector is w is equal to mu i minus
mu j, so, that I will show pictorially. And suppose, if the probability of wi is less than probability
of wj then what will happen? The hyper plane will be located closer to mu i that is the hyper
plane will be located a closer to the mean vector the mean vector is mu i and also if the
probability of wi is greater than probability of wj, then what will happen in this case the hyper
plane will be located closer to mu j that is the mean wj.

So, this is the case and if the variance, variance is sigma square is small. The variance is small
with respect to, with respect to the difference in the mean. So, I am just finding the Euclidean
distance between these two means. The location of the hyper plane is rather insensitive to the
values of P (wi) and P (wj). So, I am repeating this if the variance is small with respect to the

1203
difference in the mean, the location of the hyper plane is insensitive to the values of the
probability, the probabilities probability of wi and the probability of wj. So, this condition I will
be showing pictorially, so, what will be my decision boundary?

1204
(Refer Slide Time: 01:09:37)

So, I can show the decision boundary like this. So, suppose if I considered a 2-dimensional
feature space x1 and x2. So, suppose this is my mean vector suppose this mean vector is μi and,
and this is μ jsuppose this vector. And I can find the vector μi − μ jthat is nothing but that weight
vector that is nothing but that weight vector. And also, you can see, I can determine the point the
point is nothing but x 0 I can determine.

So, from that equations you can determine the point x 0. So, that means, the hyper plane will pass
through the point the point is x 0and if I consider the probability of wi is equal to probability of wj

1
then x naught is equal to [ μ +μ ] that already I have calculated. So, that means, the hyper plane
2 i j
will pass through the mean of μi ∧μ j . and also the hyper plane is orthogonal to the vector w.

So, that means up consider hyper plane so, hyperplane is this suppose, so, this is my hyper plane
So, that will be orthogonal. So, maybe I can consider some another color that you can
understand. So, this is my hyper plane so, hyper plane is orthogonal so, it is orthogonal to the
vector the vector is w. The hyper plane will be orthogonal to the vector μi −μ j. So, this is μi − μ j.

And in this case what I am considering the sigma is equal to sigma square I that means, the
covariance I am considering the diagonal covariance matrix with equal elements. So, diagonal

1205
covariance matrix I am considering and if I consider these two cases, one is the high variance
another one is the low variance the small variance. So, what will happen you can see so, this is
mean of the cluster suppose if I consider this the cluster so, the mean of the cluster is mu i and
another cluster I am considering.

So, this is another cluster so, this is music and from this you can see already I have shown you
so, you can determine the vector the vector is μi −μ j that you can determine that is nothing but
the weight vector level with vector and after this you can draw a decision boundary. So, this is
my decision boundary you can draw that is orthogonal to the vector w, w and it will pass through
the point x naught.

So, in this case, I am considering the compact case this is a compact, compact compactness that
means, I am considering the samples with high probability, the samples with high probability
samples with high probability I can consider another case that is a non-compact case. So, in a
non-compact case also I can draw the decision boundary. So, in case of a non-compact case
again like the previous case, I can draw the mean vector μi and the μ j and if I consider this is a
cluster suppose, the μ j is something like this.

So, this is mu j and you can determine mu i minus mu j and in this case also you can determine
the decision boundary. So, you can find a decision boundary so, decision boundary will be
something like this. So, this is the non-compact case that means, sigma square is large in the
previous case the sigma square is small with respect to μi −μ j. So, that means, in the second case,
the location of the decision hyper plane is much more critical as compared to the, the first case,
the first case is the compact case.

So, in the compact case you can easily draw the decision boundary between these two clusters.
But in case of the non-compact case, so non compact case I am showing in the second case, so, it
is very difficult to draw the decision boundary. So, location of the decision hyper plane is much
more critical as compared to first case, this is about the representation of the decision boundary.

1206
(Refer Slide Time: 01:14:52)

Now, I am considering the case number 2 case 2. So, I am considering the non-diagonal
covariance matrix. So, I am considering the non-diagonal covariance matrix these are case
number 2 that means the Σi is equal to Σ. The covariance matrix is same for all the classes, in
case of a case number 1 what I am considering diagonal covariance matrix, but in this case, I am
considering the non-diagonal covariance matrix.

So, corresponding to these the decision boundary, the decision boundary will be gi (x) is equal to

w ( x−x 0 ) is equal to 0 and what will be my weight vector? The weight vector will be σ −1 that is
T

the inverse of the covariance matrix μi −μ jand the x 0 the decision boundary will pass through the

1
point the point is x 0 [ μ i +μ j ] minus ln so, decision boundary I can find like this.
2

So, I can calculate the weight vector and also, I can determine the point the point is x naught. So,
the decision boundary will pass through the point x 0, but in this case the hyper plane is no longer
orthogonal to the vector w that is the case for a non-diagonal covariance matrix. Now, let us
consider the concept of the minimum distance classifier. So, what is the Minimum Distance
Classifier?

1207
Minimum Distance Classifier let us consider So, what is the Minimum Distance Classifier? So,
from the equation number 1 you have seen that the discriminant function is equal to minus

−1
−μ iT x and σ −1that is the inverse of the covariance matrix. So, you notice so, suppose,
2
suppose the case like this, So, sigma is equal to sigma square into I so, I use the identity matrix.

So, in this case I have to determine the maximum discriminant function. So, for c number of
classes I have c number of discriminant function and I have to take the decision based on this so,
I have to find the maximum discriminant function and based on this I have to take the
classification decision. So, that means the maximum discriminant function means minimum
Euclidean distance between the respective mean points.

So, what is the Euclidean distance, the Euclidean distance dE I can write like this, the Euclidean
distance between the vector x and the mean mu i so, that I have to find? So, that means the
maximum discriminant function corresponds to Minimum Euclidean distance. So, I have to find
a Euclidean distance and what is the, what is the classification decision? I have to find a
maximum discriminant function that corresponds to minimum Euclidean distance from the
respective mean points.

And suppose, if I consider the Euclidean distance is equal to constant and then in this case I will
be getting the curves of the circle I will be getting so, maybe I can get the curves of the circle in
case of a 2 dimensional case and if I consider the high dimensional case I will be getting hyper
sphere. So, that is the Euclidean, Euclidean distance contour I can determine. So, corresponding
to this I can draw this.

So, these are my contours that means, contour of equal Euclidean distance. So, two classes I can
consider the contour of equal Euclidean distance and you can see these vector is the weight
vector. So, suppose these is a class, class 1 and this is a class 2 and I can draw the decision
boundary. So, these are decision boundary between these two classes, this is the decision
boundary.

So, if I consider the 2-dimensional case then in this case I will be getting the curves of circle and
if I consider the high dimensional case then I will be getting the hyper sphere. So, these are

1208
nothing but the contours of equal Euclidean distance, the Euclidean distances dE the contours of
equal Euclidean distance. The second case what I am considering the second case is non diagonal
covariance matrix, I am considering the non-diagonal covariance matrix.

So, that means, in this case also I have to maximize the discriminant function. So, maximizing
the discriminant function is equivalent to minimizing the covariance matrix norm. So, that
means, I have to maximize the discriminant function gi (x) that corresponds to the minimization
of sigma to the power minus 1 that I have to minimize. So, for this I can determine the
Mahalanobis distance.

T 1
So, already I have defined a Mahalanobis distance, so, x−μi x− , so, I can determine the
2
Mahalanobis distance. So, the minimum distance corresponds to the maximum discriminant
function and if I consider dm is equal to c, then in this case, I will be getting the contours. So,
maybe I can get the curves of ellipse. So, for the high dimensional case, I can consider the hyper
ellipsoid and for the 2-dimensional case I will be getting the ellipse the contour of the ellipse. So,
maybe something like this I will be getting.

In the previous case I, I have the curves of the circles. Now, I will be getting the curves of the
ellipse, the ellipse will be like this, this is for the clusters 1 and similarly, for the cluster 2 also I
will be getting the curves of the ellipse and this is the vector μi −μ j because I have to determine
that this mu i and mu j. So, this is μi and μ jand in this case already I have defined that your
decision boundary, the decision boundary will not be orthogonal it will not be orthogonal to this
vector, this vector is nothing but μi −μ j, this vector is μi −μ j .

So, you can see, I have considered the minimum distance classifier based on these two distances
one is the Euclidean distance another one is the Mahalanobis distance. In this class I discussed
the concept of the Bayesian classification first I discussed the concept of Probability of Error and
after this I discussed the Concept of Risks. So, by considering the probability of error, I can take
a classification decision.

Similarly, by considering their risks, I can also take a classification decision after this I discussed
the concept of Discriminant Function. So, for c number of classes, I have c number of

1209
discriminant functions and based on these discriminant function, I can take a classification
decision. After this I consider the Normal Distribution one is the Univariate Distribution another
one is the Multivariate Normal Distribution.

After this I determine the discriminant function, the discriminant function is gi (x) and after this I
consider two cases one is the Diagonal Covariance Matrix and another one is the Non Diagonal
Covariance Matrix, for this I determined the decision boundary in one case the decision
boundary will be orthogonal to the vector w that is a Weight vector.

And in the second case, the decision boundary is not orthogonal to the weight vector, the weight
vector is w the weight vector is μi −μ j. And after this I discussed the concept of the Minimum
Distance Classifier. So, let me stop here today. Thank you.

1210
Computer Vision and Image Processing - Fundamentals and Applications
Professor M. K. Bhuyan
Department of Electronics and Electrical Engineering
Indian Institute of Technology Guwahati
Lecture 33
Introduction to Machine Learning - IV

Welcome to NPTEL MOOCs course on Computer Vision and Image Processing -


Fundamentals and Applications. In my last class I discussed the concept of Bayesian decision
theory. In Bayesian decision theory, I have to estimate the probability of wj given x that I
have to determine. So for this I need two information one is the probability of x given wj that
is called likelihood and also the probability of wj that is called a prior.

So, this information I need the probability of x given wj that is the class conditional density
or the likelihood and suppose, this information is available that means the density of the
likelihood function or the class conditional density is available then this is called the
parametric method. That means the density form the likelihood function or the class
conditional density is available, but the parameters are not available.

So, suppose if I consider a Gaussian density, in Gaussian density there are two parameters
one is the mean and other one is the variance and if I consider high dimensional case, then it
is the mean vector and the covariance matrix. So, in case of the parametric method this
density form is available I know the density of probability of x given wj. So that information
is available but I do not know about the parameters.

So I have to estimate the parameters. So there are two methods, very popular methods; one is
the maximum likelihood estimation, another one is the Bayesian estimation. So by using
these two techniques I can determine the parameters, the parameters are mean and the
covariance. Another case is suppose, the density form is not available, the density form class
conditional density is not available that is the probability of x given wj. So that is not
available we have to estimate the density.

So there are two popular techniques one is called the Parzen-Window technique and another
one is called a k nearest neighbor technique. So by using these two techniques I can
determine the density, the density of probability of x given wj that is the likelihood. So in this
class I will discuss the parametric methods first I will discuss the maximum likelihood
estimation and after this I will discuss the Bayesian estimation. After this I will discuss the

1211
non-parametric methods one is the Parzen-Window technique and another one is the k nearest
neighbor technique. So let us discuss about this parametric and non-parametric methods.

(Refer Slide Time: 3:19)

So the first one is the parameter estimation. So I told you that what is the parametric method
in case of the Bayesian decision theory. So I have to determine this the probability of wj
given x that I have to determine. And that is nothing but the probability of x given wj that is
the likelihood and the probability of wj and j is equal to 1 to c probability of x wj probability
of wj. So in this case what information is available?

So in this case if I want to determine the probability of wj given x that is the posterior
probability. So suppose the density form of the class conditional density is available. So
suppose the density form available. So density form is suppose, it is the parametric form in
case of the normal density I have two parameters one is the mean vector another one is the
covariance matrix.

So this parametric form is available. If you see this formula, so for calculating probability of
wj given x, so what information I need? I need the information of probability of x given wj
that input information I need, that is nothing but the class conditional density. And the
parameters are suppose the mean vector and the covariance matrix. So this information also I
need, the mean vector and the covariance matrix also I need.

So that means I have two parameters, the parameters are θi 1 and θi 2 suppose. So two
parameters I need and also I need the information of how many classes. So classes is 1 to c
number of classes. So here you can see this is j, j is equal to 1 to c. So c number of classes, so

1212
this information I need for calculating probability of wj given x. So one is the class
conditional density that is the probability of x given wj that information I need.

And I have two parameters the one is the mean vector, another one is the covariance matrix
and also I need one another information that is number of classes j is equal to 1 to c. So that
information I need to calculate the probability of wj given x. So this information is not
available directly. So for this actually we have the training samples, training samples for all
the classes that is nothing but the supervised training.

So for each and every classes we have the training samples. And after these from these
training samples, we have to do the estimation, estimation is nothing but we have to estimate
the parameters. So parameters are the mean and the covariance. So we have to estimate the
parameters. And in this case in case of the parametric approach the PDF form is known that
means, I know the PDF of the class conditional density in case of the parametric form.

So I can write this in parametric approach that is the PDF form known. But the parameters I
have to determine the PDF form is known but the parameters I have to determine I have to
estimate. That is a parametric approach. What is the non-parametric approach? In case of a
non-parametric approach, the PDF form is not available. That means I have to estimate the
density that is nothing but the density estimation.

So in case of the non-parametric method approach the density form is not available. So we
have to estimate the density. So for this we need the training samples, the training samples for
all the classes. So if I consider the training sample suppose xn that belongs to a particular
class, suppose class is wj, that is available. That is the training samples are available
corresponding to a particular class and that is nothing but the supervised learning.

So for estimating the probability of wj given x that is the posterior probability, I need the
information of the prior, the priori probability, probability of wi and also I need the
information of the probability of x given wi that is the class conditional density. So I need
this information and also I need the information of number of classes. So we have c number
of classes. So based on this I have explained two approaches one is the parametric approach
another one is the non-parametric approach.

1213
(Refer Slide Time: 9:13)

So in this case what I am considering suppose the density form is known that is the class
conditional density. So density form is known but I have to estimate the parameters. So in
this case there are two parameters one is the mean, another one is the covariance. So one is a
mean vector, another one is the covariance. So suppose this is θ1 and this is θ2.

So for this I can apply two popular techniques one is the maximum likelihood estimation
another one is the Bayesian estimation. In case of a non-parametric method the density form
is not available but I have to estimate the density.

(Refer Slide Time: 9:46)

So you can see so for estimation techniques that is the parameter estimation techniques, I may
consider this popular method that is the maximum likelihood estimation. And also I may

1214
consider the Bayesian estimation. So in both the cases results are nearly identical but the
approaches are different. In case of the Bayesian estimation, the computational complexity is
more as compared to maximum likelihood estimation.

In case of the Bayesian estimation I have to determine the multi-dimensional integration. So I


will explain this, but in in case of the maximum likelihood estimation I have to determine the
differentiation. So that is why if I consider the computational complexity then Bayesian
estimation if I consider that it is more computational complex as compared to maximum
likelihood estimation.

(Refer Slide Time: 10:43)

So in case of the maximum likelihood estimation, parameters in maximum likelihood


estimation are fixed but not known. And in this case what I have to consider, I have to
maximize the probability of obtaining a given set suppose H is the training sample or maybe I
can consider probability of D, D is the training set given theta. So I have to maximize this
probability of D given theta that theta is fixed.

So parameter estimation that maximize a likelihood function, that I can consider, maximize
the probability of obtaining the given training set. So I have to maximize this. So maximize
the probability of D given theta in case of the maximum likelihood estimation. That means
maximizing the probability of obtaining the given training set. And in this case the theta is
fixed a theta is the parameter vector. Now let us see the mathematics behind the maximum
likelihood estimation. So what is the maximum likelihood estimation?

1215
(Refer Slide Time: 11:49)

So in case of the Bayesian decision theory, we have to estimate this the probability of wj
given x we have to estimate. And for this you can see I need the information of this x given
wj that is the class conditional density. And also the priori and if you see the this is the
evidence j is equal to 1 to c probability of x given wj probability of wj.

So in this case the probability of x given wj so this parametric form, the parametric form that
is available. That is the suppose if I considered a normal distribution that information is
available that is the parametric form available. But the parameters are not available, but the
parameters we have to estimate. So in case of a normal distribution I have two parameters so
one is the mean, another one is the covariance.

So if you consider this one the θ j that is the parameter vector. So I have two parameters one
is the mean vector and another one is the covariance matrix that I have to determine. So if I
want to show the maximum likelihood estimation. So suppose we have the training samples
the training sets D1, D2, … DC that means we have the training sets and we have the training
algorithm is available.

And we have to estimate the parameters, we have to estimate the parameter vector and what
information is available? I know the density form that is the class conditional densities
available. This dependence on the training set I can write like this, the dependence on the
training set I can write like this the probability of x given wj θ j that is the dependence on
theta j. That is, I am considering the class conditional density and I am showing the
dependence on θ j.

1216
So the problem is to determine unknown parameter vectors. So that means I have to
determine θ1 ,θ 2 from the information of that training dataset. Because we have the training
data set for all the classes and from this training dataset, I have to determine the parameters.
So that is the parameter vector I have to estimate. And in this case I am considering the
independent data set.

So what do you mean by independent data set? So suppose I am considering the data set D1
that is for the class w1. Similarly, if I consider another data set that is D2 and that is for the
class w2. Like this if I consider another class suppose, so if I consider the data set suppose the
training data set is Di and suppose I have the samples, the samples are x1, x2 like this. These
are samples and corresponding to these the class is wi.

And suppose the class is, another class is wj. So this training data set Di is for the class wi.
This is not for class wj, this is not for a class wj. The training data set Di is for class wi, it is
not for the class wj. That is the supervised training, that is the concept of the supervised
training. So here x1 x2 these are the samples, these are samples of the data set, the data set is
Di and the samples are drawn independently.

So I have the training data set that is D1 D2 Dc then what is the probability of D given theta
that is nothing but this product. So suppose I have this training samples x1 x2 up to xn
suppose xn, so xn and theta. So I have to maximize the probability of the given theta. So that
I have to maximize. So that is the concept of the maximum likelihood estimation.

(Refer Slide Time: 17:22)

1217
So that means, again I am writing so I have to maximize the probability of D given theta, the
probability of obtaining a given training set for a parameter vector, the parameter vector is
theta. Now I am defining one likelihood function, the likelihood function that is the log
likelihood function I am considering. So l theta that is the log likelihood function that I am
considering, that is the log likelihood function. And after this what I am considering?

I am considering the differentiation of this likelihood function and suppose I am getting the
approximate value. So I am just doing the differentiation because I have to maximize this
one. So it is equal to 0 because I have to find a maximum of this. So what is this operator? So
if I consider this operator, this is the partial derivative I am considering with respect to θ1 like
this which respect to θ2 .

So suppose I have these parameters θ1, θ2 , … θ p like this. So I have to find a global maxima.
Maximum I have to find in the parameter space, parametric space. So global maxima I have
to find in the parametric space. So from this equation if I see this equation so from this
equation I can determine the parameters, the parameters are like this, this is a parameter
vector. So I can determine θ1, I can determine θ2 .

So in case of a normal distribution this is the mean vector and also this is the covariance
matrix I can determine. So what is theta? The theta is nothing but this is the parameter vector
θ1, θ2 , … θ ptranspose. So by considering this equation I can determine the parameters. So for
example I can determine the mean of the samples. So it is the mean of the samples you can
determine by this equation.

So it is the mean is something like this, this is nothing but the arithmetic mean of the samples.
The arithmetic mean of the samples. And also I can consider the maximum a posterior
probability that is the estimation maximum a posterior estimation that is the MAP I can
consider. So MAP is nothing but the likelihood function I am considering that is the theta that
is the likelihood function.

And the priori information also I am considering p theta. So I have to maximize the
probability of theta x, probability of x theta p theta px. So for flat prior, so if I consider the
flat prior the maximum likelihood estimation will be same as that of the MAP maximum a
posterior probability estimation. So if I consider a priori is flat, so this is the priori
information. So suppose for the flat prior, the maximum likelihood estimation is equal to
MAP, the MAP estimation.

1218
So this is the basic concept of maximum likelihood estimation. I am not explaining how to
determine the parameters but based on this equation, so if you see this equation I can
determine the parameters all the parameters. So this is the fundamental concept of maximum
likelihood estimation.

(Refer Slide Time: 22:27)

The next one is the Bayesian estimation. In case of the Bayesian estimation we can consider
the parameters as random variable having some known distribution. In case of the Bayesian
learning I am repeating this, what I am considering the parameters are random variable with
some known a priori distribution that is available that the information is available. These
training samples allows the conversion of the priori information into posterior density.

So the training samples allows conversion of priori information into a posterior information.
So in case of the Bayesian estimation, in case of the Bayesian learning what I have to
consider I have to maximize the probability of theta given D. So D is the training set and
theta is the parameter vector in this case the theta is random variable. So that means we have
to determine the density that approximate an impulse. So briefly I will explain the
mathematical formulation of the Bayesian learning. So what is Bayesian learning you can
see.

1219
(Refer Slide Time: 23:46)

In case of the Bayesian learning, again, I am showing the Bayesian decision theory that is the
Bayes law I am writing, the probability of wi given x ,D; D is the training data set. So that
means I am considering the dependence on the training data set that is the dependence on D is
equal to probability of x /wi, D that is the class conditional density. And I am writing D
because this is the dependence on the training data set i is equal to 1 to c probability of x
given wi, D probability of wi given D.

But in this case the priori information is known, the priori information is known. So I can
write like this now the probability of wi given x D probability of x given wi D the probability
of wi and summation i equal to 1 to c probability of x wj D probability of w probability of wi
and probability of wi. So I can write like this. So in case of the Bayesian estimation I have to
determine, what I have to determine? I have to determine the probability of x given D that I
have to determine.

So what information is available the information is the probability of x that is not available
this is unknown, that is not that is unknown and a parametric form is known, the parametric
form is known. So that means the probability of x given theta that is known. And also the
probability of theta and that is known. So this the training data set converts the priori
information into the posterior information.

1220
The P theta is available that is known but the training data set converts the priori information
into the posterior information. So I have to determine the probability of x given D I have to
determine that is nothing but probability of x given theta and probability of theta /D. So this
actually I am obtaining like this the probability of x given D is nothing but probability of x
theta/ D ;D/ theta so that is available.

So this probability of x theta D is nothing but probability of x theta D probability of theta D I


can write like this. So that means from this actually I am getting this one, from this I am
getting this one probability of x given D. Now because this is the important equation. So
probability of x given D is nothing but probability of x theta probability of theta D D theta
that is available. So I have to maximize the probability of theta D.

So maximize probability of theta D I have to maximize and in this case the probability of x
given D will be approximately equal to probability of x this is the approximate value of theta
that is the parameter vector approximate value of theta. So that means I am taking average of
probability p x theta. Because in this case I have to maximize this. So that means I have to
maximize this, so it will be a Dirac delta function corresponding to the estimated value of
theta.

So this is the approximate value of theta. So I have to maximize this probability of theta
given D I have to maximize. So that means the probability of x given D is nothing but
probability of x given theta. That means I am taking the average of p x given theta. So this
integration, if you see this integration that means I am taking the average and this is a multi-
dimensional integration. So it is very difficult to determine because it is a computationally
complex to determine the multi-dimensional integration.

So something like the Monte Carlo simulation technique I can use to determine this
integration. And here you can see with the help of Bayesian estimation I can determine the
probability of x given D. So that I can determine and after this I can determine the
parameters, the parameters are theta, the parameter vector are theta. So this is the basic
concept of the Bayesian estimation. So both the methods the maximum likelihood estimation
and the Bayesian estimation they will give almost similar results.

So briefly I discussed the concept of the maximum likelihood estimation and the Bayesian
estimation. So for more detail you can see and the book Pattern Classification by Duda and
Hart, that book also you can see for this maximum likelihood estimation and the Bayesian
estimation.

1221
(Refer Slide Time: 32:06)

Now I will discuss the concept of the non-parametric methods. In case of non-parametric
methods that already I have explained that is the density form is not available but we have to
estimate the density. And there are two techniques one is the Parzen-Window technique and
another one is k nearest neighbor technique, these two techniques I can consider.

(Refer Slide Time: 32:28)

So in case of generative models, we assume that the data to come from a probability density
function. That means we have this information the probability of x given wj for the
generative models. But in this case sometimes this density is not available that information is
not available. So we have to estimate the density. So how to estimate the density? So I will

1222
explain these two techniques one is the Parzen-Window technique another one is the k
nearest neighbor technique.

(Refer Slide Time: 33:03)

So in case of the non-parametric procedure what we can consider, we have to estimate the
probability of x given wj that means we have to estimate this. But in case of the k nearest
neighbor technique we can directly estimate this density. Because what is the ultimate
objective? The ultimate objective is we have to determine the probability of wj given x that is
the posterior probability I have to determine.

(Refer Slide Time: 33:31)

1223
So the basic idea in density estimation is there a vector x will fall in a region R with a
probability, that probability I am determining. P is a smooth or average version of the density
function. So density function is px.

(Refer Slide Time: 33:47)

Suppose n samples are drawn independently and identically distributed that is the i.i.d
according to px that probability px the probability that k of this n the k number of samples out
of n number of samples fall in the region R is given by this distribution. That is the
probability of pk I can determine that is nothing but a binomial distribution.

So what I am considering the probability that the k number of samples out of n fall in the
region R. And from this I can determine the expected value of k. So I can determine expected
value of k E x is equal to n into P. And if I consider the maximum likelihood estimation so by
the maximum likelihood estimation I can determine or I can maximize the probability of
probability pk for the given theta that I can maximize.

And corresponding to this I will be getting the probability the probability is nothing but p is
equal to k divided by n. So n is the total number of samples I am considering. So therefore
with large number of samples the ratio k divided by n is a good estimate for the probability P
and hence for the density function the density function is px. So that means this ratio k by n
gives the estimate of the probability the probability is P. So here n is the total number of
samples and k is the number of samples within this particular region that I am considering.
And that this ratio k divided by n it gives the probability the probability is P.

1224
(Refer Slide Time: 35:33)

And if I consider the px is continuous and the region is so small that px does not vary
significantly within it. So I am considering this case that is the px is continuous and that the
region R is very small, very, very small, then in this case the px does not vary significantly
within it. So for this I can write like this. So integration a px dash dx dash is approximately
equal to px into V.

So I will be getting this and from this expression, from the previous expression because if
you see the previous expression is the probability is nothing but k divided by n. And now I
am getting the this probability that px is nothing but k divided by n divided by V. So by using
this expression I can determine that density, the probability x that is the density is equal to k
divided by n divided by V.

1225
So V is the volume and close by the region R. So R is the region I am considering and V is
the volume enclosed by the region the region is R. So by using this expression I can
determine the density the density px I can determine.

(Refer Slide Time: 36:51)

Now in this case however V cannot become arbitrarily small because we reach a point where
no samples are contained in V. And in this case we will not get the convergence the V cannot
be very, very small. Suppose if I consider the volume is very, very small, then in this case
what will happen it may not enclose any samples. Then in this case you cannot determine that
density.

So what a process I can consider V cannot be allowed to became small since the number of
samples is always limited. Because we have the limited number of samples. So V should not
be very, very small. Otherwise we cannot expect that, the samples will be available within
this particular volume, the volume is very small. And one another case we have to consider
the certain amount of variance in the ratio k divided by n.

So we can consider this that means the volume should not be very, very small because we
have the limited number of training samples. And also we have to consider the certain
amount of variance in the ratio, the ratio is k divided by n that we can consider.

1226
(Refer Slide Time: 38:05)

So you know this expression the probability of x that is the density k divided by n divided by
V. So in case of the Parzen-Window technique fix the volume of the region V and count the
number of samples k out of n number of samples falling in V. So that means the volume is
fixed. And we have to count the number of samples within this particular volume the volume
is V. So k number of samples within this particular volume out of n number of samples.

So total number of samples is n and I am counting how many samples are within this
particular volume. So k number of samples within this particular volume. So from this
information you can see k divided by n divided by V from this information I can determine
the density that is called a Parzen-Window technique.

In case of the k nearest neighbor technique the volume is not fixed. But I can consider the k
number of samples so suppose it is fixed suppose I am considering 5 number of samples. So I
have to increase the volume so that it encloses the 5 number of samples that means that k
number of samples. So first volume it encloses one sample suppose the second volume it
considered suppose two number of samples and third volume if I consider, it considered, it
encloses two samples. So how many samples total samples 5 number of samples enclosed by
this volume.

So that means I am increasing the volume, I am growing the volume so that it encloses the k
number of samples. In case of the Parzen-Window technique the volume is not; the volume is
fixed. And I have to count the number of samples k within this particular volume. In case of
the k nearest neighbor technique the volume is not fixed and we have to increase the volume.
So that it encloses k number of samples. And from this information I can determine the

1227
density, the density is px is equal to k divided by n divided by V. So by using this expression
I can determine the density.

(Refer Slide Time: 40:14)

So for estimating the density what I am considering. So to estimate the density of x we


promise sequence of regions, we can consider a regions R1 R2 like this containing x. The
first region contains one sample, the second contains two samples and like this and Vn be the
volume corresponding to the region Rn. And kn is the number of samples falling in the region
Rn. So I am considering the k number of samples falling in the region Rn.

And in this case I can determine or I can estimate the probability or the density px. So the pn
X be the nth estimate of the probability the probability is px. So I can determine the density,
the density is nothing but k n divided by n divided by Vn. And if I consider suppose the
unlimited number of samples. So many, many samples if I consider then what convergence I
can get? The pn x converges to px. So if I consider a large number of samples then the pn x
approaches px.

1228
(Refer Slide Time: 41:30)

But in this case, in case of the Parzen-Window technique we have to consider these three

conditions one is the limit nlim


→∞
V n should be equal to 0. So that means what is the meaning of

this the volume may be very, very small because I am considering large number of samples,
because n tends to infinity. Then in this case you may expect some samples within this
particular volume the volume is suppose Vn.

The volume Vn may be very, very small because I am considering very large number of
samples, large number of samples we are considering. So that means you may expect some of
the samples within this very small volume, the small volume is Vn. So this is the first

condition that is the limit n tends to infinity Vn is equal to 0. The second condition is nlim
→∞
Kn

is equal to infinity.

Since we have large number of samples, so that means the kn will be also very large. Since
the N is very large, so that means the kn is also very large. And if I consider a ratio K n
divided by n, n tends to infinity that is the limit, then in this case this ratio will be 0, tends to
0, because n is very, very high as compared to kn. So that is why this K ndivided by n that
ratio limit n tends to infinity should be equal to 0.

And there are some mathematical derivations in the case of the Parzen-Window. So I can
determine the density, the density can be estimated by using this equation. So you can see the
book by Duda and Hart; The Pattern Classification by Duda and Hart and you can see the
derivation of this equation. So by this equation you can determine the density.

1229
(Refer Slide Time: 43:20)

Now let us consider the k nearest neighbor technique. So in this case we have to estimate the
density using the data points and let us consider the cell volume to be a function of the
training data. And in this case the center a cell about x and let it grow until it captures kn
number of samples. So here you can see a cell volume be a function of the training data.

And what I am considering? I am growing the region, that means I am increasing the volume,
so that it encloses the kn number of samples. So it encloses the kn number of samples. So kn
is a function of n, n the total number of samples. So kn are called the K nearest neighbor of x
that is the kn.

(Refer Slide Time: 44:10)

1230
And two possibilities can occur the density is high near x therefore the cell will be small
which provide good resolution. So density may be high near the feature vector, the feature
vector is x. And in this case therefore the cell will be small which provide good resolution. In
the second case the density maybe low therefore the cell will grow large and stop until higher
density regions are reached.

So that means I have to increase the volume so that it encloses kn number of samples. So
density is low, therefore the cell will grow large and stop until higher density regions are
obtained. So this is the two conditions of the k nearest neighbor technique.

(Refer Slide Time: 44:59)

So mathematically you can say, so directly we can estimate this posterior probability Pn wi
given X directly we can determine from the n labeled samples, n number of samples are
available. And there are labeled samples that means I am considering the supervised training.
Let us place a cell volume V around X and capture k samples. And ki samples amongst k
turned out to be labeled wi.

So that means I am considering ki number of samples corresponding to the class the class is
wi. So total number of samples are n and I am considering ki number of samples
corresponding to the class, the class is wi. So from this you can determine the density. The
density is Pn( X , wi)=ki divided by n divided by V that you can determine. And from this
you can see I can determine the posterior density Pn (wi divided by X) that I can determine.

So if you see this equation, so what I am determining already Pn (X , wi) that I have
determined, that is nothing but ki divided n into V. And you can see the summation I am

1231
taking that is the evidence. And I will be getting the ratio, the ratio is ki divided by k. So what
is the ki samples amongst k? So k number of samples I am considering but ki samples
corresponding to the class the class is wi. And I am considering the total number of samples,
the total number of samples is n.

So this ratio the ki divided by k that gives the information of the density that is the posterior
density. So it gives the information of this. So we have to determine the ki number of samples
corresponding to the class, the class is wi. So if I can determine the ratio ki divided by k
determine this ratio. Then I can determine the density, the density is Pn wi given x that I can
determine.

(Refer Slide Time: 47:12)

So this process the k nearest neighbor algorithm that is the classification algorithm I can
show pictorially like this. Suppose I want to classify a new data point, so in new data point is
this. And I have to classes the category A and category B and this is before k nearest
neighbor. And after this based on the minimum distance the new data point is assign to the
class, the class is A. So that is the category 1, category 1 means the class A. So based on this
you can see this new data point is assign to the category 1. So this is the k nearest neighbor
algorithm.

1232
(Refer Slide Time: 47:51)

Again I can show what is the k nearest neighbor algorithm. Here you can see I am
considering three classes one is the yellow, one is the green and one is the orange. And I am
considering one new data point so this is the new data point I am considering. And I am
finding the distance between this data point, the new data point and the other samples
corresponding to different classes.

So one class is the yellow class, another class is the green class, another class is the orange
class or a red class. So you can see I am finding the distance, the distance is 2.1, 2.4 like this I
am determining the distance. So corresponding to this yellow one, you can see the distance is
2.1 that is the first nearest neighbor a distance is minimum. And again corresponding to the
second one, the second data point that is the yellow one, the distance is 2.4 that is the second
nearest neighbor I am getting.

And again if you see this green one. So corresponding to this green one the distance is 3.1
that is the third nearest neighbor. And again you can see the distance between the grey point
and the orange or the red point, so it is 4.5 that is the first fourth nearest neighbor. And in this
case we have to determine the number of votes.

So corresponding to this yellow how many votes, because two times it is the neighbor. So 2
votes I am getting, corresponding to the green I am getting 1 vote, corresponding to the
orange I am getting 1 vote. So I can count the number of votes and based on this the new data
point can be assigned to a particular cluster.

1233
(Refer Slide Time: 49:35)

Again I am showing this one, so this new data point I have to classify and for this you can see
I have to classes the class A and the class B. And based on the minimum distance this new
data point can be assigned to a particular class, the class A and class B. So in this class I
discuss the concept of the Bayesian distance theory and after this, I discuss the concept of the
concept of the parameter estimation.

In the parameter estimation I discuss two algorithms very popular algorithms; one is the
maximum likelihood estimation, another one is the Bayesian estimation. So briefly I
explained these two concepts, after this I considered the case of the non-parametric
estimation. In case of the non-parametric estimation I have to estimate the density. So for this
I considered two algorithms one is the Parzen-Window technique another one is the k nearest
neighbor technique.

So briefly I explained the basic concept of the maximum likelihood estimation, Bayesian
estimation, the Parzen-Window and the k nearest neighbor technique. For more detail you
can see the book Pattern Classification by Duda and Hart that you can see. So let me stop
here today. Thank you.

1234
Computer Vision and Image Processing - Fundamentals and Applications
Professor M. K. Bhuyan
Department of Electronics and Electrical Engineering
Indian Institute of Technology, Guwahati
Lecture No. 34
Introduction to Machine Learning - V
Welcome to NPTEL MOOC's course on Computer vision and Image Processing- Fundamental
and Applications. In my image transformation class, I discussed the concept of PCA, the
Principle Component Analysis. So, how to reduce the dimension of the input vector. If I consider
the feature vector, suppose X; I can reduce the dimension of the feature vector, by neglecting the
redundant information by PCA; the Principal Component Analysis. PCA finds the greatest
variance of data.

But one problem with the PCA is that, it does not consider the class information. Suppose I have
number of classes, and the discrimination between the classes, that information is not considered
by the PCA. So, for this, I will consider another method, that is called the Linear Discriminate
Analysis. So, in case of Linear Discriminate Analysis, I can reduce the dimension of the input
vector. Also, I can find the separation between the classes; that is the discrimination between the
classes, I can do. So, that concept I am going to discuss today.

And also, I discuss the concept of the Bayesian decision making; the Bayesian classifier. That is
nothing but the generative model. So, what is the concept of generative model? That means, I
have the information of the class conditional density. The probability of X given w, z. So, that
information is available, and with that information I can do the classification. So, I can determine
the posterior density; the density is probability of w, z given X; I can determine.

There is another classifier that is called discriminative classifier. So, in this case, the information
of the class conditional density is not important. So, I can find the best decision boundary
between the classes. Suppose, if I consider 2 classes, I can find the best decision boundary
between these 2 classes. So, for this I will discuss one algorithm that is the Support Vector
Machine.

So, today's class I will discuss these 2 concepts; one is the Linear Discriminant Analysis, and
another one is the Support Vector Machine. So first, let us consider the concept of the LDA.
And, what are the problems with the PCA? That concept I am going to explain.

1235
(Refer Slide Time: 03:00)

So, in case of the PCA, if you know, that I discuss in the image transformation class; I can
reduce the dimension of the feature vector or the input vector. But in this case, I am not
considering that discrimination between the classes, that information is not available in the PCA.
So, in this figure, you can see I am considering n number of feature vectors. And, I am
considering the m-dimensional vector, that I am considering. So that means, the dataset matrix X
has a dimension of m × n.

And for this PCA, the method is like this. First, I have to subtract the mean from the original
data. So that, I will be getting a zero-mean dataset. So, I will be getting a zero-mean dataset. And
after this, I have to compute the covariance matrix. The covariance is matrix, I can compute like
this. And, from the covariance matrix, I can determine the transformation matrix. So, for this, I
have to determine the eigenvectors, and also I have the corresponding eigenvalues. So, this
transformation matrix for the principal component analysis, I can determine from the
eigenvectors of the covariance matrix.

So, you can see, I can determine the eigenvalues, and the eigenvectors I can determine from the
covariance matrix. And from this, I can determine the transformation matrix. That is the basis
vector, I will be getting. And, I can consider the highest eigenvalue, I can consider; and the
corresponding eigenvectors, I can consider. So, that concept already I have explained in my PCA
class, that is in the image transformation.

1236
But one problem of the PCA, that already I have highlighted, that is the class discrimination
information is not available. That is only I can reduce the dimension of the input vector, the input
data, or the input feature vector.

(Refer Slide Time: 05:00)

So in case, of the LDA, the Linear Discriminant analysis, I can reduce the dimension of the data,
the input data. So, reduce the dimensionality of a data set, by finding a new set of variables,
smaller than the original set of variables. So, I can do this. And also, I can retain most of the
sample's information. So, redundant information I can neglect, but I can retain most of the
sample's information.

So, unlike PCA, LDA uses the class information. So, that information is available in case of the
Linear Discriminant Analysis. And, I have to find a set of vectors, that maximize the between-
class scatter, while minimizing the within-class scatter matrix. So, that concept I am going to
explain. Because, in case of the Linear Discriminant Analysis, I have the class information, and I
have to find a set of vectors, that maximize the between-class scatter, and I can also minimize the
within-class scatter. So, this mathematical concept I am going to explain in case of the LDA, the
Linear Discriminant Analysis.

1237
(Refer Slide Time: 06:15)

So, LDA finds the new axis based on these 2 criteria. So, one is the maximize the distance
between the means of the classes. So, I can consider, suppose 2 classes. So, I can maximize the
distance between these classes. So, that means, I can find the maximum distance between the
means of the classes, and also the minimize the variation within the class; that I can consider. So,
one is the maximize the distance between means of the classes, and also, I can minimize the
variation within the class.

So, I can consider this one, that means I have to minimize the variation within the class, and also,
I have to maximize the distance between that means of the classes. So, this quantity I have to
maximize.

1238
(Refer Slide Time: 07:07)

Now, the question is; is PCA a good criterion for classification? Now in case of the PCA, the
PCA finds direction of greatest variance. Data variation determines the projection direction, but
in case of the PCA, the class information is missing. We do not have the class information, but
how actually we consider. We consider the eigenvectors, that means we want to find the
directions of greatest variation, that means we can find the eigenvectors of the covariance matrix.
But the class information is missing in case of the PCA.

(Refer Slide Time: 07:46)

1239
So, let us consider, what is the projection? In case of the PCA, we consider the eigenvectors, that
is the direction of the projection. Now let us consider, what is the good projection. Here, in this
figure, you can see I am considering 2 classes. You can see, one is the, this class, another one is
the, this class, the 2 classes. And I want to find the good projection. In case of the blue
projection, if I consider the blue projection line, there may be overlapping of the samples of
different classes. But, if I consider this projection, that means, these 2 classes will be well
separated.

In the first projection, if I consider the first projection, that is the, the blue projection, then in this
case, these 2 classes are not well separated. But, in the second case, if I consider, the second
projection, if I consider this projection, the 2 classes are separated. So, that means in case of the
LDA, I have to see this condition, that is the separation between the classes.

(Refer Slide Time: 08:48)

1240
So, for this, this information is important; one is the between-class distance. So here, again I am
showing 2 classes, you can see. So, this is the centroid of suppose class i, and this is the centroid
of another class m j; that is the mean. Now, in this case the between-class distance should be
maximum. So, you can see between these 2 means; one is mi , another one is m j. So, one is mi ,
and another one is m j. These 2 means I am considering. For 2 classes, the distance between the
centroid of different classes should be maximum. That is between-class distance.

After this, I am considering another information, and that is within-class distance. So that means,
it is the accumulated distance of an instance to the centroid of its class. So, that means, if I
consider, this is a centroid, the centroid is mi and m j. And you can see, I am considering the
sample points, corresponding to the centroid mi . So, this within-class distance should be
minimum.

So, suppose distance between these samples and the centroid, I can determine, and I can
determine the accumulated distance. So, that should be minimum, that corresponds to within-
class distance. So, that means for the LDA, this is important; one is the between-class distance
that is important, distance between the centroid of different classes that should be maximum.
And within-class distance, that means accumulated distance of an instance to the centroid of its
class. So, that should be a minimum. So, these 2 conditions, one is the within-class distance,
another one is the between-class distance corresponding to LDA and that is very important.

1241
(Refer Slide Time: 10:52)

So, in case of the Linear Discriminant Analysis; LDA finds most discriminant projection by
maximizing between-class distance and minimizing within-class distance. So, here I am showing
these 2 cases. First you can see, I am considering; this is the projection direction, that is the blue
is the projection direction. And you can see, the classes, the samples of the classes, 2 classes are
overlapping, that is the discrimination between these 2 classes is minimum. But, if I consider the
second projection direction, that is the yellow, the discrimination between these 2 classes is
maximum.

1242
So, you can see, the discrimination between these 2 classes is maximum. So, I have to find that
direction, in which direction, the discrimination between these classes will be maximum. So, that
direction I have to estimate. So, for this I have to consider these 2 cases; one is the within-class
distance, another one is the between-class distance, I have to consider. So that means, I have to
maximize between-class distance, and I have to minimize within-class distance.

So, based on this, I have to find the projection direction. And based on this projection direction, I
can find maximum separability between these 2 classes. Now, I am considering 2 classes. It may
be applicable for more than 2 classes also. So, if I consider C number of classes. So this concept
is also applicable. But in this example, I am only considering 2 classes.

(Refer Slide Time: 12:26)

So, what is the mathematics behind LDA, that I want to explain. So, let us consider a pattern
classification problem, and for this, I am considering C number of classes, I am considering. So,
maybe the classes, maybe the fishes; like the, seabass, tuna, salmon, like this I can consider
number of classes, the C number of classes I can consider. And each class has N i samples. So,
m-dimensional samples are available. And how many samples are available? N i number of
samples for each of the classes. And we have a set of m-dimensional samples.

So, we have a set of m-dimensional samples. So, corresponding to the class, the class is w i . I
have the samples x 1, x 2 like this. So, we have Ni number of samples, and it is the m-dimensional
samples. And from this, I can get a matrix, the matrix is X. That is stacking these samples from

1243
different classes into 1 matrix, that matrix is the X. And this is column of matrix represent 1
sample. So, I will be getting a matrix, the matrix is X, from all the samples of different classes.

Now, I want to find a transformation of X to Y; X is a input data vector, suppose. So, I want to
find a transformation of X to Y through projecting the samples in X onto a hyperplane with the
dimension C minus 1. So, I have to find the projection direction, that new data will be Y, after
the projection, and the objective is to get the maximum discrimination between the classes. So, I
have to find the best projection direction, I have to find.

(Refer Slide Time: 14:27)

For simplicity, I am now considering only 2 classes suppose. So, this principle can be extended
for C number of classes. So first, I am considering 2 classes. So, we have the m-dimensional
samples. So, we have the m-dimensional samples, and we have the N number of samples. So, the
N 1 number of samples belong to the first class; the first class is w 1. And N 2 number of samples

belonging to another class, another class is w 2.

And, we seek to obtain a scalar y by projecting the samples x onto a line. So, in this case, we
have we are considering 2 number of classes, that means C is equal to 2. So that means, C minus
1 means, it is 2 minus 1; it is 1. So, dimension is reduced to 1. And what I am getting? I am just
T
doing the projection w x; that is actually the dot product, if I consider the vector form. So, it
will be the dot product. So, y will be the scalar. So, w T. x . So, w is the projection vector, and x is

1244
the input vector, that I am considering. So, if I take the dot product between w T and x. So, I will
be getting the scalar, the scalar is y.

In the figure, you can see, I am showing a projection direction, the direction is this, one
projection direction you can see. And in this case, also I am considering 2 classes, and here I am
considering 2-dimensional samples. Because I am considering x1 and x2. This is 2-dimensional
samples. So, corresponding to this you can see the separation between the classes is minimum.
Because there is an overlapping between the samples of the classes, the 2 classes.

But, if I consider, in the second figure, I am considering a projection direction; you can see, I am
getting the separation between the classes, between the samples of the 2 classes. So, that means
the second projection direction is better as compared to the first projection direction. So, I have
to find, which one is the best projection direction. So, that is the objective of the LDA.

(Refer Slide Time: 16:40)

So, in order to find a good projection vector, we need to define a measure of separation between
the projections. Because, I am getting the projection. The projection is nothing but y. That is
nothing but w T . x . So, I am getting the scalar y. So, first the mean vector of each class in x and y
feature space, I can determine. So, μi , I can determine; because I have Ni number of samples. So,
I can determine the mean of x corresponding to a particular class. And you can see I can also
determine the mean of the projected data. So, y is the projected data.

1245
1
So, ~
μi , that I can determine. So, it will be something like this, ∑ x T x . So, it is, ~
μi , I can also
Ni

1
determine. So, you can see it is w T x. And after this, just you do this mathematics so I can
Ni
determine the mean of the projected data. Now, I am considering the objective function. The
objective function is J w, and the main objective, or the main goal is to find a maximum distance
between the projected means. So that, I will be getting maximum separation between the classes.

So, objective function J w, I am considering, and this is μ1 is the projected mean for the class 1.
And μ is the projected mean, for the class 2. So, I am considering ~
2 μ and ~
1 μ , that is the projected
2

mean. So, from this you can determine this. From the previous equation, you can determine this.

(Refer Slide Time: 18:32)

Now in this case, you can see the distance between the projected mean is not a very good
measure. Because, it does not take into account the standard deviation within the class. So, that
information is not considered in this case. Because we considered the distance between the
projected mean, and that may not be a good measure. Because in this case, we are not
considering the standard deviation within the class. So, that information we are not considering.

So, pictorially that concept I am showing here. Here you can see, the axis has a larger distance
between the means, in the first case. But in this case, it is not a good separability. There may be
some overlapping between the classes. But if I consider the second case; this axis gives better

1246
class separability. So, you can see if I consider this axis, that is, this projection direction, then I
will be getting maximum separability between the classes. But, in the first case, I will be getting
the larger distance between the mean. But in this case, the separability is not good. The
separability between the classes is not good as compared to the second case. In the second case, I
am getting better class separability.

(Refer Slide Time: 19:48)

So, that is the solution of the, this problem is given by Fisher. So that is why, this method is
called Fisher Linear Discriminant Analysis. So what is the solution of these problem? The
solution of this problem is to maximize a function that represents, the difference between the
means, normalized by a measure of within-class variability. So, that means, I am considering the
information or the measure of the within-class variability; and I can consider as a scatter. For
each class, we define the scatter and equivalent to the variance.

So, I can consider a, this is the scatter, and it is equivalent of the variance. That is the sum of
square differences between the projected samples and their class mean. So, that I am considering.
The sum of square differences between the projected samples and their class mean, I am
~
considering. So, si 2 measures the variability within the class w i. After projecting it onto the y-
~
space. So, y-space means it is the projected space. So, si 2 means, it is a measure of the variability
within the class, the class is w i.

1247
~ ~
So, that means if I consider this one, s 12 + s2 2 , that gives the measure of the variability within the
2 classes after the projection. So, that is called the within-class scatter of the projected sample.
~ ~
That means, I am considering s 12 + s2 2 ; that measures the variability within the 2 classes after the
projection, and it is called the within-class scatter of the projected samples.

(Refer Slide Time: 21:46)

So, in case of the Fisher Linear Discriminant Function, we define a linear function, the linear
T
function is w x that maximize the criterion function. What is the criterion function? The
distance between the projected means normalized by the within-class scatter of the projected
samples. So, I am considering this, the criterion function J (w), I am considering. So, the
objective is to maximize the criterion function, I have to maximize this. That means the distance
between the projected means normalized by the within-class scatter of the projected samples, I
have to maximize.

So, that is the criterion function, I am considering in case of the Fisher Linear Discriminant. That
means, in case of the LDA, what actually we are considering? We are looking for a projection,
where the samples of the same class are projected very close to each other, and at the same time
the projected means are further apart as far as possible. So, that is I am considering. So, one is
within-class distance, another one is the between-class distance; that is I am considering. And
based on this, I am determining that projection direction. So, this concept I am showing here

1248
again. So, that means, the maximum separation between the classes, but samples from the same
class are projected very close to each other. So, that I am also considering.

(Refer Slide Time: 23:15)

In order to find the optimum projection w star, we need to express J (w) as an explicit function of
w. So, I have to find the J (w), that is the criterion function. So, for this we are defining a
measure of the scatter in multivariate feature space x, which is denoted as scatter matrix. So, I
am considering S i. So, S i is the covariance matrix of class w i and S W is called within-class
scatter matrix. So, I can determine the within-class scatter matrix from S1 and S2. So, S1 is the
covariance matrix of the class 1, and S2 is the covariance matrix of the class 2. So, from this I
can determine S W , that is nothing but within-class scatter matrix. This is important for
considering these criterion function. So, J (w) I am considering.

1249
(Refer Slide Time: 24:17)

Now, the scatter of the projection y can be expressed as a function of the scatter matrix in feature
~
space x. So, here you can see, I am considering the projected data. So, si 2 , that I can determine.

So, y is the projected data. And you know, what is ~


μi ; you know. So, from this you can
determine this one, just you can see this one. And similarly, you can also determine the S2. S1
~ ~ ~
and S2, that is the s 12 and s 22 , that you can determine. And that is nothing but S W. So, SW is the
within-class scatter matrix of the projected sample y. So, you can see the mathematics, and this
derivation you can see. So this is a very simple derivation.

(Refer Slide Time: 25: 07)

1250
And based on this you can see, because I am considering the projected means, the projected
mean is ~
μ1 and the ~
μ2 . So, the separation between these 2 means, that is the projected mean
should be maximum. So, just I am determining this w T μ , that corresponds to ~
1 μ , and ~
1μ is 2

~
nothing but w T μ2. So, you know this expression, and from this you can see, I am getting S B .

So, S B is nothing but the between-class scatter matrix. So, you can see how to determine the
within-class scatter matrix S and also, we can determine the between-class scatter matrix S B.
W

~
So, S B is the between-class scatter of the original samples. And what is S B , that is the between-
class scatter of the projected sample y, that you can determine.

(Refer Slide Time: 26:01)

1251
And after this, this Fisher criterion function, that is J (w) can be expressed in terms of the
between-class scatter matrix and the within-class scatter matrix, that can be represented like this.
So, J (w) is nothing but w T S B w divided by w T S W w . So, it can be represented like this. So, J
(w) is the major of the difference between-class means, that is encoded in the between-class
scatter matrix normalized by a measure of within-class scatter matrix. So, here you can see. So, it
is the between-class scatter matrix and it is normalized by a measure of the within-class scatter
matrix. So, we did as the S W.

1252
(Refer Slide Time: 26:46)

And I have to maximize this criterion function J (w), so that is why I am taking the
differentiation with respect to w. So, w is the projection vector. So, I have to maximize J (w),
with respect to w. So, that is why I am doing the differentiation. You can see this mathematics.
So, how to do the differentiation, by using the chain rule. So, you can do the differentiation by
using the chain rule, and since I have to find the maximum value; so that is why, I am equating it
to 0.

So, I have to find the maximum of J (w). After doing all this mathematics, I will be getting this
one. So, you see this mathematics, mainly just I am applying the differentiation, applying the
chain rule and just equating it to 0, because I have to find the maximum of J (w).

1253
(Refer Slide Time: 27:38)

So, it is nothing but, the solving the generalized eigenvalue problem. So, SW −1, that is the inverse
of the within-class scatter matrix, into S B, the between-class scatter matrix, and w is the
projection vector, λ w. So, λ is the eigenvalues. So, λ is the scalar. So, corresponding to this, if I
consider this eigenvalue problem, I can determine the vector w. That I can determine, that is the
projection vector; I can determine. So, this w star, I can determine that is nothing but

SW ( μ 1−μ2 ) . So, here you can see I am determining the best projection direction, the optimum
−1

projection direction w star. So, this is known as Fisher’s Linear Discriminant.

And if I consider the same notation as PCA, the solution will be the eigenvectors of S X, because
in case of the PCA also, we determine the eigenvectors of the covariance matrix. So, similarly
the solution will be the eigenvectors of the S X. So, that is nothing but SW −1 into S B. So, this is
the very similar to the PCA, the Principal Component Analysis. In PCA, we consider the
eigenvectors of the covariance matrix. In this case you can see, I am considering SW −1 into S B.
Also, one is the within-class scatter matrix, another one is the between-class scatter matrix.

1254
(Refer Slide Time: 29:19)

So, I am considering one numerical example. So, how to apply the LDA for 2 classes. So, I am
considering samples of the class w1. So, I am considering 2 classes, w1 and w2, and I am
considering the samples for the class w1. So, these are the samples. The samples are 2-
dimensional. And, similarly I am considering the samples of the class w2. That is also 2-
dimensional, and I am showing the Matlab code for this. And, I am considering the samples X1
and X2 corresponding to the classes w1 and w2 respectively.

And you can see the, you can plot the samples corresponding to these 2 classes. One is the green
sample; you can see the green colored sample. And another one is the blue colored samples.

1255
(Refer Slide Time: 30:08)

After this, this, the class means I can determine by using this expression. So, corresponding to
the class w1, I can determine the mean of the samples. And similarly, corresponding to the
second class also I can determine the class mean. So, that I can determine. And in the Matlab you
can write like this. You can determine the mean of X1, and also the mean of X2 you can
determine.

(Refer Slide Time: 30:31)

After this, the covariance matrix of the first class also, you can determine. That is nothing but S1.
So, S1 you can determine, that is nothing but the covariance matrix of the first class. And in a

1256
Matlab, you have to right simply the covariance of X1. So, you can write like this. So, you can
determine S1.

(Refer Slide Time: 30:50)

And similarly, you can determine the covariance matrix of the second class. So, S2 is the
covariance matrix of the second class, you can determine. And in the Matlab S2 is equal to
covariance as X2 that you can determine.

(Refer Slide Time: 31:03)

1257
And from this S1 and S2, you can determine within-class scatter matrix S W, you can determine.
That is the within-class scatter matrix, you can determine from S1 and S2. So, you will be getting
this.

(Refer Slide Time: 31:17)

And after this, you can also determine the between-class scatter matrix from these 2 means,
because already you have calculated μ1 and μ2. So, from you μ1 and μ2 you can determine
between-class scatter matrix. So, you can see. So, I am computing the between-class scatter
matrix, and even in the Matlab also it is very simple. So, you can determine between-class scatter
matrix.

1258
(Refer Slide Time: 31:41)

After this, the problem is the eigenvalue problem, that you have to solve. So, the eigenvalue
problem is this. So, lambda is the eigenvalue. So, this eigenvalue problem, you can solve like
this. And I will be getting the, I will be getting 2 eigenvalues, λ 1 and λ 2. So, λ 1 is 0 and λ 2 is
12.2007. So, you will be getting 2 eigenvalues. This is nothing but the solution of the generalized
eigenvalue problem. So, you can solve this problem.

(Refer Slide Time: 32:12)

And after this, I can determine the vector w. So, you can determine w1 and w2; you can
determine. So, we can compute the LDA projection. So, it is nothing but, in the Matlab you can

1259
do like this. So, you have to find the inverse of S W and, I will be getting the projection vector; w
is the projection vector. So, I will be getting w1 and w2. And which one is the optimum
projection direction corresponding to LDA? That is the, w2 is the optimum projection direction
that I can determine; which gives maximum J (w). So, this w2 is the optimum projection vector,
that I can determine, because it gives maximum J (w).

(Refer Slide Time: 33:03)

Or maybe directly, we can compute like this. The optimum projection direction order vector; I
can determine like this. So, SW −1, that is nothing but the inverse of the within-class scatter
matrix, μ1 −μ2. So, from this you can determine the optimum projection vector that you can
determine. So, this is one example. This LDA you can apply for C number of classes. So, in the
book you can get this information, how to apply the LDA for C number of classes.

But, in my discussion I only considered 2 classes. So, how to apply the LDA for 2 classes. For
this, you have to determine S W and S B; one is the within-class scatter matrix, another one is the
between-class scatter matrix.

1260
(Refer Slide Time: 33:52)

And here you can see, I am showing one projection direction corresponding to smallest
eigenvalue. So, smallest eigenvalue I am considering, and I am showing the projection direction.
And in this case, if I considered the PDF of the classes, they are not well separated. That means,
that there is no discrimination between the classes, corresponding to this projection direction. In
case of the PCA also, we considered the eigenvalues and the corresponding eigenvectors.

So, in case of the PCA, we consider the highest eigenvalue and corresponding eigenvectors, that
we considered. And if I consider, the smallest eigenvalue that corresponds to the redundant
information, or maybe the noise; that we can neglect in case of the PCA. Here in case of LDA,
what I am considering, the smallest eigenvalue, I am considering. And corresponding to this, I
can determine the projection direction.

And here you can see, corresponding to the smallest eigenvalue, the separation between the 2
classes is not maximum. It is overlapping; overlapping of the PDF of the classes, that is the bad
separability.

1261
(Refer Slide Time: 35:03)

But if I consider, the highest eigenvalue; and corresponding to this, I can determine the
projection direction. In this case, you can get the good separability between the classes. So, you
can see, I am showing the PDF of the classes, and you can find the good separability between the
classes corresponding to the highest eigenvalue, you can see.

(Refer Slide Time: 35:25)

Now in case of the PCA, we have seen how to recognize a particular face. The face recognition
using PCA, that concept already I have explained. That means, any face can be expressed as a
linear combination of eigenfaces. So, you can see, I am considering the eigenfaces like this. So,

1262
in my class, I have explained how to determine the eigenfaces. And the any face can be
represented by a linear combination of the eigenfaces.

Similarly, in case of the LDA, any face can be represented by a linear combination of
Fisherfaces. In case of the PCA, we consider the eigenfaces. But in this LDA, we are considering
Fisherfaces. So, that means, the any face can be represented by linear combination of
Fisherfaces. So, that concept I am going to explain now.

(Refer Slide Time: 36:22)

So, how to recognize a particular face? Suppose, we have C number of classes. This μi , I can
determine; that is the mean vector of the class, I can determine. So, I have C number of classes.
So, also I am considering Mi number of samples for the classes, within a particular class. So,
from this, I can determine that, what is the total number of samples. The total number of sample
is M, is equal to summation over 0 to C Mi. So, Mi be the number of samples within class i; that I
am considering.

And from this, I can determine total number of samples, I can determine. And I have already
explained, I can determine the within-class scatter matrix, and also I can determine the between-
class scatter matrix from the input samples.

1263
(Refer Slide Time: 37:15)

After this, I am considering the criterion function; that function I am considering. So, what is the
condition? I have to maximize the between-class scatter, but I have to minimize the within-class
scatter, that is the condition. Because, I have to find the best projection direction, I have to find.
So for this, what I have to do? I have to maximize the between-class scatter, and also I have to
minimize the within-class scatter.

So, such a transformation should retain class separability, while reducing the variation due to
source other than the identity. So, maybe the variation, maybe the illumination variation; I may
consider. But, the main important point is the class separability. So, we have to find the class
separability; that it is a maximum discrimination between the classes, I have to find. And
already, I have explained that is nothing but the eigenvalue problem. So, this solution is
something like this.

And, I will be getting the Fisherfaces, I will be getting. If I consider the eigenvectors of SW −1 into
S B. So, I will be getting the eigenvectors, that is nothing but the Fisherfaces. That means, you
can see the projected data can be represented by a linear combination of the Fisherfaces. So here,
you can see, I am considering, the U is the transform matrix and how it is obtained. It is nothing
but the eigenvectors, I am considering. The eigenvectors of SW −1 into S B. So, eigenvectors I am
considering.

1264
And based on this eigenvector, I can construct the transformation matrix. So, x is the input data
minus mu; that is the mean is subtracted from the input data, that means the input data is
normalized. And after this, I am considering the transformation. The transformation is the b is
equal to U T x minus mu. So, suppose in case of the KL transformation, what I have considered
Y is equal to A x minus mu x, I considered like this.

Similarly in this case, I am considering the U is the transformation matrix Y is this, and x minus
mu x, like this; I am considering. So, this U is the transformation matrix. And this transformation
matrix, I can obtain from the eigenvectors of SW −1; that is the inverse of the within-class scatter
matrix into S b. So, I can get this one, this one U. That means any face can be represented by a
linear combination of the Fisherfaces.

(Refer Slide Time: 39:57)

So, the procedure for the face recognition is very similar to the face recognition by PCA. First, I
have to do the normalization of the input data, that means from the original face the mean face is
subtracted. After this, I have to determine the Fisherface, I have to determine. That means, I am
considering the weights. The weights are w1, w2, w3, like this; that I have already explained in
the PCA.

So, that means any unknown face can be represented by a linear combination of Fisherfaces. And
suppose, a new face is coming. So, a new face can be also represented by a linear combination of
Fisherfaces. After this, what I have to do for recognition? Just I have to compare the weights. So

1265
just, I have to compare the weights. I have to compare the weights. One is the weight
corresponding to the training, and another one is the weights corresponding to the input test face.
So, I have to compare the weights. And based on this comparison, if it is less than a particular
threshold; so based on this condition I can recognize a particular face.

So, this concept is very much similar to the face recognition by PCA. But in this case, what I am
considering? I am considering the Fisherface, I am considering. Corresponding to this, I am
determining the transformation matrix. The transformation matrix is U, that is obtained from the
eigenvectors. The eigenvectors of SW −1 into S B.

In case of the PCA, we consider the transformation matrix A. The transformation matrix is
obtained from the eigenvectors of the covariance matrix of the input data. So, C X is the
covariance matrix of the input data. And I am determining the eigenvectors. And from the
eigenvectors, I can determine the transformation matrix.

In case of the LDA, what I am considering? I am again considering the eigenvectors of this, SW −1
into S B. And from this, I can determine the transformation matrix, the transformation matrix is
U. So, these 2 concepts are very similar. The face recognition by PCA and the face recognition
by LDA. Now next, I will discuss the concept of the Support Vector Machine.

(Refer Slide Time: 42:17)

1266
Now, I will discuss the concept of Support Vector Machine. An introduction of Support Vector
Machine. So briefly, I will explain the concept of the Support Vector Machine. So, what is
Support Vector Machine?

(Refer Slide Time: 42:31)

So, Support Vector Machine is a classifier derived from statistical learning theory. And already, I
have explained, it is the discriminative model. Because in this case, I do not need the information
of class conditional density. So, I have to determine the best decision boundary between 2
classes. So, if I consider more number of classes, I have to find the best decision boundaries
between the classes. So, that is why, it is called a discriminative classifier; because, I do not need
the information of that class conditional density.

And, Support Vector Machine, we can consider different applications; like handwriting character
recognition, that is one applications. And there are many other applications, like object detection,
and recognition, content-based image retrieval, text recognition, biometrics, speech recognition.
So, there are many applications of Support Vector Machine, which can be used for classification
and recognition.

So for this, we may consider hand-crafted features for classification. So, like this already I have
explained some hand-crafted features; like color feature, texture features, or maybe the HOG,
SIFT, I can consider. And based on these hand-crafted features, I can do the classification by
Support Vector Machine.

1267
(Refer Slide Time: 43:52)

So, you know this condition, that is the discriminate function, you know. So, already I have
explained about the discriminate function. Now, this feature vector x can be assigned to a
particular class, the particular class is w i. Based on this condition, that condition is, if g i (x) is
greater than g j (x), and j is not equal to i. So, corresponding to this, I can assign a feature vector
x to the class, the class is w i. This is based on the discriminate function. And if I consider 2
classes, that is the 2-category case.

So, I can determine g1x and also I can determine g2x. So, g1x minus g2x, that is nothing but gx.
Suppose g x is equal to 0, that corresponds to the decision boundary, that already I have
explained. So, g x is equal to 0 means, it is the decision boundary. In the decision boundary, g 1x
is equal to g2x. So, x is the feature vector. So, based on g x, I can take a classification decision.
So, I can consider or I can decide the class w1, if g x is greater than 0. Otherwise, I have to
consider the class, the class is w2.

And, for the Minimum-Error-Rate Classifier, and that already I have explained. So, g x can be
presented like this. So, g x is nothing but g 1x minus g2x. So, what is g1x? That is nothing but p
the probability of w1 given x, that is posterior probability. And similarly, for g 2x the probability,
the posterior probabilities, probability of w2 given x. So, for each and every class I have to
determine g x, and I have to find a maximum discriminant function, and based on this I can take
a classification decision.

1268
(Refer Slide Time: 45:57)

And here, I have shown some decision boundaries; you can see. First time, I am considering the
nearest neighbor classification. So, you may get a decision boundary like this. So, this is the
decision boundary between 2 classes. And in case of the decision tree, this is nothing but the
binary decision, either yes or no. So, that type of decision I can consider by considering the
decision tree, and corresponding to this I may get the decision boundary like this. That is the
binary classification type, classifier.

And if I consider the g x is a linear function. So suppose, g x is equal to w T that is the transpose
x plus b. So, w is the weight vector and x is the input feature vector plus b is the bias. So,
corresponding to this, if I consider a linear function; then in this case, I will be getting a linear
decision boundary, like this.

And also, I may get the nonlinear decision boundary between the classes. So, that last example is
the nonlinear function, that is the nonlinear decision boundary, I can get. So, this is about that
decision boundaries. So, I am now considering the discriminate function. The discriminate
function is g x, that is equal to w T x plus b.

1269
(Refer Slide Time: 47:18)

So, now g x is a linear function, that is the linear discriminant function, I am considering. So, g x
T
is equal to w x plus b. And I am considering a hyperplane, that is the decision boundary
between 2 classes. You can see, here I am considering a 2-dimensional feature space, you can
see x1 and x2 that is the 2-dimensional feature space, I am considering. And, you can see, this is
the decision boundary. So, I am considering the decision boundary like this; decision boundary
between the classes.

So, it is w T x plus b is equal to 0. So, that is the equation of the decision boundary. And suppose,
T
w x plus b is greater than 0, that corresponds to the class; suppose the class is w1, this class.
T
And if w x plus b less than 0, that we have considered the class, the class is w2. So, these 2
class, I can consider; one is the w1, another one is the w2. This w is different, this is the weight
vector. So, w1 and w2, I am considering as classes, 2 classes I am considering.

And unit normal; that is the unit-length normal vector of the hyperplane, also I can determine.
So. if you see this vector, this vector is the unit vector, that is the unit-length normal vector of the
hyperplane, I can determine, that is nothing but w divided by the norm of w. So, that is unit
vector, also you can determine.

1270
(Refer Slide Time: 48:50)

Now, how will you classify at these sample points using a linear discriminate function, in order
to minimize the error rate? So, that is the concept. So, I am considering 2 classes. So, the first
class is denoted by plus 1, and second class is denoted by minus 1. And, it is a 2-dimensional
feature space, I am considering. So, infinite number of answers. Because I have to minimize the
error rate. And you can see, I am showing the decision boundary between these 2 classes.

Again I may consider another decision boundary between these 2 classes; or maybe, I may
consider another decision boundary between these 2 classes; or I may consider this decision
boundary between these classes. So, I may get the number of decision boundaries. But which one
is the best decision boundary; that I have to determine in the Support Vector Machine. So, which
one is the best decision boundary between these 2 classes; that I have to determine by
considering some optimization criterion, that I am going to explain in my next slide.

1271
(Refer Slide Time: 49:56)

So, the linear discriminant function, or the classifier with the maximum margin is the best. So,
this is the definition of the best decision boundary. So, what is the definition of the margin, that
you can see here? So, margin is defined as the width, that the boundary could be increase by
before hitting a data point. So, you can see, I am considering the boundary like this, and I am
increasing the width of the boundary. And so that, it will just touch the sample points. That
means the, you can see the vector these are the sample points. So, before hitting the data points, I
can stop.

So based on this, I can define the margin. The margin of this, the hyperplane. So, this is the
definition of the margin. So, beyond this, I cannot increase the width, because it will touch the
data points. So, just before the hitting the data points, I have to stop. And corresponding to this,
if I consider, this is the width of the decision boundary; suppose that corresponds to the margin.
So, which is the best decision boundary, I want to determine?

Because in my previous slide, I have shown, I can draw number of decision boundaries between
these 2 classes. But, which one is the best, that I can determine based on the margin. If the
margin is more, then that will be good. That means, for a good decision boundary, the margin
should be more, or the margin should be high. So, that based on this margin condition, I can find
the best decision boundary between the classes.

1272
The best decision boundary means, it is robust to outliers, and thus strong generalization ability.
So here, you can see I am considering the safe zone. Because beyond this, I cannot increase the
margin, because it will touch the data points. So, that corresponds to the safe zone. So, this is the
safe zone, I am getting based on the margin.

(Refer Slide Time: 52:03)

So, how to determine the best decision boundary, that means how to maximize the margin
between the classes. So, suppose the large margin linear classifier, that I want to design; that
means, the margin should be the maximum. And already, I have mentioned. So, I have 2 classes;
one is the plus, another one is, plus 1; another one is minus 1. So, given the data points, I have
the data points; xi, yi, like this; i is equal to 1 to n. And for the class, that is y i is equal to plus 1.
T
So, corresponding to this, w xi plus b should be greater than 0. And similarly for the second
class, yi is equal to minus 1; w T xi plus b should be less 0. So, these are the conditions.

And after this, I can do some scale transformation, for both w and the b, and corresponding to
this, these 2 equations will be equivalent to this. So, yi is equal to plus 1, that corresponds to w T
xi plus b greater than equal to 1. And for y i is equal to minus 1 w T xi plus b should be less than
equal to minus 1. So, I will be getting these 2 new conditions. So, these 2 new conditions, I am
obtaining after the scale transformation on both w and b.

1273
(Refer Slide Time: 53:34)

So you can see in this figure, I am considering the data points. So, if you see in the next figure
here, you can see, I am considering x plus, x plus and x minus. So, this is the x plus, this is the x
plus, and this is the x minus. That means, the margin is about to touch the data points. So, in one
side, it is x plus; in another side, it is x minus. So, this x plus and the x minus, they are called the
support vectors. Because this margin is about to touch the data points. Beyond this, I cannot
increase the width of the margin.

And what is the objective? My objective is to get the large margin linear classifier. So, I have to
get the maximum width of the margin. So, that is my objective; that is my goal. And based on
this, I am defining the support vectors. The support vectors are x plus and the x minus. So, x
plus, corresponding to the class w1, suppose. And another one is x minus, corresponding to the
class w2. So, I have these support vectors. So, with the help of this support vectors, I can define
the margin. So here, you can see, these are the support vectors, x plus and the x minus. These are
the support vectors.

So, we know this condition, the w T x plus b is equal to 1, corresponding to the first class. So, I
T
am now considering the support vector. So in this equation, w x plus; x plus is the support
vector, I am considering; plus b is equal to 1, that I am considering. And again, I am considering
w T x minus. So, that x minus support vector, I am considering; plus b is equal to minus 1. So,
these 2 equations, I am considering for 2 classes. The classes are w1 and w2.

1274
And now, I want to determine the width of the margin. So, the margin width can be determined
like this M is equal to x plus minus x minus. So, this is one support vector, and x minus is
another support vector. And I am considering the normal to the hyperplane. So, you get the, this
unit normal to the hyperplane, I am considering. So, this unit normal already I have determined.
So, this is the unit normal, that is the n. And this is nothing but it is equal to 2 divided by w
norm. So, I am just determining the norm of w. w is the weight vector.

(Refer Slide Time: 56:08)

Now, for the large margin linear classifier, what I am considering? I am considering the
formulation, that I have to maximize the margin. So, the width of the margin is 2 divided by w

1275
norm. So, I have to maximize this. And, that is the goal of a large margin linear classifier in the
Support Vector Machine. And based on this, what are the conditions? The condition is y i is
T
equal to plus 1. For this condition, the condition is w x i plus b greater than equal to 1. And
what is another class? Another class is y i is equal to minus 1. That is another class; and this
condition is w T x i plus b less than equal to minus 1.

And again the formulation, I have to maximize 2 divided by w norm; that is equivalent to
minimizing 1 divided 2 w norm whole square. So, this is actually equivalent to this. The
maximizing 2 divided by w norm is equivalent to minimizing 1 by 2 w norm whole square. Such
that these 2 conditions, I have to consider. So, formulation is this. So, minimize 1 by 2 w norm
whole square, that I have to minimize and condition is this.

(Refer Slide Time: 57:31)

So, here I am considering this optimization problem. So, I have to minimize these, subject to the
condition; the condition is this. So, this a optimization problem. So, how to solve this
optimization problem? So, for solving the optimization problem, I can consider Lagrangian
function. So, in mathematical optimization, Lagrange's multiplier method is used, you know that
condition. So, you have to see the mathematics book. In mathematical optimization, Lagrange's
multiplier method is used. And it is used to find local maxima and the minima of a function,
subject to some constraints. That means, subject to the condition that one or more equation have
to be satisfied exactly by the chosen value of the variables.

1276
So, I am repeating this. It is used to find the local maxima and the minima of a function, subject
to some constraints. That means, subject to the conditions, that one or more equations have to be
satisfied exactly by the chosen value of the variables. In order to find the maxima, or the minima
of the function. Suppose the function is f x. So, I want to find the maxima or the minima of the
function f x, subject to the equating constraint. So, constraint is suppose g x is equal to 0. So, this
constraint I am considering.

Then this Lagrangian function, I can write like this. This is a Lagrangian function, x, λ. So,
lambda is the Lagrange's multiplier, is equal to f x−λ g x . So, I can write like this. So, for more
details you have to see the mathematics books. So, how to do the optimization by considering the
Lagrange's function. So, this is the Lagrange's function, and I have to minimize this one. So, α i
is the Lagrange's multiplier. So, this is the Lagrange's multiplier.

(Refer Slide Time: 59:38)

So again, I am writing this. So, I have to minimize this function, the Lagrangian function and
subject to the condition, the condition is, α i should be greater than equal to 0. So, α i is the
Lagrange’s multiplier. And for this, I am just taking the derivative of L p with respect to the
weight vector, the weight vector is w. And similarly, the del L p divided by del b should be equal
to 0. So, based on these differentiations, I want to find the maximum or the minimum conditions.
So, based on this the partial derivative, I want to find the value of w and this.

1277
(Refer Slide Time: 60:23)

And this is nothing but the Lagrangian Dual Problem. That means you can see, because my
objective is to minimize this function the Lagrangian function, subject to condition. This
condition is important, that is equivalent to maximizing this. That is equivalent to maximizing
this subject to this, these 2 conditions. And this is called the Lagrangian Dual Problem. So, you
see that mathematics. Again, I am repeating this, you see the mathematics, how to solve the
optimization problem using the Lagrange’s function.

(Refer Slide Time: 60:54)

1278
And now I am considering, the KKT condition. What is the KKT condition? So, I am
considering. So, from the KKT condition, the KKT condition is nothing but, the KKT means
Karush-Kuhn-Tucker condition, that also you have to see. This is called the KKT. It is used in
mathematical optimization. It is nothing but the first derivative test. It is also called the first order
necessary condition for a solution in a nonlinear programming to be optimal, provided that some
regularity conditions are satisfied.

So, this is the briefly, what is the KKT. So, it is used in mathematical optimization and it is
nothing but the first derivative test. And it is also called the first order necessary conditions for a
solution in a nonlinear programming to be optimal, provided that some regularity conditions
should be satisfied. So, by considering this KKT, I am considering this one, that is α i into yi and
after this, I am considering w T xi plus b minus 1 is equal to 0.

That I am considering; and thus, only for the support vector that is α iis not equal to 0, that is only
support vector have this condition. That is α iis not equal to 0. So, the solution will be like this.
So, I will be getting the solution of this. So, I will be getting the value of w and also, I have to
consider this one to get b from this. So, I can get b from this equation. And in this case xi is the
support vector. So, I am considering xi is the support vector.

And in the figure also, I have shown the support vectors, and you can see this is the margin I am
considering. So, x plus and the x minus, these are the support vectors. And based on this, you can
see, based on this Lagrangian’s multiplier function, Lagrange’s function, I am getting the
solution. And you can apply the KKT, this condition, and I will be getting the value of w, and
also, I can get the value of b, the b also I can determine from this.

1279
(Refer Slide Time: 63:27)

And finally, the linear discriminant function I will be getting like this, w T into x plus b is nothing
but, I will be getting this one. And only I am considering the support vectors, i is mainly the
support vector, I am considering. So, alpha i is nothing but, it is the Lagrange’s multiplier. So, i
is nothing but the support vector. So, you can see in this expression, in the g x expression, that is
the discriminate function, the linear discriminate function. So, it is nothing but the dot product
between the test point. So, test point is x, and the support vectors are xi. So, first I have to
determine the dot product between the test point x and the support vectors xi, and based on this I
can determine g x.

So, suppose the new test vector is coming, suppose the new test vector is coming. So, for this one
I have to do, I have to find out the dot product between the test point x and the support vector xi.
And from this, I can determine the discriminate function, the discriminate function is g x. Also,
keep in mind that solving the optimization problem involve computing the dot product between x
T
i and x j between all pairs of the training points. So, for all pairs of the training points I have to
find the dot products, that is the condition. So, based on this, this discriminate function I can do
the classification by considering that support vectors.

1280
(Refer Slide Time: 64:59)

And suppose if I consider, if the data is not linear, linearly separable. So, suppose if I consider
the noisy data points or the outliers. And again here, I am showing 2 classes, one is the plus
green, and another one is the white that is the minus, 2 classes I am considering. And I am
considering the noisy data points and outliers. So, for this what is the formulation? The
formulation is I have to consider slack variables. So, slack variable I am considering. So, it can
be added to allow misclassification of difficult or the noisy data points. These slack variables are
introduced to allow certain constraints to be violated.

So, slack variables are defined to transform an inequality expression into an equality expression,
that is the mathematics, you can see. That is the slack variables, we can define to transform an
inequality expression into an equality expression. So, that is the objective of the slack variables.
T
And you can see, I am showing this equations w x plus b is equal to 1 corresponding to this,
this is the line; w T x plus b is equal to 0, that is the decision boundary; w T
x plus b is equal to
minus 1, that is the line. And you can see the margin here. I am showing the margin here.

1281
(Refer Slide Time: 66:25)

And corresponding to this slack variable problem, the formulation will be something like this, 1
by 2 w norm whole square and after this, plus C. So, C parameter I am considering and this is the
slack variable. So, ξ i is the slack variable. So, the conditions are like this. So, I have to consider
these 2 conditions, and this parameter C can be viewed as a way to control overfitting. That
means, the parameter C tells the optimization, how much you want to avoid misclassification by,
is training examples. So, that means I have to avoid the misclassification and this parameter C
controls the overfitting. So, that information I am giving with the help of the parameter, that
parameter is C. So, this is the concept of the large margin linear classifier.

So, briefly I have explained the large margin linear classifier, and also their nonlinear Support
Vector Machines. Suppose, if I consider many noisy points or the outliers. So, for this I can
consider nonlinear Support Vector Machine and, in this case, I can consider the projection of the
low dimensional data into high dimensional space. That means, I can do the projection of the low
dimensional space into high dimensional space, and I can consider the nonlinear Support Vector
Machine. I am not explaining the concept of the nonlinear Support Vector Machine. Briefly, I
have explained the concept of the Support Vector Machine, that is the large margin linear
classifier.

In this class, I discussed the basic concept of the LDA, and the Support Vector Machine. In case
of the LDA, I have to find the best direction of the projection. So, which one is the best

1282
projection direction I have to find? And for this I have considered 1 criterion function, and for
this I considered between-class scatter matrix and the within-class scatter matrix. So, based on
this, I can find the best prediction direction; I can find.

In case of the Support Vector Machine, I determine the best decision boundary between 2
classes. So, that means I have to maximize the margin and between the 2 classes. And in this
case based on this, the margin, the width of the margin, I can define the support vectors. And
based on these support vectors, I can determine the discriminate function g x; and after this, I can
do the classification.

The Support Vector Machine is a discriminative classifier, because I do not need the information
of class conditional density. So, these concepts, 2 concepts; one is the LDA, another one is the
Support Vector Machine, briefly I have explained in this class. So, let me stop here today. Thank
you.

1283
Computer Vision and Image Processing-Fundamentals and Applications
Professor M. K. Bhuyan
Department of Electronics and Electrical Engineering
Indian Institute of Technology, Guwahati, India
Lecture 35
Artificial Neural Network for Pattern Classification
Welcome to NPTEL MOOC course on Computer Vision and Image Processing- Fundamentals
and Applications. In this week I will be discussing the concept of artificial neural networks. How
it can be use for pattern classification. There are two types of artificial neural networks one is
supervised artificial neural network and another one is unsupervised artificial neural networks.
After this I will discuss the concept of deep networks. And how it can use for object
classification, object recognition, image recognition that concept I will be discussing.

So, what is artificial neural network? So, artificial neural network the concept is very much
similar to biological nervous system. So, that is why I can say it is a biologically inspired system.
Now, let us see the concept of the artificial neural network.

(Refer Slide Time: 1:30)

In this figure you can see I am showing one biological nervous system you can see cell body,
dendrites, axon. So, that concept I will be explaining in my next slide. So, idea is to deal an
electrical system which is capable of performing tasks through reasoning just like human being.
So, this is the biological nervous system. Now, you can see how it is related to the electrical
system.

1284
(Refer Slide Time: 2:02)

So, here you can see the concept of the biological nervous system, the biological neural system
and here you can see the axon and the dendrites. So, if you see if I consider these are the
dendrites. So, the dendrites collects signal from the cell body. So, suppose this is the cell body.
So, the dendrites collects the signal from the cell body and all the signals are sum up if the signal
value the sum value is greater than a particular threshold.

Then the signal will be transmitted via axon. So, you can see this axon and axon terminates
synapse. So, this is the concept of the biological neural system. These dendrites collect signal
from the cell body. And all the signals are sum up if the signal value is greater than a particular
threshold then the signal will be transmitted via axon and the axon terminates synapse. So,
corresponding to this biological neural system.

You can see what is actually going in the cell body that concept I am going to explain. So, that
means dendrites collects the signal from the cell body. And all the signals are sum up here, you
can see all the signals are collected. If the signal value is greater than a particular threshold then
the signal will be transmitted via axon. So, this is the concept of the biological neural system. So,
I am repeating these so dendrites collect the information from the neurons. And all the signals
are sum up and if the sum value is greater than a particular threshold then I will be getting a
signal in the axon. So, that is concept of the biological neural system.

1285
(Refer Slide Time: 4:03)

And if I consider the equivalent, the elliptical system suppose that is nothing but the artificial
neural network. So, corresponding to that artificial neuron you can see here I have the signals the
input signals are S1, S2, S3, S4, S5. And I am considering the Weights, the Weights are W1, W2,
W3, W4, W5 these are the Weights. Now, in this case what I am doing I am just doing the
summation of the signals the input signal that means I am adding all the signals

The signals are added like this S1W1 a signal is multiplied with the weight; the weight is W1.
Another signal is S2 that is multiplied with W2 like this I will be multiplying and suppose if I
consider Sn and Wn so like this I am determining the sum. So, sum I am determining like this, so
this sum value is determine like this. and after this I am considering one thresholding function.
So, this is the thresholding function I am considering one thresholding function.

And this function is called as squashing function. And this is used to compare the sum value with
the threshold. A threshold is the squashing function. If the sum value is greater than a particular
threshold then this threshold then I will be getting the signal in the output. So, I will be getting
the signal S in the output that is the state. So, I am repeating this if the sum value is greater than a
particular threshold then I will be the signal S in the output. So, this is the model of an artificial
neuron.

1286
(Refer Slide Time: 5:56)

So, this the squashing function, this squashing function is the sigmoid function as generally use

1
in the artificial neural network. So, σ ( x ) = . So, in the x axis you can see this is the sum I
1+e−x
am considering and in the y axis I am considering the sigmoid function. So, that is sigma x, so
sum value is compared with the sigmoid function and if the sum is greater than sum value then I
will be getting 1 in the output.

So, suppose corresponding to this one the sum value is greater than this particular threshold then
I will be getting one in the output. That means I will be getting the signal in the output terminal
of the artificial neural network. So, that is the function of the squashing function and here I am
considering the sigmoid function.

1287
(Refer Slide Time: 6:50)

In artificial neural networks already I have shown you if the Weights, the Weights if you see W1,
so W1 is the weights and another signal is suppose S2 and W2. So, W1, W2 another signal I am
considering S3 W3 is the weight. And this is the structure of the artificial neural network. So,
W1, W2, W3 these are the Weights of the artificial neural network. So, I have I may have fix
networks in case of the fix network the weights cannot be changed.

dw
So, that means the = 0. So, I cannot changed the weights of the network. But in case of the
dt

dw
adaptive network, I can changed the weights that means the is not equal to 0 that is the
dt
adaptive networks. So, mainly I will be considering adaptive networks because the knowledge of
the artificial neural network is stored in the weights. In case of the fix network the weights are
fixed I cannot change the weights.

But in case of the adaptive networks, I can change the weights and that is nothing but the training
of the artificial neural network. Because the knowledge of the artificial neural network is
available in the weights. So, I have to change the weights based on the training samples. And this
is called the learning of the artificial neural network.

1288
(Refer Slide Time: 8:26)

So, every neural network has knowledge which is contained in the values of the connected
weights that I have already explain that term knowledge of the artificial neural network is
available in the weights. And that means I can modify the weights and, in this case, the
modifying the knowledge stored in the network as a function of experience implies a learning
rule. So, that means what is the training of the artificial neural network? That is nothing but the
changing the weights corresponding to the training samples. So, information is stored in the
weight matrix of the artificial neural network.

(Refer Slide Time: 9:10)

1289
Now in case of the artificial neural networks we have to consider supervised learning and the
unsupervised learning. So, what is the supervised learning? In a particular artificial neural
network we know the desired output that information we have. So, corresponding to particular
training sample we know what is the desired output. And also, we can calculate the actual output.
The difference between the desired output and the actual output is called the error.

And the error is back propagated to the input so that we can changed the weights. And the
objective is to minimize the error. The error is the desired output minus actual output. So, I can
give one example suppose if I consider this is the artificial neural network and these are the
weights W1, W2 and I am considering the input S1, S2. So, in this case corresponding to the
input training samples I know what is the desired output.

That information is available so that is why it is called the supervised learning. So,
corresponding to a particular class the training samples are available. And corresponding to this
training samples I know what is my desired output that information I know. And from the input I
can calculate the actual output. So, that also I can determine the actual output I can determine.
And the difference between the desired output and the actual output is nothing but the error.

So, the objective is to minimize the error. So, that means what I can do to minimize the error. So,
for this I have to adjust the weights, so I can adjust the weight so that I can reduce the error. So,
because the signal output is if I consider a signal output what is the output, output is nothing but
S 1 W 1+S 2 W 2. So, that means I can change the weight so that I can reduce the error. So, that is
the concept of the supervised learning.

The error is back propagated to the input so that I can change the weights of the artificial neural
networks. And that is called the supervised learning of the artificial neural network. For this we
can use the concept like the least mean square convergence. Because I have to change the
weights to minimize the error.

1290
(Refer Slide Time: 11:50)

And in case of the unsupervised learning what actually it is nothing but the clustering the
grouping of the training samples. So, in case of the machine learning algorithms I explain the
concept of the K-mean clustering. Similar, to the k-mean clustering in case of the unsupervised
artificial neural networks I can do clustering. I can do the grouping of the training samples. So,
that is the unsupervised learning.

(Refer Slide Time: 12:22)

1291
Now, let us consider one artificial neural network here I am considering the weights W1, W2,
W3 and suppose my input is - 0.06, - 2.5, 1.4 and I am considering the function is the sigmoid

1
function. So, f ( x ) = . So, this neural network I am considering now.
1+e−x

(Refer Slide Time: 12:53)

And corresponding to this neural network if you see, if I want to calculate the value x that is the
sum value. So, I can determine the sum value like this. So, input that is the input value is
multiplied with the weight value the weight is 2.7 plus I am considering the second input the
second input is 2.5 and it is multiplied with 8.6 and 1.4 into 0.002 corresponding to this I can
determine the value x, x is the sum value.

1292
(Refer Slide Time: 13:27)

Now, let us see how to do the training in case of the supervised artificial neural networks. So,
what we have to do? I have to determine the actual output and also we know the desired output.
The difference between these two is called the error. So, I have to minimize the error for
minimization of error I have to adjust the weights of the artificial neural network. So, in this
example I am considering one artificial neural network.

And you can see I have the input layer, suppose one layer is there, there is the input layer another
layer is there suppose these is the hidden layer. And another layer suppose I have the output
layer. So, in this case I am considering two classes that is pattern classification problem. So,
corresponding to the first class the output will be 0 0 and corresponding to the second-class
output will be 1 and I am giving the input data set.

The input data set is 1.4 2.7 1.9 and corresponding to this the class will be 0. So, that information
is available that is why it is called the supervised learning. Now how to do the training here you
can see so first I have to initialize the random weights. So, random weights I have to select in the
artificial neural networks. So, randomly I am selecting the weights and let us consider the first
input, the first input is 1.4, 2.7 and the 1.9.

1293
And corresponding to this the output should be 0 corresponding to a particular class that class is
suppose W1. So, I am getting 0.8 corresponding to these inputs. So, input is 1.4, 2.7, 1.9 and
what I am doing randomly I am selecting the weights of the artificial neural network. So,
corresponding to this I am getting the output, output is 0.8 but my output should be 0. So, that is
why the actual output minus desired output is 0.8 that is the error.

So, I know what is the desired output corresponding to a particular training data. So, that is why
it is call the supervise network. I am repeating this so corresponding to a particular training data,
training samples I know what will be the output. So, output should be 0 but I am getting the
output 0.8. So, that is why the error will be 0.8. the error means the difference between the
desired output and actual output. So, I have to minimize this error so for this I have to adjust the
weights of the artificial neural network. So, that means I am adjusting the weights of the artificial
neural network.

(Refer Slide Time: 16:17)

1294
1295
After this you can see let us considered the second input the second input is suppose 6.4, 2.8, 1.7
and corresponding to this my output should be 1, corresponding to another class so suppose the
class is W2. And in this case if I give this inputs 6.4, 2.8 and 1.7 I will be getting the output,
output is 0.9 but desired output should be 1. So, that is why the error will be - 0.1. So, to
minimize the error I have to adjust the weights of artificial neural networks.

So, I am adjusting the weights of the artificial neural network. And this process I have to repeats
all the training samples and finally I will be getting the train artificial neural networks. Train
means the weights will be adjusted. So, this is the concept of the training of artificial neural
network.

1296
(Refer Slide Time: 17:16)

And you can see the concept of the decision boundary so in this case I am showing two classes
one is the yellow another one is the blue. And randomly I am initializing the weights and
corresponding to this you can see the decision boundary. Now, after this I have to adjust the
weights that means I am doing the training and you can see the moment of the decision
boundary. So, you can see the decision boundary it is moving because I am adjusting the weights
because I have to minimize the error in the output. After this again I am adjusting the weights
and you can see the position of the decision boundary. Like this I have to do the training.

1297
(Refer Slide Time: 17:52)

And finally you can see the decision boundary will be like this. So, I am getting a nonlinear
decision boundary between these two classes. And it shows the separation between two classes
that classes are one class is blue class another is the yellow class. This is the objective of
artificial neural network training.

(Refer Slide Time: 18:22)

So, here you can see the weight learning algorithms for neural networks are dumb. And in this
case we have to consider the thousands and thousands of adjustment and ultimately we will be
getting the train artificial neural networks. The training procedural already I have explained.

1298
(Refer Slide Time: 18:40)

Now, in this case if the function fx get I have considered in my previous slide that is the sigmoid
function. Suppose the function fx is not linear. The sigmoid function is nonlinear function then a
network with one hidden layer can learn perfectly any classification problem. A set of weights
exists that can produce the target from the input. The problem is finding them. So, that means
corresponding to this fx if the fx is nonlinear then what I can consider may be the network with
only one hidden layer will be sufficient for most of the classification problems.

1299
(Refer Slide Time: 19:21)

And suppose the fx is linear that is the that sigmoid function I am not considering. A sigmoid
function is also called as the activation function. Suppose the activation function is linear then in
this case I can draw only the straight lines that is the straight decision boundaries I can draw
between two classes. So, that means if I considered an activation function fx is linear then the
neural network can only draw straight decision boundaries between the classes.

(Refer Slide Time: 19:53)

And that is why we should consider the nonlinear activation function that I have considered that
is the sigmoid function is the nonlinear activation function to draw complex boundaries between

1300
two or more classes. So, that is the importance of the nonlinear activation function. And in case
of the support vector machines only draw the straight lines. But they transform the data first in a
way that makes that.

So, that means the support vector machine transform the data and after this transformation. The
linear decision boundary will be sufficient to separate two classes or two or more classes. It will
be sufficient the linear decision boundary we can consider in case of the support vector machine.
But in case of the artificial neural network, we considered nonlinear activation function fx. So,
that is why we can consider complex decision boundaries between the classes. Now, let us
consider the concept of that artificial neural network. How it can be used for a pattern
classification. I will give some simple examples suppose if I consider.

(Refer Slide Time: 21:15)

Suppose this equation I am considering w 0 +x 1 w1 +x 2 w 2=0 . So, I want to do simple pattern


recognition tasks. And by considering two-layer artificial neural networks. And in this case, I
have to consider the separation of the input space into regions. Where the response of the
network is positive and the regions where the response of the network will be negative that I
want to consider.

So, that means suppose if I consider two classes, suppose one class is this and another class is
this. And I want to find a decision boundary between these two. And what will be the response of
the artificial neural network. So, corresponding to the first class the response should be positive

1301
and the corresponding to the second class the response should be negative. And I want to find the
decision boundary between this classes.

So, for this I am considering that equation the equation is suppose w 0 +x 1 w1 +x 2 w 2=0 . That is
the equation of the separating line, so separating line I am considering this. So, that is the
equation of the separating line. And in this case the w2 is not equal to 0. So, corresponding to

−w 1 w0
this I will be getting the equation of a line. So, x 2 = x 1− .
w2 w2

So, that is the equation of a line. That is the separating line between two classes. corresponding
to this equation I can draw the artificial neural network. So, the network will be something like
this. 1 x1. So, this is my artificial neural network corresponding to depth equation the equation is
w1 that equation is w 0 +x 1 w1 +x 2 w 2=0 . And I am considering the equation of the separating line.

−w 1 w0
The equation of the separating line is this. x 2 = x 1 − . So, this is very similar to the
w2 w2
equation y is equal to m x + c that equation is the equation of the straight line and that is the
decision boundary. Now, in this case I have to get the positive response and the negative
response. So, suppose this w 0 +x 1 w1 +x 2 w 2 is greater than 0.

Then I will be getting the positive response. And the values of w1, w2 and a w 0 are determine
during the training process. So, I am repeating this the values of w1, w2, and w 0 are determine
during the training process. So, corresponding to this if the w 0 + x1 w 1 + x2 w 2 is greater than 0 then
I will be getting the positive response otherwise I will be getting the negative response.

Because I am considering two classes. So, corresponding to the first class my response will be
positive. And corresponding to the second class my response will be negative. The response of
the network. And in this network, you can see I am considering one terminal that terminal is
called the bias terminal. The input is one and weights corresponding to this a terminal is w 0 . So, I
am considering one bias terminal.

So, what is the importance of the bias terminal I will be explaining after sometime. And in this
case I have shown the concept of the pattern classification. And I am considering two class
problem here. So, for one class the response should be positive for another class the response of

1302
the network should be negative. And in between I have to draw the decision boundary and in this
case you can see I am getting the equation of the straight line.

And that is the decision boundary between the classes. And suppose if I consider suppose this
equation x 1 w1 +x 2 w2is equal to T. Now, I am considering the threshold, the threshold is T and
w2 is not equal to 0 if I consider. Then in this case I will be getting the equation, equation is x2

−w 1 T
equal to x 2 = x 1 + I will be getting. And in this case if the net input that is the network
w2 w2
input.

That is the net input means the sum input the network input x1 w1 if the net input x1 w1 x2 w2 is
greater than threshold then in this case I will be getting the positive response. Otherwise, I will
be getting the negative response. So, in this case I am considering the threshold. In the second
case also I am getting the decision boundary I will be getting the decision boundary
corresponding to depth equation.

−w 1 T
x 2= x 1 + . The values of w1 and w2 are determine during the training process. And the
w2 w2
response will be negative in this case when the input pattern is not a member of it is class. So, in
case of pattern classification the desired response of a particular output unit is positive. If the
input pattern is a member of it is class that means the response will be negative when the input
pattern is not a member of its class.

The positive response can be represented by an output signal of 1 and the negative response by
an output signal of minus 1. So, positive response is represented by a signal 1 and the negative
response is represented by the output signal of minus 1. So, that means the response will be
negative when the input pattern is not a member of its class. In case of the pattern classification
problem the desired response of the particular output unit is positive.

If the input pattern is a member of its class so I have considered this case. So, what is the need of
the bias of the artificial neural network. So, that concept I am going to explain suppose if I
consider, suppose input is x I am considering and I am considering the weight suppose weight is

1303
w naught. And I am getting the output the output is the sigmoid function that I am considering

1
sigmoid function x into w 0 sigmoid function is nothing but − x that is the sigmoid function.
1+e

So, that function I considered fx, fx is the sigmoid function so here I am considering this sigmoid
function is the fx. In this case I am not considering the bias. In the second case I am what I am
considering suppose I am considering the bias input. So, x is the input the weight is w 0 . I am
considering the bias and corresponding to this input 1 the weight is w1 and this is nothing but the
bias.

And in this case, I will be getting the output, output is the sigmoid function w 0 x + w1. So, in the
second case I am getting this one. Now, what is the importance of bias, so what is the
requirement of bias. Based on w1 the activation function can be shifted towards right or left.
Suppose we require output of the network for x is equal to 2 suppose. So, suppose we require
output 0 for a value x equals to 2.

That means I need the output 0, output should be 0 corresponding to the x is equal to 2. So, that
means in this case I have to shift the sigmoid function that means the activation function I have
to shift. So, what was my activation function if you see. So, suppose this is my sigmoid function
this sigmoid function this is the output and this is my input. This sigmoid function can be shifted
the sigmoid function can be shifted now I am shifting the sigmoid function like this.

So, this is my input and this is my output the sigmoid function can be shifted based on the value
w1. So, suppose the w1 is equal to - 5 suppose if I considered w1 is equal to - 5 that means the
sigmoid function will be shifted towards right. So, that means the activation function can be
shifted towards right or left based on the value w1. So, that is why we have considered the bias
terminal. Now, let us consider how we can do the pattern classification so already I have
explained about the decision boundary.

1304
(Refer Slide Time: 31:17)

n
So, suppose the sum is equal to that is the sum of the network is w 0 + ∑ x i w i. So, there is a sum I
i=0

am determining. And in this case the decision boundary between two regions can be obtain. I am
considering two classes and one region corresponds to suppose if I consider a sum is greater than
0 than I will be getting the one region suppose region is R1 and another is region, suppose the
sum is less than 0.

Then in this case I will be getting another region, the another region will be R2. And the decision
boundary can be obtained what will be the decision boundary. The decision boundary will be

n
w 0 + ∑ x i w i =0. So, that is the equation of the decision boundary. So, depending on the number
i =0

of input units in the network. This equation represents a line a plain or a hyper plain.

So, if I consider this equation, so depending upon the number of input units in the network
artificial neural network this equation represents a line or may be a plain or may be a hyper plain.
And the classification problem would be linearly separable when all of the training input vectors
for which the correct response is plus 1 lies on one side of the above mention decision boundary.
And all of the training input vectors for which the correct response is minus 1 lie on the other
side of the decision boundary that is call the linearly separable classification problem.

1305
I am repeating this one that means the classification problem would be linearly separable when
all of the training input vectors for which the correct response is +1 lies on one side of the above
mention decision boundary. So, this is the decision boundary and all of the training input vectors
for which the correct response is minus 1 lie on the other side of the decision boundary that is the
concept of the linearly separable problem that is the classification problem is the linearly
separable problem.

And also, I want to mention that, that a single layer neural network can learn only linearly
separable classification problems. So, this is one important aspect of artificial neural network.
So, if I consider only a single layer artificial neural network that can learn only linearly separable
classification problems. But if I consider a nonlinearly separable classification problem. Then in
this case I need more than two layers.

But if I consider a nonlinearly separable classification problem then I need more than one layer.
So, that means I have to consider hidden layers in the artificial neural network. So, this concept I
can explain like this suppose I am considering the AND logic, OR logic, and the XOR logic. So,
you know about the these what is and, what is or, and what is x or logic. Now, in this case
suppose my input is x1 x2 and output is y.

So, corresponding to and logic if the input is 0 and 0 output will be 0. If it is 0 and 1 output will
be 0. If it is 1 and 0 the output is 0. And if it is 1 and 1 if x1 is 1 and x2 is 1 the output is 1.
Similarly, in case of the OR logic x1 x2 y if it is 0 0 output is 0. If it is 0 1 output is 1. If it is 1 0
output is 1. And if it 1 and 1 output is 1. And if I considered XOR logic so again I considering x1
and x2 and output is suppose y. So, if it is 0 and 0 output is 0. If it is 0 and 1 output is 1. If it is 1
and 0 the output is 1. If it is 1 and 1 the output is 0.

So, XOR means it is the comparator for similar inputs the output is 0 but for a this similar input
it is 1 so that is why it is a comparator. So, and logic, or logic, and x or logic I am considering as
a classification problem. So, suppose if I consider this equation w 0 +x 1 w1 +x 2 w 2is equal to 0.
That is the equation of the decision boundary. So, by using this decision boundary I can
implement the AND logic.

So, corresponding to the AND logic, if you see this is suppose my x1 and this is x2 if x1 is 0 x2
is 0 the output will be 0. So, that means I am considering the first point the x1 is 0, x2 is 0 that I

1306
am considering. Next one is x1 is 0 and x2 is 1 so that I am considering this one and next I am
considering the x1 is 1 and x2 is 0 so that means I am considering this point. And finally, I am
considering x1 is 1 x2 is 1 that means I am considering this point.

So, if you see this case, I am considering this as a two-class problem. So, corresponding to this
suppose the class is w1 and corresponding to this the class is w2. That means if the output is 0
the class is w1 if the output is 1 the class is w2. There is a two-class problem. So, corresponding
to this I can draw the decision boundary, the decision boundary will be like this, this is the
decision boundary.

So, this will be in one class so this is the class w1 and that will be in another class. So, this by
using this equation I can represent the decision boundary. So, corresponding to this point and the
output should be 0, corresponding to this point the output is 0, corresponding to this point output
is 0. But corresponding to this point if I consider this point corresponding to this point the output
is 1.

So, that means by considering this linear decision boundary I can separate these two classes. This
is about the AND logic. So, for implementation of this AND operation if I consider the weight
w1 is equal to w2 is equal to 1. Suppose the bias terminal weight is 1.5 suppose then the equation
will be 0.5 plus x1 plus x2 is equal to 2. So, that means this equation will be this if I consider w1
is equal to w2 is equal to 1. And w naught is equal to - 1.5.

So, corresponding to this my network will be, my artificial neural network will be so one
terminal that is the bias terminal is 1. And in this case, it is minus 1.5 and this is x1 this is x1,
this is x2 and this weight is 1 and this weight is 1. So, this is the output the output is y. So, by
considering this network I can implement the and logic. So, this is the network for and logic the
bias terminal I am considering the bias is 1 and corresponding to this bias input my weight is -
1.5 that I am considering.

So, if I consider this equation, if I consider this equation - 0.5 + x1 +x2 is equal to 0 this is the
equation of the straight line. That is the decision boundary I am considering. This is about the
AND logic. if I considered OR logic what will be my implementation? So, if I consider input is
x1 suppose input is another input is x2. So, in this case suppose x1 is 0, x2 is 00 then
corresponding to this, this is the point.

1307
x1 is 0, x2 is 1 corresponding to this, this is the point x1 is 1 and x2 is 0 corresponding to this,
this is another point and corresponding to 1 and 1 this will be one point. So, if you consider this
OR logic so you can see the output is 1 here corresponding to this 0 1 1 0 1 1 that means I can
consider this as a class two w2 suppose. And corresponding to the input 0 0 output is 0 so
corresponding to this I can consider this is the class w1.

So, I am considering a two-class problem, so I have two classes w1 and w2. And corresponding
to this you can see I can draw the decision boundary; the decision boundary will be something
like this. Because this will be my class w1 and that will be my class w2 for class w2 the output
will be 1 1 1 you can see for class w2 the output will be 1. So, here if the output is 1 here the
output is 1 here the output is 1 here the output is 1. But corresponding to this 1, corresponding to
this 1 the output is 0. So, you can see the decision boundary for the OR logic and finally I want
to show the XOR logic. So, in the XOR logic already I have mention that is.

(Refer Slide Time: 41:35)

If I consider input x1 and x2, if I consider the input x1 and x2 and output is suppose y. So, 0 0 0
0 1 1 1 0 0 1 1 1 1 1 0 and 1 0 1. So, this is the XOR logic I have considered. And corresponding
to this XOR logic if you plot here x1 x2 so the first point is 0 0 x1 0 x2 0, second point is 0 1,
third point is 1 0, and the fourth point is 1 1. And here you can see a corresponding to this
condition that is 0 1 and 1 0 condition I will be getting 1.

1308
So, suppose this is the class w1 and the corresponding to another two cases that is 0 0 and 1 1 the
output is 0. So, that I can consider as the output corresponding to the class w2. So, corresponding
to the class w1 the output is 1 1 and corresponding to this the input will be 0 1 or 1 0. And
corresponding to the class w2 the output is 0 and in this case the input will be 0 0 or 1 1.

Then in this case how to draw the decision boundary? In this case it is you can see it very
difficult to draw the decision boundary now not like the AND logic or the OR logic. So, I can
draw the decision boundary like this. This a decision boundary so corresponding to this if I
consider this point what is my output, the output is 0. And corresponding to this point if I
consider this what is my output, the output is also 0.

So, that means 1 plus and corresponding to this point if I consider, corresponding to this point
the output is 1. And corresponding to this point also if I consider the output is 1. So, you can see
the decision boundary for the XOR logic. So, that means a two-layer network can classify data
samples into two classes which are separated by hyper plains. However, a network having three
layers is required when the problem is to classify samples into two decision regions.

Where one class is convex and another class is the complement of the first class. In case of the x
or logic I am getting this one. That means one class is convex and another class is the
complement of the first class. A convex set can be approximated by the intersection of finite
number of half plains. So, I am repeating this a convex set can be approximated by the
intersection of a finite number of half plains.

The nodes in layer one can determine whether a particular sample lies in each of the half plains
corresponding to the convex region. And subsequently the layer two of the network perform a
logical and to decide if the pattern is in the all of these half plains simultaneously. So, that means
a convex set can be approximated by the intersection of a finite number of half plains. The node
in layer one can determine whether a particular sample lies in each of the half plains
corresponding to the convex regions.

So, this concept I am going to explain now. So, for implementation of the XOR logic. So, in case
of the XOR logic I can implement like this suppose if I consider 0.5 + x1 + x2 greater than 0 and
after this I am doing the and operation. And I am considering 1.5 - x1 - x2 greater than 0. If I

1309
consider this case I can implement the XOR logic. That means the XOR mapping can be
implemented as the intersection of the two hyper plains.

So, I am considering two hyper plains here I am considering and that x or mapping I am
implementing as the intersection of the two hyper plains. Because you can see here two hyper
plains, this is the hyper plain one and this is the hyper plain two corresponding to the XOR
operation. And in this case corresponding to these hyper plains corresponding to these equations
I can draw the network that is the artificial neural network I can draw.

So, one input is suppose x1 another input is x2 and the bias is suppose 1. This bias terminal I am
considering, so bias is 1 and corresponding to this the weight is - 0.5. And x1 I am considering so
this is 1 x2 1 I am considering. So, you can see I am drawing the network corresponding x or
pattern classification problem. So, that means I am considering two layers I am considering and
in this case, you can see this XOR mapping I am implementing as the intersection of two hyper
plains.

And you can see the importance of the hidden layer. So, here you can see I am considering the
input layer you can see this is the input layer and we have the output layer and in between you
can see the hidden layers. So, for these problems we need the hidden layers. But if I consider and
logic AND the OR logic then in this case only with the help of input and the output layer we can
do the implementation.

But for the XOR logic we need hidden layers for the implementation because you can see I have
two hyper plains. And for this I have to implement by considering suppose this equation and by
considering this equation I can implement the XOR logic now I will show the concept of the
nearest neighbor classifier. So, how to consider an artificial neural network for nearest neighbor
classifier.

1310
(Refer Slide Time: 49:19)

So, let us considered the concept of the nearest neighbor. So, let us consider this case the nearest
neighbor classifier so how to implement this. And suppose if I consider artificial vector is x, x is
the artificial vector and the elements of the artificial vectors are x1, x2 suppose xj,…, xd
transpose. And in this case I am considering the d dimensional artificial vector. So, dimension is
d and also I am considering C number of classes.

Now, I am considering C number of classes with the clusters center or the centroids. Suppose yi
this is nothing but the centroids. And how many classes I am considering? I am considering c
number of classes. So, that is i is equal to 1 to C. So, that means I have C number of classes and
corresponding to these classes I have the centroid. And what is the principle of the minimum
distance classifier?

So, that means the concept is to classify the input vector. What I have to do? The distance
between the vector and the centroids I have to compute. And after this the input vector is assign
to a particular class which gives minimum distance. So, that is why it is call the minimum
distance classifier that is the nearest neighbor classifier. The distance between the input vector x
and the centroid yi can be determine like this.

So, what I am determining the distance between the artificial vector x and the centroid, the
centroid is suppose y I am considering that I am determining a distance. And this is nothing but

1311
d
2
the equality and distance I am considering. So, that I can write like this ∑ ( x j− y j ) . I can write
j =1

like this. So, I am considering input artificial vector, artificial vector is x and what is yi that is
nothing but the centroid for the classes.

So, for class one the centroid is y1, for the class two the centroid is y2, so I have C number of
classes. And this equation I can expand. So, it will be equal to

( x 1 + x 2 +… x d )−2 ( x 1 y i1 + x 2 y i2 +… x d y id ) +( y i1 ¿¿ 2+ y i 2 +…+ y id )¿ . So, that means what I am


2 2 2 2 2

doing just I am expanding this one that is nothing but a minus b whole square.

So, a square minus twice a b plus b square that like this I am expanding. So, if I consider the first
term, the first term is ( x 12 + x 22 +… x d 2 ). So, if I consider this term, this term is common for all the
classes and it plays no role in classification. So, this term the first term is ¿ like this. So, it is
common for all the classes and it has no role in classification.

And in this case, I have to determine the minimum distance a minimum distance what is the
minimum distance the minimum distance is d 2 ( x , y i ) I have to determine. The minimum distance
corresponds to maximum discriminate function. Because based on the discriminate function I
can take the classification decision. So, what will be my discriminate function. The discriminate
function g x will be equal to x 1 y i 1 +x 2 y i2 +… x d y id because I am considering the d dimensional

artificial vector x d y id −( y i 1 ¿ ¿ 2+ y i 22 +…+ y id 2 )¿ I can consider.

So, that means I am determining the discriminate function and this the first term I am not
considering this is I am neglecting because this is same for all the classes. And it has no role in
classification the minimum distance between x and y corresponds to maximum discriminate
function. So, I am considering the discriminate function, so this is the discriminate function I am
considering.

I have to compute the discriminate function g1 x for the class one, g2 x for the class two like this
I am determining the discriminate function for all the classes I am determining g c x I am
determining. And I have to find which one is the maximum. The maximum discriminate function
corresponds to the desired class. Now, how to implement this gx in the artificial neural network
that means how to implement this one.

1312
Because I have to find the maximum discriminate function. So, you can see I am drawing the
network. So, my input is x1 like this I have the d dimensional artificial vector. So, this is my
input, input is x. So, I am considering the centroid the centroid y1 suppose corresponding to the
class the class is one. And you can see if I implement this, this should be minus half because
minus half term I have considered here, minus half.

And this weight is y 1 2 that weight I am considering. And for all this cases one class number two
like this for all the classes I have to connect like this we have C number of classes. And
corresponding to this I can determine the discriminate function I can determine. So, after doing
this one I can determine the discriminate function, the discriminate function is g1 x I can
determine corresponding to the class 1.

Corresponding to the class two also I can determine a discriminate function. Corresponding to
the class c I have to determine the discriminate function gc x. And here I am considering the
comparator one comparator I am considering. Here I am considering one comparator because I
have to find the maximum discriminate function. And corresponding to this my output will be
the corresponding class.

So, I am considering the class 1, 2 up to C number of classes. And for each and every classes I
have the centroid is y 1 , y 2 ,… , yc like this. So, this is the structure of the artificial neural
network corresponding to the nearest neighbor classifier. So, you can see I have to compute
g 1 x , g 2 x , … , gc x I have to compute. After computing I have to find the maximum discriminate
function. So, that is why I am considering the comparator that comparator I am considering and
based on this I can determine the maximum discriminate function.

And that corresponds to the class. So, this is the concept of the nearest neighbor classifier that
can be implemented by using artificial neural network. So, in class I discuss the concept of
artificial neural networks. And how we can adjust the weight in case of the supervised artificial
neural networks. So, the training procedure is like this I know what is the desired output and
also, I can compute the actual output.

The difference between the desired output and the actual output is the error. And I have to
minimize the error. For minimization of error, I have to adjust the weights of the artificial neural
network that is the training of the artificial neural network. After this I discuss the concept of the

1313
decision boundary, I consider three cases one is the AND logic one is the OR logic and another
one is the XOR logic.

So, for the AND logic and the OR logic it is very simple the one line I can draw that is the
separating line I can draw for these two classes but in case of the XOR logic I need two hyper
plains that means the two lines I need and I have shown the implementation corresponding to the
XOR logic. In my next class I will discussing different types of artificial neural networks also I
will be discussing supervised and the unsupervised neural networks. So, let me stop here today.
Thank you.

1314
Computer Vision and Image Processing - Fundamentals and Applications.
Professor M. K. Bhuyan
Department of Electronics and Electrical Engineering
Indian Institute of Technology, Guwahati
Lecture - 36
Artificial Neural Network for Pattern Classification
Welcome to NPTEL MOOCS course on Computer Vision and Image Processing
Fundamentals and Applications. In my last class I discussed the basic concept of artificial
neural network. Also, I highlighted the concept of supervised learning and unsupervised
learning. And also, I mentioned the importance of hidden layer in an artificial neural network.
For this I have given 1 example, implementation of AND and OR logic and also XOR logic
that implementation I have shown.

For AND an OR I do not need any hidden layers, but if I consider the XOR logic in this case,
I need 1 hidden layer between input layer and the output layer. Also, I discussed the concept
of nearest neighbor classifier. Today I am going to discuss about different types of artificial
neural networks. And also, I will discuss the concept of supervised learning and unsupervised
learning. Let us see what are the different types of artificial neural networks.

(Refer Slide Time: 1:53)

So, this training concept I will explain later on that is about the supervised learning and the
training is mainly to find appropriate weights and that knowledge of the artificial neural
network is stored in the form of weights. That concept I have explained in my last class and
some basic neural network structures I can mention, one is MLP that is multi layer feed
forward network also it is called a multi layer perception, that is MLP. So, in case of MLP I
have the input layer and also, I have the output layer in between I may have some hidden

1315
layers. After this the next one is the feedback or the recurrent network that I will explain,
another one is the hopfield network, competitive network and the self organizing network.

(Refer Slide Time: 2:49)

So, in case of the multi layer feed forward network that is the MLP, that multi layer
perception. So, we have 1 input layer, 1 output layer and maybe the hidden layers in between.
And in this case, if I consider the simplest form of feed forward network, then in this case, no
hidden layer between input layer and the output layer. So, in this figure I have shown 1 MLP,
that is the multi layer perception 1 input layer you can see 1 output layer and in between you
can see 1 hidden, and layer that is the multi layer feed forward network.

(Refer Slide Time: 3:29)

1316
And in this case also I have shown 1 feed forward network, you can see 1 is the input layer
and in between the hidden layer is available and you can see the output layer. And you can
see the interconnections between the input layer and the hidden layer and also the
interconnections between hidden layer and the output layers. And also, you can see that
neurons for the output layer, input layer and the hidden layers. So, this is one example of the
feed forward artificial neural network.

(Refer Slide Time: 4:03)

So, I am giving one example of multi category classification. Suppose I am considering 3


classes w 1, w 2 , w 3, 3 classes I am considering and what is the importance of the hidden layer
that I want to explain. So, suppose I am considering the decision boundaries like this and the
hidden layers are h1 , h2 and h3 . So, I can draw the neural network. So, suppose my input is x1
and I am considering the D dimensional feature vector. So, suppose this h1 , similarly, I have

h2 and also, I haveh3 , I am considering the hidden layers and maybe I may have one bias
input. So bias input also I am considering.

So, I am considering the bias also. And you can see the hidden layers, I am considering h1 , h2 ,

h3 and after this I am not showing the interconnections, but output is suppose so y 1, y 2 and y 3

and between I have the interconnections, that interconnections I have not shown. So, h1 , h2 ,
h3 , I am showing the hidden layers and input is the feature vector, the feature vector is x that

is the D dimensional feature vector and I have the outputs output is y 1, y 2 and y 3.

1317
So, suppose if I consider, suppose the hidden below h1 , h2 and h3 . So, if I consider 1 and 1, h1
is 1 and , h2 is 1 that means, I am considering this one, h1 is 1 and h2 is 1 corresponding to
these suppose my class is suppose w 1. So, my output will be, my output will be y 1, y 1 should
be equal to 1. The that corresponds to the class W 1 and H 3 do not care, any value, but h1
should be 1 and , h2 should be 1, h3 do not care. So, corresponding to this y 1 will be 1 and the
corresponding class will be w 1.

Suppose if I consider h1 is do not care, h2 is suppose 0 and h3 is suppose 1, that I can consider
suppose y 2 is equal to 1 and that corresponds to the class w 2. So, that means this case in that
case h1 is do not care, but h2 is 0, if you see it is 0 andh3 is 1. So, corresponding to this the my
class is w 2. So, this is my class, that class is w 2, that class is w 1 and suppose this class is w 3.
So, corresponding to w 3, h1 should be 0 andh3 should be 0, h2 do not care. So, corresponding
to this y 3 will be equal to 1 and that corresponds to the class w 3.

So, you can see the importance of the hidden layer. So, based on the hidden layers, I am
doing the classification, that is nothing but the multi category classification. So, in this
example I have shown the importance of the hidden layer.

(Refer Slide Time: 8:37)

Now, next one is our definition of feedback network, what is the feedback network. So, the
feedback network, that is the output is fed back to the input layer. So, it is also called the
recurrent neural network. Feedback networks can have signals travelling in both directions.

1318
And this is implemented by introducing loop in the network. And the feedback networks are
generally dynamic and their state changes continuously until they reach an equilibrium point.

And in this case, the recurrent network suppose if I consider. So, a recurrent network is one
example of the feedback network. So, recurrent neural network uses feedback connections in
single layer neural network architecture. So, that is the briefly the concept of the feedback
network. That is the output is feedback to the input layer. And in case of a recurrent network,
so recurrent network uses the feedback connections in single layer neural network
architecture. After this the next one is the Hopfield network.

So, you can see the bigger the concept of the Hopfield network. Every node is connected to
every other node, but not itself. So, here you can see in this figure, the node 1 is connected to
node 3, node 1 is connected to node 2 like this and also the connection weights are
symmetric. So, what is the meaning of this, the weight from node i to j is same as that from
node j to i. So, in this figure if you see what is the weight from node 1 to 2, it is 1 and from 2
to 1 the weight is 1, that is the symmetric weights.

Similarly, if I consider the weights from the node 2 to 3, so, it is 1 and from 3 to 1 it is also 1
and similarly, if I consider the weights from node 1 to 3 that is minus 2 and from 3 to 1 it is
minus 2. So, that is the symmetric weight. So, every node is connected to every other node
and in this case, we are considering symmetric weights, that is the concept of the Hopfield
network.

(Refer Slide Time: 11:10)

1319
After this next one is the competitive networks. So, outputs of a feed forward networks are
paid to a competitive layer. And in this case, you can see I am showing 1 feed forward
network first part is the feedforward network, feedforward network and I have one
competitive layer here you can see I have one competitive layer. So, in this case the
competitive layer has the same number of inputs and the output nodes which is equal to the
number of feed forward network outputs.

So, you can see this is suppose I 1, I 2, I 3. So, like this suppose this I 1, I 2, I 3, these are the
inputs to the competitive layer, the competitive layer is nothing but a competitive. So, in the
competitive layer the output node corresponding to the maximum input files. So, in suppose
in this case, the output from the competitive layer is suppose O , O 2, O3 these are the output
from that competitive layers and suppose at a particular time suppose I 2 is maximum, I 2 is
maximum out of I 1, I 2, I 3 , I 4 like this.

So, corresponding to this I will be getting the output, the output will be O 2 because the I 2 is
maximum. So, I will be getting the output, output is O 2 and the competitive layer has the
same number of input and the output nodes which is equal to the number of feedforward
network outputs. And finally, what will be the output from the competitive layer, the output
node corresponding to the maximum input files. So, that concept I have explained.

(Refer Slide Time: 12:59)

So, in this figure I am showing one architecture for the competitive layer. I am considering
the inputs D 1, D 2, and D 3. So, 3 inputs I am considering and this structure can be used for

1320
determining minimum among 3 inputs using perceptions. So, you can see I want to determine
the minimum between D 1, D 2, and D 3 at a particular time and this structure can be used for
determining minimum among 3 inputs D 1, D 2, and D 3using perceptions. So, from this figure
you can see, I want to determine the minimum of D 1, D 2, and D 3.

So, for this I am considering the first threshold, the threshold is equal to 0 and 2nd threshold I
am considering, threshold is equal to 3 divided by 2. And you can see I am determining the
minimum. So, suppose all 1 is equal to 1 that corresponds to D 1 is minimum, otherwise it will
be 0.

(Refer Slide Time: 14:00)

And after this the next concept is the self organizing network. This is the interconnected
neurons which compete for the signal and in this case you can see in the figure I have shown
the input vector and the self organizing defines a mapping from the input n dimensional data
space on to 1 or 2 dimensional array of nodes in a way that the topological relationships in
the input space are maintained when mapped to the output array of nodes. So, meaning is you
can see the input is the, suppose the n dimensional and data vector I am considering and I am
doing the mapping, the mapping that is I am considering it is a 2-dimensional array of nodes.

And in this case, I am maintaining the topological relationship in the input space. So, here
you can see the x 2 is close to x 1. So, that information I am maintaining after mapping. In this
figure I am considering the n dimensional data space that is the input vector, I am mapped

1321
into a 2-dimensional array and also, I am maintaining that topological relationship in the
input space. So, suppose x 2 is close to x 1, so, that information I am maintaining here.

So, corresponding to suppose this is x 1 you can see, so, neighborhood will be x 2 like this. So,
the topological relationships in the input space are maintained during the mapping. So, this is
about the self organizing network. So, we will be discussing one important self organizing
network, that is the kohonen self organizing network I will be explaining later on. So, this is
very important, so, self organizing network that can be used for clustering of input data.

(Refer Slide Time: 15:55)

So, first I will be discussing the concept of the supervised learning and after this I will discuss
the concept of the unsupervised learning in the artificial neural network. So, in case of the
supervised learning, we know that desired output of the artificial neural network and also, we
can determine the actual output from the neural network, the difference between these 2 is
called the error. So, I have to minimize the error and for this, I am back propagating the error
to the input and I can adjust the weights of the artificial neural network. The objective is to
reduce the error, the error between the desired output and the actual output. So, that is the
supervised learning.

1322
(Refer Slide Time: 16:44)

So, the concept is the Generalized Delta Rule GDR. So, what I have to consider here, first I
have to apply inputs to the network, after this determine all neuron outputs, I can determine
the output from the neural network, after this compare all outputs at output layer with desired
output. Since, we know the desired output corresponding to a particular training sample, that
is why it is called the supervised learning. After this compute and the propagate error
measure backward through the network.

So, objective is to minimize the error. So, for this I have to adjust the weights. So, minimize
error at each stage through unit weight adjustment. So, I have to adjust the weights of the
artificial neural network, so that the error will be minimum. That is the procedure of
supervised learning and that is done generalized delta rule. And this is called the back
propagation training, because the error is back propagated to the input.

1323
(Refer Slide Time: 17:46)

Then mathematically how can I explain this one. Suppose, an input I have the inputs x 1 x 2 .. x n
and where n is the number of input nodes. So, I have the n number of input nodes and I am
considering W ij , that is the weights connecting input node i and the output node j. So, I am
considering the weights of the artificial neural networks, the weights connecting input node i
and the output node j . And also, we know the desired output at a particular node, then node
is suppose j. So, we know the desired output. So, that information is available. So, that is
why it is called a supervised learning.

And we can determine the output error, the output error is nothing but the desired output
minus actual output. So, if you see this term, this term is nothing but I can compute the actual
output from the artificial neural network. So, desired output minus actual output, that gives
the error and I have to minimize the error and, in this case, I have to adjust the weights. So,
that is why I am doing the differentiation of the error with respect to the weights, the weights
are W ij . So, I am doing the minimization of this.

So, after minimization you will be getting this one. Now, I am considering the weight
updation rule. So, the rule is this one, weight updation rule. So, here you can see, this is the
new weight I am considering, this is the old weight I am considering and here you can see
this is nothing but the error, the error between the desired output and the actual output and xi
is the input you can see. And one parameter I am considering that parameter is the eta, that
controls the learning rate, that controls the learning rate. This parameter decides the rate of
convergence, that parameter eta.

1324
So, that means, suppose eta is small, the convergence will be slow, if eta high, the
convergence will be fast. So, this is the weight updation rule for the back propagation training
So, objective is to reduce the error and for this I have to adjust the weights.

(Refer Slide Time: 20:07)

So, here you can see I am showing one example of the backpropagation training, you can see
I have the input layers and also, I am considering the bias inputs. And in between the input
and the output layers, you can see this is the output layer and this is my input layer, we can
have the hidden layers between the input layer and the output layers. And the concept is the
same concept, we can have the desired output and we can determine the actual output in the
network and we have to minimize the error. So, for this I have to adjust the weights.

So, weights are available here, all the weights I have to adjust, so that the error is minimum.
So, the error is back propagated to the input, that is why it is called the backpropagation
training and the error is back propagated to the input so that I can adjust the weights to
minimize the error. So, this is the concept of the backpropagation training. So, this concept I
can explain like this.

1325
(Refer Slide Time: 21:12)

Suppose I am considering a very simple network. So, suppose my input vector is suppose x
and W is the weights of the network, the railway weight vector and output is y. So, y is
nothing but x T W, that is the dot product between the input vector and the weight. So, this is
one simple network I am considering and, in this case, I can determine the error. The error is
nothing but the desired output minus actual output, that is the error and after this, I have to
minimize the error.

So, for this I have to differentiate this one,dw the error should be minimized. So, if I
minimize this error, this will be 2( y d − y ) x I will be getting this one. So, how to get this
expression, how to get these expressions, suppose y is equal to, suppose y =W 1 x 1 +W 2 x 2, I
am considering 2 inputs suppose x 1 and x 2 and suppose I am considering weight W 1 and W 2.
So, what will be my error now, the error square? The desired output I know and what is the
actual output the actual output is W 1 x1 plus it will be minus now, W 2 x2.

So, error is desired output minus actual output and I am considering the squared error. I have
to minimize the error, so this I have to take the differentiation with respect to the weight W 1.
So, if I do the differentiation, then in this case it will be minus 2. So, y d – W 1 x 1 −W 2 x 2and it
will be x 1. Similarly, if I take the differentiation with respect to W 2, the error should be
minimized. So, in this case it will be -2 ( y d – W 1 x 1 −W 2 x 2 ) x 2 and this will be x 2.

So, from this you can see I will be getting this one. After this I have to adjust the weights. So,
suppose this is my new weight, W star is the new weight, that is the weight vector, that is the

1326
old weight minus half eta I am considering. So, this I am considering 1 by 2. So, this is the
new weight, this is the old weight and eta controls the learning rate, already I have explained,
that is it controls the rate of convergence. If eta is small, the convergence will be slow. And if
the eta is high that convergence will be fast.

And this is nothing but this what I am considering this is this algorithm is also called the
gradient descent, gradient descent algorithm. Gradient Decent algorithm, because in this case
I am minimizing the error, so for this I am determining the gradient. So, that is why it is
called the gradient descent algorithms. So, you can see how to adjust the weights, how to get
the new weight from the old weight. So, I have to minimize the error. So, this is the concept
of the supervised learning and this is nothing but a back propagation algorithm.

(Refer Slide Time: 25:32)

Next learning is the unsupervised learning. So, unsupervised learning is used for grouping of
input data, that is nothing but the clustering, the clustering of input data. This the
unsupervised learning process also term is the self organizing map. And in this case, in case
of the unsupervised networks, in general, we do not use external input to adjust the weights.
So, unsupervised learning, itself organizes data inputted to the network.

Unsupervised network look for some regularities or similarities of the input data and make
adaptations according to the function of the network. And one important point is supervised
learning is performed offline, while unsupervised learning is performed online. Now, I will
explain some of the unsupervised learning techniques.

(Refer Slide Time: 26:31)

1327
One is the competitive learning. Competitive networks cluster, encode and classify input data
stream. That is, the objective is to classify an input pattern into one of the M number of
classes. The network is learned to form classes of examplers, sample patterns, according to
some similarities of these patterns. So, I have to find the similarities of the patterns and based
on this, we have to form classes. Patterns in a cluster should have similar features. So, based
on the similarity, we can group the patterns.

Now, in this case, the first one is you can see the competitive learning method generates the
weight vector as follows. So, what is the method here you can see, all the weights are
randomly initialized and training vectors are applied one by one. And if the output node j
fires for an input x, the corresponding weight vector is updated, that means I have to
determine the winner. And in this case, you can see the rule, the winner is updated and that is
W j is the new weight. And you can see that this is the old, this is the old weight.

And in this case, the winner is updated this concept I will explain, so what is the what is the
competitive learning. So, the first the winner is determined by computing the distance
between the weight vectors of the connecting neurons and the input vector. The winner is that
neuron for which the weight vector has the smallest distance to the input vector. The square
of the minimum equilibrium distance is used to select the winner.

So first, I have to select the winner. And subsequently, that weight vector W i of the winning
neuron is moved towards the input vector x. And this, you can see, the winner is updated by
using this updation rule that I am going to explain now. So, first concept is I have to

1328
determine the winner and how to determine the winner. That is by computing the distance
between the weight vectors of the connecting neurons and the input vector.

And in this case, I can consider the square of the minimum Euclidean distance. And after this
after determining the winner what I have to do, the weight vector W i of the winning neuron
is moved towards the input vector x. And is the output node i fires for an input vector x, the
corresponding weight vector is updated. The updation is like this.

(Refer Slide Time: 29:29)

So, I am considering one network suppose, x is the input feature vector and this is a D
dimensional feature vector. So, here you can see I am considering one, the competitive
learning technique that is a competitive network, I am considering x is the input feature
vector and this is a D dimensional feature vector and I am considering c number of classes.
So, y 1, y 2, y c these are the cluster centers I am considering corresponding to c number of
clusters y 1 y 2 y c, these are the cluster centers.

And corresponding to the first cluster center y 1 you can see, I am considering the weight
vector, the weight vector W 1. So, in this case and this yi actually, it depends on Wi. That
means the output depends on the weight vector, the weight vector is W i . So, any output, that
is the output the center of the cluster center depends on the weight vector W i . And for this,
for the competitive learning I have to determine the winner. So, for this I have to determine
the distance between the input feature vector and the centroid.

1329
So that I can consider the Euclidean distance. So, that means, I am determining the distance
between the input vector x and that centroid. So, I have to find a winner which one is the
winner, the minimum distance corresponds to the winner. Similarly, I can determine the
distance between x and y to x and y see like this I can determine all the distances based on
that minimum distance I can determine the winner. After this, after determining the winner,
what I have to consider, we have to update the weight vectors.

Suppose, the winner is W i that is that weight vector is W i . So, I have to update the winner.
So, corresponding to y i, my winning weight vector is W i and in this case, I have to update
the weight vector, the weight vector is W i that is the winner. So, updation will be something
like this. So, W i is the winner. So, W i it is updated like this. So, this is the new weight vector,
weight vector, this is the old weight vector. And I am considering one small fraction
parameter.

So, this is the weight updation rule I am considering. That means what I am doing since W i is
the winner, that means using W i nearer to x and what about that weight updation rule for the
other neurons. For other neurons the weight updation rule will be Wj is not the winner. So,
what is the weight updation rule? The weight updation rule is ϵx −W j . So, i is not equal to j
for is not equal to j. So, this is for the loser. For the winner this is the weight updation rule for
the loser, the weight updation rule is like this.

So, that means for the winner what I am doing, pushing Wi nearer to x and for the loser what
I am doing pushing W j away from x. So, this concept suppose if I consider and this is
suppose W i weight, so this is moved, W i is moved to W i 1 that is pushing W i towards x. So
that pushing by the amount, amount is x, pushing Wi towards x, and I am getting the new
weight vector, the new weight is W i . And in case of the loser what I am considering putting
W j, away from x.

So, that means the winner is updated, but the losers are not updated. So, our concept of the
competitive learning is that first, I have to determine the winner based on the minimum
distance, after this I have to update the winner, the winner is updated by using this formula.
So, this is the winner is updated. That means what I am doing, that is pushing W i nearer to x.
And in case of the loser, this is the weight updation rule. That means what I am doing
pushing W j away from x.

1330
So, what is the problem of this the competitive learning you can see here I am pushing Wi
nearer to x, but what about the losers that means the pushing W j away from x. So, this
problem is called the problem of under utilization, this is the problem of, I can write this
problem of under utilization, because the winner is updated, but the losers are not updated.
So, like this, I am getting the clusters in case of the competitive learning.

(Refer Slide Time: 36:54)

Next, I am considering another network and that is called the frequency sensitive, frequency
sensitive competitive learning. So, in competitive learning, we define the updation rule is like
this Wi is equal to Wi plus epsilon x minus Wi. So, this is the old weight, this is the old
weight and this is the new weight. In case of the frequency synthetic quantitative learning,
what I am considering this parameter the epsilon parameter that is defined, this epsilon
parameter is defined.

So, epsilon is defined like this epsilon is equal to 1 by Fi. So, what is Fi, Fi is number of
number of times x is mapped to Wi. So, Fi is number of times x is mapped to Wi, that means,
the number of times Wi is the winner. So, frequency is defined and based on the frequency I
am defining the epsilon, that small fraction I am considering epsilon is a small fraction. So,
epsilon is equal to 1 by Fi. So, that means, I can write W i star is equal to epsilon x plus 1
minus epsilon Wi. So, this is the weight updation rule for frequency sensitive competitive
learning.

So, like this, this is my new weight, this is the old weight and I am considering this is the
weight updation rule and in this case and this is nothing but the Wi is equal to 1 by Fi because

1331
epsilon is equal to 1 by Fi, Fi is the frequency x plus Fi minus 1 Fi Wi is equal to 1 by Fi x
plus Fi minus 1 Wi. So, this is the weight updation rule. If I give one example suppose x 1 is
mapped into the weight vector the weight vector Wi. So, that means I am considering number
of times and Wi is the winner. So, that means the number of times the x is mapped into Wi.

And corresponding to this what will be my weight updation rule, the weight updation rule
will be Wi is equal to x because only one time Wi is the winner. Suppose, next one is x 2, I
am considering, next feature vector I am considering and in this case 2 times Wi is the
winner. So, that means, what will be the new weight, the new weight will be x 1 plus x 2
divided by 2, that is my new weights. And after this suppose the next x 20 suppose I consider
x 20, that is mapped into W i, in between there is no mapping suppose.

So, first mapping is x 1 is mapped into Wi, that means 1 time Wi is the winner. Next x 2 is
mapped into Wi that means 2 times Wi is the winner. In between there is there is no mapping
and after this x 20 is mapped into Wi that means, how many times Wi is the winner, the 3
times. So, that means the new weight will be x 1 divided by 3. So, like this I can determine
the Wi star, that is the new weight vector I can determine.

And in this case, it is nothing but the centroid you can see it is the centroid I am determining.
So, in this example what I am considering So, 3 times Wi is the winner, that means the
frequency I am defining, frequency Fi. So, in this example 3 times it is mapped into Wi. So,
that is why based on this I am determining the updated or the new weights I am determining
and that is nothing but I am determining the centroid. So, this is the concept of the frequency
sensitive competitive learning and that is one variant of the competitive learning.

1332
(Refer Slide Time: 42:39)

The next competitive learning is the Kohonen neural network, Kohonen neural network, that
is also called a Self-Organizing Map, that is also called SOM. So, what is the concept of the
Kohonen neural network? The weight updation rule will be something like this. So, Wi star is
equal to Wi plus epsilon k 0 x Wi. This is for that winner, first I have to determine the winner
based on the minimum distance. So, suppose Wi is the winner. So, for winner, this is the
weight updation rule.

And for the loser that weight updation rule is W j plus epsilon k D. In case of the competitive
learning the winner is updated, it is moving towards the feature vector x, but the losers are
not updated, they are moving away from x, that is the problem and that problem already I
have explained that is called the problem of under utilization. So, that problem is eliminated
in kohonen neural network, that is the self organizing neural network. So, here you can see
for the winner, this is the weight updation rule, Wi star is equal to Wi plus epsilon k 0 x
minus Wi.

And Wj is equal to Wj plus epsilon k d. So, here you can see, now epsilon is a function of
two parameters one is K and other one is D. So, what is k? k is nothing but the frequency and
what is D, the D is the distance between Wi and Wj. So, D is the distance between Wi and
Wj. So, in case of these if I consider the first one is in case of the winner, the updation rule is
this and why I am considering it is 0, because the distance between Wi and Wi will be 0. So,
that is the distance between the Wi and the Wi it will be 0. So, that is why I am considering it
is 0.

1333
But if I consider the distance between Wi and Wj, that distance is D. In case of the kohonen
neural network, I have to define, the distance the maximum distance I have to define. So, this
is the maximum details D max is the maximum distance and what actually I am doing here,
you can see I am considering a neighborhood around a winner suppose the winner is Wi,
winner is updated. And here you can see I am considering the distance, I am considering the
epsilon, epsilon is a function of k n d.

So, I am considering the distance the D max I am considering. So, I am considering the
neighborhood, neighborhood is suppose W h 1 weight vector is W h, another weight vector is
suppose W k suppose or maybe suppose W m suppose. So, here you can see that winner is
updated and also the weight vectors which are neighborhood of the weight vector Wi they are
also updated. So, W h will be updated and W m will be updated. So, that means all the weight
vectors near to that weight vector Wi will also be updated and here you can see the epsilon is
a function of frequency and the distance.

And in this case, you can see, so up to D max, up to D max, that is the maximum distance we
can update all that weight vectors and beyond this I am not updating. That means D max the
learning vector will be 0, that means no updating. And the weight vectors away from Wi will
be less updated. But after D max, there will be no updating. So, up to D max I can update all
the neighborhood weight vectors.

So, winner I have to consider based on the minimum distance and unlike the competitive
neural networks only the winner is updated but here winner is updated and all the other
weights, that is the neighborhood of Wi will be updated but based on the frequency and based
on the distance. So, I am repeating this, that is unlike competitive learning the weight
associated with other nodes within his topological neighborhood are also updated along with
the winner node.

And in case of the kohonen and neural network that is a self organizing map it is called
topology preserving maps, because it assumes a topological structure among the cluster units.
The self organizing defines a mapping from the input n dimensional data space onto a one- or
two-dimensional array of nodes in a way that the topological relationship in the input space
are maintained. So, a self organizing map is a structure of interconnected neurons, which
compete for a signal. So that concept already explained.

1334
So, that is why the self organizing neural networks are also called the topological preserving
maps. It assumes a topological structure among the cluster units. And unlike competitive
learning, the weights associated with other nodes within each topological neighborhood are
also updated along with the winner node. So, that means in competitive learning, only the
winner is updated, the losers are not updated.

But in case of the kohonen neural network or a self organizing map the winner is updated and
also the neighborhoods are also updated. So, that is the concept and the learning rate
monotonically decreases with increasing topological distance and the learning rate also
decreases as training progresses. The winning neuron is considered as the output of the
network and in this case the concept is winner takes all. Because first I have to determine the
winner and after this the winner is updated and the neighborhood around the winner is also
updated.

So, that is why it is called winner takes all, this is called the winner takes all. In the sum,
nodes compete with each other using the strategy winner takes all. And I will be showing the
some lateral connections later on. But this is the concept of the kohonen neural network. So,
the problem of under utilization is eliminated, in case of the competitive learning the problem
is the problem of under utilization. But in case of the kohonen neural network, I am
considering the neighborhood. So, that is why the problem of under utilization is eliminated.

Now, one problem is here. So, suppose this is Wi, the winning node, the winning neuron and
the corresponding weights. So, Wi is the winner suppose here and suppose the feature vector
is x and suppose this is W k and it suppose Wj. So, here you can see the distance is computed
from the winner, not from the feature vector. So, here you can see the distance between Wi
and Wj, that is less than D max and suppose that this distance is greater than D max.

So, that means, that W j will be updated, but Wk is not updated because the distance is
measured from the winner, the winner is Wi. That is why it is called a winner takes all, but
the distance should be measured from x that is the feature vector. So, if I consider this
distance, this distance is suppose Dj and suppose this distance this distance is suppose DK.
So, in this case you can see the Dj is greater than DK.

So, if you consider this distance, the distance from the feature vector, then in this case the Wk
should be updated first and after this the Wj should be updated. Because the distance from X
to W k is smaller than the distance from X to W j. So, that is why the W k should be updated

1335
and after this we can update W j. But since in this case, we are determining the distance from
the winner. So, that is why the W j is updated first and after this the Wk may not be updated,
because it is greater than D max.

But the problem is here that we are not determining the distance from the feature vector, the
feature vector is x, that is the problem of the kohonen neural network. So, this problem can be
eliminated by considering another competitive network, that is called the fuzzy kohonen
neural network.

(Refer Slide Time: 53:13)

In this case, I am showing one structure of the self organizing map that is the kohonen neural
network I am showing. You can see, we have the input layer, we have the hidden layer and
we have the output layer. And also, you can see the mapping from the input layer to the 2d
array. The input is the suppose n dimensional vector, and I am considering the 2d array and
you can see I am considering the winning neuron. So, in the SOM, nodes compete with each
other using the strategy, the strategy is winner takes all.

And I can consider some connections, these connections are called the lateral connections,
the lateral connections are employed to develop a competition between the neurons of the
network. The neurons having the largest activation level among all the output layer neurons is
considered as the winner. The activity of all other network neurons is suppressed in the
competition process. I am considering lateral feedback connections which are used to
produce excitatory or the inhibitory effects depending on the distance from the winning
neuron of the network. And this is implemented by using a Mexican head function. So, an in

1336
the next slide I can show what is the Mexican head function and that is used to produce
excitatory or inhibitory effects.

(Refer Slide Time: 54:49)

So, let us see what is the Mexican hat function, here you can see. So, by considering this
Mexican hat function, that we are considering the lateral connections, which can produce
excitatory effect, you can see here the excitatory effect or we can produce the inhibitory
effect. So, I am repeating this. So, this Mexican hat function describes the distribution of the
weights between the neuron in the kohonen layer. So, because I have to consider the
neighborhood weights, so, that is why, so, we are considering the neighborhood weights.

But after D max, I am not considering the weights. So, that is why I am considering the
inhibitory effects. For the excitatory effects, we are considering the neighborhood weights
around the winning neuron. The problem of the kohonen neural network is that the distance is
computed from the winner, but distance should be computed from the feature vector. So, that
is why to overcome this problem, I can consider another competitive network that is called a
fuzzy kohonen neural network.

1337
(Refer Slide Time: 55:56)

So, that concept I am going to explain now, so, that is called the fuzzy kohonen neural
network. So, simply I want to first explain the concept of the fuzzy set and the normal set. So,
suppose the set is suppose A and suppose I am considering the domain X. So, x iis an element
of X and domain I am considering is X is the domain and set I am considering A is the set. So,
I am considering one function, that is the membership function I am considering.

μ A (X) that is the membership function of the set A is equal to 0, if x iis not an element of A is

equal to 1 if x iis an element of A. This is the case of the crisp set. So, for the crisp set it will
be either 0 or 1. So, I can show the pictorially, so what is the meaning of this. So, suppose the
membership grade I am considering x and you can see the crisp case is something like this.
So, either it will be 1 or it will be 0. But in case of the fuzzy set you can see, I have the
membership function something like this. This is a membership function.

So, this is the Crisp set our crisp set. And I can consider this is a fuzzy set. You should read
the concept of fuzzy logic from a book on fuzzy logic. But briefly I am explaining the
concept of the crisp set and also the fuzzy set. In case of the crisp set, you can see either I
have the 0 value or 1 value. But in case of the fuzzy sets, based on the membership function,
the membership grade and this is called a membership grade, membership grade I will be
getting the value 0, 0.1, 0.2, 0.3, 0.4, 0.5, like this I will be getting, so all the values I will be
getting based on the membership grade. This membership grade lies between 0 and 1.

So, this concept better you should read from a book on fuzzy logic. Briefly I am explaining
the concept of the fuzzy set and the crisp set. Now what is the fuzzy Kohonen neural network

1338
because we have seen the problem of the kohonen neural network because the distance is
measured from the winner. In case of the fuzzy Kohonen neural network, what I am
considering the updation rule is something like this. The weight updation rule is
W i =W i +μ i ( X ).

So, this is the weight updation rule for the fuzzy kohonen neural network. So, this is the new
weight you can see this is the new weight and this is the old weight, this is the old weight.
And in place of epsilon, because in case of the competitive network it was epsilon, I am
considering this one. So, what I am considering what is the membership grade I am
considering here the μi (m). is the membership grade and, in this case, what is m, m is nothing
but m not minus k delta m a I am considering.

So, some initial value I am considering some initial high-value I am considering and K
nothing but the iteration number, iteration number I am considering and delta m is the step
size. What is ∇ M ? ∇ M is equal to m 0−¿ 1 K max. So, it is iteration limit. So, maximum
iteration I am considering, So, ∇ M is the step size and in this case what I am considering the
membership grade, this is the membership grade, membership grade I am considering.

So, I am defining the weight updation rule, the weight of updation rule is like this. So, in
place of epsilon I am writing this 1 mu i m x divided by summation mu i m x dash. So, what
is x dash actually we are considering all the patterns, x dash means all the patterns I am
considering all the patterns I am considering. And in this case, what is the importance of this
m you can see. So, I am just plotting a membership, I am plotting the membership against the
distance from the feature vector, feature vector is x.

So, corresponding to m is equal to 1, I have this one corresponding to the m is medium


suppose, medium is this one, i is suppose this. So, corresponding to m is equal to 1, I am not
considering any neighborhood, but if I consider that means corresponding to m is equal to 0,
corresponding to m is equal to 1 I am not considering any neighborhood. But if I consider m
is greater than 1, that means I am considering that neighborhood. So, suppose m is very high,
that means, I am considering more and more neighborhood.

So, in this case instead of considering the D max that I have defined in the kohonen neural
network, here I am considering m and if the m is greater than 1, that means, if I consider a
very high value, that means, I am considering more and more neighborhood. And if I
consider m is equal to 1, that means I am not considering any neighborhood. And in this case,

1339
you can see the distance I am computing from x. Distance is not computed from the winner
and corresponding to m is equal to 1 membership will be 0, that means in this case, I am not
considering the neighborhood.

But if I consider the m is greater than 1, then that means based on this membership function
you can see here, I am considering more and more neighborhood. So, that I can put in this
expression, this membership grade is available here. So, this is the concept of the fuzzy
kohonen neural network.

(Refer Slide Time: 64:31)

And finally, I want to summarize. So, what are the advantages of the artificial neural
networks.Once important thing is the parallel processing. So, inherent parallel nature
provides parallel processing. And the second advantage is that the overall complicated
processing is split into several smaller and simpler local competitions at the neurons. So, that
is the second advantage. The first advantage is the parallel processing, after this we are
considering the adaptive learning and the self organization based on information derived from
the training data.

So, that is the concept of the adaptive learning and the self organization already I have
explained, that is by using the self organization we can do the grouping. So, suppose the input
feature vector is available and based on the similarity, we can do the grouping. Another
advantage is that good for the realization of the real old problems that do not conform to ideal
mathematical and statistical model. So, suppose one problem is there, we cannot develop
mathematical or statistical models.

1340
So, for this we can apply artificial neural networks. So, these are the advantage of the
artificial neural networks. So, in this class, I discussed the concept of supervised learning and
unsupervised learning. For supervised learning, I considered the back propagation neural
network, the back propagation training. For this what we have considered, we know the
desired output and also, we can compute the actual output. So, from this we can determine
the error.

And the error is back propagated to the input, so that we can adjust the weights of the
artificial neural network, that is the concept of the backpropagation learning. After this I
considered some unsupervised learning techniques. The first one is the competitive learning.
So, for this I have to determine the winner based on the distance between the feature vector
and the centroid. And after this, I discussed one variant of the competitive learning, that is the
frequency sensitive competitive learning.

And finally, I discussed the kohonen neural network, that is the self organizing map. For this
we have to again determine the winning neurons. And also, we have to consider the updation
of the winners and also the neighborhood and the concept is winner takes all. And finally, I
discussed the concept of the fuzzy kohonen neural network. So, let me stop here today. Thank
you.

1341
Computer Vision and Image Processing - Fundamentals and Applications
Professor M. K. Bhuyan
Department of Electronics and Electrical Engineering
Indian Institute of Technology, Guwahati
Lecture - 37
Introduction to Deep Learning
Welcome to NPTEL MOOCS course on Computer Vision and Image Processing,
Fundamentals and Applications. I have been discussing the concept of artificial neural
network. Today I am going to discuss the concept of deep learning techniques, it would not
be possible to discuss all the deep learning techniques in this computer vision course. So, that
is why I will briefly introduce the concept of deep learning, mainly the convolution neural
networks and also auto encoders. How is deep learning different from the traditional pattern
recognition system that I am going to explain. So, let us see the difference between the
traditional pattern classification system and the deep learning techniques.

(Refer Slide Time: 01:25)

So, you can see here in this block diagram, I have shown 1 traditional pattern recognition
system, you can see the input is the image 1 image I am considering and for classification or
maybe for recognition of objects, I have to extract features from the image. So, in this case I
am considering the features maybe SIFT features or maybe the HOG features I can consider
and after this I can apply some unsupervised or supervised classification techniques for object
recognition. So that means, the object is a car that is available in the image that I can
recognize, this is the traditional pattern recognition system.

1342
(Refer Slide Time: 2:11)

And already I have discussed the image features. So, we can consider SIFT features or maybe
you can consider HOG features, or maybe we can consider Textons features or maybe the
features like SURF, the local binary pattern, the Colour-SIFT, Colour histograms so all these
features we can consider for image classification, object recognition.

(Refer Slide Time: 2:36)

So, in this case also I have shown 1 traditional pattern recognition system. So, you can see the
first 1 is the input, input is the image, after this we can extract the features, these are the
handcrafted features. So, maybe the features, the low-level features, the edges, we can
consider the boundaries, we can consider the sift, we can consider HOG features, we can
consider. So these features we can consider for image recognition, image classification,

1343
object recognition and after this we can apply the pattern recognition algorithms like support
vector machine or maybe some other supervised or unsupervised techniques for object
detection object classification. So, this is a typical traditional pattern recognition system.

(Refer Slide Time: 3:29)

And now, this is the structure of the deep learning, in this case you can see the deep learning
is nothing but that cascade of nonlinear transformations. So, you can see, here is a cascade of
nonlinear transformations and it is nothing but end to end learning. And in this case, we
consider hierarchical model that we consider. In this case why it is called a deep because we
want to extract more and more information from the input data.

So, suppose in this case I am considering 1 image. So, I want to extract more and more
information from the image, maybe some low-level information we can consider or maybe
the mid level information or the high-level information that is the features we can consider
for image recognition for image classification. So, that is why it is called a deep learning
approach because I want to consider more and more information from the input image and it
is the end-to-end learning.

1344
(Refer Slide Time: 4:34)

And if you can see the performance of the deep learning algorithms and the traditional
machine learning algorithms. Here you can see that we are considering the size of the training
data, we are considering the size of the training data and also, we are considering the
performance. If we have a large amount of training data, then you can see the performance of
the deep learning algorithms are much better as compared to traditional machine learning
algorithms, but if I consider a small amount of training samples are available suppose, then
the traditional machine learning algorithms will be sufficient for a particular pattern
classification problem or maybe the pattern recognition problem. So, you can see the
performance versus sample size, sample size means number of training samples available for
a particular classification problem.

1345
(Refer Slide Time: 5:30)

So now, I am considering 1 supervised learning technique that is a traditional pattern


recognition model I am considering. So, input image I am considering, after this we have to
extract encrypted features and after this we are considering the classifiers maybe something
like the support vector machine random forest. So, these types of classifiers we can consider
and output is object recognition. So, outdoor scene that means, yes, if it is not the outdoor
scene it is no, so that is the image classification. So, this is a block diagram for traditional
pattern recognition system.

(Refer Slide Time: 6:10)

Now, I am considering that deep learning method. In this deep learning method already I
have mentioned this is nothing but the cascade of nonlinear transformations, in this figure

1346
you can see I am considering 1 image and from this image I can extract some low level
features or we can extract the mid level features and after this we can extract the high level
features. So, maximum information I want to extract from the input image that is why it is
called a deep learning then more and more information I want to extract from the input data.
And after this, we can consider the classifier for object recognition for object classification or
maybe the image recognition or image classification. So, you can see the distinction between
the traditional pattern recognition system and the deep learning system.

(Refer Slide Time: 6:55)

So, in the deep learning techniques, so already I have mentioned so we have to extract low
level features, mid level features, high level features. So, in case of the emails if I consider,
maybe the pixels values or maybe the edges, I can consider as the lower-level features. For
the mid level features, we can consider the texture or maybe the motif and for the high-level
features, we can consider the part the part of an object. So, these type of features we can
consider as high-level features, that is the part of the object or maybe the object as a whole
that we can consider for image recognition image classification.

In case of the text recognition, what we can consider, the low-level information, the low level
features, maybe the characters, the word we can consider, for mid level features we can
consider word-group or maybe the clause we can consider, and if I consider the high level
features, we can consider sentence or maybe the story we can consider as high level features.
And one important point is, if I consider this approach that is the deep learning approach, that
is the hierarchical approach that is very similar to the human cortex. That means what we are

1347
doing, the deep learning transforms its input representation into a high level 1 and that is very
similar to human cortex.

(Refer Slide Time: 8:33)

Let us see the basic concept of the deep learning so, how it works. So first, I will explain the
simple artificial neural networks and what are the main disadvantages of the artificial neural
network and why we need deep learning systems, why we need deep learning techniques. So,
in case of the artificial neural network, already I have explained that we have the inputs that
is the input layers and we have the output, output is why I am showing here and also the
hidden layers are available.

And you can see the nodes corresponding to the hidden layers, a 1, a 2, a 3, a 4, these are the
nodes corresponding to the hidden layer. And if you see in the output, we have the layer the

1348
only 1 node is available. So, I am considering this artificial neural network. So, you can see
the inputs x 1, x 2, x 3, x 4 these are the inputs. And corresponding to the hidden layer my
nodes are a 1, a 2, a 3, a 4, and from a 1 2 I will be getting the output. So, this is a simple
artificial neural network.

(Refer Slide Time: 9:58)

And the training procedure already I have explained in case of the artificial neural network
and that is the supervised learning. So first, the sample level data we are considering so that is
why it is a supervised learning, after this considering this sample level data we are uploading
it through the network and get predictions, that is the meaning is I want to calculate the
output of the artificial neural network.

And already we know the desired output. So, the difference between the desired output and
the actual output that is the error. So, errors should be back propagated to the input, so that
we can update the weights of the artificial neural networks. So, that back propagation
technique already I have explained in artificial neural network. So, this is the concept of the
training of artificial neural network.

So, we have to consider sample level data, after this we have to determine the output of the
artificial neural network and we can determine the error the error is nothing but the desired
output minus actual output and we have to minimize this error, so that the error is back
propagated to the input and for this we have to update the weights of the artificial neural
network that means, we are reducing the error in the output. And based on this we are

1349
adjusting the weights of the artificial neural network and that is the training procedure that is
a supervised training procedure of the artificial neural network.

(Refer Slide Time: 11:20)

So, now I am considering a simple neural network here you can see. So, I am considering 4
inputs x 1, x 2, x 3, x 4 and one note I am considering a 1 that is corresponding to the hidden
layer, a 1 one I am considering you can see, so my input is x 1, x 2, x 3, x 4 and weights are
w 1, w 2, w 3, w 4. So, what will be the response in a 1 that is the neuron is a 1. So, in this
case you can determine the response.

So, response will be you can see w 1 is multiplied with x 1, w 2 is multiplied with x 2, w 3 is
multiplied with x 3, w 4 is multiplied which x 4 and after this I am considering the artificial
function in my artificial neural network plus I explained the activation function that is a
sigmoid function we have considered.

Now, I am considering another activation function that is called Relu rectified linear unit I am
considering, and you can see this function that maximum between 0 and x it is considered.
So, this concept of relu I will be discussing in this class. So, we can find the response of this.
So, if I consider this Max function instead of the sigmoid function, I can determine the output
in the neuron that is the response in the neuron a 1 1 I can determine that I can determine.
Corresponding to the input, the input is x 1, x 2, x 3, x 4. And I am considering the connected
weights are w 1, w 2, w 3, w 4.

1350
Similarly, for all other hidden nodes a 2, a 3, a 4, I can determine the response and finally, I
can determine the response in the output neuron that is a 1 2 I can also determine and finally,
I can consider one function that is the softmax function I can consider and that is the function
is something like this, the soft max function so it is an activation function. So, that scales
numbers into probabilities. So, I will be getting the probabilities value, the probabilities lies
between 0 and 1 and it is used for multi class classification. So, the soft max function is an
activation function that scales numbers into probabilities. So, I will be getting the
probabilities for each and every classes, if I consider multiple classes. Suppose for class 1, the
probability is 0.2, for class 2 probability is 0.3 like this, I will be getting the probabilities.

(Refer Slide Time: 14:10)

Now, in case of the artificial neural network if you see the number of parameters. So, you can
see here, the input I am considering x 1, x 2, x 3, x 4 and that means, 4 input nodes I am
considering and 4 hidden nodes I am considering, the 4 hidden nodes are a 1 1, a 2 1, a 3 1, a
4 1 that is a second layer I am considering. And you can see, corresponding to this, I have the
4 cross 4 connections, 4 into 4 connections. After this if I consider the second layer, the
second layer is a 1 2. So, corresponding to the second layer, I have 4 connections, you can see
the connections 1 2 3 4, four connections and finally I have the output for output I have 1
connection so that means, you can see how many connections I have in this simple artificial
neural network.

1351
(Refer Slide Time: 15:20)

And suppose if I consider 1 input image, the dimension is 400 × 400 × 3 that is 3 means, I
am considering the RGB image. So, corresponding to this you can see, so number of
parameters. So, you can see the serial number of parameters. So, this 400 × 400 × 3 that is
nothing but 4,80,000 input nodes. So, how many nodes are available. So, x 1, x 2, x 3 up to x
4,80,000 number of nodes I am considering input nodes. After this I am considering the
hidden layer, suppose the hidden layer also has 4,80,000 nodes, and if I consider this
network, so number of parameters.

So, if you see the connections, the input connection and the first hidden layer connection, so
4,80,000 into 4,80,000 will be this and after this if I consider the second connection, so how
σ many connections between the second layer and the third layer. So, it is again 4,80,000 plus
1 terminal that is the connection between the second hidden layer and the output so, 1
terminal. So, approximately it is 230 billion parameters, you can see. And if I consider
suppose 1000 nodes corresponding to the first hidden layer so corresponding to this you can
see we have 480 million parameters approximately, you can see the number of parameters in
the artificial neural network.

So, it is too high so that is why it is very difficult to train an artificial neural network because
we have to consider so many connections and so many weights we have to consider. So, that
is why it is very difficult to consider or it is very difficult to train a particular artificial neural
network for image classification for image recognition. The number of parameters are too
high.

1352
(Refer Slide Time: 17:35)

Again, I am considering another example here you can see, I am considering input layer,
hidden layer and the output layer. In the input layer we have 3 nodes that I am not
considering now, and you can see in the hidden layer we have 4 nodes and in the output layer
we have 2 nodes that means 2 neurons. So, corresponding to hidden layer if I consider sigma
is the activation function so what will be the output in the hidden layer, it is nothing but σ ¿+
b 1 ¿. So, b 1 I am considering it is the biased terminal I am considering. And similarly, if I
consider the response in the output nodes then in this case again I am considering the
activation function sigma that is a sigmoid function I am considering.

So, W 2 x , W 2 is the weight plusb 2 , b2 is the bias I am considering. Now, in this case how do
you train this artificial neural network. So, if I do not consider input nodes then how many
neurons will be there. So, in the hidden layer we have 4 neurons and in the output, we have 2
neurons so total 6 neurons, and if you see the interconnections, that is the weight mainly the
interconnections means the weights so how many weights? 3 into 4 corresponding to the
connections between input and the hidden layer, and after this 4 into 2 that is the connection
between the hidden layer and the output layer. So, total 20 widths we have to consider.

And if I consider the bias, suppose for the hidden layer we have 4 bias terminals. So, then in
this case we will be getting 6 biases we are considering. So that means, we have 26 learnable
parameters in this neural network that is too high. So, how to resolve this problem by
considering the deep architecture the deep networks? So, for this I will discuss the concept of

1353
the convolution neural network. So, what is the convolution neural network and how the
number of learnable parameters can be reduced.

(Refer Slide Time: 19:37)

So, for this I am considering one convolution layer, you can see I am considering the input
image and I am considering one filter that is the convolution filter I am considering and here
it is I am considering the Laplacian filter. So, what I can consider corresponding to this input
image, I can determine the convolution between the filter and the image. So, I have the input
image, I can determine the convolution between the input image and the filter.

So, for this I have to move the filter over the image and I have to determine the convolution
outputs. So, for each and every pixels, I have to do the convolution. So, I have to place the
max, I have to place the filter over the image and I have to determine the convolution. So, for
all the pixels of the image I have to determine the convolution.

1354
(Refer Slide Time: 20:34)

So, you can see suppose this is the image, the pixels value I am considering and I have to
place the marks over the image, I have to place the filter over the image and after this I have
to determine the convolution and for all the pixels, I have to do the convolution. So I am
showing it again so like this, you can see I am doing the convolutions.

(Refer Slide Time: 20:58)

And after this I am getting the convolve image since I am considering the high pass filter that
is the Laplacian filter, I am considering so I will be getting the edges and the boundaries in
the convolve image. So, this is my convolve image so in the convolve image, I will be getting
the edges and the boundaries.

1355
(Refer Slide Time: 21:17)

So, let us see what is convolution. So, suppose it is my input image I am considering so it is a
4 by 4 image a, b, c, d, e, f, g, h, i j, k l, m, n, o, p, these are the pixels I am considering, the
pixel values are a b c d like this. And I am considering the filter, the filter weights are w 1, w
2, w 3, w 4. So, I am considering here w 1, w 2, w 3, w 4 like this. So, it is w 3 and w 4 so
this filter I am considering, after this I am determining the convolution. So, how to determine
the convolution you can see. So, I have to place the max over the image and I have to
determine the convolution. So, just I am determining the convolution here you can see.

So, corresponding to that filter, I am just multiplying the pixels value with the mask value the
mask means the filter value. So, weight is W 1 that is a into W 1 plus b into W 2 plus e into
W 3 plus f into W 4 I am considering, and also I am considering the activation function, the
activation function is f. So, corresponding to this I will be getting the output, the output is h 1
corresponding to the convolve image. And similarly, if I move the mask to the next pixel,
then corresponding to this also I can determine the response of the mask. So, you can see I
can determine h 2 just like the previous one so, I can determine h 2 and I will be getting the
below h 2 here.

So, like this I have to do the convolution and I will be getting the feature map because I am
extracting the features from the image. So, if I consider this is the filter and this filter can
extract some important features from the image. So, that is why I am getting the feature map
corresponding to their particular filter. And in this case, you can see the number of
parameters for 1 feature map it will be 4, because I am only considering the filter weights W
1, W 2, W 3, W 4 so that is why the number of parameters for 1 feature map will be only 4,

1356
but just you can see, you can do the comparison between the simple artificial neural network
and this case.

So, by using the convolution operation, I can reduce the number of parameters you can see
the principle behind the convolution layers. So that means the number of parameters, the
number of learnable parameters we can reduce. So, 4 number of parameters for 100 feature
maps will be only 4 into 100. So, that is very least as compared to the earlier cases in case of
the artificial neural networks.

(Refer Slide Time: 24:13)

After this again I want to reduce the size of the feature map by the concept of the pooling. So,
these 2 pooling techniques I can apply there are many techniques but mainly one is the max
pooling, another one is the average pooling we can consider. So, suppose this is my input
matrix and what is the max pooling, the maximum output within a rectangular
neighbourhood.

So, suppose if I consider 2 × 2 neighbourhood, that means, 2 number of rows and 2 number
of columns that is 2 × 2 neighbourhood I am considering. So, maximum output within a
rectangular neighbourhood we have to determine. In case of the average pooling, what we
have to do, average output of a rectangular neighbourhood we have to determine.

Now, I am only considering the max pooling. So, corresponding to this if you see here the
corresponding to this input matrix, the output matrix will be 4 5 3 4 because corresponding to
the person neighbourhood that is a 2 × 2 neighbourhood the maximum value is 4. So, that is
why I am considering 4 here and corresponding to the second neighbourhood so if you see

1357
the maximum value, the maximum value is 5 so 5 I am considering here, corresponding to the
third 2 ×2 neighbourhood I am considering the maximum value the maximum value is 3 so 3
I am considering here, and corresponding to the last 2 by 2 neighbourhood.

So, the maximum value is 4 so that is why I am considering 4 so that is the output matrix
after max pooling. So, that means, I am reducing the size of the feature map. So, feature map
is obtained by the process of the convolution and after this we can consider a pooling the max
pooling we can consider. So that the size of the feature map can be reduced.

(Refer Slide Time: 26:08)

So, here I am showing the structure of convolution neural network one structure. So, here you
see I am only considering the filters and the max pool this I am considering. So, here you can
see corresponding to this input image, I am considering number of filters in the convolution
layer. So first this represents the filter and this represents the max pool. So, first I am
considering 64 filters for the convolution layer, after this again another convolution layer,
after this I am considering 1 max pool layer, after this I am again considering 128 filter
convolution layer, after this again another convolution layer after this again the max pooling
layer you can see so like this I can consider the cascade of the convolution layer and the max
pool layer we can consider.

And finally, I am getting the max pool layer I am getting this one and that is converted into 1
d vector and after this I am considering 1 fully connected network for classification. So, after
the fully connected networks I will be getting the outputs, output mainly the classification of
the objects present in an image. So, you can see I am considering the cascade of these layers,

1358
one is a convolution layer, another 1 is a max pool layer, I am considering the cascade of this
because already I have explained the deep learning means, the cascade of nonlinear
transformations and it is nothing but end to end learning.

So, fort learning we can use the back propagation technique so we can adjust the parameters
of these layers by the process of the backpropagation learning and you can see here also final
max pool layer that is converted into 1 d vector and after this we can consider the fully
connected layers. And you can see so what will be the output of the fully connected layers, I
will be getting the classes. So, different classes I will be getting that is the object recognition
I am considering; living room, bedroom, kitchen, bathroom, outdoor that is the image
classification I am considering based on the input, the input is the image I am considering,
this is the concept of the convolution neural network.

(Refer Slide Time: 28:31)

1359
1360
And in this illustration, I am showing again the concept of the convolution. So, I am
considering the input metrics and I am considering the convolution 3 by 3 filter and after this
you can see the process of the convolution. So, after convolution I will be getting the feature
map so, you can see I am getting the feature map here.

(Refer Slide Time: 28:59)

The next again I am showing the concept of the convolution. So, I am considering the filter,
suppose this is my filter, so 1 filter I am considering, so I have to place the filter over the
image and after this I have to determine the convolution, I have to determine the response of
the filter. So, corresponding to the centre pixel, I am putting or I am placing the filter over the
image and I have to determine the output that is the convolution output I have to determine.

1361
(Refer Slide Time: 29:25)

1362
And like this, I have to shift the filter like this for, all the pixels of the image I have to do this,
you can see just I am doing the convolution. So, this is the concept of the convolution how to
determine the convolutions by using the filter. And I am getting the feature map. So, you can
see I am getting the feature map here so this is my feature map after the convolution.

(Refer Slide Time: 29:53)

And you can see here, if I consider this filter, if I consider this filter and if I do the
convolution with the help of this filter, then I will be getting the vertical edges I will be
getting because this is nothing but the high pass filter, if I consider because you can see the
response, here response is 0 0 0 so that means, I will be getting the vertical lines by
considering this filter that is the convolution filter.

1363
(Refer Slide Time: 30:32)

And like this, I am giving some examples of the convolution filters to extract important
features from the input image. So, from the input image I can extract important information,
important features of the images, the low-level features, the mid level features, and the high
level features we can extract by considering the convolution filters.

(Refer Slide Time: 30:50)

And here again, I am showing the concept of the convolution layer. So, we can consider
multiple filters instead of considering only one filter, we can consider multiple filters. So, in
this case, I am considering 200 into 200 image and maybe we can consider 100 filters and the
filter size will be 10 into 10 then corresponding to this I may have 10k parameters, so we
have to do the convolutions.

1364
(Refer Slide Time: 31:19)

1365
After getting the feature map we have to apply the max pooling technique, the max pooling
technique already I have explained. So, you can see corresponding to the 2 × 2
neighbourhood the first neighbourhood that maximum value is 6, corresponding to the second
neighbourhood the maximum value is 8 and like this, I am considering that max pooling
operations.

(Refer Slide Time: 31:40)

So, what is the importance of the pooling layer. So, in this example, you can see let us
assume that filter is an eye detector. So, I want to detect the eye of this face that is present in
the image. So, how can we make the detection robust to the exact location of the eye. So, first
I have to do the convolution so that we can extract some features, after this I am applying the
max pooling, you can see the pooling I am applying so that we can locate the location of the
eye because max pooling means I am taking the maximum value I am considering. So, that
means, the exact spatial locations of the feature we can determine by considering the pooling
operation.

1366
(Refer Slide Time: 32:29)

So finally, I want to say that 1 so you can see here, so if I consider 1 stages of this
architecture, so I have the convolution layer and after this I have the pooling layer. So, you
can see corresponding to suppose this image I am applying number of filters, you can see the
number of filters like this so that is the filter bank, after this I am doing the convolution. So, I
will be getting the feature map.

So, you can see so these are the feature maps, after this we can consider the activation
function, the activation function is Relu I can consider, and after this we can apply the
pooling operation. We can apply the max pooling operation or maybe the average pooling
operations and after applying the pooling operations, I will be getting the outputs so, output
will be this. So, this is 1 stage considering convolution layer and the pooling layer.

1367
(Refer Slide Time: 33:30)

And if I consider a typical architecture of the CNN the convolution neural networks, so you
can see we may have number of stages. So, maybe the first is maybe the convolution pooling
again convolution pooling convolution pooling. So, we may have number of stages and
finally, we have the fully connected layers, and output will be the class levels corresponding
to the input image. So, this is a typical architecture of the convolution neural network.

(Refer Slide Time: 33:59)

Now, let us consider learning of an image. So, suppose in this example, I am considering the
beak detector that means, I want to determine and this portion of an image this portion of the
bird, and in this case the sum patterns are much smaller than the whole image and can

1368
represent a small region with few parameters. So, how to represent that portion by
considering few parameters.

(Refer Slide Time: 34:28)

So, here you can see if I consider this portion, this portion is appearing here in the second
image, in the second image in the different positions. So, how to represent this so if I consider
suppose this one that is the upper left beak detector and if I consider this one that is a middle
beak detector. Now, these two can be combined because these two can be compressed to the
same parameters because this pattern is appearing in another position of the image. So, that is
why if I consider the upper left beak detector and the middle beak detector, they can be
compressed to the same parameters, because they appear in different places of the image.

(Refer Slide Time: 35:19)

1369
So, that means we can consider a filter for beak detector. So, a filter I am considering to
detect the beak that is a beak detector and that is nothing but the convolution layer. So, we
have to do the convolution and I will be getting the feature map.

(Refer Slide Time: 35:33)

So, suppose I am considering the 6 by 6 image, the image is like this and I am considering
suppose filter 1, filter 2, filter 3, like this and each filter can detect a small pattern, the pattern
is suppose 3 × 3 size. So, the filter size is 3 × 3 size and the image size is 6 × 6 and we have
to consider a network. So, we have to learn the parameters of the network so for this we are
considering the filters, the filter 1, filter 2 and number of filters I am considering and each
filter can detect a small pattern, the pattern maybe the 3 × 3 pattern. So, let us see about this
convolution procedure.

1370
(Refer Slide Time: 36:26)

So, you can see if I apply the convolution between the filter 1 and the image, so I will be
getting the output something like this 3−1 and I am considering the stride 1 that means the
filter marks it is moved over the image by only 1 pixel. So, that is why I am considering the
stride 1. So, if I consider the first position of the filter here and if I consider the second
position of the filter here that means, in the second position, the filter is moved by 1 pixel, by
only 1 pixel it is moved so that is why it is called a stride 1.

(Refer Slide Time: 37:05)

And if I consider a stride 2 that means the filter marks is moved by 2 pixels. So, if I consider
a higher value of the stride, then in this case I can reduce the size of the feature map.

1371
(Refer Slide Time: 37:21)

And corresponding to this you can see I am getting the corresponding stride 1, I am getting
the feature map something like this. And you can see corresponding to this filter, the filter 1
that can determine diagonal lines that means, what is the feature which can extract by the
filter 1, the filter 1 can extract the diagonal lines. So, that is why you can see the response, the
response is 3 corresponding to the diagonal lines. Here also you can see the response 3
corresponding to the diagonal lines present in an image.

So, by using the filter 1, I can extract the information of the diagonal line present in an image.
So, that filter I can consider filter 1 and you can see after convolution I will be getting the
feature map a feature map will be something like this.

(Refer Slide Time: 38:15)

1372
And after this we can consider filter 2 and we have to repeat this for each and every filters,
we have to do this procedure. So, corresponding to filter 2 also I will be getting the feature
map that means I will be getting 2 × 4 ×4 matrix that is a feature map. So, feature maps
obtained by considering filter 1 and filter 2, and in this example, I am considering stride 1.
So, you can consider strike 2 also like this you can consider.

(Refer Slide Time: 38:40)

And if I consider a colour image. Suppose in colour image, we have 3 channels, R channel, G
Channel and the B Chanel. corresponding to these 3 channels, we can determine the feature
maps by considering filter 1, filter 2, filter 3 like this we can consider and we can do the
convolution between the image and the filters. So, for this we have to consider R channel, G
channel and the Blue Chanel separately. So, we have to do the convolution between the R
channel and the filter, convolution between the G Channel and the filter, convolution between
the Blue channel and the filter, we can do the convolution like this and we will be getting the
feature map.

1373
(Refer Slide Time: 39:24)

And in this case, you can see the difference between the convolution and the fully connected
because in case of the fully connected network that is available in the artificial neural
network, the number of parameters are too high. So, that is why by the process of the
convolution, we can reduce the number of parameters. So, by using the convolution
operation, you can see I can reduce the number of parameters. So, here you can see I am
applying the convolution between the image and the filters. So, I am considering these filters,
and I am getting the feature map like this.

But if I consider suppose this image, that is 6 × 6 image, if I consider a fully connected
neural network, so, that means the input nodes, we have 36 nodes and after this we have the
hidden layer and after this, we have the interconnections between the input layer and the
hidden layer. So, there is the fully connected neural network. So, we need many many
parameters in case of the artificial neural network. But because of the convolution operation,
we can reduce the number of parameters.

1374
(Refer Slide Time: 40:34)

So, here you can see, I am applying the convolution and by using the convolution I will be
getting few parameters only as compared to fully connected networks. So, in this case, you
can see I will be getting 3 here, that 3 that what is the 3, the 3 is nothing but the result of the
convolution between the image and the filter 1. So, that portion of the image I am considering
and that is a convolution between that portion of the image and the filter 1, so, output is 3. So,
that means corresponding to suppose this pixel centered pixel I will be getting the value the
value is suppose 3.

So, to get 3, so, we are doing the connections like this. So, only connect to 9 inputs and it is
not fully connected. So, that means, I am reducing the number of connections, number of
parameters. In the second case, you can see if I do the sharing of the weights, then in this
case, I can reduce the number of parameters. So, corresponding to the second output, the
second output is minus 1, second output is minus 1 I can share the weights of the layer you
can see. And by this process, I can reduce the number of parameters. So, sharing of the
weights is possible. So, by sharing the weights, we can reduce the number of parameters, that
is not possible in case of the fully connected neural networks. So, that means the convolution
helps in reducing number of parameters of the network.

1375
(Refer Slide Time: 42:12)

So, the entire CNN, that is the whole CNN will be like this. So, my input is the image, after
this I am doing the convolution, after this I am applying the max pooling. And again, I am
doing the convolution and after this again I am applying the max pooling and that can be
repeated many times the convolution, max pooling, convolution, max pooling, that can be
repeated many times. After this the output of the max pooling and that is converted into 1 D
vector, that is the flat end.

The output of the max pooling that can be converted into 1 D vector that is the flat end , after
this we can consider a fully connected feed forward network and after this I will be getting
the outputs. So, output means the number of classes I will be getting, that is the object
detection we can do, whether it is a cat or a dog and that we can determine by considering
this CNN, the convolution neural network.

1376
(Refer Slide Time: 43:09)

And what is the pooling operation actually, why it is important because by using the pooling
operation, I can reduce the size of the feature map. So, that means, I can reduce the number of
parameters. And one important point is sub sampling pixels will not change the object. So,
that means, you can see in this image I am considering this is the bird and after this I am
doing the sub sampling, but the object will remain same. So, that means, sub sampling will
not change the object, we can sub sample the pixels to make image smaller. So, that means, if
I apply this max pooling operation, so, what will be the advantage, the advantage will be pure
parameters to characterize the image. So, we will have only few parameters to represent a
particular image.

(Refer Slide Time: 44:04)

1377
So, that means a CNN compress a fully connected network in two ways. The first one is
reducing number of connections, that is one important point and shared weights on the edges.
So, we can share that weights and also we can apply that max pooling operations. So, like this
we can consider.

(Refer Slide Time: 44:24)

So, in this figure I am showing the same concept again. So, convolution I am considering and
the activation function I am considering the ReLU function I am considering and after this
you can see pooling operation I have shown again I have shown the convolution operation
and the ReLU. Again, I am considering the pooling operations and finally, the flattening of
the feature maps we have done and after this we have the fully connected neural network, that
is the feed forward network.

And corresponding to this you can see the outputs the dog 0.94, Cat 0.03, bird 0.02, that is the
probabilities and both 0.01. So these probabilities I am considering that these probabilities I
am obtaining by considering the function, the function is the softmax. That already I have
explained. So this is an activation function that scales numbers into probabilities. So softmax
function I am considering so I am getting the probabilities, the probability of dog is 0.94, the
probability of cat is 0.03 like this I am considering. So, this is 1 application of CNN for
image classification.

1378
(Refer Slide Time: 45:42)

And I want to again highlight the concept of the activation function, the sigmoid function,
already I have explained in my class of artificial neural network. So, you can see this is the

1
sigmoid function f ( x ) = takes a real valued number and squashes it into a range
1+e−x
between 0 and 1. So, you can see the range is 0 and 1 and one problem of the sigmoid
function is that sigmoid neurons saturate and kill gradients.

That is the main problem of the sigmoid function, the saturation problem and the kills
gradient. Because we have to train the network based on the stochastic gradient descent
algorithm, and that already I have explained that is the gradient descent algorithm. So, if I
consider suppose the stochastic gradient descent algorithm, so, you can see, I am not
explaining this concept, the stochastic gradient descent algorithm. So, if I apply the sigmoid
function, so, it will saturate and kills the gradient during the training and if the initial weights
are too large, then the most of the neurons will be saturated. So, that is the main problem of
the sigmoid activation function.

1379
(Refer Slide Time: 47:03)

Another activation function is tanh. So, that is that takes a real value number and squashes it
into range between minus 1 and 1. So, output range is minus 1 and 1. So, that is the tanh
function. But again the problem is the situation. But one advantage is this output is 0
centered. And you can see the tanh can be represented in terms of the scale sigmoid function.
So, tanh can be represented by the sigmoid function, but the problem is again the saturation
problem.

(Refer Slide Time: 47:45)

So, that is why in most of the deep networks in convolution neural networks, we used the
ReLU function that is the ReLU function is x = 0 for x ¿0 and it is = x for x ≥0. So, it takes a
real value number and thresholds it at 0 f x is equal to maximum of 0 and x. So, you can see it

1380
is a linear function. So, in case of ReLU, that is the rectified linear unit you can see the trains
much faster, the training is faster and accelerates the convergence of the stochastic gradient
descent.

So, that means the convergence will be fast and due to linear nature the saturation problem
will be over. Due to the linear characteristics, the saturation problem will be eliminated that
means, it prevents that gradient vanishing problem because of the linear characteristics,
because it is a linear function. So, that concept I have not explained.

So, if you are interested about this Stochastic gradient descent algorithm, so you can see the
concept of the stochastic gradient descent algorithm as SGD, that you can see and also the
concept of the situation that is the gradient vanishing problem you can see. So, based on this
principle, most of the deep networks used the ReLU function. So, this is about the
convolution neural networks.

(Refer Slide Time: 49:25)

Now, I will discuss the concept of auto encoders. So, what is auto encoders? auto encoders
are artificial neural networks and auto encoder neural network is an unsupervised machine
learning algorithm that applies backpropagation setting the target values to be equal to the
inputs. Also, auto encoders are used to reduce the size of the inputs into smaller
representations. And suppose, if anyone needs the original data, they can reconstruct it from
the compressed data.

Here you can see the auto encoders are artificial neural networks and capable of learning
efficient representation of the input data and that is called codings without any supervision.

1381
So, that is why it is called unsupervised machine learning algorithm. The training set is
unlabeled and these coding typically have a much smaller dimensionality than the input data
making auto encoders useful for dimensionality reduction. So, it is used to reduce the
dimension of the input data.

(Refer Slide Time: 50:37)

So, in my next slide you can see the structure of the auto encoder. So, I am showing one input
image, after this I am showing the latent space representation of the input image that is
nothing but efficient representation of the input image and that is compressed. So, the
dimension is reduced, you can see here. So, from the input image I am considering the latent
space representation. So, dimension of the input image is reduced and that is nothing but
efficient representation of the input image.

And after this from that compress images or maybe the latent space representation, I want to
reconstruct the original image. So, this is the reconstructed image. In the second figure, I am
considering one input image, maybe the noisy image and after this I am considering one auto
encoder. So, this auto encoder compresses the input image. So, for this I am considering the
encoder. So, I am getting the latent space representation, after this from this representation,
that is from the compressed data, I can reconstruct the output image.

So, what is the function of the encoder? That is used for efficient representation of the input
data. So, that means in a auto encoder, I have input layer, I have a bottleneck layer that is
nothing but the hidden layer and after this I have the output layer. So, mainly I am
considering 3 layers, one is the input layer, the first one is the input layer, after this the

1382
bottleneck layer and after this I am considering the output layer. So, it has encoder and
decoder.

Now, in this case, since I am doing the dimensionality reduction, then what is the difference
between auto encoder and a PCA. So, why auto encoders are preferred over PCA, the
principal component analysis. So, one point I can mention here, an autoencoder can learn
nonlinear transformation with a nonlinear activation function and multiple layers, that is one
important point. Also, it is more efficient to learn several layers with an autoencoder rather
than learn one huge transformation with PCA, because if I considered a PCA, I have to
consider a huge transformation.

But in case of the auto encoder, I can consider a number of layers for efficient representation
of data. Also, I can make use of the pre trained layers from another model to apply
transferred learning to enhance the encoder and the decoder, that is also another advantage of
the auto encoder over PCA. So, what is encoder? Encoder means the part of the network
compress the input into a latent space representation.

So, in the figure you can see, I have shown one encoder. So, this part of the network
compress the input into a latent space representation. The encoder layer encodes the input
image as a compressed representation in a reduce dimension. A compressed image is the
distorted version of the original image. So, I am doing the compression by considering the
encoder. After this I am getting the code. So, this part of the network represents the
compressed input which is fed into the decoder.

So, I am getting the bottleneck layer that is nothing but the latent space representation and
after this I have the decoder. So, this layer decodes the encoded image back to the original
dimension. The decoded image is a lossy reconstruction of the original image and it is
reconstructed from the latent space representation. So, I can mention one property of the auto
encoder, the auto encoder shall only be able to compress the data similar to what they have
been trained on. The decompressed output will be degraded compared to the original input.
And that is also one important point, that is the decompressed output will be degraded
compared to the original input. So, this is about the auto encoder.

1383
(Refer Slide Time: 55:24)

Next, why we use auto encoders. So, use for dimensionality reduction, so, we can reduce the
dimensionality of the input image. So, that is why we are considering the auto encoder. And
with the help of autoencoders I can extract important features from the input image. So, that
is also one important point, that is with the auto encoders I can extract important features of
the image and they can be used for unsupervised pre training of deep neural networks. So,
that is also another advantage. And one important point is they are capable of randomly
generating new data that looks very similar to the training data and that is called generative
model. So, I can give one example of the generative model in my next slide.

(Refer Slide Time: 56:17)

1384
So, here you can see. For example, we could train an auto encoder on pixels of faces and it
would be able to generate new faces. So, that is the generative model. So, here you can see, I
am considering the auto encoder and we can train the auto encoder on pixels of faces. And
suppose, I am considering these images. So, this auto encoder will be able to generate new
faces that I have shown in this figure. So, that is one application of auto encoder.

(Refer Slide Time: 56:56)

Now, surprisingly, auto encoders work by simply learning to copy their inputs to their output.
Then, what is the importance of auto encoder? The importance is there that auto encoder
looks at the input and converts them to an efficient internal representation, that is efficient
internal representation of data, input data. And then spits out something that looks very closer
to the inputs. That means, it can reconstruct the inputs which looks very close to the original
inputs.

What is the meaning of the efficient internal representation? So, I can give one example, the
power sequences 40, suppose 27, 25, 36, 81, 57, 10, 73, 90, 68. So, this is my first sequence,
first sequence I am considering. And let us consider another sequence, the second sequence.
So, second sequence is suppose 50, 25, 76, 38, 19, 58, 29, 88, 44, 22, 11, 34, 17, 52, 26, 13,
40, and 20 suppose. So, I am considering two sequences, the first sequence is 40 27 like this
and second sequence is 50 25 76. So, like this I am considering.

The question is which one is the easiest to memorize. So, first sequence is the short sequence,
the second sequence is the long sequence. At the first glance, it would seem that the first
sequence should be easier, since, it is much shorter as compared to the second sequence. But

1385
actually, it is not. So, if I considered the second sequence, two simple rules I can formulate.
The first rule is, number 1 rule is even numbers are followed by their half. So, this is my first
rule.

Second Rule is odd numbers are followed by their triple plus 1. So, I have these two rules to
represent the second sequence. This second sequence is a very famous sequence, the name of
the sequence is the Hailstone sequence. It is called the Hailstone sequence. So, if I follow
these two rules, it is easy to memorize the second sequence, as we have to remember first
number and the length of the sequence. So, by using these two rules, I can generate the
second sequence.

But in the first sequence, there are no rules. So, that means, in the second sequence, the one
important point is existence of a pattern in the second sequence, that is nothing but the
efficient internal representation. That is one example, why auto encoders can be used for
efficient internal representation of data.

(Refer Slide Time: 61:02)

So, here you can see, you can limit the size of the internal representations or you can add
noise to the inputs and train the network to recover the original inputs. So, we can limit the
size of the internal representation, that means, we can compress the input data more or we can
add noise to the inputs and train the network to recover the original inputs. That means, the
auto encoder can be used for removing noise. That means, the noise is eliminated and we can
reconstruct the original input image.

1386
These constraints prevent the auto encoder from copying the input directly to the outputs,
which forces it to learn efficient ways of representing data. So, that concept already I have
mentioned, that is efficient data representation by auto encoder. In short, the codings are by
products of the auto encoders.

(Refer Slide Time: 62:04)

So, in this figure I have shown, in the first figure I am considering one noisy image and after
this I am considering the auto encoder, auto encoder you can see the encoder I am
considering after this I am considering the bottleneck layer, that is nothing but the hidden
layer. After this I am considering the decoder and I am considering the denoised image. So,
denoised image is reconstructed. The input is the noisy input image.

The input seen by the auto encoder is not the raw image, the raw input, but stochastically the
corrupted version. So, that means a denoising auto encoder is trained to reconstruct the
original input from the noisy version. That is the concept of the denoising Auto encoder. So,
input is the noisy input and after this I am considering the efficient representation of the input
image, the auto encoder can learn this and after this we can reconstruct the image by
removing the noises. In the second case also I am showing the structure of the auto encoder.
So, you can see I have the input layer, hidden layer and the output layer.

1387
(Refer Slide Time: 63:24)

So, composition of auto encoder. So, first is I need encoder order recognition network that
converts the input to an internal representation. So, the first component is an encoder. After
this the next component is either a decoder or it is called generative network that converts the
internal representation to the outputs. So, that means I have one encoder and another one is a
decoder in auto encoder. So, encoder is for compressing the input data and after this a
decoder or a generative network which can convert internal representation to the outputs.

(Refer Slide Time: 64:12)

So, here you can see I am showing one example of efficient data representation. I am
considering one input image that is nothing but the chessboard I am showing. After this the
auto encoder can create the internal representation of the input image. After this it can

1388
reconstruct the input image. So, output image, it is approximately equal to inputs. So, for this
I can give one example. Suppose, I want to remember the position of the pieces industries
board.

So, these are my pieces. So, expert chess players can easily memorize the position of all the
pieces of in a game by looking at the board just for 5 to 6 seconds, but it is very difficult for
most of the people. So, that is a task that most people would find impossible. So, one
important point is that pieces are placed in the realistic positions from one actual game and
they are not placed randomly.

So, what is the method basically they are doing, these expert chess players, they can learn the
patterns of the input image or input data. So, by looking at the chessboard, they can see the
patterns and based on these patterns, they can actually represents input data most efficiently.
And after this, they can also reconstruct the original patterns. A pattern means, I am
considering the position of the pieces in the chess board. So, that is the concept of efficient
data representation. So, you can see, I have the inputs, after this I am considering internal
representation, after this, I can reconstruct back the original inputs.

(Refer Slide Time: 66:11)

And the composition of auto encoders, so auto encoders typically have the same architecture
as a multi layer perception, except that the number of neurons in the output layer must be
equal to the number of inputs. So, in this figure you can see, I have 3 inputs x 1, x 2, x 3,
similarly, I have 3 outputs x 1 dash, x2 dash and x3 dash. So, that means number of neurons
in the output layer must be equal to the number of inputs in a simple autoencoder.

1389
(Refer Slide Time: 66:49)

And you can see there is just one hidden layer composed of 2 neurons, that is the encoder and
one output layer composed of 3 neurons, that is the decoder. So, that is I am showing the
encoder, another one is the decoder.

(Refer Slide Time: 67:07)

The outputs are called the reconstructions, since the auto encoders tries to reconstruct the
inputs, and the cost function contains a reconstruction loss, that penalize the model when the
reconstructions are different from the inputs. So, because the decoder I am considering, that
is used for the reconstruction of the input data, so for is what I am considering I am
considering a loss function. If the output reconstructed data is not similar or not same as that

1390
of the inputs, then in this case, the loss function will be more, the loss will be more and based
on this loss, I can train the auto encoder.

(Refer Slide Time: 67:51)

And here you can see the internal representation has a lower dimensionality than the input.
So, that is why if I consider the bottleneck layer, that is the internal representation, it is 2D
instead of 3D, because, I am considering the dimensionality reduction in the bottleneck layer.
That is why auto encoder is said to be undercomplete, because I am reducing the
dimensionality of the input data. So, that is the concept of the encoder. So, it is forced to learn
the most important features in the input data and drop the unimportant ones. So, that is
nothing but efficient representation of the input data.

(Refer Slide Time: 68:40)

1391
And there are many variants of auto encoder. I am not going to discuss all these auto
encoders, one is the sparse auto encoder, another one is just stacked auto encoder, one is the
variational auto encoder, one is the denoising auto encoder and one is the contractive auto
encoder. So, if you are interested then you can read some research papers regarding these
auto encoders, different types of auto encoders. So, I am not going to discuss about these auto
encoders. But the main concept of the auto encoder I have already explained.

So, in this class I discussed the concept of the convolution neural networks, what are the
disadvantages of the artificial neural networks. So, how to reduce number of parameters by
considering convolution layer, that concept I have discussed. After this I discussed the
concept of the pooling layer. And after this I discussed the concept of the fully connected
layer. Also, I discussed about activation functions, one is the sigmoid function, another one is
the tanh function and another only that ReLU activation function.

After this briefly I discussed the concept of the auto encoder. I am not discussing in detail
because it is not possible to discuss all the deep learning techniques in this computer vision
course. So, I hope you can understand the basic concept of the deep learning and also the
convolution neural networks and the auto encoder. Let me stop here today. Thank you.

1392
Computer Vision and Image Processing -Fundamentals and Application
Professor M. K. Bhuyan
Department of Electronics and Electrical Engineering
Indian Institute of Technology, Guwahati
Lecture 38
Gesture Recognition
Welcome to NPTEL MOOCs course in Computer Vision and Image Processing Fundamentals
and Applications. In this class I will discuss one important application of computer vision, that is
gesture recognition. Gesture recognition has many important applications. One important
application is human computer interactions. So, I will discuss all these applications and also what
are the major challenges of gesture recognition, that I am going to explain.

And mainly I will discuss the concept of hand gesture recognition. And in this discussion, I will
discuss two important algorithms. So, briefly I will highlight the concept of DTW, that is
Dynamic Time Warping. And also, the concept of the hidden Markov model. So, briefly these
two concepts I will be explaining, related to gesture recognition. So, let us see what is gesture
recognition in my next slide.

(Refer Slide Time: 01:32)

So, this is the definition of gesture, a moment of a limb or the body as an expression of thought
or feeling. So, this is the definition from the Oxford concise dictionary, that is the definition of
gesture.

1393
(Refer Slide Time: 01:47)

So, there are many applications of gestures. So, if I consider only hand gestures, so hand gestures
also have many applications, one is the replace mouse and the keyboard. That means, without
using the mouse and a keyboard, I can interact with the computer, with the help of hand gestures.
Also, I may consider pointing gestures, maybe for robotic interactions, the human robot
interactions or maybe for window menu activations.

So, I can select the window menus based on the pointing gestures. And like navigate in a virtual
environment that is the VR application, virtual reality application. And no physical contact with
computer, because I can interact with the computer by using hand gestures. And communicate at
a distance. So, these are the, some human computer interface using gestures, by hand gestures.

1394
(Refer Slide Time: 02:51)

So, in this case, I am showing one interaction system that is by considering the data gloves. So,
here you can see, I have the data gloves, in the data gloves I have optical or the magnetic sensors.
So, these sensors detect the movement of the fingers, the movement of the hand and the glove is
connected to the computer. So, computer will get the signals corresponding to the movement of
the fingers, the movement of the hand and based on this I can recognize different types of
gestures, this is the glove-based system.

(Refer Slide Time: 03:25)

The next system is the vision based system. So, here you can see, I have a camera. So, here in the
figure you can see I have a camera. So, the camera detects the movement of the hand, the

1395
movements of the fingers and after this, the computer can recognize the movement of the hands
or the gestures or it can recognize the finger’s feeling.

So, this is a pattern classification problem, because I can recognize different gestures, different
finger’s feeling, and in this case, I may have one camera or maybe two camera or multiple
cameras. So, in this case, glove is not important. So, the camera will detect the movement of the
hand and also it can detect the movement of the fingers.

(Refer Slide Time: 04:10)

So, if I compare these two systems, so one is the glove-based system, another one is the camera-
based system, that is the vision-based system and the glove-based system. You can see
advantages and disadvantages of, if I want to compare these two systems, one is the glove-based
system another one is the vision-based system. They both have advantages and disadvantages.
So, if you see the glove-based system, the advantages are, data is very accurate.

Because I am getting the signal from the glove. And mainly we have the optical sensors and the
magnetic sensors. So, that means I am getting a very accurate signal from the data gloves and not
much computing power required, in case of the glove-based system. But the disadvantages are,
because user has to wear gloves. That is the disadvantage, and this glove-based system is not
available for recognizing facial gestures.

So, these are the advantages and the disadvantages of the glove-based system. And if I consider
the vision-based system, one main advantage is that user does not have to wear anything, that is

1396
one important advantage and disadvantages like we have to do complex computations, because
we have to do the image processing. And since, we have to consider different lighting conditions,
the cluttered background.

So, that means, we have to do lots of image processing and that is very difficult and sometimes
there may be a problem of occlusions. So, hard to see the fingers by the camera, and this is
nothing but the occlusion. So, these are the comparisons between the glove-based system and the
camera-based system.

(Refer Slide Time: 05:53)

So, I want to show the elements of the vision-based interface. VBI means the vision-based
interface. So, these are the research area in which, you can do research in computer vision. So,
like hand tracking, hand gestures, arm gestures, head tracking, gaze tracking, lip reading, face
recognition, facial expression recognition, body tracking, activity analysis. So, these are the
elements of vision-based interface. So, any one of these topic you can do research in computer
vision or maybe in machine learning.

1397
(Refer Slide Time: 06:30)

Now, I will consider the major attributes of hand gestures. So, I may consider static gestures or
dynamic gestures. For static gestures, hand configuration is important that is the posture and for
this I need the information of hand shape. In case of dynamic gestures, finger movements,
without changing the position of the hands. Then in this case I can consider the hand shape
information, but if I consider hand movements in space, then I have to consider the hand
trajectory and the hand shape.

That means, in case of the static gesture, only we have to consider the hand shape information,
but in case of a dynamic gestures, it may have local motions or the global motions or maybe the
combination of the local motion and the global motion. In case of local motion only I have the
motion of the fingers, in case of the global motion I have the movement of the hand and
generally if I consider the sign language suppose, then in this case it is a combination of both
local motion and the global motions. So, these are the major attributes of hand gestures.

1398
(Refer Slide Time: 07:40)

So, here you can see I am showing some static gestures of American Sign Language. So, in this
case, I need the information of the hand shape. So, with this information, that means the shape
information, I can recognize different static gestures.

(Refer Slide Time: 07:55)

1399
And here you can see I am showing some of the dynamic gestures. So, here you can see the
moment of the hand and also the finger movements. So, that means, in this case I have both,
local motion and the global motion. So, this is one example of dynamic gestures. So, in this
video I have shown that sign language, you can see some American Sign Language and in this
case the computer has to understand these signs of the American Sign Language.

1400
(Refer Slide Time: 08:28)

For this gesture recognition system, that is the human computer interface system, it may be a
unimodal system or a multimodal system. So, for this we can consider the vision-based system,
that means the camera or maybe the audio-based system or maybe the sensor based system like
the data gloves.

In case of the multimodal system, we can consider one or two modalities. That means, I can
combine the vision based and the audio based or maybe the audio based plus sensor based I can
combine. So, in case of the human computer interface system, I may have unimodal system or
the multimodal system.

(Refer Slide Time: 09:09)

1401
And if you consider the gestures, so, there are many types of gestures. One is the body gestures,
one is the hands gestures, one is the head and or the face gestures. So, in my discussion I am
mainly considering the hand gestures. As explained earlier I may have static gestures or the
dynamic gestures and I can consider the isolated gestures or the continuous gestures. So, in case
of the continuous gesture I can perform all the research continuously, that is like the fluent finger
spinning I can do, or the, the continuous gesture I can perform.

But in case of the isolated gestures, I can perform a particular gesture at a particular time. So, to
understand these concepts, mainly the constraint on vision-based gesture recognition for human
computer interaction, you may see this research paper, this is our research paper. So, I think you
can see this research paper, you can understand the concept of the vision-based gesture
recognition for human computer interaction.

(Refer Slide Time: 10:16)

So, there are many applications of the vision-based gesture recognition system, VGR means the
vision-based gesture recognition system. So, one important application is sign language
recognition. Another one is robotic interactions. And also, for healthcare, there are many
applications like laparoscopic surgery, by using gestures, or maybe the window menu activation
by considering the hand gestures.

One is a human computer interactions, the human computer intelligent interactions and some
applications in augmented reality and virtual reality. So, here you can see the sign language
recognition, here this is the sign language recognition and this application is in virtual reality and

1402
this is in the healthcare applications. This is the robotic interactions. So, there are many
applications of VGR, that is the vision-based gesture recognition.

(Refer Slide Time: 11:20)

But the major challenges of gesture recognition, I am going to explain now, the in case of the
vision-based system, we will consider only the camera. Glove based system has the advantages,
but the vision-based system has the flexibility because users need not wear the data gloves. So,
the camera will detects the movement of the hand or the movement of the fingers. But there are
many challenges of the vision-based gesture recognition.

1403
So, one is the segmentation of the hand from the background. So, in this case, we have to
consider different illumination conditions, random illumination conditions and also, we have to
consider the cluttered background. So, segmentation of the hand from the background, the
cluttered background and under different illumination conditions is one research problem. And
also, the second problem is the occlusion.

So, there may be self occlusion between the fingers, because the camera cannot see all the points
of the hand or all the position of the fingers. So, we have to consider the occlusion, the self
occlusion is one important aspect. So, that we have to consider. And sometimes the depth
information is also very important. So, if I consider only one camera, it is not possible to
determine the depth information.

But if I consider maybe the stereo vision system, then in this case we have to estimate the depth,
with the depth information that recognition performance increases. So, how to determine that
depth information that is also one important salience of gesture recognition. And already I have
mentioned about the illumination changes, because for different illumination, I have to do the
segmentation. The segmentation of the hand and also the tracking of the hand and the
background maybe cluttered background.

And also we have to consider the 3D translation and the rotational variance, that means we have
to consider the translation variance and the rotational variance. So, we have to extract the
features, which should be invariant to translation and the rotation. That means we have to extract
the hand features, which are invariant to a (())(13:33). That is the translation rotation in the
scaling.

So, another problem is the spatial temporal variation. That means if a, if a particular gesture is
performed by different persons or the different users, there will be spatial temporal variations.
Even the same gesture is performed by the same user or the same person, there will be spatial
temporal variation. So, this spatial temporal variation we have to consider. And also, if I
considered a continuous gesture, then in this case, we have to find the starting point and the
ending point of the gesture.

So, how to do the segmentation, if I do the gestures continuously, that is the continuous gesture.
So, how to do the segmentation? That is one important point, that is the segmentation of the

1404
continuous gesture. And one important point is the co articulation, that means the current gesture
is affected by the preceding and the following gesture. So, one gesture maybe the part of another
gesture. So, this problem also we have to consider, that is the co articulation problem. So, these
are the major challenges of gesture recognition.

(Refer Slide Time: 14:41)

Now, again, I am explaining the major challenges of the dynamic gesture recognition. So, first
problem already I have explained, that is the segmentation problem. So, for this, we have to
consider the illumination variation and the complex background also we have to consider and
also the problem of the occlusion. And in this case the self occlusion is very important, that
means occlusion between the fingers, occlusion between the two hands suppose if I consider two
hands, that occlusion also we have to consider.

Another one is the gesture spotting, that means we have to consider spatial temporal variations, if
I consider a continuous gesture, we have to find a starting point and the ending point of this
particular gesture, that is called the gesture spotting. And also, the difficulties associated with
image processing techniques. So, the extracted feature should be rotation scaling and the
translation invariant.

That means, the hand features should be invariant to rotations, translation and the scaling. And a
processing of a large amount of image data is time consuming. And so, real time recognition is
really very difficult. So, these are difficulties associated with image processing techniques.

1405
(Refer Slide Time: 16:00)

And if I consider the major challenges of the dynamic gesture recognition, if I consider the
problem with continuous gestures. So, one problem is the co articulation problem that means, the
current gesture is affected by the preceding or the following gestures. And another important
problem is the movement epenthesis, that is the unwanted movement that occurs while
performing a gesture.

And another problem is the sub gesture problem, when a gesture is similar to a sub part of a
longer gesture. So, this problem I am explaining in my next slide, the one problem is the co
articulation problem, one is the movement epenthesis problem and another one is the sub gesture
problem. And one important problem is, that is the problems related to two handed gesture
recognition.

It is very difficult, in case of a two-handed gesture recognition, one is the computational


complexity and one is the occlusion due to inter hand overlapping. That already I have
explained, that is if I considered the overlapping of two hands or maybe the overlapping of the
fingers then, that is nothing but the occlusion, the self occlusion. And also, simultaneously we
have to track both the hands.

So, we have to do the tracking and we have to develop the tracking algorithm, so, that it can
track both hands. So, these are the major challenges of dynamic gesture recognition. Now, I will
show in my next slide, what is the co articulation, what is movement epenthesis and what is sub
gesture problem.

1406
(Refer Slide Time: 17:40)

So, here you can see, I am showing the gestures one, two, three like this, 1, 2, 5, seven three like
this. So, if I perform one and after this, if I perform two like this, so one is this extra movement
between these two gestures. So, this is the extra movement between these two gestures. And
similarly, from 2 to 5 this is the extra movements between these 2 and 5. So, this is called
movement epenthesis.

In figure b I have shown the problem of the co articulation that means, the current gesture is
affected by the preceding or the following gestures. So, here I have shown the example of the
coarticulation and finally, I want to show the sub gesture problem in c, in figure c. So, my
gesture is eight, but here you can see 5 is a sub gesture of gesture eight. So, if I want to draw
eight, so, you can see the five is a part of 8.

So, that is called the sub gesture, but my gesture is the eight, but by mistake the recognizer can
recognize that gesture as 5. So, this is called the sub gesture problem. So, these are the major
problems in case of the continuous gesture recognition. One is the co articulation. One is the
movement epenthesis and one is the sub gesture.

1407
(Refer Slide Time: 19:01)

So, in summary, I am showing in this figure the, the problems of the gesture recognition. So, the
major challenges of the gesture recognition. So, first one is gesture acquisition regarding the
camera specification, color range, resolution, frame rate, lens characteristics and regarding the
3D image acquisition, depth accuracy and also the problem of the segmentation, like illumination
variation, complex background, dynamic background.

And also, for the gesture detection, we can consider hand articulation and a, and an occlusion
that is nothing but the self occlusion. After this gesture representation and the feature extraction.
So, we may consider static gesture, dynamic gestures and we can consider the 2D modeling of
the gestures or maybe 3D modeling. But in case of the 3D modeling, it is computationally
complex.

And in case of the 2D modeling, it is object view dependence. And in this case, the problem of
the self occlusion and the hand articulation. So, 3D modeling is more accurate, but it is
computationally complex. And for dynamic gesture we have to consider, the spatial temporal
variations. So, what is the spatial temporal variation? Suppose, if I want to perform the gesture
like this, this is a gesture.

So, if I repeat this gesture, if I repeat this gesture suppose like this, there will be a spatial
temporal variation. That means, variation in the space and variation in the time. The same
gesture is repeated by different users, there will be spatial temporal variations. And even the

1408
same gesture is performed by the same user, or the same person still there will be a spatial
temporal variation.

And for the continuous gesture, we have to consider these cases, one is the movement epenthesis,
co articulation and the sub gesture problem. And in case of the feature extraction, we have to
consider the feature should be invariant to (())(21:09) that is RST invariance. And also, the
object view dependence and also that we have to reduce the dimensionality of the feature vector,
that is nothing but the curse of dimensionality.

After this we have to do the classification. So, who is classified as good? That we have to also
consider. So, maybe the size of the training data, that is important. The computational
complexity is very important and selection of optimum parameters for validation that is also very
important. And the recognition of unknown gestures. So, these are the main challenges of gesture
recognition.

(Refer Slide Time: 21:50)

So, if you see the vision-based gesture recognition architecture, the first the hand and the image
video that can be captured by the camera, after this we have to do the segmentation, the
segmentation of the hand and after this we have to do the tracking. So, already I have explained
this problem is quite difficult because we have to consider dynamic, background maybe the
cluttered background and also illumination variations I have to consider.

1409
And if I consider two handed gestures, I have to track both the hands. So, tracking is also a
problem and segmentation is also a very difficult problem, because I have to consider
elimination variation and the dynamic background and the cluttered background. After this I
have to consider gesture representation. So, for this I have to do the modeling, gesture modeling,
I have to extract features, the feature extraction. And finally, I have to select the classifier for
recognition. So, this is a typical gesture recognition system.

(Refer Slide Time: 22:50)

After this we can consider the detection and the pre processing. Detection means the detection of
the hand, that is the segmentation of the hand and also, I have to do the tracking. So, we may
consider these cases. So, skin segmentation techniques we can apply for the image and the
videos. And in case of the image, an uncontrolled static environment, because the background
may be cluttered and there may be illumination radiations.

And in case of the video, we have to consider uncontrolled dynamic environment, the dynamic
background may be there, the random illumination variation will be there, the cluttered
background may be there. So, all these cases we have to consider and for image we may consider
region-based technique, the edge-based technique or maybe the skin color based technique we
can consider, for segmentation or maybe the otsu thresholding technique also we can use.

And in case of the video, we can consider dynamic model adaptation technique. So, for skin
color-based techniques, we can consider different color models, maybe the RGB color model,

1410
HSV color model, YCbCr color model, YIQ model like this all these models you can consider, to
determine the skin colors and based on the skin color, we can segment out the hand from the
background.

And for a parametric model we can consider a single Gaussian model, the Gaussian mixture
model or maybe the elliptical boundary models we can consider. And that is corresponding to the
skin color based segmentation or maybe the nonparametric models we can use like, the base skin
probability map, SPM means the skin probability map we can use. The lookup table we can also
use, the self organizing map SOM also we can use, or maybe the ANN model we can use for
skin color segmentation. So, based on the skin color, we can segment out the hand from the
background for different conditions.

(Refer Slide Time: 24:53)

So, for vision-based methods, so we can consider the model-based methods. So, for this we can
consider a parameter like, the parameters of the hand, like the joint angles, the palm positions
like this we can consider or maybe we can consider the appearance-based model. So, for this we
can consider the parameters like images, the image geometry parameters, images motion
parameters, we can consider these parameters. And already I have explained about the skin
color-based segmentation.

So, we can apply this technique. So, with the help of the skin color, we can do the segmentation
of the hand from the background or maybe for a tracking, we can use the Kalman filter or maybe
some filters like particle filters, we can use for the tracking. So, there are many tracking

1411
algorithms, maybe the mean shift algorithm also we can used. But, one tracking algorithm is the
Kalman tracking or maybe we can use the particle filter tracking, these techniques we can use for
hand tracking.

(Refer Slide Time: 25:51)

And for tracking already I have explained. So, the methods maybe like this. So, using the pixel
level change, we can consider the background subtraction method, we can consider inter frame
difference, the three-frame difference. So, these classical techniques also we can used. And
maybe like the mean shift algorithm or the CAM shift algorithm, the Kalman filter, the particle
filter. So, these techniques can be used for hand tracking.

(Refer Slide Time: 26:21)

1412
And for hand gesture modeling already I have explained. So, we can consider different types of
models, like in the figure a, I have considered the 3D textured volumetric model. So, that is the
first model, the 3D textured volumetric model we can consider. The second b, the 3D wireframe
volumetric model, I can consider. In figure three, 3D skeletal model also we can consider. In
figure d, this is the binary silhouette, we can consider. And maybe we can consider a control
model as shown in figure e. So, these models we can use for hand gesture modeling.

(Refer Slide Time: 27:01)

So, here again I am showing these models. So, that is the gesture representation feature
extraction. So, you can see the color models, the first I am showing that is the 2D model. The
silhouette geometric model, that is the second one. That is a 2D model, deformable model also
we can consider. And similarly, we can consider 3D models like 3D skin mesh model, 3D
geometric model, 3D skeleton model, 3D textured volumetric model.

So, these are the model-based techniques and if I consider an appearance based model, that
means, in this case we can consider shape, color silhouette based or maybe we can consider
template models or maybe we can consider motion based models. So, for this we can consider
optical flow, for motion representation or maybe we can consider motion templates, like motion
history image, motion energy image. So, by using this motion templates, the motion history
image and the motion energy image we can model a particular gesture or we can represent a
particular gesture.

1413
(Refer Slide Time: 28:14)

And also, we should consider the depth base method on RGB depth data, because if I considered
a depth information, the accuracy of recognition will increase. And it will also solve the problem
of self occlusion or it may solve, or it may partially solve the problem of the self occlusion. So,
the major problems in segmentation like illumination variation occlusion can be handled nicely
with the help of the depth data.

And for this we can consider the Kinect based methods. So, we can consider the Kinect sensors
to get the depth data or maybe other depth sensors, maybe the leap motion controller we can
consider, senz3D 3D by creative depth, that also we can consider. So, different types of depth
sensors are available. So, from this we can get the RGB depth data, the RGB D data. So, with the
depth information, some of the problems can be partially or nicely eliminated like the problem of
the self occlusion, the problem of the elimination variation. So, these problems can be ended,
with the depth information.

1414
(Refer Slide Time: 29:26)

And after this for gesture representation and the feature extraction. So, we have to extract the
features and the features should be invariant to (())(29:34). So, maybe the features may be
something like this, the same information we can consider. So, geometric features or the non
geometric features we can consider. Texture or the pixel value we can consider, 2D or 3D model
based features we can consider, special features like position and the motion velocity we can
consider.

So, special features maybe, suppose if I consider the hand is moving, so that means we can
determine the motion trajectory. And from the motion trajectory, we can determine the dynamic
features like velocity acceleration. So, all these features we can determine and also the position
of the hand we can also determine. So, this is the motion trajectory and also we can consider
spatial temporal interest points, that also we can determine and maybe we can determine the
shape feature also, that is also one important feature that can be applied for hand gesture
recognition.

1415
(Refer Slide Time: 30:31)

So, in this table I have shown different types of features, they are rarely used in the research
papers. So, special domain 2D features, maybe the finger tips, location, finger detection,
silhouette and maybe the motion chain code can be used like this. So, we have already discussed
about the chain code. And for spatial domain 3D features, we can consider joint angles, hand
location, surface textures, surface illumination, like this we can consider.

And also, we can consider features in the transform domain, like the Fourier descriptors, DCT
descriptors, wavelet descriptors. And also we can consider the moments, the geometric moments,
the seven moment invariants we can consider. And like the histogram-based HoG features, the
SIFT features, the SURF features or maybe the combined features we can used. And all these
features have advantages and disadvantages. So, you can see the advantages and disadvantages
of these features from the slide or maybe from my research paper you can see. So, what are the
advantages of these features and the limitation of these features?

1416
(Refer Slide Time: 31:42)

And finally, we can consider the classification model that is the recognition model. So, after
extracting the features, we have to recognize the gestures. So, we can apply the supervise
techniques or maybe the unsupervised techniques. And maybe we can consider like the
conventional methods on RGB data, we can consider or maybe the depth-based methods on RGB
depth data or maybe the deep networks that we can used for recognition.

So, the classical machine learning techniques maybe the support vector machine, the K nearest
neighbor. So, all these techniques can be used for gesture recognition or maybe the deep
networks like the CNN, the convolution neural networks. So, this can be used or this can be
employed for gesture recognition.

1417
(Refer Slide Time: 32:40)

So, for static gesture, so for static gesture recognition we can apply maybe the supervised
classification techniques like k nearest neighbor, support vector machine, artificial neural
networks. For the unsupervised maybe we can consider the k means algorithms. And for
dynamic gesture recognition because for dynamic gesture recognition, we have to consider the
motion parameters. That means, we can consider the motion trajectory and from the motion
trajectory, we can determine the dynamic features like the velocity, we can determine the
acceleration we can determine, trajectory curvature we can determine from the motion trajectory.

And based on these features, we can do the classification. So, maybe we can consider some
techniques like the dynamic programming, or maybe the dynamic time warping, this is a very
popular algorithm for gesture recognition, maybe we can consider the artificial neural networks
and the probabilistic framework, maybe we can consider the Bayes classifier also or maybe the
hidden Markov model or other statistical methods we can consider.

And maybe we can consider CRF that is the Conditional Random Field, that is also very
important, you consider research paper on conditional random field. So, by using the CRF also
we can recognize gestures. So, in this class only, I will explain briefly the concept of the
dynamic time warping and the concept of the hidden Markov model.

1418
(Refer Slide Time: 34:11)

Now, I will discuss the conventional dynamic gesture recognition techniques. So, here in the
figure you can see, I have shown the dynamic gesture recognition techniques, we may have
direct methods or maybe the indirect methods. In case of the direct methods, we have to extract
gesture features. And based on these features, we can do the classification, we can do the
recognition.

In case of the indirect methods, we may consider non probabilistic methods or the probabilistic
methods. For example, in case of the non probabilistic methods, we can use dynamic time
warping, DTW and also maybe the artificial neural networks. In case of the probabilistic
methods, popular methods are hidden Markov models, that is very important. And another one is
the conditional random field.

So, these are some examples of conventional dynamic gesture recognition techniques. So, in this
class mainly I will discussed these two algorithms, one is the DTW, Dynamic Time Warping,
and another one is the brief introduction of hidden Markov model. So, these two algorithms
briefly I will explain.

1419
(Refer Slide Time: 35:32)

So, what is DTW? You can see in my next slide. So, this is a Dynamic Time Warping algorithm.
So, this algorithm is mainly the template matching. And it is used in isolated gesture recognition.
Suppose, in case of the gesture recognition, I have one trajectory that is the gesture trajectory.
Because the hand is moving. So, from the video I can determine the gesture trajectory. So, we
have the trajectories of all the gestures. These are called the template gestures or the template
trajectories.

So, when the new trajectory is coming, suppose a new trajectory is this. So, I have to do the
matching. So, the matching between the template trajectory and the input trajectory, that is the
test trajectory. And based on this matching, I can determine or I can recognize a particular
gesture. So, I am repeating this, that means corresponding to a particular gesture sequence,
gesture video, I can extract the gesture trajectory.

And I have the number of trajectories corresponding to different gestures. So, suppose you pick
one, consider suppose, if I write A like this or maybe the B like this, so for all these we have the
gestures trajectories. And for recognition, for classification we have to match the input trajectory,
that is the test trajectory with the template trajectory. So, one algorithm is very popular that is the
DTW, the Dynamic Time Warping algorithm.

So, it is a template matching algorithm. So, mainly the concept is, suppose I have a two-time
series, one is suppose P and another one is Q. So, I have two-time series P and Q and I can find
the similarity between these two time series. So, for this I can compute the distance between P

1420
and Q you can see, I am finding the distance between P and Q. So, I can consider Euclidean
distance or maybe the Manhattan distance.

So, any distance I can consider and based on this distance, I can find a similarity between these
two time series, one is P and another one is Q. So, for this I am considering the warping part W
and corresponding to this I have the W matrix suppose, and here you can see the matrix element
is w k. So, define warping between P and Q like this. So, W is nothing but w1, w2, wk like this.
So, for this what I have to consider?

I have to find the DTW between P and Q in this expression you can see I am finding the DTW
between P and Q, that is nothing but the minimum distance between the time series P and Q. So,
based on this, I can find a similarity between the time series P and Q. And based on this I can
recognize. So, the actual concept I am going to explain in my next slide.

(Refer Slide Time: 38:40)

So, for matching, what we can consider? We can consider a distance, maybe we can consider
Euclidean distance or any other distance, city block distance also we can consider. But what
properties should a similarity distance have? So, we can consider a metric distance, the property
of a distance measure should be like this. The distance between two points, the point A and B,
D ( A , B) should be equal to D (B , A ), that is the symmetry property.

And the distance between the same point D ( A , A) that should be equal to 0, that is the self
similarity. And also, the distance between the point A and B should be greater than equal to 0

1421
that is the positivity. And this is also you know that is the triangular inequality. So, distance
between these two points D ( A , B) should be less than equal to the distance between the point A
and C plus the distance between the point B and C. So, these are the properties of the metric
distances. So, based on this, we can consider the Euclidean distance.

(Refer Slide Time: 39:56)

So, in the next slide you can see I am considering the distance between suppose two points or
maybe the two sequences, I am considering X and Y sequences, and I am finding the distance
between these two. So, Lp is the distance and suppose the p is equal to 1, then it corresponds to
Manhattan distance. If p is equal to 2, that corresponds to Euclidean distance. So, this distance
measure I can consider to find a similarity between the time series P and the time series Q.

1422
(Refer Slide Time: 40:28)

So, for this the basic idea is let us consider the series, the time series A that is a 1up to a n and
another series I am considering B that is b 1up to b n. And we are allowed to extend each sequence
by repeating elements. So, that means, we can extend the sequence. After this we can calculate
the Euclidean distance between these two extended sequences, one is X ' another one is Y '.

So, the X ' means the extended sequence and Y ' is the extended sequence that is the y sequence.
So, we can find the Euclidean distance between these two sequences. And in this case, we are
getting the matrix, the matrix is m that is nothing but the elements are like this mij is the element
of the matrix M. What is the element mij ? That is nothing but the distance between X iand Y j .

1423
❑❑(Refer Slide Time: 41:32)

Suppose, if I consider these two-time series I am considering, one is the green another one is the
red. So, what is the importance of dynamic time warping, that I want to show here. And suppose,
if I want to match these two sequences, point to points. Suppose, the point i is match with the
point i in the second sequence. Then in this case it gives poor similarity score. The point I of the
first sequence is match with the point i of the second sequence.

So, it gives the poor similarity score, but actually what is happening here, the point i of the first
sequence corresponds to the point i plus two in the second sequence. So, we have to find a
matching between i and the i plus two. So, that is why the nonlinear elastic alignment is
important. In the previous case in the first case, I am just considering the point i of the first
sequence and the point i of the second sequence. So, it gives poor similarity score.

But in the second case what I am considering? I am considering a nonlinear elastic alignment,
then, then in this case I am getting the good measure, the similarity score, because the point i of
the first sequence actually corresponds to the point i plus two in the second sequence. So, that is
why the importance of dynamic time warping, that means we need a nonlinear elastic alignment.

1424
(Refer Slide Time: 43:08)

So, in this figure you can see I am considering the warping function, I am considering the time
series A and the time series B, you can see here the time series A and the time series B I am
considering. Now, I want to find the best alignment between A and B. And for this I need to find
a path through the grid. So, that means I have to find a path, the path is represented by P is equal
to p1, p2 , p3 like this. So, Ps is nothing but is comma js.

So, what is the best path? That means the best path minimize the total distance between them.
So, what is the best path, the base path minimizes the total distance between the two sequences.
One is the time series A another one is the time series B. And the P is called the warping
function. So, in this figure you can see I have shown the, the warping function p1, p2 , p3 like
this.

So, this is the warping function. So, in case of the gesture recognition already I have mentioned,
so I have to compare the test trajectory and the template trajectory. So, suppose I am considering
the DTW algorithm. Suppose, I am considering one trajectory something like this, this is the
gesture trajectory A suppose. And that is nothing but the suppose template. So, I am considering
this is a template.

Template gesture trajectory, and after this I am considering the another trajectory that is the test
trajectory something like this. So, trajectory is B suppose this is test trajectory. And by this DTW
algorithm, I can find the alignment between A and B. So, I can find a correspondence between

1425
the template A and the template B, that is the template and the test trajectory. So, suppose this is
the alignment between A and B that we can find, that is the warping function.

So, the best alignment we can find based on the distance between the time series A and the time
series B. So, based on this we can recognize a particular gesture. So, we have the template
gestures, we have the templates and whenever the new gesture is coming, we have to match the
test gesture with the template gestures. But it is time consuming because for one gesture I have to
compare each and every templates.

So, that is why it is computationally complex. And suppose in the template, if I can identify
some of the key points, suppose these are the key points in the template and from this I can
determine gesture features. Suppose I can extract some features like orientation feature, the
length feature or maybe the dynamic feature like velocity acceleration I can determine. And
similarly, for the test trajectory also I can determine the, the key points, the key points I can
determine and these key points I can match, these key points I can match based on the warping
function.

And by this process, I can determine the feature corresponding to the test gesture trajectory. So,
you can see I am just doing the matching based on the key points. One is the template trajectory,
another one is just the test trajectory. In the template trajectory I can find the key points. So,
maybe something like the MPEG 7 the trajectory descriptors, trajectory representation, MPEG 7
trajectory representation we can consider.

So, we can determine the some of the key points like this. So, I have shown the key points and I
can find the correspondence between the key points and after determining the key points in the
test trajectory, I can determine the gesture features. The static feature and the dynamic feature, I
can extract from the key points because from these key points of the template, I can find the
corresponding key points of the test trajectory.

And after this we can determine the, the feature the static feature like linked between the key
points or maybe the orientations or maybe the dynamic feature like velocity acceleration I can
determine and based on this I can recognize a particular gesture. So, this is one example how to
recognize a particular gesture. So, we have been discussing about the warping function. So, in
the next slide also I will explain the concept of the warping function.

1426
(Refer Slide Time: 48:13)

So, here you can see I am finding the time normalized distance between A and B, the time series
A and time series B. So, depth distance I am determining. So, D ( A , B)is the time normalized
distance and here you can see the distance is divided by the sum of the coefficients. That is the
weighting coefficients. So, what is d ( p , s)? d ( p , s) is the distance between i s and the j s, that is
the time series A and the time series B.

So, for this I am finding the distance between i s and the j s. And I am considering the weighting
functions or weighting I am considering the weighting coefficients W s. So, that is the time
normalized distance between A and B because the distance is divided by the sum of the
coefficients. So, what is the best alignment path between A and B, the best alignment path is
nothing but the minimum distance between the time series A and the time series B. So, based on
this I can find the best alignment path between A and B.

1427
(Refer Slide Time: 49:21)

So, in this case, you can see the optimization to the DTW algorithm, because number of possible
warping path through the grid is exponentially explosive. So, maybe I can consider this warping
path or maybe I can consider this warping path like this I can consider many, many warping
path. But which one is the best, I have to determine. So, that is why I have to do the
optimization, optimization to the DTW algorithm.

So, for this I have to consider some restrictions on the warping function. So, this restriction we
have to consider, one is the monotonic condition, continuity condition, boundary conditions,
warping window size consideration and the slope constraint. So, I am going to explain one by
one. So, based on this I can find the best the warping path.

1428
(Refer Slide Time: 50:15)

So, first one is the monotonic condition. So, i s −1should be less than equal to i s. And similarly,

j s−1should be less than equal to j s. The meaning is the alignment path does not go back in time
index. So, that means, it should be monotonically increasing, the warping path, the warping path
or the warping function should be monotonically increasing. That means, it guarantees that
feature are not repeated in the alignment. So, the feature should not be repeated during the
alignment.

So, that is why we are considering the monotonic condition. Next, I am considering the
continuity condition. So, here you can see I am considering a discontinuous alignment path, the
alignment path does not jump in time index. So, that is one important point, that the alignment
path does not jump in the time index. So, there should be continuity. So, in the figure I am
showing the discontinuity.

So, what is the importance of the continuity? So, guarantees that the alignment does not omit
important feature. So, if I consider the discontinuous alignment path then that means, I may miss
some important feature. So, that is why the continuity condition is important. So, these two
conditions one is the monotonic condition, another one is the continuity condition these are very
important. So, next slide you can see I am showing the, the monotonic condition here you can
see. So, it is monotonically increasing and also, I am showing that continuity path that is I am
showing the continuity that this alignment path does not jump in time index, it should be
continuous.

1429
(Refer Slide Time: 52:03)

Next one is I am considering the boundary conditions. So, that means, in this figure you can see
this alignment path, it is starting from this point and ending at this point, but it should not be like
this. So, the condition is the alignment path start at the bottom left and the ends at the top left.
So, that should be the condition, but in the figure, I am showing there, here you can see the
alignment path is starting from this point and it is ending at this point.

But actually, it should start at the bottom left and ends at the top left. So, what is the importance
of this. So, it guarantees that the alignment does not consider partially one of the sequences. So,
that is why we have to consider the starting point like this i1 should be equal to 1 and i k should be
equal to n and similarly, j1 should be equal to 1 and the ending point should be j k is equal to m.
So, here you can see I am showing the actual alignment path based on the boundary condition.

So, the second one is the, the alignment path considering the, the boundary conditions. Next
point is warping window. So, that means in this case I am defining the window size. So, which
one is the best window I am considering. So, is minus js should be less than equal to r. And r
should be greater than 0. So, that means I am considering the window length. So, a good
alignment path is unlikely to wander too far from the diagonal.

That means the best alignment path will be very close to the diagonal. So, alignment paths
should be close to this diagonal. So, that means, what is the importance of this? It guarantees that
the alignment does not try to skip the parallel feature and get stuck at similar feature. So, that is

1430
why the window size is important. So, here in the figure you can see I am showing the, the
window, the length of the window I am considering r is the length of the window. And based on
this window, I can see the alignment path like this, it is close to the diagonal. So, it should not
wander too far from the diagonal. So, that is the concept of the warping window. So, this
condition also we need to consider.

(Refer Slide Time: 54:34)

And finally, the last a constraint that is a slope constraint I am considering. So, this equation is
nothing but you can see, I am considering the slope and also, I am continuing considering the
slope. So, what is p? P means, the number of steps in y direction and what is q? The q is the
number of steps in x direction, I am considering both the directions, one is the x direction and
another one is the y direction.

So, in this case the q is the number of steps in the x direction and the p is the number of steps in
y direction. So, that means in this case the condition is the alignment path should not be too steep
or too shallow. So, what is the importance of this constraint? That means, it prevents that very
short parts of the sequences are match to very long ones. So, that is the slope constraint. So, we
have to consider all these constraints, the monotonic constraint, the continuity constraints, the
slope constraint, window size all this we have to consider, corresponding to the warping
function.

1431
(Refer Slide Time: 55:45)

So, here I am considering the time normalized distance between A and B. And here you can see
the distance between A and B I am considering and this distance is normalized by the summation
of the warping coefficients. So, now, I am considering C is equal to the summation of w s, it is
starting from s is equal to 1 to k. So, w s is nothing but the weighting coefficients. And this
weighting coefficient function, it should be independent.

So, I am getting C, C is nothing but the summation of w s, s is equal to 1 to k. And it should be


independent of the warping function. So, that is why since it is independent of the warping
function, so I am taking it out. So, it is 1 by C minimum and this, I am the distance, this I am
considering. So, that means, I am considering the distance between A and B. And that is nothing
but a time normalized distance.

And this, the distance between D AB, I can determine and it can be solved by considering the
dynamic programming. So, there are two forms, one is the symmetry form another one is the
asymmetric form. In the symmetric form the weighting coefficients, we can determine like this,
i s −i s−1 + j s − j s−1 . And C is defined like this, C is equal to n+m. And similarly, if I consider the

asymmetric form corresponding to the asymmetric form w s is this. And C is equal to n or maybe
C is equal to m. So, I may consider the symmetric form or the asymmetric form for solving the
dynamic programming problem.

1432
(Refer Slide Time: 57:39)

So, here you can see I am considering the symmetric DTW algorithm. And I am not considering
the slope constraint, I am not considering. And you can see the warping window that, that is
defined between these two yellow lines. So, this is the warping window I am considering
between these two yellow lines. And initial condition also we have to consider, that is g(1 , 1)is
equal to, to d (1 , 1). That is the initial condition for the dynamic programming. So, this DP
equation, the dynamic programming equation we can employ.

And this warping window we have, we have defined. And we can determine the time normalized
distance. And C is nothing but n +¿ m. So, this symmetry DTW algorithm we can employ. And
that means, we are considering the distance between the time series A and time series B. And we
have shown the, window, the warping window we have shown.

1433
(Refer Slide Time: 58:40)

And similarly, we can consider the asymmetric DTW algorithm. Again, in this case I am
defining the window, that is the warping window I am considering. And slope constraint I am not
considering. And this is the initial condition for the asymmetric DTW algorithm and we can
consider the DP algorithm, the Dynamic Programming equations corresponding to asymmetric
DTW algorithm. And from this we can determine the time normalized distance D ( A , B), we can
determine. And in this case, we can consider C is equal to n. So, this symmetric DTW algorithm
or the asymmetric DTW algorithm we can employ, to find the time normalized distance. That
means, I want to find the alignment between time series A and the time series B.

(Refer Slide Time: 59:31)

1434
And also, we can consider the Quazi symmetric DTW algorithm, the concept is very similar, but
the initial condition is this g(1 , 1) is equal to d (1 , 1) that is the condition. And we can consider
the dynamic programming equations like this, g ij we can consider. So, this is from the dynamic
programming equations. And we can consider the warping window and we can determine the
time normalized distance, we can determine. So, this DTW one is the symmetric, one is the
asymmetric and one is the Quazi, Quazi symmetric DTW we can consider.

(Refer Slide Time: 60:08)

So, now, let us see one example, how we can find a best alignment between time series A and the
time series B. So, here I am showing the time series A and the time series B. And I am also
showing that window, that is between two yellow lines. So, first let us start with the calculation.
So, first we are considering g(1 , 1) is equal to d (1 , 1), that I am considering. After this the
calculate the first row, you can see in the figure here I am calculating the first row by this DP
equation, gi 1 I have calculated.

The first row I have calculated, after this calculate the first column by using this DP algorithm.
The DP equations, move to the second row that means, I am considering the second row if you
see that figure, move to the second row that is g(i ,1). We can determine g(i ,2), that is the
minimum of g ( i ,1 ) , g ( i−1 , 1 ) , g (i−1 , 2) +d (i , 2). And after this keep, book keep for each cell,
the index of this neighboring cell, which contributes the minimum score.

1435
So, that means the minimum score, the distance we have to consider and that is shown by the red
arrows. So, in the figure you can see, I am showing the red arrows. So, book keep for each cell,
the index of the neighboring cell which contributes the minimum score, that we are considering.
And carry forward, carry on from left to right and from the bottom to top with the rest of the
grid, we have to consider. And based on this g(i , j) we can determine by this DP equations.

So, pictorially again I am showing here. So, first we have calculate, the first row and after this,
we considered the first column after this, we are considering the second row, after this we have
to consider g(i , j) like this we are, we are computing by using the DP equations. After this,
trace the best path through the grid starting from g nm and moving towards g(1 , 1), following
the red arrows.

So, that means, I am finding the best path I am determining. So, how to find the best path? You
can see it again. So, trace back the best path through the grid starting from g nm and moving
towards g(1 , 1) by following the red arrows. So, that means, I am getting the best alignment path
between the time series A and the time series B. So, this is the brief concept of the DTW
algorithm. So, if you want to see the details of this DTW algorithm, you can see the book, the
even in the speech recognition book you will get this algorithm, that DTW algorithm.

The book by a (())(63:10) or maybe other research papers you can see for this algorithm. So,
DTW algorithm. So, briefly I have explained the concept of the DTW algorithm and how it can
be used for gesture recognition. But the main problem already I have explained, the main
problem is the computational complexity. Because I have to compare the test template with all
the template trajectories. So, it is computationally expensive. So, this is about the DTW
algorithm.

1436
(Refer Slide Time: 63:41)

Now, let us consider the concept of the hidden Markov model. So, briefly I will explain the this
concept, the hidden Markov model. So, a particular gesture trajectory is represented by a set of
feature vectors. Where each feature vector describes the dynamics of the hand corresponding to a
particular state of the gesture. So, that means we are considering the feature vector describing the
dynamics of the hand corresponding to a particular state of a gesture. That we are considering.

The number of such states depend on the nature and the complexity of the gestures. So, number
of states of the hidden Markov model depends on the nature and the complexity of a particular
gesture. That global HMM structure is formed by connecting in parallel the trained HMM, that is
a train HMM sorry, λ 1, λ 2like this, where g is the number of gestures to be recognized. So, for
each and every gesture I have the train HMM.

1437
(Refer Slide Time: 64:44)

So, now, I am showing the hidden Markov model. So, Markov chain I have shown. So, it has
number of states S1, S2, S3 , S 4, S5 like this. And I have shown the transition from one state to
another state, the transition from S1 to S3, the transition from S1 to S2, the transition from S3 to S 4
like this you can see. And with the, the probability suppose the transition from S1 to S3 with a
probability the 0.5.

And also I have shown the self transition you can see in case of the state S2 the self transition is
taking place with a probability, the probability is 0.7. So, transition from one state to another
state I have shown here, that means a hidden Markov model has finite setup states and each of
which is associated with a multi dimensional probability distribution. And also you can see
already I have defined that is the transition among the states are governed by a set of
probabilities called transition probabilities.

So, in the figure I have shown the transition probabilities like 0.5, 0.1, I have shown the
transition probabilities. In a particular state an outcome or observation can be generated. So,
corresponding to particular state, I can see the outcome or the observation can be generated,
according to the associated probability distribution. And why it is called a hidden? I will explain.
So, it is only the outcome, not the state visible to an external observer and therefore, states are
hidden to the outside.

1438
Hence, the name is the hidden Markov model, because I am considering the Markov chains and
why I am considering the term hidden? Because the states are not visible to an external observer.
So, what is visible? Only the outcome is visible. So, that is, it is only the outcome not the state
visible to an external observer. And therefore, the states are hidden to the outside and that is the
name is the hidden Markov model. So, that is the concept of the hidden Markov model, the brief
concept.

(Refer Slide Time: 67:09)

And what are the elements of the hidden Markov models? So, first the clock. So, the clock is
defined like this, the t is equal to 1, 2, 3 this is, this is the clock. And corresponding to this I have
the n number of states, q ? Q is the states 1, 2, 3. So, n number of states are available, m number
of observation symbols. So, observation symbols I am considering O, v 1, v 2, v 3 like this. So,
number of observation symbols I am considering.

And the initial probabilities of the states is defined by π j. So, this is the initial probabilities of the
states, that is also defined that is the π j. And the transition probability, that is a ij. So, transition
from one state to another state that is also defined, the transition probabilities. And the
observation probabilities also it is defined, that is the B. b j it is defined the observation
probabilities corresponding to a particular state.

So, in this case, so what are the main elements now? What are the main elements of the hidden
Markov model? One is the matrix A. So, in the matrix A, the elements are a ij. So, what isa ij, that

1439
is the transition probability. What is the B? The set of observation probabilities, and what is π ?
The π is nothing but vector of π j values, that is the initial probabilities. So, that means, the
hidden Markov model is defined by this λ .

So, lambda is A comma B comma pi. In case of gesture recognition for each and every gesture, I
have one hidden Markov model. So, suppose the λ 1corresponds to the gesture one, λ 2corresponds
to gesture two. So, like this I have number of hidden Markov models, corresponding to each and
every gesture. So, hidden Markov model is represented by these parameters, so one is A, another
one is B, another one is pi. So, one is the matrix of a ij value. So, what is this? This is nothing but
the transition probabilities, the B is nothing but the observation probabilities, the π is nothing but
the initial probabilities. So, the model is represented by A B pi.

(Refer Slide Time: 69:35)

So, the hidden Markov model you can see the state transition distribution A, which gives the
probability of transition from the current state to the next possible state. So, in case of a hidden
Markov model, you can see here the state transition probability distribution A, which gives the
probability of transition from the current state to the next possible state. So, this is the transition
probability distribution A.

What about B? The observation symbol probability distribution B which gives the probability of
observation for the present state of the model, that is nothing but the observation symbol
probability distribution, the initial state distribution that is pi which gives the probability of a

1440
state being an initial state. So, that is the definition of A B pi. So, in case of the hidden Markov
model, so these parameters are important one is A, another one is B, another one is pi. So, this is
the hidden Markov model.

(Refer Slide Time: 70:37)

In case of the hidden Markov model, we have three problems, the basic problems. The first
problem is the evaluation. So, given the model, the model is represented by ( A , B , π ). So, what
we have to determine, what is the probability of occurrence of a particular observation sequence,
that is the gesture sequence. Gesture sequence is represented by O1 , O 2 ,O 3like this. So, this is a
gesture sequence. And we have to determine the probability of O given lambda, lambda is
nothing but the model.

And this is a classification or the recognition problem. This is actually the determination of the
probability that a particular model will generate the observed gesture sequence when there is a
trained model for each of the district classes. So, this can be obtained by the algorithm, the
algorithm is the forward, backward algorithm. So, I am not explaining these algorithms. So, you
can see the research papers on this.

So, what is the forward backward algorithm? The problem is the probability of O given , that we
have to determine. So, that is the recognition problem, what is the decoding problem? The
decoding problem is determination of optimal state sequence, that produces an observation

1441
sequence. So, this algorithm is called the Viterbi algorithm also you can see this algorithm, what
is the algorithm. But this problem is called a decoding problem.

But one important problem is the learning problem, that is the training of the hidden Markov
model. Determination of the model 𝞴, given a training set of observation. That means, we have
to find lambda such that the probability of O given lambda is maxima, maximal this is nothing
but the training of the hidden Markov model. Train and adjust the model to maximize the
observation sequence probability such that HMM should identify a similar observation sequence
in the future.

So, this Baum Welch algorithm also you can see. So, these algorithms are very important, one is
the forward backward algorithm, for this problem, the problem is we have to determine the
probability of O given lambda. And the decoding problem that is nothing but the Viterbi
algorithm and the learning problem that is the training of the hidden Markov model. So, we have
to maximize the probability of O given lambda. So, for this we can consider the Baum Welch
algorithm.

(Refer Slide Time: 73:09)

So, in case of the hidden Markov model, what are the main problems? So, we must know all the
possible states in advance. That means, we must know possible state connection in advance. So,
for all the gestures, we have to consider this. That means, we should know all the possible states
in advance. Also, we should know the possible state connection in advance. And this HMM
cannot recognize the gestures or the things outside the model, because already I have explained.

1442
So, first we have to form the model, we have to train the model by using the Baum Welch still
algorithm and after this we can recognize. But it cannot recognize things or the gestures outside
the model and it must have some estimate of state emission probabilities and state transition
probabilities. That means, it must have some estimate of state emission probabilities and state
transition probabilities. And the hidden Markov model make several assumptions.

So, these are the problems with the hidden Markov model. So, in case of a hidden Markov model
every gesture model has to be represented and trained separately considering it as a new class,
independent of anything else already learned. And in case of the hidden Markov model requires
strict independent assumptions across multivariate features. And the conditional independence
between the observations.

So, this is one important requirement. This is generally violated in continuous gesture
recognition. So, for continuous gesture recognition, we have to modify the hidden Markov
model. So, in the research paper you can see the, in case of the continuous gesture also we can
apply the hidden Markov model. The hidden Markov model is generative model that defines a
joint probability distribution to solve a conditional problem. So, that is why, we can consider
another model that is the conditional random field.

This is very popular in case of gesture recognition. CRF is a discriminative model that uses a
single model of the joint probability of the label sequence to find conditional densities from the
given observation sequence. So, the main concept of the CRF, it is the discriminative model and
it uses a single model of the joint probability of the label Sequence, to find conditional density
from the given observation sequence. So, this concept you can also see from the research papers.
Just I am mentioning the concept of the hidden Markov model and the conditional random field.

(Refer Slide Time: 76:04)

1443
And finally, nowadays, our most recently, the deep networks are used in case of gesture
recognition. So, maybe we can use something like a convolution neural networks or maybe the
recurrent neural networks, the long- and short-term memories, the concept in the recurrent neural
network. So, these types of networks are recently used for gesture recognition, you can see the
recently the deep learning has erupted in action.

And gesture recognition fields, (())(76:12) being outstanding results and outperforming non deep
state of the art methods, like the state of the art method is the hidden Markov model DTW like
this. So, now, it is or recently the deep learning techniques are used. And the popular networks
are the CNN or maybe the recurrent neural networks. So, these types of networks are used for
gesture recognition.

So, in this class, I discussed the basic concept of the gesture recognition and I have highlighted
some of the applications of gesture recognitions. I discussed the concept of aesthetic gesture
recognition and the dynamic gesture recognition. For dynamic gesture recognition, we can
consider the hidden Markov model or maybe the DTW. So, briefly I have explained the concept
of the DTW algorithm and also the hidden Markov model.

So, for more detail, you can see the related research paper on gesture recognition. So, there are
many new techniques available for gesture recognition. So, you can see all these concepts in the
research papers. So, let me stop here today. Thank you.

1444
ComputerVision and Image Processing – Fundamentals and Applications
Professor M. K. Bhuyan
Department of Electronics and Electrical Engineering
Indian Institute of Technology, Guwahati
Lecture No. 39
Introduction to Video Surveillance
(Background Modelling and Motion Estimation)
Welcome to the NPTEL MOOC course on Computer Vision and Image Processing–
Fundamentals and Applications. In this class I will briefly explain one important application
of Computer Vision, that is, Video Surveillance. There are many applications of the
Automated Video Surveillance System. So, I will explain about this, all these applications.

Also there are many research problems in developing an Automated Video Surveillance
System. And mainly I will explain two important concepts. One is the Background Modelling
and another one is Motion Estimation. In Motion Estimation I will explain one important
algorithm, that is, Optical Flow. Let us see what are the important applications of a Video
Surveillance System.

(Refer Slide Time: 01:24)

So, some of the important applications like, access control in special areas like in the airport,
railways stations, maybe in the tunnels. So, this is one important application that is the access
control in special areas. Anomaly detection and alarming, and another one is the Crowd flux
statistics and congestion analysis. So, these are some applications of the Automated Video
Surveillance system.

1445
(Refer Slide Time: 01:53)

So, in the figure you can see, I have shown one example that is the anomaly detection and
alarming. So, here I am considering, you can see the tracking that is people tracking. And
from this I want to detect anomalies, and accordingly we can give some alarms based on the
anomalies. So, this is anomaly detection and alarming.

(Refer Slide Time: 02:20)

And here I have shown one practical example that is the London bomb attack. Here in the
figure you can see, I have shown some of the frames of the CCTV video sequence. And in
this case you can see that the terrorists were detected by CCTV’s. So, this is related to the
London bomb attack. So, I have taken the pictures from this website and this is one
application that is the anomaly detection.

1446
(Refer Slide Time: 02:48)

Next also I have shown some frames of a video sequence that is from the CCTV, and this is
nothing but the anomaly detection.

(Refer Slide Time: 02:57)

And another application is the traffic information. So, I can track cars, I can track vehicles,
and accordingly I can get the traffic information. So, this is also one important application.

1447
(Refer Slide Time: 03:12)

But there are problems for this Automated Videos Surveillance system. The first problem is
the temporal variation/dynamic environment. So, what is temporal variation? Suppose if I
want to recognise activities, suppose the person is doing some activities then in this case we
have to consider the spatio temporal variations, as I mention in the gesture recognition. In
gesture recognition also we have to consider the spatio temporal variation. In this case also
for human activity recognition or maybe the human action recognition we have to consider
spatio temporal variations.

And also, whenever I want to detect a person or maybe a particular object, we have to do the
segmentation, the segmentation of foreground from the background. But the problem is the
background maybe cluttered background or maybe the illumination may changes or maybe
the background maybe the dynamic background. Then in this case it is very difficult to do the
background subtraction.

For all this conditions particularly the illumination variations, cluttered background, dynamic
background it is very difficult to do the segmentation, the segmentation of the foreground
objects from the background. So, that is the dynamic environment and also the temporal
variations.

The next point is the abrupt object or the camera motion. So, there may be abrupt objects or
the camera motions. Then in this case also we have to do the tracking, we have to do the
background subtractions. So, all this things we have to do, but we have to consider this cases
that is the abrupt object or the camera motion. And also we can employ multiple cameras for

1448
a video surveillance system. So, for multiple cameras the field of view will increase as
compared to a single camera.

But the problem is I have to find the correspondence between all these cameras, the multiple
cameras. So, that is one research problem, how to find the correspondence between all these
cameras available in the video surveillance system. Also we need all the outputs in the real
time so that is why one important issue is the computational complexity of the algorithm. So,
these are the main problems for an automated video surveillance system.

(Refer Slide Time: 05:48)

So, again I am showing some of the difficulties. In this case I am showing the concept of the
tracking but in the image you can see the people merge together, that means there may be
some occlusion. And because of this occlusion it is very difficult to track. So, here you can
see that the people merge together. In this case also the people merge together. In this case
we also have to do the tracking but it is a difficult situation. So, in this case we also have to
do the tracking.

Next one is the people are occluded by a car and that is also another example of the
occlusion. So, for a particular time maybe people are occluded by a car then also we have to
do the tracking. So, this is one difficulty. And also if I consider the tracking of vehicles, then
the problem will be the shadow. Even for tracking the people, the problem may be the
shadow. So, the cast shadow that means the shadow of the vehicle, we have to consider. So,
the vehicle will be moving at a particular speed and its shadow will be also moving.

1449
So, in this case we have to do the segmentation, the segmentation of the vehicles from the
background and after this we have to do the tracking. So, tracking in the presence of shadow
is a big difficulty. So, we have to do this, we have to remove the shadow. So, that is also
another research problem in video surveillance.

(Refer Slide Time: 07:18)

So, let us consider one generic framework of a video surveillance system. So, input is the
video. After this we have to detect the objects present in an image, and after this we can
employ some machine learning algorithms, the pattern classification algorithms for object
classification. After this we can do the tracking, the tracking of the objects, and also we can
mploy this one that is the action recognition, that is the activity recognition.

Suppose the people are walking or people are standing like this, so for this we can recognise
different actions that are similar to the gate recognition, gate means the walking style. So, that
is what we can do that is action recognition. The output is the output video and the semantic
description. So, I will be getting the semantic description of the input video. So, this is the
generic framework of a video surveillance system.

1450
(Refer Slide Time: 08:20)

And you can see what is the difference between the object detection and the object tracking.
So object detection means detect a particular object in an image. So, here you can see I am
detecting this objects and that is nothing but the object detection. But, what is the object
tracking? To track an object or maybe the multiple objects over a sequence of images that
means, the sequence of images means the frames of a video.

So, in this case I have to find the correspondence between the frames of a video. That is to
track an object, maybe a single object or maybe the multiple objects over a sequence of
frames of a video that is the object tracking.

(Refer Slide Time: 09:07)

1451
And how to detect the object? That I am showing the simple flowchart here. So, suppose the
object is stationary, after this we are considering the image acquisition. After this we can
detect object, so object detection algorithms are available. After this again the image
acquisition from the video that is the frames I am collecting or the frames I am taking from
the video.

And after this again I am doing the object tracking and suppose the object is lost in a
particular frame, then again we have to detect the objects. And if the object is not lost then we
can continue the tracking. So, you can see first I have to detect the object present in an image
or present in a video, after this we have to do the tracking.

In some of the frames the object may get lost, then in this case again we have to acquire the
frames of the video. And we can detect the object, and after this you can continue the
tracking. Like this you can do, but the object, if it is not lost then we can continue tracking.
So, this is a simple flowchart to detect objects and also do the tracking.

(Refer Slide Time: 10:29)

So, here you can see I am showing one example, the detection of the moving objects. And
this person is moving, and we have to detect the moving objects based on some algorithms
but the problem is the shadow. That is the car’s shadow, we have to remove it. The person is
moving and his shadow is also moving so that is why the problem is detecting the moving
objects. So, that is why we have to develop some algorithms so that we can remove the
shadow. That is the problem of detecting the moving objects in the presence of shadow.

1452
(Refer Slide Time: 11:04)

After this object classification, so we can consider the classification of objects present in an
image or maybe the video. So, here you can see we are classifying the objects, the car, the
pedestrians, like this we can do the classification based on some algorithms. That is also very
important. So, we can employ some machine learning algorithms for object classification.

(Refer Slide Time: 11:29)

And the tracking, already I have explained, it is nothing but finding the correspondence
between the frames of a video. In this case I have shown some of the frames of a video, so
here you can see the 563, that is the frame number, 566, 567, like this, these are the frames.
And you can see the person one is moving and person two is also moving and there maybe

1453
occlusion in this case, you can see in the image 567 there is an occlusion. And in the presence
of occlusion also we have to track the person, the person 1 and person 2.

And you can see the output of this, the tracking algorithm. So, we can track person 1 and
person 2 in this video. So, tracking means finding the correspondence between the frames of
the video.

(Refer Slide Time: 12:21)

So, some of the traditional methods for all this activities, for moving object detection we can
employ the Background subtraction algorithm and also we have to remove the shadow that is
the Shadow removal algorithm we have to employ. For object classification simply we can
employ the shape information that is the Shape-based classification we can do. And for
tracking of the objects we can employ the Model-based tracking and also the Feature-based
tracking.

Incase of the Model-based tracking, we can consider the model of the object, maybe the
geometrical model we can consider, and based on this model we can do the tracking. In case
of the Feature-based tracking, we can consider some of the features, maybe the color feature,
maybe the texture. So, all this features we can consider and based on features we can do the
tracking. So, these are the traditional methods employ in the video surveillance system.

1454
(Refer Slide Time: 13:19)

And the major problem of the background estimation, so one is the random illumination
variation, so that we have to consider. So, illumination is not uniform. So, if there is random
illumination variation then the problem of the background subtraction, the background
estimation. And the background may be cluttered and also the dynamic background.

So, in this image you can see the background is very cluttered and one train is moving that
means the background is dynamic. And we want to track these persons, you can see I am
doing the tracking in the presence of the moving background and cluttered background, and
also the illumination is not constant, not uniform. Random illumination variations are
available in these images in the video. And also we have to do the initialization of the moving
object.

In this video or in these images or in these frames of the video, you can see, you can see we
have the moving objects, this person is moving, this person is moving. So, we have the
moving objects, and also some persons they are sitting here that means they are stationary.
So, we have to initialize the moving objects.

And also as explained earlier so we have to consider the shadow problem. And one difficult
problem is the removal of the cast shadow. Because the shadow is also moving and the
person is also moving, so based on the motion information I cannot do the separation between
the moving objects and the shadow. So, we have to employ some algorithms so that we can
remove the cast shadow.

1455
(Refer Slide Time: 15:00)

Now I will discuss the concept of the background subtraction. So, one very old algorithm is
the temporal differencing. So, this algorithm I can apply to detect foreground objects by
making use of pixelwise differences between 2 or 3 consecutive frames of the video. That
means I have to find the differences between 2 or 3 consecutive frames.

And the pixels are marked as foreground if they satisfy the following equation. Suppose if I
consider a particular pixel, the pixel at the time t +1 is I t+1 −I t that is that pixel at the time t .
If the difference between these is greater than a particular threshold, then we can consider
that pixel as a foreground pixel otherwise it is a background pixel. Like this we can determine
the moving pixels.

(Refer Slide Time: 16:00)

1456
In this case I am showing one example foreground extraction that is the background
subtraction that is nothing but the sense detection algorithm. So, this is the current image and
I am considering the background image. If I subtract the background image from the current
image, I will be getting the foreground pixels. So, this is nothing but the background
subtraction algorithm.

(Refer Slide Time: 16:22)

So, again I am showing you the same thing the background subtraction algorithm. So, this is
frame one and another one is the fram ten. So, if I do the subtraction, then I will be getting
the difference of the two frames that are nothing but the moving objects I can determine.

(Refer Slide Time: 16:38)

1457
Similarly I am showing one example of the absolute differences. So, two frames I am
considering. So, you can see in the first frame the moving objects are present, in the second
frame the background that is nothing but the background. And if I subtract the second image
from the first image, I will be getting the moving objects. That is the moving objects will be
detected. That is nothing but the absolute differences.

(Refer Slide Time: 17:03)

So, in this example, I have shown one algorithm that is a simple algorithm for background
subtraction. So, background subtraction system flowchart. So, suppose I have the input video
and I am considering the frames of the video. First I am considering the N frames…the first
N number of frames I am considering and based on this I am determining the background
model. So, the background model is determined from the first N number of frames.

After this suppose one input frame is coming that is the input current frame. And I can
employ the background subtraction algorithm, so you can see I have the background image
here, and I have the input current frame so I can employ the background subtraction
algorithm and based on this I can determine the foreground pixels and the background pixels
I can determine based on the matching.

And this, based on the background pixels I can update the background model that already I
have developed from the first N number of frames. Also I have to consider the brightness
since removal, so there may be changes of the brightness so that also we have to consider and
I will be getting the foreground.

1458
After this we have to do some post processings, so may be we can employ the morphological
image processing operations like the dilation, erosion, noise filter like this we can employ.
And ultimately I will be getting the output that is nothing but the foreground image I will be
getting. So, I can separate foreground and the background by considering this background
subtraction algorithm.

(Refer Slide Time: 18:41)

And this is also a very old technique; here it is published here that is IEEE Transactions on
Pattern Analysis and Machine Intelligence. So, briefly the concept is like this, so the
background is model by representing each pixel, the pixel is x by three values, so 3 values are
considered. One is the maximum intensity value m ( x), one is the minimum intensity value
n ( x)and the maximum intensity difference between the two consecutive training frames that
is the dx .

So, by using this three values, one is the maximum intensity value, another one is the
minimum intensity value and another one is the maximum intensity difference between
consecutive training frames, I can develop the background model. So, how to do this? So,
pixel x from the current image I is considered as a foreground pixel, if the following
equations or the following equation is true. That means I x that is the current pixel – m ( x),
what is m ( x) that is the maximum intensity value m ( x) is greater than lambda dx or maybe
I x – n ( x), n ( x) is nothing but the minimum intensity value, is greater than 𝞴dx .

So, based on this I can determine the foreground pixels. So, briefly this is about the
algorithm, so based on this I can determine the foreground pixels.

1459
(Refer Slide Time: 20:20)

And this is also a very popular technique for background estimation, background modelling.
So, we can employ a Single Gaussian model. So, in this case what we can consider, each
pixel in the background reference follows a single and separate Gaussian distribution. So, that
means each pixel in the background follows a single and separate Gaussian distribution. So,
that is for a particular pixel we can consider Gaussian distribution and which is characterized
by the mean and the standard deviation. So, that means each pixel in the background model
consists of two parameters based on N number of background images.

So, we can consider N number of background images, so from N number of background


images we can determine the mean and the standard deviation that we can determine. That is
nothing but the background model of the image. And suppose I t be the value of the pixels in
the image at the time t , so suppose I am considering the pixel at the time t . The pixel will be
assigned or may be classified as a foreground pixel if it satisfies the following equation. So,
this equation should be satisfying.

So, I t means the value of the pixels in the image at time t −¿ the mean is greater than 𝞴σ t . If
this condition is satisfied then I can consider that particular pixel as a foreground pixel. So,
lambda is a user defined parameter and I am considering the difference between the pixel and
the mean that I am considering. I t −μ t if it is greater than 𝞴σ t then we can consider that
particular pixel as a foreground pixel.

1460
❑❑(Refer Slide Time: 22:15)

So, lambda is a user defined parameter. And after this the parameters of the pixel I should be
updated. Again we have to update the parameters corresponding to the pixel I so the mean is
updated like this and also the variance I can update like this. So, these are the updating
equations. And in this case the alpha is a parameter and that is used for the learning rate. So, I
can control the learning rate based on the value of the parameter alpha. And generally the
value of alpha is 0.9. So, this is the concept of the Single Gaussian model.

(Refer Slide Time: 22:54)

The next one is I can consider the mixture of the Gaussian model that is the Mixture Gaussian
Model. So, what is the concept of this? So, each pixel in the background is modelled as a
mixture of Gaussians. So, I can consider the pixel as a mixture of Gaussian because a Single

1461
Gaussian model may not be appropriate in many cases, suppose the background is moving or
may be the illumination is changing, then in this case the Single Gaussian model may not be
appropriate. So, for this we can consider a mixture of Gaussians.

And the probability of observing the current pixel value of X t and the time t is given by this,
so the probability of X t is given by this. And here the K is the number of Gaussian
distributions. So, how many Gaussian distributions are employ to represent a particular pixel,
or maybe the background. And w i t is an estimate of the weight. So, w it is nothing but the
weight. And you can see I am considering the Ith Gaussian components. So, X t is nothing
but the pixel value and I am considering the mean and the standard deviation.

(Refer Slide Time: 24:04)

And after this we have to update the weights by using this equation and here the beta is the
user defined parameter that is nothing but the learning rate and M it is 1 for the model which is
matched and 0 for the remaining models that is about the M it . And this mean, mean and the
standard deviation or the mean and the variance should be updated like we did in the Single
Gaussian Model.

So, this is the concept of the Mixture Gaussian Model. So, briefly I have explained this
concept the Mixture Gaussian model.

1462
(Refer Slide Time: 24:39)

And this is also a very popular method, the Background Modelling by Codebook
Construction. You can see this research paper by Kim. So, this research paper you can see
and I think this is one of the popular methods but this is a very old method now. In this case
the codebook algorithm is employ that is nothing but the codebook construction and in this
case the concept of the quantization or the concept of clustering is employed to construct a
background model.

So, samples at each pixel are clustered into the set of code words. So, each pixel means if I
consider samples of the pixel from the training image, they are clustered into the set of code
words. And the background is encoded on a pixel by pixel basis. So, for each pixel I have the
code word and the background is encoded on a pixel by pixel basis. A particular pixel can be
classified as a background pixel if it satisfies the following two conditions.

So, first we have to consider the case of the color distortion, so based on the color distortion
we can take a decision whether particular pixel is a foreground pixel or the background pixel.
And also we have to consider the brightness difference. So, brightness lies within the
brightness range of the code word. That also it is considered and based on this we can select
the foreground and the background pixels. So, this is the briefly the concept of the
Background Modelling by Codebook Construction. For more details you can read this paper.

1463
(Refer Slide Time: 26:20)

So, these are some results we have done, Background Modelling by the Codebook technique.
So, corresponding to this image you can see I am extracting the foregrounds. And similarly to
this image I am extracting the foregrounds. Very difficult images in the second image
because the background is you can see dynamic background is also there and also
illumination variations are also there. And the background is very complex. So, these are
some results of the background estimation that is the foreground separation from the
background.

(Refer Slide Time: 26:54)

And one important technique is the Lucas- Kanade technique. That is the concept of the
optical flow. So, we can determine the visual motion pattern of the moving objects and that is

1464
called the optical flow. So, here you can see this person is moving and corresponding to this
motion. I can determine the visual motion pattern I can determine. That means I can
determine the motion factors. So, here you can see these are the motion factor that is nothing
but the optical flow.

So, based on the optical flow determination, I can determine the moving objects that are
nothing but the motion estimation. So, this algorithm I am going to explain is the motion
estimation by optical flow.

(Refer Slide Time: 27:37)

So, what is optical flow? This concept is very important. So, optical flow can describe how
quickly and which direction the pixel is moving. So, that means I am determining the motion
pattern that is the visual motion pattern can be determined by the optical flow algorithm.And
we can determine the motion factors. That is nothing but the flow factors to detect moving
regions.

And the optical flow approach can be combined with other background subtraction methods.
So, by combining this optical flow method and the other background subtraction method we
can determine the moving regions that motion estimation we can do, and we can separate the
foreground from the background. And I have to consider these assumptions.

One is the optical flow should not depend on illumination changes in the scene. That means
the illumination should be almost constant for optical flow determination. That is one
assumption. The second assumption is the motions of the unwanted object like shadow
should not affect the optical flow. So, that also we have to consider that means we have to

1465
avoid or we should not consider the unwanted objects and is the shadow is present that will
be a big problem for optical flow algorithm.

So, that we are not considering the presence of the shadow or the presence of the unwanted
objects. So, these two assumptions we are considering for determination of the optical flow.
Now let us see how to determine the optical flow.

(Refer Slide Time: 29:13)

Here you can see in this example I have shown this object is rotating, it is rotating like this in
the counterclockwise direction and corresponding to this I will be getting the optical flow that
is the motion pattern I will be getting. And ideally, the optical flow is equal to motion field,
because the object is also moving and corresponding to this I will be getting the motion
factors that is the motion field I will be getting and also the optical flow I will be getting.

1466
(Refer Slide Time: 29:42)

But sometimes, it may not be the optical flow may not be equal to motion field. Here I am
giving two example, in the example a, you can see, this object is rotating this is rotating and I
have the source, one source is available. So, corresponding to these paths, suppose these
paths I am considering, the illumination will be almost constant that means if the illumination
is constant then corresponding to this there will not be any optical flow.

Because the optical flow, how to determine? Optical flow is determined based on the
illumination change, because the object is moving and based on this the illumination at
different points will be changing. But in this case, the illumination at this particular point will
be almost constant.

In the second case, there is no motion you can see. The object is in stationary that is the
sphere is in the stationary, but I have the moving sources, the source is moving. So,
corresponding to this path suppose, I have different brightness. The brightness will be
changing. So, whenever the brightness is changing, then I will be getting the optical flow. But
I will not be getting the motion because there is no motion field here, the motion field is 0, so
inspite of having the motion is 0, the optical flow is present.

So, that means no motion field, but shading changes that means the brightness corresponding
to that particular paths, it will be changing because the sources are moving. So, in the first
case motion field is there, but there is no optical flow. In the second case, motion field is not
there, but there is a change of the brightness and because of this I will be getting the optical
flow. So, that means I can say the optical flow is not equal to motion field.

1467
(Refer Slide Time: 31:38)

And the problem definition of the optical flow, so here I have shown two images, one is H
another one is I. And I have shown some pixels, the red pixels, the green pixels like this I
have shown. And the pixels are moving, so I have to determine the motion of the pixels that
is nothing but the motion estimation.

So, for this you can see, find the pixel correspondence. So, I have to find the correspondence
between the pixels and based on this I can determine the optical flow, I can determine the
motions. So, suppose this red pixel is corresponding to this red pixel and similarly this pixel
is corresponding to this pixel like this. So, I have to find the correspondence between the
pixels.

For this I can consider some assumptions like the color property I have to consider or may be
the brightness property we can consider for the gray scale image. And also we have to
consider this assumption that is the small motion assumption we have to consider. Points do
not move very far. So, for this we have to find the correspondence between the pixels of the
images. So, the actual concept of the optical flow, I will be explaining in the next slide.

1468
(Refer Slide Time: 32:51)

Here I have shown two frames of a video at time t another one is time t +δt . Let us consider a
paths at the point x , y . So, these paths I am considering. And other time t +δt , this path
moves to this position, second position. So, these paths will be moving suppose. Now what
will be the position of the paths in the second image? The position of the paths in the second
image x +uδt , y+vδt

So, in this case, what is x +uδt ? uδt is nothing but δx . So, that means it is nothing but δx . And
similarly, the vδt is nothing but δy . So, what will be the position of the paths in the second
frame? The position of the paths in the second frame will x +δx, y +δy . And I am considering
the velocity that is the optical flow velocities u and v . So, u is the velocity along the x
direction and v is the velocity along the y direction.

So, from u and v I can determine the displacement. So, δx is nothing but uδt and δy is
nothing but vδt that I can determine. So, assume that the brightness of a path remains same in
both the images, that is true here. You can see this path is moving to this position; the second
position. So, that means the brightness of this path will remain same in both the images.

Only the position is changing, but the brightness of the path remains the same in both the
images. So, here you can see that is the brightness at the point x comma y at the time t. So, E
is the brightness of the paths at the point x comma y and at the time, the time is t. And since
the paths is moving to the another position in the time t plus delta t, the same paths is moving,
so there will not be any brightness since.

1469
So, corresponding to this…what will be the brightness? The brightness will be E ( x ,uδt)
because that is the new position x +δx , y +δy and what is the time, the time is t +δt . So,
brightness will remain same. So, I can show another diagram, suppose I am considering one
object and it is rotating suppose. So, this object is rotating like this. So, this is at time t and
this is at t +δt .

So, if I consider the brightness at this point suppose, will be same as that of this, because this
point is moving and this is moving to this point, so brightness of this will be same as that of
this. And if I consider the same position, so brightness at this position, the position number 1
and the position number suppose 2, the same position; it will be changing because of the
rotation.

So, from this I can determine the optical flow. Here you can see corresponding to the position
1 at the time t, and corresponding to the position 2 at the time t plus delta t, the brightness
will be different because of the rotation. So, based on this concept I can determine the optical
flow because the brightness at the same point, because this point and this point is same point
1 and 2 is the same point so brightness will be different because of the rotations. So, that
gives the optical flow. So, now you can see the assume the brightness of the paths remains
same in both the images, then I will be getting this equation.

And this I can expand by considering the Taylor series expansion. So, higher order terms I
am neglecting so I will be getting this one. So, this equation I will be getting and I can cancel
out this, this and this will be cancelled out.

(Refer Slide Time: 37:04)

1470
So, from this what I will be getting? I will be getting this equation.

dx ∂ E dy ∂ E dx ∂ E ∂ E
+¿ + + =0.
dt ∂ x dt ∂ y dt ∂ x ∂ t

So, from this I will be getting this equation. So, I have to divide by delta t so I will be getting
this equation. And this equation can be represented like this. So, I will be getting this one the
E x u +E y v +E t =0.

∂E
So, what is E x ? What is E x in this case? E x is nothing but that is the change of brightness
∂x
with respect to the coordinate, the coordinate is x . That is the spatial change of the brightness.

∂E
And similarly, if I consider E y , that is nothing but . This corresponds to the spatial change
∂y

dx dx
of the brightness, because of the motion. And what is the ? is nothing butu that is the
dt dt

dy dy
velocity along the acceleration. And what is the d ? So, is nothing but velocity along the
dt dt
y direction that is v .

And what is delta E delta t that is the time rate of change of brightness. The brightness is
changing with respect to time. Because already I have explained, suppose this object is
moving like this at time t and at the time t +δt . So, this is the t +δt , so you can see the
brightness at this point, so brightness at this point suppose one and one, it will be changing.

∂E
So, change of the brightness with respect to time, so that is .
∂t

So, I will be getting this equation, this is the equation of a straight line so the velocity u and v
must lie on a straight line and based on this we can compute E x , E y and E t using the gradient
operators. But the problem is it is not a simple problem, u and v cannot be found uniquely
with this constraint. So, with this constraint cannot determine the velocity u and v .

1471
(Refer Slide Time: 39:28)

So, the same concept I am showing here, so this path is moving to this position and
corresponding to this I will be getting the optical flow. The optical flow it depends on the
change of the brightness because of the motion. So, already I have this equation, so this
equation I can write like this. So, what is the interpretation of this equation?

You can see I will be getting this equation from this minus E t is equal to that is the ∇ E that
is the gradient of E dot c. So, c is nothing but the velocity factor. So, c is the velocity factor,
so I have two components, one is the u and another one is v, velocity along the x direction
and the velocity along the y direction.

And what is Et? Et means the time rate of change of brightness, so with respect to time the
brightness will be changing because of the motion. That is the time rate of change of
brightness. And what is this ∇ E? ∇ E is nothing but the spatial rate of change of brightness.

So, one is the velocity factor, so our objective is to determine the velocity factor from this
equation. So, this is the equation, this is the constraint equation or the optical flow equation,
so from this equation, we can determine the velocity factor. The velocity factor is c, so I have
two components u and v.

1472
(Refer Slide Time: 40:53)

And for optical flow solution, there are many methods, so simply I am showing one method
but if you want to see the solution of this method, so you can see the optical flow paper;
research paper by Horn, the original paper is by Horn, so you can see this research paper, the
optical flow algorithm you can see by Horn, but briefly I have explained the solution of this
equation and that is the optical flow equation.

So, for this we have to consider error in optical flow constraint. So, this error I am
considering and also we have to consider another assumption that is the velocity factor
changes very slowly in a given neighbourhood. That means the motion is very slow. So, this
assumption I have to consider that is nothing but the smoothness constraint we have to
consider and based on this I am having this equation that is nothing but the smoothness
constraint.

1473
(Refer Slide Time: 41:51)

So, already you ave this quation that is optical flow equation and this equation we can solve
by considering the Lagrange’s multiplier method. So, e=e s + λ ec that you can consider. So,
lambda is nothing but the Lagrange’s multiplier. And based on this error, the flow error
already we have calculated, because e is equal to e s +λ ec , we have already defined e s and also
we have defined e c . The solution of this equation we are considering like this we have to
minimise the error and we will be getting this one. So, we have to do some mathematics and I
will be getting this one.

(Refer Slide Time: 42:33)

And after this the solution of this equation is nothing but, I have to do the solutions and
finally I will be getting the velocity along the x direction that is the u I can determine and

1474
velocity along the y direction I have to determine. u bar is nothing but the mean velocity.
These are the mean velocity. And M is defined like this, M and N is defined like this. So,
from these two equations I can determine the u and v that is the velocity factors I can
determine.

(Refer Slide Time: 42:59)

So, I can show the simple algorithm from my book Optical Flow Algorithm. So, initialize
velocity factor, first step to initialize the velocity factor. After this I have to consider the
iterations, so k number of iterations I am considering. So, by using the previous equations
you can see I am calculating the value u and the v, and we have to stop these iterations so
first we have to consider this error. The error should be less then a particular threshold and
based on this we can stop the iterations, other wise I can, I have to continue the iteration and I
can determine the u and v.

So, this algorithm I can employ in a video, so in a video I have the frames, number of frames.
So, in that case the k is the frame numbers. So, all the frames I have to read one by one, one
number frame, number frame two, like this. So, k number of frames of a video I have to read
and from this I can determine the velocity along the x direction and velocity along the y
direction. So, briefly I have explained the concept of the optical flow, how to determine the
optical flow.

In this class, I briefly explained the concept of a video surveillance system and also I have
shown different applications of a video surveillance system. The two main concepts, I have
briefly explained. One is the concept of the background modelling and another one is the
concept of the motion estimation. For motion estimation I have explained the concept of

1475
optical flow that is very important. In my next class I will be explaining some tracking
algorithms for object tracking, so let me stop here today. Thank you.

1476
Computer Vision and Image Processing – Fundamentals and Application
Professor M. K. Bhuyan
Department of Electronics and Electrical Engineering
Indian Institute of Technology Guwahati, India
Lecture 40
Introduction to Video Surveillance
(Object Tracking)

Welcome to NPTEL MOOCs course on Computer Vision and Image Processing - Fundamentals
and Applications. In my last class I discussed the concept of background modeling and the
concept of the motion estimation. For motion estimation, I discussed one important algorithm,
that is optical flow based motion estimation. Today I am going to continue the same discussion.
So, first I will explain the concept of the object tracking, after this I will briefly explain the
concept of particle filter based object tracking. So, let us see what is object tracking?

(Refer Slide Time: 01:07)

So, in my last class I discussed the concept of the video surveillance system. So, in this block
diagram, I have shown the generic framework of a video surveillance system. So, you can see
here the input video is available, that can be captured by the camera. After this we have to detect
the objects present in an image or a present in the video. And after this we have to do the object
classifications.

1477
And after this we can do the tracking, the object tracking and after this the action recognition and
the output is, the output video and semantic description. This is a generic framework of a video
surveillance system. Last class I discussed this system.

(Refer Slide Time: 01:55)

And one problem already I have mentioned, the problem of the shadow. So, we have to remove
the shadow. So, if I consider the cast shadow, suppose if I want to do the tracking of an object,
the shadow is also moving and the object is also moving. So, we have to remove the shadow. So,
based on the motion information, we cannot remove the shadow. So, for this, we can consider
some, a process here I have shown one classical approach in this paper.

So, in a shadow region, intensity changes greatly. However, change of chromaticity is negligible.
So, based on this condition, we can separate the shadow from the object. So, in this example, you
can see the cast shadow is present and based on this property that is the intensity variation and
also the chromaticity variations, we can remove the shadow. But at present there are many new
techniques for shadow removal.

So, this is the classical approach for shadow removal. There are many new techniques of shadow
removal, even in the deep learning framework, we have different methods for shadow removal.
So, you can see all these methods, but here I have briefly explained the concept of shadow
removal.

1478
(Refer Slide Time: 03:14)

After this, the next point is the object classification. So, for these we can consider the shape
feature for object classification. And here also I am considering a very old paper, but concept is
very simple. Based on the shape information, we can classify human and the vehicles. So, for
classification of humans and the vehicles, they considered one parameter, that parameter is
dispersedness and it is based on the parameters and the area.

So, based on this parameter, they classified humans and the vehicle. So, for humans the value of
this parameter is 61.8. And if I consider the vehicle the, the value is 41.0. The concept is, humans
are generally smaller than the vehicles, but have a more complex shape. So, based on these
parameters, we can do the classification, the classification of humans and the vehicles.

And this is also a very old technique, there are many new techniques like in deep learning also,
we have many techniques for object classification, or maybe some other features based
techniques we can consider for object classifications.

1479
(Refer Slide Time: 04:31)

And this is another shape based method. The silhouette information can be considered to detect
pedestrians. In this paper, they considered a hierarchy template. And based on this template, they
detect the pedestrians. So, these are the templates and based on this template, they detect the
pedestrians.

(Refer Slide Time: 04:50)

1480
And after this we have to consider the object tracking. Tracking means finding the
correspondence between the frames of a video. So, for this we can consider the model based
tracking. Suppose if I want to track humans, then in this case, we can consider the model based
tracking.

So, maybe we can consider a cardboard human model we can consider. And this model can be
considered for object tracking, that is for modeling the human body and to predict the location of
human body parts. The human body parts maybe the head, torso, hence, legs and the feet. So,
this is the cardboard model, that is the cardboard human model and based on this model, we can
do the object tracking.

(Refer Slide Time: 05:36)

And also we can consider the feature based tracking. In this paper, they considered the color
information, based on the color information they did the object tracking. So, in this example,
what have they considered? They considered a color histogram. So, color histogram
corresponding to this portion, the head portion, color histogram corresponding to the upper body
and color, color histogram corresponding to the lower body, based on the color information they
did the tracking, that is the feature based base tracking.

1481
(Refer Slide Time: 06:06)

Here you can see I am showing one example of feature based tracking, the tracking of the ball.
So, for these there are two popular algorithms, one is the Kalman filter, another one is the
particle filter. In case of the Kalman filter, one assumption is the linear assumption, that concept
I will explain later on. And also the noise is considered as Gaussian noise. So, because of this the
Kalman filter cannot track properly, if the motion of the object is very high, then in this case it
fails.

1482
So, the advanced version is the particle filter, it considers a nonlinear state equation that I will
explain later on, also the noise is the non Gaussian noise. So, that is why the particle filter can
track objects very nicely, if you see in this example, the particle filter can track the objects very
nicely as compared to the Kalman filter. And I assume that you know the concept of the Kalman
filter, because today I am going to explain only the particle filter. So, you have to read the
concept of the Kalman filter.

(Refer Slide Time: 07:10)

And one tracking algorithm is the blob matching. So, here you can see after background
modeling, we will be getting the foreground objects, that these are the blobs. So, first, what do
we have to do? Match blobs from the current image with a person detected in the previous
image. So, that means, just I am matching the blobs of the current image, with persons detected
in the previous image. So, this is the matching, after this I am getting the matching.

In this case the matching is two, then the match the detected person with the blobs in the current
image. So, that is nothing but the two way matching. So, by using this two way matching, we can
do the tracking. So, in the next slide you can see this concept, what is the two way matching and
that is nothing but the blob matching, the blob means that is the foreground objects detected after
background modeling.

1483
(Refer Slide Time: 08:06)

So, here you can see, that I am considering the tracking of these pedestrians, and you can see I
am detecting the blobs, these are the blobs after background modeling. So, I am getting the
foreground objects. So, first I have to do the matching. So, just I am doing the matching. After
this, after matching I am doing the reverse matching. So, suppose this person is detected and
after this again I am doing the reverse matching, that is nothing but the two way matching. So,
that is a very simple algorithm, that algorithm is the blob matching algorithm.

(Refer Slide Time: 08:40)

1484
After this I will consider another algorithm for object tracking that is the mean shift algorithm.
So, what is the objective of the mean shift algorithm? An algorithm that iteratively shifts a data
point to the average of data points in its neighborhood. So, for this is what we have to consider?
We have to consider a search window, we have to consider the initial location. After this we
have to compute the mean location in the search window, that we have to compute.

After this the center, the search window at the mean. So, that means, we have to reposition the
search window at the new mean. And we have to repeat this until we get the condition of
convergence. So, this is the concept of the mean shift algorithm. So, pictorially this concept I can
explain in my next slide.

(Refer Slide Time: 09:31)

1485
1486
Here I am showing some of the sample points, but here I am considering the distribution of the
billiard balls. But let us consider it as the sample points. And this blue circle, it corresponds to
the window that is the region of interest. And corresponding to this region of interest, we can
determine the center of mass. So, the center of mass corresponding to these sample points, within
the window, that we can determine. And you can see the mean shift vector, that is the vector
from the center of the window and a new centroid that is the center of mass.

So, you will be getting the mean shift vector you will be getting. That is the man shift vector. I
have to move the center of the window to the new center of the mass. So, that means, you can
see, this is nothing but the center of the window, that is the region of interest. And after this, we
have determined the center of the mass. After this, we have to move the center of the window to
the new center of mass.

So, that means I am doing this, that means I am moving the center of the window to the center of
the mass. After this, I am re centering the window, I am re centering the window and again I am
calculating the center of the mass. And after this again I have to shift the center of the mass to
the new mean. So, corresponding to this case also we can determine the mean shift vector, we
can determine.

And again we have to seek the center of the mass to the center of the mass, that I am doing the
shifting, that is the mean shift vector we can determine. And you can see, I am getting the new

1487
position of the center of the mass. And finally, we have to do this process iteratively and you can
see this is the point of convergence. So, I am getting the center of the mass, that is nothing but
the center of the window, the search window.

So, this is the, this is the pictorial representation of the mean shift algorithm. So, I am explaining
briefly what the mean shift algorithm is. So the mean shift already I have explained, it is a
nonparametric mode seeking algorithm.

(Refer Slide Time: 11:56)

So I can say it is a nonparametric, nonparametric mode seeking algorithm. So, I have to


determine the mode, mode seeking algorithm for a density function. And this means shift
algorithms iteratively shift a data point to the average of data points in each neighborhood. And
this concept is very similar to clustering. Suppose let us consider a set S, set S is considered, set
S is considered of n data points, of n data points. The data points are x i suppose, in suppose d
dimensional Euclidean space.

And I am considering the vector, the vectories X. And suppose let K x is the kernel function. So I
am considering the kernel function. This kernel function indicates how much x contributes to the
estimation of the mean. Then, then the sample mean m at x with kernel k can be determined like
this. So, that means I can determine the sample mean, the sample mean is m × I can determine.
With a kernel, the kernel is k. So, already I have defined the kernel, the kernel I am considering.

1488
The sample mean I can determine like this, i is equal to 1 to n because n number of data points
we are considering. So, I can determine the mean, the that is the sample mean I can determine.
And the difference, m x - X. So, this difference is called a mean shift. This difference is called a
mean shift. So, mean shift algorithm iteratively moves data point to its mean, that concept
already I have shown in the previous slide.

And each iteration, the mean mx is assigned to X that means, at each iteration that is mx is
assigned to X. The algorithm stops when m x is equal to X. So, this is the convergence condition.
That means the algorithm stops when mx is equal to X. So, that means, I will be getting the
sequence, the sequence X is nothing but mx. After this I will be getting m, mx. So, like this I will
be getting.

So, these are nothing but the trajectory of X, so these are nothing but the trajectory of X. And if
the sample means are computed at multiple points then f is iterations, update is done
simultaneously to all these points. So, this is the basic concept of the mean shift algorithm. For
object tracking we can apply the mean shift algorithm. That is nothing but the kernel based
tracking. For this what we can consider?

The appearance of an object can be characterized using histograms. And tracking can be done
based on these histograms. But, one problem is that, that is, it is hard to specify an explicit to the
parametric motion model to track non rigid objects. So, suppose like a walking person, that is a
non rigid object. So, for these appearances of the non rigid object can sometimes be model which
color distributions.

So, maybe we can consider the color distribution for object tracking. That means, the color
histogram we can consider for the non rigid objects, like, like a walking person. So, briefly, what
are the steps of the mean shift algorithm? The first step is, the initialization, initialization of the
position of a fix size search window. So, this is the first step. The second step is finding the
average position in the search window, number two.

So, finding the average position in the search window. Number three0 we can consider, put the
center of the window, put the center of the window at the average, average position estimated,
estimated in step two. And finally, repeat step two and step three, step two and step three until

1489
the average position, until the average position changes less than a threshold. This is the mean
shift algorithm, we can do the tracking.

So first I have to do the initialization of the portion of a fix size window. That is the search
window, after this the finding of the average position in the search window and after this put the
center of the window at the average position estimated in step number two, repeat step number 2
and step number three, until the average position change is less than a threshold. So brief, this is
the brief concept of that mean shift algorithm. So, with the mean shift algorithm you can do the
tracking.

(Refer Slide Time: 20:21)

After this, briefly I will explain the concept of the recursive filter. And already I have explained
that, the particle filter can track objects perfectly as compared to the Kalman filter. And I assume
that you know the concept of the Kalman filter in this discussion.

1490
(Refer Slide Time: 20:40)

So, in case of the recursive filter, the first one is the representation of dynamic systems. So, first
the state sequence is a Markov random process, that is the condition. The state sequence is
represented by this. So, x k is the state vector at the time k. And I am considering the function, the
state transition function f x is the state transition function, that is considered and I am considering
the process noise of the known distribution that is uk . And the state only depends on the previous
state, that is the concept of first order Markov process.

So, here you can see x k depends on x k - 1, that is the state only depends on the previous state.
And in this case what I am considering? uk . -1, that is nothing but the process noise at time
instant k minus 1. Here you can see p x k given x k - 1 that is nothing but the first order Markov
process. And similarly, we have the observation equations. So, that is represented by z k f z x k , v k .

So, for this what we have considered? z k is the observation at times instant k. And again we are
considering the observation function, the observation function is f z . So, it should be f z , that is the
observation function. And v k is the observation noise with known distribution. So, I am
considering the observation noise v k And in this case I am considering the process noise uk that is
the process noise.

1491
And I am considering the observation noise with known distribution. And the form of density
depends on these two functions, one is f x another one is f z . And we have the densities, one is the
process noise uk and another one is the observation noise. That is the representation of a dynamic
system.

(Refer Slide Time: 22:46)

So, what are the meanings of the densities? The first one is the probability of x given z, one. That
is the posterior density, what is the probability that object is at the location x k ? For all possible
locations x k ?, if the history of the measurement is z 1 : k . So, this is the definition of the posterior
density. And another one is the prior density, what is the prior density? Probability of x k given
x−1 , that is the motion model.

Where will be the object be at the time instant k given that it was previously at x k −¿ 1, that is
nothing but the first order Markov process. So, that is the prior information. Also we have the
likelihood information, that is the plus conditional density. That is the likelihood of making the
observations z k , given that the object is at the location x k . So, that is the likelihood. And what is
the tracking, what is the definition of the tracking?

In this case, tracking the state of a system as it evolves over time. And we have sequentially
arriving observations. That means, the noisy observations are available. So, from these

1492
observations, we have to do the estimation of the states, that is nothing but the tracking. So, we
want to know the best possible estimate of the hidden variables. So, we have the observations
and the noisy observations. And from these observations we have to predict the state, that is the
tracking the state of a system as it evolves over time. So, that is the meaning of the tracking.

(Refer Slide Time: 24:36)

And in case of a recursive filtering, the filtering is the problem of sequentially estimating the
states. That means, the parameters or the hidden variables of a system as a set of observation
became available online. So, that means we have the observations and from these observations
we have to estimate that states of a system. That is the filtering is a problem of sequentially
estimating the states, that is nothing but the parameters or the hidden variables of a system. And
based on the observations.

So, a recursive filter, that is the sequential update of previous estimate. But if I consider the batch
processing computation with all the data in one step. So, that is the difference between the
recursive filter and the batch processing. That is, in case of a recursive filter that is not only
faster, but allows online processing of data repeat adaptation to changing signal characteristics.
So, in case of the recursive filter, there are two steps.

One is the prediction step. That means, we have to do the prediction. So, from p x k -1 and
observation is available, we have to determine the probability x k . So, from the given observation,

1493
we have to determine the probability of x k that is the, current state we have to determine from
the previous state, the previous state is x k -1. And after this, that means, what is the prediction?

The predict the next state pdf from the current estimate, that is the prediction. After this we have
to do the update, that is the update step. So, update the prediction using sequentially arriving new
measurements, because we have the measurements, we have the observations. So, from these
observations, we have to update the prediction based on sequentially arriving new measurements.
So, that means, we have two steps. One is the prediction step, another one is the update step.

(Refer Slide Time: 26:39)

So, that is the concept of the Bayesian filtering. The objective is to estimate unknown state Xk
based on a sequence of observations, the observations are Zk. So, we have the observations and
we have to find a posterior distributions. So, we have to find a posterior distribution, that is the
objective of the Bayesian approach.

1494
(Refer Slide Time: 27:00)

So, this concept I have shown pictorially here. So, you can see, first I am doing the updation,
after this we are propagating. Again we are updating based on the observations, the observations
are z0, z1, z2 like this. So, we are just predicting and after this again we are updating. We are
doing the prediction and after this again we are updating, that is nothing but updating and
propagating, updating and propagating.

So, this is the concept of the Bayesian estimation, that is the recursive filter. And in case of the
Kalman filter, we have considered the Gaussian noise process and the linear state space model.
So, if you see this function, the function is fx, another function was the fz. So, that is the linear
function, that is considered in the Kalman filter. But, you can see, in case of the Kalman filter,
we consider this the linear function of the last state and Gaussian noise signal, we have
considered.

And the sensory model is nonlinear function of state, and the Gaussian noise signal we have
considered. And the this posterior density is also Gaussian. So, these are the assumptions of the
Kalman filter. Practically these assumptions are not true. So, if I want to track an object suppose,
which moves very quickly, then these assumptions generally fails.

So, that is why the Kalman filter cannot track objects properly, that is the problem of the Kalman
filter. So, that is why we may consider the particle filter. In particle filter these functions, these

1495
function we consider as nonlinear functions. And the noise we can consider as the non Gaussian
noise.

(Refer Slide Time: 28:46)

So, what is a particle filter? In particle filter unknown state vector is available. So, x0 to t, so all
the states are available x0, x1, xt all these states are available. We have the observation vector Z
from 1 to t. And we have to find a posterior distribution. So, this distribution we have to
determine, that is the problem of the tracking. So, that means, we have to determine this one, that
is the filtering distribution we have to determine from the observation.

And prior information is given, that is the prior on the state distribution it is given. The sensor
model is given and we have considered a Markovian state space model, that we have considered.
So, this information is available.

1496
(Refer Slide Time: 29:33)

Now, in case of the particle filter, that posterior density is represented by a set of random
particles. So, we have to consider random particles to represent the posterior density. If I
consider large number of particles, suppose n number of particles equivalent to functional
description of the pdf, the pdf is the posterior density. And if I consider N tends to infinity, then
this particle filter process is the optimal Bayesian estimate. That means, it process optimal
Bayesian estimate.

(Refer Slide Time: 30:10)

1497
So, here you can see, I am showing the, the concept of the particle filter. Sample based PDF
representation. The PDF is represented by random particles, regions of high density we are
considering many particles. And also the large weight of particles and here you can see uneven
partitioning. So, region of high density we are considering. Suppose, so, more number of
particles we are considering. and here you can see corresponding to this we are considering the
large weight. So, the weight of this particle is large as compared to this. And this is nothing but
the discrete approximation of continuous PDF.

(Refer Slide Time: 30:52)

So, what are the steps of the particle filter? That is the sequential importance sampling, we have
to consider. So, first the particle filtering steps for m is equal to 1 to M. So, first we have to
define the particles, that is the particle generation. After this we have to compute the weight for
the particle. So, for each and every particle we have to assign the weights, and also we have to
normalize the weights, the weight normalization we have to do by using this equation.

And finally, we have to estimate the states, that is the estimate computation. So, we are
estimating the states So, these are the steps, that is the first one is the particle generation. After
this we have to assign the weights for the particles, and after this weight normalization, estimate
computations.

1498
(Refer Slide Time: 31:42)

But what is the problem? After a few iterations, all but one particle will have negligible weight
that is called the weight degeneration problem. So, for this we have to do the resampling, the
solution is the resampling. Replicate particles in proportion to their weights, then again by
random sampling, and we have to eliminate particles with small importance weight. That means,
the particle having the negligible weight, that we have to neglect. That is called the importance
sampling. So, this concept I am going to explain in my next slide.

(Refer Slide Time: 32:22)

1499
So, you are, here you can see, this pdf is approximated by the n number of random particles. And
you can see, I am assigning the weights corresponding to all these particles, these are the weights
you can see, and this particular has the maximum weight as compared to this particle. And after
this we have to propagate these particles. So, move the particles and before the moving the
particles, we have to do the resampling, resampling of the particles, move the particles.

And after this we have to do the prediction. And this is the state of the particle filter algorithm.
So, that means the pdf is approximated by random particles. And also we have to determine the
importance weight, we have to determine. And after this we have to do the resampling and after
this we have to propagate the particles that is the move the particles and after this the prediction
of the Xt given the measurement, the measurement is z1 to t minus 1. So, based on this we have
to do the prediction.

(Refer Slide Time: 33:29)

And the concept of the resampling because, after some iterations, some of the particles will have
negligible weight, that we have to eliminate. So, here this concept, the concept of the resampling
I am showing here. The pdf is approximated by the particles, the random particles and the
associated weights. So, these are the weights and you can see some of the particles have
negligible weights, that we can neglect.

1500
And after this we can propagate the particles that means, we can move the particles to the next
step. And like this, you can see in the next step, this particle has negligible weight that we can
eliminate and like this we can move the particles into the next step, in the next iteration like this
we have to do. And we can also the predict the states, based on the measurements. So, this is the
concept of resampling.

(Refer Slide Time: 34:18)

So, briefly I can say this is the particle filter algorithm. First randomly I have to initialize the
particle, that is nothing but the discrete approximation of continuous pdf, that is the pdf, that is
the posterior pdf is approximated by random particles. So, that is the particle generations. And
after particle generation, we have to compute the weight of the particles by using that
mathematics.

After this we have to normalize the weights. After this we have to resample the particles. So, we
are getting observations and based on these observations, we have to do the tracking. So, we
have to do the tracking like this. In this case, from the normalized weight we can estimate the
output, that output estimate we can do, but the problem is that after some iterations, some of the
particles will have negligible weight.

So, that is why resampling is important. So, we are doing the resampling and after this we are
doing the particle generation, weight computation, normalized weights and after this we can

1501
output the estimates. And based on the observation we have to do the tracking. So, this is the
particle filter algorithm. So, briefly I have explained the concept of the particle filter, but I think
you have to study the research papers to understand the concept of the particle filter in more
detailing.

(Refer Slide Time: 35:39)

So, here you can see I am showing the tracking, the tracking of this person, tracking of this. So,
based on the particle filters. So, first one is the prediction, again after this the measurement, after
this the resampling, the resampling concept already I have explained. Because we have to neglect
the, the particles having negligible weights, that we have to neglect. And after this, this we have
to do the iterations. So, based on this we can do the tracking. So, this is one example of the
tracking.

1502
(Refer Slide Time: 36:13)

And this is also another example of the tracking you can see, the tracking.

(Refer Slide Time: 36:17)

1503
And I can show some of the videos of tracking. So, we have developed this algorithm the particle
filter based algorithm for object tracking. So, you can see these videos.

1504
(Refer Slide Time: 36:27)

1505
So, we have done this particle filter based tracking and also we have proposed some, the
background modeling algorithms. So, based on this we have done the tracking.

1506
(Refer Slide Time: 36:41)

1507
1508
So, in this case you can see the background is cluttered background and also the illumination is
also changing. In this case also we are doing the tracking. And in this case you can see, in one
case the person is occluded by the car, by the vehicles. And in this case also we are successful in
doing the tracking.

(Refer Slide Time: 37:04)

1509
So, this is also another good example of the tracking and here you can see we are finding the
trajectory of the movements. So, this person is moving and we are finding the trajectory. So, you
can apply the particle filter algorithm for object tracking.

1510
(Refer Slide Time: 37:16)

And again I am showing this example of the particle filter based tracking. And sometimes you
can see the one person is occluded by another person and in this case also the particle filter can
track, particle filter can handle the partial occlusion.

(Refer Slide Time: 37:32)

And these are some tracking regions, in some of the cases you can see in this case, in this frame,
this persons are occluded by this vehicle and partially this persons are visible. Even then also we

1511
can do the tracking. So, the tracking is successful with the help of the particle filter. That means,
the particle filter can handle partial occlusion.

(Refer Slide Time: 37:54)

And these are some tracking regions. So, already I have shown the video.

(Refer Slide Time: 38:00)

And also briefly I will explain the concept of the multiple camera based tracking. So, what are
the advantages of multiple camera based object tracking? So, one important advantage is that the

1512
field of view can be increased. So, in this case you can see, I am considering one camera C1,
another camera is C2. So, this person is visible by both the cameras. So, this camera can see this
person and this camera can see this person, and you can see the field of view increases because
of the use of two cameras.

In case of the multiple cameras, we have to find a correspondence between two cameras or
multiple cameras. In this example, you can see I am considering the camera 1 and camera 2, and
I am finding some transformation to find the correspondence between these two cameras. So, this
is the world plane and suppose if I want to track the object, so we have to find a correspondence
between camera 1 and camera 2.

So, for this we have to consider some transformation. So, so like maybe some projective
transformation we can consider. And another advantage is that, that because the field of view
will increase, that means the occlusion can be handled in a multiple camera based system. In case
of a single camera based object tracking system, the problem is the occlusion. But in case of the
multiple camera based system, since the field of view increases, so the problem of the occlusion
can be partially eliminated.

(Refer Slide Time: 39:28)

In this example, you can see one person is completely occluded by a tree. And this is one view of
the camera and this is another view of the camera. Maybe in the another view, the person is

1513
visible. So, that is why the occlusion problem can be partially resolved by considering multiple
camera based systems.

(Refer Slide Time: 39:45)

And already I had explained we have to find the correspondence between two cameras or the
multiple cameras. Here you can see, and this is the camera 1 view and this is the camera 2 view.
And I am finding the correspondence between these two cameras. The corresponding pixel
positions. So, based on these corresponding pixel positions, we can do the tracking. So, I am not
explaining in detail how to find a correspondence between these two cameras, but concept is
that, so we have to find a correspondence between all these cameras in a multiple camera based
tracking system.

1514
(Refer Slide Time: 40:18)

1515
And here I have shown the, one tracking example, the multiple camera base, here I am
considering two cameras, camera 1 and camera 2. And I am finding the correspondence between
these two cameras. And you can see, just I am also doing the tracking and also I am finding the
trajectory of the motion. So, all these persons are tracked. So, this is one view and this is another
view of the camera.

(Refer Slide Time: 40:45)

And again, I am showing the results of multiple camera based object tracking.

1516
(Refer Slide Time: 40:50)

And these are some examples of multiple camera based object tracking, tracking single person
who is partially occluded by a moving car, using two cameras. Because already I have explained
if I use multiple cameras, the problem of occlusion can be partially eliminated, because the field
of view increases because of the use of the multiple camera. In this class, I briefly explained the
concept of object tracking and the concept of particle filter based object tracking.

For more detail, you have to see the research paper on object tracking and also the research paper
on particle filter based object tracking. There are many new techniques of object tracking, the
deep learning techniques are also used for object tracking, in a single camera based system and
also the multiple camera based system. So for this, you have to see the research, research papers.
Let me stop here today. Thank you.

1517
Computer Vision and Image Processing – Fundamentals and Application
Professor M. K. Bhuyan
Department of Electronics and Electrical Engineering
Indian Institute of Technology Guwahati India
Lecture 41
Programming Examples

Welcome to NPTEL MOOCs course on Computer Vision and Image Processing Fundamentals
and Applications. This is my last class of this course. In this class I will give some programming
assignments. So, mainly I will discuss twenty programming assignments, the twenty
programming problems. These programming assignments you can do either in MATLAB or
maybe open CV Python or maybe any other programming languages. So, one by one all these
programming assignments I will discuss now. So, let us start now.

(Refer Slide Time: 01:01)

So, first programming assignment is, consider an image containing one arbitrary object and after
this apply affine transformation to show all the following cases. So, first we have to show the
rotation, after this translation, shearing, scaling and also the combined translation, rotation and
scaling. So, this concept already I have explained in my class, the concept of the affine
transformation.

1518
(Refer Slide Time: 01:26)

So, your output will be something like this, you can see I have the original image. This image
can be rotated by 60 degree. And also this image can be translated. So, this is the translated
image. This image can be sheared, that shearing can be done. That means, the scaling is not any
form throughout the image, that is called shearing. Also we can do the scaling, scale image you
can see.

And also we can do the combined translation, rotation and scaling. So, this is the concept of the
affine transformation. So, for this you have to write a program, the program in MATLAB or
maybe you can use Python, the open CV Python also you can use.

1519
(Refer Slide Time: 02:12)

So, in this case, I am showing one MATLAB program. So, here you can see first I am reading
the image, the image is the Lena image. And after this what I am doing, the resizing the image.
So, that resizing to make the computation first, after this I am showing the original image. After
this I am considering the rotation of the image. So, 60 degree I am considering. So, first I am
doing the rotation and this is the transformation matrix for the rotation.

So, you know the transformation matrix for the rotation is cos theta sin theta minus sin theta cos
theta. And also we have to see the minimum and the maximum coordinates for avoiding negative
coordinates. So, that we have checked this, in this part.

1520
(Refer Slide Time: 02:56)

After this, I am doing the rotation. So, I am finding the rotated image based on the
transformation matrix. So, this rotated image can be displayed. After this I am considering the
translation. So, for the translation I am considering the translation vector, the translation vector e
is equal to 10, 01 and based on this I am translating the image, the input image. So, I can do the
translation and I can show the result of the translation. So, this is the imshow result 2.

After this I am considering the shearing, shearing means the scaling is not any form throughout
the object. So, for shearing, I am considering this. So, scaling along the x direction, scaling along
the y direction I am considering. And from this I am considering the shearing matrix. After this, I
am doing this one, these computations.

1521
(Refer Slide Time: 03:54)

And finally, you can see I am showing the result of shearing, imshow result 3, that is the result of
shearing operation. And after this, what am I considering? The scaling of the image. So, so,
scaling along the x direction, scaling along the y direction I am considering. And after this I am
doing the scaling and after scaling, I can show the result of the scaling operation. So, you can see
result 4, that is nothing but the result of the scaling operation.

After this, what am I considering? I am considering the affine transformation, considering


translation, rotation and the scaling. So, for this I am defining the parameters, all the parameters I
am defining, like scaling, translation parameter and also the rotation parameters I am defining.
And based on this I am doing the combined operation.

1522
(Refer Slide Time: 04:46)

And after this finally you can see, I am showing the result. The result is, result 7. That is the
result corresponding to combined translation, rotation and scaling. So, this is about the first
assignment. So, in the first assignment, the problem is I have to do the affine transformation of
the image.

(Refer Slide Time: 05:06)

1523
The second problem is writing MATLAB program to show perspective, weak perspective and
orthographic projections of an object in an image. And for this make suitable assumptions if
required. So, I am showing the original 3D cube and you can see the output corresponding to
perspective projection. So, first one is the perspective projection. Next one is the orthographic
projection and the last one is the weak perspective projection.

So, in the case of the perspective projection, I have already explained that distant objects appear
smaller. In case of the orthographic projection, it is nothing but x, y, z is projected onto x, y
coordinate. That is the orthographic projection. And in case of the weak perspective projection, it
is nothing but the scaling of the x coordinate, scaling of the y coordinate like this and the scaling
of the z coordinate.

So, that is the weak perspective projection. So, based on some assumptions, you can show the
perspective projection, orthographic projection and weak perspective projection. There is a
mistake here in the spelling, it should be perspective, it should be perspective. So, this is the
second assignment. So, you can do it in the MATLAB.

(Refer Slide Time: 06:32)

Third problem is, write a MATLAB program to determine a depth map by using the concept of
photometric stereo. So, this problem is already I have explained that is, the concept of the shape
from shading. The shape from shading problem can be solved by the concept of photometric

1524
stereo. So, for this I am considering different light sources and shading information is used to
determine the depth information.

So, different illumination conditions or different lighting conditions I am considering and you
can see, I am getting the image of sphere one, sphere two like this. Because I am considering
different lighting conditions. That means, I am getting the shading conditions. So, from all these
images, I have to get the depth information, that is the concept of the photometric stereo. So,
images of a sphere at different illuminations.

So, different illuminations I am considering and based on this I am getting the shading. So, from
the shading information, I have to get the depth information.

(Refer Slide Time: 07:34)

So, here first based on the principle of the photometric stereo, you can determine the albedo
surface map. The albedo also you can determine, and from the albedo you can determine the
depth map of the object. So, the principle of the photometric stereo has already I have explained
in my classes. So, this is the output of this assignment, that is you can determine the albedo of
the surface map, albedo you can determine. And also from this you can determine the depth map.
This is about the third assignment.

1525
(Refer Slide Time: 08:10)

The next assignment is for an arbitrary object in an image, determine chain code and shape
number. So, here I am showing the original image and this image or the this boundary of the
object, I have to represent by using the chain code. So, for this you can see I am showing the
boundary of the object, that is the boundary of the input image. That can be represented by the
chain code.

The concept of the chain code has already I have explained. And after this I can reconstruct the
boundary from the chain code. So, in the fourth result, that is the fourth image, that is the
reconstruction of the boundary from that chain code. So, this problem you can try, that is first
from the boundary information, you can determine the chain code and again from the chain code
you can reconstruct the original boundary.

1526
(Refer Slide Time: 09:10)

Question number five is for an arbitrary object in an image, obtain fourier descriptors to
represents its boundary. So, that means the fourier descriptor we have to use to represent a
particular boundary. So, you can see, this is the original image. And you can see the boundary of
the original image, that can be reconstructed by the fourier descriptors. So, corresponding to this
original image, the boundary can be represented by the fourier descriptors.

After this, this boundary can be reconstructed by considering inverse fourier transform, that is by
considering the fourier coefficients, I can reconstruct the original boundary. That is the
reconstructed boundary. Similarly, I am considering the rotated image and corresponding to this
I am reconstructing the boundary with the help of the fourier descriptors. So, for the rotated
image also we can determine the fourier descriptors.

After this, by using these fourier descriptors, we can reconstruct the boundary. So, all the
properties of the fourier descriptors can be verified, because there are many properties of the
fourier descriptors like scaling, rotation, translation. So, all the properties I have discussed in my
class. So, all these properties you can verify with the help of these images. So, you take one
image and you try to verify all the properties of the fourier descriptors. So, this is question
number five.

1527
(Refer Slide Time: 10:40)

After this the next question is, writing MATLAB code to represent a boundary of an object by B-
spline of order four, show the represented boundary. So, here I am considering the input image
and you know that in the case of the B-spline, I have to consider some control points. And also
the important point is the order of the B-spline. So, here I am considering the order is four. So,
with these B-spline curves, I can represent the boundary with the help of the control points.

So, in the picture you can see here, I am showing the control points. And after this between these
control points, I am considering the B spline curve of order four. And like this you can represent
the boundary of the input image. So, this problem also you can see, that is, the representation of
the object boundary by B-spline curve. So, that is one important problem, the B-spline curve.

1528
(Refer Slide Time: 11:42)

The problem number seven is, represent a texture by gray level co-occurrence matrix. So, that is
GLCM and determines all the relevant parameters. So, in my class I have discussed how to
determine the GLCM that is the gray level co-occurrence matrix. So, for a particular
displacement vector, you can determine the gray level co-occurrence matrix that is the array, the
P [i , j] you can determine. And that is displayed as an image.

So, for a particular displacement d, I can determine the co-occurrence matrix, the P [i , j]. That
array I can determine and that is displayed as an image, that is the gray level co-occurrence
matrix. And from this I can determine the parameters, the parameters maybe the maximum
probability I can determine, moments I can determine, the contrast I can determine, entropy I can
determine, uniformity I can determine, homogeneity I can determine.

So, all these parameters I can determine. And this GLCM is computed from several values of d.
So, the GLCM can be computed from several values of d, and the, the one who is maximizes a
statistical measure computed from P [i , j] is ultimately used. So, that means I am repeating this:
The GLCM is computed from several values of d, that is the displacement vector. And the one
which maximizes a statistical measure computed from P [i , j] that is the GLCM is finally
employed or finally used.

1529
So, that is the concept of the GLCM. So, we can determine all the relevant parameters, the
parameters like the maximum permeability, moments entropy, all these parameters I can
determine from the GLCM. So, corresponding to this original image, I can determine the GLCM
matrix. So, this is problem seven.

(Refer Slide Time: 13:52)

Next the problem is, use seven invariance that is the moment invariance to represent a shape.
And for this write a MATLAB program for shape representation. Also show the seven moments
are invariant to translation, rotation, scaling and reflection. So, corresponding to this original
image, you can determine seven moments, that is seven moment invariants you can determine to
represent a particular shape. So, corresponding to this original image, you can determine seven
moment invariants.

After this, we have to prove them, the seven moments are invariant to translation, rotation,
scaling and reflection. So, this image can be rotated. In this case I am rotating the image by 45
degree, rotated by 90 degree I can do the scaling, I can do the translation and also I can do the
reflection of the original image. So, for all these cases I can determine seven moment invariance.
And from this I can prove the seven moments are invariant to this transformation, the
transformation are translation, rotation, scaling and reflection.

1530
So, the problem is corresponding to these images I can determine seven moment invariance. And
with the help of the seven moment invariance, we have to prove that these moments are invariant
to translation, rotations, scaling and reflection. So, this is one important problem. So you can
write a MATLAB program or maybe the Python program for this assignment. This is about
question number eight.

(Refer Slide Time: 15:40)

Next one is question number nine. Determine the radon test from for tomographic images and
show the sonogram, show the back projected images. So, suppose if I consider the original
image, this is the original image, I can determine g s theta that is nothing but a radon test from
the input image. So that I have to determine, g s theta. And the g s theta can be represented as an
image, and that is called the sinogram. So, how to determine the radon transform?

We have to compute g s theta, there are two parameters one is s another one is theta. After
determining g s theta, we can display the g s theta as an image and that is called the sonogram.
After this I am considering the reconstruction of the image. And in this case, I am considering
the back projection technique. So, for different angles theta, I am doing the back projections. So,
I will be getting the back projected images.

So, from the number of back projected images, I can reconstruct the original image. So, for this I
have to determine that ray sum, so ray some soul also I have to determine. And this ray some or

1531
the, this GST data can be displayed as an image and that is called a sonogram. And
reconstruction is nothing but four different theta, I have to determine the back projected images
and from the back projected images, if I combined all the back projected images, I can
reconstruct the original image.

So, this is the reconstructed image by considering the back projection technique. There are two
techniques, in my class I have explained, one is the back projection technique, another one is the
fourier transform technique. But in this case, I am considering the back projection technique, you
can also apply the fourier transform technique.

(Refer Slide Time: 17:30)

So, corresponding to this I am showing the program. So, first I have to read the original image.
So, this is the original image I am showing. After this, what am I considering? I am finding the
ray sum in different directions, in 90 degree directions, in a lambda zero divided by 180 degree
like this, all the directions I am determining the ray sums. So that means I am calculating the ray
sum.

1532
(Refer Slide Time: 17:57)

After calculating the ray sums, what I am considering? So, ray sums lambda 135 degree, adding
the ray sums 45 degree like this. So, I am showing the reconstructed image, because I am doing
the back projection at different angles. And from this back projected image, I can reconstruct the
original image. So, that means I am considering the ray sum, ray sum in different angles,
different directions.

And after this, I am reconstructing the original image from this back projected images, that is
about the sonogram and the reconstructed image. So, for the sonogram I have to determine the g
s theta.

1533
(Refer Slide Time: 18:38)

The next program is, write a program to implement K L transform of an image. And show the,
transform image, represent the image in terms of Eigen images and show the results. So, in the K
L transformation, so what is it the transformation? Y is equal to a x−μ x, that already I have
explained. So, a is the transformation matrix, this transformation matrix A, that is determined
from the Eigen vectors of the covariance matrix.

So, x is the input vector. So, from the input vector, I can determine the mean and also I can
determine the covariance. From the covariance matrix I can determine the Eigen values and
Eigen vectors. So, e i is the Eigen vectors and lambda i is the Eigen values and from this I can
determine the transformation matrix. So, a is the transformation matrix. So, if I apply thios
transformation, y is equal to a x−μ x, then I will be getting the transformed image something like
this.

So, this is the transform image. After this I can reconstruct the original image. If I consider all
the Eigen vectors and all the Eigen values, then the perfect reconstruction is possible. So, what is
the reconstruction formula? The reconstruction formula is a transpose y + μ x. So, by using this I
can reconstruct the original image. So, already I have explained that if I consider all the
Eigenvalues, Eigen vectors, then I will be getting the original image.

1534
But if I consider the truncated transformation matrix, then in this case I will, I will not be getting
the original image, I will be getting the approximated image. So, if I consider all the Eigen
values and the Eigen vectors in the transformation matrix, then in this case, the perfect
reconstruction is possible. But if I consider only some Eigen vectors for the transformation
matrix, then the perfect reconstruction may not be possible. So, this is about the image
reconstruction, in case of K L transformation.

(Refer Slide Time: 20:49)

The next problem is write a MATLAB program to extract SIFT features of an image. So, the
extracted features. So, first what you have to consider? First I have to consider the skill spedge
representation, after this I have to consider the difference of Gaussian and from the difference of
Gaussian, roughly I can look at the key points. Because I have to, I have to find the extrema, I
have to find a maxima or the minima of the difference of Gaussian. And from this, I can
determine the roughly, the key points.

After this I have to neglect the low contrast key points and also the edge pixels corresponding to
the key points. So, by considering the hessian matrix, I can remove the edge pixels which are
detected as key points. After this, I have to do the orientation assignments and finally, the SIFT
descriptors I will be getting. So, here you can see I am showing the key points corresponding to
this input image. These key points are the SIFT features. So, I will be getting the SIFT features
and I will be getting the descriptor and the size of the descriptor is generally 128.

1535
(Refer Slide Time: 22:06)

Question number twelve, write a MATLAB program to extract HOG features of an image. So,
the extracted features. So, hog principle already I have explained in my class, the histogram of
oriented gradient. So, you can get the feature vectors based on the gradients. So, gradient of an
image you can determine and from the orientation of this gradient, you can determine the feature
vector, corresponding to the input image.

So, you take one input image and corresponding to this, you can determine the HOG features.
And with the HOG features, you can detect objects present in an image, that is nothing but object
detection. That is question number twelve.

1536
(Refer Slide Time: 22:48)

Question number thirteen, write a program to detect edges of an image by canny edge detector.
So, this problem is the detection of the ages by canny edge detector. So, here I am showing the
input image, for canny edge detection, first I have to do the Gaussian blurring, that is the image
is convolve with the Gaussian, that means the noises are removed. And after this, I have to
determine the gradient magnitude and also the orientation, that is the direction of the edge
normal.

Here you can see, I am determining the gradient magnitude and also I am determining the
orientations, that is the direction of the a is normal. So, first step is to convolve the image with
Gaussian. The second step is, I have to determine the gradient magnitude and the direction. So,
you can see I am determining the gradient magnitude and I can determine the angle. After this,
the next important step is the non maximum suppression.

So, that concept already I have explained, the non maximum suppression. And after non
maximum suppression, you have to apply the concept of the thresholding with hysteresis. So, for
this you can consider the low threshold and the high threshold, and based on this, you can
determine the edge pixels. And you can see the finally, the edges are determined, that is the edge
pixels are determined with the help of the canny edge detector. So, this is the concept of canny
edge detector.

1537
(Refer Slide Time: 24:21)

So, the main steps I have already explained. So, first I have to do the convolution of the image
with the Gaussian, that is the noise reduction using a Gaussian filter. After this, I have to
determine the gradient magnitude. So, first I have to determine the gradient along the x direction
and the gradient along the y direction, the horizontal and the vertical directions. After this I can
determine the gradient magnitude.

And also I can determine the, direction of the edge is normal. After this the non maximum
suppression, that principle I have to consider. Because in this case, I have to see the
neighborhood. So, based on the neighborhood, I can apply the principle of non maximum
suppression. The next one is the thesholding with the hysteresis.

So, that means I can consider two thresholds, one is the low threshold another one is the high
threshold. And I can determine the strong and the weak edge pixels. So, that is the thresholding
with hysteresis. And finally, we can do the edge linking. So, these are the steps of a canny edge
detector.

1538
(Refer Slide Time: 25:30)

So, the corresponding program you can see, I am considering the input image. So, this input
image I am converting from RGB to gray, that is RGB image is converted into gray. And this is
the input image. After this, I am doing the Gaussian blurring, that is the image is convolve with
the Gaussian and we can determine the x gradient and the y gradient, that is the gradient along
the x direction and the gradient along the y direction I can determine.

And also I can determine the gradient magnitude and also the angle I can determine. And after
this I can plot the gradient magnitude and also the directions, the direction of the edge normals.
So, this Eigen plot, so these are the first and the second steps.

1539
(Refer Slide Time: 26:18)

After this I can consider the non maximum suppression. So, that means, I can consider the
quantization of the all possible directions into four or five directions, because I have to search the
neighborhood pixels. So, for this I am doing the quantization of all the directions into four
directions. So, here you can see, I am considering the directions like this. And based on this I am
finding the neighborhood pixels.

After this I have to apply the principle of the non maximum suppression. So, you can see I am
applying the non maximum suppression and I am getting the intermediate image. So, this is the
intermediate image. After this I am considering the thresholding with hysteresis. So, for this I am
considering two thresholds, one is the low threshold, another one is the high threshold. And after
this I am applying the thresholding with hysteresis principle, that principle I am applying.

1540
(Refer Slide Time: 27:10)

And corresponding to this, I will be getting the edge pixels, and this edge pixels of the input
image you can display, you can show. So, these are, these are the, this is the concept of the edge
detection with the help of canny edge detector, that is the MATLAB program.

(Refer Slide Time: 27:30)

Similarly, I can consider the same thing that is the canny edge detector by considering the
Python programming. So, I am just giving one example of Python programming here. So, first, I

1541
am considering import numpy as np. So, what is numpy? Numpy is an open source numerical
Python library. So, if you read the Python library, then you will get this one. So, it is an open
source numerical Python library. And what is available in the numpy?

So numpy contains a multi dimensional array and matrix data structures, and it can be utilized to
perform a number of mathematical operations on areas such as trigonometric operations,
statistical operations, algebraic routines, so, all these operations we can perform with the help of
this library, that is the numpy library. And one is the pandes, pandes is an open source Python
package. And that is most widely used for data science, data analysis or in machine learning also,
we used this pandes and it is built on top of the numpy package.

So, this pandes is also another library. And you can see input os, what is os? The os module in
Python, provide functions for interacting with the operating system, os comes under pythons
standard utility module. So, os means, the os module in Python provides function for interacting
with the operating systems. So, that is os. After this, next one is input cv2, that is nothing but
open cv Python.

So, open cv Python is a library of Python designed to solve computer vision problems. So,
suppose if I consider, suppose cv2 dot suppose im read. So, if I consider this statements cv2 im
read this one. So, it loads an image from the specified file and suppose if the images cannot be
read, because of missing pile or unsupported or invalid format, then this method returns an
empty matrix.

So, input cv2 means it is an open CV Python. So, it is used for computer vision problems. So
after this I am considering import matplotlib, that is a collection of functions that make
matplotlib work like MATLAB. So, we can create the figure, we can do the plotting like this
with the help of matplotlib. So, this is a library, the import this one. After this I am considering
the Canny edge detector in Python.

So, here you can see the conversion of the image to grayscale, again I am converting the image
into grayscale. Suppose the RGB image is available, so we can convert the image into grayscale.
After this we can do the Gaussian blurring. So, you can see, I am doing Gaussian blurring. After
this I am considering the sobel operator, to determine the gradient magnitude. So, we can

1542
determine the gradient magnitude, we can determine the gradient along the x direction and the
gradient along the y direction. And from this we can determine the gradient magnitude.

(Refer Slide Time: 31:11)

After this we are considering, the concept of the non maximum suppression. So, all these angles
I am considering, like 22.5 degree plus 90 degree plus 135 degree. So, quantization of all
possible directions into four directions by defining the range. So, that means I am defining the
range, and I am doing the non maximum suppression. That means I am finding the neighborhood
pixels. And based on this, I am doing the comparisons. And from this, I am getting the non
maxima suppress image.

1543
(Refer Slide Time: 31:43)

After this, what am I considering? I am considering the thresholding with hysteresis. So, for this,
I am considering the strong threshold and the weak threshold. Also I am considering and I will
be getting the strong edge pixels and the weak edge pixels. And finally, I can get the edge pixels
and I can show the output image and the input images. So, this is the Python programming to
show the canny edge detector. So, like this you can write either a MATLAB program or the open
cv Python program.

(Refer Slide Time: 32:24)

1544
So, corresponding to this you can see the input image and the output images. So, edges are
detected in the output image. That is the edge detection by canny edge detector.

(Refer Slide Time: 32:37)

Next problem is to write a program to detect lines and the circle by Hough transform. So,
corresponding to this image, I am determining the lines and the circles by considering the Hough
transformation. So, this principle has already I have explained in my class. So, for this we have
to consider the parametric space, the parameter space is rho theta is the parameter space. And
mainly I have to do the voting.

So, for this I have to first initialize the accumulator and after this I have to go for the voting,
corresponding to the pixels, all the edge pixels if I consider. And from this I can determine the or
I can detect the lines, I can detect that circles. So, you can see this is the parametric space, I can
show the outputs in the parameter space. And based on this booting, I can detect the lines, I can
detect the circles. This is question number 14.

1545
(Refer Slide Time: 33:38)

Write a program to balance a color image. So, here you can see, I am considering the input
image, that is not color balance, the colors are improper. So, I have to balance the colors. So, the
simple procedure is, suppose it corresponds to a known region. Suppose if I consider this is the
known region, corresponding to this, this is a, this should be white, this should be white, but in
this case I am not getting the white.

So, corresponding to this known white portion, R should be equal to G should be equal to B
should be equal to 1, but actually, I am not getting the white color. Similarly, if I consider this is
the black. So, this is the black color, but I am not getting the black color because of the
imperfection of the image capturing device. So, what can I consider? I have to consider the color
balancing principle.

So, corresponding to the known region, suppose white region I am considering R is equal to G is
equal to B is equal to 1. Now, suppose I make one component fix, suppose I am making R fix
and find a transformation for the other two components that is the G and B. So, that R is equal to
G is equal to B is equal to 1. So, I have to make one component fix and find a transformation for
other two components. So, that R is equal to G is equal to B is equal to 1.

And apply this transformation for all the pixels of the image. So, apply this transformation for all
the pixels of the image. And after this if I apply this transformation, I will be getting the color

1546
balance image, that is the output. So, I am getting the color image, color balance image in the
output, that is called the color balancing. So, you can write a MATLAB program or maybe open
CV Python program for this problem.

(Refer Slide Time: 35:25)

Next one is to write a program for a vector median filter and show the results for a color image.
So, here I am showing you one input image, Lena image. And this image is corrupted by noise,
the salt and pepper noise. And after this I can remove the salt and pepper noise by considering
the vector median filter. So, I cannot apply the scalar median filter, in case of the color image.
So, that concept also I have discussed in the class. So, for removing the salt and pepper noise in
the color image I have to apply the vector median filter.

1547
(Refer Slide Time: 36:03)

So, what is the program for this? You can see, so first I am considering the original image and
after this, I am applying the salt and pepper noise and we can display the images, the original
image with noise. After this mainly I have to determine the distances. So, I have to consider one
window and in this window I have to consider the distances. So, distances in terms of RGB
value. So, that is nothing but the Euclidean distance I have to determine, the distances in terms of
RGB value.

So, you can see the distances in terms of the RGB value r - r, b - b like this g - g. So, I am finding
the distance in terms of RGB value and I have to find a minimum distance. So, this is the
minimum distance I have to find, and based on this I can find the median value. So, median
value I can determine and after this, I can display the output image that is the salt and pepper
noise can be removed. So, this is the program for the vector median filter.

1548
(Refer Slide Time: 37:05)

And the question number seventeen is to write a program for image segmentation by k means
clustering. So, the input image I am considering, and you can see I am doing the image
segmentation by considering the K means clustering, first I have to consider the means. So,
randomly I have to select the means, after this I have to update the means, based on the minimum
distance of the algorithm, the algorithm is a k- means clustering.

(Refer Slide Time: 37:34)

1549
So, you can see the program. So, first I am initializing the means, mean 1, mean 2, mean 3, 3
means I am considering. After this, I have to find the distance between the sample points and the
means, and after this I have to update the clusters. So, I will be getting the new cluster centers.

(Refer Slide Time: 37:51)

This process I have to repeat again and again until there is no change of the means. And after this
finally, I will be getting the segmented output image. So, that is about the K means clustering.

(Refer Slide Time: 38:05)

1550
The problem number eighteen, eighteen is to write a program for motion estimation by optical
flow. So, here I am showing two examples, one is the input image. I am considering the frames
of the input video, here I am showing two frames only. And similarly, for the second case also
I'm showing the input image frames corresponding to a video, particular video. And after this I
can apply the optical flow principle.

So, I can determine the optical flow you can see, the direction of the optical flow. And similarly,
you can see the optical flow directions I can determine from the input image frames. So, you
have to apply the principle of optical flow. So, this also you can do in MATLAB or maybe in
open CV Python. So, this is a very interesting program.

(Refer Slide Time: 38:51)

Next is question number nineteen, write a program to recognize features by PCA and the LDA.
So, any database I can use. And in this case I have to first determine the Eigenfaces for the PCA.
And for the LDA I have to determine the fisher faces. So, here you can see I am showing the
Eigen faces. So, the principle is, any unknown face can be represented by a linear combination of
Eigen faces.

So, these are the Eigen faces I can determine. These Eigen faces I can determine from the
Eigenvectors of the covariance matrix. That means, from the input vector I can determine a
covariance matrix, from the covariance matrix I can determine the Eigenvectors. And that can be

1551
displayed as an image, that is nothing but the Eigen faces. So, any face can be represented by a
linear combination of the Eigen faces.

(Refer Slide Time: 39:42)

So, these are the Eigen faces.

(Refer Slide Time: 39:44)

And similarly, I can also determine LDA faces. So, I have to apply the LDA principle and I can
determine the fisher faces. So, that means I have to determine the fisher faces from this matrix

1552
that is the, within scatter matrix inverse into SB, that is the between scatter matrix. So, from this
you can determine the fisher faces. So, these are the fisher faces.

(Refer Slide Time: 40:12)

The last problem is, read the frames of a video one by one and then convert them to a color
video. That is nothing but the pseudo coloring, that is the false coloring. Segment out different
objects, vehicles, persons etc from the input video and then determine the trajectory path
showing the motion of the two vehicles. So, in this case, in this input, you can see I am
considering the frames of a video. So, these are the frames of a video and I have to determine the
moving objects.

So, simply you can apply the sense detection algorithm, for determining the moving objects or
maybe you can apply the optical flow algorithm or any other algorithms you can apply. But the
simple one is the sense detection algorithm, to determine the moving objects. And for the pseudo
coloring, the pseudo coloring concept can be used to convert the black and white image, into
color image, and the grayscale image into color image.

1553
(Refer Slide Time: 41:13)

So, I am applying the pseudo coloring principle and you can see the outputs. So, I am converting
the grayscale image into color images.

(Refer Slide Time: 41:21)

So, this transformation I can apply for pseudo coloring. So, the first transformation is part of the
r component. Suppose these are transformations for the g component, and this is the
transformation for the blue components. And you can see if I see the transformation, there is a

1554
difference in the face and the frequency. And for this I am getting the false colors and
corresponding to each portion I will be getting different, different colors.

So, corresponding to this portion I will be getting one color, corresponding to this portion I will
be getting another color. So, this is the concept of pseudo coloring.

(Refer Slide Time: 41:54)

And you can see, corresponding to this input image, I am converting the input image into the
color image, And also by applying the sense detection algorithm, I can determine the moving
objects and corresponding to these I can determine that trajectory. So, you can even apply the
optical flow algorithm. In this class I have given some examples of programming, I have given
twenty examples, the programming examples.

I feel you should do programming to understand the concept of image processing, the concept of
computer vision and also the machine learning algorithms. If you do programming in MATLAB
or any programming languages, your concept will be clear. Initially I thought that I can cover
many topics, particularly the applications of computer vision, but in a thirty hours course, it is
not possible to cover all the applications of computer vision.

So, that is why I have to stop here because I have already covered forty hours of lectures. So
maybe next time I can discuss more applications of computer vision. So, I hope you have
enjoyed the course. Thank you.

1555
THIS BOOK
IS NOT FOR
SALE
NOR COMMERCIAL USE

(044) 2257 5905/08


nptel.ac.in
swayam.gov.in

You might also like