CV Unit 1 Overview of Computer Vison and Application
CV Unit 1 Overview of Computer Vison and Application
Unit 1
Computer Vision Introduction
The light enters the eye through the cornea and is concentrated on the retina, a light-sensitive
membrane at the back of the eye, by the lens.
After that, the image is inverted. Human vision coordinates with the eye, but it also coordinates
with the brain to function.
To identify an object in an image, we need to see a particular number of general forms and patterns.
Computer vision is a technology that aspires to perform at the same level as human eyesight.
However, the human eye can’t see things in the rough, but in the case of computer vision, this isn’t
a significant concern. To attain a high level of visibility, computer vision has undergone numerous
methodologies, algorithms, machine learning, and other types of training. Without the assistance of
a person, computer vision can recognize images. Deep learning techniques’ progress has given new
life to the field of computer vision.
Definition:
Computer vision can be defined as a scientific field that extracts information out of digital images.
The type of information gained from an image can vary from identification, space measurements for
navigation, or augmented reality applications.
What is vision??
● 2 : The interpreting device has to process the information and extract meaning from it.
○ The human brain solves this in multiple steps in different regions of the brain.
○ Computer vision still lags behind human performance in this domain.It is using
machine learning and deep learning algorithms to gain information from image
semantics .
○ Extracting information from images
○ We can divide the information gained from images in computer vision in two
categories: measurements and semantic information.
○ Vision as a measurement device
■ Robots navigating in an unknown location need to be able to scan their
surroundings to compute the best path.
■ Stereo cameras give depth information, like our two eyes, through
triangulation. If we increase the number of viewpoints to cover all the sides
of an object, we can create a 3D surface representing the object even more
challenging idea might be to reconstruct the 3D model of a monument
through all the results of a google image search for this
○ A source of semantic information
■ On top of measurement information, an image contains a very dense amount
of semantic information. We can label objects in an image,label the whole
scene, recognize people, recognize actions, gestures,faces.
Applications
● Special effects Shape and motion capture are new techniques used in movies like Avatar to
animate digital characters by recording the movements played by a human actor. In order to
do that, we have to find the exact positions of markers on the actor’s face in a 3D space, and
then recreate them on the digital avatar.
● 3D urban modeling Taking pictures with a drone over a city can be used to render a 3D
model of the city. Computer vision is used to combine all the photos into a single 3D model.
● Scene recognition It is possible to recognize the location where a photo was taken. For
instance, a photo of a landmark can be compared to billions of photos on google to find the
best matches.
● Face detection Face detection has been used for multiple years in cameras to take better
pictures and focus on the faces. Smile detection can allow a camera to take pictures
automatically when the subject is smiling.
● Face recognition is more difficult than face detection, but with the scale of today’s data,
companies like Facebook are able to get very good performance. Finally, we can also use
computer vision for biometrics, using unique iris pattern recognition or fingerprints.
● Optical Character Recognition One of the oldest successful applications of computer vision
is to recognize characters and numbers. This can be used to read zip codes, or license plates.
● Mobile visual search With computer vision, we can do a search on Google using an image
as the query.
● Self-driving cars Autonomous driving is one of the hottest applications of computer vision.
Companies like Tesla, Google or General Motors compete to be the first to build a fully
autonomous car.
● Automatic checkout Amazon Go is a new kind of store that has no checkout. With computer
vision, algorithms detect exactly which products you take and they charge you as you walk
out of the store
● Vision-based interaction Microsoft’s Kinect captures movement in real time and allows
players to interact directly with a game through moves.
● Augmented Reality AR is also a very hot field right now, and multiple companies are
competing to provide the best mobile AR platform.Apple released ARKit in June and has
already impressive applications
● Virtual Reality VR is using similar computer vision techniques as AR. The algorithm needs
to know the position of a user, and the positions of all the objects around. As the user moves
around, everything needs to be updated in a realistic and smooth way.
Image Geometry:
https://medium.com/@madali.nabil97/representing-and-manipulating-points-lines-and-conics-using-homogeneous-coordinates-46d907ba2f54
Points:
One of the basic primitives in geometry is the point, a point can be defined by one or more
coordinates depending on the space ,for example in 2D, a point is defined in Euclidean coordinate
system by a vector with two components as follows,
Although these are the most widely used notations to define a point, this definition is rarely used in
fields related to computer vision, where the point is commonly defined in homogeneous
coordinates, this can be done simply, by adding an additional dimension as follows:
If you are given a point in homogeneous coordinates, you can do the reverse transformation by
simply dividing it by the last term.
defines all possible points in 3d space that represent the same physical point (x,y) in 2d. We can
therefore conclude that all the points of this set are equivalent.
3D Points:
The point coordinates in 3D can be written using inhomogeneous coordinates x=(x,y,z) part of R3
or homogeneous coordinates x’ = (x’,y’,z’,1) part of P3.To Denote a 3D point using the augmented
vector x’ =(x,y,z,1) with x’= w’x’.
Lines:
Once we have defined a point let us define a straight line, we recall that a straight line in 2D is
defined by the following equation:
A fairly simple way to interpret this equation is to consider that the set of points that belong to a
straight line parameterized by the vector ℓ = (a, b, c) has a null dot product null with this one, more
formally
We can consider the vector of parameter ℓ as being the representation on homogeneous coordinates
of the straight line in 2D, in the same way, we can say that any straight line passing through the
origin without containing it is the homogeneous representation of a physical point in 2D.
the intersection of two lines parameterized by ℓ1 and ℓ2 is given by their cross product as follows,
We can also easily calculate the line between two points as follows
The first equation comes from the fact that x belongs to the two lines and verifies the following
equations
And the second equation comes from the fact that the two points belong to the same line and verify
the following equations
3D Lines:
The equation for a line in 3D: 𝒗 = 〈𝑎, 𝑏, 𝑐〉 = parallel to the line.
𝒓𝟎 = 〈𝑥0, 𝑦0, 𝑧0〉 = a position vector then all other points, (𝑥, 𝑦, 𝑧), satisfy
Plane:
The equation for a plane in 3D: 𝒏 = 〈𝑎, 𝑏, 𝑐〉 = orthogonal to plane ,𝒓𝟎 = 〈𝑥0, 𝑦0, 𝑧0〉 = a position
vector then all other points, (𝑥, 𝑦, 𝑧), satisfy
The above form (𝒏 ∙ (𝒓 − 𝒓𝟎) = 0) is called the vector form of the plane.
We also write this in standard form as: 𝑎(𝑥 − 𝑥0) + 𝑏(𝑦 − 𝑦0) + 𝑐(𝑧 − 𝑧0) = 0
𝑎𝑥 + 𝑏𝑦 + 𝑐𝑧 + 𝑑 = 0.
Transformation
Linear Transformations
•Linear transformations are combinations of Scale,Rotation, Shear
•Properties Satisfies:
•Origin maps to origin
•Lines map to lines
•Parallel lines remain parallel
•Ratios are preserved
Affine Transformations
•Affine transformations are the generalizations of Euclidean transformation and combinations of
•Linear transformations, Rotation and scaling and
•Translations and Reflection.
•Properties:
•Origin does not necessarily map to origin
•Lines map to lines but circles become ellipses
•Parallel lines remain parallel
•Ratios are preserved
•Length and angle are not preserved
Projective Transformations
•Projective transformations are the most general linear transformations and require the use of
homogeneous coordinates.
•Properties:
TYPES OF TRANSFORMATION
There are two types of transformation in computer graphics.
1) 2D transformation
2) 3D transformation
Types of 2D and 3D transformation
1) Translation 2) Rotation 3) Scaling 4) Shearing 5) reflection
2D transformation:
Translation
The straight line movement of an object from one position to another is called Translation. Here the
object is positioned from one coordinate location to another.
Translation of point:
To translate a point from coordinate position (x, y) to another (x1 y1), we add algebraically the
translation distances Tx and Ty to original coordinate.
x1=x+Tx
y1=y+Ty
For translating polygon, each vertex of the polygon is converted to a new position. Similarly, curved
objects are translated. To change the position of the circle or ellipse its center coordinates are
transformed, then the object is drawn using new coordinates.
Scaling
● It is used to alter or change the size of objects. The change is done using scaling factors. T
● here are two scaling factors, i.e. Sx in x direction Sy in y-direction.
● If the picture is enlarged to twice its original size then Sx = Sy =2. If Sxand Sy are not equal
then scaling will occur but it will elongate or distort the picture.
● If scaling factors are less than one, then the size of the object will be reduced.
● If scaling factors are higher than one, then the size of the object will be enlarged.
● If Sxand Syare equal it is also called as Uniform Scaling. If not equal then called as
Differential Scaling.
● If scaling factors with values less than one will move the object closer to coordinate origin,
while a value higher than one will move coordinate position farther from origin.
● Enlargement:
○ If T1= ,If (x1 y1)is original position and T1is translation vector then (x2 y2) are
coordinated after scaling
● Reduction:
○ If T1= . If (x1 y1) is original position and T1 is translation vector, then (x2
y2) are coordinates after scaling
Rotation:
Rotation about an arbitrary point: If we want to rotate an object or point about an arbitrary point,
first of all, we translate the point about which we want to rotate to the origin. Then rotate a point or
object about the origin, and at the end, we again translate it to the original place. We get rotation
about an arbitrary point.
Example: Rotate a line CD whose endpoints are (3, 4) and (12, 15) about origin through a 45°
anticlockwise direction.
Euler angles
Two classes of linear transformations - projective and affine. Affine transformations are the
particular case of the projective ones. Both of the transformations can be represented with the
following matrix:
Where:
● is a rotation matrix. This matrix defines the kind of the transformation that will be
performed: scaling, rotation, and so on.
● is the projection vector. For affine transformations all elements of this vector are
always equal to 0.
If x and y are the coordinates of a point, the transformation can be done by the simple
multiplication:
Here, x' and y' are the coordinates of the transformed point.
This transformation allows creating perspective distortion. The affine transformation is used for
scaling, skewing and rotation.
For affine transformations, the first two elements of this line should be zeros.
Affine transformation is a transformation of a triangle. Since the last row of a matrix is zeroed,
three points are enough. The image below illustrates the difference.
3D Transformation:
3D Translation:
3D Rotation:
X axis Rotation
Y axis Rotation
Z axis Rotation:
3D Scaling:
3D Reflection
Image Radiometry
In image formation,radiometry is concerned with the relation among the amounts of light energy
emitted from light sources,reflected from surfaces and captured by sensors.
Radiometry Parameters
● When Light arriving at a surface, two factors affect of light for vision.
○ Strength: characterized by its irradiance (energy/time-area)
○ Distance: how much emitted energy actually gets to the object (no
attenuation, no reflection)
● When the light hits a surface, three major reactions might occur-
○ Some light is absorbed. That depends on the factor called ρ (albedo). Low ρ
of the surface means more light will get absorbed.
● It gives the measure of light scattered by a medium from one direction into another.
● The scattering of the light can determine the topography of the surface —
○ smooth surfaces reflect almost entirely in the specular direction, while with
increasing roughness the light tends to diffract into all possible directions.
○ An object will appear equally bright throughout the outgoing hemisphere if its
surface is perfectly diffuse (i.e., Lambertian).
● Relative to some local coordinate frame on the surface, the BRDF is a four dimensional
function that describes how much of each wavelength arriving at an incident direction v̂ i is
emitted in a reflected direction v̂ r .
● The function can be written in terms of the angles of the incident and reflected directions
relative to the surface frame as
● Owing to this, BRDF can give valuable information about the nature of the target sample.
● Fraction of incident light from the incident direction to the viewing direction per unit surface
area per unit viewing angle
Typical BRDFs can often be split into their diffuse and specular components,
● The diffuse component (also known as Lambertian or matte reflection) scatters light
uniformly in all directions and is the phenomenon we most normally associate with shading.
● Diffuse reflection also often imparts a strong body color to the light since it is caused by
selective absorption and re-emission of light inside the object’s material
● While light is scattered uniformly in all directions, i.e., the BRDF is constant.
The second major component of a typical BRDF is specular (gloss or highlight) reflection,
which depends strongly on the direction of the outgoing light.
Consider light reflecting off a mirrored surface.Incident light rays are reflected in a direction that is
rotated by 180◦ around the surface normal n̂.
Color
From a viewpoint of color, we know visible light is only a small portion of a large electromagnetic
spectrum.
Two factors are noticed when a colored light arrives at a sensor:
Colour of the light
Colour of the surface
Bayer Grid/Filter is an important development to capture the color of the light. In a camera, not
every sensor captures all the three components (RGB) of light. Inspired by human visual preceptors,
Bayers proposed a grid in which there are 50% green, 25 % red, and 25% blue sensors.
3D to 2D Projection
● Projections transform points in n-space to m-space, where m < n.
● We need to specify how 3D primitives are projected onto image plane.A linear 3D to 2D
projection matrix can be used to do this.
● In 3D, we map points from 3-space to the projection plane (PP) along projectors emanating
from the center of projection (COP).
Parallel Projection :
● In parallel projection, the centre of the projection lies at infinity. In parallel projection, the
view of the object obtained at the plane is less-realistic as there is no for-shortcoming. and
the relative dimension of the object remains preserves.
●
● Parallel projection is further divided into two categories :
○ a) Orthographic Projection
○ b) Oblique Projection
(a) Orthographic Projection : It is a kind of parallel projection where the projecting lines emerge
parallelly from the object surface and incident perpendicularly at the projecting plane.
(1) Top-View : In this projection, the rays that emerge from the top of the polygon
surface are observed.
(3) Front-view : In this orthographic projection front face view of the object is
observed.
(b) Oblique Projection : It is a kind of parallel projection where projecting rays emerge parallelly
from the surface of the polygon and incident at an angle other than 90 degrees on the plane.
It is of two kinds :
1. Cavalier Projection : It is a kind of oblique projection where the projecting lines emerge
parallelly from the object surface and incident at 45‘ rather than 90′ at the projecting plane.
In this projection, the length of the reading axis is larger than the cabinet projection.
2. Cabinet Projection : It is similar to that cavalier projection but here the length of reading
axes is just half than the cavalier projection and the incident angle at the projecting plane is
63.4′ rather 45′.
Perspective Projection:
● In the perspective projection, the distance of the project plane from the center of projection
is finite. The object size keeps changing in reverse order with distance.
● Perspective projection is used to determine the projector lines come together at a single
point. The single point is also called "project reference point" or "Center of projection."
● Characteristic of Perspective Projection:
○ The Distance between the object and projection center is finite.
○ In Perspective Projection, it is difficult to define the actual size and shape of the
object.
○ The Perspective Projection has the concept of vanishing points.
○ The Perspective Projection is realistic but tough to implement.
● Vanishing Point: Vanishing point can be defined as a point in image plane where all parallel
lines are interlinked. The Vanishing point is also called “Directing Point.”
1.One Point:
● A One Point perspective contains only one vanishing point on the horizon line.
● It is easy to draw.
● Use of One Point- The One Point projection is mostly used to draw the images of roads,
railway tracks, and buildings.
2.Two Point:
● It is also called "Angular Perspective." A Two Point perspective contains two vanishing
points on the line.
● Use of Two Point- The main use of Two Point projection is to draw the two corner roads.
3.Three-Point:
● The Three-Point Perspective contains three vanishing points. Two points lie on the horizon
line, and one above or below the line.
● It is very difficult to draw.
● Use of Three-Point: It is mainly used in skyscraping.
The light originates from multiple light sources, gets reflected on multiple surfaces, and finally
enters the camera where the photons are converted into the (R, G, B) values that we see while
looking at a digital image.
In a camera, the light first falls on the lens (optics). Following that is the aperture and shutter which
can be specified or adjusted. Then the light falls on sensors which can be CCD or CMOS , then the
image is obtained in an analog or digital form and we get the raw image.
Image is sharpened if required or any other important processing algorithms are applied. Post this,
white balancing and other digital signal processing tasks are done and the image is finally
compressed to a suitable format and stored.
CCD vs CMOS
The camera sensor can be CCD or CMOS. In charged coupled device (CCD). A charge is generated
at each sensing element and this photogenerated charge is moved from pixel to pixel and is
converted into a voltage at the output node. Then an analog to digital converter (ADC) converts the
value of each pixel to a digital value.
Let us look at some properties that you may see while clicking a picture on a camera.
Sampling Pitch: It defines the physical space between adjacent sensor cells on the imaging chip.
Fill Factor: It is the ratio of active sensing area size with respect to the theoretically available
sensing area (product of horizontal and vertical sampling pitches)
Resolution: It tells you how many bits are specified for each pixel.
Post-processing: Digital image enhancement methods used before compression and storage.
Pinhole Camera
● A simple camera system – a system that can record an image of an object or scene in the 3D
world.
● This camera system can be designed by placing a barrier with a small aperture between the
3D object and a photographic film or sensor.
● Each point on the 3D object emits multiple rays of light outwards. Without a barrier in
place, every point on the film will be influenced by light rays emitted from every point on
the 3D object.
● Due to the barrier, only one (or a few) of these rays of light passes through the aperture and
hits the film. The result is that the film gets exposed by an “image” of the 3D object.
● This simple model is known as the pinhole camera model.
● In more formal construction of the pinhole camera the film is commonly called the image
plane.
○ O : The aperture ( the pinhole O or center of the camera)
○ f : The distance between the image plane and the pinhole O is the focal length f .
○ Virtual Image Plane: the image plane is placed between O and the 3D object at a
distance f from O.
● Let P = [x y z]T be a point on some 3D object visible to the pinhole camera. P will be
mapped or projected onto the image plane Π′, resulting in point1 P ′ = [x′ y′]T .
● Similarly, the pinhole itself can be projected onto the image plane, giving a new point C′.
● Define a coordinate system [i j k] centered at the pinhole O such that the axis k is
perpendicular to the image plane and points toward it.
● This coordinate system is often known as the camera reference system or camera coordinate
system.
● The line defined by C′ and O is called the optical axis of the camera system.
● Recall that point P ′ is derived from the projection of 3D point P on the image plane Π′.
Notice that triangle P ′C′O is similar to the triangle formed by P , O and (0, 0, z).
● If we derive the relationship between 3D point P and image plane point P
● The aperture size decreases, the image gets sharper, but darker.
● The aperture size increases, the number of light rays that passes through the barrier
consequently increase,causing blurring the image.
● The above conflict between crispness and brightness is mitigated by using lenses, devices
that can focus light.
● If we replace the pinhole with a lens that is both properly placed and sized, then it satisfies
the following property:
○ all rays of light that are emitted by some point P are refracted by the lens such that
they converge to a single point P ′ in the image plane.
● Lenses have a specific distance for which objects are “in focus”. This property is also related
to a photography
The distance between the focal point and the center of the lens is commonly referred to as the focal
length f .
Because the paraxial refraction model approximates using the thin lens assumption, a number of
aberrations can occur.
The radial distortion, which causes the image magnification to decrease or increase as a function of
the distance to the optical axis. The radial distortion is pincushion distortion when the magnification
increases and barrel distortion when the magnification decreases.
Projection matrix
There are three coordinate systems involved --- camera, image and world.
This can be written as a linear mapping between homogeneous coordinates (the equation is
only up to a scale factor):
where , .
○ provides the transformation between an image point and a ray in Euclidean
3-space.
○ There are four parameters:
1. The scaling in the image x and y directions, and .
2. The principal point , which is the point where the optic axis
intersects the image plane.
Image Representation
After getting an image, it is important to devise ways to represent the image. There are various
ways by which an image can be represented. Let’s look at the most common ways to represent an
image.
Image as a matrix
The simplest way to represent the image is in the form of a matrix.
In fig., a part of the image, i.e., the clock, has been represented as a matrix. A similar matrix will
represent the rest of the image too.It is commonly seen that people use up to a byte to represent
every pixel of the image. This means that values between 0 to 255 represent the intensity for each
pixel in the image where 0 is black and 255 is white. For every color channel in the image, one such
matrix is generated. In practice, it is also common to normalize the values between 0 and 1 (as done
in the example in the figure above).
Image as a function
An image can also be represented as a function. An image (grayscale) can be thought of as a
function that takes in a pixel coordinate and gives the intensity at that pixel.
It can be written as function f: ℝ² → ℝ that outputs the intensity at any input point (x,y). The value
of intensity can be between 0 to 255 or 0 to 1 if values are normalized.
Image Digitization
In Digital Image Processing, signals captured from the physical world need to be translated into
digital form by “Digitization” Process. In order to become suitable for digital processing, an image
function f(x,y) must be digitized both spatially and in amplitude. This digitization process involves
two main processes called
Typically, a frame grabber or digitizer is used to sample and quantize the analogue video signal.
Sampling
● Since an analogue image is continuous not just in its coordinates (x axis), but also in its
amplitude (y axis), so the part that deals with the digitizing of coordinates is known as
sampling.
● In digitizing sampling is done on independent variables. In the case of equation y = sin(x), it
is done on the x variable.
● In sampling we reduce this noise by taking samples. It is obvious that more samples we
take, the quality of the image would be more better.
● However, if you take sampling on the x axis, the signal is not converted to digital format,
unless you take sampling of the y-axis too which is known as quantization.
● Sampling has a relationship with image pixels.
● The total number of pixels in an image can be calculated as Pixels = total no of rows * total
no of columns.
○ For example, let’s say we have a total of 36 pixels, that means we have a square
image of 6X 6. As we know in sampling, that more samples eventually result in
more pixels. So it means that of our continuous signal, we have taken 36 samples on
x axis. That refers to 36 pixels of this image. Also the number sample is directly
equal to the number of sensors on the CCD array.
Here is an example for image sampling and how it can be represented using a graph.
Quantization
● The number of quantization levels should be high enough for human perception of fine
shading details in the image.
Reference Frames
- Five reference frames are needed for general problems in 3D scene analysis.
● The camera matrix K contains some of the critical parameters that describe a camera’s
characteristics and its model, including the cx, cy, k, and l parameters.
● Two parameters are currently missing this formulation: skewness and distortion. Most
cameras have zero-skew, but some degree of skewness may occur because of sensor
manufacturing errors.
● Deriving the new camera matrix accounting for skewness is outside the scope of this class
and we give it to you below:
These parameters R and T are known as the extrinsic parameters because they are external to and do
not depend on the camera.
This completes the mapping from a 3D point P in an arbitrary world reference system to the image
plane. To reiterate, we see that the full projection matrix M consists of the two types of parameters
introduced above:intrinsic and extrinsic parameters.
All parameters contained in the camera matrix K are the intrinsic parameters, which change as the
type of camera changes.
The extrinsic parameters include the rotation and translation, which do not depend on the camera’s
build.
Overall, we find that the 3 × 4 projection matrix M has 11 degrees of freedom: 5 from the intrinsic
camera matrix, 3 from extrinsic rotation, and 3 from extrinsic translation.