Computer Vision Notes: Confirmed Midterm Exam Guide (Kisi-Kisi UTS)
Computer Vision Notes: Confirmed Midterm Exam Guide (Kisi-Kisi UTS)
Point-based Processing
Rotation (Rotasi)
The following formula is used to rotate an image where 𝛩 (Theta) is the angle of rotation.
Rumus berikut digunakan untuk merotasi sebuah citra di mana 𝛩 (Theta) adalah sudut rotasi.
Easy way to remember rotation
Let say you want to rotate this vector 90 degree 2 times counterclockwise:
In the first rotation, the coordinate of this vector becomes:
In the second rotation, the coordinate of this vector becomes:
Now mathematically , we can do this 90 degree rotation by multiplying some unknown 2x2 matrix with the
vector 2 times:
First multiplication :
The result would be
Second multiplication :
The result would be
The full result matrix a, b , c ,d is
Since cos 90 = 0, sin 90 = 1, -(sin 90) = -1, we can guess and transform full result matrix into :
Hurray \(^ω^\)
Shearing (Shear)
Shearing (a.k.a. skewing) is an operation that displaces a line vertically or horizontally depending on the
shear matrix used.
Shear (dikenal juga dengan skew) merupakan transformasi yang menggeser garis secara vertikal atau
horizontal bergantung pada shear matrix yang digunakan.
There are two types of shearing:
Terdapat dua jenis shear:
● Vertical
This type of shearing displaces lines vertically, depending on the value of 𝛼 and 𝑥.
Jenis shear ini menggeser garis secara vertikal, tergantung pada besar nilai 𝛼 dan 𝑥.
● Horizontal
This type of shearing displaces lines horizontally, depending on the value of 𝛼 and 𝑦.
Jenis shear ini menggeser garis secara horizontal, tergantung pada besar nilai 𝛼 dan 𝑦.
Scaling
A transformation that enlarges or shrinks the image by a certain scale (constant).
Sebuah transformasi yang memperbesar atau memperkecil citra dengan suatu skala (konstanta) tertentu.
There are two kinds of scaling transformations:
Terdapat dua jenis transformasi scaling:
● Uniform/Isotropic scaling (Scaling dengan konstanta yang sama)
This type of scaling uses the same scale factor for the 𝑥 and 𝑦 components of the vector.
Jenis scaling ini menggunakan faktor skala yang sama untuk komponen 𝑥 dan 𝑦 dari vektor.
● Non-uniform/Anisotropic Scaling (Scaling dengan konstanta yang berbeda)
This type of scaling uses different scale factors for the 𝑥 and 𝑦 components of the vector.
Jenis scaling ini menggunakan faktor skala yang berbeda untuk komponen 𝑥 dan 𝑦 dari citra.
Translation (Translasi)
A transformation that moves every component of the image by a given distance and cannot be written as
the multiplication of a 2x1 matrix with a 2x2 matrix.
Sebuah transformasi yang memindahkan setiap komponen dari citra dengan jarak tertentu dan tidak dapat
dituliskan sebagai perkalian matriks 2x1 dengan matriks 2x2.
● Homogeneous Coordinates (Koordinat Homogen)
To allow translation, the image must use homogeneous coordinates where the 2D vector is
Agar dapat melakukan translasi, citra harus menggunakan koordinat homogen di mana vektor 2D
direpresentasikan dalam bentuk vektor 3D , di mana 𝑧 berfungsi sebagai skala untuk komponen 𝑥 dan
𝑦.
● Translation with Homogeneous Coordinates (Translasi dalam Koordinat Homogen)
Translation can be written as the product of a homogenous vector with a 3x3 matrix (with 𝑧 = 1)
Translasi dapat dituliskan dalam persamaan perkalian vektor homogen dengan matriks 3x3 (dengan
komponen 𝑧 = 1)
Where 𝑥 is moved by 𝛼 units and 𝑦 by 𝛽.
Converting a 2x2 matrix to 3x3 for homogeneous coordinates (Konversi matriks 2x2
menjadi 3x3 untuk koordinat homogen)
The transformation matrix can be converted to a 3x3 matrix for transformation with homogeneous
coordinates, that is:
Matriks transformasi dapat diubah menjadi matriks 3x3 untuk transformasi dalam koordinat
homogen menjadi:
Histogram Equalization
To calculate the equalized histogram, use CDF (Cumulative Distribution Function)
By calculating the CDF, we can obtain the normalized frequency of every intensity by rounding down every
fN result
L = intensity count
fk = cumulative frequency
Example:
Intensity fk CDF fN Intensity New fk
0 2 2/25 0.56 1 2
1 4 6/25 1.68 2 4
2 5 11/25 3.08 3 5
3 2 13/25 3.64 4 ↓
4 3 16/25 4.48 4 2 + 3 = 5
5 3 19/25 5.32 5 3
6 3 22/25 6.16 6 3
7 3 25/25 7 7 3
Log Transformation
Where c is a constant (usually 1) and
This type of transformation is suited for expanding the dark values in an image while compressing the high
intensity values.
We can see from Figure 3.3 that :
● Log function maps low input intensity value to wide range of output intensity level and map high input
intensity value to narrow range of output intensity level.
● Inverse log function is the opposite of log function (low intensity -> narrow output, high intensity -> wide
output).
Power Law (Gamma) Transformation
Where
● means input pixel r with intensity level j (0-255 or 0-[L-1] where L is the color bit depth or the number
of bins).
● is the number of pixel that have intensity level j.
● M is the number of pixel row and N is the number of pixel column (for example the image resolution is
640 x 480 then MN = 307200).
The discrete form of Transformation function CDF equation is:
Where (located in output image) is mapping from each corresponding pixel (located in input image).
Example :
Let say that there is an 3-bit image represented in 5x5 matrix :
5 6 3 1 5
1 2 5 3 3
6 4 1 7 7
3 4 0 6 2
2 7 5 0 5
We can calculate the frequency of each intensity value:
2
3
3
4
2
5
3
3
Since it’s contains 8 different intensity value then (L-1) = (8-1) = 7.
MN = 5x5 = 25
The equation becomes:
Calculate each s from 0 to 7:
= 7/25 * 2 = 0.56
= 7/25 * (2 + 3) = 1.4
= 7/25 * (2 + 3 + 3) = 2.24
= 7/25 * (2 + 3 + 3 + 4) = 3.36
= 7/25 * (2 + 3 + 3 + 4 + 2) = 3.92
= 7/25 * (2 + 3 + 3 + 4 + 2 + 5) = 5.32
= 7/25 * (2 + 3 + 3 + 4 + 2 + 5 + 3) = 6.16
= 7/25 * (2 + 3 + 3 + 4 + 2 + 5 + 3 + 3) = 7
Round all the fraction result since there is no way that pixel values is a fraction (IIRC PaoPao said by
flooring):
= 0 (no changes)
= 1 (no changes)
= 2 (no changes)
= 3 (no changes)
= 3 (changed)
= 5 (no changes)
= 6 (no changes)
= 7 (no changes)
Since only the pixel of intensity 4 mapped into 3 in the output image we replace intensity 4 by 3 and the
output image matrix become (changes in red):
5 6 3 1 5
1 2 5 3 3
6 3 1 7 7
3 3 0 6 2
2 7 5 0 5
(this is a really shitty example since the distribution of histogram is pretty balanced in the first place).
Spatial Transformation (Neighbours operation)
Definition of filters
There are two kinds of filter:
● Low Pass filter is a filter that passes low frequencies, the effect produced by this filter is
blurring/smoothing an image (also called a veraging filters).
● High Pass filter is a filter that passes high frequencies, the effect produced by this filter is s
harpening
(if the result of the filter is added to the original image).
We can achieve these effects by using spatial filters (also called spatial mask), spatial filters consist of :
1. A neighbourhood typically a small rectangle.
2. A predefined operation that is performed on the images pixels encompassed by the neighbourhood.
Spatial Filtering c reates a new pixel (in output image) with coordinates equal to the coordinate of the
center of the neighbourhood. If the operation performed on the image is linear then the filter is called linear
spatial filter otherwise the filter is nonlinear.
Spatial Correlation and Convolution
There are two methods of spatial filtering:
1. Correlation which is a process of moving filter mask over the image and computing the sum of
products at each location.
2. Convolution i s the same as correlation but the filter mask is rotated 180 degree before convolving.
Note that if the filter mask is s ymmetric then correlation and convolution will lead to the same result.
(leftmost column is symmetric)
Here the step by step video on how to convolve a mask with an image:
https://youtu.be/XuD4C8vJzEQ?t=185
Smoothing Spatial Filters (averaging)
● Smoothing is analogous to i ntegration
● Smoothing filters are used for b
lurring(removal of small details in image) and noise reduction.
Blurring because replacing the pixel in the image by the average intensity level in the neighbourhood lead to
reduced s harp transition in intensities between adjacent pixels and this also lead to N
oise reduction.
However edges (which is also characterized as sharp transition in intensities) is also blurred.
● The mask in figure 3.32(a) is called box filter because all the coefficient in the matrix is the same.
● The mask in figure 3.32(b) is called weighted average filter, t his terminology is used to indicate that
pixels are multiplied by different coefficients and thus giving more importance/weight to some pixels
(in this case the closer the pixel to the center, the bigger the coefficient is).
Edge Detection
How does a computer define an edge? It's the sudden change in colour or intensity of colour.
Mathematical definition: Edge is the zero crossing point of the second derivative as illustrated below.
All you have to understand is that the derivative is the gradient/kemiringan at any point in the graph.
First derivative graph is graphed by calculating gradient at every point in the colour intensity graph.
Second derivative graph is graphed by made by calculating gradient at all point in the first derivative graph.
First and second order derivative
First order derivative in digital image is defined as partial derivative with respect to both x and y :
And for second order derivative:
Laplacian Edge Detection
The second order derivative in image processing are implemented using laplacian operator.
Derivative laplacian operator can be defined as the sum of second order derivative with respect to x and y :
The equation above can be implemented into a filter mask by 3x3:
● Left mask doesn’t take into account diagonal pixel when computing the derivative and invariant to 90
degree rotation.
● Right mask is the extension of the original equation and also take into account the diagonal pixel when
computing the derivative and i nvariant to 90 and 45 degree rotation.
● Rotation invariant mask is called isotropic filter.
● We can sharpen an image by adding the result of filtered image by laplacian mask to the original image.
The first order derivative in image processing is implemented by using sobel mask and others.
Sobel Edge Detection
● The mask in the first row left column compute the derivative in h
orizontal direction.
● The mask in the first row right column compute the derivative in v ertical direction.
● The mask in the second row compute the derivative in d iagonal direction.
● Sobel also smooth the image when differentiating.
This window operation is mathematically defined as:
● E is the difference between the original and the moved window.
● u is the window's displacement in the x direction
● v is the window's displacement in the y direction
● w(x, y) is the window at position (x, y). This acts like a mask. Ensuring that only the desired window is
used.
● I is the intensity of the image at a position (x, y)
● I(x+u, y+v) is the intensity of the moved window
● I(x, y) is the intensity of the original
Lets ignore w(x,y) for now and focus on the square difference
We can approximate the equation I(x+u,y+v) above by using f irst order multivariate taylor expansion. And
the equation becomes.
In the above equation, I(x,y) cancel out so lets expand and the equation becomes:
This can be turned into matrix vector multiplication from (since the summation symbol only depends on the
x,y we can leave out and its transpose outside the summation) :
Now we can extract the equation in the parenthesis which is called the structure tensor matrix/second
moment matrix i nto M (also add back w(x,y) since it also depends on the summation).:
Step 3:
Compute the eigenvalues from every M matrix (with varying x,y coordinate).
Then plug it to the response function
Step 4:
Perform NMS to the list of R corner coordinates to find the best corner and eliminate unnecessary corners
candidate that do not lie on the ‘true edges’ (see APPENDIX A).3
Blob Detection
In computer vision, blob detection methods are aimed at detecting regions in a digital image that differ in
properties, such as brightness or color, compared to surrounding regions.
Methods:
● Laplacian of Gaussian (LoG)
● Difference of Gaussians (DoG)
● Determinant of Hessian (DoH)
Laplacian of Gaussian
Given input image I, create gaussian blurred version of it G. Convolving I and G using Laplacian operator
(taking second derivative, that is the very definition of edge if you remember) gives you LoG. (source:
http://fourier.eng.hmc.edu/e161/lectures/gradient/node8.html)
Difference of Gaussians
Given input image I, create multiple gaussian blurred version of it with different k-sizes and take the
differences between them using laplacian operator. (SIFT uses this)
Determinant of Hessian
Hessian operator is simply put a better version of Laplacian operator. (SURF uses this)
(source: https://en.wikipedia.org/wiki/Blob_detection#The_determinant_of_the_Hessian)
This is simply because Hessian operators contain more information since it contains all the possible
second-order partial derivatives where Laplacian operators only store information about the sum of
second-order partial derivatives. Hessian matrix looks like below matrix:
Hough Transform
The Hough transform is a feature extraction technique used in image analysis, computer vision, and digital
image processing.[1]
The purpose of the technique is to find imperfect instances of objects within a certain
class of shapes by a voting procedure. This voting procedure is carried out in a p arameter space, from
which object candidates are obtained as local maxima in a so-called accumulator space that is explicitly
constructed by the algorithm for computing the Hough transform.
The classical Hough transform was concerned with the identification of l ines in the image, but later the
Hough transform has been extended to identifying positions of arbitrary shapes, most commonly circles or
ellipses.
Image Descriptors
● Most features can be thought of as templates, histograms (counts), or combinations
● The ideal descriptor should be
○ Robust and Distinctive
○ Compact and Efficient
● Most available descriptors focus on edge/gradient information
○ Capture texture information
○ Color rarely used
Main Components
1. Detection: Identify the interest points
2. Description: Extract vector feature descriptor surrounding each interest point.
3. Matching: Determine correspondence between descriptors in two views
Scale Invariant Feature Transform (SIFT)
https://en.wikipedia.org/wiki/Scale-invariant_feature_transform
The scale-invariant feature transform (SIFT) is a f eature detection algorithm in
computer vision to detect and describe local features in images published by David
Lowe in 1999. SIFT can robustly identify objects even among clutter and under partial
occlusion, because the SIFT feature descriptor is invariant to u niform scaling,
orientation, illumination changes, and partially invariant to a ffine distortion.
Affine distortion example:
An image of a f ern-like f ractal that exhibits affine s elf-similarity.
SIFT keypoints of objects are first extracted from a set of reference images and stored
in a database. An object is recognized in a new image by individually comparing each feature from the new
image to this database and finding candidate matching features based on Euclidean distance of their
feature vectors.
Key locations are defined as maxima and minima of the result of d ifference of Gaussians (DoG) function
applied in s
cale space to a series of smoothed and resampled images.
SIFT uses a modified version of kd-tree (binary search tree that stores k-dimension koordinates) called the
best-bin-first search method[14] that can identify the n
earest neighbors with high probability using only a
limited amount of computation. The BBF algorithm uses a modified search ordering for the k-d tree
algorithm so that bins in feature space are searched in the order of their closest distance from the query
location. This search order requires the use of a heap-based priority queue for efficient determination of the
search order. The best candidate match for each keypoint is found by identifying its nearest neighbor in the
database of keypoints from training images. The nearest neighbors are defined as the key points with
minimum Euclidean distance from the given descriptor vector. The probability that a match is correct can
be determined by taking the ratio of distance from the closest neighbor to the distance of the second
closest.
Lowe[3] rejected all matches in which the distance ratio is greater than 0.8, which eliminates 90% of the
false matches while discarding less than 5% of the correct matches. To further improve the efficiency of
the best-bin-first algorithm search was cut off after checking the first 200 nearest neighbor candidates. For
a database of 100,000 keypoints, this provides a speedup over exact nearest neighbor search by about 2
orders of magnitude, yet results in less than a 5% loss in the number of correct matches.
SIFT uses Hough Transform to identify clusters of features with a consistent interpretation by using each
feature to vote for all object poses that are consistent with the feature. When clusters of features are found
to vote for the same pose of an object, the probability of the interpretation being correct is much higher
than for any single feature.
Finally, Outliers can now be removed by checking for agreement between each image feature and the
model, given the parameter solution. Given the l inear least squares solution (linear regression), each match
is required to agree within half the error range that was used for the parameters in the H ough transform
bins. As outliers are discarded, the linear least squares solution is resolved with the remaining points, and
the process iterated. If fewer than 3 points remain after discarding o utliers, then the match is rejected. In
addition, a top-down matching phase is used to add any further matches that agree with the projected
model position, which may have been missed from the Hough transform bin due to the similarity transform
approximation or other errors.
The final decision to accept or reject a model hypothesis is based on a detailed probabilistic model.[15] This
method first computes the expected number of false matches to the model pose, given the projected size
of the model, the number of features within the region, and the accuracy of the fit. A Bayesian probability
analysis then gives the probability that the object is present based on the actual number of matching
features found. A model is accepted if the final probability for a correct interpretation is greater than 0.98.
Lowe's SIFT based object recognition gives excellent results except under wide illumination variations and
under non-rigid transformations.
SIFT consist of the following this steps:
1. Scale-space extrema detection
Use Difference of Gaussian (DoG) to identify potential interest points, which were invariant to scale
and orientation
2. Keypoint localization
Reject low contrast points and eliminate edge responses
3. Orientation assignment
Each keypoint is assigned one or more orientations based on local image gradient direction to
achieve invariance to rotation
4. Keypoint descriptor
Compute a descriptor vector for each keypoints for the descriptor that is highly distinctive
andpartially invariant to remaining variations
APPENDIX A
Image gradient
Lets define the derivative with respect to x as and with respect to y as :
(see F
irst and second order derivative section for explanation on the derivation).
To find edge strength and direction at location (x,y) in image f we need to compute the gradient :
The magnitude (length) of vector denoted as M(x,y) which is euclidean distance :
The direction of the gradient vector is given by angle at point (x,y) with respect to x axis:
We can use gradient operators to compute edge direction and strength (illustrated below):
Non-Maxima suppression
Example :
Let , , and denote four basic edge direction for 3x3 region which are
Horizontal (0 and +180) , -45 degree, vertical (+90 and -90) and +45 degree respectively. We can formulate
the following non maxima suppression scheme for 3x3 region centered at point (x,y) in ):
1. Compute gradient magnitude M(x,y) and angle .
2. Find the direction that is closest to
a. For example : if then the closest direction from (edge normal) is
horizontal since (20 - 0) = 20 , (45-20) = 25.
b. Since edge direction is perpendicular to edge normal, the edge direction is 0 + 90 = +90 degree and
0 - 90 = -90 degree (vertical direction).
3. If the value of M(x,y) is less than at least on of its two neighbors along , let f(x,y) = 0 (suppression)
otherwise, let f(x,y) = M(x,y).
a. Continuing the example in 2, the two neighbors along the vertical direction is (x,y+1) and (x,y-1).
Characteristics of Good Features detector (maybe come out in theory)
● Repeatability
○ The same feature can be found in several images despite geometric and photometric
transformations
● Saliency
○ Each feature is distinctive
● Compactness and efficiency
○ Many fewer features than image pixels
● Locality
○ A feature occupies a relatively small area of the image; robust to clutter and occlusion
Criteria for Optimal Edge Detection (this too)
● Good detection
○ The optimal detector must minimize the probability of false positives (detecting spurious edges
caused by noise), as well as that of false negatives (missing real edges)
● Good localization
○ The edges detected must be as close as possible to the true edges
● Single response constraint
○ The detector must return one point only for each true edge point, that is, minimize the number of
local maxima around the true edge (created by noise)