Classical Computer Vision - Session 2
Classical Computer Vision - Session 2
Texture features are repeated patterns of local variation in image intensity. It gives us
information about the spatial arrangement of color or intensities in an image. In other
words texture analysis attempts to quantify intuitive qualities described by terms such
as rough, smooth, silky, or bumpy as a function of the spatial variation in pixel
intensities. This will be done using binary codes as we will take on the next slides.
Note: Texture feature can’t be defined for a point (pixel).
Images can have same intensity distribution with different textures as shown below:
TEXTURE FEATURES
Applications of texture analysis
● Segment image into regions with same texture. (Image segmentation)
● Object recognition based on their textures.
● Edge detection can be done by the change in the texture (The parts where the
texture changes are mostly edges.
Texture basic element is called texels. The texture generally is found in 2 parts:
● Tone which is the pixel intensity
● Structure which the spatial relationship between the texels.
LOCAL BINARY PATTERNS (LBP)
Local binary patterns (LBPs) are texture descriptors made in 2002 that work locally on
parts of images. This local representation is constructed by comparing each pixel with
its surrounding neighborhood of pixels.
Steps of constructing LBP descriptor
● Convert image to grayscale images.
● Loop over each pixel such that we compare it with its 8 neighbours (28 possibility).
○ If neighbour has less value then put 1.
○ If neighbour has higher value then put 0.
LOCAL BINARY PATTERNS (LBP)
● Convert the LPB code you got to a decimal value in counter clockwise manner.
● Change value of output image at this pixel location to 23 and keep looping on
other pixels and do the same previous steps.
LOCAL BINARY PATTERNS (LBP)
The output will be something like this which is not very intuitive to us but make sense
for computer vision tasks like segmentation..
LOCAL BINARY PATTERNS (LBP)
The problem of this original LBP descriptor is that it can’t sense the details at varying
scales (it can only sense the variability in 3x3 neighbourhood), two parameters will be
introduced to account for variable neighbourhood sizes:
● Number of points (P) in circularly neighbourhood to consider.
● Radius (R) which gives us the availability to deal with different scales.
How we get the values of g1, g3, g5, and g7 points at figure (a) ? Bilinear interpolation !!
LOCAL BINARY PATTERNS (LBP)
Bilinear interpolation is a popular method for two-dimensional interpolation on a
rectangle. That is, we assume that we know the values of some unknown function at
four points that form a rectangle. Using bilinear interpolation, we can estimate this
function's value at any point (x, y) inside this rectangle. We will denote this unknown
value by P.
INTEREST POINTS
Look at this red box and ask yourself, is this part can be assumed as interest point? Is it
distinguishable and seems unique?
INTEREST POINTS (FOR IMAGE MATCHING)
INTEREST POINTS
To find points that can tell me that this image is same as this one, we need to find
features that are unique and distinguishable where we can decide whether the 2 images
are the same or not based on them. These are the interest points which are local
features (subparts of image) which are invariant to transformation as shown below:
INTEREST POINTS
One of the most used interest points that gives uniqueness and distinguishability are
corners. WHY?
INTEREST POINTS
What makes corner a great interest point that we can match or distinguish based on it?
INTEREST POINTS
CORNER DETECTION (HARRIS)
How is the discovery of corners done mathematically?
If the shift (u, v) is small, then we can approximate it to: (remove higher order terms)
Note: Ix means the differentiation of I in the x-direction which is actually the vertical
edges. It can be computed using sobel filter easily.
CORNER DETECTION (HARRIS)
This leads to the following equation of E(u, v):
CORNER DETECTION (HARRIS)
CORNER DETECTION (HARRIS)
Finally we can reach our final equation:
Now what we want is to know which directions makes the largest or smallest values of
E(u, v)? Eigenvalues can do this for us. λ1 and λ2 will be the 2 eigenvalues of M.
● If λ1 and λ2 are small, hence the E(u, v) will be small ⇒ Flat surface.
● If λ1 >> λ2 or vice versa hence change is in one direction ⇒ Edge.
● If λ1 and λ2 are large and close to each other (λ1 ~ λ2) which means E(u, v) is high in
all direction ⇒ Corner
CORNER DETECTION (HARRIS)
CORNER DETECTION (HARRIS)
Instead of computing the Eigenvalues explicitly, we can use the following equation to
get corner strength (R):
Notes:
● trace(matrix) =
● k value is empirically between 0.04 and 0.06.
● det(M) = λ1 . λ2
● trace(M) = λ1 + λ2
How to say whether it is a corner, edge or flat?
● If R is above threshold hence corner (interest point)
● If R is negative hence edge (contour)
● If R is small around 0 hence flat (uniform)
CORNER DETECTION (HARRIS)
CORNER DETECTION (HARRIS)
R-Value where red means high and blue means low
CORNER DETECTION (HARRIS)
Thresholding if R > threshold value
CORNER DETECTION (HARRIS)
CORNER DETECTION (HARRIS)
Now we can say that Harris detector can get the interest points on an image where we
can use them to define pixels or parts that are unique enough to identify an image.
Let us now dive deeper to understand each stage assuming that we will be detecting
faces as the original images in our example.
HAAR CASCADES
In Haar cascades, we have about 180,000 Haar features used to find the suitable
features representing the objects we search for. But firstly what is Haar features look
like?
HAAR CASCADES
Haar features are broadly classified into three categories. The first set of two rectangle
features are responsible for finding out the edges in a horizontal or in a vertical
direction (as shown above). The second set of three rectangle features are responsible
for finding out if there is a lighter region surrounded by darker regions on either side or
vice-versa. The third set of four rectangle features are responsible for finding out change
of pixel intensities across diagonals. These are scaled at different scales and aspect
ratios to get our 180,000 Haar features.
HAAR CASCADES
HAAR CASCADES
These Haar features are mainly applied in iterative manner as sliding window on the
image:
HAAR CASCADES
What actually happens is that When haar features are applied to image as shown in the
previous slide, each feature results in a single value which is calculated by subtracting
the average of pixels under white rectangle from the average of pixels under black
rectangle.
HAAR CASCADES
The objective from this step is to find out if the image has an edge separating dark
pixels on the right and light pixels on the left or not. Hence we say that there is an edge
detected if the haar value is closer to 1. In the example above, there is no edge as the
haar value is far from 1 (-0.02).
This is just one representation of a particular haar feature separating a vertical edge.
Now there are other haar features as well, which will detect edges in other directions
and any other image structures. To detect an edge anywhere in the image, the haar
feature needs to traverse the whole image.
Now, the haar features looping on an image would involve a lot of mathematical
calculations. As we can see for a single rectangle on either side, it involves 18 pixel value
additions (for a rectangle enclosing 18 pixels). Imagine doing this for the whole image
with all sizes (to be explained later) of the haar features. This would be a hectic
operation even for a high performance machine. This would be the motivation of the
second step.
HAAR CASCADES
● Choose the haar feature which has the lowest error rate and get it out of your haar
features (remaining haar features now is (180,000-1).
● Now update the weights of each misclassified image sample higher to stress on
correcting them in the next iteration.
● Loop again with all remaining haar features on all images once again and repeat
the error computation and choose the next haar feature to retain until reaching
the number of haar features needed or reach the accuracy you are seeking for.
Note:
● Don’t forget that in Adaboost, weights will be given to each predictor (haar
feature) based on its erroneous (performance) so better haar features will have
higher weights in the inference.
● Weights of data samples are different from final weights of predictors as weights
of data samples affect the next decision tree while weights of predictors affect the
final output in the inference stage.
HAAR CASCADES
Haar features are applied on the images in stages in the following manner:
● The stages in the beginning contain simpler features, in comparison to the
features in a later stage which are complex, complex enough to find the nitty gritty
details on the face. If the initial stage won’t detect anything on the window, then
discard the window itself from the remaining process, and move on to the next
window. This way a lot of processing time will be saved, as the irrelevant windows
will not be processed in the majority of the stages.
● The second stage processing would start, only when the features in the first stage
are detected in the image. The process continues like this, i.e. if one stage passes,
the window is passed onto the next stage, if it fails then the window is discarded.
This is a simple visualization, however in reality there is much stages than that and
much features in each stage.
HAAR CASCADES
In the paper, the author proposed a total of 38 stages for something around 6000
features. The number of features in the first five stages are 1, 10, 25, 25, and 50, and this
increased in the subsequent stages.
The initial stages with simpler and lesser number of features removed most of the
windows not having any facial features, thereby reducing the false negative ratio,
whereas the later stages with complex and more number of features can focus on
reducing the error detection rate, hence achieving low false positive ratio.
HAAR CASCADES
FEATURE DESCRIPTOR
Feature descriptor is a representation of an image that simplifies the image by
extracting only the most useful information and throw away all unneeded information.
Let us take an example and say:
Can you tell me what you see in the two images below?
FEATURE DESCRIPTOR
Let us make it slightly harder, Now Can you tell me what you see in the two images
below?
FEATURE DESCRIPTOR
What is the difference between first pair and second pair of images?
The first pair carries much more information like colors, shapes, background, etc.., while
the second pair only carries the edges, corners and shapes. Although the information in
second pair is less, we were able to say what is the object in the image because it
carries the most important information which is sufficient to recognize the object. This
what feature descriptors are made to do.
Famous feature descriptors:
● HoG (Histogram of Gradients)
● SIFT (Scale Invariant Feature Transform)
● SURF (Speeded-Up Robust Feature)
SURF is not included in the course but it is a successor of SIFT that uses integral images
in convolution with box filters to fast up the computation.
Let us now start with the Histogram of Gradients (HoG).
HISTOGRAM OF GRADIENTS (HOG)
HoG is is a feature descriptor that mainly relies on the idea of gradients (magnitudes
and directions) to detect an object in an image. It focuses on representing edges in a
better way as we will show now.
Steps:
● Preprocess the image.
● Calculate the gradients.
● Calculate the magnitude and the orientation of gradients.
● Calculate the histogram of the gradients in nxn cells.
● Normalize gradients in 2n x 2n cells.
● Generate the features for the whole image.
We have calculated the gradients in both x and y direction separately. The same process
is repeated for all the pixels in the image.
HISTOGRAM OF GRADIENTS (HOG)
Step3: Calculate the magnitude and the orientation
To calculate the magnitude and gradient at each pixel, we will use Pythagoras theorem:
Gradient Magnitude = √(Gx2 + Gy2) ⇒ √(112 + 82) = 13.6
Gradient Orientation = tan-1 (Gy / Gx) ⇒ tan-1 (8/11) = 36o
Step4: Calculate the histogram of the gradients in nxn cells (in the example below 5x5)
The simplest method for generating histogram is just counting the occurrence of each
value as shown below:
HISTOGRAM OF GRADIENTS (HOG)
The process is repeated for all the pixels orientation and magnitudes noting that here
the bin value of the histogram is 1. Hence we get about 180 different buckets, each
representing an orientation value. Another method is to create the histogram features
for higher bin values. By using a bin value of 20, we get 9 buckets only as shown below:
This gives us a 9x1 matrix instead of the 180x1 we got in the previous one.
HISTOGRAM OF GRADIENTS (HOG)
As we can notice, the only value that is taken into consideration is orientation, where is
the magnitude value contribution? We will make what we call weighted histogram.
Note: the higher contribution should be to the bin value which is closer to the
orientation.
HISTOGRAM OF GRADIENTS (HOG)
The histograms created in the HOG feature descriptor are not generated for the whole
image. Instead, the image is divided into 8×8 cells, and the histogram of oriented
gradients is computed for each cell. Why do you think this happens?
By doing so, we get the features (or histogram) for the smaller patches which in turn
represent the whole image. We can certainly change this value here from 8 x 8 to 16 x 16
or 32 x 32.
If we divide the image into 8×8 cells and generate the histograms,
we will get a 9 x 1 matrix for each grid (8x8 cell). The histograms are
weighted histograms as we have shown in the previous slide.
HISTOGRAM OF GRADIENTS (HOG)
Step5: Normalize gradients in 2n x 2n cells
If the grid was 8x8, hence we normalize the gradients on 16x16 cells.
Why we do such step? The gradients of the image are sensitive to the overall lighting.
This means that for a particular picture, some portion of the image would be very bright
as compared to the other portions. We cannot completely eliminate this from the image.
But we can reduce this lighting variation by normalizing the gradients by taking 16×16
blocks.
HISTOGRAM OF GRADIENTS (HOG)
How we normalize a vector of numbers?
Each 8×8 cell has a 9×1 matrix for a histogram. So, we would have four 9×1 matrices or a
single 36×1 matrix. To normalize this matrix, we will divide each of these values by the
square root of the sum of squares of the values.
Remember: The sum of squared of values is k = √(a12 + a22 + a32 + … + a362) noting that an2
is the nth value in the 36 x 1 matrix of the 16x16 grids we normalize on.
HISTOGRAM OF GRADIENTS (HOG)
Step6: Generate the features for the whole image
Can you guess what would be the total number of features that we will have for the
given image noting that the grids are 8x8 hence normalizing on 16x16 and the size of
image is 64x128?
we have created features for 16×16 blocks of the image. Now, we will combine all these
to get the features for the final image. We would have 105 (7x15) blocks of 16×16 for a
single 64×128 image.
Each of these 105 blocks has a vector of 36×1 as features.
The total features for the image would be 105 x 36 x 1 = 3780 features.
HISTOGRAM OF GRADIENTS (HOG)
What happens if initial image size increases with same grid size?
The total number of features representing the images will increase as well which
represents more information in the image but takes more time as well.
Now we can say that HoG describes all the edges and their orientation in the image in
the form of important features called feature descriptors.
SCALE INVARIANT FEATURE TRANSFORMATION (SIFT)
SIFT is our second feature descriptor which is used widely in image search, object
recognition and tracking. As HoG stressed on describing edges, SIFT stressed on
describing the interest points.
Can you tell me a common element in the following pictures?
SCALE INVARIANT FEATURE TRANSFORMATION (SIFT)
Probably you have said it is the Eiffel tower, The keen-eyed among you will also have
noticed that each image has a different background, is captured from different angles,
and also has different objects in the foreground (in some cases).It doesn’t matter if the
image is rotated at a weird angle or zoomed in to show only half of the Tower. We
naturally understand that the scale or angle of the image may change but the object
remains the same.
SIFT helps us locate these local features in different images which is the interest points
(Key Points) and we can use this feature descriptor as features for our image to detect
object. The major advantage of SIFT features, over edge features or hog features, is that
they are not affected by the size or orientation of the image (invariant to scale, rotation
and illumination changes while being robust to noise as well)
SCALE INVARIANT FEATURE TRANSFORMATION (SIFT)
Steps:
● Constructing scale space.
● Laplacian of Gaussian approximation (DoG).
● Finding key (interest) points.
● Eliminate edges and low contrast regions.
● Assign an orientation to the key points.
● Generate the SIFT features.
Actually we aren’t doing the Gaussian blur once but progressively with higher sigmas
and on different octaves.
Note: Octaves means different image scaling. First octave is the original image, second
octave is half the size of first octave, third octave is half the size of second octave and
so on.
Why we want to resize image? To make our descriptor scale-invariant. This means we
will be searching for these features on multiple scales, by creating a ‘scale space’.
Scale space is a collection of images having different scales(different sigmas), generated
from a single image.
how many times do we need to resize the image and how many subsequent blur images
need to be created for each resized image? The ideal number of octaves should be four,
and for each octave, the number of blur images should be five.
Check the following slide
SCALE INVARIANT FEATURE TRANSFORMATION (SIFT)
SCALE INVARIANT FEATURE TRANSFORMATION (SIFT)
TAKE CARE that we do this for all octaves and then get the union of all final key points of
all included levels (2 middle levels in the 4 DoG levels) in all octaves and place the key
point location on the original image.
SCALE INVARIANT FEATURE TRANSFORMATION (SIFT)
SCALE INVARIANT FEATURE TRANSFORMATION (SIFT)
This histogram would peak at some point. The bin at which we see the peak will be the
orientation for the keypoint. Also, any peaks above 80% of the highest peak are
converted into a new keypoint. This new keypoint has the same location and scale as
the original but with different orientation. So, orientation can split up one keypoint into
multiple keypoints.
SCALE INVARIANT FEATURE TRANSFORMATION (SIFT)
Within each 4×4 window, gradient magnitudes and orientations are calculated. These
orientations are put into an 8 bin histogram. Do this for all sixteen 4×4 regions. So you
end up with 16x8 = 128 numbers. Once you have all 128 numbers, you normalize them.
These 128 numbers form the “feature vector”. This keypoint is uniquely identified by this
feature vector. We will have a feature vector (descriptor) for each key point in the image.
Note:
The higher gradient magnitudes are by default around the key point (This is why it was
identified as key point from the beginning) which means that pixels in each 4x4 window
that are closer to the key point will contribute more in the histogram (feature vector)
which is what we want.