0% found this document useful (0 votes)

8 views689 pages

Lecture 8-Đã G P

Lecture 7 covers image segmentation, focusing on techniques such as isolated point detection, edge detection, and thresholding methods. It discusses the fundamentals of segmentation, the importance of edge detection in image perception, and various edge detection algorithms including Sobel and Canny. Additionally, it addresses the challenges of thresholding in segmentation and introduces Otsu's method for optimal threshold selection.

Uploaded by

Trân Trân

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views689 pages

Lecture 8-Đã G P

Uploaded by

Trân Trân

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 689

Lecture 7

Image Segmentation
Outline

❖ Fundamentals
❖ Isolated point detection
❖ Edge detection
❖ Segmentation based on thresholding
❖ Analytic element detection by Hough transform
Fundamentals

❖ Segmentation attempts to partition the pixels

of an image into groups that strongly correlate
with the objects in an image
❖ It is one of the most difficult tasks in image
processing
❖ Typically, the first step in any automated
computer vision application
Fundamentals—Segmentation Examples
Fundamentals—Segmentation Examples
Fundamentals

❖ There are three basic types of grey level

discontinuities that we tend to look for in
digital images:
▪ Points
▪ Lines
▪ Edges
❖ We typically find discontinuities using masks
and correlation
Outline

❖ Fundamentals
❖ Isolated point detection
❖ Edge detection
❖ Segmentation based on thresholding
❖ Analytic element detection by Hough transform
Isolated Point Detection
❖ The detection of isolated points embedded in areas
of constant or nearly constant intensity in an image
can be fulfilled by using Laplacian operator
Isolated Point Detection

❖ The detection of isolated points embedded in

areas of constant or nearly constant intensity in
an image can be fulfilled by using Laplacian
operator
Isolated Point Detection—An Example

There is an isolated black

point in the input image

How to detect it?

X‐ray image of turbine blade
Isolated Point Detection—An Example
Outline
❖ Fundamentals
❖ Isolated point detection
❖ Edge detection
▪ General concepts
▪ Basic edge detection
▪ More advanced techniques
❖ Segmentation based on thresholding
❖ Analytic element detection by Hough transform
General concepts
❖ What are edges?
▪ Edges are pixels where the image function changes abruptly

❖ Why edge detection is useful?

▪ Neurological and psychophysical research suggests that
locations in the image in which the function value changes
abruptly are important for image perception.
▪ Such a process will lead to a significant reduction of image
data, however, does not undermine understanding of the
content of the image (interpretation) in many cases.
Implementation in Matlab
close all;
clear all;
im = imread('truong.bmp');
%edgeResult = edge(im,'canny');
edgeResult = edge(im,'Sobel');
imwrite(edgeResult,’result.bmp');
figure;
imshow(edgeResult,[]);
Implementation in Matlab
close all;
clear all; % Log Edge Detection
I = rgb2gray(imread(“truong.bmp")); M = edge(I, 'log');
subplot(2, 4, 1), subplot(2, 4, 5),
imshow(I); imshow(M);
title("Gray Scale Image"); title("Log");

% Sobel Edge Detection % Zerocross Edge Detection

J = edge(I, 'Sobel'); M = edge(I, 'zerocross');
subplot(2, 4, 2), subplot(2, 4, 6),
imshow(J); imshow(M);
title("Sobel"); title("Zerocross");

% Prewitt Edge detection % Canny Edge Detection

K = edge(I, 'Prewitt'); N = edge(I, 'Canny');
subplot(2, 4, 3), subplot(2, 4, 7),
imshow(K); imshow(N);
title("Prewitt");
title("Canny");

% Robert Edge Detection

L = edge(I, 'Roberts');
subplot(2, 4, 4),
imshow(L);
title("Robert");
Outline
Main causes of edges
❖ • Depth discontinuity
▪ one surface occludes another
❖ Surface orientation discontinuity
▪ the edge of a block
❖ reflectance discontinuity
▪ texture or color changes
❖ illumination discontinuity
▪ shadows
General concepts

There are 3 fundamental steps for edge detection

▪ Image smoothing for noise reduction
▪ Detection of edge points. This is a local operation
that extracts from an image all points that are
potential candidates to become edge points
▪ Edge localization. This step is to select from the
candidate edge points only the points that are
true members of the set of points comprising an
edge
Outline

❖ Fundamentals
❖ Isolated point detection
❖ Edge detection
▪ General concepts
▪ Basic edge detection
▪ More advanced techniques
❖ Segmentation based on thresholding
❖ Analytic element detection by Hough transform
Basic Edge Detection
❖ Edges can be regarded as the extrema points
of the first‐order derivative
Basic Edge Detection
Basic Edge Detection
❖ Look for points where the gradient magnitude is
a maximum along the direction perpendicular to
the edge
❖ The direction perpendicular to the edge can be
estimated using the direction of the gradient
Basic Edge Detection
❖ For discrete case, 2D gradients are usually computed
by using various kinds of gradient operators
❖ Image’s gradient can be obtained by filtering the image
with the gradient operator
Basic Edge Detection—An Example (Sobel)
Implementation in Matlab
im = imread('hustgray.bmp');
im = double(im);
figure;
subplot(2,2,1); imshow(im,[]);
title('original input');

sobelKernelY = [1 2 1;0 0 0; -1 -2 -1];

sobelKernelX = [-1 0 1;-2 0 2; -1 0 1];

derivativeX = imfilter(im, sobelKernelX,'replicate');

subplot(2,2,2); imshow(derivativeX,[]);
title('partial derivative along X direction');

derivativeY = imfilter(im, sobelKernelY,'replicate');

subplot(2,2,3); imshow(derivativeY,[]);
title('partial derivative along Y direction');

gradientMagnitude = sqrt(derivativeX.^2 + derivativeY.^2);

subplot(2,2,4); imshow(gradientMagnitude,[]);
title('gradient magnitude');
Basic Edge Detection—An Example (Sobel)
original input partial derivative along X direction

Input image f x
partial derivative along Y direction gradient magnitude

f y Gradient magnitude
Basic Edge Detection —An Example (Sobel)
❖ Derivative with smoothing
▪ Consider a single row or column of the image

Where is the edge?

Basic Edge Detection
❖ Derivative with smoothing
❖ Finite difference filters respond strongly to noise
▪ Image noise results in pixels that look very
different from their neighbors
▪ Generally, the larger the noise the stronger
the response
❖ What is to be done?
▪ Smoothing the image should help, by forcing
pixels different from their neighbors (=noise p
ixels?) to look more like neighbors
Basic Edge Detection
❖ Derivative with smoothing

To find edges, look for peaks in d(f*g)/dx

Basic Edge Detection
Outline

❖ Fundamentals
❖ Isolated point detection
❖ Edge detection
▪ General concepts
▪ Basic edge detection
▪ More advanced techniques
❖ Segmentation based on thresholding
❖ Analytic element detection by Hough transform
Marr‐Hildreth Edge Detector (Zero‐crossings of LoG)
Marr‐Hildreth Edge Detector (Zero‐crossings of LoG)
Marr‐Hildreth Edge Detector (Zero‐crossings of LoG)
Marr‐Hildreth Edge Detector (Zero‐crossings of LoG)
Marr‐Hildreth Edge Detector (Zero‐crossings of LoG)
An example of LoG filter

close all;
clear all;
logFunction =
fspecial('log',51,8);
figure;
surfl(logFunction);
shading interp
colormap(gray);
figure;
imshow(logFunction,[]);
Marr‐Hildreth Edge Detector (Zero‐crossings of LoG)

❖ The Marr‐Hildreth edge detection algorithm

may be summarized as follows
1. Filter the input image with a Gaussian low‐pass
filter
2. Compute the Laplacian of the result in step 1
3. Find the zero‐crossings of the image from step 2
4.(Of course, steps 1 and 2 can be combined by
using one operator LoG)
Marr‐Hildreth Edge Detector (Zero‐crossings of LoG)
How to identify the zero‐crossing points?

▪ If p is a zero‐crossing point, the signs

of at least two of its opposing neighboring
pixels must differ
▪ Usually, we also require that the
absolute value of their difference must
also exceed a pre‐defined threshold
before we can call p a zero‐crossing
pixel (see the example in the next page)
Marr‐Hildreth Edge Detector (Zero‐crossings of LoG)
Canny Edge Detection [1]
❖ Although the algorithm is more complex, the
performance of the Canny edge detector is superior
in general to the edge detectors discussed before
❖ It can be summarized as the following steps
1. Smooth the input image with a Gaussian filter
2. Compute the gradient magnitude and angle maps
3. Apply nonmaxima suppression to the gradient
magnitude map
4.Use double thresholding and connectivity analysis to
detect and link edges

[1] J. Canny, A Computational Approach To Edge Detection, IEEE Trans. Pattern

Analysis and Machine Intelligence, 8:679‐714, 1986.
Canny Edge Detection
Canny Edge Detection
Canny Edge Detection
Canny Edge Detection
Canny Edge Detection
Implementation Tips
In Matlab, edge detection can be completed by
using the built‐in function “edge”

close all;
clear all;
im = imread('hustgray.bmp');
%edgeResult = edge(im,’Sobel');
edgeResult = edge(im,’canny');
imwrite(edgeResult,'canny.bmp');
figure;
imshow(edgeResult,[]);
Canny Edge Detection
Edge detection result by using “canny”
Outline

❖ Fundamentals
❖ Isolated point detection
❖ Edge detection
❖ Segmentation based on thresholding
▪ Foundation
▪ Basic global thresholding
▪ Otsu’s optimum global thresholding
▪ Variable thresholding
❖ Analytic element detection by Hough transform
Foundation
❖ Because of its intuitive properties, simplicity of
implementation, and computational speed, image
thresholding enjoys a central role in image segmentation
Foundation—Thresholding Example
❖ Imagine a poker playing robot that needs to visually
interpret the cards in its hand
Foundation—Thresholding Example
If you get the threshold wrong, the results can be disastrous
Foundation- the role of noise in thresholding

Without additional processing, we have little hope of finding a suitable

threshold for segmenting the third image
Foundation—the role of illumination

Without additional processing, we have little hope of finding a suitable

threshold for segmenting the third image
Foundation—the role of illumination
❖ The success of intensity thresholding is related to the
width and depth of the valley(s) separating the
histogram modes
❖ The key factors are
▪ The separation between peaks (the further apart the
peaks are, the better the chances of separating the
modes)
▪ The noise content in the image (the modes broaden
as noise increases)
▪ The uniformity of the illumination
▪ The uniformity of the reflectance properties of the
image
Outline

❖ Fundamentals
❖ Isolated point detection
❖ Edge detection
❖ Segmentation based on thresholding
▪ Foundation
▪ Basic global thresholding
▪ Otsu’s optimum global thresholding
▪ Variable thresholding
❖ Analytic element detection by Hough transform
Basic Global Thresholding
❖ Partition the image histogram using a single global
threshold
❖ The success of this technique very strongly depends
on how well the histogram can be partitioned
Basic Global Thresholding
Basic Global Thresholding—An Example

Input image
Basic Global Thresholding—An Example
im = (imread('fingerprint.jpg'));
figure; imhist(im);

figure;
subplot(1,2,1); imshow(im,[]);
title('input image');

thresh = mean2(im);

done = false;
while ~done
g = im > thresh;
newThresh = 0.5 * (mean(im(g)) +
mean(im(~g)));
done = abs(thresh - newThresh) < 0.5;
thresh = newThresh;
end

result = im > thresh;

subplot(1,2,2); imshow(result,[]);
title('segmentation result');
Basic Global Thresholding—An Example
3500

3000

2500

2000

1500

1000

500

0 50 100 150 200 250

Using the basic global thresholding algorithm, threshold = 125.4

Basic Global Thresholding—An Example

input image segmentation result

Outline

❖ Fundamentals
❖ Isolated point detection
❖ Edge detection
❖ Segmentation based on thresholding
▪ Foundation
▪ Basic global thresholding
▪ Otsu’s optimum global thresholding
▪ Variable thresholding
❖ Analytic element detection by Hough transform
Otsu’s Optimum Global Thresholding
❖ Otsu’s method can choose the optimum threshold in
the sense that it maximizes the between‐class
variance
Basic Global Thresholding—An Example
Otsu’s Optimum Global Thresholding

Implementation Tips
In Matlab, “graythresh” computes Otsu’s threshold
Otsu’s Optimum Global Thresholding—An Example
Outline

❖ Factors such as noise and nonuniform illumination

play a major role in the performance of a
thresholding algorithm
❖ In some cases (even seems very simple), a global
threshold does not work well
❖ That’s why we want to have a variable thresholding
scheme
Variable Thresholding based on Local Image Properties

❖ This basic approach uses the standard deviation and

mean of the pixels in a neighborhood of every pixel
in an image
❖ For each pixel, a distinct threshold will be computed
Variable Thresholding based on Local Image Properties

Đo sự khác nhau giữa những pixel lân cận

Đo sự khác nhau giữa 2 miền

ảnh sử dụng 1 cửa sổ trượt
Variable Thresholding—Moving Average

❖ Moving average is a special case of the local

thresholding, which is based on computing a moving
average along scan lines of an image
❖ It is especially useful in document processing

❖ The scanning is carried out line by line in a zigzag

pattern to reduce illumination bias
Variable Thresholding based on Local Image Properties

❖ Moving average at the pixel k is formed by averaging

the intensities of that pixel and its n-1 preceding
neighbors
❖ Suppose we have a 5*5 image, n = 4
Variable Thresholding based on Local Image Properties

❖ Moving average at the pixel k is formed by averaging

the intensities of that pixel and its n-1 preceding
neighbors
❖ Suppose we have a 5*5 image, n = 4
Variable Thresholding based on Local Image Properties

❖ Moving average at the pixel k is formed by averaging

the intensities of that pixel and its n-1 preceding
neighbors
❖ Suppose we have a 5*5 image, n = 4
Variable Thresholding—Moving Average

An example
▪ A handwritten text shaded by a spot intensity pattern;
this form of shading is typical of images obtained with
a photographic flash
%this is the function for implementing the moving average threshold
%algorithm. It is created by Gonzalez
function g = movingThresh(f,n,K)
%f is the iput gray-scale image; n is the number of the pixels involved for
%averaging; K is the constant for thesholding segmentation
f=single(f);
[M,N]=size(f);
%preliminairies
if (n<1)|| (rem(n,1)~=0)
error('n must be an integer >=1.')
end
if K<0 || K>1
error('K must be a fraction in the range [0, 1].')
end
%Flip every other row of f to produce the equivalent of a zig-zag scanning
%pattern. Convert image to a vector
f(2:2:end,:)=fliplr(f(2:2:end,:));
f=f';
f=f(:)';

%compute the moving average

maf=ones(1,n)/n;
ma=filter(maf,1,f);

%perform thresholding
g=f>K*ma;
%go back to the image format
g=reshape(g,N,M);
g = g';
%flip alternate row back
g(2:2:end,:)=fliplr(g(2:2:end,:));
%this is the demo for moving average threshold
close all;
clear all
im = rgb2gray(imread('text_ni.bmp'));
figure;
imshow(im,[]);
title('input image’);

%Ostu global threshold

ostuThreshold = graythresh(im);
segmentationResultByOstu = im2bw(im, ostuThreshold);
figure;
imshow(segmentationResultByOstu);
title('segmentation result by Ostu global threshold');

%moving average method

numberOfPixelsInvolvedInAveraging = 20;
KForThesholding = 0.5;
movingAveResult =
movingThresh(im,numberOfPixelsInvolvedInAveraging,KForThesholding);
figure;
imshow(movingAveResult);
title('segmentation result by moving average');
Variable Thresholding—Moving Average
Variable Thresholding—Moving Average
Variable Thresholding—Moving Average
Another example
❖ Text image corrupted by a sinusoidal intensity variation typical
of the variation that may occur when the power supply in a
document scanner is not grounded properly
Variable Thresholding—Moving Average
Variable Thresholding—Moving Average
Outline

❖ Let each feature vote for all the models that are
compatible with it
❖ Hopefully the noise features will not vote
consistently for any single model
❖ Missing data doesn’t matter as long as there are
enough features remaining to agree on a good
model
Hough transform
▪ An early type of voting scheme
Ý tưởng chung của việc phát hiện đường thẳng trong thuật toán này
là tạo mapping từ không gian ảnh (A) sang một không gian mới (B)
mà mỗi đường thẳng trong không gian (A) sẽ ứng với một điểm trong
không gian (B).
Parameter space representation
A line in the image corresponds to a point in
Hough space
Analytic element detection—Hough transform
Parameter space representation
Parameter space representation
Parameter space representation
Parameter space representation
Algorithm outline
Basic illustration
A real example
Incorporating image gradients
Hough transform for finding lines in Matlab
Hough transform for finding lines in Matlab
Hough transform for circles
Hough transform – General Procedure
Hough transform—Circles detection
Hough transform—Circles detection
Hough transform: Discussion
Thanks for your attention!
Hanoi University of Science and Technology
Department of Automatic Control

Computer Vision
Lecture 10:
Machine learning background
(part 2)
Van-Truong Pham, PhD
School of Electrical Engineering
Hanoi University of Science and Technology
Site: https://see.hust.edu.vn/pvtruong
Contents

▪ Machine learning in computer vision

▪ Linear module
▪ Linear regression
▪ Gradient descent
▪ Linear classifier
▪ Logistic regression
▪ Softmax
▪ Python demo
Contents

▪ Machine learning in computer vision

▪ Linear module
▪ Linear regression
▪ Gradient descent
▪ Linear classifier
▪ Logistic regression
▪ Softmax
▪ Python demo
Gradient descent (GD)
▪ Bài toán thường gặp trong machine learning: tìm các cực tiểu toàn cục
của một hàm số
▪ Việc giải trực tiếp phương trình đạo hàm bằng không có thể phức tạp
hoặc có vô số nghiệm.
▪ Hạ gradient (gradient descent) và các biến thể của nó: là hướng tiếp
cận phổ biến để giải bài toán tối ưu
Chọn một điểm xuất phát rồi tiến dần đến đích sau mỗi vòng lặp
Gradient descent (GD)
▪ Cập nhật biến cần tìm: xt+1= 𝑥𝑡 − 𝑓′(𝑥𝑡)

Lưu ý gradient của hàm số f(x) theo x ký hiệu là ∇𝑥 f(x)

𝑑𝑓(𝑥)
t+1= 𝑥𝑡 − 𝛻𝑥 𝑓(𝑥𝑡)𝛻𝑥 𝑓(𝑥)
𝑓′(𝑥)x=
𝑑𝑥
 là một số dương, gọi là tốc độ học (learning rate)

Dấu trừ thể hiện xt cần đi ngược với đạo hàm 𝑓′(𝑥𝑡)
(gradient descent)
Gradient descent (GD)
▪ Hạ gradient (GD) cho bài toán linear regression
𝑛
1
𝐿෠ 𝑤 = ෍(𝑤 𝑇 𝑥𝑖 − 𝑦𝑖 )2 = 𝑋𝑤 − 𝑌 2
2
𝑛
𝑖=1
Đạo hàm của hàm mất mát theo 𝑤 (tính loss gradient):

𝛻𝐿෠ (𝑤)=𝛻𝑤 𝑋𝑤 − 𝑌 22 = 𝛻𝑤 𝑋𝑤 − 𝑌 𝑇 𝑋𝑤 − 𝑌
= 𝛻𝑤 𝑤 𝑇 𝑋 𝑇 𝑋𝑤 − 2𝑤 𝑇 𝑋 𝑇 𝑌 + 𝑌 𝑇 𝑌
= 2𝑋 𝑇 𝑋𝑤 − 2𝑋 𝑇 𝑌

Mục tiêu: Tìm 𝑤 để tối thiểu hóa 𝐿෠ (𝑤)

• Đặt một số giá trị ước lượng ban đầu cho 𝑤
• Ở mỗi bước, tính ∇𝐿෠ (𝑤), tại giá trị w ở bước trước đó

• Cập nhật theo bước step  (learning rate):

𝑤 ← 𝑤 − ∇𝐿෠ 𝑤
 𝑤𝑛𝑒𝑤 = 𝑤 𝑜𝑙𝑑 − ∇𝐿෠ 𝑤𝑜𝑙𝑑
Gradient descent (GD)
Hạ Gradient cho bài toán Linear regression
• Hàm dự đoán
• Tham số

• Hàm mất mát

• Bài toán tối ưu

• Gradient descent:

7
Gradient descent (GD)
▪ Đạo hàm riêng theo tham số

• Đạo hàm riêng theo

8
Gradient descent (GD)
• Đạo hàm riêng theo

▪ Khởi tạo w0, w1

▪ Lặp lại cho tới khi hội tụ:

9
Gradient descent (GD)

10
Gradient descent (GD)
Gradient descent với norm1
• Khi hàm mất mát có dạng hàm sai khác tuyệt đối:

• Đạo hàm riêng theo wj

• Lặp lại cho tới khi hội tụ:

11
Gradient descent (GD)
Gradient descent với nhiều biến
• Hàm dự đoán
• Tham số

• Hàm dự đoán

• Bài toán tối ưu

• Gradient descent:

12
Gradient descent (GD)
Gradient descent: convexity
▪ Với hàm convex: gradient descent cho nghiệm toàn cục, hội tụ với mọi giá trị khởi tạo

▪ Với hàm non-convex: gradient descent cho nghiệm toàn cục hay cục bộ tùy vào giá trị
khởi tạo

13
Gradient descent (GD)
Learning rate
▪ Learning rate  (trong neural networks hay đặt là )
ảnh hưởng đến sự hội tụ tới nghiệm của GD

▪ Nếu  quá nhỏ, quá trình tối ưu hóa

cần rất nhiều vòng lặp (iterations) để hội
tụ tới nghiệm

▪ Nếu  quá lớn, quá trình tối ưu hóa

không hội tụ (bị phân kỳ)
và không đạt tới giá trị cực tiểu

14
Gradient descent (GD)
▪ Gradient descent
𝑤 ← 𝑤 − ∇𝐿෠ (𝑤)
• Khi cập nhật tham số w, ta sử dụng tất cả các điểm dữ liệu xi
(gọi là batch gradient descent)

• Có hạn chế khi cơ sở dữ liệu lớn (có rất nhiều điểm): tính đạo hàm cho tất cả
các điểm cùng lúc trong một vòng lặp dẫn đến tốn dung lượng bộ nhớ, thời
gian tính toán.

• Với online learning, batch GD thực hiện tính lại gradient của hàm mất mát với
toàn bộ dữ liệu → mất đi tính online (thời gian thực)

➢ Giải pháp: Stochastic Gradient Descent (SGD)

Stochastic Gradient Descent (SGD)
▪ Stochastic gradient descent (SGD)
• Chỉ tính đạo hàm của hàm mất mát dựa trên một (hoặc một số) điểm
dữ liệu, sau đó cập nhật tham số dựa trên đạo hàm này.
• Sau khi duyệt qua tất cả các điểm dữ liệu, thuật toán lặp lại quá trình trên.
- Mỗi lần cập nhật nghiệm là một vòng lặp (iteration)

- Mỗi lần duyệt hết toàn bộ dữ liệu gọi là một epoch

• Sau mỗi epoch, thứ tự lấy các dữ iệu cần được xáo trộn để đảm bảo
tính ngẫu nhiên (stochastic)

▪ Quy tắc cập nhật của SGD:

𝑤 ← 𝑤 − ∇𝑤෢𝐿(𝑤; 𝑥𝑖, 𝑦𝑖)

𝐿෠ (𝑤; 𝑥𝑖, 𝑦𝑖) là hàm mất mát nếu chỉ có một cặp dữ liệu thứ i
16
Mini-batch Gradient Descent
▪ Minibatch gradient descent (mini-batch GD)
• Sử dụng 1<k<N điểm dữ liệu để cập nhật ở mỗi vòng lặp.

• Xáo trộn ngẫu nhiên dữ liệu rồi chia toàn bộ dữ liệu thành các mini-batch,
mỗi mini-batch có k điểm dữ liệu.
- Ở mỗi vòng lặp, một mini-batch được lấy ra đển tính toán gradient và
cập nhật tham số
- Duyệt hết toàn bộ dữ liệu, kết thúc một epoch

• One epoch ~ (N/k) iteration. Giá trị k được gọi là kích thước batch

17
Contents

▪ Machine learning in computer vision

▪ Linear module
▪ Linear regression
▪ Gradient descent
▪ Linear classifier
▪ Logistic regression
▪ Softmax
▪ Python demo
Classification
▪ Classification: predicting a discrete-valued target
• binary classification: predicting a binary-valued target

▪ Examples of binary classification

• predict whether a patient has a disease
(covid or noncovid; benign and malignant tumor),
• given the presence or absence of various symptoms
• classify e-mails as spam or non-spam
• ….
Classification examples
▪ Example: Disease class prediction
Supervised classification problem: Predict the disease class (discrete value) of
patient given existing medical data features (tumor size).

Is the tumor benign/malign?

Supervised classification predicts benign.
Classification examples
▪ Examples of binary classification tasks:
Email: Spam (1) or not spam (0)
Online financial transaction: Fraudulent (1) or legitimate (0)

▪ From binary to multi-class classification:

Email: Spam (0), work (1), friends (2), family (3)
Medical diseases: Benign (0), malign I (1), malign II (2), malign III (3)
Classification formulation

Slide credit: X. Bresson

Model representation
Linear classifier
▪ Linear classifier:
• Classify data into labels based on a linear combination of inputs (i.e., dùng
đường thẳng để phân chia các nhóm dữ liệu)

• Common linear classifiers: Perceptron, logistic regression, SVM, LDA,

Linear classifier
▪ As a neuron:

▪ Some activation functions (g):

Perceptron-neuron
▪ Perceptron gần với neuron sinh học
Perceptron
▪ Perceptron: Mô hình toán của một neuron
▪ Đầu vào (tín hiệu từ các neurons khác): x1, x2,…xD
▪ Trọng số (các dendrities): w1, w2,…wD

▪ Hàm kích hoạt tuyến tính: a= w0 +w1x1+w2x2 +…+wDxD

chính là đầu ra của perceptron

(với bias x0=1)

Perceptron
▪ Perceptron view
Input

Weights
x1
w1

x2
w2
Output: sgn(wx + b)
x3
w3
.
.
.
wD
xD
Perceptron
▪ Phân loại
• Cho tập training data
𝑥𝑖 , 𝑦𝑖 , 𝑖 = 1, … , 𝑛 ,𝑦𝑖 ∈ {−1,1}

• Giả thuyết phân loại: 𝑓𝑤 (𝑥) = sgn(𝑤 𝑇 𝑥)

Hàm dự đoán thường có thêm hệ số điều chỉnh bias- (giúp cho mô
hình linh hoạt hơn):

𝑓𝑤 𝑥 = sgn(𝑤 𝑇 𝑥 + 𝑏)

Có thể được viết dưới dạng khác nếu đặt 𝑤 ′ = 𝑤; 𝑏 và 𝑥 ′ = [𝑥; 1]

Perceptron
• Hàm mất mát (0-1 Loss):
𝑛
1
𝐿෠ 𝑓𝑤 = ෍ 𝕀[sgn 𝑤 𝑇 𝑥𝑖 ≠ 𝑦𝑖 ]
𝑛
𝑖=1
Đếm số mẫu bị phân loại sai, và tối thiểu hóa số lượng đó
I(x) or 1(x): indicator function, takes a value of 1 if its argument is true

• Khó thực hiện tối ưu

(0-1 loss is a poor choice for optimization)
Thuật toán học Perceptron
❑Perceptron Algorithm
• Khởi tạo ngẫu nhiên cho các trọng số

• Đối với mỗi mẫu training (𝑥𝑖 , 𝑦𝑖 ):

Nếu phép dự đoán hiện tại sgn(𝑤 𝑇 𝑥𝑖 ) chưa khớp với 𝑦𝑖 ,

yi ≠ sgn(w T xi )
ta quay lại cập nhật trọng số:
𝑤 ← 𝑤 + 𝜂 𝑦𝑖 𝑥𝑖

𝜂 là tốc độ học (learning rate)

Trong thuật toán học perceptron thường chọn 𝜂=1
Perceptron Algorithm
• Hàm dự đoán đầu ra của perceptron:
𝑙𝑎𝑏𝑒𝑙 𝐱 = sgn(𝒘𝑇 𝐱)

▪ Thuật toán học perceptron (Perceptron Algorithm) chỉ có phù hợp cho bài
toán phân loại khi dữ liệu là hoàn toàn tách biệt tuyến tính

32
Perceptron vs Linear regression
Perceptron Perceptron Linear regression
(dạng thu gọn) (dạng thu gọn)

Input Output Input Output Input Output

layer layer layer layer layer layer

Hàm kích hoạt:

Perceptron: 𝑦 = sgn 𝑧
Linear regression: 𝑦 = 𝑧

33
Contents

▪ Machine learning in computer vision

▪ Linear module
▪ Linear regression
▪ Gradient descent
▪ Linear classifier
▪ Logistic regression
▪ Softmax
▪ Python demo
Classification
▪ Last time: binary classification, perceptron algorithm
• Limitations of the perceptron
✓ no guarantees if data aren't linearly separable
✓ how to generalize to multiple classes?
✓ linear model - no obvious generalization to multilayer
neural networks
Not linearly separable
▪ Design choices so far
• Task: regression, binary classification, multiclass classification
• Model/Architecture: linear, log-linear
• Loss function: squared error, 0-1 loss, cross-entropy,
hinge loss (tự đọc support vector machine)
• Optimization algorithm: direct solution, gradient descent, perceptron
Linear classifier
Input Weights
x1
w1

x2
w2
Output: sgn(wx + b)
x3
w3 Output: sigmoid(wx + b)
.
.
Thay hàm sgn bằng hàm sigmoid
.
wD
xD
Linear classifier
Linear classifier
Logistic regression
Sigmoid function
• Hàm dự đoán bị chặn (bounded) trong khoảng [0 1]

• Mượt (smooth)

1
𝜎 𝑤𝑇𝑥 =
1 + exp(−𝑤 𝑇 𝑥)
39
Logistic regression
❑ Tính chất của hàm Sigmoid

40
Logistic regression
▪ Giả sử xác suất để một điểm dữ liệu x rơi vào lớp thứ nhất là 𝜎 𝑤 𝑇 𝑥
và rơi vào lớp còn lại là 1 − 𝜎 𝑤 𝑇 𝑥

1
𝑃𝑤 𝑦 = 1 𝑥 = 𝜎 𝑤 𝑇 𝑥 =
1+exp(−𝑤 𝑇 𝑥)
𝑃𝑤 𝑦 = −1 𝑥 = 1 − 𝜎 𝑤 𝑇 𝑥
1+exp −𝑤 𝑇 𝑥 −1 exp(−𝑤 𝑇 𝑥) 1
= = =
1+exp −𝑤 𝑇 𝑥 1+exp(−𝑤 𝑇 𝑥) exp 𝑤 𝑇 𝑥 +1
= 𝜎 −𝑤 𝑇 𝑥

▪ Hàm sigmoid có tính đối xứng: 1 − 𝜎 𝑎 = 𝜎 −𝑎

Logistic regression
❑ Hàm Sigmoid cho phân loại: logistic regression

▪ Thay đầu ra (hàm kích hoạt) của phân loại tuyến tính bằng hàm
sigmoid 𝜎 𝑤 𝑇 𝑥

▪ Bài toán : tìm tham số mô hình w để tối thiểu hàm mất mát

Hàm mất mát dạng này có nhược điểm: không đảm bảo hội tụ tới
nghiệm toàn cục khi dùng gradient descent (nhạy với giá trị ban đầu)

➢ Xây dựng hàm mất mát dựa trên sự ước lượng hợp lý cực đại 42
Logistic regression
❑ Ước lượng hợp lý cực đại MLE
(Maximum likelihood estimation)
▪ Cho tập training data 𝑥𝑖 , 𝑦𝑖 , 𝑖 = 1, … , 𝑛
▪ Đặt 𝑃𝜃 𝑦 𝑥 , 𝜃 ∈ Θ là phân bố xác suất với tham số 𝜃
▪ Áp dụng ước lượng sự hợp lý cực đại: Maximum (conditional) likelihood , giá trị
tốt nhất của trọng số để giúp cực đại hóa sự hợp lý của toàn bộ tập dữ liệu
𝜃𝑀𝐿 = argmax𝜃 ෑ 𝑃𝜃 (𝑦𝑖 |𝑥𝑖 )
𝑖

▪ Không thay đổi mục tiêu ban đầu, ta có thể cực đại hóa log của hàm hợp lý
𝜃𝑀𝐿 = argmax𝜃 σ𝑖 log 𝑃𝜃 (𝑦𝑖 |𝑥𝑖 )

▪ Do các bài toán tối ưu thường biểu diễn dưới dạng cực tiểu hóa. Bài toán
cực đại hóa log của hàm hợp lý trên tương đương với cực tiểu hóa hàm đối
log hợp lý (Negative Log-Likelihood-NLL):
𝜃𝑀𝐿 = argmin𝜃 {− σ𝑖 log 𝑃𝜃 (𝑦𝑖 |𝑥𝑖 )}

Note:The log-likelihood turns products into sums

Logistic regression
❑ Cực tiểu hóa hàm đối log hợp lý (Negative Log-Likelihood-NLL):
𝜃𝑀𝐿 = argmin𝜃 {− σ𝑖 log 𝑃𝜃 (𝑦𝑖 |𝑥𝑖 )}

❖Giả sử với phân phối chuẩn 𝑃𝜃 𝑦 𝑥 = Normal(𝑦; 𝑓𝜃 𝑥 , 𝜎 2 )

2
1 𝑦 − 𝑓𝜃 𝑥
log 𝑃𝜃 𝑦𝑖 𝑥𝑖 = log exp −
2𝜋𝜎 2 2𝜎 2
1 2 1
= − 2 𝑦 − 𝑓𝜃 𝑥 − log 𝜎 − log(2𝜋)
2𝜎 2
1
Do 𝜎 là hằng số, ta bỏ qua 𝜎 và số hạng hằng số, − log 𝜎 − log(2𝜋)
2
Tham số tối ưu đạt được khi cực tiểu hóa hàm mục tiêu sau:
2
𝜃𝑀𝐿 = argmin𝜃 σ𝑖 𝑦𝑖 − 𝑓𝜃 𝑥𝑖

➢ Đây chính là hàm mục tiêu đã xét trong linear regression (tham số 𝜃 là w và b)
Logistic regression
❑ Xây dựng loss cho Logistic regression
▪ Cho tập training data 𝑥𝑖 , 𝑦𝑖 , 𝑖 = 1, … , 𝑛 từ phân bố D
▪ Tìm w để tối thiểu hóa hàm:

෠𝐿 𝑤 = − 1 σ𝑛𝑖=1 log 𝑃𝑤 𝑦𝑖 𝑥𝑖
𝑛

❖ Áp dụng cực tiểu hóa hàm đối log hợp lý (Negative Log-Likelihood-NLL) với
hàm sigmoid:
𝜃𝑀𝐿 = argmin𝜃 {− σ𝑖 log 𝑃𝜃 (𝑦𝑖 |𝑥𝑖 )}

45
Logistic regression
1
▪ Đã biết: 𝑃𝑤 𝑦 = 1 𝑥 = 𝜎 𝑤 𝑇 𝑥 =
1+exp(−𝑤 𝑇 𝑥)
𝑃𝑤 𝑦 = −1 𝑥 = 1 − 𝜎 𝑤 𝑇 𝑥

▪ Hàm mất mát của logistic regression

𝑛
1
෠𝐿 𝑤 = − ෍ log 𝑃𝑤 𝑦𝑖 𝑥𝑖
𝑛
𝑖=1
1 1
= − σ𝑦𝑖=1 log 𝜎 𝑤 𝑇 𝑥𝑖 − σ𝑦𝑖=0 log 1 − 𝜎(𝑤 𝑇 𝑥𝑖 )
𝑛 𝑛

▪ Có thể viết dưới dạng khác (gọn hơn):

𝑛
1
෠𝐿 𝑤 = − ෍ log 𝑃𝑤 𝑦𝑖 𝑥𝑖
𝑛
𝑖=1
1
= − σ𝑛𝑖=1 [𝑦𝑖 log 𝜎 𝑤 𝑇 𝑥𝑖 + (1 − 𝑦𝑖 )log 1 − 𝜎(𝑤 𝑇 𝑥𝑖 ) ]
𝑛

Đây gọi là Binary Cross Entropy (BCE) hay Logs Loss.

46
Logistic regression: Gradient descent
• Hàm dự đoán
• Tham số

• Hàm mất mát (loss)

• Bài toán tối ưu

• Gradient descent:

47
Logistic regression: Gradient descent

48
Logistic regression: Gradient descent

49
Logistic regression: Gradient descent

50
Mô hình với regularization
• Để tránh hiện tượng overfitting, thường dùng thêm hàm hiệu chỉnh
regularization (để giới hạn giá trị của tham số)

• Gradient descent:

51
Logistic regression/classification
▪ Decision boundary
Logistic regression/classification
▪ Decision boundary in higher dimensional spaces (d=2 features):
Code Test

54
Contents

▪ Machine learning in computer vision

▪ Linear module
▪ Linear regression
▪ Gradient descent
▪ Linear classifier
▪ Logistic regression
▪ Softmax
▪ Python demo
Multiclass classification
Multiclass classification
▪ Classification tasks with more than two categories
Softmax
▪ Ta có biểu thức softmax
exp(𝑧1 ) exp(𝑧𝐶 )
y=softmax 𝑧1 , … , 𝑧𝐶 = σ𝑗 exp(𝑧𝑗 )
, … , σ𝑗 exp(𝑧𝑗 )

vector gồm các thành phần, 𝑧1 , … , 𝑧𝐶 , được biểu diễn qua một vector của
các xác suất tương ứng

▪ Với phân loại C lớp, ta có thể dùng softmax. Khi đó hàm kích hoạt ai
cần tìm có dạng:
exp 𝑧𝑖
𝑎𝑖 =
σ𝑗 exp 𝑧𝑗
exp 𝑤𝑦𝑇𝑖 𝑥𝑖
𝑎𝑖 = 𝑃𝑊 𝑦𝑖 𝑥𝑖 =
σ𝑗 exp 𝑤𝑗𝑇 𝑥𝑖
Softmax regression
▪ Hàm mất mát theo Entropy chéo (Cross-entropy loss)
• Tương tự trong bài toán logistic regression, ta lấy negative log likelihood loss
của softmax:
exp 𝑤𝑦𝑇𝑖 𝑥𝑖
𝑙 𝑊, 𝑥𝑖 , 𝑦𝑖 = − log 𝑃𝑊 𝑦𝑖 𝑥𝑖 = −log
σ𝑗 exp 𝑤𝑗𝑇 𝑥𝑖
• Có thể coi đó như cross-entropy giữa phân bố thực sự/ “thực nghiệm”
(empirical) 𝑃෠ 𝑘 𝑥𝑖 = 𝕀[𝑘 = 𝑦𝑖 ] và phân bố dự đoán/ “ước lượng” (estimated)
𝑃𝑊 (𝑘|𝑥𝑖 ):
− ෍ 𝑃෠ 𝑘 𝑥𝑖 log 𝑃𝑊 (𝑘|𝑥𝑖 )
𝑘

Liên hệ: cross-entropy giữa hai vector rời rạc p và q: :

Cross-entropy loss
Cách khác để dẫn ra công thức cross-entropy :

▪ Recall binary cross entropy (BCE) loss for logistic regression:

1
binary-cross-entropy = − 𝑁 σ𝑁
𝑖=1 [𝑡𝑖 log 𝑝𝑖 + (1 − 𝑡𝑖 )log 1 − 𝑝𝑖) ]
with 𝑝𝑖 = 𝜎 𝑤 𝑇 𝑥𝑖
▪ For k classes (multiclass):

with 𝑦𝑖 = softmax 𝑤 𝑇 𝑥𝑖
Softmax regression
▪ SGD của cross-entropy loss
• Tối thiểu hóa hàm mất mát: dùng Stochastic Gradient descent (SGD)

exp 𝑤𝑦𝑇𝑖 𝑥𝑖
𝑙 𝑊, 𝑥𝑖 , 𝑦𝑖 = − log 𝑃𝑊 𝑦𝑖 𝑥𝑖 = −log
σ𝑗 exp 𝑤𝑗𝑇 𝑥𝑖
= −𝑤𝑦𝑇𝑖 𝑥𝑖 + log ෍ exp 𝑤𝑗𝑇 𝑥𝑖
𝑗
• Gradient w.r.t. 𝑤𝑦𝑖 :
exp 𝑤𝑦𝑇𝑖 𝑥𝑖 𝑥𝑖
−𝑥𝑖 + = (𝑃𝑊 𝑦𝑖 𝑥𝑖 − 1)𝑥𝑖
σ𝑗 exp 𝑤𝑗𝑇 𝑥𝑖
• Gradient w.r.t. 𝑤𝑘 , 𝑘 ≠ 𝑦𝑖 :
exp 𝑤𝑘𝑇 𝑥𝑖 𝑥𝑖
𝑇
= 𝑃𝑊 𝑘 𝑥𝑖 𝑥𝑖
σ𝑗 exp 𝑤𝑗 𝑥𝑖
Softmax regression
▪ SGD của cross-entropy loss

• Gradient w.r.t. 𝑤𝑦𝑖 : (𝑃𝑊 𝑦𝑖 𝑥𝑖 − 1)𝑥𝑖

• Gradient w.r.t. 𝑤𝑘 , 𝑘 ≠ 𝑦𝑖 : 𝑃𝑊 𝑘 𝑥𝑖 𝑥𝑖

• Update rule:
• For 𝑦𝑖 :
𝑤𝑦𝑖 ← 𝑤𝑦𝑖 + 𝜂 1 − 𝑃𝑊 𝑦𝑖 𝑥𝑖 𝑥𝑖
• For 𝑘 ≠ 𝑦𝑖 :
𝑤𝑘 ← 𝑤𝑘 − 𝜂𝑃𝑊 𝑘 𝑥𝑖 𝑥𝑖
Softmax regression
▪ Mô hình hồi quy softmax dưới dạng neural network:

63
Softmax regression
▪ Ví dụ phân loại dùng Softmax

▪ Sau khi mô hình đã được huấn luyện (trained):

• Nhãn (label) của mỗi điểm dữ liệu mới được tính là vị trí của thành phần
score có giá trị lớn nhất trong score vector

𝒛 = 𝑾𝑇 𝒙
64
Softmax regression
Minh họa tính hàm Loss của Softmax
• Hàm mất mát của Softmax regression:
exp 𝑤𝑦𝑇𝑖 𝑥𝑖
𝐿𝑖 = 𝑙 𝑊, 𝑥𝑖 , 𝑦𝑖 = − log 𝑃𝑊 𝑦𝑖 𝑥𝑖 = −log
σ𝑗 exp 𝑤𝑗𝑇 𝑥𝑖
• Đặt 𝑠𝑦𝑖 = 𝑤𝑦𝑇𝑖 𝑥𝑖 ; 𝑠𝑗 = σ𝑗 exp 𝑤𝑗𝑇 𝑥𝑖

• Giả sử đã có score vector:

𝒛 = 𝑾𝑇𝒙

65
Minh họa tính hàm Loss của Softmax

66
Minh họa tính hàm Loss của Softmax

67
Minh họa tính hàm Loss của Softmax

68
Minh họa tính hàm Loss của Softmax

69
Contents

▪ Machine learning in computer vision

▪ Linear module
▪ Linear regression
▪ Gradient descent
▪ Linear classifier
▪ Logistic regression
▪ Softmax
▪ Python demo
Demo
Demo 1: Lec3_Lab01_LinearRegression_np.ipynb

Demo 2: Lec3_Lab02_Logistic_Classification.ipynb
Contents

▪ Machine learning in computer vision

▪ Linear module
▪ Linear regression
▪ Gradient descent
▪ Linear classifier
▪ Logistic regression
▪ Softmax
▪ Python demo
▪ Autograd (next lecture)
Hanoi University of Science and Technology
Department of Automatic Control

Computer Vision
Lecture 11:
Backpropagation and
Automatic Differentiation
Van-Truong Pham, PhD
School of Electrical Engineering
Hanoi University of Science and Technology
Site: https://see.hust.edu.vn/pvtruong
Contents

▪ Chain rule
▪ Computational graph
▪ Backpropagation
▪ Automatic differentiation
▪ Autograd
▪ Python demo
Contents

▪ Chain rule
▪ Computational graph
▪ Backpropagation
▪ Automatic differentiation
▪ Autograd
▪ Python demo
Chain rule
Given: g(f) and f(x)

• Scalar chain rule

where g ◦ f is a function composition

• Vector chain rule

where we deﬁne the gradient as a row vector

Chain rule
• Deﬁne the variable y as the output of f(x) and the variable z as the output of
g(y), then we can write in Leibniz notation,

More precisely
Chain rule
Chain rule
Chain rule
Contents

▪ Chain rule
▪ Computational graph
▪ Backpropagation
▪ Automatic differentiation
▪ Autograd
▪ Python demo
Computational graph
▪ A computational graph is a directed graph where the nodes correspond to
operations or variables.
▪ The values that are fed into the nodes and come out of the nodes are called tensors
Computational graph
▪ An example of computational graph
f(x)=x1+x2+x3+x4
z1= x1*x2
z2= x3 * x4
x1 f(x)=z1+z2

z1= x1*x2
x2

f=z1+z2
x3

z2= x3 * x4
x4
Computational graph
▪ Forward propagation:
Given: f(x)=x1+x2+x3+x4
x1=2
x2=3 z1= x1*x2
x3=1 z2= x3 * x4
x4=4
f(x)=z1+z2
x1 2

3 z1= x1*x2
x2

f=z1+z2
x3 1

z2= x3 * x4
4
x4
Computational graph
▪ Forward propagation:
z1=2*3=6 f(x)=x1+x2+x3+x4
x1=2
x2=3 z2=1 * 4=4 z1= x1*x2
x3=1 f=z1+z2=10 z2= x3 * x4
x4=4
x1 2 f(x)=z1+z2

3 z1= x1*x2 6
x2
10
f=z1+z2
x3 1
4

z2= x3 * x4
4
x4
Computational graph
▪ Backward propagation: 𝜕𝑓 𝐱
=
𝜕𝑓 𝜕𝑧1
= 1 ∗ 𝑥2 =3 f(x)=x1+x2+x3+x4
𝜕𝑥1 𝜕𝑧1 𝜕𝑥1
Derivative of f with respect to x? 𝜕𝑓 𝐱 𝜕𝑓 𝜕𝑧1 z1= x1*x2
= = 1 ∗ 𝑥1 =2
𝜕𝑥2 𝜕𝑧1 𝜕𝑥2
z2= x3 * x4
𝜕𝑓 𝐱 𝜕𝑓 𝜕𝑧2
= = 1 ∗ 𝑥4 =4
𝜕𝑥3 𝜕𝑧2 𝜕𝑥3 f(x)=z1+z2
x1 2 𝜕𝑓 𝐱
=
𝜕𝑓 𝜕𝑧2
= 1 ∗ 𝑥3 =1 x1=2 z1=2*3=6
𝜕𝑥4 𝜕𝑧2 𝜕𝑥4
3 x2=3 z2=1 * 4=4
x3=1 f=z1+z2=10
3 z1= x1*x2 6 x4=4
x2
2 1
10
f=z1+z2
x3 1 𝜕𝑓
4 =1
1 𝜕𝑧1
4
𝜕𝑓
z2= x3 * x4 =1
4 𝜕𝑧2
x4 1
Green numbers: Forward propagation
Red numbers: Backward propagation
Contents

▪ Chain rule
▪ Computational graph
▪ Backpropagation
▪ Automatic differentiation
▪ Autograd
▪ Python demo
Backpropagation Example 1
▪ Ví dụ đơn giản về Backpropagation

16 2017
Source: CS231n
Backpropagation Example 1

17 2017
Source: CS231n
Backpropagation Example 1

18 2017
Source: CS231n
Backpropagation Example 1

19 2017
Source: CS231n
Backpropagation Example 1

20 2017
Source: CS231n
Backpropagation Example 1

21 2017
Source: CS231n
Backpropagation Example 1

22 2017
Source: CS231n
Backpropagation Example 1

23 2017
Source: CS231n
Backpropagation Example 1

24 2017
Source: CS231n
Backpropagation Example 1

25 2017
Source: CS231n
Backpropagation Example 1

26 2017
Source: CS231n
Backpropagation Example 1

27 2017
Source: CS231n
Backpropagation

28 2017
Source: CS231n
Backpropagation

29 2017
Source: CS231n
Backpropagation

30 2017
Source: CS231n
Backpropagation

31 2017
Source: CS231n
Backpropagation

32 2017
Source: CS231n
Backpropagation

33 2017
Source: CS231n
Backpropagation Example 2