Ch3-Machine Learning
Ch3-Machine Learning
2023/10/04
Jinyoung Yeo
Department of Artificial Intelligence
Yonsei University
2. Type of Learning
4. Clustering
5. Application Example
6. Linear Regression
9. Application Example
2
What is Machine Learning?
3
What is Machine Learning?
4
Traditional Programming
Data
Computer Output
Program
Machine Learning
Data
Computer Program
Output
5
Some more examples of tasks that are best solved by using a
learning algorithm
• Recognizing patterns:
– Facial identities or facial expressions
– Handwritten or spoken words
– Medical images
• Generating patterns:
– Generating images or motion sequences
• Recognizing anomalies:
– Unusual credit card transactions
– Unusual patterns of sensor readings in a nuclear power plant
• Prediction:
– Future stock prices or currency exchange rates
6
Sample Applications
• Web search
• Computational biology
• Finance
• E-commerce
• Robotics
• Information extraction
• Social networks
• Debugging software
• [Your favorite area]
7
Sample Applications
Image Classification
Document Categorization
9
Defining the Learning Task
Improve on task T, with respect to
performance metric P, based on experience E
T: Playing checkers
P: Percentage of games won against an arbitrary opponent
E: Playing practice games against itself
11
Types of Learning
12
Supervised Learning: Regression
9
8
September Arctic Sea Ice Extent
7
(1,000,000 sq km)
6
5
4
3
2
1
0
1970 1980 1990 2000 2010 2020
Year
13
Supervised Learning: Classification
1(Malignant)
0(Benign)
Tumor Size
14
Supervised Learning: Classification
1(Malignant)
0(Benign)
Tumor Size
Tumor Size 15
Supervised Learning: Classification
1(Malignant)
0(Benign)
Tumor Size
Tumor Size 16
Supervised Learning: Classification
• x can be multi-dimensional
– Each dimension corresponds to an attribute
- Clump Thickness
- Uniformity of Cell Size
- Uniformity of Cell Shape
Age …
Tumor Size
17
Supervised Learning: Classification
Supervised classification
Not spam spam
Goal: use emails seen so far to produce good prediction rule for
future data.
18
Supervised Learning: Classification
Represent each message by features. (e.g., keywords, spelling, etc.)
example label
Reasonable RULES:
+ + -
+ -
+
Predict SPAM if unknown AND (money OR pills)
- --
Predict SPAM if 2money + 3pills –5 known > 0 -
Linearly separable 19
Supervised Learning: Classification
20
Supervised Learning: Classification
• Weather prediction
• Medicine:
– diagnose a disease
• input: from symptoms, lab measurements, test results, DNA tests, …
• output: one of set of possible diseases, or “none of theabove”
• examples: audiology, thyroid cancer, diabetes, …
– or: response to chemo drug X
– or: will patient be re-admitted soon?
• Computational Economics:
– predict if a stock will rise or fall
– predict if a user will click on an ad or not
• in order to decide which ad to show
21
Supervised Learning: Regression
Stock market
Weather prediction
Temperature
72° F
22
Unsupervised Learning
• Given x 1 , x 2 , ..., x n (without labels)
• Output hidden structure behind the x’s
– E.g., clustering
23
Unsupervised Learning
Genomics application: group individuals by genetic similarity
Genes
Individuals 24
Unsupervised Learning
26
The Agent-Environment Interface
... rt +1 s rt +2 s rt +3 s ...
st a t +1
at +1 t+2
at +2 t +3 at +3
t
27
Reinforcement Learning
https://www.youtube.com/watch?v=4cgWya-wjgY 28
Types of Learning
29
Framing a Learning Problem
30
Developing a Learning System
• Choose the training experience
• Choose exactly what is to be Training data Learner
learned Environment/
– i.e. the target function Experience Knowledge
34
Various Search/Optimization Algorithms
• Gradient descent
– Perceptron
– Backpropagation
• Dynamic Programming
– HMM Learning
– PCFG Learning
• Divide and Conquer
– Decision tree induction
– Rule learning
• Evolutionary Computation
– Genetic Algorithms (GAs)
– Genetic Programming (GP)
– Neuro-evolution
35
Evaluation
• Accuracy
• Precision and recall
• Squared error
• Likelihood
• Posterior probability
• Cost / Utility
• Margin
• Entropy
• K-L divergence
• etc.
36
Clustering
37
Example: Clusters & Outliers
x
x
xx x
x x
x x x x x
x x x x x
x xx x xx x
x x x x
x x
x x x
x x x x
x x x
x
Outlier Cluster
38
The Problem of Clustering
• Given a set of points, with a notion of distance between points, group the points into
some number of clusters, so that
✓ Members of a cluster are close/similar to each other
✓ Members of different clusters are dissimilar
• Usually:
✓ Points are in a high-dimensional space
✓ Similarity is defined using a distance measure
▪ Euclidean, Cosine, Jaccard, edit distance, …
39
Clustering is a hard problem!
40
Why is it hard?
41
High Dimensional Data
42
Clustering Problem: Documents
Finding topics:
• Represent a document by a vector
(x1, x2,…, xk), where xi = 1 iff the i th word
(in some order) appears in the document
✓ It actually doesn’t matter if k is infinite; i.e., we don’t limit the set of words
• Documents with similar sets of words may be about the same topic
43
Cosine, Jaccard, and Euclidean
44
Overview: Methods of Clustering
• Hierarchical:
✓ Agglomerative (bottom up):
▪ Initially, each point is a cluster
▪ Repeatedly combine the two
“nearest” clusters into one
✓ Divisive (top down):
▪ Start with one cluster and
recursively split it
• Point assignment:
✓ Maintain a set of clusters
✓ Points belong to “nearest” cluster
45
Hierarchical Clustering
• Key operation:
Repeatedly combine
two nearest clusters
46
Hierarchical Clustering
47
Example: Hierarchical clustering
(5,3)
o
(1,2)
o
x (1.5,1.5) x (4.7,1.3)
x (1,1) o (2,1) o (4,1)
x (4.5,0.5)
o (0,0) o (5,0)
Data:
o … data point
x … centroid
Dendrogram
48
And in the Non-Euclidean Case?
• Approach 1:
✓ (1) How to represent a cluster of many points?
clustroid = (data)point “closest” to other points
✓ (2) How do you determine the “nearness” of clusters? Treat clustroid as if it were
centroid, when computing inter-cluster distances
49
“Closest” Point?
Datapoint Centroid
51
Cohesion
• Approach 3.1: Use the diameter of the merged cluster = maximum distance between
points in the cluster
• Approach 3.2: Use the average distance between points in the cluster
• Approach 3.3: Use a density-based approach
✓ Take the diameter or avg. distance, e.g., and divide by the number of points in the
cluster
52
Implementation
• Careful implementation using priority queue can reduce time to O(N 2 log N)
✓ Still too expensive for really big datasets that do not fit in memory
53
K–means Algorithm(s)
54
Populating Clusters
• 1) For each point, place it in the cluster whose current centroid it is nearest
• 2) After all points are assigned, update the locations of centroids of the k clusters
55
Example: Assigning Clusters
x
x
x
x
x
x x x x x x
x … data point
… centroid Clusters after round 1
56
Example: Assigning Clusters
x
x
x
x
x
x x x x x x
x … data point
… centroid Clusters after round 2
57
Example: Assigning Clusters
x
x
x
x
x
x x x x x x
x … data point
… centroid Clusters at the end
58
Getting the k right
How to select k?
• Try different k, looking at the change in the average distance to centroid as k
increases
• Average falls rapidly until right k, then changes little
Best value
of k
Average
distance to
centroid k
59
Example: Picking k
Too few; x
x
many long
xx x
distances
x x
to centroid. x x x x x
x x x x x
x xx x xx x
x x x x
x x
x x x
x x x x
x x x
x
60
Example: Picking k
x
Just right; x
distances xx x
rather short. x x
x x x x x
x x x x x
x xx x xx x
x x x x
x x
x x x
x x x x
x x x
x
61
Example: Picking k
Too many; x
little improvement x
in average xx x
distance. x x
x x x x x
x x x x x
x xx x xx x
x x x x
x x
x x x
x x x x
x x x
x
62
Application Example: Landmark Engine
63
Landmark Engine
64
Why photos?
• People tend to take many photos when they visit popular places.
• As user-generated content (UGC), photos have various metadata, location taken (geo-
coordinate), taken time, scene (visual features), tags, …
65
Why photos?
66
Google Data
• Meta-info: a photo is a tuple containing the unique photo ID, tagged GPS coordinates
in terms of latitude and longitude, text tag, and uploader id.
• World-scale: 2240 landmarks from 812 cities in 104 countries
• Large-scale: 21.4 million potential landmark images
67
Our Dataset
• Meta-info: a photo is a tuple containing the unique photo ID, tagged GPS coordinates
in terms of latitude and longitude, text tag, and uploader id.
• World-scale: 2240 landmarks from 812 cities in 104 countries
• Large-scale: 21.4 million potential landmark images
68
Our Dataset
69
Basic Method
• Pipeline
✓ Geographical clustering: The photos taken in the same landmark are likely to be
geographically close.
✓ Visual clustering: The photos taken in the same landmark are likely to be visually
similar, sharing the similar scene.
✓ Textual clustering or matching: The photos taken in the same landmark are likely to
share the similar tags.
✓ You can change this pipeline or add another phases here. This overview is a simple
guidance for your warm start.
Landmark
photos clusters
Textual
Geographical Visual
clustering/
clustering clustering
matching
70
Phase I: Geographical clustering
• The photos taken in the same landmark are likely to be geographically close.
• What are clustering algorithms suitable for this?
71
Phase I: Geographical clustering
• The photos taken in the same landmark are likely to be geographically close.
• What are clustering algorithms suitable for this?
72
Phase I: Geographical clustering
73
Phase I: Geographical clustering
74
Phase I: Geographical clustering
75
Phase I: Geographical clustering
76
Phase II: Visual clustering
77
Phase II: Visual clustering
*Bundler : http://phototour.cs.washington.edu/bundler/
78
Phase II: Visual clustering
3D reconstruction of object
photo 3
photo 1
photo 2
*SIFT : http://www.cs.ubc.ca/~lowe/keypoints/
79
Phase II: Visual clustering
object
(Eiffel tower??)
80
Phase II: Visual clustering
82
Phase II: Visual clustering
83
Phase III: Textual clustering/matching
• The photos taken in the same landmark are likely to share the similar tags
Bag of words
Clustering or merging
Bag of words
84
Phase III: Textual clustering/matching
85
Phase III: Textual clustering/matching
86
Phase III: Textual clustering/matching
87
Phase III: Textual clustering/matching
88
Phase III: Textual clustering/matching
https://radimrehurek.com/gensim/models/word2vec.html
89
Linear Regression
90
Predicting exam score: regression
x (hours) y (score)
10 90
9 80
3 50
2 30
91
Regression (data)
3
x y
1 1 2
Y
2 2
1
3 3
0
0 1 2 3
X
92
(Linear) Hypothesis
2
Y
0
0 1 2 3
X
93
(Linear) Hypothesis
H (x ) = W x + b
3
2
Y
0
0 1 2 3
X
94
Cost Function
2
Y
0
0 1 2 3
X
95
Cost Function and Optimization
96
Optimization
97
Optimization
x y • W= 1, cost(W) = 0
1 1
2 2 • W= 0, cost(W)=4.67
3 3
• W= 2, cost(W)=4.67
98
Optimization
99
Gradient descent algorithm
101
Gradient descent algorithm
102
Gradient descent algorithm
103
Gradient descent algorithm
+
+
+
+
104
Convex function
105
Multivariable Linear Regression
106
Predicting exam score:
regression using one input (x)
x (hours) y (score)
10 90
one-variable
one-feature 9 80
3 50
2 60
11 40
107
Predicting exam score:
regression using three inputs (x1, x2, x3)
multi-variable/feature
x1 (quiz 1) x2 (quiz 2) x3 (midterm 1) Y (final)
73 80 75 152
93 88 93 185
89 91 90 180
96 98 100 196
73 66 70 142
Test Scores for General Psychology
108
Hypothesis and Cost Function
109
Multi-variable
110
Matrix Multiplication
111
Matrix Multiplication
x1 x2 x3 Y
73 80 75 152
93 88 93 185
89 91 90 180
96 98 100 196
73 66 70 142
Test Scores for General Psychology
112
Matrix Multiplication
x1 x2 x3 Y
73 80 75 152
93 88 93 185
89 91 90 180
96 98 100 196
73 66 70 142
Test Scores for General Psychology
113
Matrix Multiplication
• instances
114
Matrix Multiplication
• n output
?
[n, 3] [?, ?] [n, 2]
115
Matrix Multiplication
• n output
116
Logistic Regression and Multinomial Classification
117
9
8
(1,000,000 sq km)
6
Regression 5
4
3
2
1
0
1970 1980 1990 2000 2010 2020
Year
Classification 1(Malignant)
0(Benign)
Tumor Size 118
Sigmoid
Y= 𝑾𝑻 X
𝟏
𝑯 𝒙 = 𝑻𝑿
𝟏 + 𝒆−𝑾
119
— log(H (x)) :y= 1
Cost(H (x), y) =
— log(1 — H (x)) : y = 0
121
Regression and Classification Loss Functions
Mean Squared Error (MSE)
Cross Entropy
122
Application Example
123
Thank you!
124