0% found this document useful (0 votes)
41 views124 pages

Ch3-Machine Learning

Uploaded by

JinMan Kim
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views124 pages

Ch3-Machine Learning

Uploaded by

JinMan Kim
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 124

Ver.

2023/10/04

Introduction to Machine Learning

Jinyoung Yeo
Department of Artificial Intelligence
Yonsei University

AAI2250: Introduction to Artificial Intelligence


Outline

1. What is Machine Learning?

2. Type of Learning

3. Framing a Learning Problem

4. Clustering

5. Application Example

6. Linear Regression

7. Multivariable Linear Regression

8. Logistic Regression and Multinomial Classification

9. Application Example

2
What is Machine Learning?

3
What is Machine Learning?

• “Learning is any process by which a system improves


performance from experience.”
- Herbert Simon

• Definition by Tom Mitchell (1998):


• Machine Learning is the study of algorithms that
• improve their performance P
• at some task T
• with experience E.
• A well-defined learning task is given by <P, T, E>.

4
Traditional Programming

Data
Computer Output
Program

Machine Learning

Data
Computer Program
Output

5
Some more examples of tasks that are best solved by using a
learning algorithm

• Recognizing patterns:
– Facial identities or facial expressions
– Handwritten or spoken words
– Medical images
• Generating patterns:
– Generating images or motion sequences
• Recognizing anomalies:
– Unusual credit card transactions
– Unusual patterns of sensor readings in a nuclear power plant
• Prediction:
– Future stock prices or currency exchange rates

6
Sample Applications

• Web search
• Computational biology
• Finance
• E-commerce
• Robotics
• Information extraction
• Social networks
• Debugging software
• [Your favorite area]

7
Sample Applications

Image Classification

Document Categorization

Speech Recognition Protein Classification Spam Detection

Branch Prediction Fraud Detection Natural Language Processing

Playing Games Computational Advertising


8
Machine Learning is Changing the World
“Machine learning is the hot new thing”
(John Hennessy, President, Stanford)

“A breakthrough in machine learning would be worth ten


Microsofts” (Bill Gates, Microsoft)

“Web rankings today are mostly a matter of machine learning”


(Prabhakar Raghavan, VP Engineering at Google)

9
Defining the Learning Task
Improve on task T, with respect to
performance metric P, based on experience E
T: Playing checkers
P: Percentage of games won against an arbitrary opponent
E: Playing practice games against itself

T: Recognizing hand-written words


P: Percentage of words correctly classified
E: Database of human-labeled images of handwritten words

T: Driving on four-lane highways using vision sensors


P: Average distance traveled before a human-judged error
E: A sequence of images and steering commands recorded while
observing a human driver.

T: Categorize email messages as spam or legitimate.


P: Percentage of email messages correctly classified.
E: Database of emails, some with human-given labels
10
Type of Learning

11
Types of Learning

• Supervised (inductive) learning


– Given: training data + desired outputs (labels)
• Unsupervised learning
– Given: training data (without desired outputs)
• Semi-supervised learning
– Given: training data + a few desired outputs
• Reinforcement learning
– Rewards from sequence of actions

12
Supervised Learning: Regression

• Given (x 1 , y1), (x 2 , y2), ..., (x n , yn)


• Learn a function f ( x ) to predict y givenx
– y is real-valued == regression

9
8
September Arctic Sea Ice Extent

7
(1,000,000 sq km)

6
5
4
3
2
1
0
1970 1980 1990 2000 2010 2020
Year
13
Supervised Learning: Classification

• Given (x 1 , y1), (x 2 , y2), ..., (x n , yn)


• Learn a function f(x) to predict y givenx
– y is categorical == classification

Breast Cancer (Malignant / Benign)

1(Malignant)

0(Benign)
Tumor Size

14
Supervised Learning: Classification

• Given (x 1 , y1), (x 2 , y2), ..., (x n , yn)


• Learn a function f(x) to predict y givenx
– y is categorical == classification

Breast Cancer (Malignant / Benign)

1(Malignant)

0(Benign)
Tumor Size

Tumor Size 15
Supervised Learning: Classification

• Given (x 1 , y1), (x 2 , y2), ..., (x n , yn)


• Learn a function f(x) to predict y givenx
– y is categorical == classification

Breast Cancer (Malignant / Benign)

1(Malignant)

0(Benign)
Tumor Size

Tumor Size 16
Supervised Learning: Classification

• x can be multi-dimensional
– Each dimension corresponds to an attribute

- Clump Thickness
- Uniformity of Cell Size
- Uniformity of Cell Shape
Age …

Tumor Size

17
Supervised Learning: Classification

• Decide which emails are spam and which are important.

Supervised classification
Not spam spam

Goal: use emails seen so far to produce good prediction rule for
future data.
18
Supervised Learning: Classification
Represent each message by features. (e.g., keywords, spelling, etc.)

example label

Reasonable RULES:
+ + -
+ -
+
Predict SPAM if unknown AND (money OR pills)
- --
Predict SPAM if 2money + 3pills –5 known > 0 -
Linearly separable 19
Supervised Learning: Classification

Handwritten digit recognition


(convert hand-written digits to
characters 0..9)

Face Detection and Recognition

20
Supervised Learning: Classification
• Weather prediction

• Medicine:
– diagnose a disease
• input: from symptoms, lab measurements, test results, DNA tests, …
• output: one of set of possible diseases, or “none of theabove”
• examples: audiology, thyroid cancer, diabetes, …
– or: response to chemo drug X
– or: will patient be re-admitted soon?

• Computational Economics:
– predict if a stock will rise or fall
– predict if a user will click on an ad or not
• in order to decide which ad to show
21
Supervised Learning: Regression

Stock market

Weather prediction

Temperature
72° F

Predict the temperature at any given location

22
Unsupervised Learning
• Given x 1 , x 2 , ..., x n (without labels)
• Output hidden structure behind the x’s
– E.g., clustering

23
Unsupervised Learning
Genomics application: group individuals by genetic similarity
Genes

Individuals 24
Unsupervised Learning

Clustering: discovering structure in data (only unlabeled data)


• E.g, cluster users of social networks by interest (community detection).
Facebook Twitter
network Network

Social network analysis

Market segmentation Astronomical data analysis


25
Reinforcement Learning

• Given a sequence of states and actions with


(delayed) rewards, output a policy
– Policy is a mapping from states → actions that tells
you what to do in a given state
• Examples:
– Credit assignment problem
– Game playing
– Robot in a maze
– Balance a pole on your hand

26
The Agent-Environment Interface

Agent and environment interact at discrete timesteps : t = 0, 1, 2, K


Agent observes state at step t : st S
produces action at step t : at  A(st )
gets resulting reward : rt+1 
and resulting next state : st +1

... rt +1 s rt +2 s rt +3 s ...
st a t +1
at +1 t+2
at +2 t +3 at +3
t
27
Reinforcement Learning

https://www.youtube.com/watch?v=4cgWya-wjgY 28
Types of Learning

• Supervised learning • Unsupervised learning


– Decision tree – Clustering
– Linear regression – Dimensionality reduction
– Logistic regression • Reinforcement learning
– Support vector machines – Q learning
& kernel methods
– Model ensembles
– Neural networks & deep
learning

29
Framing a Learning Problem

30
Developing a Learning System
• Choose the training experience
• Choose exactly what is to be Training data Learner

learned Environment/
– i.e. the target function Experience Knowledge

• Choose how to represent the


Testing data
target function Performance
Element
• Choose a learning algorithm to
infer the target function from the
experience

• We generally assume that the training and test examples


are independently drawn from the same overall
distribution of data
– We call this “i.i.d” which stands for “independent and identically
distributed” 31
Developing a Learning System

• Understand domain, prior knowledge, and goals


• Data integration, selection, cleaning, pre-processing, etc.
• Learn models
• Interpret results
• Consolidate and deploy discovered knowledge 32
Developing a Learning System

• Every ML algorithm has three components:


– Representation
– Optimization
– Evaluation
33
Various Function Representations
• Numerical functions
– Linear regression
– Neural networks
– Support vector machines
• Symbolic functions
– Decision trees
– Rules in propositional logic
– Rules in first-order predicate logic
• Instance-based functions
– Nearest-neighbor
– Case-based
• Probabilistic Graphical Models
– Naïve Bayes
– Bayesian networks
– Hidden-Markov Models (HMMs)
– Probabilistic Context Free Grammars (PCFGs)
– Markov networks

34
Various Search/Optimization Algorithms

• Gradient descent
– Perceptron
– Backpropagation
• Dynamic Programming
– HMM Learning
– PCFG Learning
• Divide and Conquer
– Decision tree induction
– Rule learning
• Evolutionary Computation
– Genetic Algorithms (GAs)
– Genetic Programming (GP)
– Neuro-evolution

35
Evaluation

• Accuracy
• Precision and recall
• Squared error
• Likelihood
• Posterior probability
• Cost / Utility
• Margin
• Entropy
• K-L divergence
• etc.

36
Clustering

37
Example: Clusters & Outliers

x
x
xx x
x x
x x x x x
x x x x x
x xx x xx x
x x x x
x x

x x x
x x x x
x x x
x
Outlier Cluster

38
The Problem of Clustering

• Given a set of points, with a notion of distance between points, group the points into
some number of clusters, so that
✓ Members of a cluster are close/similar to each other
✓ Members of different clusters are dissimilar
• Usually:
✓ Points are in a high-dimensional space
✓ Similarity is defined using a distance measure
▪ Euclidean, Cosine, Jaccard, edit distance, …

39
Clustering is a hard problem!

40
Why is it hard?

• Clustering in two dimensions looks easy


• Clustering small amounts of data looks easy
• And in most cases, looks are not deceiving

• Many applications involve not 2, but 10 or 10,000 dimensions


• High-dimensional spaces look different: Almost all pairs of points are at about the
same distance

41
High Dimensional Data

• Given a cloud of data points we want to understand its structure

42
Clustering Problem: Documents

Finding topics:
• Represent a document by a vector
(x1, x2,…, xk), where xi = 1 iff the i th word
(in some order) appears in the document
✓ It actually doesn’t matter if k is infinite; i.e., we don’t limit the set of words

• Documents with similar sets of words may be about the same topic

43
Cosine, Jaccard, and Euclidean

• We have a choice when we think of documents as sets of words:


✓ Sets as vectors: Measure similarity by the cosine distance
✓ Sets as sets: Measure similarity by the Jaccard distance
✓ Sets as points: Measure similarity by Euclidean distance

44
Overview: Methods of Clustering

• Hierarchical:
✓ Agglomerative (bottom up):
▪ Initially, each point is a cluster
▪ Repeatedly combine the two
“nearest” clusters into one
✓ Divisive (top down):
▪ Start with one cluster and
recursively split it

• Point assignment:
✓ Maintain a set of clusters
✓ Points belong to “nearest” cluster

45
Hierarchical Clustering

• Key operation:
Repeatedly combine
two nearest clusters

• Three important questions:


1) How do you represent a cluster of more than one point?
2) How do you determine the “nearness” of clusters?
3) When to stop combining clusters?

46
Hierarchical Clustering

• Key operation: Repeatedly combine two nearest clusters


• (1) How to represent a cluster of many points?
✓ Key problem: As you merge clusters, how do you represent the “location” of each
cluster, to tell which pair of clusters is closest?
• Euclidean case: each cluster has a
centroid = average of its (data)points
• (2) How to determine “nearness” of clusters?
✓ Measure cluster distances by distances of centroids

47
Example: Hierarchical clustering

(5,3)
o
(1,2)
o
x (1.5,1.5) x (4.7,1.3)
x (1,1) o (2,1) o (4,1)
x (4.5,0.5)
o (0,0) o (5,0)

Data:
o … data point
x … centroid
Dendrogram
48
And in the Non-Euclidean Case?

What about the Non-Euclidean case?


• The only “locations” we can talk about are the points themselves
✓ i.e., there is no “average” of two points

• Approach 1:
✓ (1) How to represent a cluster of many points?
clustroid = (data)point “closest” to other points
✓ (2) How do you determine the “nearness” of clusters? Treat clustroid as if it were
centroid, when computing inter-cluster distances

49
“Closest” Point?

• (1) How to represent a cluster of many points?


clustroid = point “closest” to other points
• Possible meanings of “closest”:
✓ Smallest maximum distance to other points
✓ Smallest average distance to other points
✓ Smallest sum of squares of distances to other points
▪ For distance metric d clustroid c of cluster C is:
min
c
 d
xC
( x , c ) 2

Datapoint Centroid

X Centroid is the avg. of all (data)points


in the cluster. This means centroid is
Clustroid an “artificial” point.
Cluster on Clustroid is an existing (data)point
3 datapoints that is “closest” to all other points in
the cluster.
50
Defining “Nearness” of Clusters

• (2) How do you determine the “nearness” of clusters?


✓ Approach 2:
Intercluster distance = minimum of the distances between any two points, one from
each cluster
✓ Approach 3:
Pick a notion of “cohesion” of clusters, e.g., maximum distance from the clustroid
▪ Merge clusters whose union is most cohesive

51
Cohesion

• Approach 3.1: Use the diameter of the merged cluster = maximum distance between
points in the cluster
• Approach 3.2: Use the average distance between points in the cluster
• Approach 3.3: Use a density-based approach
✓ Take the diameter or avg. distance, e.g., and divide by the number of points in the
cluster

52
Implementation

• Naïve implementation of hierarchical clustering:


✓ At each step, compute pairwise distances between all pairs of clusters, then merge
✓ O(N 3)

• Careful implementation using priority queue can reduce time to O(N 2 log N)
✓ Still too expensive for really big datasets that do not fit in memory

53
K–means Algorithm(s)

• Assumes Euclidean space/distance

• Start by picking k, the number of clusters

• Initialize clusters by picking one point per cluster


✓ Example: Pick one point at random, then k-1 other points, each as far away as
possible from the previous points

54
Populating Clusters

• 1) For each point, place it in the cluster whose current centroid it is nearest

• 2) After all points are assigned, update the locations of centroids of the k clusters

• 3) Reassign all points to their closest centroid


✓ Sometimes moves points between clusters

• Repeat 2 and 3 until convergence


✓ Convergence: Points don’t move between clusters and centroids stabilize

55
Example: Assigning Clusters

x
x
x
x
x

x x x x x x

x … data point
… centroid Clusters after round 1
56
Example: Assigning Clusters

x
x
x
x
x

x x x x x x

x … data point
… centroid Clusters after round 2
57
Example: Assigning Clusters

x
x
x
x
x

x x x x x x

x … data point
… centroid Clusters at the end
58
Getting the k right

How to select k?
• Try different k, looking at the change in the average distance to centroid as k
increases
• Average falls rapidly until right k, then changes little

Best value
of k
Average
distance to
centroid k

59
Example: Picking k

Too few; x
x
many long
xx x
distances
x x
to centroid. x x x x x
x x x x x
x xx x xx x
x x x x
x x

x x x
x x x x
x x x
x

60
Example: Picking k

x
Just right; x
distances xx x
rather short. x x
x x x x x
x x x x x
x xx x xx x
x x x x
x x

x x x
x x x x
x x x
x

61
Example: Picking k

Too many; x
little improvement x
in average xx x
distance. x x
x x x x x
x x x x x
x xx x xx x
x x x x
x x

x x x
x x x x
x x x
x

62
Application Example: Landmark Engine

63
Landmark Engine

• Google developed a landmark recognition engine that identifies specific landmarks by


clustering different images of the same landmark.

64
Why photos?

• People tend to take many photos when they visit popular places.
• As user-generated content (UGC), photos have various metadata, location taken (geo-
coordinate), taken time, scene (visual features), tags, …

User-generated contents (UGC)

Mention Tag Taken time Location

65
Why photos?

• Diverse information of venues can be extracted from photos.


• As always 1% of venues are emerging places, we need automatic completion of
knowledge base.

66
Google Data

• Meta-info: a photo is a tuple containing the unique photo ID, tagged GPS coordinates
in terms of latitude and longitude, text tag, and uploader id.
• World-scale: 2240 landmarks from 812 cities in 104 countries
• Large-scale: 21.4 million potential landmark images

67
Our Dataset

• Meta-info: a photo is a tuple containing the unique photo ID, tagged GPS coordinates
in terms of latitude and longitude, text tag, and uploader id.
• World-scale: 2240 landmarks from 812 cities in 104 countries
• Large-scale: 21.4 million potential landmark images

68
Our Dataset

• City-scale and intermediate-scale of photos that we can handle in your PC.


• We will use thousands of photos taken in Seattle!

69
Basic Method

• Pipeline
✓ Geographical clustering: The photos taken in the same landmark are likely to be
geographically close.
✓ Visual clustering: The photos taken in the same landmark are likely to be visually
similar, sharing the similar scene.
✓ Textual clustering or matching: The photos taken in the same landmark are likely to
share the similar tags.
✓ You can change this pipeline or add another phases here. This overview is a simple
guidance for your warm start.

Landmark
photos clusters
Textual
Geographical Visual
clustering/
clustering clustering
matching

70
Phase I: Geographical clustering

• The photos taken in the same landmark are likely to be geographically close.
• What are clustering algorithms suitable for this?

71
Phase I: Geographical clustering

• The photos taken in the same landmark are likely to be geographically close.
• What are clustering algorithms suitable for this?

72
Phase I: Geographical clustering

• Recommended algorithm: Meanshift


• Does not require info for the number of clusters
• One of density-based algorithms (e.g., Meanshift, DBSCAN)
• This algorithm tries to track the most dense clusters

73
Phase I: Geographical clustering

• Recommended algorithm: Meanshift


• Sklearn (python ML library) provides tutorial and implementation

74
Phase I: Geographical clustering

• Why is K-means not recommended?


✓ If you want, it is also available.
✓ But, you have to pre-define the number of clusters, which is a challenge issue.
✓ And, note that you don’t need to consider all of outlier photos (maybe these photos
are not landmark photos).

75
Phase I: Geographical clustering

• Limitation of geographical clustering


✓ Despite their similar geo-coordinates, two photos may depict different landmarks
✓ Despite their distant geo-coordinates, three photos may depict the same landmark

(a) Landmark disambiguation is necessary (b) Landmark resolution is necessary

76
Phase II: Visual clustering

• We can group visually-similar photos as visual clusters in each of geo-clusters.


✓ It is computationally efficient compared to the visual clustering on the whole photo set.
✓ It is more accurate compared to the visual clustering on the whole photo set.

• How can we perform visual clustering? (guide)


✓ Compute pairwise visual similarities among photos (in one geo-cluster)
✓ Perform graph clustering (avoiding high-dimensional issue)

77
Phase II: Visual clustering

• How can we perform visual clustering? (guide)


✓ Compute pairwise visual similarities among photos (in one geo-cluster)

Identify shared objects between images, using Microsoft Bundler*


Construct an adjacency graph between images with the shared object.

*Bundler : http://phototour.cs.washington.edu/bundler/

78
Phase II: Visual clustering

• How can we perform visual clustering? (guide)


✓ Compute pairwise visual similarities among photos (in one geo-cluster)
For each image, SIFT* generates a set of key points that describe the image.
Bundler reconstructs 3D structure for the images using key points from SIFT.

3D reconstruction of object

photo 3

photo 1
photo 2

*SIFT : http://www.cs.ubc.ca/~lowe/keypoints/

79
Phase II: Visual clustering

• How can we perform visual clustering? (guide)


✓ Compute pairwise visual similarities among photos (in one geo-cluster)

object
(Eiffel tower??)

80
Phase II: Visual clustering

• How can we perform visual clustering? (guide)


✓ Compute pairwise visual similarities among photos (in one geo-cluster)
✓ Perform graph clustering (avoiding high-dimensional issue)

A set of images, |N|

Pairwise similarities, < |N|x |N|


i.e., graph

Cut based graph partitioning


(optional)
81
https://www.cs.cornell.edu/~snavely/bundler/bundler-v0.4-manual.html
Phase II: Visual clustering

• One geo-cluster (before visual clustering)

82
Phase II: Visual clustering

• Visual clusters in one geo-cluster

83
Phase III: Textual clustering/matching

• The photos taken in the same landmark are likely to share the similar tags

Bag of words

Clustering or merging

Bag of words

84
Phase III: Textual clustering/matching

• Technique 1: Naïve similarities


✓ Jaccard similarity
✓ Cosine similarity

85
Phase III: Textual clustering/matching

• Technique 2: Stopword removal


✓ Tags frequently occurred in many clusters negatively affect similarity matching
✓ Travel
✓ People
✓ Seattle
✓…

86
Phase III: Textual clustering/matching

• Technique 3: TFIDF scoring


✓ Typical scoring method for search engine
✓ In our work, document = visual cluster

87
Phase III: Textual clustering/matching

• Technique 4: Word embedding


✓ We can more accurately compute the semantic similarity between two tags

88
Phase III: Textual clustering/matching

• Technique 4: Word embedding


✓ We can more accurately compute the semantic similarity between two tags

https://radimrehurek.com/gensim/models/word2vec.html

89
Linear Regression

90
Predicting exam score: regression

x (hours) y (score)

10 90

9 80

3 50

2 30

91
Regression (data)

3
x y

1 1 2

Y
2 2
1

3 3
0
0 1 2 3
X

92
(Linear) Hypothesis

2
Y

0
0 1 2 3
X

93
(Linear) Hypothesis

H (x ) = W x + b
3

2
Y

0
0 1 2 3
X

Which hypothesis is better?

94
Cost Function

• How fit the line to our (training) data


H (x ) — y

2
Y

0
0 1 2 3
X

95
Cost Function and Optimization

96
Optimization

• 𝐻(𝑊)∗ = 𝑎𝑟𝑔𝑚𝑖𝑛𝐻(𝑊) 𝑐𝑜𝑠𝑡(𝐻 𝑊 , 𝑦)


✓ 𝑦 ∼ 𝑦ො
• Update 𝑊 → 𝑊 + Δ𝑊 only if cost 𝑊 + Δ𝑊 < cost(𝑊)
• Finish it when cost 𝑊 + Δ𝑊 == cost(𝑊)
• How can we find Δ𝑊 so that cost 𝑊 + Δ𝑊 < cost(𝑊)?
✓ Gradient Descent

97
Optimization

• What cost(W) looks like?

x y • W= 1, cost(W) = 0

1 1

2 2 • W= 0, cost(W)=4.67
3 3

• W= 2, cost(W)=4.67
98
Optimization

99
Gradient descent algorithm

• Minimize cost function


✓ Gradient descent is used many minimization problems
✓ For a given cost function, cost(W,b), it will find W,b to
minimize cost
✓ It can be applied to more general function: cost(w1, w2, …)

How would you find the lowest point? 100


Gradient descent algorithm

101
Gradient descent algorithm

102
Gradient descent algorithm

103
Gradient descent algorithm

+
+
+
+
104
Convex function

105
Multivariable Linear Regression

106
Predicting exam score:
regression using one input (x)

x (hours) y (score)

10 90
one-variable
one-feature 9 80

3 50

2 60

11 40

107
Predicting exam score:
regression using three inputs (x1, x2, x3)

multi-variable/feature
x1 (quiz 1) x2 (quiz 2) x3 (midterm 1) Y (final)
73 80 75 152
93 88 93 185
89 91 90 180
96 98 100 196
73 66 70 142
Test Scores for General Psychology

108
Hypothesis and Cost Function

109
Multi-variable

110
Matrix Multiplication

111
Matrix Multiplication

x1 x2 x3 Y
73 80 75 152
93 88 93 185
89 91 90 180
96 98 100 196
73 66 70 142
Test Scores for General Psychology

112
Matrix Multiplication

x1 x2 x3 Y
73 80 75 152
93 88 93 185
89 91 90 180
96 98 100 196
73 66 70 142
Test Scores for General Psychology

113
Matrix Multiplication

• instances

[5, 3] [3, 1] [5, 1]

114
Matrix Multiplication

• n output

?
[n, 3] [?, ?] [n, 2]

115
Matrix Multiplication

• n output

[n, 3] [3, 2] [n, 2]

116
Logistic Regression and Multinomial Classification

117
9
8

September Arctic Sea Ice Extent


7

(1,000,000 sq km)
6

Regression 5
4
3
2
1
0
1970 1980 1990 2000 2010 2020
Year

Breast Cancer (Malignant / Benign)

Classification 1(Malignant)

0(Benign)
Tumor Size 118
Sigmoid

Y= 𝑾𝑻 X
𝟏
𝑯 𝒙 = 𝑻𝑿
𝟏 + 𝒆−𝑾

119
— log(H (x)) :y= 1
Cost(H (x), y) =
— log(1 — H (x)) : y = 0

cost (H (x), y) = — ylog(H (x)) — (1 — y)log(1 — H (x)) 120


Softmax

121
Regression and Classification Loss Functions
Mean Squared Error (MSE)

Cross Entropy

122
Application Example

123
Thank you!

124

You might also like