0% found this document useful (0 votes)
14 views49 pages

Lecture 4 Classification P1

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views49 pages

Lecture 4 Classification P1

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

UET

Since 2004

ĐẠI HỌC CÔNG NGHỆ, ĐHQGHN


VNU-University of Engineering and Technology

INT3405E - Machine Learning


Lecture 4: Classification (P1)

Hanoi, 09/2024
Recap: Key Issues in Machine Learning
● What are good hypothesis spaces? We choose
○ Which spaces have been useful in practical applications and why? To
● What algorithms can work with these spaces? Optimize
○ Are there general design principles for machine learning algorithms?
● How can we find the best hypothesis in an efficient way?
○ How to find the optimal solution efficiently (“optimization” question)
● How can we optimize accuracy on future data?
○ Known as the “overfitting” problem (i.e., “generalization” theory)
● How can we have confidence in the results?
○ How much training data is required to find accurate hypothesis? (“statistical” question)
● Are some learning problems computationally intractable? (“computational” question)
● How can we formulate application problems as machine learning problems? (“engineering”
question)
FIT-CS INT3405E - Machine Learning 2
Recap: Model Representation
Training Set How do we represent h ?

Learning Algorithm y

Size of h Estimated x
house price
x Hypothesis y
Linear regression with one variable.
“Univariate Linear Regression”

How to choose parameters ?


FIT-CS INT3405E - Machine Learning 3
Recap: Gradient Descent for Optimization

FIT-CS INT3405E - Machine Learning 4


Recap: Gradient Descent Example

(for fixed , this is a function of x) (function of the parameters )

How fast to converge to the Global Optimal?


FIT-CS INT3405E - Machine Learning 5
Normal Equation (3)
●Matrix-vector formulation

●Analytical solution
Take O(mn2+n3)

FIT-CS INT3405E - Machine Learning 6


Outline
● Bayesian Learning
○ Bayes Theorem
○ MAP learning vs. MLE learning
● Probabilistic Generative Models
○ Naïve Bayes Classifier
● Discriminative Models
○ Logistic Regression
○ Decision Tree
○ K-Nearest Neighbors
FIT-CS INT3405E - Machine Learning 7
Bayes Theorem
● Bayes Theorem

Posterior Likelihood Prior

Thomas Bayes (1702–1761)


○ P(h) = prior probability of hypothesis h
○ P(D) = prior probability of training data D
○ P(h|D) = conditional probability of h given D (Posterior)
○ P(D|h) = conditional probability of D given h (Likelihood)

FIT-CS INT3405E - Machine Learning 8


Maximum A Posterior Learning (MAP)
●Maximum a posterior learning (MAP)
○Find the most probable hypothesis given the training data by
maximizing the posterior prob.

Prior encodes the


knowledge/preference
FIT-CS INT3405E - Machine Learning 9
MAP Learning
●For each hypothesis h in H, calculate the posterior prob.

●Output the hypothesis h with the highest posterior prob.

●Comments:
○ Computational intensive
○ Give a standard for judging the performance of learning algorithms
○ Choosing P(h) reflects our prior knowledge about the learning task

FIT-CS INT3405E - Machine Learning 10


Maximum-Likelihood Estimation (MLE)

●Maximum Likelihood Estimation (MLE) learning


○ Assume each hypothesis is equally probably a prior

○ Maximizing the likelihood of the training data

FIT-CS INT3405E - Machine Learning 11


Relationship between MLE Learning
and Least-Squared Error Learning (1)
● Consider

● Assume

● We want learn for f(x)


● Linear Regression minimizes the objective (cost function) of the Mean
Squared Error

FIT-CS INT3405E - Machine Learning 12


Relationship between MLE Learning
and Least-Squared Error Learning (2)

FIT-CS INT3405E - Machine Learning 13


Probabilistic Generative Models (1)
• Classify instance x into one of K classes

Density function for class Ck Class prior

FIT-CS INT3405E - Machine Learning 14


Probabilistic Generative Models (2)
• Classification decision

• The key is to decide the parameters

FIT-CS INT3405E - Machine Learning 15


Probabilistic Generative Models (3)
● Given training data
● We have closed-form solutions:

FIT-CS INT3405E - Machine Learning 16


Probabilistic Generative Models (4)

class-conditional posterior
densities probability

FIT-CS INT3405E - Machine Learning 17


Curse of Dimensionality
●One challenge of learning with high-dimensional data is insufficient data
samples
●Suppose 5 samples/objects is considered enough in 1-D
– 1D : 5 points
– 2D : 25 points
– 3D : 125 points
– 10D : 9 765 625 points

FIT-CS INT3405E - Machine Learning 18


Outline
● Bayesian Learning
○ Bayes Theorem
○ MAP learning vs. MLE learning
● Probabilistic Generative Models
○ Naïve Bayes Classifier
● Discriminative Models
○ Logistic Regression
○ Decision Tree
○ K-Nearest Neighbors
FIT-CS INT3405E - Machine Learning 19
Naïve Bayes Classifier (1)
• Hard to estimate for high dimensional data x
•Conditional Independence assumption
• All attributes are conditionally independent
•Naïve Bayes approximation Distribution of 1 D

FIT-CS INT3405E - Machine Learning 20


Naïve Bayes Classifier (2)
● Text categorization
: word histogram of a document
● Bag of words assumption:
○ Assume position doesn’t matter
● Conditional independence:

Occuring times
of word in
document x

FIT-CS INT3405E - Machine Learning 21


Parameter Estimation
●Learning by Maximum Likelihood Estimates
○ Simply count the frequencies in the data

○ Create a mega-document for topic k by concatenating all the docs in this topic
○ Compute frequency of w in the mega-document

FIT-CS INT3405E - Machine Learning 22


Problem with Maximum Likelihood
●What if there is a new word (e.g., any novel words created in internet) in a
test document which never appears in the training data

●Smoothing
○ Avoid zero prob.

FIT-CS INT3405E - Machine Learning 23


Naïve Bayes Classifier (3)
• Bad approximation Text categorization for 20 Newsgroups
• Good classification accuracy

FIT-CS INT3405E - Machine Learning 24


Naïve Bayes Classifier (4)

Naïve Bayes Classifier:

FIT-CS INT3405E - Machine Learning 25


Example: “Play Tennis” (1)
● Based on the examples in the table, classify the following datum x:
x=(Outl=Sunny, Temp=Cool, Hum=High, Wind=strong)

FIT-CS INT3405E - Machine Learning 26


The Independence Assumption
●Makes computation possible
●Yields optimal classifiers when satisfied
●Fairly good empirical results
●But is seldom satisfied in practice, as attributes (variables) are
often correlated
●Attempts to overcome this limitation:
○ Bayesian networks, that combine Bayesian reasoning with causal relationships
between attributes

FIT-CS INT3405E - Machine Learning 27


Decision Boundary of Naïve Bayes (1)
● Consider text categorization of two classes
● The ratio determines the decision

Linear decision boundary


FIT-CS INT3405E - Machine Learning 28
Decision Boundary of Naïve Bayes (2)
● Consider two class classification
● Gaussian density function
● Shared covariance matrix

Linear decision boundary

FIT-CS INT3405E - Machine Learning 29


Decision Boundary
• Generative models essentially create linear decision boundaries
• Why not directly model the linear decision boundary

SML– Term 1 2020-2021


FIT-CS INT3405E - Machine Learning 30
30
Outline
● Bayesian Learning
○ Bayes Theorem
○ MAP learning vs. MLE learning
● Probabilistic Generative Models
○ Naïve Bayes Classifier
● Discriminative Models
○ Logistic Regression
○ Decision Tree
○ K-Nearest Neighbors

FIT-CS INT3405E - Machine Learning 31


Discriminative Models: Logistic Regression
• Generative models often lead to linear decision boundary
• Linear discriminatory model
• Directly model the linear decision boundary

• w is the parameter to be decided

FIT-CS INT3405E - Machine Learning 32


Logistic Regression

FIT-CS INT3405E - Machine Learning 33


Logistic Sigmoid Function
● The logistic / sigmoid function

FIT-CS INT3405E - Machine Learning 34


Logistic Regression
• Given training data
• Likelihood function (or the Log-Likelihood)

• Learn parameter w by Maximum Likelihood Estimation (MLE)

FIT-CS INT3405E - Machine Learning 35


Convex Objective Functions

If y = 1 Convex Loss Functions: If y = -1

FIT-CS INT3405E - Machine Learning 36


Logistic Regression
• Convex objective function, global optimal

• No closed-form solution
• Gradient Descent

Classification error

FIT-CS INT3405E - Machine Learning 37


Example: Heart Disease (1)
1: 25-29
2: 30-34
3: 35-39
4: 40-44
5: 45-49
6: 50-54
7: 55-59
• Input feature x: age group id 8: 60-64

• Output y: if having heart disease


• y=1: having heart disease
• y=-1: no heart disease

FIT-CS INT3405E - Machine Learning 38


Example: Heart Disease (2)

FIT-CS INT3405E - Machine Learning 39


Example: Text Categorization (1)
●Learn to classify text into two categories
●Input d:
• a document, represented by a word histogram
●Output
• y=±1:
+1 for political document
-1 for non-political document

FIT-CS INT3405E - Machine Learning 40


Example: Text Categorization (2)
• Training data

FIT-CS INT3405E - Machine Learning 41


Example: Text Categorization (3)

• Dataset: Reuter-21578
• Classification accuracy
• Naïve Bayes: 77%
• Logistic regression: 88%

FIT-CS INT3405E - Machine Learning 42


Multi-class Logistic Regression
• How to extend logistic regression model to multi-class classification ?

FIT-CS INT3405E - Machine Learning 43


Conditional Exponential Model (1)
• Consider K classes
• Define

• where Z is normalization factor:

Normalization factor
(partition function)

• Need to learn

FIT-CS INT3405E - Machine Learning 44


Conditional Exponential Model (2)
• Learn weights w’s by maximum likelihood estimation

• Modified Conditional Exponential Model

FIT-CS INT3405E - Machine Learning 45


Logistic Regression versus Naïve Bayes
•Both are linear decision boundaries
• Naïve Bayes:

• Logistic regression: learn weights by MLE

•Both can be viewed as modeling p(x|y)


• Naïve Bayes: independence assumption
• Logistic regression: assume an exponential family distribution for
p(x|y) (a broad assumption)

FIT-CS INT3405E - Machine Learning 46


Discriminative versus Generative
Discriminative Models Generative Models

● Model P(y|x) directly • Model P(x|y) directly


Pros Pros
● Usually better performance
• Usually fast convergence
(with small training data)
● Robust to noise data • Cheap computation
Cons (easier to learn, e.g. NB)
● Slow convergence Cons
(e.g., LR by gradient descent)
• Sensitive to noise data
● Expensive computation
• Usually performs worse
(with small training data)
FIT-CS INT3405E - Machine Learning 47
Summary
● Bayesian Learning
○ Bayes Theorem
○ MAP learning vs. MLE learning
● Probabilistic Generative Models
○ Naïve Bayes Classifier
● Discriminative Models
○ Logistic Regression
○ Decision Tree
○ K-Nearest Neighbors

FIT-CS INT3405E - Machine Learning 48


UET
Since 2004

ĐẠI HỌC CÔNG NGHỆ, ĐHQGHN


VNU-University of Engineering and Technology

Thank you
Email me
[email protected]

You might also like