0% found this document useful (0 votes)
7 views5 pages

ECE 449 Notes

The document provides an overview of various machine learning algorithms including K-Nearest Neighbors, Perceptron, Naive Bayes, Logistic Regression, and Support Vector Machines, detailing their methodologies, advantages, and limitations. It discusses key concepts such as distance metrics, probability estimation, optimization techniques, and the importance of model assumptions. Additionally, it covers parameter estimation methods like Maximum Likelihood Estimation and Maximum A Posteriori Estimation, highlighting their applications in different contexts.

Uploaded by

hzz121600
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views5 pages

ECE 449 Notes

The document provides an overview of various machine learning algorithms including K-Nearest Neighbors, Perceptron, Naive Bayes, Logistic Regression, and Support Vector Machines, detailing their methodologies, advantages, and limitations. It discusses key concepts such as distance metrics, probability estimation, optimization techniques, and the importance of model assumptions. Additionally, it covers parameter estimation methods like Maximum Likelihood Estimation and Maximum A Posteriori Estimation, highlighting their applications in different contexts.

Uploaded by

hzz121600
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

ECE 449 Machine Learning: Eric Ji

1 K-Nearest Neighbors (Non-linear)


• Find top K nearest neighbors under metric d and return most common/average label/value among
them
– d(x, z) = ( |x − z|p )1/p
P

– L1 (Manhattan) Distance: p = 1
– L2 (Euclidean) Distance: p = 2
• Determine K using validation set
– Small K is sensitive to noise and will overfit
– Large K includes too far examples and will underfit
• Simple to implement but several issues
– Require memory to store dataset
– Computationally expensive inference time
– Sensitive to outliers
– Curse of dimensionality: High-dimensional data spreads far away from each other giving low
performance
• Nonparametric models place mild assumptions on data distribution and good for complex data, but
require storage/computation of entire dataset
• Parametric models place strong modeling assumptions and require fitting the model to achieve more
efficient storage/computation

2 Perceptron (Linear)
• Only applies to linearly separable data
• Perceptrons are linear classifiers trying to learn a hyperplane
• Hyperplane in Rd -space is represented as w0 + wT x = 0 where w ∈ Rd
– w is orthogonal to hyperplane and points to positive half-space
• Predicted label y(x) = sign(wt + b)
• Perceptron algorithm iterates through all the data and simultaneously updates W until all data is
correctly labeled
– Update rule: wnew = w + yx when y(wT x) <= 0
• Theorem: Given w∗ that perfectly separates the data and γ = min|w∗T x( i)|, ∀x(i) in D, the perception
algorithm takes at most 1/γ 2 to converge

1
3 Probability and Estimation
• Useful Probability Properties:
P (A,B)
– Conditional Probability: P (A|B) = P (B)
P (B|A)P (A)
– Bayes Rule: P (A|B) = P (B)

• Useful Log Rules:


– log(AB ) = B · log(A)
– log(AB) = log(A) + log(B)
– log(A/B) = log(A) − log(B)

• Useful Derivative Rules:


d 1
– dx log(x) = x

• From a dataset of joint probabilities P (X1 , X2 , x3 , ..., Xd , Y ) we can calculate P (Y |X1 , X2 , x3 , ..., Xd )

– Intuitive to learn P (Y |X) from joint distribution, but requires lots of data that may not be
attainable to produce accurate model
• Estimate parameters from sparse data using Maximum Likelihood Estimation and Maximum A Pos-
terior Estimation

• MLE chooses parameter θ that maximizes maximizes probability of observing dataset D


– θ̂ = argmaxθ P (D|θ) where P (D|θ) is the likelihood function
• Steps for solving MLE:
– Take log of likelihood
– Take derivative in respect to θ and set equal to 0
– Solve for θ that maximizes the likelihood
• MAP chooses parameter θ that is most probable given prior P (θ) and dataset D

– θ̂ = argmaxθ P (θ|D)
– θ̂ = argmaxθ P (D|θ)P
P (D)
(θ)
according to Bayes Rule
– θ̂ = argmaxθ P (D|θ)P (θ) as P (D) does not depend on θ
• MAP is better than MLE when small number of samples of dataset and prior is accurate
• As the number of samples from our dataset approaches infinity, the prior becomes irrelevant and MAP
will become MLE

2
4 Naive Bayes(Probalistic)
• Aims to learn P (Y |X) through P (X|Y ) and P (Y ) using Bayes rule with conditional independence
assumption to reduce number of parameters to estimate
P (X1 ,...,Xd |Y )P (Y )
– P (Y |X1 , ..., Xd ) = ∝ P (X1 , ..., Xd |Y )P (Y ) ignoring normalization
P (X1 ,...,Xd )

• Conditional Independence: P (X1 , ..., Xd |Y ) = j P (Xj |Y )


Q

– Requires estimating 2(2d − 1) + 1 parameters without assuming conditional independence


– Requires estimating 2d + 1 parameters with assuming conditional independence
• Utilize MLE and MAP to estimate the parameters to learn P (Y |X)
– MAP makes it such that P (Y |X) won’t be 0 if one component of the product is 0

• If X is a continuous value, it is common to assume P (X|Y ) follow a normal distribution


– Variance can be independent of class, feature, or both
• Gaussian Naive Bayes can be linear with many assumptions regarding the data’s distributions

5 Logistic Regression (Linear)


• Discriminative counterpart to Naive Bayes that directly learn P (Y |X)
– Discriminative models directly calculate the weights
– Generative models calculate all the probabilities/parameters to calculate the weights then

• Learn a set of weights for each class


• P (Y |X) can be represented by sigmoid function
1P
– P (Y = c|X) = 1+exp(w0 + j wj Xj )

• Calculate weights using MCLE


– wM CLE = argmaxw P (y ( i)|x(i), w)
Q

– Objective is concave, but does not have a closed form so needs optimization techniques

• Can apply MAP by placing a prior on the weights themselves


– wM CLE = argmaxw P (w) P (y ( i)|x(i), w)
Q

• Logistic regression typically gives the better solution compared to naive bayes, especially with lots of
data and conditional independence does not hold

3
6 Optimization
• Gradient Descent uses first order Taylor expansion approximation to assume an objective function l
around weights w is linear
– l(w + s) = l(w) + g(w)T s where g(w) = ∇l(w)
• Gradient Descent Update rule: wnew = w − αg(w) to minimize l(w)

– Step size α should decrease by a constant rate for each update for good convergence
• Batch gradient uses error over training of entire dataset and updates w
• Stochastic gradient uses error over single sample and updates w
• Newton’s Method uses 2nd order Taylor expansion approximation

– l(w + s) = l(w) + g(w)T s + 21 sT (H(w)s


• Newton’s Update rule: wnew = w − H(w)−1 g(w)
– H(w) is the Hessian matrix which composes of the outer-product of the second derivative of l(w)
in respect to w

• Encorporating a prior for a MAP estimate results in a regularization term when updating weights
– wnew = w − αg(w) − αλw
– Helps reduce overfitting by keeping weights near 0

7 Linear Regression
• Used to learn function that linearly maps X onto Y where Y is continuous
– First choose parameterized for for P (Y |X, w)
– Then derive MLE or MAP and estimate w

• MLE produces Squared loss for objective: l(w) = 1 i


− wT xi )2
P
N i (y

– Closed form solution that minimizes l(w) give w = (X T X)−1 X T y


• MAP produces Square loss plus sum of squared weights objective: l(w) = 1 i
− wT xi )2 + λ||w||22
P
N i (y

– Closed form solution that minimize l(w) gives w = (X T X + λI)−1 X T y

4
8 Support Vector Machine
• Separate positive and negative samples as wide as possible
• Hard margin SVM is for linearly separable data and expects perfect separation
– Objective is to minimize 12 ||w||22 such that y (i) (wT x(i) + b) ≥ 1
– Only need support vectors for inference
∗ y (i) (wT x(i) + b) = 1
• Soft margin SVM allows for misclassified samples in non-linearly separable data
– Objective is to minimize 21 ||w||22 + C i ξi such that y (i) (wT x(i) + b) ≥ 1 − ξi
P

– C is trade-off parameter where C = ∞ causes hard margin


– ξi is slack variable where ξi = max(0, 1 − y (i) (wT x(i) + b)
• In both objectives ||w||22 is the regularize
• In soft margin objective i max(0, 1 − y (i) is the hinge loss
P

• Utilize Lagrangian multiplier to minimize quadratic objective without the constraints

– Want to solve dual problem: maxα minw,b L(w, b, α)


• Hard margin objective: L(w, b, α) = 21 ||w||22 + i αi (1 − y (i) (wt x(i) + b))
P

– w∗ = i α∗ y (i) x(i)
P

• Can also be applied to soft margin

You might also like