ECE 449 Notes
ECE 449 Notes
– L1 (Manhattan) Distance: p = 1
– L2 (Euclidean) Distance: p = 2
• Determine K using validation set
– Small K is sensitive to noise and will overfit
– Large K includes too far examples and will underfit
• Simple to implement but several issues
– Require memory to store dataset
– Computationally expensive inference time
– Sensitive to outliers
– Curse of dimensionality: High-dimensional data spreads far away from each other giving low
performance
• Nonparametric models place mild assumptions on data distribution and good for complex data, but
require storage/computation of entire dataset
• Parametric models place strong modeling assumptions and require fitting the model to achieve more
efficient storage/computation
2 Perceptron (Linear)
• Only applies to linearly separable data
• Perceptrons are linear classifiers trying to learn a hyperplane
• Hyperplane in Rd -space is represented as w0 + wT x = 0 where w ∈ Rd
– w is orthogonal to hyperplane and points to positive half-space
• Predicted label y(x) = sign(wt + b)
• Perceptron algorithm iterates through all the data and simultaneously updates W until all data is
correctly labeled
– Update rule: wnew = w + yx when y(wT x) <= 0
• Theorem: Given w∗ that perfectly separates the data and γ = min|w∗T x( i)|, ∀x(i) in D, the perception
algorithm takes at most 1/γ 2 to converge
1
3 Probability and Estimation
• Useful Probability Properties:
P (A,B)
– Conditional Probability: P (A|B) = P (B)
P (B|A)P (A)
– Bayes Rule: P (A|B) = P (B)
• From a dataset of joint probabilities P (X1 , X2 , x3 , ..., Xd , Y ) we can calculate P (Y |X1 , X2 , x3 , ..., Xd )
– Intuitive to learn P (Y |X) from joint distribution, but requires lots of data that may not be
attainable to produce accurate model
• Estimate parameters from sparse data using Maximum Likelihood Estimation and Maximum A Pos-
terior Estimation
– θ̂ = argmaxθ P (θ|D)
– θ̂ = argmaxθ P (D|θ)P
P (D)
(θ)
according to Bayes Rule
– θ̂ = argmaxθ P (D|θ)P (θ) as P (D) does not depend on θ
• MAP is better than MLE when small number of samples of dataset and prior is accurate
• As the number of samples from our dataset approaches infinity, the prior becomes irrelevant and MAP
will become MLE
2
4 Naive Bayes(Probalistic)
• Aims to learn P (Y |X) through P (X|Y ) and P (Y ) using Bayes rule with conditional independence
assumption to reduce number of parameters to estimate
P (X1 ,...,Xd |Y )P (Y )
– P (Y |X1 , ..., Xd ) = ∝ P (X1 , ..., Xd |Y )P (Y ) ignoring normalization
P (X1 ,...,Xd )
– Objective is concave, but does not have a closed form so needs optimization techniques
• Logistic regression typically gives the better solution compared to naive bayes, especially with lots of
data and conditional independence does not hold
3
6 Optimization
• Gradient Descent uses first order Taylor expansion approximation to assume an objective function l
around weights w is linear
– l(w + s) = l(w) + g(w)T s where g(w) = ∇l(w)
• Gradient Descent Update rule: wnew = w − αg(w) to minimize l(w)
– Step size α should decrease by a constant rate for each update for good convergence
• Batch gradient uses error over training of entire dataset and updates w
• Stochastic gradient uses error over single sample and updates w
• Newton’s Method uses 2nd order Taylor expansion approximation
• Encorporating a prior for a MAP estimate results in a regularization term when updating weights
– wnew = w − αg(w) − αλw
– Helps reduce overfitting by keeping weights near 0
7 Linear Regression
• Used to learn function that linearly maps X onto Y where Y is continuous
– First choose parameterized for for P (Y |X, w)
– Then derive MLE or MAP and estimate w
4
8 Support Vector Machine
• Separate positive and negative samples as wide as possible
• Hard margin SVM is for linearly separable data and expects perfect separation
– Objective is to minimize 12 ||w||22 such that y (i) (wT x(i) + b) ≥ 1
– Only need support vectors for inference
∗ y (i) (wT x(i) + b) = 1
• Soft margin SVM allows for misclassified samples in non-linearly separable data
– Objective is to minimize 21 ||w||22 + C i ξi such that y (i) (wT x(i) + b) ≥ 1 − ξi
P
– w∗ = i α∗ y (i) x(i)
P