Machine Learning Techniques
Machine Learning Techniques
1 Week 1
1. Paradigms of Machine Learning
2 Week 2
1. Concerns in PCA
• Time Complexity: Finding the eigenvalues and eigenvectors, takes about O(d3 ). Issue when d is large.
• It tries to find linear combinations as such, non-linear relationships do not fit well.
2. Time complexity Issue
1
• Compute K = X T X, Compute eigen decomposition, convert eigenvectors according to constraint, then
finally wk = Xαk
3. Feature Transformation: Increase the dimensions, such that non-linear relationships are captured, then
apply PCA.
4. Kernel function
• To convert from quadratic to linear, we map the features to ϕ(x) = [1 f12 f22 f1 f2 f1 f2 ]
• Any function k: Rd × Rd → R which is a valid map is called a kernel function.
• A function k is a valid kernel function if there exists a function ϕ : Rd → RD such that k(x1 , x2 ) =
ϕ(xT1 )ϕ(x2 ) for all x1 , x2 ∈ Rd
• Kernel is symmetric and positive semi definite. All eigenvalues of k are non-negative.
• Polynomial Kernel: k(x, x′ ) = (xT x′ + 1)2
−||x−x′ ||2
• Radial Basis function kernel or Gaussian Kernel: k(x, x′ ) = e 2σ 2
5. Kernel PCA
• Compute Kernel matrix using a kernel function
• Center the Kernel using the formula: K C = K − 1n K − K1n + 1n K1n = (I − 1n )K(I − 1n ), where
1n is a matrix with all elements n1
• Compute eigenvalues {nλ1 , nλ2 , ...} and eigenvectors {β1 , β2 , ...} of K and normalize the eigenvectors,
αu = √βnλ
u
.
u
3 Week 3
1. Intro to Clustering
• Goal: Partition the given data into k different clusters.
• Data points: {x1 , x2 , ...}
• Cluster Indicator: {z1 , z2 , ...}
Pn
• Performance Matrix: F (z1 , ..., zn ) = i=1 ||xi − µzi ||22 , where µzi is the mean/average of zi cluster.
• Goal is to minimize Performance matrix
2. K-means Clustering
• Also known as Lloyd’s Algorithm
• Step 1, Initialization: We define some assignment to clusters z10 , z20 , ..., zn0
• Then until convergence we
P t
• Step 2, Compute means: µtk = Pxi 1(z i =k)
1(zit =k)
2
• K-means can not efficiently cluster data points that are not linearly separable. Kernel K-means is
used to cluster data points that are not linearly separable. Spectral Clustering can also be used.
5. Initialization of centroids
• Pick K-means uniformly at random from the dataset.
• Means should be far apart.
• K-means++: Choose first mean µ01 uniformly at random from the dataset. For l = 2, 3, ..., k choose µ0l
probabilistically proportional to score.
• Score S(x) = minj ||x − µ0j ||2
6. Choice of K
• We want K to be not as small and not as large. Penalize large values of K.
• Value of K is where Objective function + Penalty function is the smallest.
• Akaike Information Criterion: 2K − 2 log(L(θ∗ ))
• Bayesian Information Criterion: K log(n) − 2 log(L(θ∗ ))
4 Week 4
1. Introduction to Estimation
• Estimation: There is some probabilistic mechanism that generates the data. About which we don’t
know ”something”.
• Goal: Observe data and ”Assume” a probabilistic model that generates the data.
• Assumption: Observations are Independent and Identically Distributed
2. Maximum Likelihood Estimation
• Fisher’s Principle of Maximum Likelihood: Write the likelihood function
L(p, {x1 , x2 , ...xn }) = P (x1 , x2 , ..., xn , p)
= P (x1 , p).P (x2 , p)...P (xn , p) = Πni=1 pxi (1 − p)1−xi [Independence]
• Estimator: p̂M L = arg maxp Πni=1 pxi (1 − p)1−xi
We take logarithm
Pn to simplify
= arg maxp i=1 xi log(p) + (1 − xi ) log(1 − p) [log is monotonically increasing]
Taking Derivative
Pn and setting to 0 we get,
xi
p̂M L = i=1
n
• Fisher’s Proposal: L(µ, σ 2 , {x1 , x2 , ...xn }) = fx1 ,x2 ,...xn (x1 , x2 , ...xn , µ, σ 2 ) = Πfxi (xi , µ, σ 2 )
This is done because for continuous functions, Probability only exists for intervals and probability at
any particular point would be 0.
3. Bayesian Estimation
• Goal: Incorporate ”Hunch” about parameters into the estimation procedure.
• Approach: Think of the parameter to estimate as a ”random” variable.
• Hunch: codified using a probabilistic distribution over θ
• After looking at data, move to
Updated Hunch: Codified using another probabilistic distribution.
P ({x1 ,x2 ,...xn }|θ)P (θ)
• P (θ|{x1 , x2 , ...xn }) = P ({x1 ,x2 ,...xn })
pα−1 (1−p)β−1
• BETA PRIOR: f (p, α, β) = z
• BETA POSTERIOR(α + nh , β + nt )
• One possible guess = α+nh
α+β+n
3
• Generate xi = N (µzi , σz2i )
• Observed: {x1 , x2 , ..., xn }
Unobserved: {z1 , z2 , ..., zn }
Parameters: π = [π1 , π2 , ..., πk ], and for each k (µk , σk2 )
5. Likelihood of GMM
• L(P arameters, {x1 , x2 , ..., xn }) = Πni=1 f (xi ; P arameters)
Pk
= Πni=1 [ k=1 πk f (xi ; µk , σk2 )]
−(xi −µk )2
Pk 2σ 2
e
L(P arameters) = Πni=1 [ k=1 πk √ k
2πσk
]
−(xi −µk )2
Pn Pk 2σ 2
• log(L(P arameters)) = i=1 log( k=1 πk
e √ k
2πσk
)
• Jensen’s
P inequality:
P 1 + ... + λk ak ) ≤ λ1 f (a1 ) + ... + λk f (ak )
f (λ1 aP
f ( λk ak ) ≤ λk f (ak ), λk = 1
f (a)+f (b)
• Concave Functions: f ( a+b
2 )≥ 2 ∀a, b
• Jensen’s
P inequality:
P 1 + ... + λk ak ) ≥ λ1 f (a1 ) + ... + λk f (ak )
f (λ1 aP
f ( λk ak ) ≥ λk f (ak ), λk = 1
• Linear Functions are both Concave and Convex.
• Log is a concave function
7. Estimating the parameters
−(xi −µk )2
PnPk 2σ 2
• log(L(P arameters)) = i=1 log( k=1 λik (πk e λi √2πσ k
))
k k
Apply Jensen’s inequality, Fix λ and then take derivative and set to 0
Pn
λik xi
Pn
λi (x −µˆ )2
Pn
λik
• µˆk = Pi=1
n
λi
, σˆk2 = i=1Pnk λi i k , πˆk = i=1
n
i=1 k i=1 k
5 Week 5
1. Supervised Learning
• Input is features/attributes {x1 , ...xn } and labels {y1 , ..., yn }.
• If labels only take two values, then the problem is called binary classification problem. If labels take n
values, then the problem is called multi class classification. If labels take any value in real number,
then the problem is called a regression problem.
2. Linear Regression
• Goal is to learn a function f which maps a given feature to its correct label.
• Error(f ) = (f (xi ) − yi )2
P
4
− − −x1 − −− w1 y1
• − − −x2 − −− w2 = y2 , we want to minimize (xT w − y)2 or (xT w − y)T (xT w − y).
− − −xn − −− wn yn
• Simply take gradient and set to 0, we get (xxT )w∗ = xy.
• Since w∗ is a solution of an unconstrained optimization problem, we can apply gradient descent
wt+1 = wt − η t ▽f (wt ), where η is the step size.
wt+1 = wt − η t 2(xxT wt − xy)
• In case number of data points is large, we use
Stochastic Gradient Descent: For t iterations, sample a bunch(k) of data points uniformly at
random from the set of all points. Pretend this
P sample is the entire dataset and take a gradient step
T
w.r.t it. After all rounds, we use wSGD = T1 wt .
• Kernel Regression is similar to kernel
Pn PCA, we take w∗ = xα∗ and K = xT x. α∗ = K −1 y.
∗ ∗
To make a prediction w ϕ(xtest ) = i=1 αi K(xi , xtest )
• Probabilistic Regression, assume labels are generated as wt x + ϵ, where ϵ is noise generated from a
Gaussian distribution. Using Maximum Likelihood function we simply arrive at
ˆ L = min (wT x − y)2 .
P
wM
6 Week 6
1. Goodness of Maximum Likelihood Estimator
• To understand how good ŵM L is in estimating w, E[||ŵM L − w||2 ] = σ 2 trace((xxT )−1 )
• trace((xxT )−1 ) =
P 1 T
λi , where λi are the eigenvalues of xx
7 Week 7
1. Binary Classification
• Labels belong to the set {0, 1} or the set {−1, 1}
• Loss(h) = n1
P
1(h(xi ) ̸= yi )
• h(x) = sign(wT x)
5
2. K Nearest Neighbours
• Given a test point xtest , find the closest point x′ to xtest in the training set. Predict ytest = y ′ .
• Can get affected by outliers, Ask more neighbours and predict the majority.
• Problems: Choosing a distance function, Prediction is computationally expensive, No model is learnt.
3. Decision Trees
• A question is a (feature, value) pair. Is feature ≤ value ?
• Need a measure of ”Impurity” for a set of labels to determine how good a question is.
• Entropy function = −(p log(p) + (1 − p) log(1 − p)), where p is the fraction of 1’s, and log(0) is
considered 0.
• Information Gain(feature, value) = Entropy(D) - [γEntropy(Dyes ) + (1-γ)Entropy(Dno )],
|Dy es|
where γ = |D|
• Discretize each feature in [min, max] range. Pick the question that has the largest Information Gain.
Repeat the procedure for Dyes , Dno.
• Can stop growing a tree if a node becomes ”Sufficiently” Pure. Depth of the tree is a hyperparameter.
There are alternate measures for ”goodness” of a question.
• Gini Index function is another popular function to measure impurity.
4. Types of Modelling
• Generative Model: P (x, y)
• Discriminative Model: P (y|x)
8 Week 8
1. Generative Model based Algorithm
• Data: {(x1 , y1 ), ..., (xn , yn )}, where xϵ{0, 1}d and yϵ{0, 1}.
• Step 1: Decide the labels by tossing a coin with P (yi = 1) = p
• Step 2: Determine the features using the labels obtained in Step 1 through the conditional probability
P (xi |yi ).
• The parameters in generative modelling are defined as p̂ to decide the label, 2d − 1 parameters for
P (x|y = 1) and 2d − 1 parameters for P (x|y = 0). Where d is the number of features.
• Too many parameters, could lead to overfitting and the model may not be practically viable.
2. Alternate Generative Model
• Class conditional independence: This assumption states that the features of an object are conditionally
independent given its class label.
• Step 1 remains the same.
• Step 2: Determine the features for x given y using the following conditional probability,
P (x = [f1 , f2 , ...fn ]|y) = (pyi i )fi (1 − pyi i )fi .
Q
• The parameters in generative modelling are defined as p̂ to decide the label, d parameters for P (x|y = 1)
and d parameters for P (x|y = 0). Where d is the number of features.
• Parameters are estimated using Maximum Likelihood Estimator.
3. Naive Bayes Algorithm
Q yi fi
• The model is given by: P (x = [f1 , f2 , ...fn ]|y) = (pi ) (1 − pyi i )fi .
• The parameters estimated are p, {p01 , p02 , ..., p0d }, and {p11 , p12 , ..., p1d }.
Pn
1(f i =1,yi =y)
• The estimates are p̂ = n1 y, and p̂yj = i=1
P
Pn j
1(yi =y)
i=1
• Given xtest ϵ{0, 1}d , the prediction of ŷ test is done using the inequality P (ŷ test = 1|xtest ) ≥ P (ŷ test =
0|xtest ).
P (xtest |ŷ test =t)P (ŷ test =t)
• Can express P (ŷ test = t|xtest ) = P (xtest ) , since we are only comparing, there is no need
to calculate xtest .
6
• One prominent issue with Naive Bayes is that if a feature is not observed in the training set, but
present in the testing set, the prediction probabilities for both classes become zero.
Laplace smoothing: A popular remedy for this issue is to introduce two “pseudo” data points with
labels 1 and 0, respectively, into the dataset, where all their features are set to 1.
• The decision function of Naive Bayes is linear, and the boundary is given by
{x = P (y = 0|x) = P (y = 1|x)}
4. Gaussian Naive Bayes
• Assumes that features in the dataset follow a normal distribution and computes the likelihood of a
class for a given set of feature values by estimating the mean and variance of the feature values within
each class.
• P (x|y = 0) = N (µ0 , Σ) and P (x|y = 1) = N (µ1 , Σ)
P
• µ̂t = P1(y i =t)xi 1
(xi − µ̂yi )(xi − µ̂yi )T .
P
1(yi =t) , Σ̂ = n
• If the covariance matrices are equal then the decision boundary is linear, if they are unequal then the
decision boundary is quadratic.
−µ̂t )(1(yi =t)xi −µ̂t )T
P
(1(yi =t)xiP
• For unequal covariance matrix Σ̂t = 1(yi =t) .
9 Week 9
1. Perceptron Learning Algorithm
• Widely employed for binary classification, focuses on modelling the boundary between each class.
• Objective function,
P
1(h(xi ) ̸= yi )
• Until convergence, select a pair (xi , yi ), if, sign(wT xi ) ̸= yi then update the weight vector
wt+1 = wt + xi yi .
• l2 γ 2 ≤ ||wl+1 ||2 ≤ lR2
• Uber bound on number of mistakes
2
#mistakes ≤ Rγ2
2. Logistic Regression
• Objective is to estimate the probability that the dependant variable belongs to one of two possible
values.
• Let z = wT xi , we define the P (y = 1|x) = g(z) = 1
1+e−z , where the function g(z) is called the sigmoid
function.
• Objective is to maximize the log likelihood or minimize the negative log likelihood.
yi log(g(z)) + (1 − yi ) log(1 − g(z))
• For gradient descent, we get the gradient as xi (yi − g(z))
10 Week 10
1. Support Vector Machines
• category of supervised learning algorithms designed for classification and regression analysis. SVMs
aim to identify the optimal hyperplane that maximizes the margin between data points from different
classes.
• Hard Margin SVMs: applicable only when the dataset is linearly separable
Direct or Kernelized Calculation of Q: Compute the matrix Q = X T X directly or using a kernel,
based on the dataset.
Gradient Descent: Employ the gradient of the dual formula, αT 1 − 12 αT Y T QY α, in a gradient descent
algorithm to iteratively find a satisfactory set of Lagrange multipliers α.
label(xtest ) = sign(wT xtest ) = sign( αP T
P
i yi (xi xtest ))
label(xtest ) = sign(w ϕ(xtest )) = sign( αi yi k(xTi xtest ))
T
• Soft Margin SVMs: extends the standard SVM algorithm to accommodate some misclassifications in
the training data. This extension is particularly useful when dealing with non-linearly separable data.
It introduces a regularization parameter (C) to control the balance between maximizing the margin
and allowing for Pmisclassifications.
min 12 ||w||22 + C ϵi such that (wT xi )yi + ϵi ≥ 1, ϵi ≥ 0
7
11 Week 11
1. Bagging
• Simply Distribute the dataset into m smaller datasets, them make m different models. For prediction,
predict using each of the models, average out the prediction, then use the same function on the
averaged prediction.
• Can also use feature bagging.
2. Boosting
• Input: S = {(x1 , y1 ), ..., (xn , yn )}
• Initialize D0 (i) = 1
n
• For t = 1 to T
ht = Input S to a weak Learner
D̃t+1 (i) = Dt (i)eαt if ht (xi ) ̸= yi else Dt (i)e−αt
Dt+1 (i) = PD̃D̃t+1 (i)(i)
t+1
q
1−error(ht )
• αt = log( error(ht ) )
12 Week 12
1. Activation Functions
• Sigmoid function: 1
1+e−z
• Rectified Linear Unit: max(0,z)