0% found this document useful (0 votes)
36 views

Machine Learning Techniques

Uploaded by

rishabhindoria57
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views

Machine Learning Techniques

Uploaded by

rishabhindoria57
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Machine Learning

Rishabh Indoria December 23, 2023


Techniques

1 Week 1
1. Paradigms of Machine Learning

• Broad Paradigms: Supervised Learning, Unsupervised Learning, Sequential Learning


• Foundations: Linear Algebra for Structure, Probability for Uncertainty, Optimization for Decision
2. Representation Learning
• Part of Unsupervised Learning
• Compression: Act of finding patterns in data.
Representation: Some sort of relation within the data point
Coefficients: Part of the data point, which when written as the representation, we could get the
original data point back.
• Choose Representation and Coefficient such that reconstruction error is minimized.
• n represents the number of data points and d represents the number of features.
• Projection of x onto line w is x.w w
||w|| ||w||
• Optimal value of w is the eigenvector corresponding to the maximum eigenvalue of the covariance
Pn are d features, then the Covariance matrix will be a d × d matrix. The Covariance
matrix. If there
matrix is n1 1 xi xTi .
• We want to minimize wCw.T .
• All the residuals x − (x.w)w are perpendicular to the line w
• Centred Data: the Mean value of the dataset is 0. To centre a dataset, you simply subtract the
dataset by the mean of the dataset.
3. Principal Component Analysis
• w1 is the line which minimizes the reconstruction error of a set of points, and w2 is the line which
minimizes the reconstruction error of the residues. These lines are orthogonal to each other.
• We find the best fit line w1 , then we find residuals, then using the residuals we find the best fit again
w2 . We keep doing this for d iterations. After which, residues will definitely become 0. The final
residue will be x − (x.w1 )w1 − (x.w2 )w2 − ...
• The vectors w1 , w2 , ..., wd are orthogonal and span or are the basis of Rd .
• The best line one can obtain at kth round of the algorithm is the eigenvector corresponding to the kth
maximum eigenvalue of the Covariance matrix.

2 Week 2
1. Concerns in PCA
• Time Complexity: Finding the eigenvalues and eigenvectors, takes about O(d3 ). Issue when d is large.
• It tries to find linear combinations as such, non-linear relationships do not fit well.
2. Time complexity Issue

• Large d [d >> n]; X is n × d


• Let wk be the eigenvector corresponding to the kth largest eigenvalue of C(λk ).
Cwk = λk wk
• wk = Xαk , αk X T Xαk = 1
• Non zero eigenvalues of XX T and X T X are exactly the same.
• βk are eigenvectors corresponding to nλk of X T X, converting them according to constraint αk = √βk
nλk

1
• Compute K = X T X, Compute eigen decomposition, convert eigenvectors according to constraint, then
finally wk = Xαk
3. Feature Transformation: Increase the dimensions, such that non-linear relationships are captured, then
apply PCA.
4. Kernel function
• To convert from quadratic to linear, we map the features to ϕ(x) = [1 f12 f22 f1 f2 f1 f2 ]
• Any function k: Rd × Rd → R which is a valid map is called a kernel function.
• A function k is a valid kernel function if there exists a function ϕ : Rd → RD such that k(x1 , x2 ) =
ϕ(xT1 )ϕ(x2 ) for all x1 , x2 ∈ Rd
• Kernel is symmetric and positive semi definite. All eigenvalues of k are non-negative.
• Polynomial Kernel: k(x, x′ ) = (xT x′ + 1)2
−||x−x′ ||2
• Radial Basis function kernel or Gaussian Kernel: k(x, x′ ) = e 2σ 2

5. Kernel PCA
• Compute Kernel matrix using a kernel function
• Center the Kernel using the formula: K C = K − 1n K − K1n + 1n K1n = (I − 1n )K(I − 1n ), where
1n is a matrix with all elements n1
• Compute eigenvalues {nλ1 , nλ2 , ...} and eigenvectors {β1 , β2 , ...} of K and normalize the eigenvectors,
αu = √βnλ
u
.
u

• Compute the transformed data points, ϕ(xi )T w = [ α1j Kij C C


P P
, α1j Kij , ...]
• We cannot ’recompute’ the eigenvectors of the covariance matrix as they would require computing, ϕ
which would defeat the whole purpose.
• But we can still compute a ”compressed” representation.

3 Week 3
1. Intro to Clustering
• Goal: Partition the given data into k different clusters.
• Data points: {x1 , x2 , ...}
• Cluster Indicator: {z1 , z2 , ...}
Pn
• Performance Matrix: F (z1 , ..., zn ) = i=1 ||xi − µzi ||22 , where µzi is the mean/average of zi cluster.
• Goal is to minimize Performance matrix
2. K-means Clustering
• Also known as Lloyd’s Algorithm
• Step 1, Initialization: We define some assignment to clusters z10 , z20 , ..., zn0
• Then until convergence we
P t
• Step 2, Compute means: µtk = Pxi 1(z i =k)
1(zit =k)

• Step 3, Reassignment: zit+1 = arg mink ||xi − µtk ||22


• FACT: LLOYD’S ALGORITHM CONVERGES, but converged solution may not be ”optimal”. But
produces ”reasonable” cluster in practice.
3. Convergence of K-means
P
xi
• FACT 1: Let x1 , x2 , ..., xl , v ∗ = arg minv ||xi − v||2 . v ∗ =
P
l
• In every iteration, the objective function strictly reduces, which implies that no partition repeats.
• There are only ”FINITE” number of partitions as such algorithm must converge.
4. Nature of Clusters
||µn ||2 −||µ1 ||2
• For a cluster with mean µ1 , all x assigned to it will satisfy xT (µn − µ1 ) ≤ 2 , for all n ̸= 1
• VORONOI region: intersection of half spaces. Cluster regions are voronoi regions.

2
• K-means can not efficiently cluster data points that are not linearly separable. Kernel K-means is
used to cluster data points that are not linearly separable. Spectral Clustering can also be used.

5. Initialization of centroids
• Pick K-means uniformly at random from the dataset.
• Means should be far apart.
• K-means++: Choose first mean µ01 uniformly at random from the dataset. For l = 2, 3, ..., k choose µ0l
probabilistically proportional to score.
• Score S(x) = minj ||x − µ0j ||2
6. Choice of K
• We want K to be not as small and not as large. Penalize large values of K.
• Value of K is where Objective function + Penalty function is the smallest.
• Akaike Information Criterion: 2K − 2 log(L(θ∗ ))
• Bayesian Information Criterion: K log(n) − 2 log(L(θ∗ ))

4 Week 4
1. Introduction to Estimation
• Estimation: There is some probabilistic mechanism that generates the data. About which we don’t
know ”something”.
• Goal: Observe data and ”Assume” a probabilistic model that generates the data.
• Assumption: Observations are Independent and Identically Distributed
2. Maximum Likelihood Estimation
• Fisher’s Principle of Maximum Likelihood: Write the likelihood function
L(p, {x1 , x2 , ...xn }) = P (x1 , x2 , ..., xn , p)
= P (x1 , p).P (x2 , p)...P (xn , p) = Πni=1 pxi (1 − p)1−xi [Independence]
• Estimator: p̂M L = arg maxp Πni=1 pxi (1 − p)1−xi
We take logarithm
Pn to simplify
= arg maxp i=1 xi log(p) + (1 − xi ) log(1 − p) [log is monotonically increasing]
Taking Derivative
Pn and setting to 0 we get,
xi
p̂M L = i=1
n
• Fisher’s Proposal: L(µ, σ 2 , {x1 , x2 , ...xn }) = fx1 ,x2 ,...xn (x1 , x2 , ...xn , µ, σ 2 ) = Πfxi (xi , µ, σ 2 )
This is done because for continuous functions, Probability only exists for intervals and probability at
any particular point would be 0.
3. Bayesian Estimation
• Goal: Incorporate ”Hunch” about parameters into the estimation procedure.
• Approach: Think of the parameter to estimate as a ”random” variable.
• Hunch: codified using a probabilistic distribution over θ
• After looking at data, move to
Updated Hunch: Codified using another probabilistic distribution.
P ({x1 ,x2 ,...xn }|θ)P (θ)
• P (θ|{x1 , x2 , ...xn }) = P ({x1 ,x2 ,...xn })
pα−1 (1−p)β−1
• BETA PRIOR: f (p, α, β) = z
• BETA POSTERIOR(α + nh , β + nt )
• One possible guess = α+nh
α+β+n

4. Gaussian Mixture Models


• STEP 1: Pick which mixture a data point comes from.
• STEP 2: Generate data point from that mixture.
• Generate a mixture component among {1, ..., k}, zi ϵ{1, ..., k}
P (zi = l) = πl

3
• Generate xi = N (µzi , σz2i )
• Observed: {x1 , x2 , ..., xn }
Unobserved: {z1 , z2 , ..., zn }
Parameters: π = [π1 , π2 , ..., πk ], and for each k (µk , σk2 )
5. Likelihood of GMM
• L(P arameters, {x1 , x2 , ..., xn }) = Πni=1 f (xi ; P arameters)
Pk
= Πni=1 [ k=1 πk f (xi ; µk , σk2 )]
−(xi −µk )2
Pk 2σ 2
e
L(P arameters) = Πni=1 [ k=1 πk √ k
2πσk
]
−(xi −µk )2
Pn Pk 2σ 2
• log(L(P arameters)) = i=1 log( k=1 πk
e √ k
2πσk
)

6. Convex Functions and Jensen’s inequality


f (a)+f (b)
• Convex Functions: f ( a+b
2 )≤ 2 ∀a, b

• Jensen’s
P inequality:
P 1 + ... + λk ak ) ≤ λ1 f (a1 ) + ... + λk f (ak )
f (λ1 aP
f ( λk ak ) ≤ λk f (ak ), λk = 1
f (a)+f (b)
• Concave Functions: f ( a+b
2 )≥ 2 ∀a, b
• Jensen’s
P inequality:
P 1 + ... + λk ak ) ≥ λ1 f (a1 ) + ... + λk f (ak )
f (λ1 aP
f ( λk ak ) ≥ λk f (ak ), λk = 1
• Linear Functions are both Concave and Convex.
• Log is a concave function
7. Estimating the parameters
−(xi −µk )2
PnPk 2σ 2
• log(L(P arameters)) = i=1 log( k=1 λik (πk e λi √2πσ k
))
k k
Apply Jensen’s inequality, Fix λ and then take derivative and set to 0
Pn
λik xi
Pn
λi (x −µˆ )2
Pn
λik
• µˆk = Pi=1
n
λi
, σˆk2 = i=1Pnk λi i k , πˆk = i=1
n
i=1 k i=1 k

• Fixing all parameters and maximizing with respect to λ


−(xi −µk )2
2σ 2
√ 1 e k .πk
λˆik = 2πσk
−(xi −µl )2
= P (xi |zi =k)P (k)
P (xi )
Pk 2σ 2
√ 1 e l .πl
l=1 2πσl
PK i
∀i, k=1 λk =1

8. Expectation Maximization Algorithm


• Initialize Parameters
• Then until convergence
Expectation Step: Calculate λt+1 using P arameterst
Maximization Step: Calculate P arameterst+1 using λt+1

5 Week 5
1. Supervised Learning
• Input is features/attributes {x1 , ...xn } and labels {y1 , ..., yn }.
• If labels only take two values, then the problem is called binary classification problem. If labels take n
values, then the problem is called multi class classification. If labels take any value in real number,
then the problem is called a regression problem.
2. Linear Regression

• Goal is to learn a function f which maps a given feature to its correct label.
• Error(f ) = (f (xi ) − yi )2
P

• For linear regression we take f (x) = wT x, and minimize Error(f ).

4
    
− − −x1 − −− w1 y1
•  − − −x2 − −−   w2  =  y2 , we want to minimize (xT w − y)2 or (xT w − y)T (xT w − y).
− − −xn − −− wn yn
• Simply take gradient and set to 0, we get (xxT )w∗ = xy.
• Since w∗ is a solution of an unconstrained optimization problem, we can apply gradient descent
wt+1 = wt − η t ▽f (wt ), where η is the step size.
wt+1 = wt − η t 2(xxT wt − xy)
• In case number of data points is large, we use
Stochastic Gradient Descent: For t iterations, sample a bunch(k) of data points uniformly at
random from the set of all points. Pretend this
P sample is the entire dataset and take a gradient step
T
w.r.t it. After all rounds, we use wSGD = T1 wt .
• Kernel Regression is similar to kernel
Pn PCA, we take w∗ = xα∗ and K = xT x. α∗ = K −1 y.
∗ ∗
To make a prediction w ϕ(xtest ) = i=1 αi K(xi , xtest )
• Probabilistic Regression, assume labels are generated as wt x + ϵ, where ϵ is noise generated from a
Gaussian distribution. Using Maximum Likelihood function we simply arrive at
ˆ L = min (wT x − y)2 .
P
wM

6 Week 6
1. Goodness of Maximum Likelihood Estimator
• To understand how good ŵM L is in estimating w, E[||ŵM L − w||2 ] = σ 2 trace((xxT )−1 )
• trace((xxT )−1 ) =
P 1 T
λi , where λi are the eigenvalues of xx

• ŵnew = (xxT + λI)−1 xy, trace((xxT + λI)−1 ) = 1


P
λi +λ
• Existence Theorem, There exists some λ such that ŵnew has lesser mean squared error than ŵM L .
We find this λ by cross validation.
• Split the training set into train set and validation set. Train on the train set and check for error on
validation set. Pick λ that gives the least error.
• K-fold Cross Validation, Split dataset into K folds, train on K-1 folds and validate on the last fold.
Pick λ that gives the least average error.
• Leave One Out Cross Validation, train on n − 1 data points and validate on the last point.
2. Bayesian Modelling for Linear Regression
• Need a Prior on w, a choice of prior N (0, γ 2 I)
2
Q −(yi −wT xi )2 −||w||
• Posterior is proportional to P (dataset|w)P (w) = ( e 2 )e 2γ 2
• ŵM AP = min 21 (yi − wT xi )2 + 2γ1 2 ||w||2 , taking gradient and setting to 0 we get
P

ŵM AP = (xxT + γ12 I)−1 xy


3. Ridge Regression
• ŵR = arg min
P T
(w xi − yi )2 + λ||w||2 , where the added term is called regularization term.
• Ridge pushes weight values towards 0 but does not necessarily make it 0.
4. Lasso Regression
• An alternate way is to regularize would be using L1 norm(Summation of absolute values instead of
squared values).
• Much more likely to make some weight values 0. But it does not have a closed form solution.
• Sub gradients methods are usually used to solve LASSO.

7 Week 7
1. Binary Classification
• Labels belong to the set {0, 1} or the set {−1, 1}
• Loss(h) = n1
P
1(h(xi ) ̸= yi )
• h(x) = sign(wT x)

5
2. K Nearest Neighbours
• Given a test point xtest , find the closest point x′ to xtest in the training set. Predict ytest = y ′ .
• Can get affected by outliers, Ask more neighbours and predict the majority.
• Problems: Choosing a distance function, Prediction is computationally expensive, No model is learnt.
3. Decision Trees
• A question is a (feature, value) pair. Is feature ≤ value ?
• Need a measure of ”Impurity” for a set of labels to determine how good a question is.
• Entropy function = −(p log(p) + (1 − p) log(1 − p)), where p is the fraction of 1’s, and log(0) is
considered 0.
• Information Gain(feature, value) = Entropy(D) - [γEntropy(Dyes ) + (1-γ)Entropy(Dno )],
|Dy es|
where γ = |D|
• Discretize each feature in [min, max] range. Pick the question that has the largest Information Gain.
Repeat the procedure for Dyes , Dno.
• Can stop growing a tree if a node becomes ”Sufficiently” Pure. Depth of the tree is a hyperparameter.
There are alternate measures for ”goodness” of a question.
• Gini Index function is another popular function to measure impurity.
4. Types of Modelling
• Generative Model: P (x, y)
• Discriminative Model: P (y|x)

8 Week 8
1. Generative Model based Algorithm
• Data: {(x1 , y1 ), ..., (xn , yn )}, where xϵ{0, 1}d and yϵ{0, 1}.
• Step 1: Decide the labels by tossing a coin with P (yi = 1) = p
• Step 2: Determine the features using the labels obtained in Step 1 through the conditional probability
P (xi |yi ).
• The parameters in generative modelling are defined as p̂ to decide the label, 2d − 1 parameters for
P (x|y = 1) and 2d − 1 parameters for P (x|y = 0). Where d is the number of features.
• Too many parameters, could lead to overfitting and the model may not be practically viable.
2. Alternate Generative Model
• Class conditional independence: This assumption states that the features of an object are conditionally
independent given its class label.
• Step 1 remains the same.
• Step 2: Determine the features for x given y using the following conditional probability,
P (x = [f1 , f2 , ...fn ]|y) = (pyi i )fi (1 − pyi i )fi .
Q

• The parameters in generative modelling are defined as p̂ to decide the label, d parameters for P (x|y = 1)
and d parameters for P (x|y = 0). Where d is the number of features.
• Parameters are estimated using Maximum Likelihood Estimator.
3. Naive Bayes Algorithm
Q yi fi
• The model is given by: P (x = [f1 , f2 , ...fn ]|y) = (pi ) (1 − pyi i )fi .
• The parameters estimated are p, {p01 , p02 , ..., p0d }, and {p11 , p12 , ..., p1d }.
Pn
1(f i =1,yi =y)
• The estimates are p̂ = n1 y, and p̂yj = i=1
P
Pn j
1(yi =y)
i=1

• Given xtest ϵ{0, 1}d , the prediction of ŷ test is done using the inequality P (ŷ test = 1|xtest ) ≥ P (ŷ test =
0|xtest ).
P (xtest |ŷ test =t)P (ŷ test =t)
• Can express P (ŷ test = t|xtest ) = P (xtest ) , since we are only comparing, there is no need
to calculate xtest .

6
• One prominent issue with Naive Bayes is that if a feature is not observed in the training set, but
present in the testing set, the prediction probabilities for both classes become zero.
Laplace smoothing: A popular remedy for this issue is to introduce two “pseudo” data points with
labels 1 and 0, respectively, into the dataset, where all their features are set to 1.
• The decision function of Naive Bayes is linear, and the boundary is given by
{x = P (y = 0|x) = P (y = 1|x)}
4. Gaussian Naive Bayes
• Assumes that features in the dataset follow a normal distribution and computes the likelihood of a
class for a given set of feature values by estimating the mean and variance of the feature values within
each class.
• P (x|y = 0) = N (µ0 , Σ) and P (x|y = 1) = N (µ1 , Σ)
P
• µ̂t = P1(y i =t)xi 1
(xi − µ̂yi )(xi − µ̂yi )T .
P
1(yi =t) , Σ̂ = n
• If the covariance matrices are equal then the decision boundary is linear, if they are unequal then the
decision boundary is quadratic.
−µ̂t )(1(yi =t)xi −µ̂t )T
P
(1(yi =t)xiP
• For unequal covariance matrix Σ̂t = 1(yi =t) .

9 Week 9
1. Perceptron Learning Algorithm
• Widely employed for binary classification, focuses on modelling the boundary between each class.
• Objective function,
P
1(h(xi ) ̸= yi )
• Until convergence, select a pair (xi , yi ), if, sign(wT xi ) ̸= yi then update the weight vector
wt+1 = wt + xi yi .
• l2 γ 2 ≤ ||wl+1 ||2 ≤ lR2
• Uber bound on number of mistakes
2
#mistakes ≤ Rγ2

2. Logistic Regression
• Objective is to estimate the probability that the dependant variable belongs to one of two possible
values.
• Let z = wT xi , we define the P (y = 1|x) = g(z) = 1
1+e−z , where the function g(z) is called the sigmoid
function.
• Objective is to maximize the log likelihood or minimize the negative log likelihood.
yi log(g(z)) + (1 − yi ) log(1 − g(z))
• For gradient descent, we get the gradient as xi (yi − g(z))

10 Week 10
1. Support Vector Machines
• category of supervised learning algorithms designed for classification and regression analysis. SVMs
aim to identify the optimal hyperplane that maximizes the margin between data points from different
classes.
• Hard Margin SVMs: applicable only when the dataset is linearly separable
Direct or Kernelized Calculation of Q: Compute the matrix Q = X T X directly or using a kernel,
based on the dataset.
Gradient Descent: Employ the gradient of the dual formula, αT 1 − 12 αT Y T QY α, in a gradient descent
algorithm to iteratively find a satisfactory set of Lagrange multipliers α.
label(xtest ) = sign(wT xtest ) = sign( αP T
P
i yi (xi xtest ))
label(xtest ) = sign(w ϕ(xtest )) = sign( αi yi k(xTi xtest ))
T

• Soft Margin SVMs: extends the standard SVM algorithm to accommodate some misclassifications in
the training data. This extension is particularly useful when dealing with non-linearly separable data.
It introduces a regularization parameter (C) to control the balance between maximizing the margin
and allowing for Pmisclassifications.
min 12 ||w||22 + C ϵi such that (wT xi )yi + ϵi ≥ 1, ϵi ≥ 0

7
11 Week 11
1. Bagging
• Simply Distribute the dataset into m smaller datasets, them make m different models. For prediction,
predict using each of the models, average out the prediction, then use the same function on the
averaged prediction.
• Can also use feature bagging.
2. Boosting
• Input: S = {(x1 , y1 ), ..., (xn , yn )}
• Initialize D0 (i) = 1
n
• For t = 1 to T
ht = Input S to a weak Learner
D̃t+1 (i) = Dt (i)eαt if ht (xi ) ̸= yi else Dt (i)e−αt
Dt+1 (i) = PD̃D̃t+1 (i)(i)
t+1
q
1−error(ht )
• αt = log( error(ht ) )

• h∗ (x) = sign( αt ht (x))


P

12 Week 12
1. Activation Functions
• Sigmoid function: 1
1+e−z
• Rectified Linear Unit: max(0,z)

You might also like