0% found this document useful (0 votes)

36 views

Machine Learning Techniques

Uploaded by

rishabhindoria57

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

36 views

Machine Learning Techniques

Uploaded by

rishabhindoria57

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Machine Learning

Rishabh Indoria December 23, 2023

Techniques

1 Week 1
1. Paradigms of Machine Learning

• Broad Paradigms: Supervised Learning, Unsupervised Learning, Sequential Learning

• Foundations: Linear Algebra for Structure, Probability for Uncertainty, Optimization for Decision
2. Representation Learning
• Part of Unsupervised Learning
• Compression: Act of finding patterns in data.
Representation: Some sort of relation within the data point
Coefficients: Part of the data point, which when written as the representation, we could get the
original data point back.
• Choose Representation and Coefficient such that reconstruction error is minimized.
• n represents the number of data points and d represents the number of features.
• Projection of x onto line w is x.w w
||w|| ||w||
• Optimal value of w is the eigenvector corresponding to the maximum eigenvalue of the covariance
Pn are d features, then the Covariance matrix will be a d × d matrix. The Covariance
matrix. If there
matrix is n1 1 xi xTi .
• We want to minimize wCw.T .
• All the residuals x − (x.w)w are perpendicular to the line w
• Centred Data: the Mean value of the dataset is 0. To centre a dataset, you simply subtract the
dataset by the mean of the dataset.
3. Principal Component Analysis
• w1 is the line which minimizes the reconstruction error of a set of points, and w2 is the line which
minimizes the reconstruction error of the residues. These lines are orthogonal to each other.
• We find the best fit line w1 , then we find residuals, then using the residuals we find the best fit again
w2 . We keep doing this for d iterations. After which, residues will definitely become 0. The final
residue will be x − (x.w1 )w1 − (x.w2 )w2 − ...
• The vectors w1 , w2 , ..., wd are orthogonal and span or are the basis of Rd .
• The best line one can obtain at kth round of the algorithm is the eigenvector corresponding to the kth
maximum eigenvalue of the Covariance matrix.

2 Week 2
1. Concerns in PCA
• Time Complexity: Finding the eigenvalues and eigenvectors, takes about O(d3 ). Issue when d is large.
• It tries to find linear combinations as such, non-linear relationships do not fit well.
2. Time complexity Issue

• Large d [d >> n]; X is n × d

• Let wk be the eigenvector corresponding to the kth largest eigenvalue of C(λk ).
Cwk = λk wk
• wk = Xαk , αk X T Xαk = 1
• Non zero eigenvalues of XX T and X T X are exactly the same.
• βk are eigenvectors corresponding to nλk of X T X, converting them according to constraint αk = √βk
nλk

1
• Compute K = X T X, Compute eigen decomposition, convert eigenvectors according to constraint, then
finally wk = Xαk
3. Feature Transformation: Increase the dimensions, such that non-linear relationships are captured, then
apply PCA.
4. Kernel function
• To convert from quadratic to linear, we map the features to ϕ(x) = [1 f12 f22 f1 f2 f1 f2 ]
• Any function k: Rd × Rd → R which is a valid map is called a kernel function.
• A function k is a valid kernel function if there exists a function ϕ : Rd → RD such that k(x1 , x2 ) =
ϕ(xT1 )ϕ(x2 ) for all x1 , x2 ∈ Rd
• Kernel is symmetric and positive semi definite. All eigenvalues of k are non-negative.
• Polynomial Kernel: k(x, x′ ) = (xT x′ + 1)2
−||x−x′ ||2
• Radial Basis function kernel or Gaussian Kernel: k(x, x′ ) = e 2σ 2

5. Kernel PCA
• Compute Kernel matrix using a kernel function
• Center the Kernel using the formula: K C = K − 1n K − K1n + 1n K1n = (I − 1n )K(I − 1n ), where
1n is a matrix with all elements n1
• Compute eigenvalues {nλ1 , nλ2 , ...} and eigenvectors {β1 , β2 , ...} of K and normalize the eigenvectors,
αu = √βnλ
u
.
u

• Compute the transformed data points, ϕ(xi )T w = [ α1j Kij C C

P P
, α1j Kij , ...]
• We cannot ’recompute’ the eigenvectors of the covariance matrix as they would require computing, ϕ
which would defeat the whole purpose.
• But we can still compute a ”compressed” representation.

3 Week 3
1. Intro to Clustering
• Goal: Partition the given data into k different clusters.
• Data points: {x1 , x2 , ...}
• Cluster Indicator: {z1 , z2 , ...}
Pn
• Performance Matrix: F (z1 , ..., zn ) = i=1 ||xi − µzi ||22 , where µzi is the mean/average of zi cluster.
• Goal is to minimize Performance matrix
2. K-means Clustering
• Also known as Lloyd’s Algorithm
• Step 1, Initialization: We define some assignment to clusters z10 , z20 , ..., zn0
• Then until convergence we
P t
• Step 2, Compute means: µtk = Pxi 1(z i =k)
1(zit =k)

• Step 3, Reassignment: zit+1 = arg mink ||xi − µtk ||22

• FACT: LLOYD’S ALGORITHM CONVERGES, but converged solution may not be ”optimal”. But
produces ”reasonable” cluster in practice.
3. Convergence of K-means
P
xi
• FACT 1: Let x1 , x2 , ..., xl , v ∗ = arg minv ||xi − v||2 . v ∗ =
P
l
• In every iteration, the objective function strictly reduces, which implies that no partition repeats.
• There are only ”FINITE” number of partitions as such algorithm must converge.
4. Nature of Clusters
||µn ||2 −||µ1 ||2
• For a cluster with mean µ1 , all x assigned to it will satisfy xT (µn − µ1 ) ≤ 2 , for all n ̸= 1
• VORONOI region: intersection of half spaces. Cluster regions are voronoi regions.

2
• K-means can not efficiently cluster data points that are not linearly separable. Kernel K-means is
used to cluster data points that are not linearly separable. Spectral Clustering can also be used.

5. Initialization of centroids
• Pick K-means uniformly at random from the dataset.
• Means should be far apart.
• K-means++: Choose first mean µ01 uniformly at random from the dataset. For l = 2, 3, ..., k choose µ0l
probabilistically proportional to score.
• Score S(x) = minj ||x − µ0j ||2
6. Choice of K
• We want K to be not as small and not as large. Penalize large values of K.
• Value of K is where Objective function + Penalty function is the smallest.
• Akaike Information Criterion: 2K − 2 log(L(θ∗ ))
• Bayesian Information Criterion: K log(n) − 2 log(L(θ∗ ))

4 Week 4
1. Introduction to Estimation
• Estimation: There is some probabilistic mechanism that generates the data. About which we don’t
know ”something”.
• Goal: Observe data and ”Assume” a probabilistic model that generates the data.
• Assumption: Observations are Independent and Identically Distributed
2. Maximum Likelihood Estimation
• Fisher’s Principle of Maximum Likelihood: Write the likelihood function
L(p, {x1 , x2 , ...xn }) = P (x1 , x2 , ..., xn , p)
= P (x1 , p).P (x2 , p)...P (xn , p) = Πni=1 pxi (1 − p)1−xi [Independence]
• Estimator: p̂M L = arg maxp Πni=1 pxi (1 − p)1−xi
We take logarithm
Pn to simplify
= arg maxp i=1 xi log(p) + (1 − xi ) log(1 − p) [log is monotonically increasing]
Taking Derivative
Pn and setting to 0 we get,
xi
p̂M L = i=1
n
• Fisher’s Proposal: L(µ, σ 2 , {x1 , x2 , ...xn }) = fx1 ,x2 ,...xn (x1 , x2 , ...xn , µ, σ 2 ) = Πfxi (xi , µ, σ 2 )
This is done because for continuous functions, Probability only exists for intervals and probability at
any particular point would be 0.
3. Bayesian Estimation
• Goal: Incorporate ”Hunch” about parameters into the estimation procedure.
• Approach: Think of the parameter to estimate as a ”random” variable.
• Hunch: codified using a probabilistic distribution over θ
• After looking at data, move to
Updated Hunch: Codified using another probabilistic distribution.
P ({x1 ,x2 ,...xn }|θ)P (θ)
• P (θ|{x1 , x2 , ...xn }) = P ({x1 ,x2 ,...xn })
pα−1 (1−p)β−1
• BETA PRIOR: f (p, α, β) = z
• BETA POSTERIOR(α + nh , β + nt )
• One possible guess = α+nh
α+β+n

4. Gaussian Mixture Models

• STEP 1: Pick which mixture a data point comes from.
• STEP 2: Generate data point from that mixture.
• Generate a mixture component among {1, ..., k}, zi ϵ{1, ..., k}
P (zi = l) = πl

3
• Generate xi = N (µzi , σz2i )
• Observed: {x1 , x2 , ..., xn }
Unobserved: {z1 , z2 , ..., zn }
Parameters: π = [π1 , π2 , ..., πk ], and for each k (µk , σk2 )
5. Likelihood of GMM
• L(P arameters, {x1 , x2 , ..., xn }) = Πni=1 f (xi ; P arameters)
Pk
= Πni=1 [ k=1 πk f (xi ; µk , σk2 )]
−(xi −µk )2
Pk 2σ 2
e
L(P arameters) = Πni=1 [ k=1 πk √ k
2πσk
]
−(xi −µk )2
Pn Pk 2σ 2
• log(L(P arameters)) = i=1 log( k=1 πk
e √ k
2πσk
)

6. Convex Functions and Jensen’s inequality

f (a)+f (b)
• Convex Functions: f ( a+b
2 )≤ 2 ∀a, b

• Jensen’s
P inequality:
P 1 + ... + λk ak ) ≤ λ1 f (a1 ) + ... + λk f (ak )
f (λ1 aP
f ( λk ak ) ≤ λk f (ak ), λk = 1
f (a)+f (b)
• Concave Functions: f ( a+b
2 )≥ 2 ∀a, b
• Jensen’s
P inequality:
P 1 + ... + λk ak ) ≥ λ1 f (a1 ) + ... + λk f (ak )
f (λ1 aP
f ( λk ak ) ≥ λk f (ak ), λk = 1
• Linear Functions are both Concave and Convex.
• Log is a concave function
7. Estimating the parameters
−(xi −µk )2
PnPk 2σ 2
• log(L(P arameters)) = i=1 log( k=1 λik (πk e λi √2πσ k
))
k k
Apply Jensen’s inequality, Fix λ and then take derivative and set to 0
Pn
λik xi
Pn
λi (x −µˆ )2
Pn
λik
• µˆk = Pi=1
n
λi
, σˆk2 = i=1Pnk λi i k , πˆk = i=1
n
i=1 k i=1 k

• Fixing all parameters and maximizing with respect to λ

−(xi −µk )2
2σ 2
√ 1 e k .πk
λˆik = 2πσk
−(xi −µl )2
= P (xi |zi =k)P (k)
P (xi )
Pk 2σ 2
√ 1 e l .πl
l=1 2πσl
PK i
∀i, k=1 λk =1

8. Expectation Maximization Algorithm

• Initialize Parameters
• Then until convergence
Expectation Step: Calculate λt+1 using P arameterst
Maximization Step: Calculate P arameterst+1 using λt+1

5 Week 5
1. Supervised Learning
• Input is features/attributes {x1 , ...xn } and labels {y1 , ..., yn }.
• If labels only take two values, then the problem is called binary classification problem. If labels take n
values, then the problem is called multi class classification. If labels take any value in real number,
then the problem is called a regression problem.
2. Linear Regression

• Goal is to learn a function f which maps a given feature to its correct label.
• Error(f ) = (f (xi ) − yi )2
P

• For linear regression we take f (x) = wT x, and minimize Error(f ).

4
    
− − −x1 − −− w1 y1
•  − − −x2 − −−   w2  =  y2 , we want to minimize (xT w − y)2 or (xT w − y)T (xT w − y).
− − −xn − −− wn yn
• Simply take gradient and set to 0, we get (xxT )w∗ = xy.
• Since w∗ is a solution of an unconstrained optimization problem, we can apply gradient descent
wt+1 = wt − η t ▽f (wt ), where η is the step size.
wt+1 = wt − η t 2(xxT wt − xy)
• In case number of data points is large, we use
Stochastic Gradient Descent: For t iterations, sample a bunch(k) of data points uniformly at
random from the set of all points. Pretend this
P sample is the entire dataset and take a gradient step
T
w.r.t it. After all rounds, we use wSGD = T1 wt .
• Kernel Regression is similar to kernel
Pn PCA, we take w∗ = xα∗ and K = xT x. α∗ = K −1 y.
∗ ∗
To make a prediction w ϕ(xtest ) = i=1 αi K(xi , xtest )
• Probabilistic Regression, assume labels are generated as wt x + ϵ, where ϵ is noise generated from a
Gaussian distribution. Using Maximum Likelihood function we simply arrive at
ˆ L = min (wT x − y)2 .
P
wM

6 Week 6
1. Goodness of Maximum Likelihood Estimator
• To understand how good ŵM L is in estimating w, E[||ŵM L − w||2 ] = σ 2 trace((xxT )−1 )
• trace((xxT )−1 ) =
P 1 T
λi , where λi are the eigenvalues of xx

• ŵnew = (xxT + λI)−1 xy, trace((xxT + λI)−1 ) = 1

P
λi +λ
• Existence Theorem, There exists some λ such that ŵnew has lesser mean squared error than ŵM L .
We find this λ by cross validation.
• Split the training set into train set and validation set. Train on the train set and check for error on
validation set. Pick λ that gives the least error.
• K-fold Cross Validation, Split dataset into K folds, train on K-1 folds and validate on the last fold.
Pick λ that gives the least average error.
• Leave One Out Cross Validation, train on n − 1 data points and validate on the last point.
2. Bayesian Modelling for Linear Regression
• Need a Prior on w, a choice of prior N (0, γ 2 I)
2
Q −(yi −wT xi )2 −||w||
• Posterior is proportional to P (dataset|w)P (w) = ( e 2 )e 2γ 2
• ŵM AP = min 21 (yi − wT xi )2 + 2γ1 2 ||w||2 , taking gradient and setting to 0 we get
P

ŵM AP = (xxT + γ12 I)−1 xy

3. Ridge Regression
• ŵR = arg min
P T
(w xi − yi )2 + λ||w||2 , where the added term is called regularization term.
• Ridge pushes weight values towards 0 but does not necessarily make it 0.
4. Lasso Regression
• An alternate way is to regularize would be using L1 norm(Summation of absolute values instead of
squared values).
• Much more likely to make some weight values 0. But it does not have a closed form solution.
• Sub gradients methods are usually used to solve LASSO.

7 Week 7
1. Binary Classification
• Labels belong to the set {0, 1} or the set {−1, 1}
• Loss(h) = n1
P
1(h(xi ) ̸= yi )
• h(x) = sign(wT x)

5
2. K Nearest Neighbours
• Given a test point xtest , find the closest point x′ to xtest in the training set. Predict ytest = y ′ .
• Can get affected by outliers, Ask more neighbours and predict the majority.
• Problems: Choosing a distance function, Prediction is computationally expensive, No model is learnt.
3. Decision Trees
• A question is a (feature, value) pair. Is feature ≤ value ?
• Need a measure of ”Impurity” for a set of labels to determine how good a question is.
• Entropy function = −(p log(p) + (1 − p) log(1 − p)), where p is the fraction of 1’s, and log(0) is
considered 0.
• Information Gain(feature, value) = Entropy(D) - [γEntropy(Dyes ) + (1-γ)Entropy(Dno )],
|Dy es|
where γ = |D|
• Discretize each feature in [min, max] range. Pick the question that has the largest Information Gain.
Repeat the procedure for Dyes , Dno.
• Can stop growing a tree if a node becomes ”Sufficiently” Pure. Depth of the tree is a hyperparameter.
There are alternate measures for ”goodness” of a question.
• Gini Index function is another popular function to measure impurity.
4. Types of Modelling
• Generative Model: P (x, y)
• Discriminative Model: P (y|x)

8 Week 8
1. Generative Model based Algorithm
• Data: {(x1 , y1 ), ..., (xn , yn )}, where xϵ{0, 1}d and yϵ{0, 1}.
• Step 1: Decide the labels by tossing a coin with P (yi = 1) = p
• Step 2: Determine the features using the labels obtained in Step 1 through the conditional probability
P (xi |yi ).
• The parameters in generative modelling are defined as p̂ to decide the label, 2d − 1 parameters for
P (x|y = 1) and 2d − 1 parameters for P (x|y = 0). Where d is the number of features.
• Too many parameters, could lead to overfitting and the model may not be practically viable.
2. Alternate Generative Model
• Class conditional independence: This assumption states that the features of an object are conditionally
independent given its class label.
• Step 1 remains the same.
• Step 2: Determine the features for x given y using the following conditional probability,
P (x = [f1 , f2 , ...fn ]|y) = (pyi i )fi (1 − pyi i )fi .
Q

• The parameters in generative modelling are defined as p̂ to decide the label, d parameters for P (x|y = 1)
and d parameters for P (x|y = 0). Where d is the number of features.
• Parameters are estimated using Maximum Likelihood Estimator.
3. Naive Bayes Algorithm
Q yi fi
• The model is given by: P (x = [f1 , f2 , ...fn ]|y) = (pi ) (1 − pyi i )fi .
• The parameters estimated are p, {p01 , p02 , ..., p0d }, and {p11 , p12 , ..., p1d }.
Pn
1(f i =1,yi =y)
• The estimates are p̂ = n1 y, and p̂yj = i=1
P
Pn j
1(yi =y)
i=1

• Given xtest ϵ{0, 1}d , the prediction of ŷ test is done using the inequality P (ŷ test = 1|xtest ) ≥ P (ŷ test =
0|xtest ).
P (xtest |ŷ test =t)P (ŷ test =t)
• Can express P (ŷ test = t|xtest ) = P (xtest ) , since we are only comparing, there is no need
to calculate xtest .

6
• One prominent issue with Naive Bayes is that if a feature is not observed in the training set, but
present in the testing set, the prediction probabilities for both classes become zero.
Laplace smoothing: A popular remedy for this issue is to introduce two “pseudo” data points with
labels 1 and 0, respectively, into the dataset, where all their features are set to 1.
• The decision function of Naive Bayes is linear, and the boundary is given by
{x = P (y = 0|x) = P (y = 1|x)}
4. Gaussian Naive Bayes
• Assumes that features in the dataset follow a normal distribution and computes the likelihood of a
class for a given set of feature values by estimating the mean and variance of the feature values within
each class.
• P (x|y = 0) = N (µ0 , Σ) and P (x|y = 1) = N (µ1 , Σ)
P
• µ̂t = P1(y i =t)xi 1
(xi − µ̂yi )(xi − µ̂yi )T .
P
1(yi =t) , Σ̂ = n
• If the covariance matrices are equal then the decision boundary is linear, if they are unequal then the
decision boundary is quadratic.
−µ̂t )(1(yi =t)xi −µ̂t )T
P
(1(yi =t)xiP
• For unequal covariance matrix Σ̂t = 1(yi =t) .

9 Week 9
1. Perceptron Learning Algorithm
• Widely employed for binary classification, focuses on modelling the boundary between each class.
• Objective function,
P
1(h(xi ) ̸= yi )
• Until convergence, select a pair (xi , yi ), if, sign(wT xi ) ̸= yi then update the weight vector
wt+1 = wt + xi yi .
• l2 γ 2 ≤ ||wl+1 ||2 ≤ lR2
• Uber bound on number of mistakes
2
#mistakes ≤ Rγ2

2. Logistic Regression
• Objective is to estimate the probability that the dependant variable belongs to one of two possible
values.
• Let z = wT xi , we define the P (y = 1|x) = g(z) = 1
1+e−z , where the function g(z) is called the sigmoid
function.
• Objective is to maximize the log likelihood or minimize the negative log likelihood.
yi log(g(z)) + (1 − yi ) log(1 − g(z))
• For gradient descent, we get the gradient as xi (yi − g(z))

10 Week 10
1. Support Vector Machines
• category of supervised learning algorithms designed for classification and regression analysis. SVMs
aim to identify the optimal hyperplane that maximizes the margin between data points from different
classes.
• Hard Margin SVMs: applicable only when the dataset is linearly separable
Direct or Kernelized Calculation of Q: Compute the matrix Q = X T X directly or using a kernel,
based on the dataset.
Gradient Descent: Employ the gradient of the dual formula, αT 1 − 12 αT Y T QY α, in a gradient descent
algorithm to iteratively find a satisfactory set of Lagrange multipliers α.
label(xtest ) = sign(wT xtest ) = sign( αP T
P
i yi (xi xtest ))
label(xtest ) = sign(w ϕ(xtest )) = sign( αi yi k(xTi xtest ))
T

• Soft Margin SVMs: extends the standard SVM algorithm to accommodate some misclassifications in
the training data. This extension is particularly useful when dealing with non-linearly separable data.
It introduces a regularization parameter (C) to control the balance between maximizing the margin
and allowing for Pmisclassifications.
min 12 ||w||22 + C ϵi such that (wT xi )yi + ϵi ≥ 1, ϵi ≥ 0

7
11 Week 11
1. Bagging
• Simply Distribute the dataset into m smaller datasets, them make m different models. For prediction,
predict using each of the models, average out the prediction, then use the same function on the
averaged prediction.
• Can also use feature bagging.
2. Boosting
• Input: S = {(x1 , y1 ), ..., (xn , yn )}
• Initialize D0 (i) = 1
n
• For t = 1 to T
ht = Input S to a weak Learner
D̃t+1 (i) = Dt (i)eαt if ht (xi ) ̸= yi else Dt (i)e−αt
Dt+1 (i) = PD̃D̃t+1 (i)(i)
t+1
q
1−error(ht )
• αt = log( error(ht ) )

• h∗ (x) = sign( αt ht (x))

12 Week 12
1. Activation Functions
• Sigmoid function: 1
1+e−z
• Rectified Linear Unit: max(0,z)

4 - Basics in Statistics and Linear Algebra
No ratings yet
4 - Basics in Statistics and Linear Algebra
7 pages
Homework 2: SVM, Kernel Methods, Ensemble Learning, Learning Theory
No ratings yet
Homework 2: SVM, Kernel Methods, Ensemble Learning, Learning Theory
12 pages
Bishop Solutions PDF
No ratings yet
Bishop Solutions PDF
87 pages
Weekly Homework X
No ratings yet
Weekly Homework X
15 pages
ML Questions Answer Q1
No ratings yet
ML Questions Answer Q1
79 pages
Cheat Sheet
No ratings yet
Cheat Sheet
4 pages
Cours2 ML
No ratings yet
Cours2 ML
21 pages
Introduction To Kernels: Max Welling
No ratings yet
Introduction To Kernels: Max Welling
16 pages
ECE_449_Notes
No ratings yet
ECE_449_Notes
5 pages
Kernel Methods For General Pattern Analysis PDF
No ratings yet
Kernel Methods For General Pattern Analysis PDF
77 pages
Hota ML13
No ratings yet
Hota ML13
24 pages
Duda Solutions PDF
No ratings yet
Duda Solutions PDF
77 pages
VIP Cheatsheet: Unsupervised Learning: Afshine Amidi and Shervine Amidi September 9, 2018
No ratings yet
VIP Cheatsheet: Unsupervised Learning: Afshine Amidi and Shervine Amidi September 9, 2018
3 pages
QSRI-lecture4
No ratings yet
QSRI-lecture4
56 pages
A Step by Step Mathematical Derivation A
No ratings yet
A Step by Step Mathematical Derivation A
32 pages
2021 UNAS REFER Rafi Yon Saputra 173112706420242 Kernel Primer
No ratings yet
2021 UNAS REFER Rafi Yon Saputra 173112706420242 Kernel Primer
65 pages
SVM Class 2
No ratings yet
SVM Class 2
87 pages
2014 02 26 Kernels
No ratings yet
2014 02 26 Kernels
140 pages
Lecture17 Kernels
No ratings yet
Lecture17 Kernels
23 pages
CS 7641 CSE/ISYE 6740 Mid-Term Exam 2 (Fall 2016) Solutions: 1 Probability and Bayes' Rule (14 PTS)
No ratings yet
CS 7641 CSE/ISYE 6740 Mid-Term Exam 2 (Fall 2016) Solutions: 1 Probability and Bayes' Rule (14 PTS)
12 pages
Data Science Cheatsheet
No ratings yet
Data Science Cheatsheet
5 pages
LINFO2275 Questions d Examen-4
No ratings yet
LINFO2275 Questions d Examen-4
34 pages
cs229 Notes3
No ratings yet
cs229 Notes3
30 pages
Time Series Forecasting by Using Wavelet Kernel SVM
No ratings yet
Time Series Forecasting by Using Wavelet Kernel SVM
52 pages
Solution
No ratings yet
Solution
148 pages
Machine Learning
No ratings yet
Machine Learning
48 pages
Int 10
No ratings yet
Int 10
21 pages
Mathematics of Signals, Networks, and Learning
No ratings yet
Mathematics of Signals, Networks, and Learning
68 pages
Dimension Reduction and Hidden Structure: 1.1 Principal Component Analysis (PCA)
No ratings yet
Dimension Reduction and Hidden Structure: 1.1 Principal Component Analysis (PCA)
40 pages
Cheat ML
No ratings yet
Cheat ML
1 page
set3sol-2022
No ratings yet
set3sol-2022
3 pages
Wk01 machine learning
No ratings yet
Wk01 machine learning
6 pages
Machine Unit4
No ratings yet
Machine Unit4
55 pages
Assignment 2 Documentation
No ratings yet
Assignment 2 Documentation
15 pages
Tema5 Teoria-2830
No ratings yet
Tema5 Teoria-2830
57 pages
Mathematical model
No ratings yet
Mathematical model
34 pages
ML imppp (1)
No ratings yet
ML imppp (1)
12 pages
VIP Cheatsheet: Unsupervised Learning: Afshine Amidi and Shervine Amidi August 12, 2018
No ratings yet
VIP Cheatsheet: Unsupervised Learning: Afshine Amidi and Shervine Amidi August 12, 2018
2 pages
Math Behind Machine Learning
No ratings yet
Math Behind Machine Learning
9 pages
Machine Learning
No ratings yet
Machine Learning
29 pages
DAAI - Lecture - 04 - With - Solutions - 10oct22
No ratings yet
DAAI - Lecture - 04 - With - Solutions - 10oct22
84 pages
COMP4702 Notes 2019: Week 2 - Supervised Learning
No ratings yet
COMP4702 Notes 2019: Week 2 - Supervised Learning
23 pages
AIML_LAB
No ratings yet
AIML_LAB
37 pages
ML Lecture06 2
No ratings yet
ML Lecture06 2
63 pages
Introduction To (Statistical) Machine Learning
No ratings yet
Introduction To (Statistical) Machine Learning
30 pages
ML Cheatsheet
No ratings yet
ML Cheatsheet
1 page
Machine Learning-4
No ratings yet
Machine Learning-4
73 pages
An Adventure of Epic Porpoises
No ratings yet
An Adventure of Epic Porpoises
174 pages
Kernel PCA
No ratings yet
Kernel PCA
13 pages
Week 9 Notes
No ratings yet
Week 9 Notes
6 pages
SVM
No ratings yet
SVM
40 pages
ML - Unit - 4 - Part Ii
No ratings yet
ML - Unit - 4 - Part Ii
79 pages
Lecture 03 Bayes Classifier With Prob Concepts
No ratings yet
Lecture 03 Bayes Classifier With Prob Concepts
70 pages
U L D R: Nsupervised Earning and Imensionality Eduction
No ratings yet
U L D R: Nsupervised Earning and Imensionality Eduction
58 pages
Dimension Reduction
No ratings yet
Dimension Reduction
23 pages
KPCA
No ratings yet
KPCA
26 pages
Intro&NP Stat
No ratings yet
Intro&NP Stat
122 pages

Machine Learning Techniques

Uploaded by

Machine Learning Techniques

Uploaded by

Machine Learning

Rishabh Indoria December 23, 2023

• Broad Paradigms: Supervised Learning, Unsupervised Learning, Sequential Learning

• Large d [d >> n]; X is n × d

• Compute the transformed data points, ϕ(xi )T w = [ α1j Kij C C

• Step 3, Reassignment: zit+1 = arg mink ||xi − µtk ||22

4. Gaussian Mixture Models

6. Convex Functions and Jensen’s inequality

• Fixing all parameters and maximizing with respect to λ

8. Expectation Maximization Algorithm

• For linear regression we take f (x) = wT x, and minimize Error(f ).

• ŵnew = (xxT + λI)−1 xy, trace((xxT + λI)−1 ) = 1

ŵM AP = (xxT + γ12 I)−1 xy

• h∗ (x) = sign( αt ht (x))

You might also like