0% found this document useful (0 votes)

32 views110 pages

Unit-2 Machine Learning

machine learning unit 2

Uploaded by

do7760

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views110 pages

Unit-2 Machine Learning

machine learning unit 2

Uploaded by

do7760

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 110

UNIT-

II
Maximum likelihood estimation- Least squares, Robust linear expression, ridge
regression, Bayesian linear regression, Linear models for classification:
Discriminant function, Probabilistic generative models, Probabilistic
discriminative models, Laplacian approximation, Bayesian logistic regression,
Kernel functions, using kernels in GLM, Kernel trick, SVMs.
Maximum Likelihood Estimation
Maximum Likelihood Estimation
• Maximum Likelihood Estimation (MLE) is a statistical method used to estimate the
parameters of a probability distribution that best describe a given dataset.
• To analyze the data provided, we need to identify the distribution from which we have
obtained data.
• Next, use data to find the parameters of our distribution. A parameter is a numerical
characteristic of a distribution.
• Example Distribution and their parameters:
• Normal distributions - mean (µ) & variance (σ2)
• Binomial distributions –no.of trials (n) & probability of success (p)
• Gamma distributions - shape (k) and scale (θ)
• Exponential distributions - inverse mean (λ).
• These parameters are vital for understanding the size, shape, spread, and other properties of a
distribution.
• Since the data that we have is mostly randomly generated, we often don’t know the true values
of the parameters characterizing our distribution.
Maximum Likelihood Estimation
• An estimator is like a function of data that gives approximate values of the parameters.
• Ex: sample-mean estimator – Simple & Frequently used estimator

• Since the numerical characteristics of the distribution vary as a function of the range of
parameter it is not easy to estimate parameter θ of the distribution.

• Maximum likelihood estimation, which is a process of estimation that gives an entire class of
estimators called maximum likelihood estimators or MLEs
MLE – for Linear Regression Model
In the context of linear regression, MLE is used to estimate the coefficients (parameters) of the regression model.
Likelihood Function
Maximizing Log-Likelihood
Example - Maximum likelihood estimation
Let's work through an example of using Maximum Likelihood Estimation (MLE) in linear regression.
• Example : We have a dataset with the following observations:

X Y
1 2
2 3
3 5
4 4
2 6
Robust linear expression
Robust linear expression

• Robust linear regression is designed to be less sensitive to outliers

compared to traditional linear regression.
• Traditional linear regression minimizes the sum of squared residuals,
which can be heavily influenced by outliers.
• Robust linear regression uses different techniques to mitigate the
effect of outliers and produce a more reliable model.
Robust linear expression
Ridge regression

• Problem with MLE: Maximum Likelihood Estimation (MLE) can

lead to overfitting because it finds the parameter values that best fit the
training data, which may result in overly complex models, especially
when the data is noisy.
• Gaussian Prior: To mitigate overfitting, ridge regression uses
Maximum A Posteriori (MAP) estimation with a Gaussian prior on the
parameters. The Gaussian prior assumes that the parameters should be
small, which helps to create smoother, less complex models.
Bayesian linear regression
Bayesian linear regression

• Bayesian linear regression extends the traditional linear regression

framework by incorporating prior beliefs about the parameters,
resulting in a full posterior distribution for the parameters. This
approach offers several advantages, including better handling of
uncertainty and robustness to overfitting.
• Robust linear regression is used when the data contains outliers or
non-normal errors that can unduly influence the results of ordinary
least squares (OLS) regression.
• Robust regression techniques are designed to be less sensitive to
outliers and provide more reliable parameter estimates.One common
method for robust regression is the Huber loss, which combines the
squared loss (used in OLS) and the absolute loss.
• This approach reduces the influence of outliers by treating small
residuals with the squared loss and large residuals with the absolute
loss.
Discriminant Functions
Discriminant Functions

Two classes Decision boundary:

Multiple classes

Least squares for classification

Each class

Training data

Sum-of-squares error function

is largest.
42
Discriminant Functions
Least squares for classification

43
Discriminant Functions
Fisher’s linear discriminant

select a projection that maximizes the class separation

maximize have considerable overlap

Fisher’s idea maximizes a function that will give a large separation between the projected class meanwhile
also giving a small variance within each class, thereby minimizing the class overlap.

The Fisher criterion

Fisher’s linear discriminant
between-class covariance matrix:

total within-class covariance matrix

44
Discriminant Functions
Fisher’s linear discriminant

45
Discriminant Functions

The perceptron algorithm

Perceptron function nonlinear activation function
Target values

perceptron criterion denotes the set of all misclassified patterns

Stochastic gradient descent

Perceptron convergence theorem: if there exists an exact solution (in other words, if the training data set is linearly
separable), then the perceptron learning algorithm is guaranteed to find an exact solution in a finite number of
steps.

46
Discriminant Functions

The perceptron algorithm

47
Probabilistic Generative Models
Probabilistic Discriminative Models
Laplace Approximation
Laplace Approximation
The Laplace Approximation aims to find a Gaussian approximation to a probability
density defined over a set of continuous variables.
Consider first the case of a single continuous variable z, and suppose the distribution
p(z) is defined by
Laplace Approximation…
Laplace Approximation…
Gaussian approximation will only be well defined if its precision A > 0, in other words the stationary
point z0 must be a local maximum, so that the second derivative of f(z) at the point z0 is negative.
Laplace Approximation…

and ∇ is the gradient operator. Taking the exponential of both sides we obtain

The distribution q(z) is proportional to f(z) and the appropriate normalization coefficient can be
found by inspection, using the standard result for a normalized multivariate Gaussian, giving
Laplace Approximation…

One major weakness of the Laplace approximation is that, since it is based on a Gaussian
distribution, it is only directly applicable to real variables.
In other cases it may be possible to apply the Laplace approximation to a transformation of the
variable. For instance if 0 <T < ∞ then we can consider a Laplace approximation of ln τ .
The most serious limitation of the Laplace framework, however, is that it is based purely on the
aspects of the true distribution at a specific value of the variable, and so can fail to capture important
global properties.
Bayesian Logistic Regression
Bayesian Logistic Regression
• Exact Bayesian inference for logistic regression is intractable.
• In particular, evaluation of the posterior distribution would require
normalization of the product of a prior distribution and a likelihood
function that itself comprises a product of logistic sigmoid functions,
one for every data point.
• Evaluation of the predictive distribution is similarly intractable.
• it is natural to begin with a Gaussian prior, which we write in the
general form
Bayesian Logistic Regression…
Bayesian Logistic Regression…
Predictive distribution
Predictive distribution…
Predictive distribution…
Kernel functions, using kernels in GLM, Kernel trick
Kernel Trick
• SVM algorithms use a set of mathematical functions that are defined as the kernel. The
function of kernel is to take data as input and transform it into the required form.
• Firstly, a kernel takes the data from its original space and implicitly maps it to a
higher-dimensional space. This is crucial when dealing with data that is not linearly separable
in its original form.
• Instead of performing computationally expensive high-dimensional calculations, the kernel
function calculates the relationships or similarities between pairs of data points as if they were
in this higher-dimensional space.
Kernel Trick
• Kernel Trick allows us to operate in the original • However, if we use the kernel function, which
feature space without computing the coordinates of is denoted as k(x, y), instead of doing the
the data in a higher dimensional space. complicated computations in the 9-dimensional
• Example: space, we reach the same result within the
3-dimensional space by calculating the dot
product of x -transpose and y.
• Here x and y are two data points in 3 dimensions. • The computational complexity, in this case, is
Let’s assume that we need to map x and y to O(n).
9-dimensional space. We need to do the following
calculations to get the final result, which is just a
scalar.
• The computational complexity, in this case, is
O(n²).

• Kernel trick does for us is to offer a more

efficient and less expensive way to transform
data into higher dimensions.
Kernel Trick
• Numerical Example for working of Kernel Function
Feature(x) -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6
X2 36 25 16 9 4 2 0 1 4 9 16 25 36

Fig 1: Linearly inseparable data in one dimension

Fig 2: Applying kernel method to represent data in two

dimensions
Kernel Functions
• Let κ(x, x’) ≥ 0 be some measure of similarity between objects x, x’ ∈ X , where X is some
abstract space; we will call κ a kernel function.
• We define a kernel function to be a real-valued function of two arguments, κ(x, x’ ) ∈ R, for x, x’
∈ X . Typically the function is symmetric (i.e., κ(x, x’ ) = κ(x’ , x)), and non-negative (i.e., κ(x, x’
) ≥ 0.
• Linear Kernel:
• Let φ(x) = x, we get the linear kernel, defined by just the dot product between the two object
vectors:

• This is useful if the original data is already high dimensional, and if the original features are
individually informative,
• e.g., a bag of words representation where the vocabulary size is large, or the expression level of
many genes.
• In such a case, the decision boundary is likely to be representable as a linear combination of the
original features, so it is not necessary to work in some other feature space
Kernel Functions
• Mercer Kernel:
• Let X = {x1, . . . , xn} be a finite set of n samples from ჯ. The Gram matrix of X is defined as

• If ∀X ⊆ ჯ , the matrix K is positive definite, κ is called a Mercer Kernel, or a positive definite

kernel.
• Mercer’s Theorem: If the Gram matrix is positive definite, we can compute an eigenvector
decomposition of it as
• where Λ is a diagonal matrix of eigenvalues λi > 0. Now consider an element of K:

• Let us define Then we can write

• Thus entries in the kernel matrix can be computed by performing an inner product of some feature
vectors that are implicitly defined by the eigenvectors U. In general, if the kernel is Mercer, then
there exists a function φ mapping x ∈ ჯ to such that
Kernel Functions
• Polynomial Kernel:
• It represents the similarity of vectors in the training set of data in a feature space over
polynomials of the original variables used in the kernel.

• Where r>0.
• Sigmoid Kernel:
• An example of a kernel that is not a Mercer kernel is the so-called sigmoid kernel, defined by

• this function is equivalent to a two-layer, perceptron model of the neural network, which is
used as an activation function for artificial neurons.
Kernel Functions
• RBF Kernel:
• The below function is the Gaussian or RBF kernel

• Here σ is the variance and our hyperparameter

• ||x-x’|| is the Euclidean (L₂-norm) Distance between two points x and x’
• When using a Gaussian kernel in an SVM, the decision boundary is a nonlinear hyper plane
that can capture complex nonlinear relationships between the input features.
• The width of the Gaussian function, controlled by the gamma parameter, determines the
degree of nonlinearity in the decision boundary.
Kernel Functions
• RBF Kernel Illustration:
• The RBF kernel function for two points X₁ and X₂ computes the similarity or how close they
are to each other. This kernel can be mathematically represented as follows:

• Let d₁₂ be the distance between the two points X₁ and X₂, we can now represent d₁₂ as follows:

• The kernel equation can be re-written as follows:

Kernel Functions
• RBF Kernel Illustration:
• The maximum value that the RBF kernel can be is 1 and occurs when d₁₂ is 0 which is when
the points are the same, i.e. X₁ = X₂.
• When the points are the same, there is no distance between them and therefore they are
extremely similar
• When the points are separated by a large distance, then the kernel value is less than 1 and
close to 0 which would mean that the points are dissimilar
• Distance can be thought of as an equivalent to dissimilarity because we can notice that when
distance between the points increases, they are less similar.
Kernel Functions
• RBF Kernel Illustration:
• It is important to find the right value of ‘σ’ to decide
which points should be considered similar and this
can be demonstrated on a case by case basis.
• a] σ = 1
• When σ = 1, σ² = 1 and the RBF kernel’s
mathematical equation will be as follows:

• We can notice that when d₁₂ = 0, the

• The curve for this equation is given in right and we similarity is 1 and as d₁₂ increases beyond 4
can notice that as the distance increases, the RBF units, the similarity is 0
Kernel decreases exponentially and is 0 for • From the graph, we see that if the distance is
distances greater than 4. below 4, the points can be considered similar
and if the distance is greater than 4 then the
points are dissimilar
Kernel Functions
• RBF Kernel Illustration:
• b] σ = 0.1
• When σ = 0.1, σ² = 0.01 and the RBF kernel’s
mathematical equation will be as follows:

• The width of the Region of Similarity is minimal for

σ = 0.1 and hence, only if points are extremely close
they are considered similar. • We see that the curve is extremely peaked
and is 0 for distances greater than 0.2
• The points are considered similar only if the
distance is less than or equal to 0.2
Kernel Functions
• RBF Kernel Illustration:
• b] σ = 10
• When σ = 10, σ² = 100 and the RBF kernel’s
mathematical equation will be as follows:

• The width of the Region of Similarity is large for σ

= 100 because of which the points that are farther
away can be considered to be similar. • The width of the curve is large
• The points are considered similar for
distances up to 10 units and beyond 10 units
they are dissimilar

• It is evident from the above cases that the width of the Region of Similarity changes as
σ changes.
• Finding the right σ for a given dataset is important
Kernel Functions
• String Kernels:
• If we’re interested in matching all substrings (for example) instead of representing an object
as a bag of words, we can use a string kernel:

• Let A denote an alphabet, e.g., {a, ..., z}, and A∗ = [A, A2 , . . . , Am], where m is the length of
the longest string we would like to match. Then a basis function φ(x) will map a string x to a
vector of length |A∗|, where each element j is the number of times we observe substring A ∗j in
string x, where j = 1 : | A∗ |.
• The string kernel measures the similarity of two strings x and x’ :

• where φs(x) denotes the number of occurrences of substring s in string x.

Kernel Functions
• Matern Kernel:
• The Matern kernel, which is commonly used in Gaussian process has the following form

• Where is a modified Bessel function.

• Fisher Kernel:
• We can construct a kernel based on a chosen generative model using the concept of a Fisher
kernel. The idea is that this kernel represents the distance in likelihood space between
different objects for a fitted generative model. A Fisher kernel is defined as

• Where
Kernel Functions
• Kernels for comparing documents:
• When performing document classification or retrieval, it is useful to have a way of comparing
two documents, xi and xi’ .
• If we use a bag of words representation, where xij is the number of times words j occurs in
document i, we can use the cosine similarity, which is defined by

• This quantity measures the cosine of the angle between xi and xi’ when interpreted as vectors.
• Since xi is a count vector (and hence non-negative), the cosine similarity is between 0 and 1,
where 0 means the vectors are orthogonal and therefore have no words in common.
• This simple method does not work very well, for two main reasons. First, if xi has any word in
common with xi’ , it is deemed similar, even though some popular words, such as “the” or
“and” occur in many documents, and are therefore not discriminative.
• Second, if a discriminative word occurs many times in a document, the similarity is artificially
boosted.
Kernel Functions
• We can significantly improve performance using some simple preprocessing. The idea is to
replace the word count vector with a new feature vector called the TF-IDF representation,
which stands for “term frequency inverse document frequency”.
• First, the term frequency is defined as a log-transform of the count:
• This reduces the impact of words that occur many times within one document. Second, the
inverse document frequency is defined as

• where N is the total number of documents, and the denominator counts how many documents
contain term j. Finally, we define

• We then use this inside the cosine similarity measure. That is, our new kernel has the form

• where φ(x) = tf-idf(x). This gives good results for information retrieval
Kernel Functions
• We can significantly improve performance using some simple preprocessing. The idea is to
replace the word count vector with a new feature vector called the TF-IDF representation,
which stands for “term frequency inverse document frequency”.
• First, the term frequency is defined as a log-transform of the count:
• This reduces the impact of words that occur many times within one document. Second, the
inverse document frequency is defined as

• where N is the total number of documents, and the denominator counts how many documents
contain term j. Finally, we define

• We then use this inside the cosine similarity measure. That is, our new kernel has the form

• where φ(x) = tf-idf(x). This gives good results for information retrieval
Using Kernels inside GLMs
• Kernel Machines:
• We define a kernel machine to be a GLM where the input feature vector has the form

• where ∈ ჯ are a set of K centroids. If κ is an RBF kernel, this is called an RBF network
• We will call above Equation a kernelised feature vector.
• We can use the kernelized feature vector for logistic regression by defining
p(y|x, θ) = Ber(cφ(x)).
• This provides a simple way to define a non-linear decision boundary.
Using Kernels inside GLMs
• Example:
• Consider the data coming from the exclusive or or xor function. This is a binary valued
function of two binary inputs. Its truth table is shown in Figure 14.2(a). In Figure 14.2(b), we
have show some data labeled by the xor function. We see we cannot separate the data even
using a degree 10 polynomial.
• However, using an RBF kernel and just 4 prototypes easily solves the problem as shown in
Figure 14.2(c)
Using Kernels inside GLMs
• Example:
• We can also use the kernelized feature vector
inside a linear regression model by defining
p(y|x, θ) = N (wTφ(x), σ2).
• For example, Figure 14.3 shows a 1d data set fit
with K = 10 uniformly spaced RBF prototypes,
but with the bandwidth ranging from small to
large.
• Small values lead to very wiggly functions,
since the predicted function value will only be
non-zero for points x that are close to one of the
prototypes μk.
• If the bandwidth is very large, the design matrix
reduces to a constant matrix of 1’s, since each
point is equally close to every prototype; hence
the corresponding function is just a straight line.
Support Vector Machine
SVM Algorithm
• SVM stands for support vector machine, and although it can solve both classification
and regression problems, it is mainly used for classification problems in machine
learning (ML).

• SVM’s purpose is to predict the classification of a query sample by relying on labeled

input data which are separated into two group classes by using a margin.

• Specifically, the data is transformed into a higher dimension, and a support vector
classifier is used as a threshold (or hyperplane) to separate the two classes with
minimum error.
SVM Algorithm
• Dimensions:
• In simple terms, a dimension of something is
a particular aspect of it. Examples: width,
depth and height are dimensions.
• A line on a plane is one dimension,
considering the edges a square has two
dimensions and a cube has three dimensions
• Planes and Hyperplane:
• In one dimension, a hyperplane is called a
point.
• In two dimensions, it is a line.
• In three dimensions, it is a plane and in more
dimensions we call it a hyperplane.
SVM Algorithm
• Terminologies in SVM:
• The points closest to the hyperplane are called as the support vector points and the distance of
the vectors from the hyperplane are called the margins.
• The basic intuition is that more the farther SV points, from the hyperplane, more is the
probability of correctly classifying the points in their respective region or classes.
• SV points are very critical in determining the hyperplane because if the position of the vectors
changes the hyperplane’s position is altered.
• Technically this hyperplane can also be called as margin maximizing hyperplane.
SVM Algorithm
• Math Behind SVM
• Consider a binary classification problem with two
classes, labeled as +1 and -1. We have a training
dataset consisting of input feature vectors X and their
corresponding class labels Y.
• The equation for the linear hyperplane can be written
as:

• where the output y indicates whether it is in a • Anything above the decision boundary should
positive class or the negative class. w is that have label 1
matrix representing the plane's parameters also the • Similarly, anything below the decision
coefficient of x where x is the input data. b represents boundary should have label −1.
the intercept of the hyperplane. • If any point exactly on the decision boundary,
then the output of the classifier would be zero
SVM Algorithm
• Why the output of the equation is either
positive or negative.
• Consider the problem where the decision boundary passes
through the origin and hence intercept is zero and its slope
is +1.
• A single data point on each side of the hyperplane
represents both the positive and negative classes.
• Substituting the values in the equation of the hyperplane:

Any point below the hyperplane will always be

positive, and above the hyperplane will be
negative.
SVM Algorithm
• We would like to choose a hyperplane that maximizes the margin between classes. The graph
below shows what good margins and bad margins are.
SVM Algorithm
• The margin has to be maximized to find the optimal decision boundary.
• Consider the negative support vector as point x1 and the positive support vector as point x2.
• The margin would be simply the difference between x1 and x2. subtract one equation from
another.

• To find x1-x2, w has to be sent to the equation's left-hand side, which gives 2 over w.
• It is already known that w is a vector, and vectors can not be divided directly like a scalar value.
The equivalent would be to divide both sides by the length of w, that is, the magnitude of that
norm of w.
SVM Algorithm
• Now that we have arrived at the equation for the margin, it is considered the optimization function
that needs to be maximized using optimization algorithms like gradient descent.

• Optimization algorithms work best when finding the local minimum, hence to ease the problem,
minimizing the reciprocal of x1-x2 can be used as an optimization function, which is the norm
of w over 2 as below.

• Hard Margin
• The maximum-margin hyperplane or the hard margin hyperplane is a hyperplane that properly
separates the data points of different categories without any misclassifications.
SVM Algorithm
• Soft Margin
• It is also possible that the SVM model can have
some percentage of error, meaning, misclassification
of new data
• This has to be integrated into our optimization
function,

• where c indicates the number of error points, in other

words, the number of misclassified data points and
summation of the distance between the marginal
hyperplane and the misclassified data point.
Types of SVM
• Linear SVM:
• Linear SVMs use a linear decision boundary to
separate the data points of different classes.
• When the data can be precisely linearly separated,
linear SVMs are very suitable. This means that a
single straight line (in 2D) or a hyperplane (in higher
dimensions) can entirely divide the data points into
their respective classes.
• Non-Linear SVM:
• Non-Linear SVM can be used to classify data when it
cannot be separated into two classes by a straight line
(in the case of 2D).
• By using kernel functions, nonlinear SVMs can
handle nonlinearly separable data.
Hinge Loss in SVM
• Hinge loss is a function popularly used in support vector
machine algorithms to measure the distance of data points
from the decision boundary. This helps approximate the
possibility of incorrect predictions and evaluate the
model's performance.
• Mathematically, Hinge loss for a data point can be
represented as :
• L(y, f(x))=max(0,1 –y ∗ f(x))
• Here,
• y- the actual class (-1 or 1) Case 1 : Correct Classification and |y| ≥ 1 (blue)
Case 2 : Correct Classification and |y| < 1 (faded blue)
• f(x) – the output of the classifier for the datapoint Case 3: Incorrect Classification (Red)

Grade 6 Math
100% (1)
Grade 6 Math
35 pages
Maximum Likelihood Estimation
No ratings yet
Maximum Likelihood Estimation
17 pages
Complex Number - Module - Lakshya JEE AIR Recorded 2025
No ratings yet
Complex Number - Module - Lakshya JEE AIR Recorded 2025
74 pages
Pair of Linear Equations in Two Variables
100% (2)
Pair of Linear Equations in Two Variables
6 pages
Linear - Regression
100% (1)
Linear - Regression
39 pages
BA501 Week5 Linear Regression
No ratings yet
BA501 Week5 Linear Regression
45 pages
W6a Gaussian Process Kernels
No ratings yet
W6a Gaussian Process Kernels
6 pages
Est I Math Febuary 2022
No ratings yet
Est I Math Febuary 2022
37 pages
Wk05 Machine Learning
No ratings yet
Wk05 Machine Learning
6 pages
Regression
No ratings yet
Regression
45 pages
Supervised Learning
No ratings yet
Supervised Learning
6 pages
Unit 2
No ratings yet
Unit 2
133 pages
Pattern Recognition Machine Learning: Chapter 3: Linear Models For Regression
100% (1)
Pattern Recognition Machine Learning: Chapter 3: Linear Models For Regression
48 pages
Unit 2 ML - Ver 2
No ratings yet
Unit 2 ML - Ver 2
129 pages
Time Series Forecasting by Using Wavelet Kernel SVM
No ratings yet
Time Series Forecasting by Using Wavelet Kernel SVM
52 pages
Unit 2 - ML - SRM
No ratings yet
Unit 2 - ML - SRM
89 pages
Unit 2
No ratings yet
Unit 2
92 pages
EDAN96 2024 Last Lecture-1
No ratings yet
EDAN96 2024 Last Lecture-1
78 pages
Unit 2 - ML - SRM
No ratings yet
Unit 2 - ML - SRM
66 pages
ML 3
No ratings yet
ML 3
66 pages
Machine Learning (CSO851) - Lecture 02
No ratings yet
Machine Learning (CSO851) - Lecture 02
74 pages
Lecture Notes On High Dimensional Linear Regression
No ratings yet
Lecture Notes On High Dimensional Linear Regression
73 pages
Lec8 MLE
No ratings yet
Lec8 MLE
35 pages
Unit 2&3 - 250421 - 215911
No ratings yet
Unit 2&3 - 250421 - 215911
60 pages
Grade 3 Session 1
100% (1)
Grade 3 Session 1
38 pages
Unit 2 ML - Ver 2
No ratings yet
Unit 2 ML - Ver 2
129 pages
Lecture 2
No ratings yet
Lecture 2
66 pages
Graphs of Inverse Functions
100% (1)
Graphs of Inverse Functions
36 pages
Module3-Fitting A Model To Data
No ratings yet
Module3-Fitting A Model To Data
57 pages
MFML Unit-4 Notes - 14-06-2024
No ratings yet
MFML Unit-4 Notes - 14-06-2024
36 pages
Day 1
No ratings yet
Day 1
41 pages
PR M4 Notes
No ratings yet
PR M4 Notes
38 pages
Statistical Learning
No ratings yet
Statistical Learning
31 pages
Lecture3 2015
No ratings yet
Lecture3 2015
38 pages
Module 3.1
No ratings yet
Module 3.1
25 pages
Unit 2
No ratings yet
Unit 2
26 pages
Unit III Regression
No ratings yet
Unit III Regression
24 pages
2.SupervisedLearning Error
No ratings yet
2.SupervisedLearning Error
32 pages
When Models Meet Data
No ratings yet
When Models Meet Data
25 pages
Linear Regression
No ratings yet
Linear Regression
19 pages
MLF Week 4 Notes by Manisha Pal
No ratings yet
MLF Week 4 Notes by Manisha Pal
13 pages
Lecture 2 Annotated
No ratings yet
Lecture 2 Annotated
60 pages
Revisiting Revisiting Logistic Regression & Naïve Logistic Regression & Naïve Bayes Bayes
No ratings yet
Revisiting Revisiting Logistic Regression & Naïve Logistic Regression & Naïve Bayes Bayes
46 pages
MLT UNIT-2 Notes
No ratings yet
MLT UNIT-2 Notes
16 pages
Lec 12
No ratings yet
Lec 12
15 pages
RigNotes15 PDF
No ratings yet
RigNotes15 PDF
130 pages
Today: - Calculus
No ratings yet
Today: - Calculus
61 pages
Math Problem
No ratings yet
Math Problem
196 pages
Da Unit-Iii
No ratings yet
Da Unit-Iii
14 pages
Note 4: EECS 189 Introduction To Machine Learning Fall 2020 1 MLE and MAP For Regression (Part I)
No ratings yet
Note 4: EECS 189 Introduction To Machine Learning Fall 2020 1 MLE and MAP For Regression (Part I)
6 pages
CH 1
No ratings yet
CH 1
24 pages
Manual GPML
No ratings yet
Manual GPML
51 pages
Fitting A Model To Data
No ratings yet
Fitting A Model To Data
41 pages
Notes5 Regression
No ratings yet
Notes5 Regression
14 pages
PRML Slides 3
No ratings yet
PRML Slides 3
57 pages
Aiml Unit 3
No ratings yet
Aiml Unit 3
9 pages
Linear Regression and Classification
No ratings yet
Linear Regression and Classification
8 pages
ML Unit3
No ratings yet
ML Unit3
9 pages
GMM, MLE and Tests For Non-Linear Restrictions: 1 Generalized Method of Moments (GMM)
No ratings yet
GMM, MLE and Tests For Non-Linear Restrictions: 1 Generalized Method of Moments (GMM)
15 pages
ML Summary PDF
No ratings yet
ML Summary PDF
5 pages
Experiment 1
No ratings yet
Experiment 1
5 pages
Cheatsheet Supervised Learning
No ratings yet
Cheatsheet Supervised Learning
4 pages
Math Behind Machine Learning
No ratings yet
Math Behind Machine Learning
9 pages
Maximum Likelihood Estimators and Least Squares
No ratings yet
Maximum Likelihood Estimators and Least Squares
5 pages
Matlab Tutorial PDF
No ratings yet
Matlab Tutorial PDF
70 pages
Spectral Methods PDF
100% (1)
Spectral Methods PDF
28 pages
Statistics For Managers Using Microsoft® Excel 5th Edition: Some Important Discrete Probability Distributions
No ratings yet
Statistics For Managers Using Microsoft® Excel 5th Edition: Some Important Discrete Probability Distributions
48 pages
20A54101 Linear Algebra & Calculus
No ratings yet
20A54101 Linear Algebra & Calculus
2 pages
Progression and Series CPP-tripathi
No ratings yet
Progression and Series CPP-tripathi
3 pages
One+Shot Umang CBSE+9+-+2021+ +CH02 +polynomials+ +2nd+june
No ratings yet
One+Shot Umang CBSE+9+-+2021+ +CH02 +polynomials+ +2nd+june
76 pages
Schedule Jee Main 2025 Test Series Droppers July Batch
No ratings yet
Schedule Jee Main 2025 Test Series Droppers July Batch
4 pages
Grade 7/8 Math Circles Continued Fractions A Fraction of Our History
No ratings yet
Grade 7/8 Math Circles Continued Fractions A Fraction of Our History
23 pages
DLL - Mathematics 6 - Q4 - W2
No ratings yet
DLL - Mathematics 6 - Q4 - W2
8 pages
MATH 219 Introduction To Differential Equations: Akisisel@metu - Edu.tr
No ratings yet
MATH 219 Introduction To Differential Equations: Akisisel@metu - Edu.tr
5 pages
Pythagorean Triples: Beat The Computer Drill
No ratings yet
Pythagorean Triples: Beat The Computer Drill
44 pages
Faisel
No ratings yet
Faisel
35 pages
Chapter-1 - Function: Maths IIT-JEE Best Approach' (MC SIR)
No ratings yet
Chapter-1 - Function: Maths IIT-JEE Best Approach' (MC SIR)
9 pages
03 Center-Radius vs. General Form
No ratings yet
03 Center-Radius vs. General Form
24 pages
Are Visual Representations Always Helpful in The Communication of Knowledge - Discuss With Reference
No ratings yet
Are Visual Representations Always Helpful in The Communication of Knowledge - Discuss With Reference
8 pages
Revised UG V Sem 21 Regulations Time Table
No ratings yet
Revised UG V Sem 21 Regulations Time Table
6 pages
Solve For The Indicated Variable in The Parenthesis.: Literal Equations Worksheet
No ratings yet
Solve For The Indicated Variable in The Parenthesis.: Literal Equations Worksheet
2 pages
Ifmo
No ratings yet
Ifmo
6 pages
Xercise: Jee Problems
No ratings yet
Xercise: Jee Problems
2 pages
Teorema Proyeksi
No ratings yet
Teorema Proyeksi
3 pages
Institute, Admmed Studies Center, Carouge-Genetur. Switzerluad
No ratings yet
Institute, Admmed Studies Center, Carouge-Genetur. Switzerluad
6 pages
Problem Set 5 Solutions
No ratings yet
Problem Set 5 Solutions
3 pages
AMA2112 Assignment 1 2020
No ratings yet
AMA2112 Assignment 1 2020
2 pages
Exercises of Statistical Inference
From Everand
Exercises of Statistical Inference
Simone Malacrida
No ratings yet
Radial Basis Networks: Fundamentals and Applications for The Activation Functions of Artificial Neural Networks
From Everand
Radial Basis Networks: Fundamentals and Applications for The Activation Functions of Artificial Neural Networks
Fouad Sabry
No ratings yet
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet

Unit-2 Machine Learning

Uploaded by

Unit-2 Machine Learning

Uploaded by

UNIT-

• Robust linear regression is designed to be less sensitive to outliers

• Problem with MLE: Maximum Likelihood Estimation (MLE) can

• Bayesian linear regression extends the traditional linear regression

Two classes Decision boundary:

Least squares for classification

Sum-of-squares error function

select a projection that maximizes the class separation

maximize have considerable overlap

The Fisher criterion

total within-class covariance matrix

The perceptron algorithm

perceptron criterion denotes the set of all misclassified patterns

Stochastic gradient descent

The perceptron algorithm

• Kernel trick does for us is to offer a more

Fig 1: Linearly inseparable data in one dimension

Fig 2: Applying kernel method to represent data in two

• If ∀X ⊆ ჯ , the matrix K is positive definite, κ is called a Mercer Kernel, or a positive definite

• Let us define Then we can write

• Here σ is the variance and our hyperparameter

• The kernel equation can be re-written as follows:

• We can notice that when d₁₂ = 0, the

• The width of the Region of Similarity is minimal for

• The width of the Region of Similarity is large for σ

• where φs(x) denotes the number of occurrences of substring s in string x.

• Where is a modified Bessel function.

• SVM’s purpose is to predict the classification of a query sample by relying on labeled

Any point below the hyperplane will always be

• where c indicates the number of error points, in other

You might also like