Unit-2 Machine Learning
Unit-2 Machine Learning
II
Maximum likelihood estimation- Least squares, Robust linear expression, ridge
regression, Bayesian linear regression, Linear models for classification:
Discriminant function, Probabilistic generative models, Probabilistic
discriminative models, Laplacian approximation, Bayesian logistic regression,
Kernel functions, using kernels in GLM, Kernel trick, SVMs.
Maximum Likelihood Estimation
Maximum Likelihood Estimation
• Maximum Likelihood Estimation (MLE) is a statistical method used to estimate the
parameters of a probability distribution that best describe a given dataset.
• To analyze the data provided, we need to identify the distribution from which we have
obtained data.
• Next, use data to find the parameters of our distribution. A parameter is a numerical
characteristic of a distribution.
• Example Distribution and their parameters:
• Normal distributions - mean (µ) & variance (σ2)
• Binomial distributions –no.of trials (n) & probability of success (p)
• Gamma distributions - shape (k) and scale (θ)
• Exponential distributions - inverse mean (λ).
• These parameters are vital for understanding the size, shape, spread, and other properties of a
distribution.
• Since the data that we have is mostly randomly generated, we often don’t know the true values
of the parameters characterizing our distribution.
Maximum Likelihood Estimation
• An estimator is like a function of data that gives approximate values of the parameters.
• Ex: sample-mean estimator – Simple & Frequently used estimator
• Since the numerical characteristics of the distribution vary as a function of the range of
parameter it is not easy to estimate parameter θ of the distribution.
• Maximum likelihood estimation, which is a process of estimation that gives an entire class of
estimators called maximum likelihood estimators or MLEs
MLE – for Linear Regression Model
In the context of linear regression, MLE is used to estimate the coefficients (parameters) of the regression model.
Likelihood Function
Maximizing Log-Likelihood
Example - Maximum likelihood estimation
Let's work through an example of using Maximum Likelihood Estimation (MLE) in linear regression.
• Example : We have a dataset with the following observations:
X Y
1 2
2 3
3 5
4 4
2 6
Robust linear expression
Robust linear expression
Multiple classes
Each class
Training data
is largest.
42
Discriminant Functions
Least squares for classification
43
Discriminant Functions
Fisher’s linear discriminant
Fisher’s idea maximizes a function that will give a large separation between the projected class meanwhile
also giving a small variance within each class, thereby minimizing the class overlap.
45
Discriminant Functions
Perceptron convergence theorem: if there exists an exact solution (in other words, if the training data set is linearly
separable), then the perceptron learning algorithm is guaranteed to find an exact solution in a finite number of
steps.
46
Discriminant Functions
47
Probabilistic Generative Models
Probabilistic Discriminative Models
Laplace Approximation
Laplace Approximation
The Laplace Approximation aims to find a Gaussian approximation to a probability
density defined over a set of continuous variables.
Consider first the case of a single continuous variable z, and suppose the distribution
p(z) is defined by
Laplace Approximation…
Laplace Approximation…
Gaussian approximation will only be well defined if its precision A > 0, in other words the stationary
point z0 must be a local maximum, so that the second derivative of f(z) at the point z0 is negative.
Laplace Approximation…
and ∇ is the gradient operator. Taking the exponential of both sides we obtain
The distribution q(z) is proportional to f(z) and the appropriate normalization coefficient can be
found by inspection, using the standard result for a normalized multivariate Gaussian, giving
Laplace Approximation…
One major weakness of the Laplace approximation is that, since it is based on a Gaussian
distribution, it is only directly applicable to real variables.
In other cases it may be possible to apply the Laplace approximation to a transformation of the
variable. For instance if 0 <T < ∞ then we can consider a Laplace approximation of ln τ .
The most serious limitation of the Laplace framework, however, is that it is based purely on the
aspects of the true distribution at a specific value of the variable, and so can fail to capture important
global properties.
Bayesian Logistic Regression
Bayesian Logistic Regression
• Exact Bayesian inference for logistic regression is intractable.
• In particular, evaluation of the posterior distribution would require
normalization of the product of a prior distribution and a likelihood
function that itself comprises a product of logistic sigmoid functions,
one for every data point.
• Evaluation of the predictive distribution is similarly intractable.
• it is natural to begin with a Gaussian prior, which we write in the
general form
Bayesian Logistic Regression…
Bayesian Logistic Regression…
Predictive distribution
Predictive distribution…
Predictive distribution…
Kernel functions, using kernels in GLM, Kernel trick
Kernel Trick
• SVM algorithms use a set of mathematical functions that are defined as the kernel. The
function of kernel is to take data as input and transform it into the required form.
• Firstly, a kernel takes the data from its original space and implicitly maps it to a
higher-dimensional space. This is crucial when dealing with data that is not linearly separable
in its original form.
• Instead of performing computationally expensive high-dimensional calculations, the kernel
function calculates the relationships or similarities between pairs of data points as if they were
in this higher-dimensional space.
Kernel Trick
• Kernel Trick allows us to operate in the original • However, if we use the kernel function, which
feature space without computing the coordinates of is denoted as k(x, y), instead of doing the
the data in a higher dimensional space. complicated computations in the 9-dimensional
• Example: space, we reach the same result within the
3-dimensional space by calculating the dot
product of x -transpose and y.
• Here x and y are two data points in 3 dimensions. • The computational complexity, in this case, is
Let’s assume that we need to map x and y to O(n).
9-dimensional space. We need to do the following
calculations to get the final result, which is just a
scalar.
• The computational complexity, in this case, is
O(n²).
• This is useful if the original data is already high dimensional, and if the original features are
individually informative,
• e.g., a bag of words representation where the vocabulary size is large, or the expression level of
many genes.
• In such a case, the decision boundary is likely to be representable as a linear combination of the
original features, so it is not necessary to work in some other feature space
Kernel Functions
• Mercer Kernel:
• Let X = {x1, . . . , xn} be a finite set of n samples from ჯ. The Gram matrix of X is defined as
• Where r>0.
• Sigmoid Kernel:
• An example of a kernel that is not a Mercer kernel is the so-called sigmoid kernel, defined by
• this function is equivalent to a two-layer, perceptron model of the neural network, which is
used as an activation function for artificial neurons.
Kernel Functions
• RBF Kernel:
• The below function is the Gaussian or RBF kernel
• Let d₁₂ be the distance between the two points X₁ and X₂, we can now represent d₁₂ as follows:
• It is evident from the above cases that the width of the Region of Similarity changes as
σ changes.
• Finding the right σ for a given dataset is important
Kernel Functions
• String Kernels:
• If we’re interested in matching all substrings (for example) instead of representing an object
as a bag of words, we can use a string kernel:
• Let A denote an alphabet, e.g., {a, ..., z}, and A∗ = [A, A2 , . . . , Am], where m is the length of
the longest string we would like to match. Then a basis function φ(x) will map a string x to a
vector of length |A∗|, where each element j is the number of times we observe substring A ∗j in
string x, where j = 1 : | A∗ |.
• The string kernel measures the similarity of two strings x and x’ :
• Where
Kernel Functions
• Kernels for comparing documents:
• When performing document classification or retrieval, it is useful to have a way of comparing
two documents, xi and xi’ .
• If we use a bag of words representation, where xij is the number of times words j occurs in
document i, we can use the cosine similarity, which is defined by
• This quantity measures the cosine of the angle between xi and xi’ when interpreted as vectors.
• Since xi is a count vector (and hence non-negative), the cosine similarity is between 0 and 1,
where 0 means the vectors are orthogonal and therefore have no words in common.
• This simple method does not work very well, for two main reasons. First, if xi has any word in
common with xi’ , it is deemed similar, even though some popular words, such as “the” or
“and” occur in many documents, and are therefore not discriminative.
• Second, if a discriminative word occurs many times in a document, the similarity is artificially
boosted.
Kernel Functions
• We can significantly improve performance using some simple preprocessing. The idea is to
replace the word count vector with a new feature vector called the TF-IDF representation,
which stands for “term frequency inverse document frequency”.
• First, the term frequency is defined as a log-transform of the count:
• This reduces the impact of words that occur many times within one document. Second, the
inverse document frequency is defined as
• where N is the total number of documents, and the denominator counts how many documents
contain term j. Finally, we define
• We then use this inside the cosine similarity measure. That is, our new kernel has the form
• where φ(x) = tf-idf(x). This gives good results for information retrieval
Kernel Functions
• We can significantly improve performance using some simple preprocessing. The idea is to
replace the word count vector with a new feature vector called the TF-IDF representation,
which stands for “term frequency inverse document frequency”.
• First, the term frequency is defined as a log-transform of the count:
• This reduces the impact of words that occur many times within one document. Second, the
inverse document frequency is defined as
• where N is the total number of documents, and the denominator counts how many documents
contain term j. Finally, we define
• We then use this inside the cosine similarity measure. That is, our new kernel has the form
• where φ(x) = tf-idf(x). This gives good results for information retrieval
Using Kernels inside GLMs
• Kernel Machines:
• We define a kernel machine to be a GLM where the input feature vector has the form
• where ∈ ჯ are a set of K centroids. If κ is an RBF kernel, this is called an RBF network
• We will call above Equation a kernelised feature vector.
• We can use the kernelized feature vector for logistic regression by defining
p(y|x, θ) = Ber(cφ(x)).
• This provides a simple way to define a non-linear decision boundary.
Using Kernels inside GLMs
• Example:
• Consider the data coming from the exclusive or or xor function. This is a binary valued
function of two binary inputs. Its truth table is shown in Figure 14.2(a). In Figure 14.2(b), we
have show some data labeled by the xor function. We see we cannot separate the data even
using a degree 10 polynomial.
• However, using an RBF kernel and just 4 prototypes easily solves the problem as shown in
Figure 14.2(c)
Using Kernels inside GLMs
• Example:
• We can also use the kernelized feature vector
inside a linear regression model by defining
p(y|x, θ) = N (wTφ(x), σ2).
• For example, Figure 14.3 shows a 1d data set fit
with K = 10 uniformly spaced RBF prototypes,
but with the bandwidth ranging from small to
large.
• Small values lead to very wiggly functions,
since the predicted function value will only be
non-zero for points x that are close to one of the
prototypes μk.
• If the bandwidth is very large, the design matrix
reduces to a constant matrix of 1’s, since each
point is equally close to every prototype; hence
the corresponding function is just a straight line.
Support Vector Machine
SVM Algorithm
• SVM stands for support vector machine, and although it can solve both classification
and regression problems, it is mainly used for classification problems in machine
learning (ML).
• Specifically, the data is transformed into a higher dimension, and a support vector
classifier is used as a threshold (or hyperplane) to separate the two classes with
minimum error.
SVM Algorithm
• Dimensions:
• In simple terms, a dimension of something is
a particular aspect of it. Examples: width,
depth and height are dimensions.
• A line on a plane is one dimension,
considering the edges a square has two
dimensions and a cube has three dimensions
• Planes and Hyperplane:
• In one dimension, a hyperplane is called a
point.
• In two dimensions, it is a line.
• In three dimensions, it is a plane and in more
dimensions we call it a hyperplane.
SVM Algorithm
• Terminologies in SVM:
• The points closest to the hyperplane are called as the support vector points and the distance of
the vectors from the hyperplane are called the margins.
• The basic intuition is that more the farther SV points, from the hyperplane, more is the
probability of correctly classifying the points in their respective region or classes.
• SV points are very critical in determining the hyperplane because if the position of the vectors
changes the hyperplane’s position is altered.
• Technically this hyperplane can also be called as margin maximizing hyperplane.
SVM Algorithm
• Math Behind SVM
• Consider a binary classification problem with two
classes, labeled as +1 and -1. We have a training
dataset consisting of input feature vectors X and their
corresponding class labels Y.
• The equation for the linear hyperplane can be written
as:
• where the output y indicates whether it is in a • Anything above the decision boundary should
positive class or the negative class. w is that have label 1
matrix representing the plane's parameters also the • Similarly, anything below the decision
coefficient of x where x is the input data. b represents boundary should have label −1.
the intercept of the hyperplane. • If any point exactly on the decision boundary,
then the output of the classifier would be zero
SVM Algorithm
• Why the output of the equation is either
positive or negative.
• Consider the problem where the decision boundary passes
through the origin and hence intercept is zero and its slope
is +1.
• A single data point on each side of the hyperplane
represents both the positive and negative classes.
• Substituting the values in the equation of the hyperplane:
• To find x1-x2, w has to be sent to the equation's left-hand side, which gives 2 over w.
• It is already known that w is a vector, and vectors can not be divided directly like a scalar value.
The equivalent would be to divide both sides by the length of w, that is, the magnitude of that
norm of w.
SVM Algorithm
• Now that we have arrived at the equation for the margin, it is considered the optimization function
that needs to be maximized using optimization algorithms like gradient descent.
• Optimization algorithms work best when finding the local minimum, hence to ease the problem,
minimizing the reciprocal of x1-x2 can be used as an optimization function, which is the norm
of w over 2 as below.
• Hard Margin
• The maximum-margin hyperplane or the hard margin hyperplane is a hyperplane that properly
separates the data points of different categories without any misclassifications.
SVM Algorithm
• Soft Margin
• It is also possible that the SVM model can have
some percentage of error, meaning, misclassification
of new data
• This has to be integrated into our optimization
function,