ML Unit4
ML Unit4
In machine learning, Linear Discriminants are used to solve classification problems by finding
a linear decision boundary that separates different classes. These methods are central to many
classification algorithms and form the foundation of techniques like Linear Discriminant
Analysis (LDA) and Fisher's Linear Discriminant.
A linear discriminant is a function that attempts to classify data by creating a decision rule
based on a linear combination of the features. In other words, it finds a hyperplane (a linear
boundary) that can divide the data into different classes.
1
2
Linear Discriminants for Classification:
Linear Discriminant Analysis (LDA) is a method used for classification that assumes different
classes generate data based on different Gaussian distributions, and it tries to find a linear
combination of features that best separate the classes.
1. Goal:
o The primary goal of LDA is to reduce the dimensionality of the data while preserving as
much of the class discriminatory information as possible. Essentially, it finds the "best"
axis or hyperplane that can separate multiple classes.
2. Assumptions:
o Each class is normally distributed with the same covariance matrix (homoscedasticity
assumption).
o The features are independent and linearly separable.
3. How it Works: LDA tries to maximize the ratio of the variance between the classes to
the variance within each class, ensuring the classes are as separated as possible.
3
o Between-class variance: Measures how far apart the class means are from the overall
mean of the data.
o Within-class variance: Measures the spread of the data points within each class.
1. This criterion ensures that the classes are as distinct as possible when projected onto a
lower-dimensional space.
2. Steps in LDA:
o Step 1: Compute the mean vector for each class.
o Step 2: Compute the overall mean of the data.
o Step 3: Compute the scatter matrices (within-class and between-class).
o Step 4: Compute the linear discriminants (eigenvectors) of the scatter matrices.
o Step 5: Choose the top eigenvectors that correspond to the largest eigenvalues to form a
transformation matrix.
o Step 6: Project the data onto the new subspace formed by these eigenvectors.
Example:
Suppose you have a dataset with features like height, weight, and age, and you are tasked with
classifying individuals as "Male" or "Female." LDA would:
Compute the mean height, weight, and age for males and females.
Maximize the separation between the "Male" and "Female" classes while minimizing the variance
within each class.
Project the data into a new 1D or 2D space where the two classes are as far apart as possible.
Then, classification can be performed using a decision rule based on the transformed data.
Advantages of LDA:
Efficient: LDA is computationally less expensive and less complex compared to some other
machine learning techniques, especially when dealing with high-dimensional data.
Good for smaller datasets: If you have a relatively small dataset with fewer samples than
features, LDA is an effective method for classification.
4
Interpretability: Since LDA creates linear decision boundaries, it can be easy to understand and
interpret.
Limitations:
Assumption of normality: LDA assumes that the data follows a Gaussian (normal) distribution,
which may not always be true.
Assumption of equal covariance matrices: LDA assumes that all classes share the same
covariance matrix, which may not hold in real-world data.
Linear decision boundary: LDA can only create linear decision boundaries, which may not be
effective for highly non-linear datasets.
Speech recognition
Face recognition
Medical diagnostics
Financial prediction, etc.
Perceptron Classifie:
5
Limitations:
The perceptron can only solve linearly separable problems, meaning that it can only find a
decision boundary for datasets where classes can be separated by a straight line or hyperplane.
It can't handle more complex problems, such as XOR (exclusive OR), which is non-linearly
separable.
6
Perceptron vs. Other Models:
Single Layer vs. Multi-layer: A perceptron is a single-layer neural network, which is quite
simple. More complex models like multi-layer perceptrons (MLPs) involve multiple layers of
neurons and can solve non-linear problems.
Linear vs. Non-Linear: The perceptron is a linear classifier, so it cannot model non-linear
decision boundaries without modifications like using a kernel trick (as in Support Vector
Machines).
A perceptron is a type of artificial neuron that mimics the way biological neurons work. It takes
multiple input values, applies weights, sums them up, and passes the result through an activation
function to produce an output.
Mathematical Representation
7
4. Advantages of the Perceptron Algorithm
5. Limitations
❌ Only works for linearly separable data (fails on problems like XOR)
❌ Convergence depends on the learning rate
❌ Cannot model complex patterns (no hidden layers)
Multi-Layer Perceptron (MLP): Uses multiple layers and nonlinear activation functions.
8
Adaline (Adaptive Linear Neuron): Uses a continuous activation function.
7. Conclusion
The Perceptron Learning Algorithm is one of the simplest binary classifiers in machine
learning. It works well for linearly separable problems, such as the AND gate, but fails on non-
linearly separable problems like XOR.
9
10
Support Vector Machines:
1. Introduction to SVM
Support Vector Machines (SVM) are a set of supervised learning algorithms used for
classification, regression, and outlier detection. SVM is particularly effective in high-
dimensional spaces and is widely used in tasks like image recognition, text categorization, and
medical diagnosis.
The main goal of SVM is to find the optimal hyperplane that best separates different classes in
the dataset while maximizing the margin between the nearest data points of different classes
(support vectors).
11
12
13
14
So to separate these data points, we need to add one more dimension. For linear data, we
have used two dimensions x and y, so for non-linear data, we will add a third dimension z. It
can be calculated as:
z=x2 +y2
By adding the third dimension, the sample space will become as below image:
15
So now, SVM will divide the datasets into classes in the following way. Consider the below
image:
16
Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we
convert it in 2d space with z=1, then it will become as:
17
18
Linearly Non-Separable Case:
In machine learning, the linearly non-separable case occurs when classes in a dataset cannot be
divided by a single straight line (or hyperplane in higher dimensions). This problem arises
frequently in real-world data, where the patterns governing the data are more complex than a
simple linear boundary can capture.
To understand linearly non-separable data, it helps to first understand what it means for data to
be linearly separable:
Linearly Separable Data: This is when two classes in a dataset can be separated by a
single straight line (in 2D), a hyperplane (in higher dimensions), or a flat decision
boundary. For example, if you were plotting data points of two classes on a 2D plane, and
you could draw a straight line between the two classes without misclassifying any points,
then the data is linearly separable.
Example:
19
A line could easily separate these points, and a linear classifier (e.g., Logistic Regression,
Linear SVM) could classify them correctly.
Linearly Non-Separable Data: When no straight line (or hyperplane) can separate the
classes without errors, the data is linearly non-separable. In such cases, a linear classifier
will not perform well because it will misclassify some points.
Example:
No single straight line can separate these points into their respective classes, and this is a
classic example of non-separable data.
When dealing with linearly non-separable data, classifiers like Support Vector Machines
(SVMs) and Logistic Regression struggle to achieve perfect classification. The challenges are:
Overfitting: A linear model might attempt to create overly complex boundaries to fit the
training data, which can lead to poor generalization to unseen data.
Misclassification: A linear classifier will misclassify some points, since it cannot create a
boundary that separates the data perfectly.
There are several ways to handle linearly non-separable data, depending on the classifier and the
problem at hand:
Support Vector Machines (SVMs) are one of the most powerful classifiers for handling non-
linearly separable data. The kernel trick is a method that maps the data into a higher-
dimensional space where it may become linearly separable. After the data is mapped, a
hyperplane can be used to separate the classes.
1. Transformation to Higher Dimensions: The kernel trick allows the transformation of the original
input space into a higher-dimensional space. In this new space, a linear hyperplane may exist
that separates the data.
2. Types of Kernels:
o Linear Kernel: Does not change the input space and works well for linearly separable
data.
o Polynomial Kernel: Maps the data to a higher-dimensional polynomial space, which can
capture non-linear relationships.
20
o Radial Basis Function (RBF) Kernel: Maps the data into an infinite-dimensional space
and can separate highly complex data patterns.
o Sigmoid Kernel: Based on the sigmoid function, it is used in some contexts but not as
widely.
Example: For the XOR problem, the points are not linearly separable in 2D. However, by using
a kernel (e.g., RBF), you can map the data into a higher-dimensional space, where it becomes
linearly separable.
B. Nonlinear Classifiers
Some machine learning models are inherently nonlinear and can handle non-separable data
without the need for a kernel transformation:
1. Decision Trees:
o Decision trees build models by recursively splitting the data based on feature values.
They do not rely on linear boundaries but instead create piecewise constant decision
regions.
o A decision tree can handle non-linear relationships by making binary decisions at each
node, forming a complex boundary.
3. Neural Networks:
o Neural networks are composed of layers of neurons with nonlinear activation functions
(e.g., ReLU, Sigmoid, Tanh). These networks can model highly complex, non-linear
relationships in data.
o The layers and activation functions allow the network to learn intricate patterns that
linear models cannot capture.
For SVMs, a soft margin allows for some points to be misclassified while still maximizing the
margin between classes. This is useful in the non-separable case where a perfect separation is not
possible.
C parameter: In the SVM, the C parameter controls the trade-off between maximizing the
margin and minimizing the classification error. A high value for C results in a stricter classifier
with fewer errors, while a low value allows for more misclassifications but a wider margin.
Instead of insisting on perfect classification, it allows some points to be on the wrong side of the
margin.
21
The classifier finds a balance between a large margin and a low number of misclassifications.
Sometimes, the data is non-separable because the features are not well-suited for linear
separation. Feature engineering can help:
Nonlinear Transformations: You can apply nonlinear functions (e.g., sine, cosine,
logarithms) to the features to help reveal patterns that might not be visible in the original
feature space.
E. Regularization
When using models like Logistic Regression or Linear SVM, regularization can help avoid
overfitting by adding a penalty term to the loss function. This penalty prevents the model from
fitting the noise in the non-separable data.
L1 regularization (Lasso): Encourages sparsity in the model by forcing some feature coefficients
to be exactly zero.
L2 regularization (Ridge): Penalizes large coefficients but does not force them to be zero,
encouraging smoother decision boundaries.
F. Ensemble Methods
Ensemble methods combine multiple weak learners to create a strong model. These methods
work well for non-separable data:
1. Random Forests: An ensemble of decision trees that aggregates predictions from many
individual trees to create a final classification.
o Random Forests can handle complex data patterns by averaging the results of many
trees.
2. Gradient Boosting Machines (GBM): An ensemble method that builds decision trees
sequentially, with each tree trying to correct the mistakes made by the previous one.
GBMs are very effective for handling non-linear data.
22
x1 x2 Class
0 0 0
0 1 1
1 0 1
1 1 0
In this case, no straight line can separate the two classes (0 and 1), but you can apply a kernel
trick (e.g., RBF kernel) to map the data into a higher-dimensional space where it becomes
separable. The decision boundary in the original space will be nonlinear, but in the transformed
space, it will be linear.
Conclusion
Non-linear SVM:
Support Vector Machines (SVMs) are powerful supervised learning algorithms used for classification and
regression tasks. While traditional linear SVMs aim to find a hyperplane that separates data linearly into
classes, non-linear SVMs are used when the data cannot be separated by a linear boundary. Non-linear
SVMs can effectively handle linearly non-separable data using a technique known as the kernel trick.
Before diving into non-linear SVMs, it’s essential to understand the basic workings of SVMs:
Objective of SVM: The goal of an SVM is to find the optimal hyperplane (or decision
boundary) that maximizes the margin between the two classes. In simple terms, SVM
tries to find a hyperplane that divides the classes in such a way that the distance between
the closest data points (from either class) to the hyperplane is as large as possible.
Linear Separation: In cases where the data is linearly separable, SVM can create a
linear boundary. For instance, in a 2D feature space, a line would separate the classes,
and the algorithm seeks the line that maximizes the margin.
When the data is non-linearly separable, a single straight line (in 2D) or hyperplane (in higher
dimensions) will not be sufficient to separate the classes. For example, in problems like the XOR
problem, where the points are arranged in a way that no linear boundary can separate the
23
classes, traditional linear SVM will fail.Example of Non-Linearly Separable Data (XOR
Problem):
x1 x2 Class
0 0 0
0 1 1
1 0 1
1 1 0
In this case, you cannot draw a straight line that divides the classes (0 and 1). A non-linear SVM
can solve this problem by transforming the input features into a higher-dimensional space.
The central concept behind non-linear SVM is the kernel trick. The kernel trick allows us to
map the data into a higher-dimensional space where a linear hyperplane might be able to separate
the classes.
1. Non-linear Mapping: Instead of directly trying to find a linear boundary in the original
feature space, SVM uses a non-linear transformation to map the input data into a
higher-dimensional space. In this new space, the data might become linearly separable.
2. Implicit Transformation: Instead of explicitly calculating the transformation (which
could be computationally expensive), SVM uses kernels that implicitly compute the
inner products between data points in the higher-dimensional space. This allows the SVM
to perform well even on very high-dimensional spaces without the need for
computationally expensive transformations.
24
25
5. How Non-Linear SVM Works
26
7. Disadvantages of Non-Linear SVM
Computational Cost: The kernel trick can be computationally expensive, especially for
large datasets. Calculating the kernel function for each pair of data points can lead to
significant memory and time requirements.
Choosing the Right Kernel: The choice of kernel and its parameters (e.g., σ\sigmaσ in
the RBF kernel or ddd in the polynomial kernel) is crucial. Incorrect choices can lead to
poor model performance.
Not Interpretable: Unlike decision trees, the decision boundaries in SVMs are hard to
interpret, making the model less transparent.
Kernel Trick:
The kernel trick is a powerful concept used in machine learning to allow algorithms, such as
Support Vector Machines (SVMs) and Principal Component Analysis (PCA), to operate in
higher-dimensional spaces without explicitly transforming the data into those higher dimensions.
It is especially useful for non-linear classification and regression tasks when the data is not
linearly separable in its original space.
The kernel trick allows you to apply kernel functions that compute the inner product (or dot
product) between data points in a higher-dimensional space without actually performing the
transformation into that space. This is computationally efficient and allows algorithms to handle
complex, non-linear relationships in data.
In essence, the kernel trick allows machine learning algorithms to implicitly map the input data
into a higher-dimensional feature space, where a linear separation may be possible, without the
need to compute the coordinates in this space explicitly.
2. Why is it Useful?
When we are working with non-linearly separable data, simple linear models (like linear
SVM) can't find a decision boundary that separates the classes properly. The kernel trick
enables us to use a linear classifier in a higher-dimensional feature space where the data might
become separable.
27
28
Logistic Regression:
Logistic Regression is a widely used statistical model for binary classification tasks. Despite its
name, logistic regression is not actually a regression model, but a classification model. It is used
to predict the probability that a given input point belongs to one of two classes.
The core of logistic regression is the logistic function, also known as the sigmoid function. The
sigmoid function maps any real-valued number into a range between 0 and 1. This makes it ideal
for modeling probabilities.
29
30
31
Linear Regression:
Linear Regression is one of the simplest and most widely used algorithms for regression tasks
in machine learning. Its goal is to model the relationship between a dependent (target) variable
and one or more independent (feature) variables by fitting a linear equation to observed data.
Where:
32
2. Goal of Linear Regression
The goal of linear regression is to find the best-fit line (or hyperplane in higher dimensions) that
minimizes the difference between the predicted values and the actual values in the training data.
This difference is often measured by the residual sum of squares (RSS), or Mean Squared
Error (MSE).
To find the best-fitting line, we need to minimize the cost function, which represents the error
between the predicted values and the actual values. The Mean Squared Error (MSE) is
commonly used as the cost function:
33
Multi-Layer Perceptions (MLPs):
A Multi-Layer Perception (MLP) is a type of artificial neural network used for supervised
learning tasks, such as classification and regression. It is one of the most common types of deep
34
learning models. MLPs are composed of multiple layers of neurons (or nodes), which are
interconnected to process data and learn patterns from it.
1. Input Layer: This layer takes the input features (data) and passes them to the next layer.
Each neuron in this layer represents a feature from the dataset.
2. Hidden Layers: These are layers between the input and output layers. An MLP typically
has one or more hidden layers. Each hidden layer is made up of neurons that apply a
transformation (typically a linear transformation followed by a non-linear activation
function) to the inputs they receive.
3. Output Layer: The final layer produces the predicted output of the model. For a
classification task, the output layer typically has one neuron per class, using an activation
function like Softmax or Sigmoid. For regression tasks, it has a single neuron with no
activation function.
4. Neurons/Units: Each neuron in a layer is connected to the neurons in the previous and
next layers via weighted connections. The neuron receives input from the previous layer,
processes it, and passes the output to the next layer.
The connections between neurons have weights and biases associated with them, which are
learned during training.
35
36
Backpropagation for Training an MLP:
Backpropagation (short for backward propagation of errors) is the primary algorithm used
for training a Multi-Layer Perceptron (MLP), which is a type of artificial neural network.
Backpropagation allows the network to adjust its weights based on the error it makes in its
predictions, minimizing the loss function over time.
37
Overview of Backpropagation
1. Forward Propagation: The input is passed through the network to obtain the predicted output.
2. Backward Propagation (Backpropagation): The error (difference between predicted output and
actual output) is propagated back through the network to adjust the weights and minimize the
error.
The backpropagation algorithm relies on gradient descent, which is used to optimize the
weights by minimizing the loss function. This is done by calculating the gradient of the loss
function with respect to each weight in the network, which tells us how to update each weight to
reduce the loss.
38
39
40
4. Repeat
This process of forward propagation, loss calculation, and backpropagation is repeated for
multiple epochs (iterations) until the model converges to a set of weights that minimize the loss.
41