0% found this document useful (0 votes)
169 views41 pages

ML Unit4

The document discusses Linear Discriminants in machine learning, focusing on their application in classification tasks, particularly through Linear Discriminant Analysis (LDA) and the Perceptron model. It highlights key concepts, advantages, and limitations of these methods, as well as the Support Vector Machines (SVM) approach for handling linearly non-separable data using techniques like the kernel trick. Additionally, it covers various strategies for addressing challenges in non-separable datasets, including nonlinear classifiers and ensemble methods.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
169 views41 pages

ML Unit4

The document discusses Linear Discriminants in machine learning, focusing on their application in classification tasks, particularly through Linear Discriminant Analysis (LDA) and the Perceptron model. It highlights key concepts, advantages, and limitations of these methods, as well as the Support Vector Machines (SVM) approach for handling linearly non-separable data using techniques like the kernel trick. Additionally, it covers various strategies for addressing challenges in non-separable datasets, including nonlinear classifiers and ensemble methods.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 41

MACHINE LEARNING(UNIT4)

Linear Discriminants for Machine Learning


Introduction to Linear Discriminants, Linear Discriminants for Classification, Perceptron
Classifier, Perceptron Learning Algorithm, Support Vector Machines, Linearly Non-Separable
Case, Non-linear SVM, Kernel Trick, Logistic Regression, Linear Regression, Multi-Layer
Perceptrons (MLPs), Backpropagation for Training an MLP

Introduction to Linear Discriminants:

In machine learning, Linear Discriminants are used to solve classification problems by finding
a linear decision boundary that separates different classes. These methods are central to many
classification algorithms and form the foundation of techniques like Linear Discriminant
Analysis (LDA) and Fisher's Linear Discriminant.

What is a Linear Discriminant?

A linear discriminant is a function that attempts to classify data by creating a decision rule
based on a linear combination of the features. In other words, it finds a hyperplane (a linear
boundary) that can divide the data into different classes.

The general form of a linear discriminant function for classification is:

1
2
Linear Discriminants for Classification:

Linear Discriminants for Classification in Machine Learning

Linear Discriminant Analysis (LDA) is a method used for classification that assumes different
classes generate data based on different Gaussian distributions, and it tries to find a linear
combination of features that best separate the classes.

Key Concepts of LDA:

1. Goal:
o The primary goal of LDA is to reduce the dimensionality of the data while preserving as
much of the class discriminatory information as possible. Essentially, it finds the "best"
axis or hyperplane that can separate multiple classes.

2. Assumptions:
o Each class is normally distributed with the same covariance matrix (homoscedasticity
assumption).
o The features are independent and linearly separable.

3. How it Works: LDA tries to maximize the ratio of the variance between the classes to
the variance within each class, ensuring the classes are as separated as possible.

3
o Between-class variance: Measures how far apart the class means are from the overall
mean of the data.
o Within-class variance: Measures the spread of the data points within each class.

The goal is to maximize the Fisher criterion, which is defined as:

1. This criterion ensures that the classes are as distinct as possible when projected onto a
lower-dimensional space.
2. Steps in LDA:
o Step 1: Compute the mean vector for each class.
o Step 2: Compute the overall mean of the data.
o Step 3: Compute the scatter matrices (within-class and between-class).
o Step 4: Compute the linear discriminants (eigenvectors) of the scatter matrices.
o Step 5: Choose the top eigenvectors that correspond to the largest eigenvalues to form a
transformation matrix.
o Step 6: Project the data onto the new subspace formed by these eigenvectors.

3. Dimensionality Reduction: LDA can be used as a dimensionality reduction technique,


similar to Principal Component Analysis (PCA). However, unlike PCA, which seeks to
explain the most variance in the data, LDA seeks to find the axes that maximize the
separation between classes.
4. Classification: Once the data is projected into the lower-dimensional space, it is
classified using a simple classification algorithm like k-Nearest Neighbors (k-NN) or
even a Gaussian Naive Bayes classifier, which assumes the data follows a Gaussian
distribution.

Example:

Suppose you have a dataset with features like height, weight, and age, and you are tasked with
classifying individuals as "Male" or "Female." LDA would:

 Compute the mean height, weight, and age for males and females.
 Maximize the separation between the "Male" and "Female" classes while minimizing the variance
within each class.
 Project the data into a new 1D or 2D space where the two classes are as far apart as possible.
 Then, classification can be performed using a decision rule based on the transformed data.

Advantages of LDA:

 Efficient: LDA is computationally less expensive and less complex compared to some other
machine learning techniques, especially when dealing with high-dimensional data.
 Good for smaller datasets: If you have a relatively small dataset with fewer samples than
features, LDA is an effective method for classification.

4
 Interpretability: Since LDA creates linear decision boundaries, it can be easy to understand and
interpret.

Limitations:

 Assumption of normality: LDA assumes that the data follows a Gaussian (normal) distribution,
which may not always be true.
 Assumption of equal covariance matrices: LDA assumes that all classes share the same
covariance matrix, which may not hold in real-world data.
 Linear decision boundary: LDA can only create linear decision boundaries, which may not be
effective for highly non-linear datasets.

In practice, LDA is widely used in applications like:

 Speech recognition
 Face recognition
 Medical diagnostics
 Financial prediction, etc.

Perceptron Classifie:

A Perceptron is a type of artificial neural network and a fundamental concept in machine


learning, particularly in the field of supervised learning. It's one of the simplest types of neural
networks and was introduced in the 1950s by Frank Rosenblatt. The perceptron is a linear
classifier, meaning it is used to classify data into one of two classes by finding a linear decision
boundary.

Key Concepts of the Perceptron:

5
Limitations:

 The perceptron can only solve linearly separable problems, meaning that it can only find a
decision boundary for datasets where classes can be separated by a straight line or hyperplane.
 It can't handle more complex problems, such as XOR (exclusive OR), which is non-linearly
separable.

6
Perceptron vs. Other Models:

 Single Layer vs. Multi-layer: A perceptron is a single-layer neural network, which is quite
simple. More complex models like multi-layer perceptrons (MLPs) involve multiple layers of
neurons and can solve non-linear problems.
 Linear vs. Non-Linear: The perceptron is a linear classifier, so it cannot model non-linear
decision boundaries without modifications like using a kernel trick (as in Support Vector
Machines).

Perceptron Learning Algorithm:


The Perceptron Learning Algorithm is a fundamental supervised learning algorithm used for binary
classification. It is the foundation of artificial neural networks and operates as a linear classifier.

A perceptron is a type of artificial neuron that mimics the way biological neurons work. It takes
multiple input values, applies weights, sums them up, and passes the result through an activation
function to produce an output.

Mathematical Representation

A perceptron computes its output as:

7
4. Advantages of the Perceptron Algorithm

✔Simple & easy to implement


✔Guaranteed to converge for linearly separable data
✔Fast updates using a simple rule

5. Limitations

❌ Only works for linearly separable data (fails on problems like XOR)
❌ Convergence depends on the learning rate
❌ Cannot model complex patterns (no hidden layers)

6. Extensions of the Perceptron

 Multi-Layer Perceptron (MLP): Uses multiple layers and nonlinear activation functions.

8
 Adaline (Adaptive Linear Neuron): Uses a continuous activation function.

7. Conclusion

The Perceptron Learning Algorithm is one of the simplest binary classifiers in machine
learning. It works well for linearly separable problems, such as the AND gate, but fails on non-
linearly separable problems like XOR.

9
10
Support Vector Machines:

Support Vector Machine (SVM) is a supervised machine learning algorithm


used for classification and regression tasks. While it can handle regression
problems, SVM is particularly well-suited for classification tasks.
SVM aims to find the optimal hyperplane in an N-dimensional space to
separate data points into different classes. The algorithm maximizes the
margin between the closest points of different classes.

1. Introduction to SVM

Support Vector Machines (SVM) are a set of supervised learning algorithms used for
classification, regression, and outlier detection. SVM is particularly effective in high-
dimensional spaces and is widely used in tasks like image recognition, text categorization, and
medical diagnosis.

The main goal of SVM is to find the optimal hyperplane that best separates different classes in
the dataset while maximizing the margin between the nearest data points of different classes
(support vectors).

11
12
13
14
So to separate these data points, we need to add one more dimension. For linear data, we
have used two dimensions x and y, so for non-linear data, we will add a third dimension z. It
can be calculated as:

z=x2 +y2
By adding the third dimension, the sample space will become as below image:

15
So now, SVM will divide the datasets into classes in the following way. Consider the below
image:

16
Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we
convert it in 2d space with z=1, then it will become as:

Hence we get a circumference of radius 1 in case of non-linear data.

17
18
Linearly Non-Separable Case:

Linearly Non-Separable Case in Machine Learning (Detailed Explanation)

In machine learning, the linearly non-separable case occurs when classes in a dataset cannot be
divided by a single straight line (or hyperplane in higher dimensions). This problem arises
frequently in real-world data, where the patterns governing the data are more complex than a
simple linear boundary can capture.

1. Understanding Linearly Separable vs. Non-Separable Data

To understand linearly non-separable data, it helps to first understand what it means for data to
be linearly separable:

 Linearly Separable Data: This is when two classes in a dataset can be separated by a
single straight line (in 2D), a hyperplane (in higher dimensions), or a flat decision
boundary. For example, if you were plotting data points of two classes on a 2D plane, and
you could draw a straight line between the two classes without misclassifying any points,
then the data is linearly separable.

Example:

Class 1: (1, 2), (3, 4), (5, 6)


Class 2: (7, 8), (9, 10)

19
A line could easily separate these points, and a linear classifier (e.g., Logistic Regression,
Linear SVM) could classify them correctly.

 Linearly Non-Separable Data: When no straight line (or hyperplane) can separate the
classes without errors, the data is linearly non-separable. In such cases, a linear classifier
will not perform well because it will misclassify some points.

Example:

Class 1: (0,0), (1,1)


Class 2: (0,1), (1,0)

No single straight line can separate these points into their respective classes, and this is a
classic example of non-separable data.

2. Challenges of Linearly Non-Separable Data

When dealing with linearly non-separable data, classifiers like Support Vector Machines
(SVMs) and Logistic Regression struggle to achieve perfect classification. The challenges are:

 Overfitting: A linear model might attempt to create overly complex boundaries to fit the
training data, which can lead to poor generalization to unseen data.
 Misclassification: A linear classifier will misclassify some points, since it cannot create a
boundary that separates the data perfectly.

3. Solutions to Handle Linearly Non-Separable Data

There are several ways to handle linearly non-separable data, depending on the classifier and the
problem at hand:

A. Kernel Trick (Support Vector Machines)

Support Vector Machines (SVMs) are one of the most powerful classifiers for handling non-
linearly separable data. The kernel trick is a method that maps the data into a higher-
dimensional space where it may become linearly separable. After the data is mapped, a
hyperplane can be used to separate the classes.

How the Kernel Trick Works:

1. Transformation to Higher Dimensions: The kernel trick allows the transformation of the original
input space into a higher-dimensional space. In this new space, a linear hyperplane may exist
that separates the data.
2. Types of Kernels:
o Linear Kernel: Does not change the input space and works well for linearly separable
data.
o Polynomial Kernel: Maps the data to a higher-dimensional polynomial space, which can
capture non-linear relationships.

20
o Radial Basis Function (RBF) Kernel: Maps the data into an infinite-dimensional space
and can separate highly complex data patterns.
o Sigmoid Kernel: Based on the sigmoid function, it is used in some contexts but not as
widely.

Example: For the XOR problem, the points are not linearly separable in 2D. However, by using
a kernel (e.g., RBF), you can map the data into a higher-dimensional space, where it becomes
linearly separable.

B. Nonlinear Classifiers

Some machine learning models are inherently nonlinear and can handle non-separable data
without the need for a kernel transformation:

1. Decision Trees:
o Decision trees build models by recursively splitting the data based on feature values.
They do not rely on linear boundaries but instead create piecewise constant decision
regions.
o A decision tree can handle non-linear relationships by making binary decisions at each
node, forming a complex boundary.

2. k-Nearest Neighbors (k-NN):


o k-NN is a non-parametric classifier that makes predictions based on the majority class of
the k nearest neighbors to a data point. The decision boundary created by k-NN is not
linear but highly flexible, allowing it to adapt to complex patterns in the data.

3. Neural Networks:
o Neural networks are composed of layers of neurons with nonlinear activation functions
(e.g., ReLU, Sigmoid, Tanh). These networks can model highly complex, non-linear
relationships in data.
o The layers and activation functions allow the network to learn intricate patterns that
linear models cannot capture.

C. Soft Margin Support Vector Machine

For SVMs, a soft margin allows for some points to be misclassified while still maximizing the
margin between classes. This is useful in the non-separable case where a perfect separation is not
possible.

 C parameter: In the SVM, the C parameter controls the trade-off between maximizing the
margin and minimizing the classification error. A high value for C results in a stricter classifier
with fewer errors, while a low value allows for more misclassifications but a wider margin.

Soft Margin SVM works as follows:

 Instead of insisting on perfect classification, it allows some points to be on the wrong side of the
margin.

21
 The classifier finds a balance between a large margin and a low number of misclassifications.

D. Feature Engineering and Transformation

Sometimes, the data is non-separable because the features are not well-suited for linear
separation. Feature engineering can help:

1. Polynomial Features: You can create new features by applying polynomial


transformations to the original features, creating interaction terms and higher-degree
features.
2.

Nonlinear Transformations: You can apply nonlinear functions (e.g., sine, cosine,
logarithms) to the features to help reveal patterns that might not be visible in the original
feature space.

E. Regularization

When using models like Logistic Regression or Linear SVM, regularization can help avoid
overfitting by adding a penalty term to the loss function. This penalty prevents the model from
fitting the noise in the non-separable data.

 L1 regularization (Lasso): Encourages sparsity in the model by forcing some feature coefficients
to be exactly zero.
 L2 regularization (Ridge): Penalizes large coefficients but does not force them to be zero,
encouraging smoother decision boundaries.

F. Ensemble Methods

Ensemble methods combine multiple weak learners to create a strong model. These methods
work well for non-separable data:

1. Random Forests: An ensemble of decision trees that aggregates predictions from many
individual trees to create a final classification.
o Random Forests can handle complex data patterns by averaging the results of many
trees.

2. Gradient Boosting Machines (GBM): An ensemble method that builds decision trees
sequentially, with each tree trying to correct the mistakes made by the previous one.
GBMs are very effective for handling non-linear data.

4. Illustrative Example: XOR Problem

Consider the XOR problem:

22
x1 x2 Class

0 0 0

0 1 1

1 0 1

1 1 0

In this case, no straight line can separate the two classes (0 and 1), but you can apply a kernel
trick (e.g., RBF kernel) to map the data into a higher-dimensional space where it becomes
separable. The decision boundary in the original space will be nonlinear, but in the transformed
space, it will be linear.

Conclusion

Linearly non-separable data is a common occurrence in machine learning, especially in


complex real-world problems. To handle such data, we use kernel methods, nonlinear
classifiers, regularization, and feature transformations. Each of these techniques allows us to
capture the complex patterns in the data and build classifiers that perform well even when linear
separation is not possible.

Non-linear SVM:

Support Vector Machines (SVMs) are powerful supervised learning algorithms used for classification and
regression tasks. While traditional linear SVMs aim to find a hyperplane that separates data linearly into
classes, non-linear SVMs are used when the data cannot be separated by a linear boundary. Non-linear
SVMs can effectively handle linearly non-separable data using a technique known as the kernel trick.

1. Basic Concept of SVM

Before diving into non-linear SVMs, it’s essential to understand the basic workings of SVMs:

 Objective of SVM: The goal of an SVM is to find the optimal hyperplane (or decision
boundary) that maximizes the margin between the two classes. In simple terms, SVM
tries to find a hyperplane that divides the classes in such a way that the distance between
the closest data points (from either class) to the hyperplane is as large as possible.
 Linear Separation: In cases where the data is linearly separable, SVM can create a
linear boundary. For instance, in a 2D feature space, a line would separate the classes,
and the algorithm seeks the line that maximizes the margin.

2. Challenges with Non-Linear Data

When the data is non-linearly separable, a single straight line (in 2D) or hyperplane (in higher
dimensions) will not be sufficient to separate the classes. For example, in problems like the XOR
problem, where the points are arranged in a way that no linear boundary can separate the

23
classes, traditional linear SVM will fail.Example of Non-Linearly Separable Data (XOR
Problem):

x1 x2 Class

0 0 0

0 1 1

1 0 1

1 1 0

In this case, you cannot draw a straight line that divides the classes (0 and 1). A non-linear SVM
can solve this problem by transforming the input features into a higher-dimensional space.

3. Non-Linear SVM: The Kernel Trick

The central concept behind non-linear SVM is the kernel trick. The kernel trick allows us to
map the data into a higher-dimensional space where a linear hyperplane might be able to separate
the classes.

Kernel Trick Explanation:

1. Non-linear Mapping: Instead of directly trying to find a linear boundary in the original
feature space, SVM uses a non-linear transformation to map the input data into a
higher-dimensional space. In this new space, the data might become linearly separable.
2. Implicit Transformation: Instead of explicitly calculating the transformation (which
could be computationally expensive), SVM uses kernels that implicitly compute the
inner products between data points in the higher-dimensional space. This allows the SVM
to perform well even on very high-dimensional spaces without the need for
computationally expensive transformations.

24
25
5. How Non-Linear SVM Works

Here’s how a non-linear SVM works, step by step:

1. Kernel Mapping: The data is mapped to a higher-dimensional feature space using a


kernel function. This transformation is implicitly done via the kernel function, so you
don't need to worry about explicitly performing the transformation.
2. Finding the Hyperplane: In this new space, the SVM algorithm attempts to find the
optimal hyperplane that maximizes the margin between the two classes.
3. Decision Function: Once the hyperplane is found, the SVM uses it to make predictions.
For a new data point, the decision function will classify the point based on which side of
the hyperplane it lies on.
4. Regularization: The SVM can handle misclassifications (which are inevitable in some
cases) by introducing a soft margin. The regularization parameter CCC controls the
trade-off between having a wider margin and minimizing the misclassification of points.

6. Advantages of Non-Linear SVM

 Effective in High-Dimensional Spaces: Non-linear SVMs, especially with the RBF


kernel, are effective for problems where the feature space is very high-dimensional.
 Flexibility: By using different kernels, SVMs can model a wide range of data
distributions and non-linear relationships.
 Robust to Overfitting: The soft margin SVM allows some misclassifications, which
prevents the model from overfitting, especially when the data is noisy.

26
7. Disadvantages of Non-Linear SVM

 Computational Cost: The kernel trick can be computationally expensive, especially for
large datasets. Calculating the kernel function for each pair of data points can lead to
significant memory and time requirements.
 Choosing the Right Kernel: The choice of kernel and its parameters (e.g., σ\sigmaσ in
the RBF kernel or ddd in the polynomial kernel) is crucial. Incorrect choices can lead to
poor model performance.
 Not Interpretable: Unlike decision trees, the decision boundaries in SVMs are hard to
interpret, making the model less transparent.

Kernel Trick:

The Kernel Trick in Machine Learning

The kernel trick is a powerful concept used in machine learning to allow algorithms, such as
Support Vector Machines (SVMs) and Principal Component Analysis (PCA), to operate in
higher-dimensional spaces without explicitly transforming the data into those higher dimensions.
It is especially useful for non-linear classification and regression tasks when the data is not
linearly separable in its original space.

1. What is the Kernel Trick?

The kernel trick allows you to apply kernel functions that compute the inner product (or dot
product) between data points in a higher-dimensional space without actually performing the
transformation into that space. This is computationally efficient and allows algorithms to handle
complex, non-linear relationships in data.

In essence, the kernel trick allows machine learning algorithms to implicitly map the input data
into a higher-dimensional feature space, where a linear separation may be possible, without the
need to compute the coordinates in this space explicitly.

2. Why is it Useful?

When we are working with non-linearly separable data, simple linear models (like linear
SVM) can't find a decision boundary that separates the classes properly. The kernel trick
enables us to use a linear classifier in a higher-dimensional feature space where the data might
become separable.

 Non-linear separability: When data is not separable by a straight line or hyperplane in


its original space, a linear classifier won't work well. For example, a classic example is
the XOR problem, where data points cannot be separated by a single straight line.
 Mapping to higher dimensions: The kernel trick allows data to be transformed into
higher-dimensional spaces where a linear decision boundary may exist, without explicitly
transforming the data, which could be computationally expensive.

27
28
Logistic Regression:

Logistic Regression in Machine Learning

Logistic Regression is a widely used statistical model for binary classification tasks. Despite its
name, logistic regression is not actually a regression model, but a classification model. It is used
to predict the probability that a given input point belongs to one of two classes.

1. Understanding the Logistic Function

The core of logistic regression is the logistic function, also known as the sigmoid function. The
sigmoid function maps any real-valued number into a range between 0 and 1. This makes it ideal
for modeling probabilities.

The logistic function is given by:

29
30
31
Linear Regression:

Linear Regression in Machine Learning

Linear Regression is one of the simplest and most widely used algorithms for regression tasks
in machine learning. Its goal is to model the relationship between a dependent (target) variable
and one or more independent (feature) variables by fitting a linear equation to observed data.

Where:

 www is the weight (slope) of the line.


 bbb is the bias (intercept).

32
2. Goal of Linear Regression

The goal of linear regression is to find the best-fit line (or hyperplane in higher dimensions) that
minimizes the difference between the predicted values and the actual values in the training data.
This difference is often measured by the residual sum of squares (RSS), or Mean Squared
Error (MSE).

3. Cost Function in Linear Regression

To find the best-fitting line, we need to minimize the cost function, which represents the error
between the predicted values and the actual values. The Mean Squared Error (MSE) is
commonly used as the cost function:

33
Multi-Layer Perceptions (MLPs):

Multi-Layer Perceptions (MLPs) in Machine Learning

A Multi-Layer Perception (MLP) is a type of artificial neural network used for supervised
learning tasks, such as classification and regression. It is one of the most common types of deep

34
learning models. MLPs are composed of multiple layers of neurons (or nodes), which are
interconnected to process data and learn patterns from it.

1. Structure of a Multi-Layer Perceptron (MLP)

An MLP consists of the following layers:

1. Input Layer: This layer takes the input features (data) and passes them to the next layer.
Each neuron in this layer represents a feature from the dataset.
2. Hidden Layers: These are layers between the input and output layers. An MLP typically
has one or more hidden layers. Each hidden layer is made up of neurons that apply a
transformation (typically a linear transformation followed by a non-linear activation
function) to the inputs they receive.
3. Output Layer: The final layer produces the predicted output of the model. For a
classification task, the output layer typically has one neuron per class, using an activation
function like Softmax or Sigmoid. For regression tasks, it has a single neuron with no
activation function.
4. Neurons/Units: Each neuron in a layer is connected to the neurons in the previous and
next layers via weighted connections. The neuron receives input from the previous layer,
processes it, and passes the output to the next layer.

The connections between neurons have weights and biases associated with them, which are
learned during training.

35
36
Backpropagation for Training an MLP:

Backpropagation for Training an MLP in Machine Learning

Backpropagation (short for backward propagation of errors) is the primary algorithm used
for training a Multi-Layer Perceptron (MLP), which is a type of artificial neural network.
Backpropagation allows the network to adjust its weights based on the error it makes in its
predictions, minimizing the loss function over time.

37
Overview of Backpropagation

Backpropagation is a supervised learning algorithm that consists of two main stages:

1. Forward Propagation: The input is passed through the network to obtain the predicted output.
2. Backward Propagation (Backpropagation): The error (difference between predicted output and
actual output) is propagated back through the network to adjust the weights and minimize the
error.

The backpropagation algorithm relies on gradient descent, which is used to optimize the
weights by minimizing the loss function. This is done by calculating the gradient of the loss
function with respect to each weight in the network, which tells us how to update each weight to
reduce the loss.

38
39
40
4. Repeat

This process of forward propagation, loss calculation, and backpropagation is repeated for
multiple epochs (iterations) until the model converges to a set of weights that minimize the loss.

41

You might also like