0% found this document useful (0 votes)
18 views20 pages

Unit Iii

Document

Uploaded by

cwales559
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views20 pages

Unit Iii

Document

Uploaded by

cwales559
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 20

UNIT III

Dimensionality reduction (Notes)


Dimensionality reduction refers to techniques that reduce the number of input
variables in a dataset while retaining as much information as possible. This is
particularly useful in high-dimensional datasets, where too many variables can
lead to overfitting or computational inefficiency.
1. Subset Selection
Subset selection identifies a subset of predictors (features) that are most relevant
for predicting the response variable.
Subset selection, is the process of selecting a subset of relevant features
(variables, predictors) for use in model construction for several reasons:
 Simplification of models to make them easier to interpret,
 Shorter training times,
 To avoid the curse of dimensionality,
 Enhanced generalization by reducing overfitting
The smaller subsets of features are chosen from a set of many dimensional data
to represent the model by filtering, wrapping, embedding.
It is also known as variable selection, attribute selection, feature selection.
Methods:
 Best Subset Selection: Evaluates all possible combinations of predictors
and selects the subset that minimizes prediction error.
 Forward Selection: Starts with no predictors and adds them one by one
based on their contribution to the model's performance.
 Backward Elimination: Starts with all predictors and removes them one
by one based on their statistical insignificance.

Subset Selection:

Fig 01 Types of Subset Selection


Classification-Separating Hyperplanes
 Classification: It’s the process of dividing data into categories or groups (e.g., identifying if
an email is spam or not).

 Separating Hyperplanes: In classification, a hyperplane is a boundary that divides data


points into classes.

For example, in a 2D space, a hyperplane is a line; in 3D, it’s a plane.

The goal is to find a hyperplane that best separates data points of different classes (e.g.,
separating cats and dogs in a feature space).

Algorithms like SVM (Support Vector Machines) often focus on finding the optimal
hyperplane.

A separating hyperplane is a plane that separates two classes of data points in a multi-dimensional
space. The hyperplane separation theorem states, that if two classes of data points are linearly
separable, then there exists a hyperplane that perfectly separates the two classes
In a binary classification problem, given a linearly separable data set, the optimal separating
hyperplane is the one that correctly classifies all the data while being farthest away from the data
points. In this respect, it is said to be the hyperplane that maximizes the margin, defined as the
distance from the hyperplane to the closest data point.
The idea behind the optimality of this classifier can be illustrated as follows. New test points are
drawn according to the same distribution as the training data. Thus, if the separating hyperplane is far
away from the data points, previously unseen test points will most likely fall far away from the
hyperplane or in the margin. As a consequence, the larger the margin is, the less likely the points are
to fall on the wrong side of the hyperplane.
Finding the optimal separating hyperplane can be formulated as a convex quadratic
programming problem, which can be solved with well-known techniques.
The optimal separating hyperplane should not be confused with the optimal classifier known as
the Bayes classifier: the Bayes classifier is the best classifier for a given problem, independently of
the available data but unattainable in practice, whereas the optimal separating hyperplane is only the
best linear classifier one can produce given a particular data set.

The optimal separating hyperplane is one of the core ideas behind the support vector machines. In
particular, it gives rise to the so-called support vectors which are the data points lying on the margin
boundary of the hyperplane. These points support the hyperplane in the sense that they contain all the
required information to compute the hyperplane: removing other points does not change the optimal
separating hyperplane. Elaborating on this fact, one can actually add points to the data set without
influencing the hyperplane, as long as these points lie outside of the margin.

In the following dia The plot below shows the optimal separating hyperplane and its margin for a data
set in 2 dimensions. The support vectors are the highlighted points lying on the margin boundary.

Fig. Separating Hyperplanes

ANN
Elements of a Neural Network

 Input Layer: This layer accepts input features. It provides information from the outside
world to the network, no computation is performed at this layer, nodes here just pass on the
information(features) to the hidden layer.
 Hidden Layer: Nodes of this layer are not exposed to the outer world, they are part of the
abstraction provided by any neural network. The hidden layer performs all sorts of
computation on the features entered through the input layer and transfers the result to the
output layer.
 Output Layer: This layer bring up the information learned by the network to the outer
world.
Classification in Artificial Neural Networks
Artificial Neural Networks (ANNs) are used for classification by learning decision
boundaries (like hyperplanes) that divide classes based on the input data. They can handle
non-linear decision boundaries using hidden layers and activation functions.

Early Models of ANN


In machine learning, the perceptron (or McCulloch–Pitts neuron) is an algorithm
for supervised learning of binary classifiers. A binary classifier is a function which can decide
whether or not an input, represented by a vector of numbers, belongs to some specific class. It
is a type of linear classifier, i.e. a classification algorithm that makes its predictions based on
a linear predictor function combining a set of weights with the feature vector.
1. Perceptron (1958):
 The simplest ANN model with a single neuron.
 Used for binary classification.
 Limited to solving linearly separable problems (e.g., cannot solve XOR
problem).
A perceptron is one of the earliest and simplest models of a neuron. A Perceptron
model is a binary classifier, separating data into two different classifications. As a
linear model it is one of the simplest examples of a type of artificial neural
network.

Fig. Perceptron
Multilayer Perceptron artificial neural networks adds complexity and density,
with the capacity for many hidden layers between the input and output layer. Each
individual node on a specific layer is connected to every node on the next layer.
This means Multilayer Perceptron models are fully connected networks, and can
be leveraged for deep learning.
They’re used for more complex problems and tasks such as complex
classification or voice recognition. Because of the model’s depth and complexity,
processing and model maintenance can be resource and time-consuming.
2. Multi-Layer Perceptron (MLP):
 Added hidden layers to overcome limitations of the perceptron.
 Capable of solving complex, non-linear problems.

The bias acts as an adjustable constant in the neuron, allowing the activation function to shift, which
helps the model better fit the data.

Role of Bias in ANNs

1. Improves Model Flexibility:

 Without bias, the output of the neuron is entirely dependent on the weighted sum of
inputs. This constrains the neuron to pass through the origin (0,0) for certain
activation functions.

 Bias allows the activation function to shift up or down, enabling the neuron to fit data
that doesn't pass through the origin.

2. Shifts Activation Function:

 The bias term adjusts the decision boundary by shifting the activation function.
 For instance, in a linear model y= w⋅x+b, the bias b shifts the line up or down,
providing more flexibility in separating data points.

3. Facilitates Learning of Complex Functions:

 Biases are essential in learning non-linear patterns when combined with activation
functions like ReLU, sigmoid, or tanh.
 Without bias, the neural network might struggle to approximate functions where the
outputs are not symmetrical around the origin.

Mathematical Representation

A neuron computes the output as:

Here:

 wi: Weights for inputs xi

 xi: Inputs

 b: Bias

 f: Activation function

The bias b adjusts the input to the activation function, allowing the output y to take on values that fit
the data distribution better.

Comparison Without Bias

If b=0 , the neuron’s output becomes:

In this case:

 The model loses flexibility.

 Decision boundaries are restricted, making the model less capable of learning complex
relationships.

Back Propagation in ANN:


Backpropagation (short for "Backward Propagation of Errors") is a method used to train artificial
neural networks. Its goal is to reduce the difference between the model’s predicted output and the
actual output by adjusting the weights and biases in the network.

Backpropagation is a powerful algorithm in deep learning, primarily used to train artificial neural
networks, particularly feed-forward networks. It works iteratively, minimizing the cost function by
adjusting weights and biases.
In each epoch, the model adapts these parameters, reducing loss by following the error gradient.
Backpropagation often utilizes optimization algorithms like gradient descent or stochastic gradient
descent. The algorithm computes the gradient using the chain rule from calculus, allowing it to
effectively navigate complex layers in the neural network to minimize the cost function.

Fig. A simple illustration of how the backpropagation works by adjustments of weights

Backpropagation plays a critical role in how neural networks improve over time. Here's why:

1. Efficient Weight Update: It computes the gradient of the loss function with respect to each
weight using the chain rule, making it possible to update weights efficiently.

2. Scalability: The backpropagation algorithm scales well to networks with multiple layers and
complex architectures, making deep learning feasible.

3. Automated Learning: With backpropagation, the learning process becomes automated, and
the model can adjust itself to optimize its performance.

Probability Distribution
In statistics, Probability distribution functions depict the probability of different outcomes of a
random variable. It can be divided into 2 types-

 Discrete Probability Distribution- In this probability distribution, the random variable may
take discrete and distinct number of values with their respective probabilities.
For Example: a die rolled once can take only 6 values, from 1 to 6. And each of these
outcomes has a probability of ⅙.

 Continuous Probability Distribution: In this probability distribution, the random variable can
take an infinite number of values. And the probability of any discrete value is almost zero.
The probability is given for a range of values.
Parameter Estimation:
Parameter estimation involves finding the values of parameters in a statistical model that best
explain or fit a given dataset. Two common approaches are Maximum Likelihood Estimation
(MLE) and Bayesian Parameter Estimation.
MLE : In MLE, the objective is to maximize the likelihood of observing data given specific
probability distribution and its parameters. We estimate parameters that maximize the
likelihood of observing the data.
Likelihood function
The objective is to maximise the probability of observing the data points from joint
probability distribution considering specific probability distribution. This is formally stated
as-
P(X | theta)
Here, theta is an unknown parameter. This may also be written as
P(X ; theta)
P(x1,x2,x3,...,xn ; theta)
This is the likelihood function and is commonly denoted with L-
L(X ; theta)
Since the aim is to find the parameters that maximise the likelihood function-
Maximum{L(X;theta)}
The joint probability is restated as a product of conditional probability for every observation
given the distribution parameters.
L(X | theta) = π(i to n) P (xi | theta)

Bayesian Estimation
Bayes Theorem
Most of you might already be aware of bayes theorem. It was proposed by Thomas Bayes.
The theorem puts forth a formula for conditional probability. Given as:
Fig Bayes’ Theorem
Here, We find the probability of event A given B is true. And P(A) and P(B) are independent
probabilities of events A and B.
Or, you may come across websites referring to these in pure statistical terminology.
P(A) = Prior Probability. This is the probability of any event before we take into
consideration any new piece of information.
P(B) is referred to as evidence. How likely an observation of B is given our prior beliefs
about A.
P(B|A) is referred to as likelihood function. It tells how likely each observation of B is for a
fixed A.
P(A|B) = Posterior Probability. This is the probability of an event after some event has
already occurred.

Decision tree evaluation measures:


Decision trees are a non-parametric supervised learning method that can be used for both
classification and regression tasks. They work by learning simple decision rules from the data
features to predict the value of a target variable.
Decision tree evaluation measures are used to assess how well a decision tree splits data at
each node and how well the entire tree performs on a dataset. These measures can be
categorized into split criteria (used during tree construction) and performance metrics
(used to evaluate the final tree).
The best way to evaluate decision tree models is through a combination of metrics like
accuracy, precision, recall, F1-score, and ROC-AUC. Cross-validation techniques such as k-
fold or stratified cross-validation ensure robustness. Additionally, assessing the model's
performance on unseen data with techniques like holdout validation or bootstrapping
provides further validation. Regularization methods like pruning help prevent overfitting,
enhancing generalization to unseen data. Ultimately, a comprehensive evaluation strategy
ensures the effectiveness and reliability of decision tree models.
Some metrics and methods used to evaluate decision tree models include:

.
Hypothesis Testing in Ensemble Methods
What is Hypothesis Testing?
Any data science project starts with exploring the data. When we perform an analysis on
a sample through exploratory data analysis and inferential statistics we get information about
the sample. Now, we want to use this information to predict values for the entire population.
It involves using statistical principles to combine predictions from multiple models (the
ensemble) and evaluate whether the combined predictions significantly improve performance
or achieve better results compared to individual models.
Fig. Types of Errors
Hypothesis testing is done to confirm our observation about the population using sample
data, within the desired error level. Through hypothesis testing, we can determine whether we
have enough statistical evidence to conclude if the hypothesis about the population is true or
not.
How to perform hypothesis testing in machine learning?
To trust your model and make predictions, we utilize hypothesis testing. When we will use
sample data to train our model, we make assumptions about our population. By performing
hypothesis testing, we validate these assumptions for a desired significance level.

Let’s take the case of regression models: When we fit a straight line through a linear
regression model, we get the slope and intercept for the line. Hypothesis testing is used to
confirm if our beta coefficients are significant in a linear regression model. Every time we
run the linear regression model, we test if the line is significant or not by checking if the
coefficient is significant.
Key steps to perform hypothesis test are as follows:
1. Formulate a Hypothesis
2. Determine the significance level
3. Determine the type of test
4. Calculate the Test Statistic values and the p values
5. Make Decision
Types of Hypothesis Testing
Hypothesis tests are divided into two categories:
1) Parametric tests – are used when the samples have a normal distribution. In general,
samples with a mean of 0 and a variance of 1 follow a normal distribution.
2) Non-Parametric tests – If the samples do not follow a normal distribution, non-
parametric tests are used.
Two types of Hypothesis Testing can be created depending on the number of samples to
be compared:
• One Sample – If there is only one sample that must be compared to a specific value, it is
called a single sample.
• Two Samples – if you’re comparing two or more samples. Correlation and sample
difference are two tests that could be used in this situation. Samples can be paired or not in
both circumstances. Dependent samples are sometimes known as paired samples, while
independent samples are known as unpaired samples. Natural or matched couplings occur in
paired samples.

Ensemble Methods Overview


Ensemble methods combine predictions from multiple base models to improve performance,
stability, and robustness. Common ensemble techniques include:
1. Bagging (Bootstrap Aggregating):
Example: Random Forest.
Combines models trained on different subsets of the data using averaging (for
regression) or majority voting (for classification).
2. Boosting:
Example: AdaBoost, Gradient Boosting.
Sequentially builds models where each corrects the errors of the previous one.
3. Stacking:
Combines predictions from multiple models using a meta-model to learn how
to best combine them.
4. Voting:
A simple combination of predictions using majority (for classification) or
averaging (for regression).

What is graph-based clustering?


Graph-based clustering is a method for identifying groups of similar cells or samples. It
makes no prior assumptions about the clusters in the data. This means the number, size,
density, and shape of clusters does not need to be known or assumed prior to clustering
Graphical model clustering involves using probabilistic graphical models to group similar
data points into clusters based on their underlying probability distributions. It combines the
principles of graph theory and statistical modeling, making it a powerful tool for
understanding relationships in complex datasets.
Key Concepts in Graphical Models
1. Graphical Models: Represent dependencies between variables using nodes
(representing variables) and edges (representing conditional dependencies).
 Bayesian Networks (Directed Acyclic Graphs): Represent directed
dependencies.
 Markov Random Fields (Undirected Graphs): Represent undirected
dependencies.
2. Clustering: The task of grouping data points such that points in the same group are
more similar to each other than to points in other groups.
3. Probabilistic Clustering: Assigns probabilities of membership to clusters rather than
hard assignments.
Advantages of Graphical Model Clustering
 Handles uncertainty and probabilistic relationships effectively.
 Models complex dependencies between variables.
 Offers flexibility for structured and unstructured data (e.g., images, text, or time-
series).

Graphical Model Clustering Process


1. Define the Graphical Model:
 Choose nodes to represent variables and edges to represent dependencies.
 Select a probabilistic framework (Bayesian Network, MRF, etc.).
2. Infer Cluster Membership:
Use methods like Expectation-Maximization (EM), Variational Inference, or Gibbs
Sampling to infer cluster assignments.
3. Parameter Estimation:
Estimate model parameters (e.g., mean and variance in GMMs) using Maximum
Likelihood Estimation or Bayesian methods.
4. Evaluate Clusters:
Use metrics like Adjusted Rand Index (ARI), Silhouette Score, or Log-Likelihood.

Advantages of Graphical Model Clustering


 Handles uncertainty and probabilistic relationships effectively.
 Models complex dependencies between variables.
 Offers flexibility for structured and unstructured data (e.g., images, text, or time-
series).
Applications
1. Text Clustering:
Topic modeling (e.g., LDA for news articles or reviews).
2. Image Segmentation:
Clustering pixels using MRFs or GMMs.
3. Biological Data:
Gene expression data clustering using Bayesian networks.
4. Social Networks:
Community detection in graph-based data.

A Gaussian Mixture Model (GMM): is a probabilistic model that assumes data is


generated from a mixture of several Gaussian distributions with unknown parameters. It is
widely used in clustering, density estimation, and data modeling.

Spectral Clustering: Ensemble Method and Learning Theory


Spectral clustering is a powerful algorithm that leverages the eigen structure of data similarity
matrices to group data points into clusters. When combined with ensemble methods, it
becomes a robust approach for solving clustering problems. Theoretical insights from
learning theory further enhance its understanding and application.
Spectral clustering is a graph-based clustering method that uses the eigenvalues and
eigenvectors of a similarity matrix derived from the data to group points into clusters. It
constructs a similarity graph, computes its Laplacian matrix, and transforms data into a
lower-dimensional spectral space where clustering (e.g., using KKK-means) is performed.
Ensemble spectral clustering combines multiple spectral clustering results from diverse
similarity measures or graph constructions to enhance robustness and accuracy. Learning
theory supports its consistency, generalization, and robustness by analyzing spectral gaps and
perturbation effects.
Applications and Use: Spectral clustering is widely used in applications like image
segmentation (grouping pixels into coherent regions), social network analysis (detecting
communities), bioinformatics (clustering genes or proteins), and document clustering
(organizing text data). Its ability to handle non-linear boundaries and adapt to diverse data
structures makes it valuable in scenarios where traditional clustering methods like KKK-
means fail.

You might also like