Unit Iii
Unit Iii
Subset Selection:
The goal is to find a hyperplane that best separates data points of different classes (e.g.,
separating cats and dogs in a feature space).
Algorithms like SVM (Support Vector Machines) often focus on finding the optimal
hyperplane.
A separating hyperplane is a plane that separates two classes of data points in a multi-dimensional
space. The hyperplane separation theorem states, that if two classes of data points are linearly
separable, then there exists a hyperplane that perfectly separates the two classes
In a binary classification problem, given a linearly separable data set, the optimal separating
hyperplane is the one that correctly classifies all the data while being farthest away from the data
points. In this respect, it is said to be the hyperplane that maximizes the margin, defined as the
distance from the hyperplane to the closest data point.
The idea behind the optimality of this classifier can be illustrated as follows. New test points are
drawn according to the same distribution as the training data. Thus, if the separating hyperplane is far
away from the data points, previously unseen test points will most likely fall far away from the
hyperplane or in the margin. As a consequence, the larger the margin is, the less likely the points are
to fall on the wrong side of the hyperplane.
Finding the optimal separating hyperplane can be formulated as a convex quadratic
programming problem, which can be solved with well-known techniques.
The optimal separating hyperplane should not be confused with the optimal classifier known as
the Bayes classifier: the Bayes classifier is the best classifier for a given problem, independently of
the available data but unattainable in practice, whereas the optimal separating hyperplane is only the
best linear classifier one can produce given a particular data set.
The optimal separating hyperplane is one of the core ideas behind the support vector machines. In
particular, it gives rise to the so-called support vectors which are the data points lying on the margin
boundary of the hyperplane. These points support the hyperplane in the sense that they contain all the
required information to compute the hyperplane: removing other points does not change the optimal
separating hyperplane. Elaborating on this fact, one can actually add points to the data set without
influencing the hyperplane, as long as these points lie outside of the margin.
In the following dia The plot below shows the optimal separating hyperplane and its margin for a data
set in 2 dimensions. The support vectors are the highlighted points lying on the margin boundary.
ANN
Elements of a Neural Network
Input Layer: This layer accepts input features. It provides information from the outside
world to the network, no computation is performed at this layer, nodes here just pass on the
information(features) to the hidden layer.
Hidden Layer: Nodes of this layer are not exposed to the outer world, they are part of the
abstraction provided by any neural network. The hidden layer performs all sorts of
computation on the features entered through the input layer and transfers the result to the
output layer.
Output Layer: This layer bring up the information learned by the network to the outer
world.
Classification in Artificial Neural Networks
Artificial Neural Networks (ANNs) are used for classification by learning decision
boundaries (like hyperplanes) that divide classes based on the input data. They can handle
non-linear decision boundaries using hidden layers and activation functions.
Fig. Perceptron
Multilayer Perceptron artificial neural networks adds complexity and density,
with the capacity for many hidden layers between the input and output layer. Each
individual node on a specific layer is connected to every node on the next layer.
This means Multilayer Perceptron models are fully connected networks, and can
be leveraged for deep learning.
They’re used for more complex problems and tasks such as complex
classification or voice recognition. Because of the model’s depth and complexity,
processing and model maintenance can be resource and time-consuming.
2. Multi-Layer Perceptron (MLP):
Added hidden layers to overcome limitations of the perceptron.
Capable of solving complex, non-linear problems.
The bias acts as an adjustable constant in the neuron, allowing the activation function to shift, which
helps the model better fit the data.
Without bias, the output of the neuron is entirely dependent on the weighted sum of
inputs. This constrains the neuron to pass through the origin (0,0) for certain
activation functions.
Bias allows the activation function to shift up or down, enabling the neuron to fit data
that doesn't pass through the origin.
The bias term adjusts the decision boundary by shifting the activation function.
For instance, in a linear model y= w⋅x+b, the bias b shifts the line up or down,
providing more flexibility in separating data points.
Biases are essential in learning non-linear patterns when combined with activation
functions like ReLU, sigmoid, or tanh.
Without bias, the neural network might struggle to approximate functions where the
outputs are not symmetrical around the origin.
Mathematical Representation
Here:
xi: Inputs
b: Bias
f: Activation function
The bias b adjusts the input to the activation function, allowing the output y to take on values that fit
the data distribution better.
In this case:
Decision boundaries are restricted, making the model less capable of learning complex
relationships.
Backpropagation is a powerful algorithm in deep learning, primarily used to train artificial neural
networks, particularly feed-forward networks. It works iteratively, minimizing the cost function by
adjusting weights and biases.
In each epoch, the model adapts these parameters, reducing loss by following the error gradient.
Backpropagation often utilizes optimization algorithms like gradient descent or stochastic gradient
descent. The algorithm computes the gradient using the chain rule from calculus, allowing it to
effectively navigate complex layers in the neural network to minimize the cost function.
Backpropagation plays a critical role in how neural networks improve over time. Here's why:
1. Efficient Weight Update: It computes the gradient of the loss function with respect to each
weight using the chain rule, making it possible to update weights efficiently.
2. Scalability: The backpropagation algorithm scales well to networks with multiple layers and
complex architectures, making deep learning feasible.
3. Automated Learning: With backpropagation, the learning process becomes automated, and
the model can adjust itself to optimize its performance.
Probability Distribution
In statistics, Probability distribution functions depict the probability of different outcomes of a
random variable. It can be divided into 2 types-
Discrete Probability Distribution- In this probability distribution, the random variable may
take discrete and distinct number of values with their respective probabilities.
For Example: a die rolled once can take only 6 values, from 1 to 6. And each of these
outcomes has a probability of ⅙.
Continuous Probability Distribution: In this probability distribution, the random variable can
take an infinite number of values. And the probability of any discrete value is almost zero.
The probability is given for a range of values.
Parameter Estimation:
Parameter estimation involves finding the values of parameters in a statistical model that best
explain or fit a given dataset. Two common approaches are Maximum Likelihood Estimation
(MLE) and Bayesian Parameter Estimation.
MLE : In MLE, the objective is to maximize the likelihood of observing data given specific
probability distribution and its parameters. We estimate parameters that maximize the
likelihood of observing the data.
Likelihood function
The objective is to maximise the probability of observing the data points from joint
probability distribution considering specific probability distribution. This is formally stated
as-
P(X | theta)
Here, theta is an unknown parameter. This may also be written as
P(X ; theta)
P(x1,x2,x3,...,xn ; theta)
This is the likelihood function and is commonly denoted with L-
L(X ; theta)
Since the aim is to find the parameters that maximise the likelihood function-
Maximum{L(X;theta)}
The joint probability is restated as a product of conditional probability for every observation
given the distribution parameters.
L(X | theta) = π(i to n) P (xi | theta)
Bayesian Estimation
Bayes Theorem
Most of you might already be aware of bayes theorem. It was proposed by Thomas Bayes.
The theorem puts forth a formula for conditional probability. Given as:
Fig Bayes’ Theorem
Here, We find the probability of event A given B is true. And P(A) and P(B) are independent
probabilities of events A and B.
Or, you may come across websites referring to these in pure statistical terminology.
P(A) = Prior Probability. This is the probability of any event before we take into
consideration any new piece of information.
P(B) is referred to as evidence. How likely an observation of B is given our prior beliefs
about A.
P(B|A) is referred to as likelihood function. It tells how likely each observation of B is for a
fixed A.
P(A|B) = Posterior Probability. This is the probability of an event after some event has
already occurred.
.
Hypothesis Testing in Ensemble Methods
What is Hypothesis Testing?
Any data science project starts with exploring the data. When we perform an analysis on
a sample through exploratory data analysis and inferential statistics we get information about
the sample. Now, we want to use this information to predict values for the entire population.
It involves using statistical principles to combine predictions from multiple models (the
ensemble) and evaluate whether the combined predictions significantly improve performance
or achieve better results compared to individual models.
Fig. Types of Errors
Hypothesis testing is done to confirm our observation about the population using sample
data, within the desired error level. Through hypothesis testing, we can determine whether we
have enough statistical evidence to conclude if the hypothesis about the population is true or
not.
How to perform hypothesis testing in machine learning?
To trust your model and make predictions, we utilize hypothesis testing. When we will use
sample data to train our model, we make assumptions about our population. By performing
hypothesis testing, we validate these assumptions for a desired significance level.
Let’s take the case of regression models: When we fit a straight line through a linear
regression model, we get the slope and intercept for the line. Hypothesis testing is used to
confirm if our beta coefficients are significant in a linear regression model. Every time we
run the linear regression model, we test if the line is significant or not by checking if the
coefficient is significant.
Key steps to perform hypothesis test are as follows:
1. Formulate a Hypothesis
2. Determine the significance level
3. Determine the type of test
4. Calculate the Test Statistic values and the p values
5. Make Decision
Types of Hypothesis Testing
Hypothesis tests are divided into two categories:
1) Parametric tests – are used when the samples have a normal distribution. In general,
samples with a mean of 0 and a variance of 1 follow a normal distribution.
2) Non-Parametric tests – If the samples do not follow a normal distribution, non-
parametric tests are used.
Two types of Hypothesis Testing can be created depending on the number of samples to
be compared:
• One Sample – If there is only one sample that must be compared to a specific value, it is
called a single sample.
• Two Samples – if you’re comparing two or more samples. Correlation and sample
difference are two tests that could be used in this situation. Samples can be paired or not in
both circumstances. Dependent samples are sometimes known as paired samples, while
independent samples are known as unpaired samples. Natural or matched couplings occur in
paired samples.