DSP Unit - III
DSP Unit - III
UNIT - III
Pattern:
Pattern is an abstraction, represented by a set of measurements describing a “physical”
object.
Pattern is everything around in this digital world. A pattern can either be seen physically
or it can be observed mathematically by applying algorithms.
It gives the description of the object or the notion.
The description is given in the form of attributes of the object.
These are also called the features of the object.
Example:
The colors on the clothes, speech pattern etc. In computer science, a pattern is
represented using vector feature values.
Pattern Recognition
Pattern recognition is the process of recognizing patterns by using a Machine Learning
algorithm. Pattern recognition can be defined as the classification of data based on
knowledge already gained or on statistical information extracted from patterns and/or
their representation.
Pattern recognition is the process which can detect different categories and get
information about particular data.
Some of the applications of patterns recognition are voice recognition, weather forecast,
object detection in images, etc.
Advantages:
DNA sequences can be interpreted
Extensively applied in the medical field and robotics.
Classification problems can be solved using pattern recognition.
Biometric detection.
Can recognize a particular object from different angles.
It is useful for cloth pattern recognition for visually impaired blind people
Pattern recognition helps in forensic Lab
Disadvantages:
The syntactic pattern recognition approach is complex to implement and it is a very slow
process.
Sometimes to get better accuracy, a larger dataset is required.
It cannot explain why a particular object is recognized.
Example: my face vs my friend’s face.
Radar signal analysis: Pattern recognition and Signal processing methods are used in
various applications of radar signal classifications like AP mine detection and
identification.
Speech recognition: The greatest success in speech recognition has been obtained using
pattern recognition paradigms. It is used in various algorithms of speech recognition
which tries to avoid the problems of using a phoneme level of description and treats
larger units such as words as pattern.
We need machines / computers to apply Pattern Recognition may be outside the machine
Machine Learning algorithms.
Features:
Feature Vectors
Usually a single object can be represented using several features,
e.g. –
o x1 = shape (e.g. nr of sides) –
o x2 = size (e.g. some numeric value)
o x3 = color (e.g. rgb values)
o ...
o xd = some other (numeric) feature.
REPRESENTATION OF PATTERN
Patterns can be represented in a number of ways.
All the ways pertains to giving the values of the features used for that particular pattern.
For supervised learning, where a training set is given, each pattern in the training set will
also have the class of the pattern given.
Curse of Dimensionality
Handling the high-dimensional data is very difficult in practice, commonly known as the curse
of dimensionality.
If the dimensionality of the input dataset increases, any machine learning algorithm and model
becomes more complex.
As the number of features increases, the number of samples also gets increased proportionally,
and the chance of over fitting also increases.
If the machine learning model is trained on high-dimensional data, it becomes over fitted and
results in poor performance.
Hence, it is often required to reduce the number of features, which can be done with
dimensionality reduction.
Dimensionality Reduction
The number of input features, variables, or columns present in a given dataset is known
as dimensionality, and the process to reduce these features is called dimensionality
reduction.
A dataset contains a huge number of input features in various cases, which makes the
predictive modeling task more complicated.
Because it is very difficult to visualize or make predictions for the training dataset with a
high number of features, for such cases, dimensionality reduction techniques are required
to use.
Dimensionality reduction technique can be defined as, "It is a way of converting the
higher dimensions dataset into lesser dimensions dataset ensuring that it provides
similar information."
These techniques are widely used in machine learning for obtaining a better fit predictive
model while solving the classification and regression problems.
It is commonly used in the fields that deal with high-dimensional data, such as speech
recognition, signal processing, bioinformatics, etc. It can also be used for data
visualization, noise reduction, cluster analysis, etc.
Different methods can be used to reduce dimensionality:
• Feature extraction
• Feature selection
Feature extraction finds a set of new features (i.e., through some mapping f()) from
the existing features. The mapping f() could be linear or non-linear.
Feature selection is a process of choosing a subset of features from the original set of
features. It usually involves three ways:
Filter
Wrapper
Embedded
Supervised learning algorithms are trained using Unsupervised learning algorithms are trained
labeled data. using unlabeled data.
Supervised learning model takes direct feedback Unsupervised learning model does not take any
to check if it is predicting correct output or not. feedback.
Supervised learning model predicts the output. Unsupervised learning model finds the hidden
patterns in data.
In supervised learning, input data is provided to In unsupervised learning, only input data is
the model along with the output. provided to the model.
The goal of supervised learning is to train the The goal of unsupervised learning is to find the
model so that it can predict the output when it is hidden patterns and useful insights from the
given new data. unknown dataset.
Supervised learning needs supervision to train the Unsupervised learning does not need any
model. supervision to train the model.
Supervised learning can be used for those cases Unsupervised learning can be used for those
where we know the input as well as cases where we have only input data and no
corresponding outputs. corresponding output data.
Supervised learning model produces an accurate Unsupervised learning model may give less
result. accurate result as compared to supervised
learning.
Supervised learning is not close to true Artificial Unsupervised learning is more close to the true
intelligence as in this, we first train the model for Artificial Intelligence as it learns similarly as a
each data, and then only it can predict the correct child learns daily routine things by his
output. experiences.
PERCEPTRON:
Perceptron is an algorithm used for supervised learning of binary classifiers. Binary classifiers
decide whether an input, usually represented by a series of vectors, belongs to a specific class.
A perceptron is a single-layer neural network. They consist of four main parts including input
values, weights and bias, net sum, and an activation function.
The process begins by taking all the input values and multiplying them by their weights.
Then, all of these multiplied values are added together to create the weighted sum.
The weighted sum is then applied to the activation function, producing the perceptron's output.
The activation function plays the integral role of ensuring the output is mapped between required
values such as (0, 1) or (-1, 1).
It is important to note that the weight of an input is indicative of the strength of a node.
Similarly, an input's bias value gives the ability to shift the activation function curve up or down.
As a simplified form of a neural network, specifically a single-layer neural network, perceptrons
play an important role in binary classification.
This means the perceptron is used to classify data into two parts, hence binary. Sometimes,
perceptrons are also referred to as linear binary classifiers for this reason.
NEAREST-NEIGHBOUR CLASSIFIER:
o K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on
Supervised Learning technique.
o K-NN algorithm assumes the similarity between the new case/data and available cases
and put the new case into the category that is most similar to the available categories.
o K-NN algorithm stores all the available data and classifies a new data point based on the
similarity. This means when new data appears then it can be easily classified into a well
suite category by using K- NN algorithm.
o K-NN algorithm can be used for Regression as well as for Classification but mostly it is
used for the Classification problems.
o K-NN is a non-parametric algorithm, which means it does not make any assumption on
underlying data.
o It is also called a lazy learner algorithm because it does not learn from the training set
immediately instead it stores the dataset and at the time of classification, it performs an
action on the dataset.
o KNN algorithm at the training phase just stores the dataset and when it gets new data,
then it classifies that data into a category that is much similar to the new data.
o Example: Suppose, we have an image of a creature that looks similar to cat and dog, but
we want to know either it is a cat or dog. So for this identification, we can use the KNN
algorithm, as it works on a similarity measure. Our KNN model will find the similar
features of the new data set to the cats and dogs images and based on the most similar
features it will put it in either cat or dog category.
Firstly, we will choose the number of neighbors, so we will choose the k=5.
Next, we will calculate the Euclidean distance between the data points. The Euclidean distance is the
distance between two points, which we have already studied in geometry. It can be calculated as:
By calculating the Euclidean distance we got the nearest neighbors, as three nearest neighbors in
category A and two nearest neighbors in category B. Consider the below image:
As we can see the 3 nearest neighbors are from category A, hence this new data point must
belong to category A.
The dimensions of the hyperplane depend on the features present in the dataset, which means
if there are 2 features (as shown in image), then hyperplane will be a straight line.
And if there are 3 features, then hyperplane will be a 2-dimension plane. We always create a
hyperplane that has a maximum margin, which means the maximum distance between the
data points.
The data points or vectors that are the closest to the hyperplane and which affect the position
of the hyperplane are termed as Support Vector.
Since these vectors support the hyperplane, hence called a Support vector.
The working of the SVM algorithm can be understood by using an example. Suppose we
have a dataset that has two tags (green and blue), and the dataset has two features x1 and x2.
We want a classifier that can classify the pair(x1, x2) of coordinates in either green or blue.
Non-Linear SVM:
If data is linearly arranged, then we can separate it by using a straight line, but for non-linear
data, we cannot draw a single straight line.
So to separate these data points, we need to add one more dimension.
For linear data, we have used two dimensions x and y, so for non-linear data, we will add a
third dimension z.
It can be calculated as: Z=x2 +y2
SVM KERNELS:
The SVM kernel is a function that takes low-dimensional input space and transforms it into
higher-dimensional space, ie it converts nonseparable problems to separable problems.
It is mostly useful in non-linear separation problems.
Simply put the kernel, does some extremely complex data transformations and then finds out
the process to separate the data based on the labels or outputs defined.
It makes SVM more powerful, flexible and accurate. The following are some of
the types of kernels used by SVM.
Linear Kernel:
It can be used as a dot product between any two observations. The formula of
linear kernel is as below −
Polynomial Kernel:
It is more generalized form of linear kernel and distinguish curved or nonlinear input space.
Following is the formula for polynomial kernel −
Here d is the degree of polynomial, which we need to specify manually in the learning algorithm.
V. ARUNA KUMARI-Asst. Professor-Dept of MCA Page 15
ASHOKA WOMEN’S ENGINEERING COLLEGE (AUTONOMOUS)
Here, gamma ranges from 0 to 1. We need to manually specify it in the learning algorithm. A good
default value of gamma is 0.1.
LOGISTIC REGRESSION:
Logistic regression is one of the most popular Machine Learning algorithms, which comes
under the Supervised Learning technique. It is used for predicting the categorical dependent
variable using a given set of independent variables. This binary classification and multi
classification.
Logistic regression predicts the output of a categorical dependent variable. Therefore the
outcome must be a categorical or discrete value.
It can be either Yes or No, 0 or 1, true or False, etc. but instead of giving the exact value as 0
and 1, it gives the probabilistic values which lie between 0 and 1.
Logistic Regression is much similar to the Linear Regression except that how they are used.
Linear Regression is used for solving Regression problems, whereas Logistic regression is
used for solving the classification problems.
Probabilistic and characteristic method is applied in this concept.
In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic
function or Sigmoid function, which predicts two maximum values (0 or 1).
o The value of the logistic regression must be between 0 and 1, which cannot go beyond
this limit, so it forms a curve like the "S" form. The S-form curve is called the sigmoid
function or the logistic function.
o In logistic regression, we use the concept of the threshold value, which defines the
probability of either 0 or 1. Such as values above the threshold value tends to 1, and a
value below the threshold values tends to 0.
Logistic Regression Equation:
The Logistic regression equation can be obtained from the Linear Regression equation.
We know the equation of the straight line can be written as:
But we need range between -[infinity] to +[infinity], then take logarithm of the equation
it will become:
o The decisions or the test are performed on the basis of features of the given dataset.
o It is a graphical representation for getting all the possible solutions to a problem/decision
based on given conditions.
o It is called a decision tree because, similar to a tree, it starts with the root node, which
expands on further branches and constructs a tree-like structure.
o In order to build a tree, we use the CART algorithm, which stands for Classification and
Regression Tree algorithm.
o A decision tree simply asks a question, and based on the answer (Yes/No), it further split
the tree into sub trees.
o Below diagram explains the general structure of a decision tree:
Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes
according to the given conditions.
Branch/Sub Tree: A tree formed by splitting the tree.
Pruning: Pruning is the process of removing the unwanted branches from the tree.
Parent/Child node: The root node of the tree is called the parent node, and other nodes are
called the child nodes.
How does the Decision Tree algorithm Work?
In a decision tree, for predicting the class of the given dataset, the algorithm starts from the root
node of the tree. This algorithm compares the values of root attribute with the record (real
dataset) attribute and, based on the comparison, follows the branch and jumps to the next node.
For the next node, the algorithm again compares the attribute value with the other sub-nodes and
move further. It continues the process until it reaches the leaf node of the tree. The complete
process can be better understood using the below algorithm:
o Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
o Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).
o Step-3: Divide the S into subsets that contains possible values for the best attributes.
o Step-4: Generate the decision tree node, which contains the best attribute.
o Step-5: Recursively make new decision trees using the subsets of the dataset created in
step -3. Continue this process until a stage is reached where you cannot further classify
the nodes and called the final node as a leaf node.
Example: Suppose there is a candidate who has a job offer and wants to decide whether he
should accept the offer or Not. So, to solve this problem, the decision tree starts with the root
node (Salary attribute by ASM). The root node splits further into the next decision node (distance
from the office) and one leaf node based on the corresponding labels. The next decision node
further gets split into one decision node (Cab facility) and one leaf node. Finally, the decision
node splits into two leaf nodes (Accepted offers and Declined offer). Consider the below
diagram:
The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which can be described
as:
o Naive: It is called Naïve because it assumes that the occurrence of a certain feature is
independent of the occurrence of other features. Such as if the fruit is identified on the
bases of color, shape, and taste, then red, spherical, and sweet fruit is recognized as an
apple. Hence each feature individually contributes to identify that it is an apple without
depending on each other.
o Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem.
Bayes' Theorem:
o Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine
the probability of a hypothesis with prior knowledge. It depends on the conditional
probability.
o The formula for Bayes' theorem is given as:
Where,
P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B.
P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a
hypothesis is true.
P(A) is Prior Probability: Probability of hypothesis before observing the evidence.
P(B) is Marginal Probability: Probability of Evidence.
Working of Naive Bayes' Classifier:
Working of Naïve Bayes' Classifier can be understood with the help of the below example:
Suppose we have a dataset of weather conditions and corresponding target variable "Play". So
using this dataset we need to decide that weather we should play or not on a particular day
according to the weather conditions. So to solve this problem, we need to follow the below steps:
Problem: If the weather is sunny, then the Player should play or not?
S.NO Outlook Play
0 Rainy Yes
1 Sunny Yes
2 Overcast Yes
3 Overcast Yes
4 Sunny No
5 Rainy Yes
6 Sunny Yes
7 Overcast Yes
8 Rainy No
9 Sunny No
10 Sunny Yes
11 Rainy No
12 Overcast Yes
13 Overcast Yes
Solution: To solve this, first consider the below dataset: Frequency table for the Weather Conditions:
Weather Yes No
Overcast 5 0
Rainy 2 2
Sunny 3 2
Total 10 4
Weather No Yes
Overcast 0 5 5/14= 0.35
Rainy 2 2 4/14=0.29
Sunny 2 3 5/14=0.35
All 4/14=0.29 10/14=0.71
Applying Bayes'theorem:
P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny)
P(Sunny)= 0.35
P(Yes)=0.71
P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)
P(Sunny|NO)= 2/4=0.5
P(No)= 0.29
P(Sunny)= 0.35
decision trees on various subsets of the given dataset and takes the average to
improve the predictive accuracy of that dataset."
Instead of relying on one decision tree, the random forest takes the prediction from each
tree and based on the majority votes of predictions, and it predicts the final output.
The greater number of trees in the forest leads to higher accuracy and prevents the
problem of overfitting.
The below diagram explains the working of the Random Forest algorithm:
Note: To better understand the Random Forest Algorithm, you should have knowledge of the
Decision Tree Algorithm.
Assumptions for Random Forest:
Since the random forest combines multiple trees to predict the class of the dataset, it is possible
that some decision trees may predict the correct output, while others may not. But together, all
the trees predict the correct output. Therefore, below are two assumptions for a better Random
forest classifier:
o There should be some actual values in the feature variable of the dataset so that the
classifier can predict accurate results rather than a guessed result.
o The predictions from each tree must have very low correlations.
Why use Random Forest?
Below are some points that explain why we should use the Random Forest algorithm:
o It takes less training time as compared to other algorithms.
o It predicts output with high accuracy, even for the large dataset it runs efficiently.
o It can also maintain accuracy when a large proportion of data is missing.
The Working process can be explained in the below steps and diagram:
Step-1: Select random K data points from the training set.
Step-2: Build the decision trees associated with the selected data points (Subsets).
Step-3: Choose the number N for decision trees that you want to build.
Step-4: Repeat Step 1 & 2.
Step-5: For new data points, find the predictions of each decision tree, and assign the new data
points to the category that wins the majority votes.
The working of the algorithm can be better understood by the below example:
Example: Suppose there is a dataset that contains multiple fruit images. So, this dataset is
given to the Random forest classifier. The dataset is divided into subsets and given to each
decision tree. During the training phase, each decision tree produces a prediction result, and
when a new data point occurs, then based on the majority of results, the Random Forest classifier
predicts the final decision. Consider the below image:
2. Boosting:
Unlike bagging, boosting is an iterative ensemble learning technique where base models
are trained sequentially, and each subsequent model focuses on correcting the errors
made by the previous models.
In boosting, each base model is trained on the entire training set, but with different
weights assigned to the training examples. Examples that are misclassified by earlier
models are given higher weights to force subsequent models to pay more attention to
them.
Predictions are made by aggregating the weighted predictions of all base models, with
more weight given to the predictions of models that perform better on the training data.
Boosting helps reduce both bias and variance, leading to improved generalization
performance.
Gradient Boosting Machines (GBMs) and AdaBoost (Adaptive Boosting) are two popular
boosting algorithms widely used in practice.
Clustering or cluster analysis is a machine learning technique, which groups the unlabelled
dataset. It can be defined as "A way of grouping the data points into different clusters,
consisting of similar data points. The objects with the possible similarities remain in a group
that has less or no similarities with another group."
It does it by finding some similar patterns in the unlabelled dataset such as shape, size,
color, behavior, etc., and divides them as per the presence and absence of those similar
patterns.
It is an unsupervised learning method, hence no supervision is provided to the algorithm,
and it deals with the unlabeled dataset.
After applying this clustering technique, each cluster or group is provided with a cluster-
ID. ML system can use this id to simplify the processing of large and complex datasets.
Example:
Let's understand the clustering technique with the real-world example of Mall: When we visit
any shopping mall, we can observe that the things with similar usage are grouped together. Such
as the t-shirts are grouped in one section, and trousers are at other sections, similarly, at
vegetable sections, apples, bananas, Mangoes, etc., are grouped in separate sections, so that we
can easily find out the things. The clustering technique also works in the same way.
The clustering technique can be widely used in various tasks. Some most common uses of this
technique are:
o Market Segmentation
o Statistical data analysis
o Social network analysis
o Image segmentation
o Anomaly detection, etc
Applications of Clustering:
Below are some commonly known applications of clustering technique in Machine Learning:
o In Identification of Cancer Cells
o In Search Engines
o The customers based on their choice and preferences.
o In Biology
o In Land Use
PARTITIONING CLUSTERING:
It is a type of clustering that divides the data into non-hierarchical groups. It is also
known as the centroid-based method.
The most common example of partitioning clustering is the K-Means Clustering
algorithm.
In this type, the dataset is divided into a set of k groups, where K is used to define the
number of pre-defined groups.
The cluster center is created in such a way that the distance between the data points of
one cluster is minimum as compared to another cluster centroid.
K-MEANS CLUSTERING:
K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabeled
dataset into different clusters.
Here K defines the number of pre-defined clusters that need to be created in the process,
as if K=2, there will be two clusters, and for K=3, there will be three clusters, and so on.
It is an iterative algorithm that divides the unlabeled dataset into k different clusters in
such a way that each dataset belongs only one group that has similar properties.
It allows us to cluster the data into different groups and a convenient way to discover the
categories of groups in the unlabeled dataset on its own without the need for any training.
It is a centroid-based algorithm, where each cluster is associated with a centroid.
The main aim of this algorithm is to minimize the sum of distances between the data
point and their corresponding clusters.
The algorithm takes the unlabeled dataset as input, divides the dataset into k-number of
clusters, and repeats the process until it does not find the best clusters. The value of k
should be predetermined in this algorithm.
The k-means clustering algorithm mainly performs two tasks:
1. Determines the best value for K center points or centroids by an iterative process.
2. Assigns each data point to its closest k-center. Those data points which are near to
the particular k-center, create a cluster.
Hence each cluster has datapoints with some commonalities, and it is away from other
clusters.
The below diagram explains the working of the K-means Clustering Algorithm:
Algorithm:
The working of the K-Means algorithm is explained in the below steps:
Step-1: Select the number K to decide the number of clusters.
Step-2: Select random K points or centroids. (It can be other from the input dataset).
Step-3: Assign each data point to their closest centroid, which will form the predefined K clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each datapoint to the new closest centroid
of each cluster.
Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.
Step-7: The model is ready.
HIERARCHICAL CLUSTERING :
Hierarchical clustering is another unsupervised machine learning algorithm, which is
used to group the unlabeled datasets into a cluster and also known as hierarchical
cluster analysis or HCA.
In this algorithm, we develop the hierarchy of clusters in the form of a tree, and this tree-
shaped structure is known as the dendrogram.
REGRESSION:
The term regression is used when you try to find the relationship between variables.
In Machine Learning, and in statistical modeling, that relationship is used to predict the
outcome of future events.
Regression analysis is a statistical method to model the relationship between a dependent
(target) and independent (predictor) variables with one or more independent variables.
More specifically, Regression analysis helps us to understand how the value of the
dependent variable is changing corresponding to an independent variable when other
independent variables are held fixed.
It predicts continuous/real values such as temperature, age, salary, price, etc.
It is a supervised technique.
In Regression, we plot a graph between the variables which best fits the given datapoints,
using this plot, the machine learning model can make predictions about the data.
In simple words, "Regression shows a line or curve that passes through all the
datapoints on target-predictor graph in such a way that the vertical distance
between the datapoints and the regression line is minimum." The distance between
datapoints and line tells whether a model has captured a strong relationship or not.
Types of Regression:
1. Linear Regression
2. Polynomial Regression
3. Support Vector Regression
4. Decision Tree Regression
5. Random Forest Regression
1. Linear Regression:
Linear regression is a statistical regression method which is used for predictive analysis.
It is one of the very simple and easy algorithms which works on regression and shows the
relationship between the continuous variables.
It is used for solving the regression problem in machine learning.
Linear regression shows the linear relationship between the independent variable (X-axis)
and the dependent variable (Y-axis), hence called linear regression.
If there is only one input variable (x), then such linear regression is called simple linear
regression. And if there is more than one input variable, then such linear regression is
called multiple linear regression.
The relationship between variables in the linear regression model can be explained using
the below image. Here we are predicting the salary of an employee on the basis of the
year of experience.
The equation for polynomial regression also derived from linear regression equation that
means Linear regression equation Y= b0+ b1x, is transformed into Polynomial regression
equation Y= b0+b1x+ b2x2+ b3x3+.....+ bnxn.
Here Y is the predicted/target output, b0, b1,... bn are the regression coefficients. x is
our independent/input variable.
The model is still linear as the coefficients are still linear with quadratic.
3. Support vector regression:
Support Vector Machine is a supervised learning algorithm which can be used for
regression as well as classification problems. So if we use it for regression problems, then
it is termed as Support Vector Regression.
Support Vector Regression is a regression algorithm which works for continuous
variables. Below are some keywords which are used in Support Vector Regression:
o Kernel: It is a function used to map a lower-dimensional data into higher dimensional data.
o Hyperplane: In general SVM, it is a separation line between two classes, but in SVR, it
is a line which helps to predict the continuous variables and cover most of the datapoints.
o Boundary line: Boundary lines are the two lines apart from hyperplane, which creates a
margin for datapoints.
o Support vectors: Support vectors are the datapoints which are nearest to the hyperplane
and opposite class.
In SVR, we always try to determine a hyperplane with a maximum margin, so that maximum
number of datapoints is covered in that margin. The main goal of SVR is to consider the
maximum datapoints within the boundary lines and the hyperplane (best-fit line) must contain a
maximum number of datapoints. Consider the below image:
Here, the blue line is called hyperplane, and the other two lines are known as boundary lines.
COST FUNCTION:
A cost function is an important parameter that determines how well a machine learning
model performs for a given dataset.
It calculates the difference between the expected value and predicted value and represents
it as a single real number.
Cost function also plays a crucial role in understanding that how well your model
estimates the relationship between the input and output parameters.
In simple, "Cost function is a measure of how wrong the model is in estimating the
relationship between X(input) and Y(output) Parameter."
A cost function is sometimes also referred to as Loss function, and it can be estimated by
iteratively running the model to compare estimated predictions against the known values
of Y.
The main aim of each ML model is to determine parameters or weights that can minimize
the cost function.
Why use Cost Function?
While there are different accuracy parameters, then why do we need a Cost function for the
Machine learning model? So, we can understand it with an example of the classification of data.
Suppose we have a dataset that contains the height and weights of cats & dogs, and we need to
classify them accordingly. If we plot the records using these two features, we will get a scatter
plot as below:
In the above image, the green dots are cats, and the yellow dots are dogs. Below are the three
possible solutions for this classification problem.
In the above solutions, all three classifiers have high accuracy, but the third solution is the best
because it correctly classifies each datapoint. The reason behind the best classification is that it is
in mid between both the classes, not close or not far to any of them.
To get such results, we need a Cost function. It means for getting the optimal solution; we need a
Cost function.
It calculated the difference between the actual values and predicted values and measured how wrong
was our model in the prediction.
By minimizing the value of the cost function, we can get the optimal solution. Here comes the role of
Gradient descent.
“Gradient Descent is an optimization algorithm which is used for optimizing the cost function
or error in the model.”
It enables the models to take the gradient or direction to reduce the errors by reaching to least possible
error. Here direction refers to how model parameters should be corrected to further reduce the cost
function.
The error in your model can be different at different points, and you have to find the quickest way to
minimize it, to prevent resource wastage.
Types of the cost function:
There are many cost functions in machine learning and each has its use cases depending on whether it is a
regression problem or classification problem.
1. Regression cost Function
2. Binary Classification cost Functions
3. Multi-class Classification Cost Function.
1. Regression cost Function:
Regression models deal with predicting a continuous value for example salary of an employee, price of a
car, loan prediction, etc. A cost function used in the regression problem is called “Regression Cost
Function”. They are calculated on the distance-based error as follows:
Error = y-y’ (actual output- predicted output)
Where,
Y – Actual output
Y’ – Predicted output
The most used Regression cost functions are below,
The binary Cost function is a special case of Categorical cross-entropy, where there is
only one output class. For example, classification between red and blue.
To better understand it, let's suppose there is only a single output variable Y
Cross-entropy(D) = - y*log(p) when y = 1
Cross-entropy(D) = - (1-y)*log(1-p) when y = 0
The error in binary classification is calculated as the mean of cross-entropy for all N
training data. Which means:
Binary Cross-Entropy = (Sum of Cross-Entropy for N data)/N
3. Multi-class Classification Cost Function:
A multi-class classification cost function is used in the classification problems for
which instances are allocated to one of more than two classes.
Here also, similar to binary class classification cost function, cross-entropy or
categorical cross-entropy is commonly used cost function.
It is designed in a way that it can be used with multi-class classification with the target values
ranging from 0 to 1, 3, ….,n classes.
In a multi-class classification problem, cross-entropy will generate a score that summarizes the
mean difference between actual and anticipated probability distribution. For a perfect cross-
entropy, the value should be zero when the score is minimized.
TRAINING AND TESTING A CLASSIFIER:
In machine learning data preprocessing, we divide our dataset into a training dataset and testing dataset.
The below image is a complete dataset that means collection of data.
For example student dataset which including Rollno, names, subjects, marks, percentage etc. This is
called as one dataset. Those dataset can be divided into two parts 1. Training Dataset 2. Testing Dataset.
Training Dataset is used for training purpose those things are called as train model and
Testing dataset is used for Testing purpose those things are called as Evaluate model.
1. Training Dataset:
Training dataset is provided as input to this phase.
In this attributes & class labels & used for training machine learning algorithm to prepare model.
Machine can learn when they observed relevant data.
From that they find relationship, detect patterns, understand complex problems & make decision.
Training error is occurred by applying the model same data from which model is trained.
2. Testing Dataset:
Testing dataset is provided input to this phase.
Test dataset is a dataset for which class label is unknown. It is tested using model.
A test dataset used for assessment of the finally chosen model.
Training & Testing dataset is completely different.
Testing error that is occurred by accessing the model by providing the unknown data to the model.
In simple way, the actual output of testing data & predicted output of model does not match
then testing error is occurred.
Example: Training Vs Testing
On 1st iteration, the first fold is reserved for test the model, and rest are used to train the
model. On 2nd iteration, the second fold is used to test the model, and rest are used to
train the model. This process will continue until each fold is not used for the test fold.
Consider the below diagram:
houses can be much high than other houses. To tackle such situations, a stratified k-fold
cross-validation technique is useful.
6. Holdout Method:
This method is the simplest cross-validation technique among all. In this method, we
need to remove a subset of the training data and use it to get prediction results by training
it on the rest part of the dataset.
The error that occurs in this process tells how well our model will perform with the
unknown dataset. Although this approach is simple to perform, it still faces the issue of
high variance, and it also produces misleading results sometimes.
CLASS-IMBALANCE – WAYS OF HANDLING:
o Misclassification rate: It is also termed as Error rate, and it defines how often the model
gives the wrong predictions. The value of error rate can be calculated as the number of
incorrect predictions to all number of the predictions made by the classifier. The formula
is given below:
o Precision: It can be defined as the number of correct outputs provided by the model or
out of all positive classes that have predicted correctly by the model, how many of them
were actually true. It can be calculated using the below formula:
o Recall: It is defined as the out of total positive classes, how our model predicted correctly.
The recall must be as high as possible.
o F-measure: If two models have low precision and high recall or vice versa, it is difficult to
compare these models. So, for this purpose, we can use F-score. This score helps us to
evaluate the recall and precision at the same time. The F-score is maximum if the recall is
equal to the precision. It can be calculated using the below formula:
This value can be used to calculate the coefficient of determination (R²) using Formula 1:
Where:
RSS = sum of squared residuals
TSS = total sum of squares
Example: Calculating R² using regression outputsAs part of performing a simple linear
regression that predicts students’ exam scores (dependent variable) from their study time
(independent variable), you calculate that:
These values can be used to calculate the coefficient of determination (R²) using Formula 2:
LEAST SQUARE:
The least-squares method can be defined as a statistical method that is used to find the
equation of the line of best fit related to the given data.
This method is called so as it aims at reducing the sum of squares of deviations as much
as possible.
The line obtained from such a method is called a regression line.
The red points in the above plot represent the data points for the sample data available. Independent
variables are plotted as x-coordinates and dependent ones are plotted as y-coordinates. The equation
of the line of best fit obtained from the least squares method is plotted as the red line in the graph.
We can conclude from the above graph that how the least squares method helps us to find a
line that best fits the given data points and hence can be used to make further predictions about the
value of the dependent variable where it is not known initially.
xi yi X – xi Y – yi (X-xi)*(Y-yi) (X – xi)2
3 5 16 10.24
1 3.2
0 0 0.04
4 8 0.2
-2 3.6 3.24
6 10 -1.8
-7 26.6 14.44
8 15 -3.8
The slope of the line of best fit can be calculated from the formula as follows:
c = Y – mX
c = 8 – 1.68*4.2 = 0.94
Thus, the equation of the line of best fit becomes, y = 1.68x + 0.94.