0% found this document useful (0 votes)
9 views49 pages

DSP Unit - III

The document discusses patterns and pattern recognition, defining patterns as abstractions that describe physical objects through attributes. It covers the advantages and disadvantages of pattern recognition, its applications in various fields, and the importance of features in machine learning. Additionally, it explains supervised and unsupervised learning, dimensionality reduction techniques, and classification algorithms.

Uploaded by

harshakoushil04
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views49 pages

DSP Unit - III

The document discusses patterns and pattern recognition, defining patterns as abstractions that describe physical objects through attributes. It covers the advantages and disadvantages of pattern recognition, its applications in various fields, and the importance of features in machine learning. Additionally, it explains supervised and unsupervised learning, dimensionality reduction techniques, and classification algorithms.

Uploaded by

harshakoushil04
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

ASHOKA WOMEN’S ENGINEERING COLLEGE (AUTONOMOUS)

UNIT - III

Pattern:
 Pattern is an abstraction, represented by a set of measurements describing a “physical”
object.
 Pattern is everything around in this digital world. A pattern can either be seen physically
or it can be observed mathematically by applying algorithms.
 It gives the description of the object or the notion.
 The description is given in the form of attributes of the object.
 These are also called the features of the object.
Example:
The colors on the clothes, speech pattern etc. In computer science, a pattern is
represented using vector feature values.

V. ARUNA KUMARI-Asst. Professor-Dept of MCA Page 1


ASHOKA WOMEN’S ENGINEERING COLLEGE (AUTONOMOUS)

Pattern Recognition
 Pattern recognition is the process of recognizing patterns by using a Machine Learning
algorithm. Pattern recognition can be defined as the classification of data based on
knowledge already gained or on statistical information extracted from patterns and/or
their representation.
 Pattern recognition is the process which can detect different categories and get
information about particular data.
 Some of the applications of patterns recognition are voice recognition, weather forecast,
object detection in images, etc.

Advantages:
 DNA sequences can be interpreted
 Extensively applied in the medical field and robotics.
 Classification problems can be solved using pattern recognition.
 Biometric detection.
 Can recognize a particular object from different angles.
 It is useful for cloth pattern recognition for visually impaired blind people
 Pattern recognition helps in forensic Lab
Disadvantages:
 The syntactic pattern recognition approach is complex to implement and it is a very slow
process.
 Sometimes to get better accuracy, a larger dataset is required.
 It cannot explain why a particular object is recognized.
Example: my face vs my friend’s face.

Applications of Pattern Recognition


 Computer vision: Pattern recognition is used to extract meaningful features from given
image/video samples and is used in computer vision for various applications like
biological and biomedical imaging.
 Image processing, segmentation and analysis: Pattern recognition is used to give
human recognition intelligence to machine which is required in image processing. Pattern
recognition is used in Terrorist Detection, Credit Fraud Detection Credit Applications.
 Finger print identification: The fingerprint recognition technique is a dominant
technology in the biometric market. A number of recognition methods have been used to
perform fingerprint matching out of which pattern recognition approaches is widely used.

V. ARUNA KUMARI-Asst. Professor-Dept of MCA Page 2


ASHOKA WOMEN’S ENGINEERING COLLEGE (AUTONOMOUS)

 Radar signal analysis: Pattern recognition and Signal processing methods are used in
various applications of radar signal classifications like AP mine detection and
identification.
 Speech recognition: The greatest success in speech recognition has been obtained using
pattern recognition paradigms. It is used in various algorithms of speech recognition
which tries to avoid the problems of using a phoneme level of description and treats
larger units such as words as pattern.

Difference between Machine Learning and Pattern Recognition


Machine Learning Pattern Recognition
Machine Learning is a method of data Pattern recognition is the engineering application of
analysis that automates analytical model various algorithms for the purpose of recognition of
building. patterns in data.

Machine Learning is more on the practical


Pattern recognition is more on the theoretical side
side

It can be a solution of real time problem It can be a real time problem

We need machines / computers to apply Pattern Recognition may be outside the machine
Machine Learning algorithms.

Features:

 Features may be represented as continuous, discrete, or discrete binary variables. A feature is a


function of one or more measurements, computed so that it quantifies some significant characteristics
of the object.
Example: Consider our face then eyes, ears, nose, etc are features of the face.
A set of features that are taken together, forms the features vector.
Example: In the above example of a face, if all the features (eyes, ears, nose, etc) are taken together
then the sequence is a feature vector ([eyes, ears, nose]). The feature vector is the sequence of a feature
represented as a d-dimensional column vector. In the case of speech, MFCC (Mel-frequency Cepstral
Coefficient) is the spectral feature of the speech.
Pattern recognition possesses the following features:
 Pattern recognition system should recognize familiar patterns quickly and accurate.
 Recognize and classify unfamiliar objects.
 Accurately recognize shapes and objects from different angles.
 Identify patterns and objects even when partly hidden.
 Recognize patterns quickly with ease, and with automaticity.

V. ARUNA KUMARI-Asst. Professor-Dept of MCA Page 3


ASHOKA WOMEN’S ENGINEERING COLLEGE (AUTONOMOUS)

Why Feature is Important in Machine Learning?


 Features in machine learning is very important, being building a blocks of datasets, the
quality of the features in your dataset has major impact on the quality of the insights you
will get while using the dataset for machine learning.

Feature Vectors
 Usually a single object can be represented using several features,
 e.g. –
o x1 = shape (e.g. nr of sides) –
o x2 = size (e.g. some numeric value)
o x3 = color (e.g. rgb values)
o ...
o xd = some other (numeric) feature.

Real-time Examples and Explanations:


 A pattern is a physical object or an abstract notion.
 While talking about various types of balls, then a description of a ball is a pattern.
 In the case balls considered as pattern, the classes could be football, cricket ball, table
tennis ball, etc.
 Given a new pattern, the class of the pattern is to be determined. The choice of attributes
and representation of patterns is a very important step in pattern classification.
 A good representation is one that makes use of discriminating attributes and also reduces
the computational burden in pattern classification.
 An obvious representation of a pattern will be a vector. Each element of the vector can
represent one attribute of the pattern. The first element of the vector will contain the
value of the first attribute for the pattern being considered.

V. ARUNA KUMARI-Asst. Professor-Dept of MCA Page 4


ASHOKA WOMEN’S ENGINEERING COLLEGE (AUTONOMOUS)

REPRESENTATION OF PATTERN
 Patterns can be represented in a number of ways.
 All the ways pertains to giving the values of the features used for that particular pattern.
 For supervised learning, where a training set is given, each pattern in the training set will
also have the class of the pattern given.

Representing patterns as vectors


 The most popular method of representing patterns is as vectors.
 Here, the training dataset may be represented as a matrix of size (nxd), where each row
corresponds to a pattern and each column represents a feature.
 Each attribute/feature/variable is associated with a domain. A domain is a set of numbers; each
number pertains to a value of an attribute for that particular pattern.
 The class label is a dependent attribute which depends on the‘d’ independent attributes.

Representing patterns as strings


 Here each pattern is a string of characters from an alphabet.
 This is generally used to represent gene expressions.
 For example, DNA can be represented as GTGCATCTGACTCCT... , RNA is
expressed as 5 GUGCAUCUGACUCCU....
 Each string of characters represents a pattern. Operations like pattern matching or finding
the similarity between strings are carried out with these patterns.

Representing patterns by using logical operators


 Here each pattern is represented by a sentence (well formed formula) in a logic.
 An example would be
if (beak(x) = red) and (colour(x) = green) then parrot(x)

V. ARUNA KUMARI-Asst. Professor-Dept of MCA Page 5


ASHOKA WOMEN’S ENGINEERING COLLEGE (AUTONOMOUS)

Curse of Dimensionality
 Handling the high-dimensional data is very difficult in practice, commonly known as the curse
of dimensionality.
 If the dimensionality of the input dataset increases, any machine learning algorithm and model
becomes more complex.
 As the number of features increases, the number of samples also gets increased proportionally,
and the chance of over fitting also increases.
 If the machine learning model is trained on high-dimensional data, it becomes over fitted and
results in poor performance.
 Hence, it is often required to reduce the number of features, which can be done with
dimensionality reduction.

Dimensionality Reduction
 The number of input features, variables, or columns present in a given dataset is known
as dimensionality, and the process to reduce these features is called dimensionality
reduction.
 A dataset contains a huge number of input features in various cases, which makes the
predictive modeling task more complicated.
 Because it is very difficult to visualize or make predictions for the training dataset with a
high number of features, for such cases, dimensionality reduction techniques are required
to use.
 Dimensionality reduction technique can be defined as, "It is a way of converting the
higher dimensions dataset into lesser dimensions dataset ensuring that it provides
similar information."
 These techniques are widely used in machine learning for obtaining a better fit predictive
model while solving the classification and regression problems.

V. ARUNA KUMARI-Asst. Professor-Dept of MCA Page 6


ASHOKA WOMEN’S ENGINEERING COLLEGE (AUTONOMOUS)

 It is commonly used in the fields that deal with high-dimensional data, such as speech
recognition, signal processing, bioinformatics, etc. It can also be used for data
visualization, noise reduction, cluster analysis, etc.
 Different methods can be used to reduce dimensionality:
• Feature extraction
• Feature selection
 Feature extraction finds a set of new features (i.e., through some mapping f()) from
the existing features. The mapping f() could be linear or non-linear.

 Feature selection is a process of choosing a subset of features from the original set of
features. It usually involves three ways:
 Filter
 Wrapper
 Embedded

Methods of Dimensionality Reduction

 The various methods used for dimensionality reduction include:


 Principal Component Analysis (PCA)
 Linear Discriminant Analysis (LDA)
 Generalized Discriminant Analysis (GDA)
 Dimensionality reduction may be both linear and non-linear, depending upon the method
used. The prime linear method, called Principal Component Analysis, or PCA

V. ARUNA KUMARI-Asst. Professor-Dept of MCA Page 7


ASHOKA WOMEN’S ENGINEERING COLLEGE (AUTONOMOUS)

Principal Component Analysis (PCA)


This method was introduced by Karl Pearson. It works on a condition that while the data in a
higher dimensional space is mapped to data in a lower dimension space, the variance of the
data in the lower dimensional space should be maximum.

 It involves the following steps:


 Construct the covariance matrix of the data.
 Compute the eigenvectors of this matrix.
 Eigenvectors corresponding to the largest eigenvalues are used to reconstruct a large
fraction of variance of the original data.
 Hence, we are left with a lesser number of eigenvectors, and there might have been some
data loss in the process. But, the most important variances should be retained by the
remaining eigenvectors.
Advantages of Dimensionality Reduction
 It helps in data compression, and hence reduced storage space.
 It reduces computation time.
 It also helps remove redundant features, if any.

Disadvantages of Dimensionality Reduction


 It may lead to some amount of data loss.
 PCA tends to find linear correlations between variables, which is sometimes
undesirable.
 PCA fails in cases where mean and covariance are not enough to define datasets.
 We may not know how many principal components to keep- in practice, some thumb
rules are applied.

V. ARUNA KUMARI-Asst. Professor-Dept of MCA Page 8


ASHOKA WOMEN’S ENGINEERING COLLEGE (AUTONOMOUS)

Supervised and Unsupervised Learning:

Supervised Learning Unsupervised Learning

Supervised learning algorithms are trained using Unsupervised learning algorithms are trained
labeled data. using unlabeled data.

Supervised learning model takes direct feedback Unsupervised learning model does not take any
to check if it is predicting correct output or not. feedback.

Supervised learning model predicts the output. Unsupervised learning model finds the hidden
patterns in data.

In supervised learning, input data is provided to In unsupervised learning, only input data is
the model along with the output. provided to the model.

The goal of supervised learning is to train the The goal of unsupervised learning is to find the
model so that it can predict the output when it is hidden patterns and useful insights from the
given new data. unknown dataset.

Supervised learning needs supervision to train the Unsupervised learning does not need any
model. supervision to train the model.

Supervised learning can be categorized Unsupervised Learning can be classified


in Classification and Regression problems. in Clustering and Associations problems.

Supervised learning can be used for those cases Unsupervised learning can be used for those
where we know the input as well as cases where we have only input data and no
corresponding outputs. corresponding output data.

Supervised learning model produces an accurate Unsupervised learning model may give less
result. accurate result as compared to supervised
learning.

Supervised learning is not close to true Artificial Unsupervised learning is more close to the true
intelligence as in this, we first train the model for Artificial Intelligence as it learns similarly as a
each data, and then only it can predict the correct child learns daily routine things by his
output. experiences.

It includes various algorithms such as Linear It includes various algorithms such as


Regression, Logistic Regression, Support Vector Clustering, KNN, and Apriori algorithm.
Machine, Multi-class Classification, Decision
tree, Bayesian Logic, etc.

V. ARUNA KUMARI-Asst. Professor-Dept of MCA Page 9


ASHOKA WOMEN’S ENGINEERING COLLEGE (AUTONOMOUS)

CLASSIFICATION—LINEAR AND NON-LINEAR


Classification Algorithms can be divided into the mainly two categories:
o Linear Models
o Logistic Regression
o Support Vector Machines
o Non-linear Models
o K-Nearest Neighbours
o Kernel SVM
o Naïve Bayes
o Decision Tree Classification
o Random Forest Classification

PERCEPTRON:
 Perceptron is an algorithm used for supervised learning of binary classifiers. Binary classifiers
decide whether an input, usually represented by a series of vectors, belongs to a specific class.
 A perceptron is a single-layer neural network. They consist of four main parts including input
values, weights and bias, net sum, and an activation function.
 The process begins by taking all the input values and multiplying them by their weights.
 Then, all of these multiplied values are added together to create the weighted sum.
 The weighted sum is then applied to the activation function, producing the perceptron's output.
 The activation function plays the integral role of ensuring the output is mapped between required
values such as (0, 1) or (-1, 1).
 It is important to note that the weight of an input is indicative of the strength of a node.
Similarly, an input's bias value gives the ability to shift the activation function curve up or down.
 As a simplified form of a neural network, specifically a single-layer neural network, perceptrons
play an important role in binary classification.
 This means the perceptron is used to classify data into two parts, hence binary. Sometimes,
perceptrons are also referred to as linear binary classifiers for this reason.

V. ARUNA KUMARI-Asst. Professor-Dept of MCA Page 10


ASHOKA WOMEN’S ENGINEERING COLLEGE (AUTONOMOUS)

NEAREST-NEIGHBOUR CLASSIFIER:
o K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on
Supervised Learning technique.
o K-NN algorithm assumes the similarity between the new case/data and available cases
and put the new case into the category that is most similar to the available categories.
o K-NN algorithm stores all the available data and classifies a new data point based on the
similarity. This means when new data appears then it can be easily classified into a well
suite category by using K- NN algorithm.
o K-NN algorithm can be used for Regression as well as for Classification but mostly it is
used for the Classification problems.
o K-NN is a non-parametric algorithm, which means it does not make any assumption on
underlying data.
o It is also called a lazy learner algorithm because it does not learn from the training set
immediately instead it stores the dataset and at the time of classification, it performs an
action on the dataset.
o KNN algorithm at the training phase just stores the dataset and when it gets new data,
then it classifies that data into a category that is much similar to the new data.
o Example: Suppose, we have an image of a creature that looks similar to cat and dog, but
we want to know either it is a cat or dog. So for this identification, we can use the KNN
algorithm, as it works on a similarity measure. Our KNN model will find the similar
features of the new data set to the cats and dogs images and based on the most similar
features it will put it in either cat or dog category.

Why do we need a K-NN Algorithm?


o Suppose there are two categories, i.e., Category A and Category B, and we have a new
data point x1, so this data point will lie in which of these categories.
o To solve this type of problem, we need a K-NN algorithm. With the help of K-NN, we
can easily identify the category or class of a particular dataset.
o Consider the below diagram:

V. ARUNA KUMARI-Asst. Professor-Dept of MCA Page 11


ASHOKA WOMEN’S ENGINEERING COLLEGE (AUTONOMOUS)

How does K-NN work?


The K-NN working can be explained on the basis of the below algorithm:
o Step-1: Select the number K of the neighbors
o Step-2: Calculate the Euclidean distance of K number of neighbors
o Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
o Step-4: Among these k neighbors, count the number of the data points in each category.
o Step-5: Assign the new data points to that category for which the number of the neighbor is
maximum.
o Step-6: Our model is ready.
Suppose we have a new data point and we need to put it in the required category. Consider the
below image:

Firstly, we will choose the number of neighbors, so we will choose the k=5.
Next, we will calculate the Euclidean distance between the data points. The Euclidean distance is the
distance between two points, which we have already studied in geometry. It can be calculated as:

By calculating the Euclidean distance we got the nearest neighbors, as three nearest neighbors in
category A and two nearest neighbors in category B. Consider the below image:

As we can see the 3 nearest neighbors are from category A, hence this new data point must
belong to category A.

V. ARUNA KUMARI-Asst. Professor-Dept of MCA Page 12


ASHOKA WOMEN’S ENGINEERING COLLEGE (AUTONOMOUS)

Advantages of KNN Algorithm:


o It is simple to implement.
o It is robust to the noisy training data
o It can be more effective if the training data is large.
Disadvantages of KNN Algorithm:
o Always needs to determine the value of K which may be complex some time.
o The computation cost is high because of calculating the distance between the data points
for all the training samples.
SUPPORT VECTOR MACHINE:
 Support Vector Machine (SVM) is one of the most popular supervised machine
learning algorithm used for both classification and regression problems.
 Though we say regression problems as well its best suited for classification.
 The goal of the SVM algorithm is to create the best line or decision boundary that can
segregate n-dimensional space into classes so that we can easily put the new data point in the
correct category in the future.
 This best decision boundary is called a hyperplane.
 There can be multiple lines/decision boundaries to segregate the classes in n-dimensional space, but
we need to find out the best decision boundary that helps to classify the data points.
 This best decision boundary is known as the hyperplane of SVM.

 The dimensions of the hyperplane depend on the features present in the dataset, which means
if there are 2 features (as shown in image), then hyperplane will be a straight line.
 And if there are 3 features, then hyperplane will be a 2-dimension plane. We always create a
hyperplane that has a maximum margin, which means the maximum distance between the
data points.
 The data points or vectors that are the closest to the hyperplane and which affect the position
of the hyperplane are termed as Support Vector.
 Since these vectors support the hyperplane, hence called a Support vector.

V. ARUNA KUMARI-Asst. Professor-Dept of MCA Page 13


ASHOKA WOMEN’S ENGINEERING COLLEGE (AUTONOMOUS)

SVM can be of two types:


1. Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset can be
classified into two classes by using a single straight line, then such data is termed as linearly
separable data, and classifier is used called as Linear SVM classifier.
2. Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which means if a dataset
cannot be classified by using a straight line, then such data is termed as non-linear data and classifier
used is called as Non-linear SVM Classifier.
Linear SVM:

 The working of the SVM algorithm can be understood by using an example. Suppose we
have a dataset that has two tags (green and blue), and the dataset has two features x1 and x2.
 We want a classifier that can classify the pair(x1, x2) of coordinates in either green or blue.

Non-Linear SVM:
 If data is linearly arranged, then we can separate it by using a straight line, but for non-linear
data, we cannot draw a single straight line.
 So to separate these data points, we need to add one more dimension.

V. ARUNA KUMARI-Asst. Professor-Dept of MCA Page 14


ASHOKA WOMEN’S ENGINEERING COLLEGE (AUTONOMOUS)

 For linear data, we have used two dimensions x and y, so for non-linear data, we will add a
third dimension z.
 It can be calculated as: Z=x2 +y2

SVM KERNELS:
 The SVM kernel is a function that takes low-dimensional input space and transforms it into
higher-dimensional space, ie it converts nonseparable problems to separable problems.
 It is mostly useful in non-linear separation problems.
 Simply put the kernel, does some extremely complex data transformations and then finds out
the process to separate the data based on the labels or outputs defined.
 It makes SVM more powerful, flexible and accurate. The following are some of
the types of kernels used by SVM.
Linear Kernel:
 It can be used as a dot product between any two observations. The formula of
linear kernel is as below −

Polynomial Kernel:
 It is more generalized form of linear kernel and distinguish curved or nonlinear input space.
Following is the formula for polynomial kernel −

 Here d is the degree of polynomial, which we need to specify manually in the learning algorithm.
V. ARUNA KUMARI-Asst. Professor-Dept of MCA Page 15
ASHOKA WOMEN’S ENGINEERING COLLEGE (AUTONOMOUS)

Radial Basis Function (RBF) Kernel:


 RBF kernel, mostly used in SVM classification, maps input space in indefinite dimensional space.
Following formula explains it mathematically −

 Here, gamma ranges from 0 to 1. We need to manually specify it in the learning algorithm. A good
default value of gamma is 0.1.
LOGISTIC REGRESSION:
 Logistic regression is one of the most popular Machine Learning algorithms, which comes
under the Supervised Learning technique. It is used for predicting the categorical dependent
variable using a given set of independent variables. This binary classification and multi
classification.
 Logistic regression predicts the output of a categorical dependent variable. Therefore the
outcome must be a categorical or discrete value.
 It can be either Yes or No, 0 or 1, true or False, etc. but instead of giving the exact value as 0
and 1, it gives the probabilistic values which lie between 0 and 1.
 Logistic Regression is much similar to the Linear Regression except that how they are used.
Linear Regression is used for solving Regression problems, whereas Logistic regression is
used for solving the classification problems.
 Probabilistic and characteristic method is applied in this concept.
 In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic
function or Sigmoid function, which predicts two maximum values (0 or 1).

Logistic Function (Sigmoid Function):


o The sigmoid function is a mathematical function used to map the predicted values to
probabilities.
o It maps any real value into another value within a range of 0 and 1.

V. ARUNA KUMARI-Asst. Professor-Dept of MCA Page 16


ASHOKA WOMEN’S ENGINEERING COLLEGE (AUTONOMOUS)

o The value of the logistic regression must be between 0 and 1, which cannot go beyond
this limit, so it forms a curve like the "S" form. The S-form curve is called the sigmoid
function or the logistic function.
o In logistic regression, we use the concept of the threshold value, which defines the
probability of either 0 or 1. Such as values above the threshold value tends to 1, and a
value below the threshold values tends to 0.
Logistic Regression Equation:
 The Logistic regression equation can be obtained from the Linear Regression equation.
 We know the equation of the straight line can be written as:

 But we need range between -[infinity] to +[infinity], then take logarithm of the equation
it will become:

Type of Logistic Regression:


 On the basis of the categories, Logistic Regression can be classified into three types of data:
1. Binomial: In binomial Logistic regression, there can be only two possible types of
the dependent variables, such as 0 or 1, Pass or Fail, etc.
2. Multinomial: In multinomial Logistic regression, there can be 3 or more possible
unordered types of the dependent variable, such as "cat", "dogs", or
"sheep"(multinomial logistic regression data is in unordered)
3. Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered
types of dependent variables, such as "low", "Medium", or "High".( data in order)
Example: Weather Prediction
 Weather predictions are the result of logical regression. Here, we analyse the data of the
previous weather reports and predict the possible outcome for a specific day. But logical
regression would only predict categorical data, like if it’s going to rain or not.
DECISION TREES:
o Decision Tree is a supervised learning technique that can be used for both classification
and Regression problems, but mostly it is preferred for solving Classification problems.
o It is a tree-structured classifier, where internal nodes represent the features of a dataset,
branches represent the decision rules and each leaf node represents the outcome.
o In a Decision tree, there are two nodes, which are the Decision Node and Leaf
Node. Decision nodes are used to make any decision and have multiple branches,
whereas Leaf nodes are the output of those decisions and do not contain any further
branches.

V. ARUNA KUMARI-Asst. Professor-Dept of MCA Page 17


ASHOKA WOMEN’S ENGINEERING COLLEGE (AUTONOMOUS)

o The decisions or the test are performed on the basis of features of the given dataset.
o It is a graphical representation for getting all the possible solutions to a problem/decision
based on given conditions.
o It is called a decision tree because, similar to a tree, it starts with the root node, which
expands on further branches and constructs a tree-like structure.
o In order to build a tree, we use the CART algorithm, which stands for Classification and
Regression Tree algorithm.
o A decision tree simply asks a question, and based on the answer (Yes/No), it further split
the tree into sub trees.
o Below diagram explains the general structure of a decision tree:

Why use Decision Trees?


There are various algorithms in Machine learning, so choosing the best algorithm for the given
dataset and problem is the main point to remember while creating a machine learning model.
Below are the two reasons for using the Decision tree:
o Decision Trees usually mimic human thinking ability while making a decision, so it is
easy to understand.
o The logic behind the decision tree can be easily understood because it shows a tree-like
structure.
Decision Tree Terminologies:
Root Node: Root node is from where the decision tree starts. It represents the entire dataset,
which further gets divided into two or more homogeneous sets.
Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further
after getting a leaf node.

V. ARUNA KUMARI-Asst. Professor-Dept of MCA Page 18


ASHOKA WOMEN’S ENGINEERING COLLEGE (AUTONOMOUS)

Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes
according to the given conditions.
Branch/Sub Tree: A tree formed by splitting the tree.
Pruning: Pruning is the process of removing the unwanted branches from the tree.
Parent/Child node: The root node of the tree is called the parent node, and other nodes are
called the child nodes.
How does the Decision Tree algorithm Work?
In a decision tree, for predicting the class of the given dataset, the algorithm starts from the root
node of the tree. This algorithm compares the values of root attribute with the record (real
dataset) attribute and, based on the comparison, follows the branch and jumps to the next node.
For the next node, the algorithm again compares the attribute value with the other sub-nodes and
move further. It continues the process until it reaches the leaf node of the tree. The complete
process can be better understood using the below algorithm:
o Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
o Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).
o Step-3: Divide the S into subsets that contains possible values for the best attributes.
o Step-4: Generate the decision tree node, which contains the best attribute.
o Step-5: Recursively make new decision trees using the subsets of the dataset created in
step -3. Continue this process until a stage is reached where you cannot further classify
the nodes and called the final node as a leaf node.
Example: Suppose there is a candidate who has a job offer and wants to decide whether he
should accept the offer or Not. So, to solve this problem, the decision tree starts with the root
node (Salary attribute by ASM). The root node splits further into the next decision node (distance
from the office) and one leaf node based on the corresponding labels. The next decision node
further gets split into one decision node (Cab facility) and one leaf node. Finally, the decision
node splits into two leaf nodes (Accepted offers and Declined offer). Consider the below
diagram:

V. ARUNA KUMARI-Asst. Professor-Dept of MCA Page 19


ASHOKA WOMEN’S ENGINEERING COLLEGE (AUTONOMOUS)

Advantages of the Decision Tree:


o It is simple to understand as it follows the same process which a human follow while
making any decision in real-life.
o It can be very useful for solving decision-related problems.
o It helps to think about all the possible outcomes for a problem.
o There is less requirement of data cleaning compared to other algorithms.
Disadvantages of the Decision Tree:
o The decision tree contains lots of layers, which makes it complex.
o It may have an over fitting issue, which can be resolved using the Random Forest algorithm.
o For more class labels, the computational complexity of the decision tree may increase.

NAIVE BAYES CLASSIFIER ALGORITHM:


o Naive Bayes algorithm is a supervised learning algorithm, which is based on Bayes
theorem and used for solving classification problems.
o It is mainly used in text classification that includes a high-dimensional training dataset.
o Naïve Bayes Classifier is one of the simple and most effective Classification algorithms
which helps in building the fast machine learning models that can make quick
predictions.
o It is a probabilistic classifier, which means it predicts on the basis of the probability
of an object.
o Some popular examples of Naïve Bayes Algorithm are spam filtration, Sentimental
analysis, and classifying articles.

V. ARUNA KUMARI-Asst. Professor-Dept of MCA Page 20


ASHOKA WOMEN’S ENGINEERING COLLEGE (AUTONOMOUS)

Why is it called Naive Bayes?

The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which can be described
as:

o Naive: It is called Naïve because it assumes that the occurrence of a certain feature is
independent of the occurrence of other features. Such as if the fruit is identified on the
bases of color, shape, and taste, then red, spherical, and sweet fruit is recognized as an
apple. Hence each feature individually contributes to identify that it is an apple without
depending on each other.
o Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem.

Bayes' Theorem:
o Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine
the probability of a hypothesis with prior knowledge. It depends on the conditional
probability.
o The formula for Bayes' theorem is given as:

Where,
P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B.
P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a
hypothesis is true.
P(A) is Prior Probability: Probability of hypothesis before observing the evidence.
P(B) is Marginal Probability: Probability of Evidence.
Working of Naive Bayes' Classifier:
Working of Naïve Bayes' Classifier can be understood with the help of the below example:
Suppose we have a dataset of weather conditions and corresponding target variable "Play". So
using this dataset we need to decide that weather we should play or not on a particular day
according to the weather conditions. So to solve this problem, we need to follow the below steps:

1. Convert the given dataset into frequency tables.


2. Generate Likelihood table by finding the probabilities of given features.
3. Now, use Bayes theorem to calculate the posterior probability.

V. ARUNA KUMARI-Asst. Professor-Dept of MCA Page 21


ASHOKA WOMEN’S ENGINEERING COLLEGE (AUTONOMOUS)

Problem: If the weather is sunny, then the Player should play or not?
S.NO Outlook Play
0 Rainy Yes
1 Sunny Yes
2 Overcast Yes
3 Overcast Yes
4 Sunny No
5 Rainy Yes
6 Sunny Yes
7 Overcast Yes
8 Rainy No
9 Sunny No
10 Sunny Yes
11 Rainy No
12 Overcast Yes
13 Overcast Yes

Solution: To solve this, first consider the below dataset: Frequency table for the Weather Conditions:

Weather Yes No
Overcast 5 0
Rainy 2 2
Sunny 3 2
Total 10 4

Likelihood table weather condition:

Weather No Yes
Overcast 0 5 5/14= 0.35
Rainy 2 2 4/14=0.29
Sunny 2 3 5/14=0.35
All 4/14=0.29 10/14=0.71

Applying Bayes'theorem:

P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny)

P(Sunny|Yes)= 3/10= 0.3

P(Sunny)= 0.35

P(Yes)=0.71

V. ARUNA KUMARI-Asst. Professor-Dept of MCA Page 22


ASHOKA WOMEN’S ENGINEERING COLLEGE (AUTONOMOUS)

So P(Yes|Sunny) = 0.3*0.71/0.35= 0.60

P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)

P(Sunny|NO)= 2/4=0.5

P(No)= 0.29

P(Sunny)= 0.35

So P(No|Sunny)= 0.5*0.29/0.35 = 0.41

So as we can see from the above calculation that P(Yes|Sunny)>P(No|Sunny)

Advantages of Naive Bayes Classifier:


o Naïve Bayes is one of the fast and easy ML algorithms to predict a class of datasets.
o It can be used for Binary as well as Multi-class Classifications.
o It performs well in Multi-class predictions as compared to the other Algorithms.
o It is the most popular choice for text classification problems.

Disadvantages of Naive Bayes Classifier:


o Naive Bayes assumes that all features are independent or unrelated, so it cannot learn
the relationship between features.

Applications of Naive Bayes Classifier:


o It is used for Credit Scoring.
o It is used in medical data classification.
o It can be used in real-time predictions because Naïve Bayes Classifier is an eager learner.
o It is used in Text classification such as Spam filtering and Sentiment analysis

RANDOM FOREST ALGORITHM:


 Random Forest is a popular machine learning algorithm that belongs to the supervised
learning technique.
 It can be used for both Classification and Regression problems in ML.
 It is based on the concept of ensemble learning (taking two or more models for the
prediction and here the data set may be same or different), which is a process
of combining multiple classifiers to solve a complex problem and to improve the
performance of the model.
 As the name suggests, "Random Forest is a classifier (to assign class labels to any
data I/P, the i/p mapping is doen to the specific category) that contains a number of

V. ARUNA KUMARI-Asst. Professor-Dept of MCA Page 23


ASHOKA WOMEN’S ENGINEERING COLLEGE (AUTONOMOUS)

decision trees on various subsets of the given dataset and takes the average to
improve the predictive accuracy of that dataset."
 Instead of relying on one decision tree, the random forest takes the prediction from each
tree and based on the majority votes of predictions, and it predicts the final output.
 The greater number of trees in the forest leads to higher accuracy and prevents the
problem of overfitting.
The below diagram explains the working of the Random Forest algorithm:

Note: To better understand the Random Forest Algorithm, you should have knowledge of the
Decision Tree Algorithm.
Assumptions for Random Forest:
Since the random forest combines multiple trees to predict the class of the dataset, it is possible
that some decision trees may predict the correct output, while others may not. But together, all
the trees predict the correct output. Therefore, below are two assumptions for a better Random
forest classifier:
o There should be some actual values in the feature variable of the dataset so that the
classifier can predict accurate results rather than a guessed result.
o The predictions from each tree must have very low correlations.
Why use Random Forest?
Below are some points that explain why we should use the Random Forest algorithm:
o It takes less training time as compared to other algorithms.
o It predicts output with high accuracy, even for the large dataset it runs efficiently.
o It can also maintain accuracy when a large proportion of data is missing.

How does Random Forest algorithm work?


Random Forest works in two-phase first is to create the random forest by combining N decision
tree, and second is to make predictions for each tree created in the first phase.

V. ARUNA KUMARI-Asst. Professor-Dept of MCA Page 24


ASHOKA WOMEN’S ENGINEERING COLLEGE (AUTONOMOUS)

The Working process can be explained in the below steps and diagram:
Step-1: Select random K data points from the training set.
Step-2: Build the decision trees associated with the selected data points (Subsets).
Step-3: Choose the number N for decision trees that you want to build.
Step-4: Repeat Step 1 & 2.
Step-5: For new data points, find the predictions of each decision tree, and assign the new data
points to the category that wins the majority votes.
The working of the algorithm can be better understood by the below example:
Example: Suppose there is a dataset that contains multiple fruit images. So, this dataset is
given to the Random forest classifier. The dataset is divided into subsets and given to each
decision tree. During the training phase, each decision tree produces a prediction result, and
when a new data point occurs, then based on the majority of results, the Random Forest classifier
predicts the final decision. Consider the below image:

Applications of Random Forest:


There are mainly four sectors where Random forest mostly used:
1. Banking: Banking sector mostly uses this algorithm for the identification of loan risk.
2. Medicine: With the help of this algorithm, disease trends and risks of the disease can be
identified.
3. Land Use: We can identify the areas of similar land use by this algorithm.
4. Marketing: Marketing trends can be identified using this algorithm.

V. ARUNA KUMARI-Asst. Professor-Dept of MCA Page 25


ASHOKA WOMEN’S ENGINEERING COLLEGE (AUTONOMOUS)

Advantages of Random Forest:


o Random Forest is capable of performing both Classification and Regression tasks.
o It is capable of handling large datasets with high dimensionality.
o It enhances the accuracy of the model and prevents the overfitting issue.
Disadvantages of Random Forest
o Although random forest can be used for both classification and regression tasks, it is not
more suitable for Regression tasks.
BAGGING AND BOOSTING:
Bagging and boosting are two popular ensemble learning techniques used in machine learning to
improve the performance of predictive models by combining the predictions of multiple base
models.
1. Bagging (Bootstrap Aggregating):
 In bagging, multiple base models (often referred to as "weak learners") are trained
independently on different random subsets of the training data, with replacement. This
process is known as bootstrapping.
 Each base model is trained on a subset of the data, and predictions are made by averaging
(for regression) or voting (for classification) the predictions of all base models.
 Bagging helps reduce variance by averaging out the predictions of multiple models,
thereby improving the overall generalization performance of the ensemble model.
 Random Forest is a popular ensemble learning algorithm based on bagging, where the
base models are decision trees trained on different subsets of the data.

V. ARUNA KUMARI-Asst. Professor-Dept of MCA Page 26


ASHOKA WOMEN’S ENGINEERING COLLEGE (AUTONOMOUS)

2. Boosting:

 Unlike bagging, boosting is an iterative ensemble learning technique where base models
are trained sequentially, and each subsequent model focuses on correcting the errors
made by the previous models.
 In boosting, each base model is trained on the entire training set, but with different
weights assigned to the training examples. Examples that are misclassified by earlier
models are given higher weights to force subsequent models to pay more attention to
them.
 Predictions are made by aggregating the weighted predictions of all base models, with
more weight given to the predictions of models that perform better on the training data.
 Boosting helps reduce both bias and variance, leading to improved generalization
performance.
 Gradient Boosting Machines (GBMs) and AdaBoost (Adaptive Boosting) are two popular
boosting algorithms widely used in practice.

CLUSTERING IN MACHINE LEARNING:

Clustering or cluster analysis is a machine learning technique, which groups the unlabelled
dataset. It can be defined as "A way of grouping the data points into different clusters,
consisting of similar data points. The objects with the possible similarities remain in a group
that has less or no similarities with another group."
 It does it by finding some similar patterns in the unlabelled dataset such as shape, size,
color, behavior, etc., and divides them as per the presence and absence of those similar
patterns.
 It is an unsupervised learning method, hence no supervision is provided to the algorithm,
and it deals with the unlabeled dataset.
 After applying this clustering technique, each cluster or group is provided with a cluster-
ID. ML system can use this id to simplify the processing of large and complex datasets.

V. ARUNA KUMARI-Asst. Professor-Dept of MCA Page 27


ASHOKA WOMEN’S ENGINEERING COLLEGE (AUTONOMOUS)

Example:
Let's understand the clustering technique with the real-world example of Mall: When we visit
any shopping mall, we can observe that the things with similar usage are grouped together. Such
as the t-shirts are grouped in one section, and trousers are at other sections, similarly, at
vegetable sections, apples, bananas, Mangoes, etc., are grouped in separate sections, so that we
can easily find out the things. The clustering technique also works in the same way.
The clustering technique can be widely used in various tasks. Some most common uses of this
technique are:
o Market Segmentation
o Statistical data analysis
o Social network analysis
o Image segmentation
o Anomaly detection, etc

Applications of Clustering:
Below are some commonly known applications of clustering technique in Machine Learning:
o In Identification of Cancer Cells
o In Search Engines
o The customers based on their choice and preferences.
o In Biology
o In Land Use
PARTITIONING CLUSTERING:
 It is a type of clustering that divides the data into non-hierarchical groups. It is also
known as the centroid-based method.
 The most common example of partitioning clustering is the K-Means Clustering
algorithm.

V. ARUNA KUMARI-Asst. Professor-Dept of MCA Page 28


ASHOKA WOMEN’S ENGINEERING COLLEGE (AUTONOMOUS)

 In this type, the dataset is divided into a set of k groups, where K is used to define the
number of pre-defined groups.
 The cluster center is created in such a way that the distance between the data points of
one cluster is minimum as compared to another cluster centroid.

K-MEANS CLUSTERING:
 K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabeled
dataset into different clusters.
 Here K defines the number of pre-defined clusters that need to be created in the process,
as if K=2, there will be two clusters, and for K=3, there will be three clusters, and so on.
 It is an iterative algorithm that divides the unlabeled dataset into k different clusters in
such a way that each dataset belongs only one group that has similar properties.
 It allows us to cluster the data into different groups and a convenient way to discover the
categories of groups in the unlabeled dataset on its own without the need for any training.
 It is a centroid-based algorithm, where each cluster is associated with a centroid.
 The main aim of this algorithm is to minimize the sum of distances between the data
point and their corresponding clusters.
 The algorithm takes the unlabeled dataset as input, divides the dataset into k-number of
clusters, and repeats the process until it does not find the best clusters. The value of k
should be predetermined in this algorithm.
 The k-means clustering algorithm mainly performs two tasks:
1. Determines the best value for K center points or centroids by an iterative process.
2. Assigns each data point to its closest k-center. Those data points which are near to
the particular k-center, create a cluster.

 Hence each cluster has datapoints with some commonalities, and it is away from other
clusters.

V. ARUNA KUMARI-Asst. Professor-Dept of MCA Page 29


ASHOKA WOMEN’S ENGINEERING COLLEGE (AUTONOMOUS)

 The below diagram explains the working of the K-means Clustering Algorithm:

Algorithm:
The working of the K-Means algorithm is explained in the below steps:
Step-1: Select the number K to decide the number of clusters.
Step-2: Select random K points or centroids. (It can be other from the input dataset).
Step-3: Assign each data point to their closest centroid, which will form the predefined K clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each datapoint to the new closest centroid
of each cluster.
Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.
Step-7: The model is ready.
HIERARCHICAL CLUSTERING :
 Hierarchical clustering is another unsupervised machine learning algorithm, which is
used to group the unlabeled datasets into a cluster and also known as hierarchical
cluster analysis or HCA.
 In this algorithm, we develop the hierarchy of clusters in the form of a tree, and this tree-
shaped structure is known as the dendrogram.

V. ARUNA KUMARI-Asst. Professor-Dept of MCA Page 30


ASHOKA WOMEN’S ENGINEERING COLLEGE (AUTONOMOUS)

Types of hierarchical clustering:


 Agglomerative (bottom up) clustering: It builds the dendrogram (tree) from the bottom
level, and – merges the most similar (or nearest) pair of clusters – stops when all the data
points are merged into a single cluster (i.e., the root cluster).
 Divisive (top down) clustering: It starts with all data points in one cluster, the root. –
Splits the root into a set of child clusters. Each child cluster is recursively divided further
– stops when only singleton clusters of individual data points remain, i.e., each cluster
with only a single point.

REGRESSION:
 The term regression is used when you try to find the relationship between variables.
 In Machine Learning, and in statistical modeling, that relationship is used to predict the
outcome of future events.
 Regression analysis is a statistical method to model the relationship between a dependent
(target) and independent (predictor) variables with one or more independent variables.
 More specifically, Regression analysis helps us to understand how the value of the
dependent variable is changing corresponding to an independent variable when other
independent variables are held fixed.
 It predicts continuous/real values such as temperature, age, salary, price, etc.
 It is a supervised technique.
 In Regression, we plot a graph between the variables which best fits the given datapoints,
using this plot, the machine learning model can make predictions about the data.
 In simple words, "Regression shows a line or curve that passes through all the
datapoints on target-predictor graph in such a way that the vertical distance
between the datapoints and the regression line is minimum." The distance between
datapoints and line tells whether a model has captured a strong relationship or not.

Types of Regression:
1. Linear Regression
2. Polynomial Regression
3. Support Vector Regression
4. Decision Tree Regression
5. Random Forest Regression

V. ARUNA KUMARI-Asst. Professor-Dept of MCA Page 31


ASHOKA WOMEN’S ENGINEERING COLLEGE (AUTONOMOUS)

1. Linear Regression:
 Linear regression is a statistical regression method which is used for predictive analysis.
 It is one of the very simple and easy algorithms which works on regression and shows the
relationship between the continuous variables.
 It is used for solving the regression problem in machine learning.
 Linear regression shows the linear relationship between the independent variable (X-axis)
and the dependent variable (Y-axis), hence called linear regression.
 If there is only one input variable (x), then such linear regression is called simple linear
regression. And if there is more than one input variable, then such linear regression is
called multiple linear regression.
 The relationship between variables in the linear regression model can be explained using
the below image. Here we are predicting the salary of an employee on the basis of the
year of experience.

 Below is the mathematical equation for Linear regression:


Y= aX+b
Here, Y=dependent variables (target variables),
X= Independent variables (predictor variables), a and b are the linear coefficients
2. Polynomial Regression:
 Polynomial Regression is a type of regression which models the non-linear dataset using
a linear model.
 It is similar to multiple linear regressions, but it fits a non-linear curve between the value
of x and corresponding conditional values of y.
 Suppose there is a dataset which consists of datapoints which are present in a non-linear
fashion, so for such case, linear regression will not best fit to those datapoints. To cover
such datapoints, we need Polynomial regression.
 In Polynomial regression, the original features are transformed into polynomial features
of given degree and then modeled using a linear model. This means the datapoints are
best fitted using a polynomial line.

V. ARUNA KUMARI-Asst. Professor-Dept of MCA Page 32


ASHOKA WOMEN’S ENGINEERING COLLEGE (AUTONOMOUS)

 The equation for polynomial regression also derived from linear regression equation that
means Linear regression equation Y= b0+ b1x, is transformed into Polynomial regression
equation Y= b0+b1x+ b2x2+ b3x3+.....+ bnxn.
 Here Y is the predicted/target output, b0, b1,... bn are the regression coefficients. x is
our independent/input variable.
 The model is still linear as the coefficients are still linear with quadratic.
3. Support vector regression:
 Support Vector Machine is a supervised learning algorithm which can be used for
regression as well as classification problems. So if we use it for regression problems, then
it is termed as Support Vector Regression.
 Support Vector Regression is a regression algorithm which works for continuous
variables. Below are some keywords which are used in Support Vector Regression:
o Kernel: It is a function used to map a lower-dimensional data into higher dimensional data.
o Hyperplane: In general SVM, it is a separation line between two classes, but in SVR, it
is a line which helps to predict the continuous variables and cover most of the datapoints.
o Boundary line: Boundary lines are the two lines apart from hyperplane, which creates a
margin for datapoints.
o Support vectors: Support vectors are the datapoints which are nearest to the hyperplane
and opposite class.
In SVR, we always try to determine a hyperplane with a maximum margin, so that maximum
number of datapoints is covered in that margin. The main goal of SVR is to consider the
maximum datapoints within the boundary lines and the hyperplane (best-fit line) must contain a
maximum number of datapoints. Consider the below image:

Here, the blue line is called hyperplane, and the other two lines are known as boundary lines.

V. ARUNA KUMARI-Asst. Professor-Dept of MCA Page 33


ASHOKA WOMEN’S ENGINEERING COLLEGE (AUTONOMOUS)

COST FUNCTION:
 A cost function is an important parameter that determines how well a machine learning
model performs for a given dataset.
 It calculates the difference between the expected value and predicted value and represents
it as a single real number.
 Cost function also plays a crucial role in understanding that how well your model
estimates the relationship between the input and output parameters.

 In simple, "Cost function is a measure of how wrong the model is in estimating the
relationship between X(input) and Y(output) Parameter."
 A cost function is sometimes also referred to as Loss function, and it can be estimated by
iteratively running the model to compare estimated predictions against the known values
of Y.
 The main aim of each ML model is to determine parameters or weights that can minimize
the cost function.
Why use Cost Function?
While there are different accuracy parameters, then why do we need a Cost function for the
Machine learning model? So, we can understand it with an example of the classification of data.
Suppose we have a dataset that contains the height and weights of cats & dogs, and we need to
classify them accordingly. If we plot the records using these two features, we will get a scatter
plot as below:

In the above image, the green dots are cats, and the yellow dots are dogs. Below are the three
possible solutions for this classification problem.

V. ARUNA KUMARI-Asst. Professor-Dept of MCA Page 34


ASHOKA WOMEN’S ENGINEERING COLLEGE (AUTONOMOUS)

In the above solutions, all three classifiers have high accuracy, but the third solution is the best
because it correctly classifies each datapoint. The reason behind the best classification is that it is
in mid between both the classes, not close or not far to any of them.
 To get such results, we need a Cost function. It means for getting the optimal solution; we need a
Cost function.
 It calculated the difference between the actual values and predicted values and measured how wrong
was our model in the prediction.
 By minimizing the value of the cost function, we can get the optimal solution. Here comes the role of
Gradient descent.
 “Gradient Descent is an optimization algorithm which is used for optimizing the cost function
or error in the model.”
 It enables the models to take the gradient or direction to reduce the errors by reaching to least possible
error. Here direction refers to how model parameters should be corrected to further reduce the cost
function.
 The error in your model can be different at different points, and you have to find the quickest way to
minimize it, to prevent resource wastage.
Types of the cost function:
There are many cost functions in machine learning and each has its use cases depending on whether it is a
regression problem or classification problem.
1. Regression cost Function
2. Binary Classification cost Functions
3. Multi-class Classification Cost Function.
1. Regression cost Function:
Regression models deal with predicting a continuous value for example salary of an employee, price of a
car, loan prediction, etc. A cost function used in the regression problem is called “Regression Cost
Function”. They are calculated on the distance-based error as follows:
Error = y-y’ (actual output- predicted output)
Where,
Y – Actual output
Y’ – Predicted output
The most used Regression cost functions are below,

V. ARUNA KUMARI-Asst. Professor-Dept of MCA Page 35


ASHOKA WOMEN’S ENGINEERING COLLEGE (AUTONOMOUS)

1.1 Mean Error (ME):


 In this type of cost function, the error is calculated for each training data, and then the
mean of all error values is taken.
 It is one of the simplest ways possible.
 The errors that occurred from the training data can be either negative or positive.
 While finding mean, they can cancel out each other and result in the zero-mean error for
the model, so it is not recommended cost function for a model.
 However, it provides a base for other cost functions of regression models .
1.2 Mean Squared Error (MSE):
 Means Square error is one of the most commonly used Cost function methods.
 It improves the drawbacks of the Mean error cost function, as it calculates the square of
the difference between the actual value and predicted value.
 The formula for calculating MSE is given below:

MSE = (sum of squared errors)/n


 It is also known as L2 loss.
 In MSE, each error is squared, and it helps in reducing a small deviation in prediction
as compared to MAE.
 But if the dataset has outliers that generate more prediction errors, then squaring of this
error will further increase the error multiple times. Hence, we can say MSE is less
robust to outliers.
1.3 Mean Absolute Error (MAE):
 Mean Absolute error also overcome the issue of the Mean error cost function by taking
the absolute difference between the actual value and predicted value.
 The formula for calculating Mean Absolute Error is given below:

MAE = (sum of absolute errors)/n


 It is also known as L1 Loss.
 It is robust to outliers thus it will give better results even when our dataset has noise or
outliers.
2. Binary Classification Cost Functions:
 Classification models are used to make predictions of categorical variables, such as
predictions for 0 or 1, Cat or dog, etc.
 The cost function used in the classification problem is known as the Classification cost
function. However, the classification cost function is different from the Regression cost
function.
 One of the commonly used loss functions for classification is cross-entropy loss.

V. ARUNA KUMARI-Asst. Professor-Dept of MCA Page 36


ASHOKA WOMEN’S ENGINEERING COLLEGE (AUTONOMOUS)

 The binary Cost function is a special case of Categorical cross-entropy, where there is
only one output class. For example, classification between red and blue.
 To better understand it, let's suppose there is only a single output variable Y
Cross-entropy(D) = - y*log(p) when y = 1
Cross-entropy(D) = - (1-y)*log(1-p) when y = 0
 The error in binary classification is calculated as the mean of cross-entropy for all N
training data. Which means:
Binary Cross-Entropy = (Sum of Cross-Entropy for N data)/N
3. Multi-class Classification Cost Function:
 A multi-class classification cost function is used in the classification problems for
which instances are allocated to one of more than two classes.
 Here also, similar to binary class classification cost function, cross-entropy or
categorical cross-entropy is commonly used cost function.
 It is designed in a way that it can be used with multi-class classification with the target values
ranging from 0 to 1, 3, ….,n classes.
 In a multi-class classification problem, cross-entropy will generate a score that summarizes the
mean difference between actual and anticipated probability distribution. For a perfect cross-
entropy, the value should be zero when the score is minimized.
TRAINING AND TESTING A CLASSIFIER:
 In machine learning data preprocessing, we divide our dataset into a training dataset and testing dataset.
The below image is a complete dataset that means collection of data.
 For example student dataset which including Rollno, names, subjects, marks, percentage etc. This is
called as one dataset. Those dataset can be divided into two parts 1. Training Dataset 2. Testing Dataset.

 Training Dataset is used for training purpose those things are called as train model and
Testing dataset is used for Testing purpose those things are called as Evaluate model.
1. Training Dataset:
 Training dataset is provided as input to this phase.
 In this attributes & class labels & used for training machine learning algorithm to prepare model.
 Machine can learn when they observed relevant data.
 From that they find relationship, detect patterns, understand complex problems & make decision.
 Training error is occurred by applying the model same data from which model is trained.

V. ARUNA KUMARI-Asst. Professor-Dept of MCA Page 37


ASHOKA WOMEN’S ENGINEERING COLLEGE (AUTONOMOUS)

2. Testing Dataset:
 Testing dataset is provided input to this phase.
 Test dataset is a dataset for which class label is unknown. It is tested using model.
 A test dataset used for assessment of the finally chosen model.
 Training & Testing dataset is completely different.
 Testing error that is occurred by accessing the model by providing the unknown data to the model.
 In simple way, the actual output of testing data & predicted output of model does not match
then testing error is occurred.
Example: Training Vs Testing

CROSS-VALIDATION IN MACHINE LEARNING:


 Cross-validation is a technique for validating the model efficiency by training it on the
subset of input data and testing on previously unseen subset of the input data.
 In machine learning, there is always the need to test the stability of the model. It means based
only on the training dataset; we can't fit our model on the training dataset. For this purpose,
we reserve a particular sample of the dataset, which was not part of the training dataset. After
that, we test our model on that sample before deployment, and this complete process comes
under cross-validation. This is something different from the general train-test split.

V. ARUNA KUMARI-Asst. Professor-Dept of MCA Page 38


ASHOKA WOMEN’S ENGINEERING COLLEGE (AUTONOMOUS)

 Hence the basic steps of cross-validations are:


o Reserve a subset of the dataset as a validation set.
o Provide the training to the model using the training dataset.
o Now, evaluate model performance using the validation set. If the model performs
well with the validation set, perform the further step, else check for the issues.

Methods used for Cross-Validation:


There are some common methods that are used for cross-validation. These methods are given below:
1. Validation Set Approach
2. Leave-P-out cross-validation(LPOUT)
3. Leave one out cross-validation(LOOUT)
4. K-fold cross-validation(KFC)
5. Stratified k-fold cross-validation(SKFC)
6. Holdout Method
1. Validation Set Approach:
 We divide our input dataset into a training set and test or validation set in the validation set approach.
Both the subsets are given 50% of the dataset.50% Training and 50% test set.
 But it has one of the big disadvantages that we are just using a 50% dataset to train our model, so the
model may miss out to capture important information of the dataset. It also tends to give the under
fitted model.
2. Leave-P-out cross-validation:
 In this approach, the p datasets are left out of the training data. It means, if there are total n datapoints
in the original input dataset, then n-p data points will be used as the training dataset and the p data
points as the validation set.
 We have p data sets left out for training data, and then we create N- training dataset, P- Validation set.
Shown as N-P DATA POINTS;
 This complete process is repeated for all the samples, and the average error is calculated to know the
effectiveness of the model.
 There is a disadvantage of this technique; that is, it can be computationally difficult for the large p.

V. ARUNA KUMARI-Asst. Professor-Dept of MCA Page 39


ASHOKA WOMEN’S ENGINEERING COLLEGE (AUTONOMOUS)

3. Leave one out cross-validation:


 This method is similar to the leave-p-out cross-validation, but instead of p, we need to take 1
dataset out of training. It means, in this approach, for each learning set, only one datapoint is
reserved, and the remaining dataset is used to train the model.
 This process repeats for each datapoint. Hence for n samples, we get n different training set and
n test set. It has the following features:
o In this approach, the bias is minimum as all the data points are used.
o The process is executed for n times; hence execution time is high.
o This approach leads to high variation in testing the effectiveness of the model as we
iteratively check against one data point.
4. K-Fold Cross-Validation
 K-fold cross-validation approach divides the input dataset into K groups of samples of
equal sizes. These samples are called folds.
 For each learning set, the prediction function uses k-1 folds, and the rest of the folds are
used for the test set.
 This approach is a very popular CROSS VALIDATION approach because it is easy to
understand, and the output is less biased than other methods.
 The steps for k-fold cross-validation are:
o Split the input dataset into K groups
o For each group:
o Take one group as the reserve or test data set.
o Use remaining groups as the training dataset
o Fit the model on the training set and evaluate the performance of the
model using the test set.
 Let's take an example of 5-folds cross-validation. So, the dataset is grouped into 5 folds.

On 1st iteration, the first fold is reserved for test the model, and rest are used to train the
model. On 2nd iteration, the second fold is used to test the model, and rest are used to
train the model. This process will continue until each fold is not used for the test fold.
 Consider the below diagram:

V. ARUNA KUMARI-Asst. Professor-Dept of MCA Page 40


ASHOKA WOMEN’S ENGINEERING COLLEGE (AUTONOMOUS)

5. Stratified k-fold cross-validation:


 This technique is similar to k-fold cross-validation with some little changes. This

approach works on stratification concept; it is a process of rearranging the data to ensure


that each fold or group is a good representative of the complete dataset. To deal with the
bias and variance, it is one of the best approaches.
 It can be understood with an example of housing prices, such that the price of some

houses can be much high than other houses. To tackle such situations, a stratified k-fold
cross-validation technique is useful.
6. Holdout Method:
 This method is the simplest cross-validation technique among all. In this method, we

need to remove a subset of the training data and use it to get prediction results by training
it on the rest part of the dataset.
 The error that occurs in this process tells how well our model will perform with the

unknown dataset. Although this approach is simple to perform, it still faces the issue of
high variance, and it also produces misleading results sometimes.
CLASS-IMBALANCE – WAYS OF HANDLING:

V. ARUNA KUMARI-Asst. Professor-Dept of MCA Page 41


ASHOKA WOMEN’S ENGINEERING COLLEGE (AUTONOMOUS)

Techniques to handling Imbalanced Data:

V. ARUNA KUMARI-Asst. Professor-Dept of MCA Page 42


ASHOKA WOMEN’S ENGINEERING COLLEGE (AUTONOMOUS)

CONFUSION MATRIX IN MACHINE LEARNING:


 The confusion matrix is a matrix used to determine the performance of the classification
models for a given set of test data.
 It can only be determined if the true values for test data are known. The matrix itself can be
easily understood, but the related terminologies may be confusing.
 Since it shows the errors in the model performance in the form of a matrix, hence also known
as an error matrix. Some features of Confusion matrix are given below:
 For the 2 prediction classes of classifiers, the matrix is of 2*2 tables, for 3 classes, it is 3*3
table, and so on.
 The matrix is divided into two dimensions that are predicted values and actual values along
with the total number of predictions.
 Predicted values are those values, which are predicted by the model, and actual values are the
true values for the given observations.
 It looks like the below table:

V. ARUNA KUMARI-Asst. Professor-Dept of MCA Page 43


ASHOKA WOMEN’S ENGINEERING COLLEGE (AUTONOMOUS)

The above table has the following cases:


o True Negative: Model has given prediction No, and the real or actual value was also No.
o True Positive: The model has predicted yes, and the actual value was also true.
o False Negative: The model has predicted no, but the actual value was Yes, it is also
called as Type-II error.
o False Positive: The model has predicted Yes, but the actual value was No. It is also
called a Type-I error.
Need for Confusion Matrix in Machine learning:
o It evaluates the performance of the classification models, when they make predictions on
test data, and tells how good our classification model is.
o It not only tells the error made by the classifiers but also the type of errors such as it is
either type-I or type-II error.
o With the help of the confusion matrix, we can calculate the different parameters for the
model, such as accuracy, precision, recall etc.
Example: We can understand the confusion matrix using an example.
Suppose we are trying to create a model that can predict the result for the disease that is either a
person has that disease or not. So, the confusion matrix for this is given as:
ava

From the above example, we can conclude that:


o The table is given for the two-class classifier, which has two predictions "Yes" and "NO."
Here, Yes defines that patient has the disease, and No defines that patient does not has
that disease.
o The classifier has made a total of 100 predictions. Out of 100 predictions, 89 are true
predictions, and 11 are incorrect predictions.
o The model has given prediction "yes" for 32 times, and "No" for 68 times. Whereas the
actual "Yes" was 27, and actual "No" was 73 times.

V. ARUNA KUMARI-Asst. Professor-Dept of MCA Page 44


ASHOKA WOMEN’S ENGINEERING COLLEGE (AUTONOMOUS)

Calculations using Confusion Matrix (evaluation metrics.):


We can perform various calculations for the model, such as the model's accuracy, using this
matrix. These calculations are given below:
o Classification Accuracy: It is one of the important parameters to determine the accuracy
of the classification problems. It defines how often the model predicts the correct output.
It can be calculated as the ratio of the number of correct predictions made by the
classifier to all number of predictions made by the classifiers. The formula is given
below:

o Misclassification rate: It is also termed as Error rate, and it defines how often the model
gives the wrong predictions. The value of error rate can be calculated as the number of
incorrect predictions to all number of the predictions made by the classifier. The formula
is given below:

o Precision: It can be defined as the number of correct outputs provided by the model or
out of all positive classes that have predicted correctly by the model, how many of them
were actually true. It can be calculated using the below formula:

o Recall: It is defined as the out of total positive classes, how our model predicted correctly.
The recall must be as high as possible.

o F-measure: If two models have low precision and high recall or vice versa, it is difficult to
compare these models. So, for this purpose, we can use F-score. This score helps us to
evaluate the recall and precision at the same time. The F-score is maximum if the recall is
equal to the precision. It can be calculated using the below formula:

V. ARUNA KUMARI-Asst. Professor-Dept of MCA Page 45


ASHOKA WOMEN’S ENGINEERING COLLEGE (AUTONOMOUS)

Other important terms used in Confusion Matrix:


o Null Error rate: It defines how often our model would be incorrect if it always predicted
the majority class. As per the accuracy paradox, it is said that "the best classifier has a
higher error rate than the null error rate."
o ROC Curve (Receiver Operating Characteristics): The ROC is a graph displaying a
classifier's performance for all possible thresholds. The graph is plotted between the true
positive rate (on the Y-axis) and the false Positive rate (on the x-axis).
 ROC curve is used for visual comparison of classification models which shows the trade-
off between the true positive rate and the false positive rate.
 The area under the ROC curve(AUC) is a measure of the accuracy of the model. When a
model is closer to the diagonal, it is less accurate and the model with perfect accuracy
will have an area of 1.0.

COEFFICIENT OF DETERMINATION DEFINITION (R – SQUARE):


 The coefficient of determination or R squared method is the proportion of the variance in the
dependent variable that is predicted from the independent variable. It indicates the level of variation
in the given data set.
 The coefficient of determination is the square of the correlation(r), thus it ranges from 0 to 1.
 With linear regression, the coefficient of determination is equal to the square of the correlation
between the x and y variables.
 If R2 is equal to 0, then the dependent variable cannot be predicted from the independent variable.
 If R2 is equal to 1, then the dependent variable can be predicted from the independent variable
without any error.
 If R2 is between 0 and 1, then it indicates the extent that the dependent variable can be predictable.
 If R2 of 0.10 means, it is 10 percent of the variance in the y variable is predicted from the x
variable.
 If 0.20 means, 20 percent of the variance in the y variable is predicted from the x variable, and so on.

V. ARUNA KUMARI-Asst. Professor-Dept of MCA Page 46


ASHOKA WOMEN’S ENGINEERING COLLEGE (AUTONOMOUS)

Calculating the coefficient of determination:


Formula 1: Using the correlation coefficient

Where r = Pearson correlation coefficient


Example: Calculating R² using the correlation coefficient You are studying the relationship between
heart rate and age in children, and you find that the two variables have a negative Pearson correlation:

This value can be used to calculate the coefficient of determination (R²) using Formula 1:

Formula 2: Using the regression outputs

Where:
 RSS = sum of squared residuals
 TSS = total sum of squares
Example: Calculating R² using regression outputsAs part of performing a simple linear
regression that predicts students’ exam scores (dependent variable) from their study time
(independent variable), you calculate that:

These values can be used to calculate the coefficient of determination (R²) using Formula 2:

V. ARUNA KUMARI-Asst. Professor-Dept of MCA Page 47


ASHOKA WOMEN’S ENGINEERING COLLEGE (AUTONOMOUS)

LEAST SQUARE:

 The least-squares method can be defined as a statistical method that is used to find the
equation of the line of best fit related to the given data.
 This method is called so as it aims at reducing the sum of squares of deviations as much
as possible.
 The line obtained from such a method is called a regression line.

Formula of Least Square Method:


The formula used in the least squares method and the steps used in deriving the line of best fit
from this method is discussed as follows:
 Step 1: Denote the independent variable values as xi and the dependent ones as yi.
 Step 2: Calculate the average values of xi and yi as X and Y.
 Step 3: Presume the equation of the line of best fit as y = mx + c, where m is the slope of the
line and c represents the intercept of the line on the Y-axis.
 Step 4: The slope m can be calculated from the following formula:
m = [Σ (X – xi)×(Y – yi)] / Σ(X – xi)2
 Step 5: The intercept c is calculated from the following formula:
c = Y – mX
Thus, we obtain the line of best fit as y = mx + c, where values of m and c can be calculated
from the formulae defined above.
Least Square Method Graph:
Let us have a look at how the data points and the line of best fit obtained from the least squares
method look when plotted on a graph.

The red points in the above plot represent the data points for the sample data available. Independent
variables are plotted as x-coordinates and dependent ones are plotted as y-coordinates. The equation
of the line of best fit obtained from the least squares method is plotted as the red line in the graph.
We can conclude from the above graph that how the least squares method helps us to find a
line that best fits the given data points and hence can be used to make further predictions about the
value of the dependent variable where it is not known initially.

V. ARUNA KUMARI-Asst. Professor-Dept of MCA Page 48


ASHOKA WOMEN’S ENGINEERING COLLEGE (AUTONOMOUS)

Least Square Method Solved Example:


Problem 1: Find the line of best fit for the following data points using the least squares
method: (x,y) = (1,3), (2,4), (4,8), (6,10), (8,15).
Solution:
Here, we have x as the independent variable and y as the dependent variable. First, we
calculate the means of x and y values denoted by X and Y respectively.
X = (1+2+4+6+8)/5 = 4.2
Y = (3+4+8+10+15)/5 = 8

xi yi X – xi Y – yi (X-xi)*(Y-yi) (X – xi)2

3 5 16 10.24
1 3.2

2.2 4 8.8 4.84


2 4

0 0 0.04
4 8 0.2

-2 3.6 3.24
6 10 -1.8

-7 26.6 14.44
8 15 -3.8

Sum (Σ) 0 0 55 32.8

The slope of the line of best fit can be calculated from the formula as follows:

m = (Σ (X – xi)*(Y – yi)) /Σ(X – xi)2


m = 55/32.8 = 1.68 (rounded upto 2 decimal places)

Now, the intercept will be calculated from the formula as follows:

c = Y – mX

c = 8 – 1.68*4.2 = 0.94

Thus, the equation of the line of best fit becomes, y = 1.68x + 0.94.

V. ARUNA KUMARI-Asst. Professor-Dept of MCA Page 49

You might also like