0% found this document useful (0 votes)
9 views82 pages

PP&DS 4

Pattern recognition involves identifying and confirming patterns, which can include fingerprints, handwritten words, and images. It encompasses tasks such as classification, segmentation, and feature extraction, and is crucial in biometric systems for identifying individuals. The document also discusses the challenges of high-dimensional data, known as the 'Curse of Dimensionality,' and the importance of dimensionality reduction techniques to improve model performance.

Uploaded by

shatviklakshman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views82 pages

PP&DS 4

Pattern recognition involves identifying and confirming patterns, which can include fingerprints, handwritten words, and images. It encompasses tasks such as classification, segmentation, and feature extraction, and is crucial in biometric systems for identifying individuals. The document also discusses the challenges of high-dimensional data, known as the 'Curse of Dimensionality,' and the importance of dimensionality reduction techniques to improve model performance.

Uploaded by

shatviklakshman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 82

Pattern recognition

➢ Pattern recognition deals with identifying a pattern and confirming it


again.
➢ In general, a pattern can be a fingerprint image, a handwritten cursive
word, a human face, a speech signal, a bar code, or a web page on the
Internet.
➢ The individual patterns are often grouped into various categories based
on their properties.
➢ When the patterns of same properties are grouped together, the
resultant group is also a pattern, which is often called a pattern class.
➢ Pattern recognition is the science for observing, distinguishing the
patterns of interest, and making correct decisions about the patterns or
pattern classes.
➢ Thus, a biometric system applies pattern recognition to identify and
classify the individuals, by comparing it with the stored templates.

Pattern Recognition

The pattern recognition technique conducts the following tasks −


• Classification – Identifying handwritten characters, CAPTCHAs,
distinguishing humans from computers.
• Segmentation − Detecting text regions or face regions in images.
• Syntactic Pattern Recognition − Determining how a group of math symbols
or operators are related, and how they form a meaningful expression.
The following table highlights the role of pattern recognition in biometrics −

Pattern Recognition Task Input Output

Character Recognition (Signature Recognition) Optical signals or Name of the char


Strokes

Speaker Recognition Voice Identity of the spe

Fingerprint, Facial image, hand geometry image Image Identity of the use

Features
A feature is a function of one or more measurements, computed so that it
quantifies some significant characteristics of the object.
Example:
➢ consider our face, eyes, ears, nose etc are features of the face.
➢ A set of features that are taken together, forms the features vector.

Example:
➢ In the above example of face, if all the features (eyes, ears, nose etc)
taken together then the sequence is feature vector([eyes, ears, nose]).
➢ Feature vector is the sequence of a features represented as a d-
dimensional column vector.

➢ In case of speech, MFCC (Melfrequency Cepstral Coefficient) is the


spectral features of the speech.
➢ Sequence of first 13 features forms a feature vector.
Pattern recognition possesses the following features:
➢ Pattern recognition system should recognise familiar pattern quickly and
accurate
➢ Recognize and classify unfamiliar objects
➢ Accurately recognize shapes and objects from different angles
➢ Identify patterns and objects even when partly hidden
➢ Recognize patterns quickly with ease, and with automaticity.
Phases in Pattern Recognition System
Approaches for Pattern Recognition Systems can be represented by different
phases as Pattern Recognition Systems can be divided into components.
➢ Phase 1: Converts images or sounds or other inputs into signal data.
➢ Phase 2: Isolates the sensed objects from the background.
➢ Phase 3: Measures objects properties that are useful for classification.
➢ Phase 4: Assign the sensed object to category.
➢ Phase 5: Take other consideration to decide for appropriate action.
Problems solved by these Phases are as follows:
1. Sensing: It deals with problem arises in the input such as its bandwidth,
resolution, sensitivity, distortion, signal-to-noise ratio, latency, etc.
2. Segmentation and Grouping: Deepest problems in pattern recognition that
deals with the problem of recognizing or grouping together the various
parts of an object.
3. Feature Extraction: It deals with the characterization of an object so that it
can be recognized easily by measurements. Those objects whose values are
very similar for the objects are consider to be in the same category, while
whose values are very different for the objects are placed in different
categories.
4. Classification: It deals with assigning the object to their particular
categories by using the feature vector provided by the feature extractor
and determining the values of all of the features for a particular input.
5. Post Processing: It deals with action decision making by using the output of
the classifier. Action such as to minimum-error-rate classification that will
minimize the total expected cost.
What is Pattern Recognition?

➢ Pattern recognition is the process of recognizing patterns by using


machine learning algorithm.
➢ Pattern recognition can be defined as the classification of data based on
knowledge already gained or on statistical information extracted from
patterns and/or their representation.
➢ One of the important aspects of the pattern recognition is its application
potential.
Patterns Representation
• Patterns can be represented in a number of ways.

Representing patterns as vectors

➢ The most popular method of representing patterns is as vectors.


➢ Here, the dataset may be represented as a matrix of size (nxd),
➢ where each row corresponds to a pattern and each column
represents a feature.
➢ Each attribute/feature/variable is associated with a domain.
➢ A domain is a set of numbers, each number pertains to a value of
an attribute for that particular pattern.
➢ The class label is a dependent attribute which depends on the ‘d’
in- dependent attributes.

Example

The dataset could be as follows :

f1 f2 f3 f4 f5 f6 Class
label
Pattern 1: 1 4 3 6 4 7 1
Pattern 2: 4 7 5 7 4 2 2
Pattern 3: 6 9 7 5 3 1 3
Pattern 4: 7 4 6 2 8 6 1
Pattern 5: 4 7 5 8 2 6 2
Pattern 6: 5 3 7 9 5 3 3
Pattern 7: 8 1 9 4 2 8 3
• In this case, n=7 and d=6. As can be seen,each pattern has six
attributes( or features).
• Each attribute in this case is a number between 1 and 9.
• The last number in each line gives the class of the pattern.
• In this case, the class of the patterns is either 1, 2 or 3.

Representing patterns as strings


• Here each pattern is a string of characters from an alphabet.
• This is generally used to represent gene expressions.
• For example, DNA can be represented as

GTGCATCTGACTCCT...

RNA is expressed as

GUGCAUCUGACUCCU....

• This can be translated into protein which would be of the form VHLTPEEK

....

• Each string of characters represents a pattern. Operations like pattern


matching or finding the similarity between strings are carried out with
these patterns.
Representing patterns by using logical operators

• Here each pattern is represented by a sentence(well formed formula)


in a logic.
• An example would be
if (beak(x) = red) and (colour(x) = green) then parrot(x)
• Another example would be
if (has-trunk(x)) and (colour(x) = black) and (size(x) = large) then
elephant(x)
Curse Of Dimensionality
➢ Curse of Dimensionality refers to a set of problems that arise when
working with high-dimensional data.
➢ The dimension of a dataset corresponds to the number of
attributes/features that exist in a dataset.

➢ A dataset with a large number of attributes, generally of the order of a


hundred or more, is referred to as high dimensional data.

➢ Some of the difficulties that come with high dimensional data manifest
during analyzing or visualizing the data to identify patterns, and some
manifest while training machine learning models.

➢ The difficulties related to training machine learning models due to high


dimensional data is referred to as ‘Curse of Dimensionality’.


"As the number of features or dimensions grows, the amount of data we need
to generalize accurately grows exponentially."
Example: below. Fig. 1 (a) shows 10 data points in one dimension i.e. there is
only one feature in the data set.
It can be easily represented on a line with only 10 values, x=1, 2, 3... 10.
But if we add one more feature, same data will be represented in 2 dimensions
(Fig.1 (b)) causing increase in dimension space to 10*10 =100.
And again if we add 3rd feature, dimension space will increase to 10*10*10 =
1000.
As dimensions grows, dimensions space increases exponentially.
10^1 = 10
10^2 = 100

10^3 = 1000 and so on...

This exponential growth in data causes high sparsity in the data set and
unnecessarily increases storage space and processing time for the particular
modelling algorithm.

EXAMPLE: if we are trying to predict a target, that is dependent on two


attributes: gender and age group, we should ideally capture the targets for
all possible combinations of values for the two attributes as shown in figure
1.

Figure 1. Combination of values of 2 attributes for generalizing a model


In the above example, we assume that the target value depends on gender
and age group only.

If the target depends on a third attribute, let’s say body type, the number of
training samples required to cover all the combinations increases
phenomenally.

The combinations are shown in figure 2. For two variables, we needed eight
training samples. For three variables, we need 24 samples.
Figure 2. Combination of values of 3 attributes for generalizing a model

The above examples show that, as the number of attributes or the


dimensions increases, the number of training samples required to generalize
a model also increase phenomenally.

Dimensionality Reduction
What is Dimensionality Reduction?
The number of input features, variables, or columns present in a given
dataset is known as dimensionality, and the process to reduce these features is
called dimensionality reduction.

A dataset contains a huge number of input features in various cases, which makes
the predictive modeling task more complicated.

Because it is very difficult to visualize or make predictions for the training dataset
with a high number of features, for such cases, dimensionality reduction
techniques are required to use.

Dimensionality reduction technique can be defined as, "It is a way of converting


the higher dimensions dataset into lesser dimensions dataset ensuring that it
provides similar information."

These techniques are widely used in machine learning for obtaining a better fit
predictive model while solving the classification and regression problems.

It is commonly used in the fields that deal with high-dimensional data, such
as speech recognition, signal processing, bioinformatics, etc. It can also be used
for data visualization, noise reduction, cluster analysis, etc.

The Curse of Dimensionality

➢ Handling the high-dimensional data is very difficult in practice, commonly


known as the curse of dimensionality.
➢ If the dimensionality of the input dataset increases, any machine learning
algorithm and model becomes more complex.
➢ As the number of features increases, the number of samples also gets
increased proportionally, and the chance of overfitting also increases.
➢ If the machine learning model is trained on high-dimensional data, it
becomes overfitted and results in poor performance.
➢ Hence, it is often required to reduce the number of features, which can be
done with dimensionality reduction.
➢ 6M
Benefits of applying Dimensionality Reduction

Some benefits of applying dimensionality reduction technique to the given


dataset are

o By reducing the dimensions of the features, the space required to store


the dataset also gets reduced.
o Less Computation training time is required for reduced dimensions of
features.
o Reduced dimensions of features of the dataset help in visualizing the data
quickly.
o It removes the redundant features (if present).

Disadvantages of dimensionality Reduction

There are also some disadvantages of applying the dimensionality reduction,


which are given below:

o Some data may be lost due to dimensionality reduction.


o In dimensionality reduction technique, sometimes the principal
components required to consider are unknown.

Approaches of Dimension Reduction

There are two ways to apply the dimension reduction technique, which are given
below:

Feature Selection

• Feature selection is the process of selecting the subset of the relevant


features and leaving out the irrelevant features present in a dataset to
build a model of high accuracy.
• In other words, it is a way of selecting the optimal features from the input
dataset.
• Three methods are used for the feature selection:

1. Filters Methods

In this method, the dataset is filtered, and a subset that contains only the relevant
features is taken. Some common techniques of filters method are:

o Correlation
o Chi-Square Test
o ANOVA
o Information Gain, etc.

2. Wrappers Methods

• The wrapper method has the same goal as the filter method, but it takes a
machine learning model for its evaluation.
• In this method, some features are fed to the ML model, and evaluate the
performance.
• The performance decides whether to add those features or remove to
increase the accuracy of the model.
• This method is more accurate than the filtering method but complex to
work. Some common techniques of wrapper methods are:

o Forward Selection
o Backward Selection
o Bi-directional Elimination

3. Embedded Methods: Embedded methods check the different training


iterations of the machine learning model and evaluate the importance of each
feature. Some common techniques of Embedded methods are:

o LASSO
o Elastic Net
o Ridge Regression, etc.

Feature Extraction:

• Feature extraction is the process of transforming the space containing


many dimensions into space with fewer dimensions.
• This approach is useful when we want to keep the whole information but
use fewer resources while processing the information.

Some common feature extraction techniques are:

a. Principal Component Analysis


b. Linear Discriminant Analysis
c. Kernel PCA
d. Quadratic Discriminant Analysis

Common techniques of Dimensionality Reduction

a. Principal Component Analysis


b. Backward Elimination
c. Forward Selection
d. Score comparison
e. Missing Value Ratio
f. Low Variance Filter
g. High Correlation Filter
h. Random Forest
i. Factor Analysis
j. Auto-Encoder

Principal Component Analysis (PCA)

• Principal Component Analysis is a statistical process that converts the


observations of correlated features into a set of linearly uncorrelated
features with the help of orthogonal transformation.
• These new transformed features are called the Principal Components.
• It is one of the popular tools that is used for exploratory data analysis and
predictive modeling.
• PCA works by considering the variance of each attribute because the high
attribute shows the good split between the classes, and hence it reduces
the dimensionality.
• Some real-world applications of PCA are image processing, movie
recommendation system, optimizing the power allocation in various
communication channels.

Backward Feature Elimination


The backward feature elimination technique is mainly used while developing
Linear Regression or Logistic Regression model. Below steps are performed in this
technique to reduce the dimensionality or in feature selection:

o In this technique, firstly, all the n variables of the given dataset are taken
to train the model.
o The performance of the model is checked.
o Now we will remove one feature each time and train the model on n-1
features for n times, and will compute the performance of the model.
o We will check the variable that has made the smallest or no change in the
performance of the model, and then we will drop that variable or features;
after that, we will be left with n-1 features.
o Repeat the complete process until no feature can be dropped.

In this technique, by selecting the optimum performance of the model and


maximum tolerable error rate, we can define the optimal number of features
require for the machine learning algorithms.

Forward Feature Selection

Forward feature selection follows the inverse process of the backward


elimination process. It means, in this technique, we don't eliminate the feature;
instead, we will find the best features that can produce the highest increase in
the performance of the model. Below steps are performed in this technique:

o We start with a single feature only, and progressively we will add each
feature at a time.
o Here we will train the model on each feature separately.
o The feature with the best performance is selected.
o The process will be repeated until we get a significant increase in the
performance of the model.

Missing Value Ratio

• If a dataset has too many missing values, then we drop those variables as
they do not carry much useful information.
• To perform this, we can set a threshold level, and if a variable has missing
values more than that threshold, we will drop that variable.
• The higher the threshold value, the more efficient the reduction.

Low Variance Filter


• As same as missing value ratio technique, data columns with some changes
in the data have less information.
• Therefore, we need to calculate the variance of each variable, and all data
columns with variance lower than a given threshold are dropped because
low variance features will not affect the target variable.

High Correlation Filter

• High Correlation refers to the case when two variables carry approximately
similar information.
• Due to this factor, the performance of the model can be degraded.
• This correlation between the independent numerical variable gives the
calculated value of the correlation coefficient.
• If this value is higher than the threshold value, we can remove one of the
variables from the dataset. We can consider those variables or features
that show a high correlation with the target variable.

Random Forest

• Random Forest is a popular and very useful feature selection algorithm in


machine learning.
• This algorithm contains an in-built feature importance package, so we do
not need to program it separately.
• In this technique, we need to generate a large set of trees against the
target variable, and with the help of usage statistics of each attribute, we
need to find the subset of features.
• Random forest algorithm takes only numerical variables, so we need to
convert the input data into numeric data using hot encoding.

Factor Analysis

• Factor analysis is a technique in which each variable is kept within a group


according to the correlation with other variables, it means variables within
a group can have a high correlation between themselves, but they have a
low correlation with variables of other groups.
• We can understand it by an example, such as if we have two variables
Income and spend.
• These two variables have a high correlation, which means people with high
income spends more, and vice versa.
• So, such variables are put into a group, and that group is known as
the factor.
• The number of these factors will be reduced as compared to the original
dimension of the dataset.

Auto-encoders

One of the popular methods of dimensionality reduction is auto-encoder, which


is a type of ANN or artificial neural network, and its main aim is to copy the inputs
to their outputs. In this, the input is compressed into latent-space representation,
and output is occurred using this representation. It has mainly two parts:

o Encoder: The function of the encoder is to compress the input to form the
latent-space representation.
o Decoder: The function of the decoder is to recreate the output from the
latent-space representation.

Machine Learning
In the real world, we are surrounded by humans who can learn everything from
their experiences with their learning capability, and we have computers or
machines which work on our instructions.

But can a machine also learn from experiences or past data like a human does?
So here comes the role of Machine Learning.

Triggers in SQL (Hindi)


• Machine Learning is said as a subset of artificial intelligence that is mainly
concerned with the development of algorithms which allow a computer to
learn from the data and past experiences on their own.
• The term machine learning was first introduced by Arthur Samuel in 1959.
• Machine learning enables a machine to automatically learn from data,
improve performance from experiences, and predict things without being
explicitly programmed.
• With the help of sample historical data, which is known as training data,
machine learning algorithms build a mathematical model that helps in
making predictions or decisions without being explicitly programmed.
• Machine learning brings computer science and statistics together for
creating predictive models.
• Machine learning constructs or uses the algorithms that learn from
historical data.
• The more we will provide the information, the higher will be the
performance.
• A machine has the ability to learn if it can improve its performance by
gaining more data.

How does Machine Learning work

• A Machine Learning system learns from historical data, builds the


prediction models, and whenever it receives new data, predicts the output
for it.
• The accuracy of predicted output depends upon the amount of data, as
the huge amount of data helps to build a better model which predicts the
output more accurately.
• Suppose we have a complex problem, where we need to perform some
predictions, so instead of writing a code for it, we just need to feed the
data to generic algorithms, and with the help of these algorithms, machine
builds the logic as per the data and predict the output.
• Machine learning has changed our way of thinking about the problem.

The below block diagram explains the working of Machine Learning algorithm:
Features of Machine Learning:

• Machine learning uses data to detect various patterns in a given dataset.


• It can learn from past data and improve automatically.
• It is a data-driven technology.
• Machine learning is much similar to data mining as it also deals with the
huge amount of the data.

Classification of Machine Learning

At a broad level, machine learning can be classified into three types:

1. Supervised learning
2. Unsupervised learning
3. Reinforcement learning
1.Supervised Learning

• Supervised learning is the types of machine learning in which machines are


trained using well "labelled"(images, text files, videos, etc.) training data,
and on basis of that data, machines predict the output.
• In supervised learning, the training data provided to the machines work as
the supervisor that teaches the machines to predict the output correctly.
• It applies the same concept as a student learns in the supervision of the
teacher.
• Supervised learning is a process of providing input data as well as correct
output data to the machine learning model.
• The aim of a supervised learning algorithm is to find a mapping function to
map the input variable(x) with the output variable(y).
• In the real-world, supervised learning can be used for Risk Assessment,
Image classification, Fraud Detection, spam filtering, etcrence between
JDK, JRE, and JVM

How Supervised Learning Works?

• In supervised learning, models are trained using labelled dataset, where


the model learns about each type of data.
• Once the training process is completed, the model is tested on the basis of
test data (a subset of the training set), and then it predicts the output.

The working of Supervised learning can be easily understood by the below


example and diagram:
Suppose we have a dataset of different types of shapes which includes square,
rectangle, triangle, and Polygon. Now the first step is that we need to train the
model for each shape.

o If the given shape has four sides, and all the sides are equal, then it will be
labelled as a Square.
o If the given shape has three sides, then it will be labelled as a triangle.
o If the given shape has six equal sides then it will be labelled as hexagon.

Now, after training, we test our model using the test set, and the task of the
model is to identify the shape.

The machine is already trained on all types of shapes, and when it finds a new
shape, it classifies the shape on the bases of a number of sides, and predicts the
output.

Steps Involved in Supervised Learning:

o First Determine the type of training dataset


o Collect/Gather the labelled training data.
o Split the training dataset into training dataset, test dataset, and validation
dataset.
o Determine the input features of the training dataset, which should have
enough knowledge so that the model can accurately predict the output.
o Evaluate the accuracy of the model by providing the test set.
o If the model predicts the correct output, which means our model is
accurate.

Types of supervised Machine learning Algorithms:

Supervised learning can be further divided into two types of problems:


1. Regression

• Regression algorithms are used if there is a relationship between the input


variable and the output variable.
• It is used for the prediction of continuous variables, such as Weather
forecasting, Market Trends, etc.

Below are some popular Regression algorithms which come under supervised
learning:

o Linear Regression
o Regression Trees
o Non-Linear Regression
o Bayesian Linear Regression
o Polynomial Regression

2. Classification

Classification algorithms are used when the output variable is categorical, which
means there are two classes such as Yes-No, Male-Female, True-false, etc.

Spam Filtering,

o Random Forest
o Decision Trees
o Logistic Regression
o Support vector Machines
Advantages of Supervised learning:

o With the help of supervised learning, the model can predict the output on
the basis of prior experiences.
o In supervised learning, we can have an exact idea about the classes of
objects.
o Supervised learning model helps us to solve various real-world problems
such as fraud detection, spam filtering, etc.

Disadvantages of supervised learning:

o Supervised learning models are not suitable for handling the complex
tasks.
o Supervised learning cannot predict the correct output if the test data is
different from the training dataset.
o In supervised learning, we need enough knowledge about the classes of
object.

2) Unsupervised Learning

Unsupervised learning is a type of machine learning in which models are trained


using unlabeled dataset and are allowed to act on that data without any
supervision.

The goal of unsupervised learning is to find the underlying structure of dataset,


group that data according to similarities, and represent that dataset in a
compressed format.

Example: Suppose the unsupervised learning algorithm is given an input dataset


containing images of different types of cats and dogs. The algorithm is never
trained upon the given dataset, which means it does not have any idea about the
features of the dataset. The task of the unsupervised learning algorithm is to
identify the image features on their own. Unsupervised learning algorithm will
perform this task by clustering the image dataset into the groups according to
similarities between images.

7MTriggers in SQL (Hindi)


Working of Unsupervised Learning

Working of unsupervised learning can be understood by the below diagram:

Here, we have taken an unlabeled input data, which means it is not categorized
and corresponding outputs are also not given.

Now, this unlabeled input data is fed to the machine learning model in order to
train it.

Firstly, it will interpret the raw data to find the hidden patterns from the data and
then will apply suitable algorithms such as k-means clustering, Decision tree, etc.

Once it applies the suitable algorithm, the algorithm divides the data objects into
groups according to the similarities.

Types of Unsupervised Learning Algorithm:


The unsupervised learning algorithm can be further categorized into two types of
problems:

o Clustering: Clustering is a method of grouping the objects into clusters such


that objects with most similarities remains into a group and has less or no
similarities with the objects of another group. Cluster analysis finds the
commonalities between the data objects and categorizes them as per the
presence and absence of those commonalities.
o Association: An association rule is an unsupervised learning method which
is used for finding the relationships between variables in the large
database. It determines the set of items that occurs together in the
dataset. Association rule makes marketing strategy more effective. Such as
people who buy X item (suppose a bread) are also tend to purchase Y
(Butter/Jam) item. A typical example of Association rule is Market Basket
Analysis. Unsupervised Learning algorithms:

Below is the list of some popular unsupervised learning algorithms:

o K-means clustering
o KNN (k-nearest neighbors)
o Hierarchal clustering
o Anomaly detection
o Neural Networks
o Principle Component Analysis
o Independent Component Analysis
o Apriori algorithm
o Singular value decomposition
Advantages of Unsupervised Learning

o Unsupervised learning is used for more complex tasks as compared to


supervised learning because, in unsupervised learning, we don't have
labeled input data.
o Unsupervised learning is preferable as it is easy to get unlabeled data in
comparison to labeled data.

Disadvantages of Unsupervised Learning

o Unsupervised learning is intrinsically more difficult than supervised


learning as it does not have corresponding output.
o The result of the unsupervised learning algorithm might be less accurate
as input data is not labeled, and algorithms do not know the exact output
in advance.

3) Reinforcement Learning

o Reinforcement Learning is a feedback-based Machine learning technique


in which an agent learns to behave in an environment by performing the
actions and seeing the results of actions. For each good action, the agent
gets positive feedback, and for each bad action, the agent gets negative
feedback or penalty.
o In Reinforcement Learning, the agent learns automatically using feedbacks
without any labeled data, unlike supervised learning.
o Since there is no labelled data, so the agent is bound to learn by its
experience only.
o The agent interacts with the environment and explores it by itself.
o The primary goal of an agent in reinforcement learning is to improve the
performance by getting the maximum positive rewards.
o The agent learns with the process of hit and trial, and based on the
experience, it learns to perform the task in a better way.
o Hence, we can say that "Reinforcement learning is a type of machine
learning method where an intelligent agent (computer program) interacts
with the environment and learns to act within that.
o " How a Robotic dog learns the movement of his arms is an example of
Reinforcement learning.

Example: Suppose there is an AI agent present within a maze environment,


and his goal is to find the diamond. The agent interacts with the environment
by performing some actions, and based on those actions, the state of the
agent gets changed, and it also receives a reward or penalty as feedback.

o The agent continues doing these three things (take action, change
state/remain in the same state, and get feedback), and by doing these
actions, he learns and explores the environment.
o The agent learns that what actions lead to positive feedback or rewards
and what actions lead to negative feedback penalty.
o As a positive reward, the agent gets a positive point, and as a penalty, it
gets a negative point.

Classification Algorithm

• The Classification algorithm is a Supervised Learning technique that is used


to identify the category of new observations on the basis of training data.
• In Classification, a program learns from the given dataset or observations
and then classifies new observation into a number of classes or groups.
• Such as, Yes or No, 0 or 1, Spam or Not Spam, cat or dog, etc.
• Classes can be called as targets/labels or categories.
• The output variable of Classification is a category, not a value, such as
"Green or Blue", "fruit or animal", etc.
• Since the Classification algorithm is a Supervised learning technique, hence
it takes labeled input data, which means it contains input with the
corresponding output.
• In classification algorithm, a discrete output function(y) is mapped to input
variable(x).

y=f(x), where y = categorical output

• The best example of an ML classification algorithm is Email Spam Detector.


• The main goal of the Classification algorithm is to identify the category of
a given dataset, and these algorithms are mainly used to predict the output
for the categorical data.
• Classification algorithms can be better understood using the below
diagram.

In the below diagram, there are two classes, class A and Class B. These classes
have features that are similar to each other and dissimilar to other classes.

The algorithm which implements the classification on a dataset is known as a


classifier. There are two types of Classifications:
o Binary Classifier:

If the classification problem has only two possible outcomes, then it is


called as BinaryClassifier.
Examples: YES or NO, MALE or FEMALE, SPAM or NOT SPAM, CAT or DOG,
etc.

o Multi-class Classifier:

If a classification problem has more than two outcomes, then it is called as


Multi-classClassifier.
Example: Classifications of types of crops, Classification of types of music.

Learners in Classification Problems:

In the classification problems, there are two types of learners:

1. Lazy Learners:

Lazy Learner firstly stores the training dataset and wait until it receives the
test dataset.In Lazy learner case, classification is done on the basis of the most
related data stored in the training dataset. It takes less time in training but more
time for predictions.
Example: K-NN algorithm, Case-based reasoning

2. Eager Learners:

Eager Learners develop a classification model based on a training dataset


before receiving a test dataset. Opposite to Lazy learners, Eager Learner takes
more time in learning, and less time in prediction.

Example: Decision Trees, Naïve Bayes, ANN.

Types of ML Classification Algorithms:

Classification Algorithms can be further divided into the Mainly two category:

o Linear Models
o Logistic Regression
o Support Vector Machines
o Non-linear Models
o K-Nearest Neighbours
o Kernel SVM
o Naïve Bayes
o Decision Tree Classification
o Random Forest Classification

Logistic Regression
o Logistic regression is one of the most popular Machine Learning
algorithms, which comes under the Supervised Learning technique.
o It is used for predicting the categorical dependent variable using a given
set of independent variables.
o Logistic regression predicts the output of a categorical dependent variable.
Therefore the outcome must be a categorical or discrete value.
o It can be either Yes or No, 0 or 1, true or False, etc. but instead of giving
the exact value as 0 and 1, it gives the probabilistic values which lie
between 0 and 1.
o In Logistic regression, instead of fitting a regression line, we fit an "S"
shaped logistic function, which predicts two maximum values (0 or 1).
o Logistic Regression is a significant machine learning algorithm because it
has the ability to provide probabilities and classify new data using
continuous and discrete datasets.
o Logistic Regression can be used to classify the observations using different
types of data and can easily determine the most effective variables used
for the classification.

The below image is showing the logistic function:


Note: Logistic regression uses the concept of predictive modeling as regression;
therefore, it is called logistic regression, but is used to classify samples; Therefore,
it falls under the classification algorithm.
Logistic Function (Sigmoid Function):
o The sigmoid function is a mathematical function used to map the predicted
values to probabilities.
o It maps any real value into another value within a range of 0 and 1.
o The value of the logistic regression must be between 0 and 1, which cannot
go beyond this limit, so it forms a curve like the "S" form. The S-form curve
is called the Sigmoid function or the logistic function.
o In logistic regression, we use the concept of the threshold value, which
defines the probability of either 0 or 1. Such as values above the threshold
value tends to 1, and a value below the threshold values tends to 0.

Assumptions for Logistic Regression:


o The dependent variable must be categorical in nature.
o The independent variable should not have multi-collinearity.

Logistic Regression Equation:

The Logistic regression equation can be obtained from the Linear Regression
equation. The mathematical steps to get Logistic Regression equations are given
below:

o We know the equation of the straight line can be written as:

o In Logistic Regression y can be between 0 and 1 only, so for this let's divide
the above equation by (1-y):

o But we need range between -[infinity] to +[infinity], then take logarithm of


the equation it will become:
The above equation is the final equation for Logistic Regression.

Type of Logistic Regression:

On the basis of the categories, Logistic Regression can be classified into three
types:

o Binomial: In binomial Logistic regression, there can be only two possible


types of the dependent variables, such as 0 or 1, Pass or Fail, etc.
o Multinomial: In multinomial Logistic regression, there can be 3 or more
possible unordered types of the dependent variable, such as "cat", "dogs",
or "sheep"
o Ordinal: In ordinal Logistic regression, there can be 3 or more possible
ordered types of dependent variables, such as "low", "Medium", or "High".

Support Vector Machine Algorithm


• Support Vector Machine or SVM is one of the most popular Supervised
Learning algorithms, which is used for Classification as well as Regression
problems.
• However, primarily, it is used for Classification problems in Machine
Learning.
• The goal of the SVM algorithm is to create the best line or decision
boundary that can segregate n-dimensional space into classes so that we
can easily put the new data point in the correct category in the future.
• This best decision boundary is called a hyperplane.
• SVM chooses the extreme points/vectors that help in creating the
hyperplane.
• These extreme cases are called as support vectors, and hence algorithm is
termed as Support Vector Machine.
• Consider the below diagram in which there are two different categories
that are classified using a decision boundary or hyperplane:
Example: SVM can be understood with the example Suppose we see a strange
cat that also has some features of dogs, so if we want a model that can accurately
identify whether it is a cat or dog, so such a model can be created by using the
SVM algorithm. We will first train our model with lots of images of cats and dogs
so that it can learn about different features of cats and dogs, and then we test it
with this strange creature. So as support vector creates a decision boundary
between these two data (cat and dog) and choose extreme cases (support
vectors), it will see the extreme case of cat and dog. On the basis of the support
vectors, it will classify it as a cat. Consider the below diagram:

Difference between JDK, JRE, and JVM

SVM algorithm can be used for Face detection, image classification, text
categorization, etc.

Types of SVM

SVM can be of two types:


o Linear SVM: Linear SVM is used for linearly separable data, which means
if a dataset can be classified into two classes by using a single straight line,
then such data is termed as linearly separable data, and classifier is used
called as Linear SVM classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated data,
which means if a dataset cannot be classified by using a straight line, then
such data is termed as non-linear data and classifier used is called as Non-
linear SVM classifier.

Hyperplane and Support Vectors in the SVM algorithm:

Hyperplane: There can be multiple lines/decision boundaries to segregate the


classes in n-dimensional space, but we need to find out the best decision
boundary that helps to classify the data points. This best boundary is known as
the hyperplane of SVM.

The dimensions of the hyperplane depend on the features present in the dataset,
which means if there are 2 features (as shown in image), then hyperplane will be
a straight line. And if there are 3 features, then hyperplane will be a 3-dimension
plane.

We always create a hyperplane that has a maximum margin, which means the
maximum distance between the data points.

Support Vectors:

The data points or vectors that are the closest to the hyperplane and which affect
the position of the hyperplane are termed as Support Vector. Since these vectors
support the hyperplane, hence called a Support vector.

How does SVM works?

Linear SVM:

The working of the SVM algorithm can be understood by using an example.


Suppose we have a dataset that has two tags (green and blue), and the dataset
has two features x1 and x2. We want a classifier that can classify the pair(x1, x2)
of coordinates in either green or blue. Consider the below image:
So as it is 2-d space so by just using a straight line, we can easily separate these
two classes. But there can be multiple lines that can separate these classes.
Consider the below image:
Hence, the SVM algorithm helps to find the best line or decision boundary; this
best boundary or region is called as a hyperplane. SVM algorithm finds the
closest point of the lines from both the classes. These points are called support
vectors. The distance between the vectors and the hyperplane is called
as margin. And the goal of SVM is to maximize this margin. The hyperplane with
maximum margin is called the optimal hyperplane.

Non-Linear SVM:

If data is linearly arranged, then we can separate it by using a straight line, but
for non-linear data, we cannot draw a single straight line. Consider the below
image:
So to separate these data points, we need to add one more dimension. For linear
data, we have used two dimensions x and y, so for non-linear data, we will add a
third dimension z. It can be calculated as:

z=x2 +y2

By adding the third dimension, the sample space will become as below image:

So now, SVM will divide the datasets into classes in the following way. Consider
the below image:
Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If
we convert it in 2d space with z=1, then it will become as:

Hence we get a circumference of radius 1 in case of non-linear data.

K-Nearest Neighbor(KNN)
K-Nearest Neighbour is one of the simplest Machine Learning algorithms based
on Supervised Learning technique.

o K-NN algorithm assumes the similarity between the new case/data and
available cases and put the new case into the category that is most similar
to the available categories.
o K-NN algorithm stores all the available data and classifies a new data point
based on the similarity. This means when new data appears then it can be
easily classified into a well suite category by using K- NN algorithm.
o K-NN algorithm can be used for Regression as well as for Classification but
mostly it is used for the Classification problems.
o K-NN is a non-parametric algorithm, which means it does not make any
assumption on underlying data.
o It is also called a lazy learner algorithm because it does not learn from the
training set immediately instead it stores the dataset and at the time of
classification, it performs an action on the dataset.
o KNN algorithm at the training phase just stores the dataset and when it
gets new data, then it classifies that data into a category that is much
similar to the new data.

Example: Suppose, we have an image of a creature that looks similar to cat


and dog, but we want to know either it is a cat or dog. So for this identification,
we can use the KNN algorithm, as it works on a similarity measure. Our KNN
model will find the similar features of the new data set to the cats and dogs
images and based on the most similar features it will put it in either cat or dog
category.

Why do we need a K-NN Algorithm?

Suppose there are two categories, i.e., Category A and Category B, and we have
a new data point x1, so this data point will lie in which of these categories.
To solve this type of problem, we need a K-NN algorithm. With the help of K-NN,
we can easily identify the category or class of a particular dataset. Consider the
below diagram:

How does K-NN work?

The K-NN working can be explained on the basis of the below algorithm:

o Step-1: Select the number K of the neighbors


o Step-2: Calculate the Euclidean distance of K number of neighbors
o Step-3: Take the K nearest neighbors as per the calculated Euclidean
distance.
o Step-4: Among these k neighbors, count the number of the data points in
each category.
o Step-5: Assign the new data points to that category for which the number
of the neighbor is maximum.
o Step-6: Our model is ready.

Suppose we have a new data point and we need to put it in the required category.
Consider the below image:

o Firstly, we will choose the number of neighbors, so we will choose the k=5.
o Next, we will calculate the Euclidean distance between the data points.
The Euclidean distance is the distance between two points, which we have
already studied in geometry. It can be calculated as:

o By calculating the Euclidean distance we got the nearest neighbors, as


three nearest neighbors in category A and two nearest neighbors in
category B. Consider the below image:

o As we can see the 3 nearest neighbors are from category A, hence this new
data point must belong to category A.

How to select the value of K in the K-NN Algorithm?

Below are some points to remember while selecting the value of K in the K-NN
algorithm: Trying to Surf Wave Machine he Water

o There is no particular way to determine the best value for "K", so we need
to try some values to find the best out of them. The most preferred value
for K is 5.
o A very low value for K such as K=1 or K=2, can be noisy and lead to the
effects of outliers in the model.
o Large values for K are good, but it may find some difficulties.

Advantages of KNN Algorithm:

o It is simple to implement.
o It is robust to the noisy training data
o It can be more effective if the training data is large.

Disadvantages of KNN Algorithm:

o Always needs to determine the value of K which may be complex some


time.
o The computation cost is high because of calculating the distance between
the data points for all the training samples.

Naïve Bayes Classifier Algorithm


o Naïve Bayes algorithm is a supervised learning algorithm, which is based
on Bayes theorem and used for solving classification problems.
o It is mainly used in text classification that includes a high-dimensional
training dataset.
o Naïve Bayes Classifier is one of the simple and most effective Classification
algorithms which helps in building the fast machine learning models that
can make quick predictions.
o It is a probabilistic classifier, which means it predicts on the basis of the
probability of an object.
o Some popular examples of Naïve Bayes Algorithm are spam filtration,
Sentimental analysis, and classifying articles.

Why is it called Naïve Bayes?

The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which
can be described as:

o Naïve: It is called Naïve because it assumes that the occurrence of a certain


feature is independent of the occurrence of other features.
o Such as if the fruit is identified on the bases of color, shape, and taste, then
red, spherical, and sweet fruit is recognized as an apple.
o Hence each feature individually contributes to identify that it is an apple
without depending on each other.
o Bayes: It is called Bayes because it depends on the principle of Bayes'
Theorem.

Bayes' Theorem:
o Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used
to determine the probability of a hypothesis with prior knowledge. It
depends on the conditional probability.
o The formula for Bayes' theorem is given as:

Where,

P(A|B) is Posterior probability: Probability of hypothesis A on the observed


event B.

P(B|A) is Likelihood probability: Probability of the evidence given that the


probability of a hypothesis is true.ers in SQL (Hindi)

P(A) is Prior Probability: Probability of hypothesis before observing the


evidence.

P(B) is Marginal Probability: Probability of Evidence.

Working of Naïve Bayes' Classifier:

Working of Naïve Bayes' Classifier can be understood with the help of the below
example:

Suppose we have a dataset of weather conditions and corresponding target


variable "Play".

So using this dataset we need to decide that whether we should play or not on a
particular day according to the weather conditions. So to solve this problem, we
need to follow the below steps:

1. Convert the given dataset into frequency tables.


2. Generate Likelihood table by finding the probabilities of given features.
3. Now, use Bayes theorem to calculate the posterior probability.

Problem: If the weather is sunny, then the Player should play or not?

Solution: To solve this, first consider the below dataset:

Outlook Play

0 Rainy Yes

1 Sunny Yes

2 Overcast Yes

3 Overcast Yes

4 Sunny No

5 Rainy Yes

6 Sunny Yes

7 Overcast Yes

8 Rainy No

9 Sunny No

10 Sunny Yes
11 Rainy No

12 Overcast Yes

13 Overcast Yes

Frequency table for the Weather Conditions:

Weather Yes No

Overcast 5 0

Rainy 2 2

Sunny 3 2

Total 10 5

Likelihood table weather condition:

Weather No Yes

Overcast 0 5 5/14= 0.35

Rainy 2 2 4/14=0.29

Sunny 2 3 5/14=0.35

All 4/14=0.29 10/14=0.71


Applying Bayes'theorem:

P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny)

P(Sunny|Yes)= 3/10= 0.3

P(Sunny)= 0.35

P(Yes)=0.71

So P(Yes|Sunny) = 0.3*0.71/0.35= 0.60

P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)

P(Sunny|NO)= 2/4=0.5

P(No)= 0.29

P(Sunny)= 0.35

So P(No|Sunny)= 0.5*0.29/0.35 = 0.41

So as we can see from the above calculation that P(Yes|Sunny)>P(No|Sunny)

Hence on a Sunny day, Player can play the game.

Advantages of Naïve Bayes Classifier:

o Naïve Bayes is one of the fast and easy ML algorithms to predict a class of
datasets.
o It can be used for Binary as well as Multi-class Classifications.
o It performs well in Multi-class predictions as compared to the other
Algorithms.
o It is the most popular choice for text classification problems.

Disadvantages of Naïve Bayes Classifier:

o Naive Bayes assumes that all features are independent or unrelated, so it


cannot learn the relationship between features.

Applications of Naïve Bayes Classifier:

o It is used for Credit Scoring.


o It is used in medical data classification.
o It can be used in real-time predictions because Naïve Bayes Classifier is an
eager learner.
o It is used in Text classification such as Spam filtering and Sentiment
analysis.

Types of Naïve Bayes Model:

There are three types of Naive Bayes Model, which are given below:

o Gaussian: The Gaussian model assumes that features follow a normal


distribution. This means if predictors take continuous values instead of
discrete, then the model assumes that these values are sampled from the
Gaussian distribution.
o Multinomial: The Multinomial Naïve Bayes classifier is used when the data
is multinomial distributed. It is primarily used for document classification
problems, it means a particular document belongs to which category such
as Sports, Politics, education, etc.
The classifier uses the frequency of words for the predictors.
o Bernoulli: The Bernoulli classifier works similar to the Multinomial
classifier, but the predictor variables are the independent Booleans
variables. Such as if a particular word is present or not in a document. This
model is also famous for document classification tasks.

Decision Tree Classification Algorithm


o Decision Tree is a Supervised learning technique that can be used for both
classification and Regression problems, but mostly it is preferred for
solving Classification problems.
o It is a tree-structured classifier, where internal nodes represent the
features of a dataset, branches represent the decision rules and each leaf
node represents the outcome.
o In a Decision tree, there are two nodes, which are the Decision
Node and Leaf Node.
o Decision nodes are used to make any decision and have multiple branches,
whereas Leaf nodes are the output of those decisions and do not contain
any further branches.
o The decisions or the test are performed on the basis of features of the
given dataset.
o It is a graphical representation for getting all the possible solutions to a
problem/decision based on given conditions.
o It is called a decision tree because, similar to a tree, it starts with the root
node, which expands on further branches and constructs a tree-like
structure.
o In order to build a tree, we use the CART algorithm, which stands
for Classification and Regression Tree algorithm.
o A decision tree simply asks a question, and based on the answer (Yes/No),
it further split the tree into subtrees.
o Below diagram explains the general structure of a decision tree:

Note: A decision tree can contain categorical data (YES/NO) as well as numeric
data.

Why use Decision Trees?

There are various algorithms in Machine learning, so choosing the best algorithm
for the given dataset and problem is the main point to remember while creating
a machine learning model.

Below are the two reasons for using the Decision tree:

o Decision Trees usually mimic human thinking ability while making a


decision, so it is easy to understand.
o The logic behind the decision tree can be easily understood because it
shows a tree-like structure.

Decision Tree Terminologies

▪ Root Node: Root node is from where the decision tree starts. It
represents the entire dataset, which further gets divided into two or
more homogeneous sets.
▪ Leaf Node: Leaf nodes are the final output node, and the tree cannot be
segregated further after getting a leaf node.
▪ Splitting: Splitting is the process of dividing the decision node/root node
into sub-nodes according to the given conditions.
▪ Branch/Sub Tree: A tree formed by splitting the tree.
▪ Pruning: Pruning is the process of removing the unwanted branches from
the tree.
▪ Parent/Child node: The root node of the tree is called the parent node,
and other nodes are called the child nodes.

How does the Decision Tree algorithm Work?

In a decision tree, for predicting the class of the given dataset, the algorithm
starts from the root node of the tree.

This algorithm compares the values of root attribute with the record (real
dataset) attribute and, based on the comparison, follows the branch and jumps
to the next node.

For the next node, the algorithm again compares the attribute value with the
other sub-nodes and move further.

It continues the process until it reaches the leaf node of the tree.

The complete process can be better understood using the below algorithm:

CStep-1: Begin the tree with the root node, says S, which contains the complete
dataset.
o Step-2: Find the best attribute in the dataset using Attribute Selection
Measure (ASM).
o Step-3: Divide the S into subsets that contains possible values for the best
attributes.
o Step-4: Generate the decision tree node, which contains the best attribute.
o Step-5: Recursively make new decision trees using the subsets of the
dataset created in step -3. Continue this process until a stage is reached
where you cannot further classify the nodes and called the final node as a
leaf node.

Example: Suppose there is a candidate who has a job offer and wants to decide
whether he should accept the offer or Not.

So, to solve this problem, the decision tree starts with the root node (Salary
attribute by ASM).

The root node splits further into the next decision node (distance from the office)
and one leaf node based on the corresponding labels.

The next decision node further gets split into one decision node (Cab facility) and
one leaf node.

Finally, the decision node splits into two leaf nodes (Accepted offers and Declined
offer). Consider the below diagram:

Pruning: Getting an Optimal Decision tree

Pruning is a process of deleting the unnecessary nodes from a tree in order to get
the optimal decision tree.

A too-large tree increases the risk of overfitting, and a small tree may not capture
all the important features of the dataset. Therefore, a technique that decreases
the size of the learning tree without reducing accuracy is known as Pruning. There
are mainly two types of tree pruning technology used:

o Cost Complexity Pruning


o Reduced Error Pruning.

Advantages of the Decision Tree

o It is simple to understand as it follows the same process which a human


follow while making any decision in real-life.
o It can be very useful for solving decision-related problems.
o It helps to think about all the possible outcomes for a problem.
o There is less requirement of data cleaning compared to other algorithms.

Disadvantages of the Decision Tree

o The decision tree contains lots of layers, which makes it complex.


o It may have an overfitting issue, which can be resolved using the Random
Forest algorithm.
o For more class labels, the computational complexity of the decision tree
may increase.

Random Forest Algorithm


▪ Random Forest is a popular machine learning algorithm that belongs to the
supervised learning technique.
▪ It can be used for both Classification and Regression problems in ML.
▪ It is based on the concept of ensemble learning, which is a process
of combining multiple classifiers to solve a complex problem and to improve
the performance of the model.
▪ As the name suggests, "Random Forest is a classifier that contains a
number of decision trees on various subsets of the given dataset and takes
the average to improve the predictive accuracy of that dataset."
▪ Instead of relying on one decision tree, the random forest takes the
prediction from each tree and based on the majority votes of predictions,
and it predicts the final output.
▪ The greater number of trees in the forest leads to higher accuracy and
prevents the problem of overfitting.
The below diagram explains the working of the Random Forest algorithm:

How to find Nth Highest Salary in SQL

Note: To better understand the Random Forest Algorithm, you should have
knowledge of the Decision Tree Algorithm.

Assumptions for Random Forest

Since the random forest combines multiple trees to predict the class of the
dataset, it is possible that some decision trees may predict the correct output,
while others may not.

But together, all the trees predict the correct output.

Therefore, below are two assumptions for a better Random forest classifier:

o There should be some actual values in the feature variable of the dataset
so that the classifier can predict accurate results rather than a guessed
result.
o The predictions from each tree must have very low correlations.

Why use Random Forest?

o It takes less training time as compared to other algorithms.


o It predicts output with high accuracy, even for the large dataset it runs
efficiently.
o It can also maintain accuracy when a large proportion of data is missing.
How does Random Forest algorithm work?

Random Forest works in two-phase first is to create the random forest by


combining N decision tree, and second is to make predictions for each tree
created in the first phase.

The Working process can be explained in the below steps and diagram:

Step-1: Select random K data points from the training set.

Step-2: Build the decision trees associated with the selected data points
(Subsets).

Step-3: Choose the number N for decision trees that you want to build.

Step-4: Repeat Step 1 & 2.

Step-5: For new data points, find the predictions of each decision tree, and assign
the new data points to the category that wins the majority votes.

The working of the algorithm can be better understood by the below example:

Example: Suppose there is a dataset that contains multiple fruit images. So, this
dataset is given to the Random forest classifier. The dataset is divided into
subsets and given to each decision tree. During the training phase, each decision
tree produces a prediction result, and when a new data point occurs, then based
on the majority of results, the Random Forest classifier predicts the final decision.

Consider the below image:


Applications of Random Forest

There are mainly four sectors where Random forest mostly used:

1. Banking: Banking sector mostly uses this algorithm for the identification of
loan risk.
2. Medicine: With the help of this algorithm, disease trends and risks of the
disease can be identified.
3. Land Use: We can identify the areas of similar land use by this algorithm.
4. Marketing: Marketing trends can be identified using this algorithm.

Advantages of Random Forest

o Random Forest is capable of performing both Classification and Regression


tasks.
o It is capable of handling large datasets with high dimensionality.
o It enhances the accuracy of the model and prevents the over fitting issue.

Disadvantages of Random Forest

o Although random forest can be used for both classification and regression
tasks, it is not more suitable for Regression tasks.
Bagging Vs Boosting
• We all use the Decision Tree Technique on day to day life to make the
decision.
• Organizations use these supervised machine learning techniques like
Decision trees to make a better decision and to generate more surplus and
profit.
• Ensemble methods combine different decision trees to deliver better
predictive results, afterward utilizing a single decision tree.
• The primary principle behind the ensemble model is that a group of weak
learners come together to form an active learner.
• There are two techniques given below that are used to perform ensemble
decision tree.

Bagging

• Bagging is used when our objective is to reduce the variance of a decision


tree.
• Here the concept is to create a few subsets of data from the training
sample, which is chosen randomly with replacement.
• Now each collection of subset data is used to prepare their decision trees
thus, we end up with an ensemble of various models.
• The average of all the assumptions from numerous tress is used, which is
more powerful than a single decision tree.s in Java
• Random Forest is an expansion over bagging.

Boosting:

• Boosting is another ensemble procedure to make a collection of


predictors.
• In other words, we fit consecutive trees, usually random samples, and at
each step, the objective is to solve net error from the prior trees.
• If a given input is misclassified by theory, then its weight is increased so
that the upcoming hypothesis is more likely to classify it correctly by
consolidating the entire set and converts weak learners into better
performing models.

Clustering in Machine Learning


▪ Clustering or cluster analysis is a machine learning technique, which groups
the unlabelled dataset.
▪ It can be defined as "A way of grouping the data points into different
clusters, consisting of similar data points. The objects with the possible
similarities remain in a group that has less or no similarities with another
group."
▪ By finding some similar patterns in the unlabelled dataset such as shape,
size, color, behavior, etc.
▪ It divides them as per the presence and absence of those similar patterns.
▪ It is an unsupervised learning method, hence no supervision is provided to
the algorithm, and it deals with the unlabeled dataset.
▪ After applying this clustering technique, each cluster or group is provided
with a cluster-ID.
▪ ML system can use this id to simplify the processing of large and complex
datasets.
▪ The clustering technique is commonly used for statistical data analysis.

Note: Clustering is somewhere similar to the classification algorithm, but the


difference is the type of dataset that we are using. In classification, we work with
the labeled data set, whereas in clustering, we work with the unlabelled dataset.

Example: Let's understand the clustering technique with the real-world example
of Mall: When we visit any shopping mall, we can observe that the things with
similar usage are grouped together. Such as the t-shirts are grouped in one
section, and trousers are at other sections, similarly, at vegetable sections,
apples, bananas, Mangoes, etc., are grouped in separate sections, so that we can
easily find out the things. The clustering technique also works in the same way.

The clustering technique can be widely used in various tasks. Some most common
uses of this technique are:

o Market Segmentation
o Statistical data analysis
o Social network analysis
o Image segmentation
o Anomaly detection, etc.

Apart from these general usages, it is used by the Amazon in its recommendation
system to provide the recommendations as per the past search of products.
Netflix also uses this technique to recommend the movies and web-series to its
users as per the watch history.

The below diagram explains the working of the clustering algorithm. We can see
the different fruits are divided into several groups with similar properties.

Types of Clustering Methods

The clustering methods are broadly divided into Hard clustering (datapoint
belongs to only one group) and Soft Clustering (data points can belong to another
group also). But there are also other various approaches of Clustering exist. Below
are the main clustering methods used in Machine learning:

1. Partitioning Clustering
2. Density-Based Clustering
3. Distribution Model-Based Clustering
4. Hierarchical Clustering
5. Fuzzy Clustering

Partitioning Clustering

▪ It is a type of clustering that divides the data into non-hierarchical groups.


▪ It is also known as the centroid-based method.
▪ The most common example of partitioning clustering is the K-Means
Clustering algorithm.
▪ In this type, the dataset is divided into a set of k groups, where K is used to
define the number of pre-defined groups.
▪ The cluster center is created in such a way that the distance between the
data points of one cluster is minimum as compared to another cluster
centroid.
Density-Based Clustering

▪ The density-based clustering method connects the highly-dense areas into


clusters, and the arbitrarily shaped distributions are formed as long as the
dense region can be connected.
▪ This algorithm does it by identifying different clusters in the dataset and
connects the areas of high densities into clusters.
▪ The dense areas in data space are divided from each other by sparser
areas.
▪ These algorithms can face difficulty in clustering the data points if the
dataset has varying densities and high dimensions.
Distribution Model-Based Clustering

▪ In the distribution model-based clustering method, the data is divided


based on the probability of how a dataset belongs to a particular
distribution.
▪ The grouping is done by assuming some distributions commonly Gaussian
Distribution.
▪ The example of this type is the Expectation-Maximization Clustering
algorithm that uses Gaussian Mixture Models (GMM).

Hierarchical Clustering

▪ Hierarchical clustering can be used as an alternative for the partitioned


clustering as there is no requirement of pre-specifying the number of
clusters to be created.
▪ In this technique, the dataset is divided into clusters to create a tree-like
structure, which is also called a dendrogram.
▪ The observations or any number of clusters can be selected by cutting the
tree at the correct level.
▪ The most common example of this method is the Agglomerative
Hierarchical algorithm.
Fuzzy Clustering

▪ Fuzzy clustering is a type of soft method in which a data object may belong
to more than one group or cluster.
▪ Each dataset has a set of membership coefficients, which depend on the
degree of membership to be in a cluster.
▪ Fuzzy C-means algorithm is the example of this type of clustering; it is
sometimes also known as the Fuzzy k-means algorithm.

Clustering Algorithms

▪ The Clustering algorithms can be divided based on their models that are
explained above.
▪ There are different types of clustering algorithms published, but only a few
are commonly used.
▪ The clustering algorithm is based on the kind of data that we are using.
▪ Such as, some algorithms need to guess the number of clusters in the given
dataset, whereas some are required to find the minimum distance
between the observation of the dataset.

Popular Clustering algorithms that are widely used in machine learning:

1. K-Means algorithm
2. Mean-shift algorithm
3. DBSCAN Algorithm
4. Expectation-Maximization Clustering using GMM
5. Agglomerative Hierarchical algorithm
6. Affinity Propagation

K-Means Clustering Algorithm


▪ K-Means Clustering is an unsupervised learning algorithm that is used to
solve the clustering problems in machine learning or data science.
▪ In this topic, we will learn what is K-means clustering algorithm, how the
algorithm works, along with the Python implementation of k-means
clustering.
▪ What is K-Means Algorithm?
▪ K-Means Clustering is an Unsupervised Learning algorithm, which groups
the unlabeled dataset into different clusters.
▪ Here K defines the number of pre-defined clusters that need to be created
in the process, as if K=2, there will be two clusters, and for K=3, there will
be three clusters, and so on.
▪ It is an iterative algorithm that divides the unlabeled dataset into k
different clusters in such a way that each dataset belongs only one group
that has similar properties.
▪ It allows us to cluster the data into different groups and a convenient way
to discover the categories of groups in the unlabeled dataset on its own
without the need for any training.
▪ It is a centroid-based algorithm, where each cluster is associated with a
centroid.
▪ The main aim of this algorithm is to minimize the sum of distances between
the data point and their corresponding clusters.
▪ The algorithm takes the unlabeled dataset as input, divides the dataset into
k-number of clusters, and repeats the process until it does not find the best
clusters.
▪ The value of k should be predetermined in this algorithm.

The k-means clustering algorithm mainly performs two tasks:

o Determines the best value for K center points or centroids by an iterative


process.
o Assigns each data point to its closest k-center. Those data points which are
near to the particular k-center, create a cluster.

Hence each cluster has datapoints with some commonalities, and it is away from
other clusters.

The below diagram explains the working of the K-means Clustering Algorithm:
How does the K-Means Algorithm Work?

The working of the K-Means algorithm is explained in the below steps:

Step-1: Select the number K to decide the number of clusters.

Step-2: Select random K points or centroids. (It can be other from the input
dataset).

Step-3: Assign each data point to their closest centroid, which will form the
predefined K clusters.

Step-4: Calculate the variance and place a new centroid of each cluster.

Step-5: Repeat the third steps, which means reassign each data point to the new
closest centroid of each cluster.

Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.

Step-7: The model is ready.

Let's understand the above steps by considering the visual plots:

Suppose we have two variables M1 and M2. The x-y axis scatter plot of these two
variables is given below:
o Let's take number k of clusters, i.e., K=2, to identify the dataset and to put
them into different clusters. It means here we will try to group these
datasets into two different clusters.
o We need to choose some random k points or centroid to form the cluster.
o These points can be either the points from the dataset or any other point.
o So, here we are selecting the below two points as k points, which are not
the part of our dataset.

Consider the below image:

o Now we will assign each data point of the scatter plot to its closest K-point
or centroid.
o We will compute it by applying some mathematics that we have studied to
calculate the distance between two points.
o So, we will draw a median between both the centroids. Consider the below
image:

From the above image, it is clear that points left side of the line is near to the K1
or blue centroid, and points to the right of the line are close to the yellow
centroid.

Let's color them as blue and yellow for clear visualization.

o As we need to find the closest cluster, so we will repeat the process by


choosing a new centroid.
o To choose the new centroids, we will compute the center of gravity of
these centroids, and will find new centroids as below:

o Next, we will reassign each datapoint to the new centroid.


o For this, we will repeat the same process of finding a median line.The
median will be likebelowimage:

From the above image, we can see, one yellow point is on the left side of the line,
and two blue points are right to the line. So, these three points will be assigned
to new centroids.
As reassignment has taken place, so we will again go to the step-4, which is
finding new centroids or K-points.

o We will repeat the process by finding the center of gravity of centroids, so


the new centroids will be as shown in the below image:

o As we got the new centroids so again will draw the median line and
reassign the data points. So, the image will be:
o We can see in the above image; there are no dissimilar data points on
either side of the line, which means our model is formed. Consider the
below image:

As our model is ready, so we can now remove the assumed centroids, and the
two final clusters will be as shown in the below image:
Hierarchical Clustering in Machine Learning
▪ Hierarchical clustering is another unsupervised machine learning
algorithm, which is used to group the unlabeled datasets into a cluster and
also known as hierarchical cluster analysis or HCA.
▪ In this algorithm, we develop the hierarchy of clusters in the form of a tree,
and this tree-shaped structure is known as the dendrogram.
▪ Sometimes the results of K-means clustering and hierarchical clustering
may look similar, but they both differ depending on how they work.
▪ As there is no requirement to predetermine the number of clusters as we
did in the K-Means algorithm.

The hierarchical clustering technique has two approaches:

1. Agglomerative: Agglomerative is a bottom-up approach, in which the


algorithm starts with taking all data points as single clusters and merging
them until one cluster is left.
2. Divisive: Divisive algorithm is the reverse of the agglomerative algorithm
as it is a top-down approach.

Why hierarchical clustering?

o As we already have other clustering algorithms such as K-Means


Clustering, then why we need hierarchical clustering?
o So, as we have seen in the K-means clustering that there are some
challenges with this algorithm, which are a predetermined number of
clusters, and it always tries to create the clusters of the same size.
o To solve these two challenges, we can opt for the hierarchical clustering
algorithm because, in this algorithm, we don't need to have knowledge
about the predefined number of clusters.
o In this topic, we will discuss the Agglomerative Hierarchical clustering
algorithm.

Agglomerative Hierarchical clustering

▪ The agglomerative hierarchical clustering algorithm is a popular example


of HCA.
▪ To group the datasets into clusters, it follows the bottom-up approach.
▪ It means, this algorithm considers each dataset as a single cluster at the
beginning, and then start combining the closest pair of clusters together.
▪ It does this until all the clusters are merged into a single cluster that
contains all the datasets.
▪ This hierarchy of clusters is represented in the form of the dendrogram.

How the Agglomerative Hierarchical clustering Work?

The working of the AHC algorithm can be explained using the below steps:

o Step-1: Create each data point as a single cluster. Let's say there are N data
points, so the number of clusters will also be N.

o Step-2: Take two closest data points or clusters and merge them to form
one cluster. So, there will now be N-1 clusters.

o Step-3: Again, take the two closest clusters and merge them together to
form one cluster. There will be N-2 clusters.
o Step-4: Repeat Step 3 until only one cluster left. So, we will get the
following clusters. Consider the below images:

o Step-5: Once all the clusters are combined into one big cluster, develop the
dendrogram to divide the clusters as per the problem.

Note: To better understand hierarchical clustering, it is advised to have a look on


k-means clustering

Measure for the distance between two clusters

As we have seen, the closest distance between the two clusters is crucial for the
hierarchical clustering. There are various ways to calculate the distance between
two clusters, and these ways decide the rule for clustering. These measures are
called Linkage methods. Some of the popular linkage methods are given below:

1. Single Linkage: It is the Shortest Distance between the closest points of the
clusters. Consider the below image:

2. Complete Linkage: It is the farthest distance between the two points of


two different clusters. It is one of the popular linkage methods as it forms
tighter clusters than single-linkage.

3. Average Linkage: It is the linkage method in which the distance between


each pair of datasets is added up and then divided by the total number of
datasets to calculate the average distance between two clusters. It is also
one of the most popular linkage methods.
4. Centroid Linkage: It is the linkage method in which the distance between
the centroid of the clusters is calculated. Consider the below image:

From the above-given approaches, we can apply any of them according to the
type of problem or business requirement.

Woking of Dendrogram in Hierarchical clustering

▪ The dendrogram is a tree-like structure that is mainly used to store each


step as a memory that the HC algorithm performs.
▪ In the dendrogram plot, the Y-axis shows the Euclidean distances between
the data points, and the x-axis shows all the data points of the given
dataset.

The working of the dendrogram can be explained using the below diagram:

In the above diagram, the left part is showing how clusters are created in
agglomerative clustering, and the right part is showing the corresponding
dendrogram.

o As we have discussed above, firstly, the datapoints P2 and P3 combine


together and form a cluster, correspondingly a dendrogram is created,
which connects P2 and P3 with a rectangular shape. The hight is decided
according to the Euclidean distance between the data points.
o In the next step, P5 and P6 form a cluster, and the corresponding
dendrogram is created. It is higher than of previous, as the Euclidean
distance between P5 and P6 is a little bit greater than the P2 and P3.
o Again, two new dendrograms are created that combine P1, P2, and P3 in
one dendrogram, and P4, P5, and P6, in another dendrogram.
o At last, the final dendrogram is created that combines all the data points
together.

We can cut the dendrogram tree structure at any level as per our requirement.

Regression Analysis in Machine learning


▪ Regression analysis is a statistical method to model the relationship
between a dependent (target) and independent (predictor) variables with
one or more independent variables.
▪ More specifically, Regression analysis helps us to understand how the
value of the dependent variable is changing corresponding to an
independent variable when other independent variables are held fixed.
▪ It predicts continuous/real values such as temperature, age, salary,
price, etc.

We can understand the concept of regression analysis using the below example:

Example: Suppose there is a marketing company A, who does various


advertisement every year and get sales on that. The below list shows the
advertisement made by the company in the last 5 years and the corresponding
sales:
▪ Now, the company wants to do the advertisement of $200 in the year
2019 and wants to know the prediction about the sales for this year.
▪ So to solve such type of prediction problems in machine learning, we need
regression analysis.
▪ Regression is a supervised learning technique which helps in finding the
correlation between variables and enables us to predict the continuous
output variable based on the one or more predictor variables.
▪ It is mainly used for prediction, forecasting, time series modeling, and
determining the causal-effect relationship between variables.
▪ In Regression, we plot a graph between the variables which best fits the
given datapoints, using this plot, the machine learning model can make
predictions about the data.
▪ In simple words, "Regression shows a line or curve that passes through all
the datapoints on target-predictor graph in such a way that the vertical
distance between the datapoints and the regression line is minimum."
▪ The distance between datapoints and line tells whether a model has
captured a strong relationship or not.

Some examples of regression can be as:

o Prediction of rain using temperature and other factors


o Determining Market trends
o Prediction of road accidents due to rash driving.

Terminologies Related to the Regression Analysis:

o Dependent Variable: The main factor in Regression analysis which we want


to predict or understand is called the dependent variable. It is also
called target variable.
o Independent Variable: The factors which affect the dependent variables or
which are used to predict the values of the dependent variables are called
independent variable, also called as a predictor.
o Outliers: Outlier is an observation which contains either very low value or
very high value in comparison to other observed values. An outlier may
hamper the result, so it should be avoided.
o Multicollinearity: If the independent variables are highly correlated with
each other than other variables, then such condition is called
Multicollinearity. It should not be present in the dataset, because it creates
problem while ranking the most affecting variable.
o Underfitting and Overfitting: If our algorithm works well with the training
dataset but not well with test dataset, then such problem is
called Overfitting. And if our algorithm does not perform well even with
training dataset, then such problem is called underfitting.

Types of Regression

▪ There are various types of regressions which are used in data science and
machine learning.
▪ Each type has its own importance on different scenarios, but at the core,
all the regression methods analyze the effect of the independent variable
on dependent variables.

Here we are discussing some important types of regression which are given
below:

o Linear Regression
o Logistic Regression
o Polynomial Regression
o Support Vector Regression
o Decision Tree Regression
o Random Forest Regression
o Ridge Regression
o Lasso Regression:
Cost Functions

• A function that determines how well a Machine Learning


model performs for a given set of data.
• After training your model, you need to see how well your model is
performing.
• A Cost Function is used to measure just how wrong the model is in
finding a relation between the input and output.
• It tells you how badly your model is behaving/predicting

Consider a robot trained to stack boxes in a factory.


• The robot might have to consider certain changeable parameters, called
Variables, which influence how it performs.
• Let’s say the robot comes across an obstacle, like a rock.
• The robot might bump into the rock and realize that it is not the correct
action.
• It will learn from this, and next time it will learn to avoid rocks.
• Hence, your machine uses variables to better fit the data.
• The outcome of all these obstacles will further optimize the robot and
help it perform better.
• It will generalize and learn to avoid obstacles in general, say like a fire
that might have broken out.
• The outcome acts as a cost function, which helps you optimize the
variable, to get the best variables and fit for the model.

Figure 1: Robot learning to avoid obstacles

Training and testing data


• Training data. This type of data builds up the machine learning
algorithm. The data scientist feeds the algorithm input data, which
corresponds to an expected output. The model evaluates the data
repeatedly to learn more about the data’s behaviour and then adjusts
itself to serve its intended purpose.
• Test data. After the model is built, Test data provides a final, real-world
check of an unseen dataset to confirm that the ML algorithm was
trained effectively.
Confusion Matrix
• The confusion matrix is a matrix used to determine the performance of
the classification models for a given set of test data.
• It can only be determined if the true values for test data are known.
• The matrix itself can be easily understood, but the related terminologies
may be confusing.
• Since it shows the errors in the model performance in the form of a
matrix, hence also known as an error matrix.

Some features of Confusion matrix are given below:


o For the 2 prediction classes of classifiers, the matrix is of 2*2 table, for 3
classes, it is 3*3 table, and so on.
o The matrix is divided into two dimensions, that are predicted
values and actual values along with the total number of predictions.
o Predicted values are those values, which are predicted by the model, and
actual values are the true values for the given observations.
o It looks like the below table:

The above table has the following cases:

o True Negative: Model has given prediction No, and the real or actual
value was also No.
o True Positive: The model has predicted yes, and the actual value was also
true.
o False Negative: The model has predicted no, but the actual value was Yes,
it is also called as Type-II error.
o False Positive: The model has predicted Yes, but the actual value was No.
It is also called a Type-I error.

Need for Confusion Matrix in Machine learning

o It evaluates the performance of the classification models, when they


make predictions on test data, and tells how good our classification model
is.
o It not only tells the error made by the classifiers but also the type of errors
such as it is either type-I or type-II error.
o With the help of the confusion matrix, we can calculate the different
parameters for the model, such as accuracy, precision, etc.

Example: We can understand the confusion matrix using an example.


Suppose we are trying to create a model that can predict the result for the
disease that is either a person has that disease or not. So, the confusion matrix
for this is given as:

From the above example, we can conclude that:

o The table is given for the two-class classifier, which has two predictions
"Yes" and "NO." Here, Yes defines that patient has the disease, and No
defines that patient does not has that disease.
o The classifier has made a total of 100 predictions. Out of 100
predictions, 89 are true predictions, and 11 are incorrect predictions.
o The model has given prediction "yes" for 32 times, and "No" for 68 times.
Whereas the actual "Yes" was 27, and actual "No" was 73 times.
Cross Validation in Machine Learning

• In machine learning, we couldn’t fit the model on the training data and
can’t say that the model will work accurately for the real data.
• For this, we must assure that our model got the correct patterns from the
data, and it is not getting up too much noise.
• For this purpose, we use the cross-validation technique.
Cross-Validation

Cross-validation is a technique in which we train our model using the subset of


the data-set and then evaluate using the complementary subset of the data-set.
The steps involved in cross-validation are as follows:
• Reserve some portion of sample data-set.
• Using the rest data-set train the model.
• Test the model using the reserve portion of the data-set.

K-Fold Cross Validation

• In this method, we split the data-set into k number of subsets(known


as folds) then we perform training on the all the subsets but leave
one(k-1) subset for the evaluation of the trained model.
• In this method, we iterate k times with a different subset reserved
for testing purpose each time.

Example

• The diagram below shows an example of the training subsets and


evaluation subsets generated in k-fold cross-validation.

• Here, we have total 25 instances.

• In first iteration we use the first 20 percent of data for evaluation, and
the remaining 80 percent for training

• ([1-5] testing and [5-25] training)

• while in the second iteration we use the second subset of 20 percent for
evaluation, and the remaining three subsets of the data for training

• ([5-10] testing and [1-5 and 10-25] training), and so on.


Total instances: 25
Value of k : 5

No. Iteration Training set observations Testing set


observations
1 [ 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24] [0 1 2 3
4]
2 [ 0 1 2 3 4 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24] [5 6 7 8
9]
3 [ 0 1 2 3 4 5 6 7 8 9 15 16 17 18 19 20 21 22 23 24] [10 11 12
13 14]
4 [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 20 21 22 23 24] [15 16 17
18 19]
5 [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19] [20
21 22 23 24]
Class imbalance
• Imbalanced classification is the problem of classification when there is an
unequal distribution of classes in the training dataset.

• The imbalance in the class distribution may vary, but a severe imbalance is more
challenging to model and may require specialized techniques.

• Many real-world classification problems have an imbalanced class distribution,


such as fraud detection, spam detection, and churn prediction.

Evaluation Metrics

• Evaluation metrics are used to measure the quality of the statistical


or machine learning model.
• Evaluating machine learning models or algorithms is essential for any
project.
• There are many different types of evaluation metrics available to test a
model.
• These include classification accuracy, logarithmic loss, confusion matrix,
and others.
• Classification accuracy is the ratio of the number of correct predictions to
the total number of input samples, which is usually what we refer to
when we use the term accuracy.
• Logarithmic loss, also called log loss, works by penalizing the false
classifications.
• A confusion matrix gives us a matrix as output and describes the
complete performance of the model.
• There are other evaluation metrics that can be used that have not been
listed.
• Evaluation metrics involves using a combination of these individual
evaluation metrics to test a model or algorithm.

You might also like