0% found this document useful (0 votes)
22 views

PR Lecture Note

Uploaded by

emmanuelmachar5
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views

PR Lecture Note

Uploaded by

emmanuelmachar5
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 109

University of Juba

School of Computer Science & Information Technology


Department: Computer Science
Fifth-Year Semester, Semester Nine, Degree Program
Course: Pattern Recognition
[email protected] +211 925 210 912
Pattern Recognition
Course Description: Pattern recognition is a fundamental area of study in the field of artificial
intelligence and machine learning. This course provides an introduction to the principles,
techniques, and applications of pattern recognition. Students will learn how to analyze data, extract
meaningful patterns, and make informed decisions based on pattern recognition algorithms. Topics
covered include statistical pattern recognition, machine learning algorithms, feature extraction,
classification, clustering, and applications in image processing, speech recognition, and
bioinformatics.

Course Objectives:
1. To understand the basic concepts and principles of pattern recognition.
2. To learn various techniques and algorithms used in pattern recognition.
3. To develop practical skills in data analysis, feature extraction, and classification.
4. To explore applications of pattern recognition in real-world problems.
5. To critically evaluate the performance of pattern recognition systems.

1
Course Outline:
1. Introduction to Pattern Recognition
 Definition and importance of pattern recognition
 History and applications
2. Statistical Pattern Recognition
 Probability theory and statistical decision theory
 Bayes decision theory
 Maximum likelihood estimation
3. Machine Learning Algorithms
 Supervised learning
 Unsupervised learning
 Neural networks
 Support vector machines
4. Feature Extraction
 Feature selection and dimensionality reduction
 Feature extraction techniques
5. Classification
 Linear classifiers (e.g., perceptron, logistic regression)
 Non-linear classifiers (e.g., decision trees, k-nearest neighbors)
 Ensemble methods (e.g., random forests, boosting)
6. Clustering
 K-means clustering
 Hierarchical clustering
 Density-based clustering
7. Applications of Pattern Recognition

2
 Image processing and computer vision
 Speech recognition and natural language processing
 Bioinformatics and biomedical signal processing
 Pattern recognition in finance and marketing
8. Performance Evaluation
 Metrics for evaluating classification and clustering algorithms
 Cross-validation and model selection
9. Ethical and Social Implications
 Privacy and security concerns
 Bias and fairness in pattern recognition systems

Teaching Methodology:
 Lectures
 Practical sessions using software tools (e.g., MATLAB, Python)
 Case studies and real-world examples
 Group discussions and presentations
Assessment:
 Assignments and quizzes
 Practical projects
 Final examinations
Prerequisites: Basic knowledge of mathematics, probability theory, and programming is
recommended.
References:
 Duda, R. O., Hart, P. E., & Stork, D. G. (2012). Pattern Classification (2nd ed.).
 Bishop, C. M. (2006). Pattern Recognition and Machine Learning.
 Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning:
Data Mining, Inference, and Prediction (2nd ed.).

3
LECTURE ONE (1)
Introduction to Pattern Recognition:
Pattern recognition is a data analysis method that uses machine learning algorithms to
automatically recognize patterns and regularities in data. This data can be anything from text and
images to sounds or other definable qualities. Pattern recognition systems can recognize familiar
patterns quickly and accurately.

Pattern Recognition is the act of taking in raw data (objects) and classifying them into a number
of categories or classes (clustering) called patterns. Typically the categories are assumed to be
known in advance, although there are techniques to learn the clustering.

Pattern recognition is the process of identifying patterns, regularities, or similarities in data or


information. It involves the extraction of meaningful features from raw data and the classification
or categorization of these patterns based on their similarities or differences. Pattern recognition is
a fundamental aspect of human cognition, as well as a key area of study in various fields, including
artificial intelligence, machine learning, computer vision, and data mining.

Pattern recognition can be defined as the classification of data based on knowledge already
gained or on statistical information extracted from patterns and/or their representation. One of the
important aspects of pattern recognition is its application potential.

A pattern is an abstract object, such as a set of measurements describing a physical object. The
objects could be images, signal waveforms or any type of measurements that need to be classified.
We will refer to these objects as patterns.

Pattern is everything around in this digital world. A pattern can either be seen physically or it can
be observed mathematically by applying algorithms.

Pattern recognition is an integral part in most machine intelligence systems built for decision
making. It studies how machine can

 extract information by observing the environment


 learn to recognize patterns from examples

4
 make decisions based on the category of the patterns

Pattern recognition involves the classification and cluster of patterns.

 Classification, an appropriate class label is assigned to a pattern based on an abstraction that


is generated using a set of training patterns or domain knowledge. Classification of patterns
involve capabilities of

 Supervised learning
 Unsupervised learning

 Classification of Pattern Recognition ranges from known models (probability densities,


category labels ...) to the problems where the model, training patterns and the number of classes
is unknown.

 These problems can be categorized as:

a) Supervised classification

b) Unsupervised classification (Clustering)

a) Supervised classification
 We know exactly how many groups exist.
 We have data that are known - coming from the groups.
 Based on the known information, we can build classification rules (training samples) to classify
new data points into one of the available groups.
 In the example of Figure 1 (above), we assumed that a set of training data were available,
and the classifier was designed by exploiting this priori known information. This is known
as supervised pattern recognition.
b) Unsupervised classification (Cluster analysis)
 The dataset are not grouped
 We do not know how many groups exist in the data

5
 The training data are not available - we are given a set of feature vectors x and the goal is
to unravel the underlying similarities and cluster (group) similar vectors together. This is
known as unsupervised pattern recognition or clustering.
 Such tasks applied in remote sensing, image segmentation, and image and speech coding.

 Clustering is “the process of organizing objects into groups whose members are similar in
some way”. Clustering generated a partition of the data which helps decision making, the
specific decision-making activity of interest to us. Clustering is used in unsupervised learning.
A cluster is therefore a collection of objects which are “similar” between them and are
“dissimilar” to the objects belonging to other clusters.

Categorizing Clustering

1. Sequential
 Number of clusters is not known (priori)
2. Hierarchical
 Final clustering is achieved via divisive approach or agglomerative.
3. Iterative – based on cost function optimization
 Probabilistic - data are picked from mixture of probability distributions
 Boundary detection - adjust iteratively the boundaries of the regions where clusters lie
 Hard clustering - a vector belongs exclusively to one specific cluster.

Why clustering?

 Simplifications
 Pattern detection
 Useful in data concept construction
 Unsupervised learning process

Pattern Processing Stages:

1. The objects to be classified are first sensed by a transducer (camera) - sensor converts images,
sounds or other physical inputs into signal data.
2. The segmentor isolates the sensed objects from the background or from other objects.

6
3. The feature extractor reduce the data size by measuring certain objects properties that are useful
for classification.
4. The classifier evaluates the evidence presented and makes final decision as to assign the
sensed object to a category.
5. The preprocessor adjusts the light level, threshold the image to remove the background of the
conveyor belt and the costs of errors, and decide on the appropriate action.

Features may be represented as continuous, discrete, or discrete binary variables. A feature is a


function of one or more measurements, computed so that it quantifies some significant
characteristics of the object.

Example: consider our face then eyes, ears, nose, etc are features of the face.
A set of features that are taken together, forms the features vector.

Example: In the above example of a face, if all the features (eyes, ears, nose, etc) are taken
together then the sequence is a feature vector ([eyes, ears, nose]). The feature vector is the sequence
of a feature represented as a d-dimensional column vector. In the case of speech, MFCC (Mel-
frequency Cepstral Coefficient) is the spectral feature of the speech.

The sequence of the first 13 features forms a feature vector.


Pattern recognition possesses the following features:

 Pattern recognition system should recognize familiar patterns quickly and accurate

 Recognize and classify unfamiliar objects

 Accurately recognize shapes and objects from different angles

 Identify patterns and objects even when partly hidden

 Recognize patterns quickly with ease, and with automaticity.

Training and Learning in Pattern Recognition


Learning is a phenomenon through which a system gets trained and becomes adaptable to give
results in an accurate manner. Learning is the most important phase as to how well the system
performs on the data provided to the system depends on which algorithms are used on the data.

7
The entire dataset is divided into two categories, one which is used in training the model i.e.
Training set, and the other that is used in testing the model after training, i.e. testing set.

Training set:
The training set is used to build a model. It consists of the set of images that are used to train the
system. Training rules and algorithms are used to give relevant information on how to associate
input data with output decisions. The system is trained by applying these algorithms to the
dataset, all the relevant information is extracted from the data, and results are obtained.
Generally, 80% of the data of the dataset is taken for training data.

Testing set:
Testing data is used to test the system. It is the set of data that is used to verify whether the
system is producing the correct output after being trained or not. Generally, 20% of the data of
the dataset is used for testing. Testing data is used to measure the accuracy of the system. For
example, a system that identifies which category a particular flower belongs to is able to identify
seven categories of flowers correctly out of ten and the rest of others wrong, then the accuracy is
70 %

Imagine we have a dataset containing information about apples and oranges. The features of each
fruit are its color (red or yellow) and its shape (round or oval). We can represent each fruit using
a list of strings, e.g. [‘red’, ’round’] for a red, round fruit.

Our goal is to write a function that can predict whether a given fruit is an apple or an orange. To
do this, we will use a simple pattern recognition algorithm called k-nearest neighbors (k-NN).

8
Importance of Pattern Recognition:

Data Analysis: Pattern recognition enables the analysis of complex datasets by identifying
underlying structures, trends, and relationships. It helps uncover valuable insights and hidden
patterns that may not be apparent through manual examination.

Decision Making: Recognizing patterns allows for informed decision-making in various domains,
including business, finance, healthcare, and engineering. By identifying patterns in data, decision-
makers can anticipate trends, predict outcomes, and formulate effective strategies.

Automation: In artificial intelligence and machine learning, pattern recognition algorithms


automate the process of identifying patterns in data. These algorithms can learn from past data and
make predictions or classifications based on new observations, leading to more efficient and
accurate decision-making processes.

Computer Vision: Pattern recognition is essential in computer vision systems, where algorithms
analyze visual data to interpret and understand the content of images or videos. Applications
include object detection, facial recognition, image segmentation, and autonomous navigation.

Speech and Language Processing: Pattern recognition plays a crucial role in speech and language
processing tasks, such as speech recognition, natural language understanding, and machine
translation. Algorithms analyze audio signals or text data to recognize patterns in speech or
language patterns and convert them into meaningful information.

Biometrics: Pattern recognition is used in biometric systems for identifying individuals based on
unique physiological or behavioral characteristics, such as fingerprints, iris scans, or voice
patterns. These systems are employed for security and authentication purposes in various
applications, including access control and identity verification.

Disease Diagnosis: In healthcare, pattern recognition techniques are used for medical image
analysis, disease diagnosis, and prognosis prediction. Algorithms analyze medical imaging data,
such as MRI scans or X-rays, to detect abnormalities, classify diseases, and assist healthcare
professionals in making accurate diagnoses.

9
Overall, pattern recognition is essential for understanding and interpreting complex data,
facilitating decision-making processes, and developing intelligent systems capable of automated
analysis and interpretation. Its applications span across diverse fields and contribute to
advancements in technology, science, and society.

History of Pattern Recognition:

Early Developments: The roots of pattern recognition can be traced back to ancient civilizations,
where humans relied on visual and auditory cues to recognize patterns in nature, such as animal
tracks and sounds. However, formal studies in pattern recognition began in the early 20th century
with the development of statistical methods and signal processing techniques.

Statistical Pattern Recognition: In the mid-20th century, pioneers such as Norbert Wiener and
Claude Shannon laid the foundation for statistical pattern recognition, introducing concepts such
as Bayesian decision theory and information theory. These mathematical frameworks provided a
systematic approach to analyzing and classifying patterns in data.

Machine Learning: The advent of computers in the latter half of the 20th century enabled the
development of machine learning algorithms for pattern recognition. Early approaches, such as the
perceptron and nearest neighbor algorithms, paved the way for more sophisticated techniques like
neural networks, support vector machines, and deep learning.

Computer Vision: Pattern recognition found widespread applications in computer vision, where
algorithms analyze visual data to interpret and understand the content of images or videos.
Landmark developments include the creation of edge detection algorithms, object recognition
systems, and convolutional neural networks (CNNs).

Speech and Language Processing: Pattern recognition techniques have been applied to speech
and language processing tasks, such as speech recognition, natural language understanding, and
machine translation. Early systems relied on statistical models, while modern approaches leverage
deep learning architectures like recurrent neural networks (RNNs) and transformers.

10
Goal of Pattern Recognition

Pattern recognition is a scientific discipline whose goal is classification of data, objects or patterns
into categories or classes.

The main goal and approach in pattern recognition is to:

 Hypothesize the model describing each population class

 Process the sensed data to eliminate noise

 Choose the model for any sensed pattern that corresponds best and to assign it to the class
described by that model

Advantages:

 Pattern recognition solves classification problems

 Pattern recognition solves the problem of fake biometric detection.

 It is useful for cloth pattern recognition for visually impaired blind people.

 It helps in speaker diarization.

 We can recognize particular objects from different angles.

Disadvantages:

 The syntactic pattern recognition approach is complex to implement and it is a very slow
process.

 Sometimes to get better accuracy, a larger dataset is required.

 It cannot explain why a particular object is recognized.


Example: my face vs my friend’s face.

11
Applications of Pattern Recognition:

Image Processing: Pattern recognition is used in image processing applications for tasks such as
object detection, image segmentation, and scene understanding. It finds applications in fields like
medical imaging, satellite imagery analysis, surveillance systems, and autonomous vehicles.

Biometrics: Pattern recognition plays a crucial role in biometric systems for identifying
individuals based on unique physiological or behavioral characteristics. Biometric modalities
include fingerprint recognition, iris recognition, facial recognition, voice recognition, and gait
analysis.

Speech Recognition: Pattern recognition techniques are employed in speech recognition systems
to transcribe spoken language into text. These systems are used in virtual assistants, voice-
controlled devices, dictation software, and customer service automation.

Medical Diagnosis: Pattern recognition algorithms assist healthcare professionals in medical


image analysis, disease diagnosis, and prognosis prediction. They analyze medical imaging data
(e.g., MRI scans, X-rays) to detect abnormalities, classify diseases, and guide treatment decisions.

Document Analysis: Pattern recognition is applied to document analysis tasks such as optical
character recognition (OCR), handwriting recognition, and document classification. These systems
automate data entry, digitize historical documents, and assist in information retrieval.

Financial Forecasting: Pattern recognition techniques are used in financial markets for predicting
trends, identifying trading opportunities, and assessing risk. They analyze historical market data
to forecast stock prices, detect anomalies, and optimize investment strategies.

Security and Surveillance: Pattern recognition is utilized in security and surveillance systems for
tasks such as intrusion detection, face recognition, and behavior analysis. These systems enhance
public safety, protect critical infrastructure, and prevent criminal activities.

Overall, pattern recognition has revolutionized numerous fields by enabling automated analysis
and interpretation of complex data, leading to advancements in technology, science, and society.
Its applications continue to expand as new algorithms and technologies emerge, driving innovation
and addressing real-world challenges.

12
Probability theory and statistical decision theory are two fundamental concepts in the field of
statistics and decision-making. Let's explore each of these concepts:
Probability Theory:
Probability theory deals with the mathematical study of random events or uncertain outcomes. It
provides a framework for quantifying uncertainty and reasoning about uncertainty in a systematic
manner.
Key concepts in probability theory include:
Sample space: The set of all possible outcomes of a random experiment.
Event: Any subset of the sample space, representing a particular outcome or set of outcomes.
Probability measure: A function that assigns a numerical value between 0 and 1 to each event,
representing the likelihood of that event occurring.
Probability distribution: A mathematical function that describes the probabilities of different
outcomes of a random variable.
Probability theory is used in various fields, including statistics, mathematics, physics, finance, and
engineering. It is applied in areas such as risk assessment, modeling of random phenomena, and
decision-making under uncertainty.
Statistical Decision Theory:
Statistical decision theory is concerned with making decisions in the presence of uncertainty, based
on statistical data and decision criteria. It provides a framework for rational decision-making in
situations where outcomes are uncertain and probabilities are known or can be estimated.
Key concepts in statistical decision theory include:
Decision problem: A situation in which a decision-maker must choose among alternative courses
of action, each associated with uncertain outcomes.
Loss function: A function that quantifies the cost or loss associated with different decision
outcomes.
Decision rule: A rule or criterion for selecting the best course of action based on the available
information and the objectives of the decision-maker.
Bayes decision rule: A decision rule that minimizes the expected loss, taking into account both
the prior probabilities of different outcomes and the loss associated with each outcome.
Statistical decision theory is applied in various fields, including economics, engineering, medicine,
and business. It is used to optimize decision-making processes, develop decision support systems,
and analyze the trade-offs between different decision options.

13
In summary, probability theory provides the mathematical foundation for quantifying uncertainty,
while statistical decision theory provides a framework for making optimal decisions under
uncertainty based on available information and decision criteria. Together, these two concepts
form the basis for rational decision-making in a wide range of real-world situations.

14
LECTURE TWO:

2.1 Statistical Pattern Recognition


2.2 Probability theory and statistical decision theory
2.3 Bayes decision theory
2.4 Maximum likelihood estimation
Introduction:
2.1STATISTICAL PATTERN RECOGNITION

Statistical pattern recognition: Statistical pattern recognition is the branch of statistics that deals
with the identification and classification of patterns in data. It is a type of supervised learning,
where the data is labeled with class labels that indicate which class a particular instance belongs
to. The goal of statistical pattern recognition is to learn a model that can accurately classify new
data instances based on their features.

The topic of machine learning known as statistical pattern recognition focuses on finding patterns
and regularities in data. It enables machines to gain knowledge from data, enhance performance,
and make choices based on what they have discovered.

The goal of Statistical Pattern Recognition is to find relationships between variables that can be
used for prediction or classification tasks. This will explore the various techniques used in
Statistical Pattern Recognition and how these methods are applied to solve real-world problems.

The importance of pattern recognition lies in its ability to detect complex relations among variables
without explicit programming instructions. By using statistical models, machines can identify
regularities in data that would otherwise require manual labor or trial-and-error experimentation
by humans. In addition, machines can generalize from existing knowledge bases to predict new
outcomes more accurately than before.

Statistical Pattern Recognition is becoming increasingly important within many industries due to
its ability to automate certain processes as well as providing valuable insights into large datasets
that may otherwise remain hidden beneath the surface. With this article, we aim to provide an

15
overview of different techniques used for identifying patterns within data and explain how they
are employed in solving practical problems effectively.

This field combines elements of statistics, mathematics, and computer science to develop
algorithms and models that can recognize regularities, structures, and anomalies in data sets.

What Is Statistical Pattern Recognition With Example?

Statistical pattern recognition (SPR) is a field of data analysis that uses mathematical models and
algorithms to identify patterns from large datasets. It can be used for various tasks, such as
handwriting or speech recognition, classification of objects in images, and natural language
processing. SPR employs several techniques including vector machines, neural networks, linear
discriminants, Bayesian methods, k-nearest neighbors and other feature extraction algorithms.

In terms of applications, SPR has been successfully applied to problems like cursive handwriting
recognition and automated medical diagnosis. In the case of handwriting recognition, an algorithm
works by extracting features using a feature extraction algorithm then matching them with existing
model parameters. The same principle applies when solving more complex tasks such as image
classification where deep learning may be employed instead of traditional methods like
discriminant analysis. Similarly, machine vision systems use SPR techniques to identify objects
within an image and classify them according to specific criteria. Furthermore, modern robotics
also utilize SPR concepts to enable robots to recognize their environment better iRobot Roomba
Vacuum Cleaner being one example.

By combining different types of data transformations alongside statistical modeling approaches


such as supervised learning algorithms and unsupervised clustering techniques, it is possible to
uncover new insights from data sets which could otherwise have gone unnoticed. With the help of
advancements in computing power and technologies like artificial intelligence (AI), this area
continues to grow rapidly allowing us to investigate further into the world of big data.

What Is Statistical Pattern Recognition In Cognitive Psychology?

A method of cognitive psychology known as statistical pattern recognition employs learning


algorithms to detect patterns in data automatically. This technology can be used for shape

16
recognition, where features such as the size and orientation of the object are extracted from an
image using feature selection techniques and then converted into a feature vector input which
describes the identity of the object. The classification approach uses this information to classify
objects or events according to their properties, with Bayesian pattern classifiers being one of the
most common methods. These optimal classifiers use probabilities rather than hard limitations on
parameters when making decisions about how to group different classes together; they also allow
for prior knowledge about certain groups of objects or events to be taken into account when doing
so. By combining these two elements—feature extraction and classification—statistical pattern
recognition enables accurate automatic recognition processes.

Key Concepts

Features and Feature Extraction:

Features are measurable properties or characteristics of the data. Feature extraction involves
transforming raw data into a set of attributes that can be used for pattern recognition.

Classification and Clustering:

Classification: This is a supervised learning process where the goal is to assign input data into
predefined categories or classes based on the features. Common algorithms include decision trees,
support vector machines, and neural networks.

Clustering: This is an unsupervised learning process where the objective is to group similar data
points together without predefined labels. K-means and hierarchical clustering are popular
methods used for this purpose.

Model Training and Validation:

Model training involves using a set of data (training set) to build and optimize a pattern recognition
model. Validation involves testing the model on a separate set of data (validation set) to evaluate
its performance and generalizability.

17
Probability and Statistical Inference:

Probability theory underpins many pattern recognition techniques, allowing for the modeling of
uncertainty and the making of inferences about the data. Bayesian inference and maximum
likelihood estimation are commonly used methods.

Dimensionality Reduction:

Techniques such as Principal Component Analysis (PCA) and Linear Discriminant Analysis
(LDA) are used to reduce the number of features while retaining the most significant information,
improving computational efficiency and model performance.

Applications

Statistical Pattern Recognition has a wide range of applications across various fields:

Image and Speech Recognition: Identifying objects in images and understanding spoken language.

Medical Diagnosis: Classifying medical images or patient data to assist in diagnosing diseases.

Finance: Fraud detection, stock market prediction, and credit scoring.

Text Analysis: Spam detection, sentiment analysis, and topic modeling.

Challenges

High Dimensionality: Large datasets with many features can be computationally intensive and
may require dimensionality reduction techniques.

Overfitting: Models that perform well on training data but poorly on unseen data. Regularization
techniques and cross-validation are used to mitigate this.

Noise and Outliers: Real-world data often contain noise and outliers that can affect the accuracy
of pattern recognition models.

In conclusion, Statistical Pattern Recognition is a crucial area in the realm of data science and
artificial intelligence, providing powerful tools and methodologies for analyzing complex data and
making informed decisions based on statistical evidence

18
2.2 Probability Theory and Statistical Decision Theory
a) Probability Theory:
 Definition: Probability theory is the branch of mathematics that deals with the analysis of
random events. It provides a framework for quantifying the uncertainty associated with various
phenomena and for making predictions about future events based on known probabilities.
 Applications: Used in diverse fields such as finance, insurance, science, engineering, and
everyday decision-making.
Statistical Decision Theory:
 Definition: Statistical decision theory involves making decisions under uncertainty. It
combines probability theory with decision-making processes to identify the best course of
action when outcomes are uncertain.
 Applications: Widely used in economics, medical decision-making, machine learning, and
artificial intelligence.
Fundamentals of Probability Theory
Basic Concepts:
 Random Experiment: An experiment or process for which the outcome cannot be
predicted with certainty.
 Sample Space (S): The set of all possible outcomes of a random experiment.
 Event: A subset of the sample space. An event occurs if the outcome of the experiment is
one of the elements in the subset.

Bayes’ Theorem:

Bayes’ Theorem is used to determine the conditional probability of an event. It was named after
an English statistician, Thomas Bayes who discovered this formula in 1763. Bayes Theorem is a
very important theorem in mathematics that laid the foundation of a unique statistical inference
approach called the Bayes’ inference. It is used to find the probability of an event, based on
prior knowledge of conditions that might be related to that event.

19
P (A|B) = P(B|A)P(A) / P(B)

Where,
P(A) and P(B) are the probabilities of events A and B
P (A|B) is the probability of event A when event B happens
P(B|A) is the probability of event B when A happens

Bayes Theorem Statement


Bayes’ Theorem for n set of events is defined as,
Let E1, E2,…, En be a set of events associated with the sample space S, in which all the events
E1, E2,…, En have a non-zero probability of occurrence. All the events E1, E2,…, E form a
partition of S. Let A be an event from space S for which we have to find probability, then
according to Bayes’ theorem,

P(Ei|A) = P(Ei)P(A|Ei) / ∑ P(Ek)P(A|Ek)


for k = 1, 2, 3, …., n
Bayes Theorem Formula
For any two events A and B, then the formula for the Bayes theorem is given by

Where,
P(A) and P(B) are the probabilities of events A and B also P(B) is never equal to zero.
P(A|B) is the probability of event A when event B happens
P(B|A) is the probability of event B when A happens

20
Bayes Theorem Derivation

P(Ei|A) = P(Ei)P(A|Ei) / ∑ P(Ek)P(A|Ek)


Bayes’ theorem is also known as the formula for the probability of “causes”. As we know, the
Ei‘s are a partition of the sample space S, and at any given time only one of the events
Ei occurs. Thus we conclude that the Bayes’ theorem formula gives the probability of a particular
Ei, given the event A has occurred.

Terms Related to Bayes Theorem

After learning about Bayes theorem in detail, let us understand some important terms related to
the concepts we covered in formula and derivation.

 Hypotheses: Events happening in the sample space E1, E2,… En is called the hypotheses

 Priori Probability: Priori Probability is the initial probability of an event occurring before any
new data is taken into account. P(Ei) is the priori probability of hypothesis Ei.

 Posterior Probability: Posterior Probability is the updated probability of an event after


considering new information. Probability P(Ei|A) is considered as the posterior probability of
hypothesis Ei.

Conditional Probability

The probability of an event A based on the occurrence of another event B is termed conditional
Probability.

21
It is denoted as P(A|B) and represents the probability of A when event B has already happened.

Joint Probability

When the probability of two more events occurring together and at the same time is measured it is
marked as Joint Probability. For two events A and B, it is denoted by joint probability is denoted
as, P (A∩B).

Random Variables

Real-valued variables whose possible values are determined by random experiments are called
random variables. The probability of finding such variables is the experimental probability.

Bayes’ Theorem Applications

Bayesian inference is very important and has found application in various activities, including
medicine, science, philosophy, engineering, sports, law, etc., and Bayesian inference is directly
derived from Bayes’ theorem.

Example: Bayes’ theorem defines the accuracy of the medical test by taking into account how
likely a person is to have a disease and what is the overall accuracy of the test.

Difference between Conditional Probability and Bayes Theorem

The difference between Conditional Probability and Bayes Theorem can be understood with the
help of the table given below,

22
Theorem of Total Probability
Let E1, E2, . . ., En is mutually exclusive and exhaustive events associated with a random experiment
and lets E be an event that occurs with some Ei. Then, prove that
Proof

Implementation of Bayesian Regression


Linear regression is a popular regression approach in machine learning. Linear regression is based
on the assumption that the underlying data is normally distributed and that all relevant predictor
variables have a linear relationship with the outcome. But In the real world, this is not always
possible, it will follows these assumptions, Bayesian regression could be the better choice.

Bayesian regression employs prior belief or knowledge about the data to “learn” more about it and
create more accurate predictions. It also takes into account the data’s uncertainty and leverages
prior knowledge to provide more precise estimates of the data. As a result, it is an ideal choice
when the data is complex or ambiguous.

Bayesian regression uses a Bayes algorithm to estimate the parameters of a linear regression model
from data, including prior knowledge about the parameters. Because of its probabilistic character,
it can produce more accurate estimates for regression parameters than ordinary least squares (OLS)
linear regression, provide a measure of uncertainty in the estimation, and make stronger

23
conclusions than OLS. Bayesian regression can also be utilized for related regression analysis tasks
like model selection and outlier detection.

Bayesian Regression

Bayesian regression is a type of linear regression that uses Bayesian statistics to estimate the
unknown parameters of a model. It uses Bayes’ theorem to estimate the likelihood of a set of
parameters given observed data. The goal of Bayesian regression is to find the best estimate of the
parameters of a linear model that describes the relationship between the independent and the
dependent variables.

The main difference between traditional linear regression and Bayesian regression is the
underlying assumption regarding the data-generating process. Traditional linear regression
assumes that data follows a Gaussian or normal distribution, while Bayesian regression has
stronger assumptions about the nature of the data and puts a prior probability distribution on the
parameters. Bayesian regression also enables more flexibility as it allows for additional parameters
or prior distributions, and can be used to construct an arbitrarily complex model that explicitly
expresses prior beliefs about the data. Additionally, Bayesian regression provides more accurate
predictive measures from fewer data points and is able to construct estimates for uncertainty
around the estimates. On the other hand, traditional linear regressions are easier to implement and
generally faster with simpler models and can provide good results when the assumptions about the
data are valid.

Bayesian Regression can be very useful when we have insufficient data in the dataset or the data
is poorly distributed. The output of a Bayesian Regression model is obtained from a probability
distribution, as compared to regular regression techniques where the output is just obtained from
a single value of each attribute.

Some Dependent Concepts for Bayesian Regression

The important concepts in Bayesian Regression are as follows:

24
Bayes Theorem

Bayes Theorem gives the relationship between an event’s prior probability and its posterior
probability after evidence is taken into account. It states that the conditional probability of an event
is equal to the probability of the event given certain conditions multiplied by the prior probability
of the event, divided by the probability of the conditions.

Where P(A|B) is the probability of event A occurring given that event B has already occurred,
P(B|A) is the probability of event B occurring given that event A has already occurred, P(A) is the
probability of event A occurring and P(B) is the probability of event B occurring.

Need for Bayesian Regression

There are several reasons why Bayesian regression is useful over other regression techniques.
Some of them are as follows:

1. Bayesian regression also uses the prior belief about the parameters in the analysis. Which
makes it useful when there is limited data available and the prior knowledge are relevant. By
combining prior knowledge with the observed data, Bayesian regression provides more
informed and potentially more accurate estimates of the regression parameters.

2. Bayesian regression provides a natural way to measure the uncertainty in the estimation of
regression parameters by generating the posterior distribution, which captures the uncertainty
in the parameter values, as opposed to the single point estimate that is produced by standard
regression techniques. This distribution offers a range of acceptable values for the parameters
and can be used to compute trustworthy intervals or Bayesian confidence intervals.

3. In order to incorporate complicated correlations and non-linearities, Bayesian regression


provides flexibility by offering a framework for integrating various prior distributions, which
makes it capable to handle situations where the basic assumptions of standard regression
techniques, like linearity or homoscedasticity, may not be true. It enables the modeling of more
realistic and nuanced relationships between the predictors and the response variable.

25
4. Bayesian regression facilitates model selection and comparison by calculating the posterior
probabilities of different models.

5. Bayesian regression can handle outliers and influential observations more effectively
compared to classical regression methods. It provides a more robust approach to regression
analysis, as extreme or influential observations have a lesser impact on the estimation.

#Import the necessary libraries


import torch
import pyro
import pyro.distributions as dist
from pyro.infer import SVI, Trace_ELBO, Predictive
from pyro.optim import Adam
import matplotlib.pyplot as plt
import seaborn as sns

# Generate some sample data


torch.manual_seed(0)
X = torch.linspace(0, 10, 100)
true_slope = 2
true_intercept = 1
Y = true_intercept + true_slope * X + torch.randn(100)

# Define the Bayesian regression model


def model(X, Y):
# Priors for the parameters
slope = pyro.sample("slope", dist.Normal(0, 10))
intercept = pyro.sample("intercept", dist.Normal(0, 10))
sigma = pyro.sample("sigma", dist.HalfNormal(1))

# Expected value of the outcome


mu = intercept + slope * X

# Likelihood (sampling distribution) of the observations


with pyro.plate("data", len(X)):
pyro.sample("obs", dist.Normal(mu, sigma), obs=Y)

# Run Bayesian inference using SVI (Stochastic Variational Inference)


def guide(X, Y):
# Approximate posterior distributions for the parameters
slope_loc = pyro.param("slope_loc", torch.tensor(0.0))
slope_scale = pyro.param("slope_scale", torch.tensor(1.0),
constraint=dist.constraints.positive)
intercept_loc = pyro.param("intercept_loc", torch.tensor(0.0))
intercept_scale = pyro.param("intercept_scale", torch.tensor(1.0),
constraint=dist.constraints.positive)
sigma_loc = pyro.param("sigma_loc", torch.tensor(1.0),
constraint=dist.constraints.positive)

# Sample from the approximate posterior distributions

26
slope = pyro.sample("slope", dist.Normal(slope_loc, slope_scale))
intercept = pyro.sample("intercept", dist.Normal(intercept_loc,
intercept_scale))
sigma = pyro.sample("sigma", dist.HalfNormal(sigma_loc))

# Initialize the SVI and optimizer


optim = Adam({"lr": 0.01})
svi = SVI(model, guide, optim, loss=Trace_ELBO())

# Run the inference loop


num_iterations = 1000
for i in range(num_iterations):
loss = svi.step(X, Y)
if (i + 1) % 100 == 0:
print(f"Iteration {i + 1}/{num_iterations} - Loss: {loss}")

# Obtain posterior samples using Predictive


predictive = Predictive(model, guide=guide, num_samples=1000)
posterior = predictive(X, Y)

# Extract the parameter samples


slope_samples = posterior["slope"]
intercept_samples = posterior["intercept"]
sigma_samples = posterior["sigma"]

# Compute the posterior means


slope_mean = slope_samples.mean()
intercept_mean = intercept_samples.mean()
sigma_mean = sigma_samples.mean()

# Print the estimated parameters


print("Estimated Slope:", slope_mean.item())
print("Estimated Intercept:", intercept_mean.item())
print("Estimated Sigma:", sigma_mean.item())

# Create subplots
fig, axs = plt.subplots(1, 3, figsize=(15, 5))

# Plot the posterior distribution of the slope


sns.kdeplot(slope_samples, shade=True, ax=axs[0])
axs[0].set_title("Posterior Distribution of Slope")
axs[0].set_xlabel("Slope")
axs[0].set_ylabel("Density")

# Plot the posterior distribution of the intercept


sns.kdeplot(intercept_samples, shade=True, ax=axs[1])
axs[1].set_title("Posterior Distribution of Intercept")
axs[1].set_xlabel("Intercept")
axs[1].set_ylabel("Density")

# Plot the posterior distribution of sigma


sns.kdeplot(sigma_samples, shade=True, ax=axs[2])

27
axs[2].set_title("Posterior Distribution of Sigma")
axs[2].set_xlabel("Sigma")
axs[2].set_ylabel("Density")

# Adjust the layout


plt.tight_layout()

# Show the plot


plt.show()

2.3 Decision Making Under Uncertainty


In the context of pattern recognition, statistical decision theory plays a crucial role in making
informed decisions based on uncertain or incomplete data. The theory provides a formal
framework for selecting the best action or decision in the presence of uncertainty, using concepts
such as decision rules and loss functions.

Decision Rule:

 Definition: A decision rule is a function that maps observed data (features) to an action or
decision. In pattern recognition, this typically involves classifying an input pattern into one
of several predefined categories or classes.

 Example: In a handwritten digit recognition system, the decision rule would assign an
observed digit image (data) to a specific digit class (e.g., 0-9).

Loss Function (L(θ,a)):

 Definition: The loss function L(θ,a) quantifies the cost associated with making a decision
a when the true state of nature is θ. In pattern recognition, the true state of nature
corresponds to the actual class of the pattern, while the decision is the predicted class.
 Purpose: The loss function helps in evaluating the performance of a decision rule by
assigning a penalty for incorrect classifications. It allows for the assessment of the overall
risk or expected loss, guiding the selection of optimal decision rules.

Conclusion
Probability theory and statistical decision theory provide powerful tools for dealing with
uncertainty and making informed decisions. By leveraging these theories, individuals and

28
organizations can improve their decision-making processes, optimize outcomes, and effectively
manage risks across various domains.
2.4 Maximum Likelihood Estimation (MLE)

Maximum Likelihood Estimation (MLE) is a fundamental method in statistical inference used to


estimate the parameters of a statistical model. The central idea of MLE is to find the parameter
values that maximize the likelihood function, thereby making the observed data most probable
under the assumed model. MLE is widely used due to its desirable properties and simplicity.

Key Concepts:

Likelihood Function:

 Definition: The likelihood function L(θ;x) measures the probability of the observed data
xxx given the parameters θ of the model. It is derived from the probability density function
(PDF) or probability mass function (PMF) of the data.

 Purpose: The likelihood function serves as the basis for finding the most probable
parameter values that explain the observed data.

Maximum Likelihood Estimation:

 Objective: To find the parameter values θ that maximize the likelihood function.
Formally, this can be expressed as:

 Log-Likelihood: Often, the logarithm of the likelihood function, called the log-
likelihood, is used because it simplifies the mathematical operations involved in
maximization.

29
Properties of MLE:

 Consistency: As the sample size increases, the MLE converges to the true parameter value.

 Asymptotic Normality: For large sample sizes, the distribution of the MLE approaches a
normal distribution.

 Efficiency: MLE achieves the lowest possible variance among all unbiased estimators,
under certain regularity conditions.

Application Process:

1. Define the Statistical Model:

 Specify the probability distribution of the data with unknown parameters. For example,
assume the data follows a normal distribution with mean μ and variance σ2.

2. Construct the Likelihood Function:

 Based on the assumed model, write the likelihood function for the observed data. For a
normal distribution, the likelihood function for a sample x1, x2,…,xn is:

3. Compute the Log-Likelihood:

 Take the natural logarithm of the likelihood function to obtain the log-likelihood

4. Maximize the Log-Likelihood:


Differentiate the log-likelihood function with respect to the parameters and set the derivatives
to zero to find the maximum. Solve the resulting equations to obtain the MLE of the
parameters.

30
Advantages of MLE:

 Simplicity: MLE provides a straightforward method for parameter estimation by focusing


solely on the observed data.
 Flexibility: Applicable to a wide range of statistical models and data types.
 Point Estimates: Provides specific parameter estimates without requiring prior distributions
or additional assumptions.

Limitations of MLE:

 No Prior Information: MLE does not incorporate prior knowledge or assumptions about the
parameters, which can be a limitation in some contexts.
 Sensitivity to Sample Size: With small sample sizes, MLE estimates can be biased or have
high variance.
 Computational Complexity: For complex models or large datasets, finding the MLE can be
computationally intensive.

In summary, Maximum Likelihood Estimation is a powerful and versatile method for estimating
the parameters of statistical models. By maximizing the likelihood function, MLE seeks to find
the parameter values that make the observed data most probable under the assumed model. Despite
its limitations, MLE remains a fundamental tool in statistical analysis and inference.

31
Maximum A Posteriori (MAP) Estimation

MAP estimation is a Bayesian approach that combines prior information with the likelihood
function to estimate the parameters. It involves finding the parameter values that maximize the
posterior distribution, which is obtained by applying Bayes’ theorem. In MAP estimation, a prior
distribution is specified for the parameters, representing prior beliefs or knowledge about their
values. The likelihood function is then multiplied by the prior distribution to obtain the joint
distribution, and the parameter values that maximize this joint distribution are selected as the MAP
estimates. MAP estimation provides point estimates of the parameters, similar to MLE, but
incorporates prior information.

32
Lecture Three 3:
3.1Machine Learning Algorithms
3.2 Supervised learning
3.3 Unsupervised learning
3.4 Neural networks
3.5 Support vector machines
3.1 Machine Learning Algorithms

Machine learning algorithms are computational models that allow computers to understand
patterns and forecast or make judgments based on data without the need for explicit
programming. These algorithms form the foundation of modern artificial intelligence and are
used in a wide range of applications, including image and speech recognition, natural language
processing, recommendation systems, fraud detection, autonomous cars etc.

This Machine learning Algorithms article will cover all the essential algorithms of machine
learning like Support vector machine, decision-making, logistics regression, naive bayees
classifier, random forest, k-mean clustering, reinforcement learning, vector, hierarchical
clustering, xgboost, adaboost, logistics, etc

Types of Machine Learning Algorithms

There are three types of machine learning algorithms.

3.2 Supervised Learning

 Regression

Regression, a statistical approach, dissects the relationship between dependent and


independent variables, enabling predictions through various regression models.

The article delves into regression in machine learning, elucidating models, terminologies,
types, and practical applications.

33
What is Regression?

Regression is a statistical approach used to analyze the relationship between a dependent


variable (target variable) and one or more independent variables (predictor variables). The
objective is to determine the most suitable function that characterizes the connection
between these variables.

It seeks to find the best-fitting model, which can be utilized to make predictions or draw
conclusions

Regression in Machine Learning

It is a supervised machine learning technique, used to predict the value of the dependent
variable for new, unseen data. It models the relationship between the input features and the
target variable, allowing for the estimation or prediction of numerical values.

Regression analysis problem works with if output variable is a real or continuous value,
such as “salary” or “weight”. Many different models can be used, the simplest is the linear
regression. It tries to fit data with the best hyper-plane which goes through the points.

Terminologies Related to the Regression Analysis in Machine Learning

Terminologies Related to Regression Analysis:

 Response Variable: The primary factor to predict or understand in regression, also known
as the dependent variable or target variable.

 Predictor Variable: Factors influencing the response variable, used to predict its values;
also called independent variables.

 Outliers: Observations with significantly low or high values compared to others,


potentially impacting results and best avoided.

 Multicollinearity: High correlation among independent variables, which can complicate


the ranking of influential variables.

34
 Underfitting and Overfitting: Overfitting occurs when an algorithm performs well on
training but poorly on testing, while underfitting indicates poor performance on both
datasets.

Regression Types

There are two main types of regression:

 Simple Regression

o Used to predict a continuous dependent variable based on a single independent


variable.

o Simple linear regression should be used when there is only a single independent
variable.

 Multiple Regression

o Used to predict a continuous dependent variable based on multiple independent


variables.

o Multiple linear regression should be used when there are multiple independent
variables.

 Nonlinear Regression

o Relationship between the dependent variable and independent variable(s) follows a


nonlinear pattern.

o Provides flexibility in modeling a wide range of functional forms.

35
Regression Algorithms

There are many different types of regression algorithms, but some of the most common
include:

 Linear Regression

o Linear regression is one of the simplest and most widely used statistical models. This
assumes that there is a linear relationship between the independent and dependent
variables. This means that the change in the dependent variable is proportional to the
change in the independent variables.

 Polynomial Regression

o Polynomial regression is used to model nonlinear relationships between the


dependent variable and the independent variables. It adds polynomial terms to the
linear regression model to capture more complex relationships.

 Support Vector Regression (SVR)

o Support vector regression (SVR) is a type of regression algorithm that is based on


the support vector machine (SVM) algorithm. SVM is a type of algorithm that is used
for classification tasks, but it can also be used for regression tasks. SVR works by
finding a hyperplane that minimizes the sum of the squared residuals between the
predicted and actual values.

 Decision Tree Regression

o Decision tree regression is a type of regression algorithm that builds a decision tree
to predict the target value. A decision tree is a tree-like structure that consists of nodes
and branches. Each node represents a decision, and each branch represents the
outcome of that decision. The goal of decision tree regression is to build a tree that
can accurately predict the target value for new data points.

36
 Random Forest Regression

o Random forest regression is an ensemble method that combines multiple decision


trees to predict the target value. Ensemble methods are a type of machine learning
algorithm that combines multiple models to improve the performance of the overall
model. Random forest regression works by building a large number of decision trees,
each of which is trained on a different subset of the training data. The final prediction
is made by averaging the predictions of all of the trees.

Regularized Linear Regression Techniques

 Ridge Regression

o Ridge regression is a type of linear regression that is used to prevent overfitting.


Overfitting occurs when the model learns the training data too well and is unable
to generalize to new data.

 Lasso regression

o Lasso regression is another type of linear regression that is used to prevent


overfitting. It does this by adding a penalty term to the loss function that forces the
model to use some weights and to set others to zero.

Characteristics of Regression

Here are the characteristics of the regression:

 Continuous Target Variable: Regression deals with predicting continuous target


variables that represent numerical values. Examples include predicting house prices,
forecasting sales figures, or estimating patient recovery times.

 Error Measurement: Regression models are evaluated based on their ability to minimize
the error between the predicted and actual values of the target variable. Common error
metrics include mean absolute error (MAE), mean squared error (MSE), and root mean
squared error (RMSE).

37
 Model Complexity: Regression models range from simple linear models to more complex
nonlinear models. The choice of model complexity depends on the complexity of the
relationship between the input features and the target variable.

 Overfitting and Underfitting: Regression models are susceptible to overfitting and


underfitting.

 Interpretability: The interpretability of regression models varies depending on the


algorithm used. Simple linear models are highly interpretable, while more complex models
may be more difficult to interpret.

Examples

Which of the following is a regression task?

 Predicting age of a person

 Predicting nationality of a person

 Predicting whether stock price of a company will increase tomorrow

 Predicting whether a document is related to sighting of UFOs?

Solution : Predicting age of a person (because it is a real value, predicting nationality is


categorical, whether stock price will increase is discrete-yes/no answer, predicting whether
a document is related to UFO is again discrete- a yes/no answer).

Regression Model Machine Learning

Let’s take an example of linear regression. We have a Housing data set and we want to
predict the price of the house. Following is the python code for it.

# Python code to illustrate


# regression using data set
import matplotlib
matplotlib.use('GTKAgg')

import matplotlib.pyplot as plt


import numpy as np
from sklearn import datasets, linear_model
import pandas as pd

38
# Load CSV and columns
df = pd.read_csv("Housing.csv")

Y = df['price']
X = df['lotsize']

X=X.values.reshape(len(X),1)
Y=Y.values.reshape(len(Y),1)

# Split the data into training/testing sets


X_train = X[:-250]
X_test = X[-250:]

# Split the targets into training/testing sets


Y_train = Y[:-250]
Y_test = Y[-250:]

# Plot outputs
plt.scatter(X_test, Y_test, color='black')
plt.title('Test Data')
plt.xlabel('Size')
plt.ylabel('Price')
plt.xticks(())
plt.yticks(())

# Create linear regression object


regr = linear_model.LinearRegression()

# Train the model using the training sets


regr.fit(X_train, Y_train)

# Plot outputs
plt.plot(X_test, regr.predict(X_test), color='red',linewidth=3)
plt.show()

Output:

39
Here in this graph, we plot the test data. The red line indicates the best fit line for predicting
the price. To make an individual prediction using the linear regression model:

print( str(round(regr.predict(5000))) )

Regression Evaluation Metrics

Here are some most popular evaluation metrics for regression:

 Mean Absolute Error (MAE): The average absolute difference between the predicted and
actual values of the target variable.

 Mean Squared Error (MSE): The average squared difference between the predicted and
actual values of the target variable.

 Root Mean Squared Error (RMSE): The square root of the mean squared error.

 Huber Loss: A hybrid loss function that transitions from MAE to MSE for larger errors,
providing balance between robustness and MSE’s sensitivity to outliers.

 Root Mean Square Logarithmic Error

 R2 – Score: Higher values indicate better fit, ranging from 0 to 1.

40
Applications of Regression

 Predicting prices: For example, a regression model could be used to predict the price of a
house based on its size, location, and other features.

 Forecasting trends: For example, a regression model could be used to forecast the sales
of a product based on historical sales data and economic indicators.

 Identifying risk factors: For example, a regression model could be used to identify risk
factors for heart disease based on patient data.

 Making decisions: For example, a regression model could be used to recommend which
investment to buy based on market data.

Advantages of Regression

 Easy to understand and interpret

 Robust to outliers

 Can handle both linear and nonlinear relationships.

Disadvantages of Regression

 Assumes linearity

 Sensitive to multicollinearity

 May not be suitable for highly complex relationships

Conclusion

Regression, a vital facet of supervised machine learning, navigates the realm of continuous
predictions. Its diverse algorithms, from linear to ensemble methods, cater to a spectrum
of real-world applications, underscoring its significance in data-driven decision-making.

41
Classification

Classification vs Regression in Machine Learning

Classification and Regression are two major prediction problems that are usually dealt with in Data
Mining and Machine Learning. We are going to deal with both Classification and Regression and
we will also see differences between them in this article.

Classification Algorithms

Classification is the process of finding or discovering a model or function that helps in separating
the data into multiple categorical classes’ i.e. discrete values. In classification, data is categorized
under different labels according to some parameters given in the input and then the labels are
predicted for the data.

 In a classification task, we are supposed to predict discrete target variables (class labels) using
independent features.

 In the classification task, we are supposed to find a decision boundary that can separate the
different classes in the target variable.

The derived mapping function could be demonstrated in the form of “IF-THEN” rules. The
classification process deals with problems where the data can be divided into binary or multiple
discrete labels. Let’s take an example, suppose we want to predict the possibility of the winning
of a match by Team A on the basis of some parameters recorded earlier. Then there would be two
labels Yes and No.

3.3 Unsupervised Learning

Unsupervised learning is a branch of machine learning that deals with unlabeled data. Unlike
supervised learning, where the data is labeled with a specific category or outcome, unsupervised
learning algorithms are tasked with finding patterns and relationships within the data without any
prior knowledge of the data’s meaning. This makes unsupervised learning a powerful tool for
exploratory data analysis, where the goal is to understand the underlying structure of the data.

42
In artificial intelligence, machine learning that takes place in the absence of human
supervision is known as unsupervised machine learning. Unsupervised machine learning
models, in contrast to supervised learning, are given unlabeled data and allow discover
patterns and insights on their own—without explicit direction or instruction.

Unsupervised machine learning analyzes and clusters unlabeled datasets using machine
learning algorithms. These algorithms find hidden patterns and data without any human
intervention, i.e., we don’t give output to our model. The training model has only input
parameter values and discovers the groups or patterns on its own.

How does unsupervised learning work?

Unsupervised learning works by analyzing unlabeled data to identify patterns and


relationships. The data is not labeled with any predefined categories or outcomes, so the
algorithm must find these patterns and relationships on its own. This can be a challenging
task, but it can also be very rewarding, as it can reveal insights into the data that would not
be apparent from a labeled dataset.

Data-set in Figure A is Mall data that contains information about its clients that subscribe
to them. Once subscribed they are provided a membership card and the mall has complete
information about the customer and his/her every purchase. Now using this data and

43
unsupervised learning techniques, the mall can easily group clients based on the parameters
we are feeding in.

The input to the unsupervised learning models is as follows:

 Unstructured data: May contain noisy(meaningless) data, missing values, or unknown data

 Unlabeled data: Data only contains a value for input parameters, there is no targeted value
(output). It is easy to collect as compared to the labeled one in the supervised approach.

Unsupervised Learning Algorithms

There are mainly 3 types of Algorithms which are used for unsupervised
dataset.

a) Clustering

b) Association Rule Learning

c) Dimensionality Reduction

44
Clustering

Clustering in unsupervised machine learning is the process of grouping unlabeled data into clusters
based on their similarities. The goal of clustering is to identify patterns and relationships in the
data without any prior knowledge of the data’s meaning.

Broadly this technique is applied to group data based on different patterns, such as similarities or
differences, our machine model finds. These algorithms are used to process raw, unclassified data
objects into groups. For example, in the above figure, we have not given output parameter values,
so this technique will be used to group clients based on the input parameters provided by our data.

Clustering in Machine Learning

In real world, not every data we work upon has a target variable. This kind of data cannot be
analyzed using supervised learning algorithms. We need the help of unsupervised algorithms. One
of the most popular type of analysis under unsupervised learning is Cluster analysis. When the
goal is to group similar data points in a dataset, then we use cluster analysis. In practical situations,
we can use cluster analysis for customer segmentation for targeted advertisements, or in medical
imaging to find unknown or new infected areas and many more use cases that we will discuss
further in this article.

What is Clustering?

The task of grouping data points based on their similarity with each other is called Clustering or
Cluster Analysis. This method is defined under the branch of Unsupervised Learning, which aims
at gaining insights from unlabeled data points, that is, unlike supervised learning we don’t have a
target variable.

Clustering aims at forming groups of homogeneous data points from a heterogeneous dataset. It
evaluates the similarity based on a metric like Euclidean distance, Cosine similarity, Manhattan
distance, etc. and then group the points with highest similarity score together.

For Example, In the graph given below, we can clearly see that there are 3 circular clusters forming
on the basis of distance.

45
Now it is not necessary that the clusters formed must be circular in shape. The shape of clusters
can be arbitrary. There are many algorithms that work well with detecting arbitrary shaped
clusters.

For example, In the below given graph we can see that the clusters formed are not circular in shape.

Types of Clustering

Broadly speaking, there are 2 types of clustering that can be performed to group similar data points:

 Hard Clustering: In this type of clustering, each data point belongs to a cluster completely or
not. For example, let’s say there are 4 data point and we have to cluster them into 2 clusters.
So each data point will either belong to cluster 1 or cluster 2.

46
Data Points Clusters

A C1

B C2

C C2

D C1

 Soft Clustering: In this type of clustering, instead of assigning each data point into a separate
cluster, a probability or likelihood of that point being that cluster is evaluated. For example,
let’s say there are 4 data point and we have to cluster them into 2 clusters. So we will be
evaluating a probability of a data point belonging to both clusters. This probability is calculated
for all data points.

Data Points Probability of C1 Probability of C2

A 0.91 0.09

B 0.3 0.7

C 0.17 0.83

D 1 0

47
Uses of Clustering

Now before we begin with types of clustering algorithms, we will go through the use cases of
Clustering algorithms. Clustering algorithms are majorly used for:

 Market Segmentation – Businesses use clustering to group their customers and use targeted
advertisements to attract more audience.

 Market Basket Analysis – Shop owners analyze their sales and figure out which items are
majorly bought together by the customers. For example, In USA, according to a study diapers
and beers were usually bought together by fathers.

 Social Network Analysis – Social media sites use your data to understand your browsing
behaviour and provide you with targeted friend recommendations or content
recommendations.

 Medical Imaging – Doctors use Clustering to find out diseased areas in diagnostic images like
X-rays.

 Anomaly Detection – To find outliers in a stream of real-time dataset or forecasting fraudulent


transactions we can use clustering to identify them.

 Simplify working with large datasets – Each cluster is given a cluster ID after clustering is
complete. Now, you may reduce a feature set’s whole feature set into its cluster ID. Clustering
is effective when it can represent a complicated case with a straightforward cluster ID. Using
the same principle, clustering data can make complex datasets simpler.

There are many more use cases for clustering but there are some of the major and common use
cases of clustering. Moving forward we will be discussing Clustering Algorithms that will help
you perform the above tasks.

Types of Clustering Algorithms

At the surface level, clustering helps in the analysis of unstructured data. Graphing, the shortest
distance, and the density of the data points are a few of the elements that influence cluster
formation. Clustering is the process of determining how related the objects are based on a metric

48
called the similarity measure. Similarity metrics are easier to locate in smaller sets of features. It
gets harder to create similarity measures as the number of features increases. Depending on the
type of clustering algorithm being utilized in data mining, several techniques are employed to
group the data from the datasets. In this part, the clustering techniques are described. Various types
of clustering algorithms are:

1. Centroid-based Clustering (Partitioning methods)

2. Density-based Clustering (Model-based methods)

3. Connectivity-based Clustering (Hierarchical clustering)

4. Distribution-based Clustering

We will be going through each of these types in brief.

1. Centroid-based Clustering (Partitioning methods)

Partitioning methods are the most easily clustering algorithms. They group data points on the basis
of their closeness. Generally, the similarity measure chosen for these algorithms are Euclidian
distance, Manhattan Distance or Minkowski Distance. The datasets are separated into a
predetermined number of clusters, and each cluster is referenced by a vector of values. When
compared to the vector value, the input data variable shows no difference and joins the cluster.

The primary drawback for these algorithms is the requirement that we establish the number of
clusters, “k,” either intuitively or scientifically (using the Elbow Method) before any clustering
machine learning system starts allocating the data points. Despite this, it is still the most popular
type of clustering. K-means and K-medoids clustering are some examples of this type clustering.

2. Density-based Clustering (Model-based methods)

Density-based clustering, a model-based method, finds groups based on the density of data points.
Contrary to centroid-based clustering, which requires that the number of clusters be predefined
and is sensitive to initialization, density-based clustering determines the number of clusters
automatically and is less susceptible to beginning positions. They are great at handling clusters of
different sizes and forms, making them ideally suited for datasets with irregularly shaped or

49
overlapping clusters. These methods manage both dense and sparse data regions by focusing on
local density and can distinguish clusters with a variety of morphologies.

In contrast, centroid-based grouping, like k-means, has trouble finding arbitrary shaped clusters.
Due to its preset number of cluster requirements and extreme sensitivity to the initial positioning
of centroids, the outcomes can vary. Furthermore, the tendency of centroid-based approaches to
produce spherical or convex clusters restricts their capacity to handle complicated or irregularly
shaped clusters. In conclusion, density-based clustering overcomes the drawbacks of centroid-
based techniques by autonomously choosing cluster sizes, being resilient to initialization, and
successfully capturing clusters of various sizes and forms. The most popular density-based
clustering algorithm is DBSCAN.

3. Connectivity-based Clustering (Hierarchical clustering)

A method for assembling related data points into hierarchical clusters is called hierarchical
clustering. Each data point is initially taken into account as a separate cluster, which is
subsequently combined with the clusters that are the most similar to form one large cluster that
contains all of the data points.

Think about how you may arrange a collection of items based on how similar they are. Each object
begins as its own cluster at the base of the tree when using hierarchical clustering, which creates a
dendrogram, a tree-like structure. The closest pairings of clusters are then combined into larger
clusters after the algorithm examines how similar the objects are to one another. When every object
is in one cluster at the top of the tree, the merging process has finished. Exploring various
granularity levels is one of the fun things about hierarchical clustering. To obtain a given number
of clusters, you can select to cut the dendrogram at a particular height. The more similar two
objects are within a cluster, the closer they are. It’s comparable to classifying items according to
their family trees, where the nearest relatives are clustered together and the wider branches signify
more general connections. There are 2 approaches for Hierarchical clustering:

 Divisive Clustering: It follows a top-down approach, here we consider all data points to
be part one big cluster and then this cluster is divide into smaller groups.

50
 Agglomerative Clustering: It follows a bottom-up approach, here we consider all data
points to be part of individual clusters and then these clusters are clubbed together to make
one big cluster with all data points.

4. Distribution-based Clustering

Using distribution-based clustering, data points are generated and organized according to their
propensity to fall into the same probability distribution (such as a Gaussian, binomial, or other)
within the data. The data elements are grouped using a probability-based distribution that is based
on statistical distributions. Included are data objects that have a higher likelihood of being in the
cluster. A data point is less likely to be included in a cluster the further it is from the cluster’s
central point, which exists in every cluster.

A notable drawback of density and boundary-based approaches is the need to specify the clusters
a priori for some algorithms, and primarily the definition of the cluster form for the bulk of
algorithms. There must be at least one tuning or hyper-parameter selected, and while doing so
should be simple, getting it wrong could have unanticipated repercussions. Distribution-based
clustering has a definite advantage over proximity and centroid-based clustering approaches in
terms of flexibility, accuracy, and cluster structure. The key issue is that, in order to
avoid overfitting, many clustering methods only work with simulated or manufactured data, or
when the bulk of the data points certainly belong to a preset distribution. The most popular
distribution-based clustering algorithm is Gaussian Mixture Model.

Applications of Clustering in different fields:

1. Marketing: It can be used to characterize & discover customer segments for marketing
purposes.

2. Biology: It can be used for classification among different species of plants and animals.

3. Libraries: It is used in clustering different books on the basis of topics and information.

4. Insurance: It is used to acknowledge the customers, their policies and identifying the frauds.

5. City Planning: It is used to make groups of houses and to study their values based on their
geographical locations and other factors present.

51
6. Earthquake studies: By learning the earthquake-affected areas we can determine the
dangerous zones.

7. Image Processing: Clustering can be used to group similar images together, classify images
based on content, and identify patterns in image data.

8. Genetics: Clustering is used to group genes that have similar expression patterns and identify
gene networks that work together in biological processes.

9. Finance: Clustering is used to identify market segments based on customer behavior, identify
patterns in stock market data, and analyze risk in investment portfolios.

10. Customer Service: Clustering is used to group customer inquiries and complaints into
categories, identify common issues, and develop targeted solutions.

11. Manufacturing: Clustering is used to group similar products together, optimize production
processes, and identify defects in manufacturing processes.

12. Medical diagnosis: Clustering is used to group patients with similar symptoms or diseases,
which helps in making accurate diagnoses and identifying effective treatments.

13. Fraud detection: Clustering is used to identify suspicious patterns or anomalies in financial
transactions, which can help in detecting fraud or other financial crimes.

14. Traffic analysis: Clustering is used to group similar patterns of traffic data, such as peak
hours, routes, and speeds, which can help in improving transportation planning and
infrastructure.

15. Social network analysis: Clustering is used to identify communities or groups within social
networks, which can help in understanding social behavior, influence, and trends.

16. Cybersecurity: Clustering is used to group similar patterns of network traffic or system
behavior, which can help in detecting and preventing cyberattacks.

17. Climate analysis: Clustering is used to group similar patterns of climate data, such as
temperature, precipitation, and wind, which can help in understanding climate change and its
impact on the environment.

52
18. Sports analysis: Clustering is used to group similar patterns of player or team performance
data, which can help in analyzing player or team strengths and weaknesses and making
strategic decisions.

19. Crime analysis: Clustering is used to group similar patterns of crime data, such as location,
time, and type, which can help in identifying crime hotspots, predicting future crime trends,
and improving crime prevention strategies.

Association Rule Learning

Association rule learning is also known as association rule mining is a common technique used to
discover associations in unsupervised machine learning. This technique is a rule-based ML
technique that finds out some very useful relations between parameters of a large data set. This
technique is basically used for market basket analysis that helps to better understand the
relationship between different products. For e.g. shopping stores use algorithms based on this
technique to find out the relationship between the sales of one product w.r.t to another’s sales based
on customer behavior. Like if a customer buys milk, then he may also buy bread, eggs, or butter.
Once trained well, such models can be used to increase their sales by planning different offers.

Association rule mining finds interesting associations and relationships among large sets of data
items. This rule shows how frequently a itemset occurs in a transaction. A typical example is a
Market Based Analysis. Market Based Analysis is one of the key techniques used by large relations
to show associations between items. It allows retailers to identify relationships between the items
that people buy together frequently. Given a set of transactions, we can find rules that will predict
the occurrence of an item based on the occurrences of other items in the transaction.

TID Items

1 Bread, Milk

2 Bread, Diaper, Beer, Eggs

53
TID Items

3 Milk, Diaper, Beer, Coke

4 Bread, Milk, Diaper, Beer

5 Bread, Milk, Diaper, Coke

Before we start defining the rule, let us first see the basic definitions. Support Count( )
– Frequency of occurrence of a itemset.

Here ({Milk, Bread, Diaper})=2

Frequent Itemset – An itemset whose support is greater than or equal to minsup


threshold. Association Rule – An implication expression of the form X -> Y, where X and Y are
any 2 itemsets.

Example: {Milk, Diaper}->{Beer}

Rule Evaluation Metrics –

 Support(s) – The number of transactions that include items in the {X} and {Y} parts of
the rule as a percentage of the total number of transaction. It is a measure of how frequently
the collection of items occur together as a percentage of all transactions.

 Support = (X+Y) total – It is interpreted as fraction of transactions that contain both


X and Y.

 Confidence(c) – It is the ratio of the no of transactions that includes all items in {B} as
well as the no of transactions that includes all items in {A} to the no of transactions that
includes all items in {A}.

 Conf(X=>Y) = Supp(X Y) Supp(X) – It measures how often each item in Y appears


in transactions that contains items in X also.

54
 Lift(l) – The lift of the rule X=>Y is the confidence of the rule divided by the expected
confidence, assuming that the itemsets X and Y are independent of each other.The expected
confidence is the confidence divided by the frequency of {Y}.

 Lift(X=>Y) = Conf(X=>Y) Supp(Y) – Lift value near 1 indicates X and Y almost often
appear together as expected, greater than 1 means they appear together more than expected
and less than 1 means they appear less than expected.Greater lift values indicate stronger
association.

Example – From the above table, {Milk, Diaper}=>{Beer}

s= ({Milk, Diaper, Beer}) |T|

= 2/5

= 0.4

c= (Milk, Diaper, Beer) (Milk, Diaper)

= 2/3

= 0.67

l= Supp({Milk, Diaper, Beer}) Supp({Milk, Diaper})*Supp({Beer})

= 0.4/(0.6*0.6)

= 1.11

The Association rule is very useful in analyzing datasets. The data is collected using bar-code
scanners in supermarkets. Such databases consists of a large number of transaction records which
list all items bought by a customer on a single purchase. So the manager could know if certain
groups of items are consistently purchased together and use this data for adjusting store layouts,
cross-selling, promotions based on statistics.

 Apriori Algorithm: A Classic Method for Rule Induction

55
 FP-Growth Algorithm: An Efficient Alternative to Apriori

 Eclat Algorithm: Exploiting Closed Itemsets for Efficient Rule Mining

 Efficient Tree-based Algorithms: Handling Large Datasets with Scalability

Dimensionality Reduction

Dimensionality reduction is the process of reducing the number of features in a dataset while
preserving as much information as possible. This technique is useful for improving the
performance of machine learning algorithms and for data visualization. Examples of
dimensionality reduction algorithms include Dimensionality reduction is the process of reducing
the number of features in a dataset while preserving as much information as possible.

 Principal Component Analysis (PCA): Linear Transformation for Reduced Dimensions

 Linear Discriminant Analysis (LDA): Dimensionality Reduction for Discrimination

 Non-negative Matrix Factorization (NMF): Decomposing Data into Non-negative


Components

 Locally Linear Embedding (LLE): Preserving Local Geometry in Reduced Dimensions

 Isomap: Capturing Global Relationships in Reduced Dimensions

Challenges of Unsupervised Learning

Here are the key challenges of unsupervised learning

 Evaluation: Assessing the performance of unsupervised learning algorithms is difficult


without predefined labels or categories.

 Interpretability: Understanding the decision-making process of unsupervised learning


models is often challenging.

 Overfitting: Unsupervised learning algorithms can overfit to the specific dataset used for
training, limiting their ability to generalize to new data.

56
 Data quality: Unsupervised learning algorithms are sensitive to the quality of the input
data. Noisy or incomplete data can lead to misleading or inaccurate results.

 Computational complexity: Some unsupervised learning algorithms, particularly those


dealing with high-dimensional data or large datasets, can be computationally expensive.

Advantages of Unsupervised learning

 No labeled data required: Unlike supervised learning, unsupervised learning does not
require labeled data, which can be expensive and time-consuming to collect.

 Can uncover hidden patterns: Unsupervised learning algorithms can identify patterns
and relationships in data that may not be obvious to humans.

 Can be used for a variety of tasks: Unsupervised learning can be used for a variety of
tasks, such as clustering, dimensionality reduction, and anomaly detection.

 Can be used to explore new data: Unsupervised learning can be used to explore new data
and gain insights that may not be possible with other methods.

Disadvantages of Unsupervised learning

 Difficult to evaluate: It can be difficult to evaluate the performance of unsupervised


learning algorithms, as there are no predefined labels or categories against which to
compare results.

 Can be difficult to interpret: It can be difficult to understand the decision-making process


of unsupervised learning models.

 Can be sensitive to the quality of the data: Unsupervised learning algorithms can be
sensitive to the quality of the input data. Noisy or incomplete data can lead to misleading
or inaccurate results.

 Can be computationally expensive: Some unsupervised learning algorithms, particularly


those dealing with high-dimensional data or large datasets, can be computationally
expensive

57
Applications of Unsupervised learning

 Customer segmentation: Unsupervised learning can be used to segment customers into


groups based on their demographics, behavior, or preferences. This can help businesses to
better understand their customers and target them with more relevant marketing campaigns.

 Fraud detection: Unsupervised learning can be used to detect fraud in financial data by
identifying transactions that deviate from the expected patterns. This can help to prevent
fraud by flagging these transactions for further investigation.

 Recommendation systems: Unsupervised learning can be used to recommend items to


users based on their past behavior or preferences. For example, a recommendation system
might use unsupervised learning to identify users who have similar taste in movies, and
then recommend movies that those users have enjoyed.

 Natural language processing (NLP): Unsupervised learning is used in a variety of NLP


tasks, including topic modeling, document clustering, and part-of-speech tagging.

 Image analysis: Unsupervised learning is used in a variety of image analysis


tasks, including image segmentation, object detection, and image pattern recognition.

Semi-Supervised Learning in ML

Today’s Machine Learning algorithms can be broadly classified into three categories, Supervised
Learning, Unsupervised Learning, and Reinforcement Learning. Casting Reinforced Learning
aside, the primary two categories of Machine Learning problems are Supervised and Unsupervised
Learning. The basic difference between the two is that Supervised Learning datasets have an output
label associated with each tuple while Unsupervised Learning datasets do not.

What is Semi-Supervised Learning?

Semi-supervised learning is a type of machine learning that falls in between supervised and
unsupervised learning. It is a method that uses a small amount of labeled data and a large amount
of unlabeled data to train a model. The goal of semi-supervised learning is to learn a function that
can accurately predict the output variable based on the input variables, similar to supervised

58
learning. However, unlike supervised learning, the algorithm is trained on a dataset that contains
both labeled and unlabeled data.

Semi-supervised learning is particularly useful when there is a large amount of unlabeled data
available, but it’s too expensive or difficult to label all of it.

Semi-Supervised Learning Flow Chart

Intuitively, one may imagine the three types of learning algorithms as Supervised learning where
a student is under the supervision of a teacher at both home and school, Unsupervised learning
where a student has to figure out a concept himself and Semi-Supervised learning where a teacher
teaches a few concepts in class and gives questions as homework which are based on similar
concepts.

59
Examples of Semi-Supervised Learning

 Text classification: In text classification, the goal is to classify a given text into one or
more predefined categories. Semi-supervised learning can be used to train a text
classification model using a small amount of labeled data and a large amount of unlabeled
text data.

 Image classification: In image classification, the goal is to classify a given image into one
or more predefined categories. Semi-supervised learning can be used to train an image
classification model using a small amount of labeled data and a large amount of unlabeled
image data.

 Anomaly detection: In anomaly detection, the goal is to detect patterns or observations


that are unusual or different from the norm

Assumptions followed by Semi-Supervised Learning

A Semi-Supervised algorithm assumes the following about the data

1. Continuity Assumption: The algorithm assumes that the points which are closer to each
other are more likely to have the same output label.

2. Cluster Assumption: The data can be divided into discrete clusters and points in the same
cluster are more likely to share an output label.

3. Manifold Assumption: The data lie approximately on a manifold of a much lower


dimension than the input space. This assumption allows the use of distances and densities
which are defined on a manifold.

Applications of Semi-Supervised Learning

1. Speech Analysis: Since labeling audio files is a very intensive task, Semi-Supervised
learning is a very natural approach to solve this problem.

2. Internet Content Classification: Labeling each webpage is an impractical and unfeasible


process and thus uses Semi-Supervised learning algorithms. Even the Google search

60
algorithm uses a variant of Semi-Supervised learning to rank the relevance of a webpage
for a given query.

3. Protein Sequence Classification: Since DNA strands are typically very large in size, the
rise of Semi-Supervised learning has been imminent in this field.

Disadvantages of Semi-Supervised Learning

The most basic disadvantage of any Supervised Learning algorithm is that the dataset has to be
hand-labeled either by a Machine Learning Engineer or a Data Scientist. This is a very costly
process, especially when dealing with large volumes of data. The most basic disadvantage of
any Unsupervised Learning is that its application spectrum is limited.

To counter these disadvantages, the concept of Semi-Supervised Learning was introduced. In


this type of learning, the algorithm is trained upon a combination of labeled and unlabelled data.
Typically, this combination will contain a very small amount of labeled data and a very large
amount of unlabelled data. The basic procedure involved is that first, the programmer will cluster
similar data using an unsupervised learning algorithm and then use the existing labeled data to
label the rest of the unlabelled data. The typical use cases of such type of algorithm have a common
property among them – The acquisition of unlabelled data is relatively cheap while labeling the
said data is very expensive.

3. Reinforcement Learning

Reinforcement learning is an area of Machine Learning. It is about taking suitable action to


maximize reward in a particular situation. It is employed by various software and machines to find
the best possible behavior or path it should take in a specific situation. Reinforcement learning
differs from supervised learning in a way that in supervised learning the training data has the
answer key with it so the model is trained with the correct answer itself whereas in reinforcement
learning, there is no answer but the reinforcement agent decides what to do to perform the given
task. In the absence of a training dataset, it is bound to learn from its experience.

61
Reinforcement Learning (RL) is the science of decision making. It is about learning the optimal
behavior in an environment to obtain maximum reward. In RL, the data is accumulated from
machine learning systems that use a trial-and-error method. Data is not part of the input that we
would find in supervised or unsupervised machine learning.

Reinforcement learning uses algorithms that learn from outcomes and decide which action to take
next. After each action, the algorithm receives feedback that helps it determine whether the choice
it made was correct, neutral or incorrect. It is a good technique to use for automated systems that
have to make a lot of small decisions without human guidance.

Reinforcement learning is an autonomous, self-teaching system that essentially learns by trial and
error. It performs actions with the aim of maximizing rewards, or in other words, it is learning by
doing in order to achieve the best outcomes.

Example:

The problem is as follows: We have an agent and a reward, with many hurdles in between. The
agent is supposed to find the best possible path to reach the reward. The following problem
explains the problem more easily.

62
The above image shows the robot, diamond, and fire. The goal of the robot is to get the reward
that is the diamond and avoid the hurdles that are fired. The robot learns by trying all the
possible paths and then choosing the path which gives him the reward with the least hurdles.
Each right step will give the robot a reward and each wrong step will subtract the reward of the
robot. The total reward will be calculated when it reaches the final reward that is the diamond.

Main points in Reinforcement learning

 Input: The input should be an initial state from which the model will start

 Output: There are many possible outputs as there are a variety of solutions to a particular
problem

 Training: The training is based upon the input, The model will return a state and the user
will decide to reward or punish the model based on its output.

 The model keeps continues to learn.

 The best solution is decided based on the maximum reward.

Difference between Reinforcement learning and Supervised learning:

Reinforcement learning Supervised learning

Reinforcement learning is all about making decisions In Supervised learning, the


sequentially. In simple words, we can say that the output decision is made on the initial
depends on the state of the current input and the next input input or the input given at the
depends on the output of the previous input start

63
Reinforcement learning Supervised learning

In supervised learning the


In Reinforcement learning decision is dependent, So we decisions are independent of
give labels to sequences of dependent decisions each other so labels are given to
each decision.

Example: Object recognition,


Example: Chess game, text summarization
spam detection

Types of Reinforcement:

There are two types of Reinforcement:

1. Positive: Positive Reinforcement is defined as when an event, occurs due to a particular


behavior, increases the strength and the frequency of the behavior. In other words, it has a
positive effect on behavior.

Advantages of reinforcement learning are:

 Maximizes Performance

 Sustain Change for a long period of time

 Too much Reinforcement can lead to an overload of states which can diminish the
results

2. Negative: Negative Reinforcement is defined as strengthening of behavior because a


negative condition is stopped or avoided.
Advantages of reinforcement learning:

 Increases Behavior

 Provide defiance to a minimum standard of performance

64
 It Only provides enough to meet up the minimum behavior

Elements of Reinforcement Learning

Reinforcement learning elements are as follows:

1. Policy

2. Reward function

3. Value function

4. Model of the environment

Policy: Policy defines the learning agent behavior for given time period. It is a mapping from
perceived states of the environment to actions to be taken when in those states.

Reward function: Reward function is used to define a goal in a reinforcement learning problem.A
reward function is a function that provides a numerical score based on the state of the environment

Value function: Value functions specify what is good in the long run. The value of a state is the
total amount of reward an agent can expect to accumulate over the future, starting from that state.

Model of the environment: Models are used for planning.

Credit assignment problem: Reinforcement learning algorithms learn to generate an


internal value for the intermediate states as to how good they are in leading to the goal. The
learning decision maker is called the agent. The agent interacts with the environment that includes
everything outside the agent.

The agent has sensors to decide on its state in the environment and takes action that modifies its
state.

The reinforcement learning problem model is an agent continuously interacting with an


environment. The agent and the environment interact in a sequence of time steps. At each time
step t, the agent receives the state of the environment and a scalar numerical reward for the previous
action, and then the agent then selects an action.

65
Reinforcement learning is a technique for solving Markov decision problems.

Reinforcement learning uses a formal framework defining the interaction between a learning
agent and its environment in terms of states, actions, and rewards. This framework is intended to
be a simple way of representing essential features of the artificial intelligence problem.

Various Practical Applications of Reinforcement Learning –

 RL can be used in robotics for industrial automation.

 RL can be used in machine learning and data processing

 RL can be used to create training systems that provide custom instruction and materials
according to the requirement of students.

Application of Reinforcement Learnings

1. Robotics: Robots with pre-programmed behavior are useful in structured environments, such as
the assembly line of an automobile manufacturing plant, where the task is repetitive in nature.

2. A master chess player makes a move. The choice is informed both by planning, anticipating
possible replies and counter replies.

3. An adaptive controller adjusts parameters of a petroleum refinery’s operation in real time.

RL can be used in large environments in the following situations:

1. A model of the environment is known, but an analytic solution is not available;

2. Only a simulation model of the environment is given (the subject of simulation-based


optimization)

3. The only way to collect information about the environment is to interact with it.

Advantages and Disadvantages of Reinforcement Learning

66
Advantages of Reinforcement learning

1. Reinforcement learning can be used to solve very complex problems that cannot be solved by
conventional techniques.

2. The model can correct the errors that occurred during the training process.

3. In RL, training data is obtained via the direct interaction of the agent with the environment

4. Reinforcement learning can handle environments that are non-deterministic, meaning that the
outcomes of actions are not always predictable. This is useful in real-world applications where the
environment may change over time or is uncertain.

5. Reinforcement learning can be used to solve a wide range of problems, including those that
involve decision making, control, and optimization.

6. Reinforcement learning is a flexible approach that can be combined with other machine learning
techniques, such as deep learning, to improve performance.

Disadvantages of Reinforcement learning

1. Reinforcement learning is not preferable to use for solving simple problems.

2. Reinforcement learning needs a lot of data and a lot of computation

3. Reinforcement learning is highly dependent on the quality of the reward function. If the reward
function is poorly designed, the agent may not learn the desired behavior.

4. Reinforcement learning can be difficult to debug and interpret. It is not always clear why the
agent is behaving in a certain way, which can make it difficult to diagnose and fix problems.

67
3.4 Neural Networks

Neural Networks are computational models that mimic the complex functions of the human brain.
The neural networks consist of interconnected nodes or neurons that process and learn from data,
enabling tasks such as pattern recognition and decision making in machine learning. The article
explores more about neural networks, their working, architecture and more.

 Evolution of Neural Networks

 What are Neural Networks?

 How does Neural Networks work?

 Learning of a Neural Network

 Types of Neural Networks

Evolution of Neural Networks

Since the 1940s, there have been a number of noteworthy advancements in the field of neural
networks:

68
 1940s-1950s: Early Concepts
Neural networks began with the introduction of the first mathematical model of artificial
neurons by McCulloch and Pitts. But computational constraints made progress difficult.

 1960s-1970s: Perceptions
this era is defined by the work of Rosenblatt on perceptions. perceptions are single-layer
networks whose applicability was limited to issues that could be solved linearly
separately.

 1980s: Backpropagation and Connectionism


Multi-layer network training was made possible by Rumelhart, Hinton, and Williams’
invention of the backpropagation method. With its emphasis on learning through
interconnected nodes, connectionism gained appeal.

 1990s: Boom and Winter


With applications in image identification, finance, and other fields, neural networks saw a
boom. Neural network research did, however, experience a “winter” due to exorbitant
computational costs and inflated expectations.

 2000s: Resurgence and Deep Learning


Larger datasets, innovative structures, and enhanced processing capability spurred a
comeback. Deep learning has shown amazing effectiveness in a number of disciplines by
utilizing numerous layers.

 2010s-Present: Deep Learning Dominance


Convolutional neural networks (CNNs) and recurrent neural networks (RNNs), two deep
learning architectures, dominated machine learning. Their power was demonstrated by
innovations in gaming, picture recognition, and natural language processing.

What are Neural Networks?

Neural networks extract identifying features from data, lacking pre-programmed understanding.
Network components include neurons, connections, weights, biases, propagation functions, and a
learning rule. Neurons receive inputs, governed by thresholds and activation functions.

69
Connections involve weights and biases regulating information transfer. Learning, adjusting
weights and biases, occurs in three stages: input computation, output generation, and iterative
refinement enhancing the network’s proficiency in diverse tasks.

These include:

1. The neural network is simulated by a new environment.

2. Then the free parameters of the neural network are changed as a result of this simulation.

3. The neural network then responds in a new way to the environment because of the changes
in its free parameters.

70
Importance of Neural Networks

The ability of neural networks to identify patterns, solve intricate puzzles, and adjust to changing
surroundings is essential. Their capacity to learn from data has far-reaching effects, ranging from
revolutionizing technology like natural language processing and self-driving automobiles to
automating decision-making processes and increasing efficiency in numerous industries. The
development of artificial intelligence is largely dependent on neural networks, which also drive
innovation and influence the direction of technology.

How does Neural Networks work?

Let’s understand with an example of how a neural network works:

Consider a neural network for email classification. The input layer takes features like email
content, sender information, and subject. These inputs, multiplied by adjusted weights, pass
through hidden layers. The network, through training, learns to recognize patterns indicating
whether an email is spam or not. The output layer, with a binary activation function, predicts
whether the email is spam (1) or not (0). As the network iteratively refines its weights through
backpropagation, it becomes adept at distinguishing between spam and legitimate emails,
showcasing the practicality of neural networks in real-world applications like email filtering.

Working of a Neural Network

Neural networks are complex systems that mimic some features of the functioning of the human
brain. It is composed of an input layer, one or more hidden layers, and an output layer made up of
layers of artificial neurons that are coupled. The two stages of the basic process are called
backpropagation and forward propagation.

71
Forward Propagation

 Input Layer: Each feature in the input layer is represented by a node on the network,
which receives input data.

 Weights and Connections: The weight of each neuronal connection indicates how strong
the connection is. Throughout training, these weights are changed.

 Hidden Layers: Each hidden layer neuron processes inputs by multiplying them by
weights, adding them up, and then passing them through an activation function. By doing
this, non-linearity is introduced, enabling the network to recognize intricate patterns.

 Output: The final result is produced by repeating the process until the output layer is
reached.

Backpropagation

 Loss Calculation: The network’s output is evaluated against the real goal values, and a
loss function is used to compute the difference. For a regression problem, the Mean
Squared Error (MSE) is commonly used as the cost function.

72
Loss Function:

 Gradient Descent: Gradient descent is then used by the network to reduce the loss. To
lower the inaccuracy, weights are changed based on the derivative of the loss with respect
to each weight.

 Adjusting weights: The weights are adjusted at each connection by applying this iterative
process, or backpropagation, backward across the network.

 Training: During training with different data samples, the entire process of forward
propagation, loss calculation, and backpropagation is done iteratively, enabling the
network to adapt and learn patterns from the data.

 Activation Functions: Model non-linearity is introduced by activation functions like


the rectified linear unit (ReLU) or sigmoid. Their decision on whether to “fire” a neuron is
based on the whole weighted input.

Learning of a Neural Network

1. Learning with supervised learning

In supervised learning, the neural network is guided by a teacher who has access to both input-
output pairs. The network creates outputs based on inputs without taking into account the
surroundings. By comparing these outputs to the teacher-known desired outputs, an error signal is
generated. In order to reduce errors, the network’s parameters are changed iteratively and stop
when performance is at an acceptable level.

2. Learning with Unsupervised learning

Equivalent output variables are absent in unsupervised learning. Its main goal is to comprehend
incoming data’s (X) underlying structure. No instructor is present to offer advice. Modeling data
patterns and relationships is the intended outcome instead. Words like regression and classification
are related to supervised learning, whereas unsupervised learning is associated with clustering and
association.

73
3. Learning with Reinforcement Learning

Through interaction with the environment and feedback in the form of rewards or penalties, the
network gains knowledge. Finding a policy or strategy that optimizes cumulative rewards over
time is the goal for the network. This kind is frequently utilized in gaming and decision-making
applications.

Types of Neural Networks

There are seven types of neural networks that can be used.

 Feedforward Neteworks: A feedforward neural network is a simple artificial neural


network architecture in which data moves from input to output in a single direction. It has
input, hidden, and output layers; feedback loops are absent. Its straightforward architecture
makes it appropriate for a number of applications, such as regression and pattern recognition.

 Multilayer Perceptron (MLP): MLP is a type of feedforward neural network with three or
more layers, including an input layer, one or more hidden layers, and an output layer. It uses
nonlinear activation functions.

 Convolutional Neural Network (CNN): A Convolutional Neural Network (CNN) is a


specialized artificial neural network designed for image processing. It employs convolutional
layers to automatically learn hierarchical features from input images, enabling effective
image recognition and classification. CNNs have revolutionized computer vision and are
pivotal in tasks like object detection and image analysis.

 Recurrent Neural Network (RNN): An artificial neural network type intended for
sequential data processing is called a Recurrent Neural Network (RNN). It is appropriate for
applications where contextual dependencies are critical, such as time series prediction and
natural language processing, since it makes use of feedback loops, which enable information
to survive within the network.

 Long Short-Term Memory (LSTM): LSTM is a type of RNN that is designed to overcome
the vanishing gradient problem in training RNNs. It uses memory cells and gates to
selectively read, write, and erase information.

74
Advantages of Neural Networks

Neural networks are widely used in many different applications because of their many benefits:

 Adaptability: Neural networks are useful for activities where the link between inputs and
outputs is complex or not well defined because they can adapt to new situations and learn
from data.

 Pattern Recognition: Their proficiency in pattern recognition renders them efficacious in


tasks like as audio and image identification, natural language processing, and other intricate
data patterns.

 Parallel Processing: Because neural networks are capable of parallel processing by nature,
they can process numerous jobs at once, which speeds up and improves the efficiency of
computations.

 Non-Linearity: Neural networks are able to model and comprehend complicated


relationships in data by virtue of the non-linear activation functions found in neurons,
which overcome the drawbacks of linear models.

Disadvantages of Neural Networks

Neural networks, while powerful, are not without drawbacks and difficulties:

 Computational Intensity: Large neural network training can be a laborious and


computationally demanding process that demands a lot of computing power.

 Black box Nature: As “black box” models, neural networks pose a problem in important
applications since it is difficult to understand how they make decisions.

 Overfitting: Overfitting is a phenomenon in which neural networks commit training


material to memory rather than identifying patterns in the data. Although regularization
approaches help to alleviate this, the problem still exists.

 Need for Large datasets: For efficient training, neural networks frequently need sizable,
labeled datasets; otherwise, their performance may suffer from incomplete or skewed data.

75
3.5 Support Vector Machine (SVM) Algorithm

Support Vector Machine (SVM) is a powerful machine learning algorithm used for linear or
nonlinear classification, regression, and even outlier detection tasks. SVMs can be used for a
variety of tasks, such as text classification, image classification, spam detection, handwriting
identification, gene expression analysis, face detection, and anomaly detection. SVMs are
adaptable and efficient in a variety of applications because they can manage high-dimensional data
and nonlinear relationships.

SVM algorithms are very effective as we try to find the maximum separating hyperplane between
the different classes available in the target feature.

Support Vector Machine

Support Vector Machine (SVM) is a supervised machine learning algorithm used for both
classification and regression. Though we say regression problems as well it’s best suited for
classification. The main objective of the SVM algorithm is to find the optimal hyperplane in an N-
dimensional space that can separate the data points in different classes in the feature space. The
hyperplane tries that the margin between the closest points of different classes should be as
maximum as possible. The dimension of the hyperplane depends upon the number of features. If
the number of input features is two, then the hyperplane is just a line. If the number of input
features is three, then the hyperplane becomes a 2-D plane. It becomes difficult to imagine when
the number of features exceeds three.

Let’s consider two independent variables x1, x2, and one dependent variable which is either a blue
circle or a red circle.

76
Linearly Separable Data points

From the figure above it’s very clear that there are multiple lines (our hyperplane here is a line
because we are considering only two input features x1, x2) that segregate our data points or do a
classification between red and blue circles. So how do we choose the best line or in general the
best hyperplane that segregates our data points?

How does SVM work?

One reasonable choice as the best hyperplane is the one that represents the largest separation or
margin between the two classes.

77
Multiple hyperplanes separate the data from two classes

So we choose the hyperplane whose distance from it to the nearest data point on each side is
maximized. If such a hyperplane exists it is known as the maximum-margin hyperplane/hard
margin. So from the above figure, we choose L2. Let’s consider a scenario like shown below

78
Selecting hyperplane for data with outlier

Here we have one blue ball in the boundary of the red ball. So how does SVM classify the data?
It’s simple! The blue ball in the boundary of red ones is an outlier of blue balls. The SVM algorithm
has the characteristics to ignore the outlier and finds the best hyperplane that maximizes the
margin. SVM is robust to outliers.

Hyperplane which is the most optimized one

So in this type of data point what SVM does is, finds the maximum margin as done with previous
data sets along with that it adds a penalty each time a point crosses the margin. So the margins in
these types of cases are called soft margins. When there is a soft margin to the data set, the SVM
tries to minimize (1/margin+∧(∑penalty)). Hinge loss is a commonly used penalty. If no violations
no hinge loss.If violations hinge loss proportional to the distance of violation.

Till now, we were talking about linearly separable data(the group of blue balls and red balls are
separable by a straight line/linear line). What to do if data are not linearly separable?

79
Original 1D dataset for classification

Say, our data is shown in the figure above. SVM solves this by creating a new variable using
a kernel. We call a point xi on the line and we create a new variable yi as a function of distance
from origin o.so if we plot this we get something like as shown below

Mapping 1D data to 2D to become able to separate the two classes

In this case, the new variable y is created as a function of distance from the origin. A non-linear
function that creates a new variable is referred to as a kernel.

80
Support Vector Machine Terminology

1. Hyperplane: Hyperplane is the decision boundary that is used to separate the data points
of different classes in a feature space. In the case of linear classifications, it will be a linear
equation i.e. wx+b = 0.

2. Support Vectors: Support vectors are the closest data points to the hyperplane, which
makes a critical role in deciding the hyperplane and margin.

3. Margin: Margin is the distance between the support vector and hyperplane. The main
objective of the support vector machine algorithm is to maximize the margin. The wider
margin indicates better classification performance.

4. Kernel: Kernel is the mathematical function, which is used in SVM to map the original
input data points into high-dimensional feature spaces, so, that the hyperplane can be easily
found out even if the data points are not linearly separable in the original input space. Some
of the common kernel functions are linear, polynomial, radial basis function(RBF), and
sigmoid.

5. Hard Margin: The maximum-margin hyperplane or the hard margin hyperplane is a


hyperplane that properly separates the data points of different categories without any
misclassifications.

6. Soft Margin: When the data is not perfectly separable or contains outliers, SVM permits
a soft margin technique. Each data point has a slack variable introduced by the soft-margin
SVM formulation, which softens the strict margin requirement and permits certain
misclassifications or violations. It discovers a compromise between increasing the margin
and reducing violations.

7. C: Margin maximisation and misclassification fines are balanced by the regularisation


parameter C in SVM. The penalty for going over the margin or misclassifying data items
is decided by it. A stricter penalty is imposed with a greater value of C, which results in a
smaller margin and perhaps fewer misclassifications.

81
8. Hinge Loss: A typical loss function in SVMs is hinge loss. It punishes incorrect
classifications or margin violations. The objective function in SVM is frequently formed
by combining it with the regularization term.

9. Dual Problem: A dual Problem of the optimization problem that requires locating the
Lagrange multipliers related to the support vectors can be used to solve SVM. The dual
formulation enables the use of kernel tricks and more effective computing.

Mathematical intuition of Support Vector Machine

Consider a binary classification problem with two classes, labeled as +1 and -1. We have a training
dataset consisting of input feature vectors X and their corresponding class labels Y.

The equation for the linear hyperplane can be written as:

The vector W represents the normal vector to the hyperplane. i.e the direction perpendicular to the
hyperplane. The parameter b in the equation represents the offset or distance of the hyperplane
from the origin along the normal vector w.

The distance between a data point x_i and the decision boundary can be calculated as:

where ||w|| represents the Euclidean norm of the weight vector w. Euclidean norm of the normal
vector W

For Linear SVM classifier :

82
Optimization:

 For Hard margin linear SVM classifier:

The target variable or label for the ith training instance is denoted by the symbol ti in this statement.
And ti=-1 for negative occurrences (when yi= 0) and ti=1positive instances (when yi = 1)
respectively. Because we require the decision boundary that satisfy the

constraint:

 For Soft margin linear SVM classifier:

 Dual Problem: A dual Problem of the optimisation problem that requires locating the
Lagrange multipliers related to the support vectors can be used to solve SVM. The optimal
Lagrange multipliers α(i) that maximize the following dual objective function

Where,

 αi is the Lagrange multiplier associated with the ith training sample.

 K(xi, xj) is the kernel function that computes the similarity between two samples xi and xj.
It allows SVM to handle nonlinear classification problems by implicitly mapping the
samples into a higher-dimensional feature space.

 The term ∑αi represents the sum of all Lagrange multipliers.

83
The SVM decision boundary can be described in terms of these optimal Lagrange multipliers and
the support vectors once the dual issue has been solved and the optimal Lagrange multipliers have
been discovered. The training samples that have i > 0 are the support vectors, while the decision
boundary is supplied by:

Types of Support Vector Machine

Based on the nature of the decision boundary, Support Vector Machines (SVM) can be divided
into two main parts:

 Linear SVM: Linear SVMs use a linear decision boundary to separate the data points of
different classes. When the data can be precisely linearly separated, linear SVMs are very
suitable. This means that a single straight line (in 2D) or a hyperplane (in higher
dimensions) can entirely divide the data points into their respective classes. A hyperplane
that maximizes the margin between the classes is the decision boundary.

 Non-Linear SVM: Non-Linear SVM can be used to classify data when it cannot be
separated into two classes by a straight line (in the case of 2D). By using kernel functions,
nonlinear SVMs can handle nonlinearly separable data. The original input data is
transformed by these kernel functions into a higher-dimensional feature space, where the
data points can be linearly separated. A linear SVM is used to locate a nonlinear decision
boundary in this modified space.

Popular kernel functions in SVM

The SVM kernel is a function that takes low-dimensional input space and transforms it into higher-
dimensional space, ie it converts nonseparable problems to separable problems. It is mostly useful
in non-linear separation problems. Simply put the kernel, does some extremely complex data

84
transformations and then finds out the process to separate the data based on the labels or outputs
defined.

Advantages of SVM

 Effective in high-dimensional cases.

 Its memory is efficient as it uses a subset of training points in the decision function called
support vectors.

 Different kernel functions can be specified for the decision functions and its possible to
specify custom kernels.

85
LECTURE FOUR:
4.0 Feature Extraction
4.1Feature selection and Dimensionality reduction
4.2 Feature Extraction Techniques
4.0 Feature Extraction

Feature extraction is a critical step in machine learning and data analysis, involving transforming
raw data into a set of features that can be used for model building. This process includes two main
components: feature selection and dimensionality reduction, and feature extraction techniques.

4.1 Feature Selection and Dimensionality Reduction

Feature Selection: Feature selection involves selecting a subset of relevant features (variables,
predictors) for use in model construction. The main goal is to improve the performance of the
model by eliminating irrelevant or redundant features. There are several methods for feature
selection:

1. Filter Methods: These techniques evaluate the relevance of features by looking at the
intrinsic properties of the data, without involving any machine learning algorithms.
Examples include:

o Correlation Coefficient

o Chi-square Test

o Mutual Information

2. Wrapper Methods: These methods evaluate the performance of a subset of features based
on the outcome of a specific machine learning algorithm. Examples include:

o Recursive Feature Elimination (RFE)

o Forward Selection

o Backward Elimination

86
3. Embedded Methods: These methods perform feature selection during the model training
process. Examples include:

o LASSO (Least Absolute Shrinkage and Selection Operator)

o Ridge Regression

o Tree-based methods (e.g., Random Forests)

Dimensionality Reduction: Dimensionality reduction techniques reduce the number of random


variables under consideration by obtaining a set of principal variables. This is particularly useful
when dealing with high-dimensional data. Key methods include:

1. Principal Component Analysis (PCA): A linear technique that transforms the data into a
new coordinate system such that the greatest variance by any projection of the data comes
to lie on the first coordinate (the first principal component), the second greatest variance
on the second coordinate, and so on.

2. Linear Discriminant Analysis (LDA): Primarily used for classification problems, LDA
aims to find a linear combination of features that best separate two or more classes of
objects or events.

3. t-Distributed Stochastic Neighbor Embedding (t-SNE): A non-linear technique for


dimensionality reduction that is particularly well suited for the visualization of high-
dimensional datasets.

4. Autoencoders: A type of artificial neural network used to learn efficient codings of


unlabeled data (unsupervised learning). The network is trained to attempt to copy its input
to its output, thereby learning the most important features.

4.2 Feature Extraction Techniques

Feature extraction involves creating new features from the existing raw data, aiming to reduce the
data's dimensionality while preserving its significant properties. Here are some common feature
extraction techniques:

1. Text Data:

87
a) Bag of Words (BoW): Represents text data by converting it into a frequency matrix
of words.

b) Term Frequency-Inverse Document Frequency (TF-IDF): A statistical measure


used to evaluate the importance of a word in a document relative to a corpus.

c) Word Embeddings (Word2Vec, GloVe): Converts words into continuous vectors


in a high-dimensional space.

2. Image Data:

a) Histogram of Oriented Gradients (HOG): Captures edge orientations and is


often used in object detection.

b) Scale-Invariant Feature Transform (SIFT): Detects and describes local features


in images.

c) Convolutional Neural Networks (CNNs): Automatically extract hierarchical


features from images.

3. Time Series Data:

a) Fourier Transform: Decomposes a time series into the frequencies that make it
up.

b) Wavelet Transform: Provides a time-frequency representation of the signal.

c) Autoregressive Integrated Moving Average (ARIMA): A model used for


understanding and predicting future points in the series.

4. Audio Data:

a) Mel-Frequency Cepstral Coefficients (MFCC): Represents the short-term power


spectrum of sound.

b) Spectrograms: Visual representations of the spectrum of frequencies in a signal as


it varies with time.

88
By effectively employing feature selection, dimensionality reduction, and feature extraction
techniques, one can improve model performance, reduce overfitting, and decrease computational
costs, ultimately leading to more accurate and efficient machine learning models.

89
LECTURE FIVE

5.0 Classification

5.1 Linear classifiers (e.g., perceptron, logistic regression)


5.2 Non-linear classifiers (e.g., decision trees, k-nearest neighbors)
5.3Ensemble methods (e.g., random forests, boosting)
5.0 Classification
Classification is a fundamental task in machine learning, involving the assignment of input data
into predefined categories or classes. This process can be performed using various types of
classifiers, each with its own strengths and weaknesses. Below is an overview of different types
of classifiers, including linear classifiers, non-linear classifiers, and ensemble methods.

5.1 Linear Classifiers

5.1.1. Perceptron:

 Description: The perceptron is a simple, single-layer neural network used for binary
classification tasks. It works by finding a linear decision boundary to separate two classes.

 Advantages: Easy to implement and understand.

 Disadvantages: Limited to linearly separable data and cannot handle more complex
patterns.

5.1.2. Logistic Regression:

 Description: Logistic regression is a statistical model that uses a logistic function to model
the probability of a binary dependent variable. Despite its name, it is used for classification,
not regression.

 Advantages: Provides probability estimates, interpretable coefficients, and performs well


for linearly separable data.

 Disadvantages: Assumes a linear relationship between input features and the log odds of
the output, which may not always hold true.

90
5.2 Non-linear Classifiers

5. 2. 1. Decision Trees:

 Description: Decision trees partition the data into subsets based on feature values, making
decisions at each node until a prediction is made at the leaf nodes.

 Advantages: Easy to interpret, handle both numerical and categorical data, and require
little data preprocessing.

 Disadvantages: Prone to overfitting, especially with deep trees, and can be unstable with
small changes in data.

5.2.2. k-Nearest Neighbors (k-NN):

 Description: k-NN is a simple, instance-based learning algorithm that classifies a data


point based on the majority class among its k-nearest neighbors in the feature space.

 Advantages: Simple and intuitive, effective for small datasets with well-defined clusters.

 Disadvantages: Computationally expensive for large datasets, sensitive to irrelevant


features and the choice of k.

5.3 Ensemble Methods

5.3.1. Random Forests:

 Description: Random forests are an ensemble learning method that constructs multiple
decision trees during training and outputs the mode of the classes (classification) or mean
prediction (regression) of the individual trees.

 Advantages: Robust to overfitting, handles large datasets well, and can handle missing
values and maintain accuracy for a large proportion of data.

 Disadvantages: Can be computationally expensive and less interpretable than a single


decision tree.

91
5.3.2. Boosting:

 Description: Boosting is an ensemble technique that combines weak learners to form a


strong learner by focusing on the errors made by previous models. Common algorithms
include AdaBoost, Gradient Boosting, and XGBoost.

 Advantages: Improves model accuracy, reduces bias, and can handle complex patterns.

 Disadvantages: Prone to overfitting if not carefully tuned, can be computationally


intensive, and requires careful parameter tuning.

Summary of Classification Methods

 Linear Classifiers: Suitable for linearly separable data, with easy implementation and
interpretability. Examples include the perceptron and logistic regression.

 Non-linear Classifiers: Handle more complex relationships between features and outputs.
Examples include decision trees and k-nearest neighbors.

 Ensemble Methods: Combine multiple models to improve performance and robustness.


Examples include random forests and boosting techniques.

By understanding and applying these different classification

92
LECTURE SIX:

6.0 Clustering
6.1 K-means clustering
6.2 Hierarchical clustering
6.3 Density-based clustering
Clustering is an unsupervised learning technique used to group similar data points into clusters
based on their characteristics. This helps in understanding the underlying structure of the data.
Here are three widely used clustering methods: K-means clustering, hierarchical clustering, and
density-based clustering.

6.1 K-means Clustering

Description: K-means clustering is a partitioning method that divides a dataset into K distinct,
non-overlapping subsets (clusters). Each data point belongs to the cluster with the nearest mean.

Steps:

1. Initialization: Choose K initial centroids randomly or using methods like K-means++.

2. Assignment: Assign each data point to the nearest centroid, forming K clusters.

3. Update: Calculate the new centroids as the mean of the data points in each cluster.

4. Repeat: Repeat the assignment and update steps until the centroids no longer change or
change very little.

Advantages:

 Simple to implement and understand.

 Efficient for large datasets.

93
Disadvantages:

 Requires specifying the number of clusters (K) in advance.

 Sensitive to the initial placement of centroids.

 May converge to a local minimum, leading to suboptimal clusters.

 Assumes clusters are spherical and equally sized, which may not always be the case.

6.2 Hierarchical Clustering

Description: Hierarchical clustering builds a hierarchy of clusters either by progressively merging


smaller clusters into larger ones (agglomerative) or by progressively splitting larger clusters into
smaller ones (divisive).

Types:

1. Agglomerative (Bottom-Up):

o Start with each data point as a separate cluster.

o Merge the two closest clusters at each step.

o Repeat until all data points are in a single cluster.

2. Divisive (Top-Down):

o Start with all data points in a single cluster.

o Split the cluster into smaller clusters at each step.

o Repeat until each data point is its own cluster.

Advantages:

 Does not require specifying the number of clusters in advance.

 Produces a dendrogram, a tree-like diagram that illustrates the arrangements of the clusters.

94
Disadvantages:

 Computationally intensive for large datasets.

 Once a merge or split is done, it cannot be undone.

 Sensitive to noise and outliers.

6.3 Density-Based Clustering

Description: Density-based clustering identifies clusters as areas of high density separated by


areas of low density. It is particularly useful for discovering clusters of arbitrary shape and for
handling noise.

Example Algorithm: DBSCAN (Density-Based Spatial Clustering of Applications with


Noise):

1. Initialization: Define parameters ε (epsilon, the radius of the neighborhood) and MinPts
(minimum number of points required to form a dense region).

2. Core Points: Identify core points as those with at least MinPts within a radius ε.

3. Cluster Formation: Connect core points within ε distance to form clusters.

4. Border Points: Assign border points (points within ε of a core point but with fewer than
MinPts in their neighborhood) to the nearest core point's cluster.

5. Noise Points: Label points that are neither core nor border points as noise.

Advantages:

 Can find clusters of arbitrary shape.

 Does not require specifying the number of clusters in advance.

 Robust to noise and outliers.

95
Disadvantages:

 Performance depends on the selection of parameters ε and MinPts.

 Can struggle with clusters of varying densities.

Summary of Clustering Methods

 K-means Clustering: Efficient and simple but assumes spherical clusters and requires
specifying K in advance.

 Hierarchical Clustering: Produces a dendrogram and does not require specifying the
number of clusters, but is computationally intensive and sensitive to noise.

 Density-Based Clustering: Can find clusters of arbitrary shapes and is robust to noise, but
requires careful selection of parameters and may struggle with clusters of varying densities.

By understanding and utilizing these clustering methods, one can uncover the underlying patterns
and structures in the data, leading to valuable insights and better decision-making.

96
LECTURE SEVEN:

6.0 Applications of Pattern Recognition

6.1 Image processing and computer vision

6.2 Speech recognition and natural language processing

6.3 Bioinformatics and biomedical signal processing

6.4 Pattern recognition in finance and marketing

6.0 Applications of Pattern Recognition

Pattern recognition is a core aspect of machine learning that finds applications in many fields. Here
are detailed descriptions of its applications in various domains:

6.1 Image Processing and Computer Vision

Description: Pattern recognition in image processing and computer vision involves interpreting
visual data to automate tasks that typically require human vision. This includes the analysis,
understanding, and processing of images and videos to extract meaningful information.

Applications:

 Object Detection and Recognition: Systems that identify .0and classify objects within an
image or video, such as facial recognition for security systems, vehicle detection in traffic
management, and identifying animals in wildlife monitoring.

 Image Segmentation: Dividing an image into parts for easier analysis, commonly used in
medical imaging to highlight areas of interest, such as tumors in MRI scans.

 Image Enhancement: Techniques to improve the quality of images, including noise


reduction, sharpening, and color correction.

 Autonomous Vehicles: Using cameras and sensors to enable vehicles to perceive and
navigate their environment by recognizing roads, obstacles, and traffic signals.

97
 Augmented Reality (AR): Enhancing real-world environments with digital overlays for
applications in gaming, education, and navigation.

6.2 Speech Recognition and Natural Language Processing (NLP)

Description: Pattern recognition in speech recognition and NLP enables machines to understand
and interact with human language, whether spoken or written. This involves the analysis of
linguistic patterns to facilitate communication between humans and computers.

Applications:

 Voice Assistants: Technologies like Siri, Alexa, and Google Assistant that understand and
respond to spoken commands, providing hands-free interaction with devices.

 Speech-to-Text: Converting spoken language into written text, useful for transcription
services, voice typing, and accessibility features for the hearing impaired.

 Language Translation: Translating text or speech from one language to another, aiding
global communication and understanding through services like Google Translate.

 Sentiment Analysis: Analyzing text data to determine the sentiment or emotional tone,
valuable in customer feedback analysis, social media monitoring, and market research.

 Chatbots: Automated systems that engage with users through text or speech to provide
customer support, information, and personalized experiences.

6.3 Bioinformatics and Biomedical Signal Processing

Description: Pattern recognition in bioinformatics and biomedical signal processing involves


analyzing biological data and medical signals to advance healthcare and biological research.

Applications:

 Genomics: Studying DNA sequences to identify genes linked to diseases, enabling


personalized medicine and advances in genetic research.

98
 Protein Structure Prediction: Predicting the three-dimensional structure of proteins to
understand their function and interactions, crucial for drug development and understanding
biological processes.

 Medical Imaging: Analyzing medical images to diagnose and monitor diseases, such as
detecting cancerous cells in mammograms or identifying brain abnormalities in MRI scans.

 Electrocardiogram (ECG) Analysis: Interpreting ECG signals to monitor heart health,


detect arrhythmias, and guide treatment decisions.

 Wearable Health Devices: Using data from wearable sensors to monitor vital signs and
detect health issues in real-time, improving patient care and outcomes.

6.4 Pattern Recognition in Finance and Marketing

Description: Pattern recognition in finance and marketing involves analyzing large datasets to
identify trends, make predictions, and optimize business strategies.

Applications:

 Algorithmic Trading: Developing automated trading systems that analyze market data to
execute buy and sell orders based on predefined criteria, enhancing trading efficiency and
profitability.

 Fraud Detection: Identifying fraudulent transactions and activities by recognizing unusual


patterns in financial data, thereby protecting against financial losses.

 Customer Segmentation: Grouping customers based on purchasing behavior and


demographics to tailor marketing efforts and improve customer targeting.

 Credit Scoring: Assessing an individual's creditworthiness by analyzing financial history


and behavior, aiding in loan approvals and risk management.

 Market Basket Analysis: Analyzing purchase data to understand product relationships,


optimize inventory management, and enhance cross-selling and upselling strategies.

99
Summary

Pattern recognition is a versatile and essential tool across many fields. Its applications in image
processing, speech recognition, bioinformatics, and finance demonstrate its ability to extract
valuable insights from complex data, automate processes, and improve decision-making. By
leveraging pattern recognition, various industries can enhance their operations, offer personalized
experiences, and drive innovation.

100
Lecture EIGHT

8.0 Performance Evaluation


8.1 Metrics for evaluating classification and clustering algorithms
8.2 Cross-validation and model selection

8.0 Performance Evaluation

Performance evaluation is a critical step in developing and deploying machine learning models. It
ensures that the models are effective, reliable, and generalizable. This section covers the key
metrics for evaluating classification and clustering algorithms, as well as techniques for cross-
validation and model selection.

8.1 Metrics for Evaluating Classification and Clustering Algorithms

Classification Metrics

1. Accuracy:

2.
Precision:

3.
Recall (Sensitivity):

101
4.
F1 Score:

5.
ROC-AUC (Receiver Operating Characteristic - Area Under Curve):

o Description: Measures the model's ability to distinguish between classes across


different thresholds.

o Use Case: Useful for evaluating performance over all classification thresholds.

6. Confusion Matrix:

o Description: A table that displays the true positives (TP), true negatives (TN), false
positives (FP), and false negatives (FN).

Clustering Metrics

1. Silhouette Score:

o Description: Measures how similar a point is to its own cluster compared to other
clusters.

o Range: -1 to 1 (higher values indicate better clustering)

o Formula: s(i)=b(i)−a(i)max⁡(a(i),b(i))s(i) = \frac{b(i) - a(i)}{\max(a(i),


b(i))}s(i)=max(a(i),b(i))b(i)−a(i)

 a(i)a(i)a(i): Mean distance to other points in the same cluster.

102
 b(i)b(i)b(i): Mean distance to points in the nearest cluster.

2. Davies-Bouldin Index:

o Description: Measures the average similarity ratio of each cluster with the cluster
that is most similar to it.

o Range: 0 to ∞ (lower values indicate better clustering)

3. Adjusted Rand Index (ARI):

o Description: Measures the similarity between two clusterings by considering all


pairs of samples and counting pairs that are assigned in the same or different
clusters in the predicted and true clusterings.

o Range: -1 to 1 (1 indicates a perfect match)

4. Homogeneity, Completeness, and V-Measure:

o Homogeneity: All clusters contain only data points that are members of a single
class.

o Completeness: All data points that are members of a given class are elements of
the same cluster.

o V-Measure: The harmonic mean of homogeneity and completeness.

8.2 Cross-Validation and Model Selection

Cross-Validation

1. k-Fold Cross-Validation:

o Description: The dataset is divided into k subsets (folds). The model is trained k
times, each time using a different subset as the validation set and the remaining k-
1 subsets as the training set.

o Advantages: Provides a good indication of model performance on unseen data.

2. Leave-One-Out Cross-Validation (LOOCV):

103
o Description: Each instance in the dataset is used once as a validation while the
remaining instances form the training set.

o Advantages: Uses maximum data for training but can be computationally


expensive.

3. Stratified k-Fold Cross-Validation:

o Description: Ensures that each fold is representative of the whole by preserving


the percentage of samples for each class.

o Advantages: Provides better performance evaluation for imbalanced datasets.

Model Selection

1. Grid Search:

o Description: An exhaustive search through a specified subset of hyperparameters.

o Advantages: Ensures the best combination of hyperparameters is found, but can


be computationally expensive.

2. Random Search:

o Description: A random search through a specified subset of hyperparameters.

o Advantages: Often faster than grid search and can find good hyperparameters with
fewer trials.

3. Bayesian Optimization:

o Description: A probabilistic model is used to find the optimal hyper parameters by


balancing exploration and exploitation.

o Advantages: More efficient than grid and random search, especially with limited
computational resources.

104
4. Cross-Validation:

o Description: Often used in combination with the above techniques to evaluate


model performance.

Summary

Performance evaluation involves using various metrics to assess classification and clustering
algorithms, ensuring their effectiveness and reliability. Classification metrics include accuracy,
precision, recall, F1 score, ROC-AUC, and confusion matrix. Clustering metrics include the
silhouette score, Davies-Bouldin index, Adjusted Rand index, and V-measure. Cross-validation
methods, such as k-fold and LOOCV, along with model selection techniques like grid search and
Bayesian optimization, are crucial for developing robust models that generalize well to new data.

105
LECTURE NINE:

9.0 Ethical and Social Implications


9.1Privacy and security concerns
9.2Bias and fairness in pattern recognition systems

9.0 Ethical and Social Implications

As pattern recognition systems become increasingly integrated into various aspects of society, it
is crucial to consider their ethical and social implications. Two major areas of concern are privacy
and security, as well as bias and fairness.

9.1 Privacy and Security Concerns

Description: Pattern recognition systems often rely on large amounts of data, some of which may
be sensitive or personally identifiable. Ensuring the privacy and security of this data is paramount
to prevent misuse and protect individuals' rights.

Concerns:

1. Data Privacy:

o Issue: Pattern recognition systems can collect and process vast amounts of personal
data, raising concerns about how this data is stored, shared, and used.

o Implications: Unauthorized access, misuse, or breaches of personal data can lead


to identity theft, financial loss, and erosion of trust.

2. Data Security:

o Issue: Ensuring that data is protected from unauthorized access and cyber threats
is critical.

o Implications: Security breaches can result in the exposure of sensitive information,


compromising individuals' privacy and potentially causing significant harm.

3. Surveillance:

106
o Issue: The use of pattern recognition in surveillance systems can lead to constant
monitoring of individuals without their consent.

o Implications: This can create a sense of being watched, infringing on personal


freedoms and privacy.

4. Informed Consent:

o Issue: Individuals may not always be aware that their data is being collected and
used by pattern recognition systems.

o Implications: Lack of informed consent undermines trust and violates ethical


standards of autonomy and transparency.

Solutions:

 Data Anonymization: Removing personally identifiable information from datasets to


protect individuals' privacy.

 Robust Security Measures: Implementing strong encryption, access controls, and regular
security audits to safeguard data.

 Transparent Policies: Clearly communicating how data is collected, used, and stored, and
obtaining informed consent from individuals.

 Regulatory Compliance: Adhering to data protection regulations such as GDPR, HIPAA,


and CCPA.

9.2 Bias and Fairness in Pattern Recognition Systems

Description: Bias in pattern recognition systems arises when the data or algorithms used reflect
prejudices, leading to unfair or discriminatory outcomes. Ensuring fairness is essential to maintain
trust and promote equitable treatment.

107
Concerns:

1. Algorithmic Bias:

o Issue: Algorithms may inadvertently learn and perpetuate biases present in the
training data.

o Implications: This can lead to discriminatory practices, such as biased hiring


decisions, unfair loan approvals, and unequal treatment in the criminal justice
system.

2. Representation Bias:

o Issue: If the training data does not adequately represent all demographic groups,
the system may perform poorly for underrepresented populations.

o Implications: This can result in skewed predictions and decisions that


disproportionately affect certain groups.

3. Outcome Fairness:

o Issue: Ensuring that the outcomes of pattern recognition systems are fair and do
not disadvantage any group.

o Implications: Unfair outcomes can exacerbate social inequalities and lead to a lack
of trust in these systems.

4. Transparency and Accountability:

o Issue: The decision-making processes of pattern recognition systems are often


opaque, making it difficult to identify and correct biases.

o Implications: Lack of transparency can prevent accountability and hinder efforts


to address bias.

Solutions:

 Diverse Datasets: Ensuring that training datasets are representative of all demographic
groups to reduce bias.

108
 Bias Detection and Mitigation: Developing techniques to identify and mitigate biases in
algorithms and data.

 Fairness Metrics: Using metrics to evaluate and ensure fairness in pattern recognition
systems.

 Transparent Algorithms: Implementing explainable AI techniques to make the decision-


making process more transparent and understandable.

 Regular Audits: Conducting regular audits of systems to identify and address biases and
ensure compliance with fairness standards.

Summary

The ethical and social implications of pattern recognition systems are significant and multifaceted.
Privacy and security concerns necessitate robust measures to protect sensitive data and ensure
informed consent. Bias and fairness issues require careful attention to prevent discriminatory
outcomes and promote equitable treatment. Addressing these concerns is essential to foster trust,
ensure ethical use, and maximize the benefits of pattern recognition technologies for all members
of society.

109

You might also like