PR Lecture Note
PR Lecture Note
Course Objectives:
1. To understand the basic concepts and principles of pattern recognition.
2. To learn various techniques and algorithms used in pattern recognition.
3. To develop practical skills in data analysis, feature extraction, and classification.
4. To explore applications of pattern recognition in real-world problems.
5. To critically evaluate the performance of pattern recognition systems.
1
Course Outline:
1. Introduction to Pattern Recognition
Definition and importance of pattern recognition
History and applications
2. Statistical Pattern Recognition
Probability theory and statistical decision theory
Bayes decision theory
Maximum likelihood estimation
3. Machine Learning Algorithms
Supervised learning
Unsupervised learning
Neural networks
Support vector machines
4. Feature Extraction
Feature selection and dimensionality reduction
Feature extraction techniques
5. Classification
Linear classifiers (e.g., perceptron, logistic regression)
Non-linear classifiers (e.g., decision trees, k-nearest neighbors)
Ensemble methods (e.g., random forests, boosting)
6. Clustering
K-means clustering
Hierarchical clustering
Density-based clustering
7. Applications of Pattern Recognition
2
Image processing and computer vision
Speech recognition and natural language processing
Bioinformatics and biomedical signal processing
Pattern recognition in finance and marketing
8. Performance Evaluation
Metrics for evaluating classification and clustering algorithms
Cross-validation and model selection
9. Ethical and Social Implications
Privacy and security concerns
Bias and fairness in pattern recognition systems
Teaching Methodology:
Lectures
Practical sessions using software tools (e.g., MATLAB, Python)
Case studies and real-world examples
Group discussions and presentations
Assessment:
Assignments and quizzes
Practical projects
Final examinations
Prerequisites: Basic knowledge of mathematics, probability theory, and programming is
recommended.
References:
Duda, R. O., Hart, P. E., & Stork, D. G. (2012). Pattern Classification (2nd ed.).
Bishop, C. M. (2006). Pattern Recognition and Machine Learning.
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning:
Data Mining, Inference, and Prediction (2nd ed.).
3
LECTURE ONE (1)
Introduction to Pattern Recognition:
Pattern recognition is a data analysis method that uses machine learning algorithms to
automatically recognize patterns and regularities in data. This data can be anything from text and
images to sounds or other definable qualities. Pattern recognition systems can recognize familiar
patterns quickly and accurately.
Pattern Recognition is the act of taking in raw data (objects) and classifying them into a number
of categories or classes (clustering) called patterns. Typically the categories are assumed to be
known in advance, although there are techniques to learn the clustering.
Pattern recognition can be defined as the classification of data based on knowledge already
gained or on statistical information extracted from patterns and/or their representation. One of the
important aspects of pattern recognition is its application potential.
A pattern is an abstract object, such as a set of measurements describing a physical object. The
objects could be images, signal waveforms or any type of measurements that need to be classified.
We will refer to these objects as patterns.
Pattern is everything around in this digital world. A pattern can either be seen physically or it can
be observed mathematically by applying algorithms.
Pattern recognition is an integral part in most machine intelligence systems built for decision
making. It studies how machine can
4
make decisions based on the category of the patterns
Supervised learning
Unsupervised learning
a) Supervised classification
a) Supervised classification
We know exactly how many groups exist.
We have data that are known - coming from the groups.
Based on the known information, we can build classification rules (training samples) to classify
new data points into one of the available groups.
In the example of Figure 1 (above), we assumed that a set of training data were available,
and the classifier was designed by exploiting this priori known information. This is known
as supervised pattern recognition.
b) Unsupervised classification (Cluster analysis)
The dataset are not grouped
We do not know how many groups exist in the data
5
The training data are not available - we are given a set of feature vectors x and the goal is
to unravel the underlying similarities and cluster (group) similar vectors together. This is
known as unsupervised pattern recognition or clustering.
Such tasks applied in remote sensing, image segmentation, and image and speech coding.
Clustering is “the process of organizing objects into groups whose members are similar in
some way”. Clustering generated a partition of the data which helps decision making, the
specific decision-making activity of interest to us. Clustering is used in unsupervised learning.
A cluster is therefore a collection of objects which are “similar” between them and are
“dissimilar” to the objects belonging to other clusters.
Categorizing Clustering
1. Sequential
Number of clusters is not known (priori)
2. Hierarchical
Final clustering is achieved via divisive approach or agglomerative.
3. Iterative – based on cost function optimization
Probabilistic - data are picked from mixture of probability distributions
Boundary detection - adjust iteratively the boundaries of the regions where clusters lie
Hard clustering - a vector belongs exclusively to one specific cluster.
Why clustering?
Simplifications
Pattern detection
Useful in data concept construction
Unsupervised learning process
1. The objects to be classified are first sensed by a transducer (camera) - sensor converts images,
sounds or other physical inputs into signal data.
2. The segmentor isolates the sensed objects from the background or from other objects.
6
3. The feature extractor reduce the data size by measuring certain objects properties that are useful
for classification.
4. The classifier evaluates the evidence presented and makes final decision as to assign the
sensed object to a category.
5. The preprocessor adjusts the light level, threshold the image to remove the background of the
conveyor belt and the costs of errors, and decide on the appropriate action.
Example: consider our face then eyes, ears, nose, etc are features of the face.
A set of features that are taken together, forms the features vector.
Example: In the above example of a face, if all the features (eyes, ears, nose, etc) are taken
together then the sequence is a feature vector ([eyes, ears, nose]). The feature vector is the sequence
of a feature represented as a d-dimensional column vector. In the case of speech, MFCC (Mel-
frequency Cepstral Coefficient) is the spectral feature of the speech.
Pattern recognition system should recognize familiar patterns quickly and accurate
7
The entire dataset is divided into two categories, one which is used in training the model i.e.
Training set, and the other that is used in testing the model after training, i.e. testing set.
Training set:
The training set is used to build a model. It consists of the set of images that are used to train the
system. Training rules and algorithms are used to give relevant information on how to associate
input data with output decisions. The system is trained by applying these algorithms to the
dataset, all the relevant information is extracted from the data, and results are obtained.
Generally, 80% of the data of the dataset is taken for training data.
Testing set:
Testing data is used to test the system. It is the set of data that is used to verify whether the
system is producing the correct output after being trained or not. Generally, 20% of the data of
the dataset is used for testing. Testing data is used to measure the accuracy of the system. For
example, a system that identifies which category a particular flower belongs to is able to identify
seven categories of flowers correctly out of ten and the rest of others wrong, then the accuracy is
70 %
Imagine we have a dataset containing information about apples and oranges. The features of each
fruit are its color (red or yellow) and its shape (round or oval). We can represent each fruit using
a list of strings, e.g. [‘red’, ’round’] for a red, round fruit.
Our goal is to write a function that can predict whether a given fruit is an apple or an orange. To
do this, we will use a simple pattern recognition algorithm called k-nearest neighbors (k-NN).
8
Importance of Pattern Recognition:
Data Analysis: Pattern recognition enables the analysis of complex datasets by identifying
underlying structures, trends, and relationships. It helps uncover valuable insights and hidden
patterns that may not be apparent through manual examination.
Decision Making: Recognizing patterns allows for informed decision-making in various domains,
including business, finance, healthcare, and engineering. By identifying patterns in data, decision-
makers can anticipate trends, predict outcomes, and formulate effective strategies.
Computer Vision: Pattern recognition is essential in computer vision systems, where algorithms
analyze visual data to interpret and understand the content of images or videos. Applications
include object detection, facial recognition, image segmentation, and autonomous navigation.
Speech and Language Processing: Pattern recognition plays a crucial role in speech and language
processing tasks, such as speech recognition, natural language understanding, and machine
translation. Algorithms analyze audio signals or text data to recognize patterns in speech or
language patterns and convert them into meaningful information.
Biometrics: Pattern recognition is used in biometric systems for identifying individuals based on
unique physiological or behavioral characteristics, such as fingerprints, iris scans, or voice
patterns. These systems are employed for security and authentication purposes in various
applications, including access control and identity verification.
Disease Diagnosis: In healthcare, pattern recognition techniques are used for medical image
analysis, disease diagnosis, and prognosis prediction. Algorithms analyze medical imaging data,
such as MRI scans or X-rays, to detect abnormalities, classify diseases, and assist healthcare
professionals in making accurate diagnoses.
9
Overall, pattern recognition is essential for understanding and interpreting complex data,
facilitating decision-making processes, and developing intelligent systems capable of automated
analysis and interpretation. Its applications span across diverse fields and contribute to
advancements in technology, science, and society.
Early Developments: The roots of pattern recognition can be traced back to ancient civilizations,
where humans relied on visual and auditory cues to recognize patterns in nature, such as animal
tracks and sounds. However, formal studies in pattern recognition began in the early 20th century
with the development of statistical methods and signal processing techniques.
Statistical Pattern Recognition: In the mid-20th century, pioneers such as Norbert Wiener and
Claude Shannon laid the foundation for statistical pattern recognition, introducing concepts such
as Bayesian decision theory and information theory. These mathematical frameworks provided a
systematic approach to analyzing and classifying patterns in data.
Machine Learning: The advent of computers in the latter half of the 20th century enabled the
development of machine learning algorithms for pattern recognition. Early approaches, such as the
perceptron and nearest neighbor algorithms, paved the way for more sophisticated techniques like
neural networks, support vector machines, and deep learning.
Computer Vision: Pattern recognition found widespread applications in computer vision, where
algorithms analyze visual data to interpret and understand the content of images or videos.
Landmark developments include the creation of edge detection algorithms, object recognition
systems, and convolutional neural networks (CNNs).
Speech and Language Processing: Pattern recognition techniques have been applied to speech
and language processing tasks, such as speech recognition, natural language understanding, and
machine translation. Early systems relied on statistical models, while modern approaches leverage
deep learning architectures like recurrent neural networks (RNNs) and transformers.
10
Goal of Pattern Recognition
Pattern recognition is a scientific discipline whose goal is classification of data, objects or patterns
into categories or classes.
Choose the model for any sensed pattern that corresponds best and to assign it to the class
described by that model
Advantages:
It is useful for cloth pattern recognition for visually impaired blind people.
Disadvantages:
The syntactic pattern recognition approach is complex to implement and it is a very slow
process.
11
Applications of Pattern Recognition:
Image Processing: Pattern recognition is used in image processing applications for tasks such as
object detection, image segmentation, and scene understanding. It finds applications in fields like
medical imaging, satellite imagery analysis, surveillance systems, and autonomous vehicles.
Biometrics: Pattern recognition plays a crucial role in biometric systems for identifying
individuals based on unique physiological or behavioral characteristics. Biometric modalities
include fingerprint recognition, iris recognition, facial recognition, voice recognition, and gait
analysis.
Speech Recognition: Pattern recognition techniques are employed in speech recognition systems
to transcribe spoken language into text. These systems are used in virtual assistants, voice-
controlled devices, dictation software, and customer service automation.
Document Analysis: Pattern recognition is applied to document analysis tasks such as optical
character recognition (OCR), handwriting recognition, and document classification. These systems
automate data entry, digitize historical documents, and assist in information retrieval.
Financial Forecasting: Pattern recognition techniques are used in financial markets for predicting
trends, identifying trading opportunities, and assessing risk. They analyze historical market data
to forecast stock prices, detect anomalies, and optimize investment strategies.
Security and Surveillance: Pattern recognition is utilized in security and surveillance systems for
tasks such as intrusion detection, face recognition, and behavior analysis. These systems enhance
public safety, protect critical infrastructure, and prevent criminal activities.
Overall, pattern recognition has revolutionized numerous fields by enabling automated analysis
and interpretation of complex data, leading to advancements in technology, science, and society.
Its applications continue to expand as new algorithms and technologies emerge, driving innovation
and addressing real-world challenges.
12
Probability theory and statistical decision theory are two fundamental concepts in the field of
statistics and decision-making. Let's explore each of these concepts:
Probability Theory:
Probability theory deals with the mathematical study of random events or uncertain outcomes. It
provides a framework for quantifying uncertainty and reasoning about uncertainty in a systematic
manner.
Key concepts in probability theory include:
Sample space: The set of all possible outcomes of a random experiment.
Event: Any subset of the sample space, representing a particular outcome or set of outcomes.
Probability measure: A function that assigns a numerical value between 0 and 1 to each event,
representing the likelihood of that event occurring.
Probability distribution: A mathematical function that describes the probabilities of different
outcomes of a random variable.
Probability theory is used in various fields, including statistics, mathematics, physics, finance, and
engineering. It is applied in areas such as risk assessment, modeling of random phenomena, and
decision-making under uncertainty.
Statistical Decision Theory:
Statistical decision theory is concerned with making decisions in the presence of uncertainty, based
on statistical data and decision criteria. It provides a framework for rational decision-making in
situations where outcomes are uncertain and probabilities are known or can be estimated.
Key concepts in statistical decision theory include:
Decision problem: A situation in which a decision-maker must choose among alternative courses
of action, each associated with uncertain outcomes.
Loss function: A function that quantifies the cost or loss associated with different decision
outcomes.
Decision rule: A rule or criterion for selecting the best course of action based on the available
information and the objectives of the decision-maker.
Bayes decision rule: A decision rule that minimizes the expected loss, taking into account both
the prior probabilities of different outcomes and the loss associated with each outcome.
Statistical decision theory is applied in various fields, including economics, engineering, medicine,
and business. It is used to optimize decision-making processes, develop decision support systems,
and analyze the trade-offs between different decision options.
13
In summary, probability theory provides the mathematical foundation for quantifying uncertainty,
while statistical decision theory provides a framework for making optimal decisions under
uncertainty based on available information and decision criteria. Together, these two concepts
form the basis for rational decision-making in a wide range of real-world situations.
14
LECTURE TWO:
Statistical pattern recognition: Statistical pattern recognition is the branch of statistics that deals
with the identification and classification of patterns in data. It is a type of supervised learning,
where the data is labeled with class labels that indicate which class a particular instance belongs
to. The goal of statistical pattern recognition is to learn a model that can accurately classify new
data instances based on their features.
The topic of machine learning known as statistical pattern recognition focuses on finding patterns
and regularities in data. It enables machines to gain knowledge from data, enhance performance,
and make choices based on what they have discovered.
The goal of Statistical Pattern Recognition is to find relationships between variables that can be
used for prediction or classification tasks. This will explore the various techniques used in
Statistical Pattern Recognition and how these methods are applied to solve real-world problems.
The importance of pattern recognition lies in its ability to detect complex relations among variables
without explicit programming instructions. By using statistical models, machines can identify
regularities in data that would otherwise require manual labor or trial-and-error experimentation
by humans. In addition, machines can generalize from existing knowledge bases to predict new
outcomes more accurately than before.
Statistical Pattern Recognition is becoming increasingly important within many industries due to
its ability to automate certain processes as well as providing valuable insights into large datasets
that may otherwise remain hidden beneath the surface. With this article, we aim to provide an
15
overview of different techniques used for identifying patterns within data and explain how they
are employed in solving practical problems effectively.
This field combines elements of statistics, mathematics, and computer science to develop
algorithms and models that can recognize regularities, structures, and anomalies in data sets.
Statistical pattern recognition (SPR) is a field of data analysis that uses mathematical models and
algorithms to identify patterns from large datasets. It can be used for various tasks, such as
handwriting or speech recognition, classification of objects in images, and natural language
processing. SPR employs several techniques including vector machines, neural networks, linear
discriminants, Bayesian methods, k-nearest neighbors and other feature extraction algorithms.
In terms of applications, SPR has been successfully applied to problems like cursive handwriting
recognition and automated medical diagnosis. In the case of handwriting recognition, an algorithm
works by extracting features using a feature extraction algorithm then matching them with existing
model parameters. The same principle applies when solving more complex tasks such as image
classification where deep learning may be employed instead of traditional methods like
discriminant analysis. Similarly, machine vision systems use SPR techniques to identify objects
within an image and classify them according to specific criteria. Furthermore, modern robotics
also utilize SPR concepts to enable robots to recognize their environment better iRobot Roomba
Vacuum Cleaner being one example.
16
recognition, where features such as the size and orientation of the object are extracted from an
image using feature selection techniques and then converted into a feature vector input which
describes the identity of the object. The classification approach uses this information to classify
objects or events according to their properties, with Bayesian pattern classifiers being one of the
most common methods. These optimal classifiers use probabilities rather than hard limitations on
parameters when making decisions about how to group different classes together; they also allow
for prior knowledge about certain groups of objects or events to be taken into account when doing
so. By combining these two elements—feature extraction and classification—statistical pattern
recognition enables accurate automatic recognition processes.
Key Concepts
Features are measurable properties or characteristics of the data. Feature extraction involves
transforming raw data into a set of attributes that can be used for pattern recognition.
Classification: This is a supervised learning process where the goal is to assign input data into
predefined categories or classes based on the features. Common algorithms include decision trees,
support vector machines, and neural networks.
Clustering: This is an unsupervised learning process where the objective is to group similar data
points together without predefined labels. K-means and hierarchical clustering are popular
methods used for this purpose.
Model training involves using a set of data (training set) to build and optimize a pattern recognition
model. Validation involves testing the model on a separate set of data (validation set) to evaluate
its performance and generalizability.
17
Probability and Statistical Inference:
Probability theory underpins many pattern recognition techniques, allowing for the modeling of
uncertainty and the making of inferences about the data. Bayesian inference and maximum
likelihood estimation are commonly used methods.
Dimensionality Reduction:
Techniques such as Principal Component Analysis (PCA) and Linear Discriminant Analysis
(LDA) are used to reduce the number of features while retaining the most significant information,
improving computational efficiency and model performance.
Applications
Statistical Pattern Recognition has a wide range of applications across various fields:
Image and Speech Recognition: Identifying objects in images and understanding spoken language.
Medical Diagnosis: Classifying medical images or patient data to assist in diagnosing diseases.
Challenges
High Dimensionality: Large datasets with many features can be computationally intensive and
may require dimensionality reduction techniques.
Overfitting: Models that perform well on training data but poorly on unseen data. Regularization
techniques and cross-validation are used to mitigate this.
Noise and Outliers: Real-world data often contain noise and outliers that can affect the accuracy
of pattern recognition models.
In conclusion, Statistical Pattern Recognition is a crucial area in the realm of data science and
artificial intelligence, providing powerful tools and methodologies for analyzing complex data and
making informed decisions based on statistical evidence
18
2.2 Probability Theory and Statistical Decision Theory
a) Probability Theory:
Definition: Probability theory is the branch of mathematics that deals with the analysis of
random events. It provides a framework for quantifying the uncertainty associated with various
phenomena and for making predictions about future events based on known probabilities.
Applications: Used in diverse fields such as finance, insurance, science, engineering, and
everyday decision-making.
Statistical Decision Theory:
Definition: Statistical decision theory involves making decisions under uncertainty. It
combines probability theory with decision-making processes to identify the best course of
action when outcomes are uncertain.
Applications: Widely used in economics, medical decision-making, machine learning, and
artificial intelligence.
Fundamentals of Probability Theory
Basic Concepts:
Random Experiment: An experiment or process for which the outcome cannot be
predicted with certainty.
Sample Space (S): The set of all possible outcomes of a random experiment.
Event: A subset of the sample space. An event occurs if the outcome of the experiment is
one of the elements in the subset.
Bayes’ Theorem:
Bayes’ Theorem is used to determine the conditional probability of an event. It was named after
an English statistician, Thomas Bayes who discovered this formula in 1763. Bayes Theorem is a
very important theorem in mathematics that laid the foundation of a unique statistical inference
approach called the Bayes’ inference. It is used to find the probability of an event, based on
prior knowledge of conditions that might be related to that event.
19
P (A|B) = P(B|A)P(A) / P(B)
Where,
P(A) and P(B) are the probabilities of events A and B
P (A|B) is the probability of event A when event B happens
P(B|A) is the probability of event B when A happens
Where,
P(A) and P(B) are the probabilities of events A and B also P(B) is never equal to zero.
P(A|B) is the probability of event A when event B happens
P(B|A) is the probability of event B when A happens
20
Bayes Theorem Derivation
After learning about Bayes theorem in detail, let us understand some important terms related to
the concepts we covered in formula and derivation.
Hypotheses: Events happening in the sample space E1, E2,… En is called the hypotheses
Priori Probability: Priori Probability is the initial probability of an event occurring before any
new data is taken into account. P(Ei) is the priori probability of hypothesis Ei.
Conditional Probability
The probability of an event A based on the occurrence of another event B is termed conditional
Probability.
21
It is denoted as P(A|B) and represents the probability of A when event B has already happened.
Joint Probability
When the probability of two more events occurring together and at the same time is measured it is
marked as Joint Probability. For two events A and B, it is denoted by joint probability is denoted
as, P (A∩B).
Random Variables
Real-valued variables whose possible values are determined by random experiments are called
random variables. The probability of finding such variables is the experimental probability.
Bayesian inference is very important and has found application in various activities, including
medicine, science, philosophy, engineering, sports, law, etc., and Bayesian inference is directly
derived from Bayes’ theorem.
Example: Bayes’ theorem defines the accuracy of the medical test by taking into account how
likely a person is to have a disease and what is the overall accuracy of the test.
The difference between Conditional Probability and Bayes Theorem can be understood with the
help of the table given below,
22
Theorem of Total Probability
Let E1, E2, . . ., En is mutually exclusive and exhaustive events associated with a random experiment
and lets E be an event that occurs with some Ei. Then, prove that
Proof
Bayesian regression employs prior belief or knowledge about the data to “learn” more about it and
create more accurate predictions. It also takes into account the data’s uncertainty and leverages
prior knowledge to provide more precise estimates of the data. As a result, it is an ideal choice
when the data is complex or ambiguous.
Bayesian regression uses a Bayes algorithm to estimate the parameters of a linear regression model
from data, including prior knowledge about the parameters. Because of its probabilistic character,
it can produce more accurate estimates for regression parameters than ordinary least squares (OLS)
linear regression, provide a measure of uncertainty in the estimation, and make stronger
23
conclusions than OLS. Bayesian regression can also be utilized for related regression analysis tasks
like model selection and outlier detection.
Bayesian Regression
Bayesian regression is a type of linear regression that uses Bayesian statistics to estimate the
unknown parameters of a model. It uses Bayes’ theorem to estimate the likelihood of a set of
parameters given observed data. The goal of Bayesian regression is to find the best estimate of the
parameters of a linear model that describes the relationship between the independent and the
dependent variables.
The main difference between traditional linear regression and Bayesian regression is the
underlying assumption regarding the data-generating process. Traditional linear regression
assumes that data follows a Gaussian or normal distribution, while Bayesian regression has
stronger assumptions about the nature of the data and puts a prior probability distribution on the
parameters. Bayesian regression also enables more flexibility as it allows for additional parameters
or prior distributions, and can be used to construct an arbitrarily complex model that explicitly
expresses prior beliefs about the data. Additionally, Bayesian regression provides more accurate
predictive measures from fewer data points and is able to construct estimates for uncertainty
around the estimates. On the other hand, traditional linear regressions are easier to implement and
generally faster with simpler models and can provide good results when the assumptions about the
data are valid.
Bayesian Regression can be very useful when we have insufficient data in the dataset or the data
is poorly distributed. The output of a Bayesian Regression model is obtained from a probability
distribution, as compared to regular regression techniques where the output is just obtained from
a single value of each attribute.
24
Bayes Theorem
Bayes Theorem gives the relationship between an event’s prior probability and its posterior
probability after evidence is taken into account. It states that the conditional probability of an event
is equal to the probability of the event given certain conditions multiplied by the prior probability
of the event, divided by the probability of the conditions.
Where P(A|B) is the probability of event A occurring given that event B has already occurred,
P(B|A) is the probability of event B occurring given that event A has already occurred, P(A) is the
probability of event A occurring and P(B) is the probability of event B occurring.
There are several reasons why Bayesian regression is useful over other regression techniques.
Some of them are as follows:
1. Bayesian regression also uses the prior belief about the parameters in the analysis. Which
makes it useful when there is limited data available and the prior knowledge are relevant. By
combining prior knowledge with the observed data, Bayesian regression provides more
informed and potentially more accurate estimates of the regression parameters.
2. Bayesian regression provides a natural way to measure the uncertainty in the estimation of
regression parameters by generating the posterior distribution, which captures the uncertainty
in the parameter values, as opposed to the single point estimate that is produced by standard
regression techniques. This distribution offers a range of acceptable values for the parameters
and can be used to compute trustworthy intervals or Bayesian confidence intervals.
25
4. Bayesian regression facilitates model selection and comparison by calculating the posterior
probabilities of different models.
5. Bayesian regression can handle outliers and influential observations more effectively
compared to classical regression methods. It provides a more robust approach to regression
analysis, as extreme or influential observations have a lesser impact on the estimation.
26
slope = pyro.sample("slope", dist.Normal(slope_loc, slope_scale))
intercept = pyro.sample("intercept", dist.Normal(intercept_loc,
intercept_scale))
sigma = pyro.sample("sigma", dist.HalfNormal(sigma_loc))
# Create subplots
fig, axs = plt.subplots(1, 3, figsize=(15, 5))
27
axs[2].set_title("Posterior Distribution of Sigma")
axs[2].set_xlabel("Sigma")
axs[2].set_ylabel("Density")
Decision Rule:
Definition: A decision rule is a function that maps observed data (features) to an action or
decision. In pattern recognition, this typically involves classifying an input pattern into one
of several predefined categories or classes.
Example: In a handwritten digit recognition system, the decision rule would assign an
observed digit image (data) to a specific digit class (e.g., 0-9).
Definition: The loss function L(θ,a) quantifies the cost associated with making a decision
a when the true state of nature is θ. In pattern recognition, the true state of nature
corresponds to the actual class of the pattern, while the decision is the predicted class.
Purpose: The loss function helps in evaluating the performance of a decision rule by
assigning a penalty for incorrect classifications. It allows for the assessment of the overall
risk or expected loss, guiding the selection of optimal decision rules.
Conclusion
Probability theory and statistical decision theory provide powerful tools for dealing with
uncertainty and making informed decisions. By leveraging these theories, individuals and
28
organizations can improve their decision-making processes, optimize outcomes, and effectively
manage risks across various domains.
2.4 Maximum Likelihood Estimation (MLE)
Key Concepts:
Likelihood Function:
Definition: The likelihood function L(θ;x) measures the probability of the observed data
xxx given the parameters θ of the model. It is derived from the probability density function
(PDF) or probability mass function (PMF) of the data.
Purpose: The likelihood function serves as the basis for finding the most probable
parameter values that explain the observed data.
Objective: To find the parameter values θ that maximize the likelihood function.
Formally, this can be expressed as:
Log-Likelihood: Often, the logarithm of the likelihood function, called the log-
likelihood, is used because it simplifies the mathematical operations involved in
maximization.
29
Properties of MLE:
Consistency: As the sample size increases, the MLE converges to the true parameter value.
Asymptotic Normality: For large sample sizes, the distribution of the MLE approaches a
normal distribution.
Efficiency: MLE achieves the lowest possible variance among all unbiased estimators,
under certain regularity conditions.
Application Process:
Specify the probability distribution of the data with unknown parameters. For example,
assume the data follows a normal distribution with mean μ and variance σ2.
Based on the assumed model, write the likelihood function for the observed data. For a
normal distribution, the likelihood function for a sample x1, x2,…,xn is:
Take the natural logarithm of the likelihood function to obtain the log-likelihood
30
Advantages of MLE:
Limitations of MLE:
No Prior Information: MLE does not incorporate prior knowledge or assumptions about the
parameters, which can be a limitation in some contexts.
Sensitivity to Sample Size: With small sample sizes, MLE estimates can be biased or have
high variance.
Computational Complexity: For complex models or large datasets, finding the MLE can be
computationally intensive.
In summary, Maximum Likelihood Estimation is a powerful and versatile method for estimating
the parameters of statistical models. By maximizing the likelihood function, MLE seeks to find
the parameter values that make the observed data most probable under the assumed model. Despite
its limitations, MLE remains a fundamental tool in statistical analysis and inference.
31
Maximum A Posteriori (MAP) Estimation
MAP estimation is a Bayesian approach that combines prior information with the likelihood
function to estimate the parameters. It involves finding the parameter values that maximize the
posterior distribution, which is obtained by applying Bayes’ theorem. In MAP estimation, a prior
distribution is specified for the parameters, representing prior beliefs or knowledge about their
values. The likelihood function is then multiplied by the prior distribution to obtain the joint
distribution, and the parameter values that maximize this joint distribution are selected as the MAP
estimates. MAP estimation provides point estimates of the parameters, similar to MLE, but
incorporates prior information.
32
Lecture Three 3:
3.1Machine Learning Algorithms
3.2 Supervised learning
3.3 Unsupervised learning
3.4 Neural networks
3.5 Support vector machines
3.1 Machine Learning Algorithms
Machine learning algorithms are computational models that allow computers to understand
patterns and forecast or make judgments based on data without the need for explicit
programming. These algorithms form the foundation of modern artificial intelligence and are
used in a wide range of applications, including image and speech recognition, natural language
processing, recommendation systems, fraud detection, autonomous cars etc.
This Machine learning Algorithms article will cover all the essential algorithms of machine
learning like Support vector machine, decision-making, logistics regression, naive bayees
classifier, random forest, k-mean clustering, reinforcement learning, vector, hierarchical
clustering, xgboost, adaboost, logistics, etc
Regression
The article delves into regression in machine learning, elucidating models, terminologies,
types, and practical applications.
33
What is Regression?
It seeks to find the best-fitting model, which can be utilized to make predictions or draw
conclusions
It is a supervised machine learning technique, used to predict the value of the dependent
variable for new, unseen data. It models the relationship between the input features and the
target variable, allowing for the estimation or prediction of numerical values.
Regression analysis problem works with if output variable is a real or continuous value,
such as “salary” or “weight”. Many different models can be used, the simplest is the linear
regression. It tries to fit data with the best hyper-plane which goes through the points.
Response Variable: The primary factor to predict or understand in regression, also known
as the dependent variable or target variable.
Predictor Variable: Factors influencing the response variable, used to predict its values;
also called independent variables.
34
Underfitting and Overfitting: Overfitting occurs when an algorithm performs well on
training but poorly on testing, while underfitting indicates poor performance on both
datasets.
Regression Types
Simple Regression
o Simple linear regression should be used when there is only a single independent
variable.
Multiple Regression
o Multiple linear regression should be used when there are multiple independent
variables.
Nonlinear Regression
35
Regression Algorithms
There are many different types of regression algorithms, but some of the most common
include:
Linear Regression
o Linear regression is one of the simplest and most widely used statistical models. This
assumes that there is a linear relationship between the independent and dependent
variables. This means that the change in the dependent variable is proportional to the
change in the independent variables.
Polynomial Regression
o Decision tree regression is a type of regression algorithm that builds a decision tree
to predict the target value. A decision tree is a tree-like structure that consists of nodes
and branches. Each node represents a decision, and each branch represents the
outcome of that decision. The goal of decision tree regression is to build a tree that
can accurately predict the target value for new data points.
36
Random Forest Regression
Ridge Regression
Lasso regression
Characteristics of Regression
Error Measurement: Regression models are evaluated based on their ability to minimize
the error between the predicted and actual values of the target variable. Common error
metrics include mean absolute error (MAE), mean squared error (MSE), and root mean
squared error (RMSE).
37
Model Complexity: Regression models range from simple linear models to more complex
nonlinear models. The choice of model complexity depends on the complexity of the
relationship between the input features and the target variable.
Examples
Let’s take an example of linear regression. We have a Housing data set and we want to
predict the price of the house. Following is the python code for it.
38
# Load CSV and columns
df = pd.read_csv("Housing.csv")
Y = df['price']
X = df['lotsize']
X=X.values.reshape(len(X),1)
Y=Y.values.reshape(len(Y),1)
# Plot outputs
plt.scatter(X_test, Y_test, color='black')
plt.title('Test Data')
plt.xlabel('Size')
plt.ylabel('Price')
plt.xticks(())
plt.yticks(())
# Plot outputs
plt.plot(X_test, regr.predict(X_test), color='red',linewidth=3)
plt.show()
Output:
39
Here in this graph, we plot the test data. The red line indicates the best fit line for predicting
the price. To make an individual prediction using the linear regression model:
print( str(round(regr.predict(5000))) )
Mean Absolute Error (MAE): The average absolute difference between the predicted and
actual values of the target variable.
Mean Squared Error (MSE): The average squared difference between the predicted and
actual values of the target variable.
Root Mean Squared Error (RMSE): The square root of the mean squared error.
Huber Loss: A hybrid loss function that transitions from MAE to MSE for larger errors,
providing balance between robustness and MSE’s sensitivity to outliers.
40
Applications of Regression
Predicting prices: For example, a regression model could be used to predict the price of a
house based on its size, location, and other features.
Forecasting trends: For example, a regression model could be used to forecast the sales
of a product based on historical sales data and economic indicators.
Identifying risk factors: For example, a regression model could be used to identify risk
factors for heart disease based on patient data.
Making decisions: For example, a regression model could be used to recommend which
investment to buy based on market data.
Advantages of Regression
Robust to outliers
Disadvantages of Regression
Assumes linearity
Sensitive to multicollinearity
Conclusion
Regression, a vital facet of supervised machine learning, navigates the realm of continuous
predictions. Its diverse algorithms, from linear to ensemble methods, cater to a spectrum
of real-world applications, underscoring its significance in data-driven decision-making.
41
Classification
Classification and Regression are two major prediction problems that are usually dealt with in Data
Mining and Machine Learning. We are going to deal with both Classification and Regression and
we will also see differences between them in this article.
Classification Algorithms
Classification is the process of finding or discovering a model or function that helps in separating
the data into multiple categorical classes’ i.e. discrete values. In classification, data is categorized
under different labels according to some parameters given in the input and then the labels are
predicted for the data.
In a classification task, we are supposed to predict discrete target variables (class labels) using
independent features.
In the classification task, we are supposed to find a decision boundary that can separate the
different classes in the target variable.
The derived mapping function could be demonstrated in the form of “IF-THEN” rules. The
classification process deals with problems where the data can be divided into binary or multiple
discrete labels. Let’s take an example, suppose we want to predict the possibility of the winning
of a match by Team A on the basis of some parameters recorded earlier. Then there would be two
labels Yes and No.
Unsupervised learning is a branch of machine learning that deals with unlabeled data. Unlike
supervised learning, where the data is labeled with a specific category or outcome, unsupervised
learning algorithms are tasked with finding patterns and relationships within the data without any
prior knowledge of the data’s meaning. This makes unsupervised learning a powerful tool for
exploratory data analysis, where the goal is to understand the underlying structure of the data.
42
In artificial intelligence, machine learning that takes place in the absence of human
supervision is known as unsupervised machine learning. Unsupervised machine learning
models, in contrast to supervised learning, are given unlabeled data and allow discover
patterns and insights on their own—without explicit direction or instruction.
Unsupervised machine learning analyzes and clusters unlabeled datasets using machine
learning algorithms. These algorithms find hidden patterns and data without any human
intervention, i.e., we don’t give output to our model. The training model has only input
parameter values and discovers the groups or patterns on its own.
Data-set in Figure A is Mall data that contains information about its clients that subscribe
to them. Once subscribed they are provided a membership card and the mall has complete
information about the customer and his/her every purchase. Now using this data and
43
unsupervised learning techniques, the mall can easily group clients based on the parameters
we are feeding in.
Unstructured data: May contain noisy(meaningless) data, missing values, or unknown data
Unlabeled data: Data only contains a value for input parameters, there is no targeted value
(output). It is easy to collect as compared to the labeled one in the supervised approach.
There are mainly 3 types of Algorithms which are used for unsupervised
dataset.
a) Clustering
c) Dimensionality Reduction
44
Clustering
Clustering in unsupervised machine learning is the process of grouping unlabeled data into clusters
based on their similarities. The goal of clustering is to identify patterns and relationships in the
data without any prior knowledge of the data’s meaning.
Broadly this technique is applied to group data based on different patterns, such as similarities or
differences, our machine model finds. These algorithms are used to process raw, unclassified data
objects into groups. For example, in the above figure, we have not given output parameter values,
so this technique will be used to group clients based on the input parameters provided by our data.
In real world, not every data we work upon has a target variable. This kind of data cannot be
analyzed using supervised learning algorithms. We need the help of unsupervised algorithms. One
of the most popular type of analysis under unsupervised learning is Cluster analysis. When the
goal is to group similar data points in a dataset, then we use cluster analysis. In practical situations,
we can use cluster analysis for customer segmentation for targeted advertisements, or in medical
imaging to find unknown or new infected areas and many more use cases that we will discuss
further in this article.
What is Clustering?
The task of grouping data points based on their similarity with each other is called Clustering or
Cluster Analysis. This method is defined under the branch of Unsupervised Learning, which aims
at gaining insights from unlabeled data points, that is, unlike supervised learning we don’t have a
target variable.
Clustering aims at forming groups of homogeneous data points from a heterogeneous dataset. It
evaluates the similarity based on a metric like Euclidean distance, Cosine similarity, Manhattan
distance, etc. and then group the points with highest similarity score together.
For Example, In the graph given below, we can clearly see that there are 3 circular clusters forming
on the basis of distance.
45
Now it is not necessary that the clusters formed must be circular in shape. The shape of clusters
can be arbitrary. There are many algorithms that work well with detecting arbitrary shaped
clusters.
For example, In the below given graph we can see that the clusters formed are not circular in shape.
Types of Clustering
Broadly speaking, there are 2 types of clustering that can be performed to group similar data points:
Hard Clustering: In this type of clustering, each data point belongs to a cluster completely or
not. For example, let’s say there are 4 data point and we have to cluster them into 2 clusters.
So each data point will either belong to cluster 1 or cluster 2.
46
Data Points Clusters
A C1
B C2
C C2
D C1
Soft Clustering: In this type of clustering, instead of assigning each data point into a separate
cluster, a probability or likelihood of that point being that cluster is evaluated. For example,
let’s say there are 4 data point and we have to cluster them into 2 clusters. So we will be
evaluating a probability of a data point belonging to both clusters. This probability is calculated
for all data points.
A 0.91 0.09
B 0.3 0.7
C 0.17 0.83
D 1 0
47
Uses of Clustering
Now before we begin with types of clustering algorithms, we will go through the use cases of
Clustering algorithms. Clustering algorithms are majorly used for:
Market Segmentation – Businesses use clustering to group their customers and use targeted
advertisements to attract more audience.
Market Basket Analysis – Shop owners analyze their sales and figure out which items are
majorly bought together by the customers. For example, In USA, according to a study diapers
and beers were usually bought together by fathers.
Social Network Analysis – Social media sites use your data to understand your browsing
behaviour and provide you with targeted friend recommendations or content
recommendations.
Medical Imaging – Doctors use Clustering to find out diseased areas in diagnostic images like
X-rays.
Simplify working with large datasets – Each cluster is given a cluster ID after clustering is
complete. Now, you may reduce a feature set’s whole feature set into its cluster ID. Clustering
is effective when it can represent a complicated case with a straightforward cluster ID. Using
the same principle, clustering data can make complex datasets simpler.
There are many more use cases for clustering but there are some of the major and common use
cases of clustering. Moving forward we will be discussing Clustering Algorithms that will help
you perform the above tasks.
At the surface level, clustering helps in the analysis of unstructured data. Graphing, the shortest
distance, and the density of the data points are a few of the elements that influence cluster
formation. Clustering is the process of determining how related the objects are based on a metric
48
called the similarity measure. Similarity metrics are easier to locate in smaller sets of features. It
gets harder to create similarity measures as the number of features increases. Depending on the
type of clustering algorithm being utilized in data mining, several techniques are employed to
group the data from the datasets. In this part, the clustering techniques are described. Various types
of clustering algorithms are:
4. Distribution-based Clustering
Partitioning methods are the most easily clustering algorithms. They group data points on the basis
of their closeness. Generally, the similarity measure chosen for these algorithms are Euclidian
distance, Manhattan Distance or Minkowski Distance. The datasets are separated into a
predetermined number of clusters, and each cluster is referenced by a vector of values. When
compared to the vector value, the input data variable shows no difference and joins the cluster.
The primary drawback for these algorithms is the requirement that we establish the number of
clusters, “k,” either intuitively or scientifically (using the Elbow Method) before any clustering
machine learning system starts allocating the data points. Despite this, it is still the most popular
type of clustering. K-means and K-medoids clustering are some examples of this type clustering.
Density-based clustering, a model-based method, finds groups based on the density of data points.
Contrary to centroid-based clustering, which requires that the number of clusters be predefined
and is sensitive to initialization, density-based clustering determines the number of clusters
automatically and is less susceptible to beginning positions. They are great at handling clusters of
different sizes and forms, making them ideally suited for datasets with irregularly shaped or
49
overlapping clusters. These methods manage both dense and sparse data regions by focusing on
local density and can distinguish clusters with a variety of morphologies.
In contrast, centroid-based grouping, like k-means, has trouble finding arbitrary shaped clusters.
Due to its preset number of cluster requirements and extreme sensitivity to the initial positioning
of centroids, the outcomes can vary. Furthermore, the tendency of centroid-based approaches to
produce spherical or convex clusters restricts their capacity to handle complicated or irregularly
shaped clusters. In conclusion, density-based clustering overcomes the drawbacks of centroid-
based techniques by autonomously choosing cluster sizes, being resilient to initialization, and
successfully capturing clusters of various sizes and forms. The most popular density-based
clustering algorithm is DBSCAN.
A method for assembling related data points into hierarchical clusters is called hierarchical
clustering. Each data point is initially taken into account as a separate cluster, which is
subsequently combined with the clusters that are the most similar to form one large cluster that
contains all of the data points.
Think about how you may arrange a collection of items based on how similar they are. Each object
begins as its own cluster at the base of the tree when using hierarchical clustering, which creates a
dendrogram, a tree-like structure. The closest pairings of clusters are then combined into larger
clusters after the algorithm examines how similar the objects are to one another. When every object
is in one cluster at the top of the tree, the merging process has finished. Exploring various
granularity levels is one of the fun things about hierarchical clustering. To obtain a given number
of clusters, you can select to cut the dendrogram at a particular height. The more similar two
objects are within a cluster, the closer they are. It’s comparable to classifying items according to
their family trees, where the nearest relatives are clustered together and the wider branches signify
more general connections. There are 2 approaches for Hierarchical clustering:
Divisive Clustering: It follows a top-down approach, here we consider all data points to
be part one big cluster and then this cluster is divide into smaller groups.
50
Agglomerative Clustering: It follows a bottom-up approach, here we consider all data
points to be part of individual clusters and then these clusters are clubbed together to make
one big cluster with all data points.
4. Distribution-based Clustering
Using distribution-based clustering, data points are generated and organized according to their
propensity to fall into the same probability distribution (such as a Gaussian, binomial, or other)
within the data. The data elements are grouped using a probability-based distribution that is based
on statistical distributions. Included are data objects that have a higher likelihood of being in the
cluster. A data point is less likely to be included in a cluster the further it is from the cluster’s
central point, which exists in every cluster.
A notable drawback of density and boundary-based approaches is the need to specify the clusters
a priori for some algorithms, and primarily the definition of the cluster form for the bulk of
algorithms. There must be at least one tuning or hyper-parameter selected, and while doing so
should be simple, getting it wrong could have unanticipated repercussions. Distribution-based
clustering has a definite advantage over proximity and centroid-based clustering approaches in
terms of flexibility, accuracy, and cluster structure. The key issue is that, in order to
avoid overfitting, many clustering methods only work with simulated or manufactured data, or
when the bulk of the data points certainly belong to a preset distribution. The most popular
distribution-based clustering algorithm is Gaussian Mixture Model.
1. Marketing: It can be used to characterize & discover customer segments for marketing
purposes.
2. Biology: It can be used for classification among different species of plants and animals.
3. Libraries: It is used in clustering different books on the basis of topics and information.
4. Insurance: It is used to acknowledge the customers, their policies and identifying the frauds.
5. City Planning: It is used to make groups of houses and to study their values based on their
geographical locations and other factors present.
51
6. Earthquake studies: By learning the earthquake-affected areas we can determine the
dangerous zones.
7. Image Processing: Clustering can be used to group similar images together, classify images
based on content, and identify patterns in image data.
8. Genetics: Clustering is used to group genes that have similar expression patterns and identify
gene networks that work together in biological processes.
9. Finance: Clustering is used to identify market segments based on customer behavior, identify
patterns in stock market data, and analyze risk in investment portfolios.
10. Customer Service: Clustering is used to group customer inquiries and complaints into
categories, identify common issues, and develop targeted solutions.
11. Manufacturing: Clustering is used to group similar products together, optimize production
processes, and identify defects in manufacturing processes.
12. Medical diagnosis: Clustering is used to group patients with similar symptoms or diseases,
which helps in making accurate diagnoses and identifying effective treatments.
13. Fraud detection: Clustering is used to identify suspicious patterns or anomalies in financial
transactions, which can help in detecting fraud or other financial crimes.
14. Traffic analysis: Clustering is used to group similar patterns of traffic data, such as peak
hours, routes, and speeds, which can help in improving transportation planning and
infrastructure.
15. Social network analysis: Clustering is used to identify communities or groups within social
networks, which can help in understanding social behavior, influence, and trends.
16. Cybersecurity: Clustering is used to group similar patterns of network traffic or system
behavior, which can help in detecting and preventing cyberattacks.
17. Climate analysis: Clustering is used to group similar patterns of climate data, such as
temperature, precipitation, and wind, which can help in understanding climate change and its
impact on the environment.
52
18. Sports analysis: Clustering is used to group similar patterns of player or team performance
data, which can help in analyzing player or team strengths and weaknesses and making
strategic decisions.
19. Crime analysis: Clustering is used to group similar patterns of crime data, such as location,
time, and type, which can help in identifying crime hotspots, predicting future crime trends,
and improving crime prevention strategies.
Association rule learning is also known as association rule mining is a common technique used to
discover associations in unsupervised machine learning. This technique is a rule-based ML
technique that finds out some very useful relations between parameters of a large data set. This
technique is basically used for market basket analysis that helps to better understand the
relationship between different products. For e.g. shopping stores use algorithms based on this
technique to find out the relationship between the sales of one product w.r.t to another’s sales based
on customer behavior. Like if a customer buys milk, then he may also buy bread, eggs, or butter.
Once trained well, such models can be used to increase their sales by planning different offers.
Association rule mining finds interesting associations and relationships among large sets of data
items. This rule shows how frequently a itemset occurs in a transaction. A typical example is a
Market Based Analysis. Market Based Analysis is one of the key techniques used by large relations
to show associations between items. It allows retailers to identify relationships between the items
that people buy together frequently. Given a set of transactions, we can find rules that will predict
the occurrence of an item based on the occurrences of other items in the transaction.
TID Items
1 Bread, Milk
53
TID Items
Before we start defining the rule, let us first see the basic definitions. Support Count( )
– Frequency of occurrence of a itemset.
Support(s) – The number of transactions that include items in the {X} and {Y} parts of
the rule as a percentage of the total number of transaction. It is a measure of how frequently
the collection of items occur together as a percentage of all transactions.
Confidence(c) – It is the ratio of the no of transactions that includes all items in {B} as
well as the no of transactions that includes all items in {A} to the no of transactions that
includes all items in {A}.
54
Lift(l) – The lift of the rule X=>Y is the confidence of the rule divided by the expected
confidence, assuming that the itemsets X and Y are independent of each other.The expected
confidence is the confidence divided by the frequency of {Y}.
Lift(X=>Y) = Conf(X=>Y) Supp(Y) – Lift value near 1 indicates X and Y almost often
appear together as expected, greater than 1 means they appear together more than expected
and less than 1 means they appear less than expected.Greater lift values indicate stronger
association.
= 2/5
= 0.4
= 2/3
= 0.67
= 0.4/(0.6*0.6)
= 1.11
The Association rule is very useful in analyzing datasets. The data is collected using bar-code
scanners in supermarkets. Such databases consists of a large number of transaction records which
list all items bought by a customer on a single purchase. So the manager could know if certain
groups of items are consistently purchased together and use this data for adjusting store layouts,
cross-selling, promotions based on statistics.
55
FP-Growth Algorithm: An Efficient Alternative to Apriori
Dimensionality Reduction
Dimensionality reduction is the process of reducing the number of features in a dataset while
preserving as much information as possible. This technique is useful for improving the
performance of machine learning algorithms and for data visualization. Examples of
dimensionality reduction algorithms include Dimensionality reduction is the process of reducing
the number of features in a dataset while preserving as much information as possible.
Overfitting: Unsupervised learning algorithms can overfit to the specific dataset used for
training, limiting their ability to generalize to new data.
56
Data quality: Unsupervised learning algorithms are sensitive to the quality of the input
data. Noisy or incomplete data can lead to misleading or inaccurate results.
No labeled data required: Unlike supervised learning, unsupervised learning does not
require labeled data, which can be expensive and time-consuming to collect.
Can uncover hidden patterns: Unsupervised learning algorithms can identify patterns
and relationships in data that may not be obvious to humans.
Can be used for a variety of tasks: Unsupervised learning can be used for a variety of
tasks, such as clustering, dimensionality reduction, and anomaly detection.
Can be used to explore new data: Unsupervised learning can be used to explore new data
and gain insights that may not be possible with other methods.
Can be sensitive to the quality of the data: Unsupervised learning algorithms can be
sensitive to the quality of the input data. Noisy or incomplete data can lead to misleading
or inaccurate results.
57
Applications of Unsupervised learning
Fraud detection: Unsupervised learning can be used to detect fraud in financial data by
identifying transactions that deviate from the expected patterns. This can help to prevent
fraud by flagging these transactions for further investigation.
Semi-Supervised Learning in ML
Today’s Machine Learning algorithms can be broadly classified into three categories, Supervised
Learning, Unsupervised Learning, and Reinforcement Learning. Casting Reinforced Learning
aside, the primary two categories of Machine Learning problems are Supervised and Unsupervised
Learning. The basic difference between the two is that Supervised Learning datasets have an output
label associated with each tuple while Unsupervised Learning datasets do not.
Semi-supervised learning is a type of machine learning that falls in between supervised and
unsupervised learning. It is a method that uses a small amount of labeled data and a large amount
of unlabeled data to train a model. The goal of semi-supervised learning is to learn a function that
can accurately predict the output variable based on the input variables, similar to supervised
58
learning. However, unlike supervised learning, the algorithm is trained on a dataset that contains
both labeled and unlabeled data.
Semi-supervised learning is particularly useful when there is a large amount of unlabeled data
available, but it’s too expensive or difficult to label all of it.
Intuitively, one may imagine the three types of learning algorithms as Supervised learning where
a student is under the supervision of a teacher at both home and school, Unsupervised learning
where a student has to figure out a concept himself and Semi-Supervised learning where a teacher
teaches a few concepts in class and gives questions as homework which are based on similar
concepts.
59
Examples of Semi-Supervised Learning
Text classification: In text classification, the goal is to classify a given text into one or
more predefined categories. Semi-supervised learning can be used to train a text
classification model using a small amount of labeled data and a large amount of unlabeled
text data.
Image classification: In image classification, the goal is to classify a given image into one
or more predefined categories. Semi-supervised learning can be used to train an image
classification model using a small amount of labeled data and a large amount of unlabeled
image data.
1. Continuity Assumption: The algorithm assumes that the points which are closer to each
other are more likely to have the same output label.
2. Cluster Assumption: The data can be divided into discrete clusters and points in the same
cluster are more likely to share an output label.
1. Speech Analysis: Since labeling audio files is a very intensive task, Semi-Supervised
learning is a very natural approach to solve this problem.
60
algorithm uses a variant of Semi-Supervised learning to rank the relevance of a webpage
for a given query.
3. Protein Sequence Classification: Since DNA strands are typically very large in size, the
rise of Semi-Supervised learning has been imminent in this field.
The most basic disadvantage of any Supervised Learning algorithm is that the dataset has to be
hand-labeled either by a Machine Learning Engineer or a Data Scientist. This is a very costly
process, especially when dealing with large volumes of data. The most basic disadvantage of
any Unsupervised Learning is that its application spectrum is limited.
3. Reinforcement Learning
61
Reinforcement Learning (RL) is the science of decision making. It is about learning the optimal
behavior in an environment to obtain maximum reward. In RL, the data is accumulated from
machine learning systems that use a trial-and-error method. Data is not part of the input that we
would find in supervised or unsupervised machine learning.
Reinforcement learning uses algorithms that learn from outcomes and decide which action to take
next. After each action, the algorithm receives feedback that helps it determine whether the choice
it made was correct, neutral or incorrect. It is a good technique to use for automated systems that
have to make a lot of small decisions without human guidance.
Reinforcement learning is an autonomous, self-teaching system that essentially learns by trial and
error. It performs actions with the aim of maximizing rewards, or in other words, it is learning by
doing in order to achieve the best outcomes.
Example:
The problem is as follows: We have an agent and a reward, with many hurdles in between. The
agent is supposed to find the best possible path to reach the reward. The following problem
explains the problem more easily.
62
The above image shows the robot, diamond, and fire. The goal of the robot is to get the reward
that is the diamond and avoid the hurdles that are fired. The robot learns by trying all the
possible paths and then choosing the path which gives him the reward with the least hurdles.
Each right step will give the robot a reward and each wrong step will subtract the reward of the
robot. The total reward will be calculated when it reaches the final reward that is the diamond.
Input: The input should be an initial state from which the model will start
Output: There are many possible outputs as there are a variety of solutions to a particular
problem
Training: The training is based upon the input, The model will return a state and the user
will decide to reward or punish the model based on its output.
63
Reinforcement learning Supervised learning
Types of Reinforcement:
Maximizes Performance
Too much Reinforcement can lead to an overload of states which can diminish the
results
Increases Behavior
64
It Only provides enough to meet up the minimum behavior
1. Policy
2. Reward function
3. Value function
Policy: Policy defines the learning agent behavior for given time period. It is a mapping from
perceived states of the environment to actions to be taken when in those states.
Reward function: Reward function is used to define a goal in a reinforcement learning problem.A
reward function is a function that provides a numerical score based on the state of the environment
Value function: Value functions specify what is good in the long run. The value of a state is the
total amount of reward an agent can expect to accumulate over the future, starting from that state.
The agent has sensors to decide on its state in the environment and takes action that modifies its
state.
65
Reinforcement learning is a technique for solving Markov decision problems.
Reinforcement learning uses a formal framework defining the interaction between a learning
agent and its environment in terms of states, actions, and rewards. This framework is intended to
be a simple way of representing essential features of the artificial intelligence problem.
RL can be used to create training systems that provide custom instruction and materials
according to the requirement of students.
1. Robotics: Robots with pre-programmed behavior are useful in structured environments, such as
the assembly line of an automobile manufacturing plant, where the task is repetitive in nature.
2. A master chess player makes a move. The choice is informed both by planning, anticipating
possible replies and counter replies.
3. The only way to collect information about the environment is to interact with it.
66
Advantages of Reinforcement learning
1. Reinforcement learning can be used to solve very complex problems that cannot be solved by
conventional techniques.
2. The model can correct the errors that occurred during the training process.
3. In RL, training data is obtained via the direct interaction of the agent with the environment
4. Reinforcement learning can handle environments that are non-deterministic, meaning that the
outcomes of actions are not always predictable. This is useful in real-world applications where the
environment may change over time or is uncertain.
5. Reinforcement learning can be used to solve a wide range of problems, including those that
involve decision making, control, and optimization.
6. Reinforcement learning is a flexible approach that can be combined with other machine learning
techniques, such as deep learning, to improve performance.
3. Reinforcement learning is highly dependent on the quality of the reward function. If the reward
function is poorly designed, the agent may not learn the desired behavior.
4. Reinforcement learning can be difficult to debug and interpret. It is not always clear why the
agent is behaving in a certain way, which can make it difficult to diagnose and fix problems.
67
3.4 Neural Networks
Neural Networks are computational models that mimic the complex functions of the human brain.
The neural networks consist of interconnected nodes or neurons that process and learn from data,
enabling tasks such as pattern recognition and decision making in machine learning. The article
explores more about neural networks, their working, architecture and more.
Since the 1940s, there have been a number of noteworthy advancements in the field of neural
networks:
68
1940s-1950s: Early Concepts
Neural networks began with the introduction of the first mathematical model of artificial
neurons by McCulloch and Pitts. But computational constraints made progress difficult.
1960s-1970s: Perceptions
this era is defined by the work of Rosenblatt on perceptions. perceptions are single-layer
networks whose applicability was limited to issues that could be solved linearly
separately.
Neural networks extract identifying features from data, lacking pre-programmed understanding.
Network components include neurons, connections, weights, biases, propagation functions, and a
learning rule. Neurons receive inputs, governed by thresholds and activation functions.
69
Connections involve weights and biases regulating information transfer. Learning, adjusting
weights and biases, occurs in three stages: input computation, output generation, and iterative
refinement enhancing the network’s proficiency in diverse tasks.
These include:
2. Then the free parameters of the neural network are changed as a result of this simulation.
3. The neural network then responds in a new way to the environment because of the changes
in its free parameters.
70
Importance of Neural Networks
The ability of neural networks to identify patterns, solve intricate puzzles, and adjust to changing
surroundings is essential. Their capacity to learn from data has far-reaching effects, ranging from
revolutionizing technology like natural language processing and self-driving automobiles to
automating decision-making processes and increasing efficiency in numerous industries. The
development of artificial intelligence is largely dependent on neural networks, which also drive
innovation and influence the direction of technology.
Consider a neural network for email classification. The input layer takes features like email
content, sender information, and subject. These inputs, multiplied by adjusted weights, pass
through hidden layers. The network, through training, learns to recognize patterns indicating
whether an email is spam or not. The output layer, with a binary activation function, predicts
whether the email is spam (1) or not (0). As the network iteratively refines its weights through
backpropagation, it becomes adept at distinguishing between spam and legitimate emails,
showcasing the practicality of neural networks in real-world applications like email filtering.
Neural networks are complex systems that mimic some features of the functioning of the human
brain. It is composed of an input layer, one or more hidden layers, and an output layer made up of
layers of artificial neurons that are coupled. The two stages of the basic process are called
backpropagation and forward propagation.
71
Forward Propagation
Input Layer: Each feature in the input layer is represented by a node on the network,
which receives input data.
Weights and Connections: The weight of each neuronal connection indicates how strong
the connection is. Throughout training, these weights are changed.
Hidden Layers: Each hidden layer neuron processes inputs by multiplying them by
weights, adding them up, and then passing them through an activation function. By doing
this, non-linearity is introduced, enabling the network to recognize intricate patterns.
Output: The final result is produced by repeating the process until the output layer is
reached.
Backpropagation
Loss Calculation: The network’s output is evaluated against the real goal values, and a
loss function is used to compute the difference. For a regression problem, the Mean
Squared Error (MSE) is commonly used as the cost function.
72
Loss Function:
Gradient Descent: Gradient descent is then used by the network to reduce the loss. To
lower the inaccuracy, weights are changed based on the derivative of the loss with respect
to each weight.
Adjusting weights: The weights are adjusted at each connection by applying this iterative
process, or backpropagation, backward across the network.
Training: During training with different data samples, the entire process of forward
propagation, loss calculation, and backpropagation is done iteratively, enabling the
network to adapt and learn patterns from the data.
In supervised learning, the neural network is guided by a teacher who has access to both input-
output pairs. The network creates outputs based on inputs without taking into account the
surroundings. By comparing these outputs to the teacher-known desired outputs, an error signal is
generated. In order to reduce errors, the network’s parameters are changed iteratively and stop
when performance is at an acceptable level.
Equivalent output variables are absent in unsupervised learning. Its main goal is to comprehend
incoming data’s (X) underlying structure. No instructor is present to offer advice. Modeling data
patterns and relationships is the intended outcome instead. Words like regression and classification
are related to supervised learning, whereas unsupervised learning is associated with clustering and
association.
73
3. Learning with Reinforcement Learning
Through interaction with the environment and feedback in the form of rewards or penalties, the
network gains knowledge. Finding a policy or strategy that optimizes cumulative rewards over
time is the goal for the network. This kind is frequently utilized in gaming and decision-making
applications.
Multilayer Perceptron (MLP): MLP is a type of feedforward neural network with three or
more layers, including an input layer, one or more hidden layers, and an output layer. It uses
nonlinear activation functions.
Recurrent Neural Network (RNN): An artificial neural network type intended for
sequential data processing is called a Recurrent Neural Network (RNN). It is appropriate for
applications where contextual dependencies are critical, such as time series prediction and
natural language processing, since it makes use of feedback loops, which enable information
to survive within the network.
Long Short-Term Memory (LSTM): LSTM is a type of RNN that is designed to overcome
the vanishing gradient problem in training RNNs. It uses memory cells and gates to
selectively read, write, and erase information.
74
Advantages of Neural Networks
Neural networks are widely used in many different applications because of their many benefits:
Adaptability: Neural networks are useful for activities where the link between inputs and
outputs is complex or not well defined because they can adapt to new situations and learn
from data.
Parallel Processing: Because neural networks are capable of parallel processing by nature,
they can process numerous jobs at once, which speeds up and improves the efficiency of
computations.
Neural networks, while powerful, are not without drawbacks and difficulties:
Black box Nature: As “black box” models, neural networks pose a problem in important
applications since it is difficult to understand how they make decisions.
Need for Large datasets: For efficient training, neural networks frequently need sizable,
labeled datasets; otherwise, their performance may suffer from incomplete or skewed data.
75
3.5 Support Vector Machine (SVM) Algorithm
Support Vector Machine (SVM) is a powerful machine learning algorithm used for linear or
nonlinear classification, regression, and even outlier detection tasks. SVMs can be used for a
variety of tasks, such as text classification, image classification, spam detection, handwriting
identification, gene expression analysis, face detection, and anomaly detection. SVMs are
adaptable and efficient in a variety of applications because they can manage high-dimensional data
and nonlinear relationships.
SVM algorithms are very effective as we try to find the maximum separating hyperplane between
the different classes available in the target feature.
Support Vector Machine (SVM) is a supervised machine learning algorithm used for both
classification and regression. Though we say regression problems as well it’s best suited for
classification. The main objective of the SVM algorithm is to find the optimal hyperplane in an N-
dimensional space that can separate the data points in different classes in the feature space. The
hyperplane tries that the margin between the closest points of different classes should be as
maximum as possible. The dimension of the hyperplane depends upon the number of features. If
the number of input features is two, then the hyperplane is just a line. If the number of input
features is three, then the hyperplane becomes a 2-D plane. It becomes difficult to imagine when
the number of features exceeds three.
Let’s consider two independent variables x1, x2, and one dependent variable which is either a blue
circle or a red circle.
76
Linearly Separable Data points
From the figure above it’s very clear that there are multiple lines (our hyperplane here is a line
because we are considering only two input features x1, x2) that segregate our data points or do a
classification between red and blue circles. So how do we choose the best line or in general the
best hyperplane that segregates our data points?
One reasonable choice as the best hyperplane is the one that represents the largest separation or
margin between the two classes.
77
Multiple hyperplanes separate the data from two classes
So we choose the hyperplane whose distance from it to the nearest data point on each side is
maximized. If such a hyperplane exists it is known as the maximum-margin hyperplane/hard
margin. So from the above figure, we choose L2. Let’s consider a scenario like shown below
78
Selecting hyperplane for data with outlier
Here we have one blue ball in the boundary of the red ball. So how does SVM classify the data?
It’s simple! The blue ball in the boundary of red ones is an outlier of blue balls. The SVM algorithm
has the characteristics to ignore the outlier and finds the best hyperplane that maximizes the
margin. SVM is robust to outliers.
So in this type of data point what SVM does is, finds the maximum margin as done with previous
data sets along with that it adds a penalty each time a point crosses the margin. So the margins in
these types of cases are called soft margins. When there is a soft margin to the data set, the SVM
tries to minimize (1/margin+∧(∑penalty)). Hinge loss is a commonly used penalty. If no violations
no hinge loss.If violations hinge loss proportional to the distance of violation.
Till now, we were talking about linearly separable data(the group of blue balls and red balls are
separable by a straight line/linear line). What to do if data are not linearly separable?
79
Original 1D dataset for classification
Say, our data is shown in the figure above. SVM solves this by creating a new variable using
a kernel. We call a point xi on the line and we create a new variable yi as a function of distance
from origin o.so if we plot this we get something like as shown below
In this case, the new variable y is created as a function of distance from the origin. A non-linear
function that creates a new variable is referred to as a kernel.
80
Support Vector Machine Terminology
1. Hyperplane: Hyperplane is the decision boundary that is used to separate the data points
of different classes in a feature space. In the case of linear classifications, it will be a linear
equation i.e. wx+b = 0.
2. Support Vectors: Support vectors are the closest data points to the hyperplane, which
makes a critical role in deciding the hyperplane and margin.
3. Margin: Margin is the distance between the support vector and hyperplane. The main
objective of the support vector machine algorithm is to maximize the margin. The wider
margin indicates better classification performance.
4. Kernel: Kernel is the mathematical function, which is used in SVM to map the original
input data points into high-dimensional feature spaces, so, that the hyperplane can be easily
found out even if the data points are not linearly separable in the original input space. Some
of the common kernel functions are linear, polynomial, radial basis function(RBF), and
sigmoid.
6. Soft Margin: When the data is not perfectly separable or contains outliers, SVM permits
a soft margin technique. Each data point has a slack variable introduced by the soft-margin
SVM formulation, which softens the strict margin requirement and permits certain
misclassifications or violations. It discovers a compromise between increasing the margin
and reducing violations.
81
8. Hinge Loss: A typical loss function in SVMs is hinge loss. It punishes incorrect
classifications or margin violations. The objective function in SVM is frequently formed
by combining it with the regularization term.
9. Dual Problem: A dual Problem of the optimization problem that requires locating the
Lagrange multipliers related to the support vectors can be used to solve SVM. The dual
formulation enables the use of kernel tricks and more effective computing.
Consider a binary classification problem with two classes, labeled as +1 and -1. We have a training
dataset consisting of input feature vectors X and their corresponding class labels Y.
The vector W represents the normal vector to the hyperplane. i.e the direction perpendicular to the
hyperplane. The parameter b in the equation represents the offset or distance of the hyperplane
from the origin along the normal vector w.
The distance between a data point x_i and the decision boundary can be calculated as:
where ||w|| represents the Euclidean norm of the weight vector w. Euclidean norm of the normal
vector W
82
Optimization:
The target variable or label for the ith training instance is denoted by the symbol ti in this statement.
And ti=-1 for negative occurrences (when yi= 0) and ti=1positive instances (when yi = 1)
respectively. Because we require the decision boundary that satisfy the
constraint:
Dual Problem: A dual Problem of the optimisation problem that requires locating the
Lagrange multipliers related to the support vectors can be used to solve SVM. The optimal
Lagrange multipliers α(i) that maximize the following dual objective function
Where,
K(xi, xj) is the kernel function that computes the similarity between two samples xi and xj.
It allows SVM to handle nonlinear classification problems by implicitly mapping the
samples into a higher-dimensional feature space.
83
The SVM decision boundary can be described in terms of these optimal Lagrange multipliers and
the support vectors once the dual issue has been solved and the optimal Lagrange multipliers have
been discovered. The training samples that have i > 0 are the support vectors, while the decision
boundary is supplied by:
Based on the nature of the decision boundary, Support Vector Machines (SVM) can be divided
into two main parts:
Linear SVM: Linear SVMs use a linear decision boundary to separate the data points of
different classes. When the data can be precisely linearly separated, linear SVMs are very
suitable. This means that a single straight line (in 2D) or a hyperplane (in higher
dimensions) can entirely divide the data points into their respective classes. A hyperplane
that maximizes the margin between the classes is the decision boundary.
Non-Linear SVM: Non-Linear SVM can be used to classify data when it cannot be
separated into two classes by a straight line (in the case of 2D). By using kernel functions,
nonlinear SVMs can handle nonlinearly separable data. The original input data is
transformed by these kernel functions into a higher-dimensional feature space, where the
data points can be linearly separated. A linear SVM is used to locate a nonlinear decision
boundary in this modified space.
The SVM kernel is a function that takes low-dimensional input space and transforms it into higher-
dimensional space, ie it converts nonseparable problems to separable problems. It is mostly useful
in non-linear separation problems. Simply put the kernel, does some extremely complex data
84
transformations and then finds out the process to separate the data based on the labels or outputs
defined.
Advantages of SVM
Its memory is efficient as it uses a subset of training points in the decision function called
support vectors.
Different kernel functions can be specified for the decision functions and its possible to
specify custom kernels.
85
LECTURE FOUR:
4.0 Feature Extraction
4.1Feature selection and Dimensionality reduction
4.2 Feature Extraction Techniques
4.0 Feature Extraction
Feature extraction is a critical step in machine learning and data analysis, involving transforming
raw data into a set of features that can be used for model building. This process includes two main
components: feature selection and dimensionality reduction, and feature extraction techniques.
Feature Selection: Feature selection involves selecting a subset of relevant features (variables,
predictors) for use in model construction. The main goal is to improve the performance of the
model by eliminating irrelevant or redundant features. There are several methods for feature
selection:
1. Filter Methods: These techniques evaluate the relevance of features by looking at the
intrinsic properties of the data, without involving any machine learning algorithms.
Examples include:
o Correlation Coefficient
o Chi-square Test
o Mutual Information
2. Wrapper Methods: These methods evaluate the performance of a subset of features based
on the outcome of a specific machine learning algorithm. Examples include:
o Forward Selection
o Backward Elimination
86
3. Embedded Methods: These methods perform feature selection during the model training
process. Examples include:
o Ridge Regression
1. Principal Component Analysis (PCA): A linear technique that transforms the data into a
new coordinate system such that the greatest variance by any projection of the data comes
to lie on the first coordinate (the first principal component), the second greatest variance
on the second coordinate, and so on.
2. Linear Discriminant Analysis (LDA): Primarily used for classification problems, LDA
aims to find a linear combination of features that best separate two or more classes of
objects or events.
Feature extraction involves creating new features from the existing raw data, aiming to reduce the
data's dimensionality while preserving its significant properties. Here are some common feature
extraction techniques:
1. Text Data:
87
a) Bag of Words (BoW): Represents text data by converting it into a frequency matrix
of words.
2. Image Data:
a) Fourier Transform: Decomposes a time series into the frequencies that make it
up.
4. Audio Data:
88
By effectively employing feature selection, dimensionality reduction, and feature extraction
techniques, one can improve model performance, reduce overfitting, and decrease computational
costs, ultimately leading to more accurate and efficient machine learning models.
89
LECTURE FIVE
5.0 Classification
5.1.1. Perceptron:
Description: The perceptron is a simple, single-layer neural network used for binary
classification tasks. It works by finding a linear decision boundary to separate two classes.
Disadvantages: Limited to linearly separable data and cannot handle more complex
patterns.
Description: Logistic regression is a statistical model that uses a logistic function to model
the probability of a binary dependent variable. Despite its name, it is used for classification,
not regression.
Disadvantages: Assumes a linear relationship between input features and the log odds of
the output, which may not always hold true.
90
5.2 Non-linear Classifiers
5. 2. 1. Decision Trees:
Description: Decision trees partition the data into subsets based on feature values, making
decisions at each node until a prediction is made at the leaf nodes.
Advantages: Easy to interpret, handle both numerical and categorical data, and require
little data preprocessing.
Disadvantages: Prone to overfitting, especially with deep trees, and can be unstable with
small changes in data.
Advantages: Simple and intuitive, effective for small datasets with well-defined clusters.
Description: Random forests are an ensemble learning method that constructs multiple
decision trees during training and outputs the mode of the classes (classification) or mean
prediction (regression) of the individual trees.
Advantages: Robust to overfitting, handles large datasets well, and can handle missing
values and maintain accuracy for a large proportion of data.
91
5.3.2. Boosting:
Advantages: Improves model accuracy, reduces bias, and can handle complex patterns.
Linear Classifiers: Suitable for linearly separable data, with easy implementation and
interpretability. Examples include the perceptron and logistic regression.
Non-linear Classifiers: Handle more complex relationships between features and outputs.
Examples include decision trees and k-nearest neighbors.
92
LECTURE SIX:
6.0 Clustering
6.1 K-means clustering
6.2 Hierarchical clustering
6.3 Density-based clustering
Clustering is an unsupervised learning technique used to group similar data points into clusters
based on their characteristics. This helps in understanding the underlying structure of the data.
Here are three widely used clustering methods: K-means clustering, hierarchical clustering, and
density-based clustering.
Description: K-means clustering is a partitioning method that divides a dataset into K distinct,
non-overlapping subsets (clusters). Each data point belongs to the cluster with the nearest mean.
Steps:
2. Assignment: Assign each data point to the nearest centroid, forming K clusters.
3. Update: Calculate the new centroids as the mean of the data points in each cluster.
4. Repeat: Repeat the assignment and update steps until the centroids no longer change or
change very little.
Advantages:
93
Disadvantages:
Assumes clusters are spherical and equally sized, which may not always be the case.
Types:
1. Agglomerative (Bottom-Up):
2. Divisive (Top-Down):
Advantages:
Produces a dendrogram, a tree-like diagram that illustrates the arrangements of the clusters.
94
Disadvantages:
1. Initialization: Define parameters ε (epsilon, the radius of the neighborhood) and MinPts
(minimum number of points required to form a dense region).
2. Core Points: Identify core points as those with at least MinPts within a radius ε.
4. Border Points: Assign border points (points within ε of a core point but with fewer than
MinPts in their neighborhood) to the nearest core point's cluster.
5. Noise Points: Label points that are neither core nor border points as noise.
Advantages:
95
Disadvantages:
K-means Clustering: Efficient and simple but assumes spherical clusters and requires
specifying K in advance.
Hierarchical Clustering: Produces a dendrogram and does not require specifying the
number of clusters, but is computationally intensive and sensitive to noise.
Density-Based Clustering: Can find clusters of arbitrary shapes and is robust to noise, but
requires careful selection of parameters and may struggle with clusters of varying densities.
By understanding and utilizing these clustering methods, one can uncover the underlying patterns
and structures in the data, leading to valuable insights and better decision-making.
96
LECTURE SEVEN:
Pattern recognition is a core aspect of machine learning that finds applications in many fields. Here
are detailed descriptions of its applications in various domains:
Description: Pattern recognition in image processing and computer vision involves interpreting
visual data to automate tasks that typically require human vision. This includes the analysis,
understanding, and processing of images and videos to extract meaningful information.
Applications:
Object Detection and Recognition: Systems that identify .0and classify objects within an
image or video, such as facial recognition for security systems, vehicle detection in traffic
management, and identifying animals in wildlife monitoring.
Image Segmentation: Dividing an image into parts for easier analysis, commonly used in
medical imaging to highlight areas of interest, such as tumors in MRI scans.
Autonomous Vehicles: Using cameras and sensors to enable vehicles to perceive and
navigate their environment by recognizing roads, obstacles, and traffic signals.
97
Augmented Reality (AR): Enhancing real-world environments with digital overlays for
applications in gaming, education, and navigation.
Description: Pattern recognition in speech recognition and NLP enables machines to understand
and interact with human language, whether spoken or written. This involves the analysis of
linguistic patterns to facilitate communication between humans and computers.
Applications:
Voice Assistants: Technologies like Siri, Alexa, and Google Assistant that understand and
respond to spoken commands, providing hands-free interaction with devices.
Speech-to-Text: Converting spoken language into written text, useful for transcription
services, voice typing, and accessibility features for the hearing impaired.
Language Translation: Translating text or speech from one language to another, aiding
global communication and understanding through services like Google Translate.
Sentiment Analysis: Analyzing text data to determine the sentiment or emotional tone,
valuable in customer feedback analysis, social media monitoring, and market research.
Chatbots: Automated systems that engage with users through text or speech to provide
customer support, information, and personalized experiences.
Applications:
98
Protein Structure Prediction: Predicting the three-dimensional structure of proteins to
understand their function and interactions, crucial for drug development and understanding
biological processes.
Medical Imaging: Analyzing medical images to diagnose and monitor diseases, such as
detecting cancerous cells in mammograms or identifying brain abnormalities in MRI scans.
Wearable Health Devices: Using data from wearable sensors to monitor vital signs and
detect health issues in real-time, improving patient care and outcomes.
Description: Pattern recognition in finance and marketing involves analyzing large datasets to
identify trends, make predictions, and optimize business strategies.
Applications:
Algorithmic Trading: Developing automated trading systems that analyze market data to
execute buy and sell orders based on predefined criteria, enhancing trading efficiency and
profitability.
99
Summary
Pattern recognition is a versatile and essential tool across many fields. Its applications in image
processing, speech recognition, bioinformatics, and finance demonstrate its ability to extract
valuable insights from complex data, automate processes, and improve decision-making. By
leveraging pattern recognition, various industries can enhance their operations, offer personalized
experiences, and drive innovation.
100
Lecture EIGHT
Performance evaluation is a critical step in developing and deploying machine learning models. It
ensures that the models are effective, reliable, and generalizable. This section covers the key
metrics for evaluating classification and clustering algorithms, as well as techniques for cross-
validation and model selection.
Classification Metrics
1. Accuracy:
2.
Precision:
3.
Recall (Sensitivity):
101
4.
F1 Score:
5.
ROC-AUC (Receiver Operating Characteristic - Area Under Curve):
o Use Case: Useful for evaluating performance over all classification thresholds.
6. Confusion Matrix:
o Description: A table that displays the true positives (TP), true negatives (TN), false
positives (FP), and false negatives (FN).
Clustering Metrics
1. Silhouette Score:
o Description: Measures how similar a point is to its own cluster compared to other
clusters.
102
b(i)b(i)b(i): Mean distance to points in the nearest cluster.
2. Davies-Bouldin Index:
o Description: Measures the average similarity ratio of each cluster with the cluster
that is most similar to it.
o Homogeneity: All clusters contain only data points that are members of a single
class.
o Completeness: All data points that are members of a given class are elements of
the same cluster.
Cross-Validation
1. k-Fold Cross-Validation:
o Description: The dataset is divided into k subsets (folds). The model is trained k
times, each time using a different subset as the validation set and the remaining k-
1 subsets as the training set.
103
o Description: Each instance in the dataset is used once as a validation while the
remaining instances form the training set.
Model Selection
1. Grid Search:
2. Random Search:
o Advantages: Often faster than grid search and can find good hyperparameters with
fewer trials.
3. Bayesian Optimization:
o Advantages: More efficient than grid and random search, especially with limited
computational resources.
104
4. Cross-Validation:
Summary
Performance evaluation involves using various metrics to assess classification and clustering
algorithms, ensuring their effectiveness and reliability. Classification metrics include accuracy,
precision, recall, F1 score, ROC-AUC, and confusion matrix. Clustering metrics include the
silhouette score, Davies-Bouldin index, Adjusted Rand index, and V-measure. Cross-validation
methods, such as k-fold and LOOCV, along with model selection techniques like grid search and
Bayesian optimization, are crucial for developing robust models that generalize well to new data.
105
LECTURE NINE:
As pattern recognition systems become increasingly integrated into various aspects of society, it
is crucial to consider their ethical and social implications. Two major areas of concern are privacy
and security, as well as bias and fairness.
Description: Pattern recognition systems often rely on large amounts of data, some of which may
be sensitive or personally identifiable. Ensuring the privacy and security of this data is paramount
to prevent misuse and protect individuals' rights.
Concerns:
1. Data Privacy:
o Issue: Pattern recognition systems can collect and process vast amounts of personal
data, raising concerns about how this data is stored, shared, and used.
2. Data Security:
o Issue: Ensuring that data is protected from unauthorized access and cyber threats
is critical.
3. Surveillance:
106
o Issue: The use of pattern recognition in surveillance systems can lead to constant
monitoring of individuals without their consent.
4. Informed Consent:
o Issue: Individuals may not always be aware that their data is being collected and
used by pattern recognition systems.
Solutions:
Robust Security Measures: Implementing strong encryption, access controls, and regular
security audits to safeguard data.
Transparent Policies: Clearly communicating how data is collected, used, and stored, and
obtaining informed consent from individuals.
Description: Bias in pattern recognition systems arises when the data or algorithms used reflect
prejudices, leading to unfair or discriminatory outcomes. Ensuring fairness is essential to maintain
trust and promote equitable treatment.
107
Concerns:
1. Algorithmic Bias:
o Issue: Algorithms may inadvertently learn and perpetuate biases present in the
training data.
2. Representation Bias:
o Issue: If the training data does not adequately represent all demographic groups,
the system may perform poorly for underrepresented populations.
3. Outcome Fairness:
o Issue: Ensuring that the outcomes of pattern recognition systems are fair and do
not disadvantage any group.
o Implications: Unfair outcomes can exacerbate social inequalities and lead to a lack
of trust in these systems.
Solutions:
Diverse Datasets: Ensuring that training datasets are representative of all demographic
groups to reduce bias.
108
Bias Detection and Mitigation: Developing techniques to identify and mitigate biases in
algorithms and data.
Fairness Metrics: Using metrics to evaluate and ensure fairness in pattern recognition
systems.
Regular Audits: Conducting regular audits of systems to identify and address biases and
ensure compliance with fairness standards.
Summary
The ethical and social implications of pattern recognition systems are significant and multifaceted.
Privacy and security concerns necessitate robust measures to protect sensitive data and ensure
informed consent. Bias and fairness issues require careful attention to prevent discriminatory
outcomes and promote equitable treatment. Addressing these concerns is essential to foster trust,
ensure ethical use, and maximize the benefits of pattern recognition technologies for all members
of society.
109