0% found this document useful (0 votes)
22 views38 pages

UNIT 1 Notes

The document outlines a syllabus for a machine learning course, covering topics such as supervised learning, PAC learning, and regression analysis. It highlights the importance of machine learning in various industries, including finance, healthcare, and retail, and discusses the requirements for creating effective machine learning systems. Additionally, it provides insights into model selection techniques and the applications of machine learning in real-world scenarios.

Uploaded by

vivekkale500
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views38 pages

UNIT 1 Notes

The document outlines a syllabus for a machine learning course, covering topics such as supervised learning, PAC learning, and regression analysis. It highlights the importance of machine learning in various industries, including finance, healthcare, and retail, and discusses the requirements for creating effective machine learning systems. Additionally, it provides insights into model selection techniques and the applications of machine learning in real-world scenarios.

Uploaded by

vivekkale500
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

UNIT 1 Syllabus

What Is Machine Learning, Machine Learning Applications, Supervised Learning: Learning a


Class from Examples, Probably Approximately Correct (PAC) Learning, Learning Multiple
Classes, Regression, Model Selection and Generalization, Bayesian Decision Theory,
Dimensions of a Supervised Machine Learning Algorithm, Knowing What You Know: Testing
Machine Learning Algorithms, Guidelines for Machine Learning Experiments, Cross-
Validation and Resampling Methods, Measuring Classifier Performance, Hypothesis Testing ,
Comparing Two Classification Algorithms

Compiled by: Dr. Jayashree Prasad

What Is Machine Learning


Why is machine learning important?
Resurging interest in machine learning is due to the same factors that have made data
mining and Bayesian analysis more popular than ever. Things like growing volumes and
varieties of available data, computational processing that is cheaper and more powerful, and
affordable data storage.
All of these things mean it's possible to quickly and automatically produce models that can
analyze bigger, more complex data and deliver faster, more accurate results – even on a very
large scale. And by building precise models, an organization has a better chance of identifying
profitable opportunities – or avoiding unknown risks.

What's required to create good machine learning systems?


 Data preparation capabilities.
 Algorithms – basic and advanced.
 Automation and iterative processes.
 Scalability.
 Ensemble modeling.

Did you know?


 In machine learning, a target is called a label.
 In statistics, a target is called a dependent variable.
 A variable in statistics is called a feature in machine learning.
 A transformation in statistics is called feature creation in machine learning.
Machine learning in today's world
By using ato build models that uncover connections, organizations can make better
decisions without human intervention. Learn more about the technologies that
are shaping the world we live in.

Opportunities and challenges for machine learning in business


This O'Reilly white paper provides a practical guide to implementing machine-learning
applications in your organization.

Expand your skill set


Get in-depth instruction and free access to SAS Software to build your machine learning skills.
Courses include: 14 hours of course time, 90 days free software access in the cloud, a flexible
e-learning format, with no programming skills required.

Will machine learning change your organization?


This Harvard Business Review Insight Center report looks at how machine learning will
change companies and the way we manage them.

Machine Learning Applications


Applying machine learning to IoT
Machine learning can be used to achieve higher levels of efficiency, particularly when applied
to the Internet of Things. This article explores the topic.
Who's using it?
Most industries working with large amounts of data have recognized the value of machine
learning technology. By gleaning insights from this data – often in real time –
organizations are able to work more efficiently or gain an advantage over competitors.
Financial services
Banks and other businesses in the financial industry use machine learning technology for two
key purposes: to identify important insights in data, and prevent fraud. The insights can identify
investment opportunities, or help investors know when to trade. Data mining can also identify
clients with high-risk profiles, or use cybersurveillance to pinpoint warning signs of fraud.
Government
Government agencies such as public safety and utilities have a particular need for machine
learning since they have multiple sources of data that can be mined for insights. Analyzing
sensor data, for example, identifies ways to increase efficiency and save money. Machine
learning can also help detect fraud and minimize identity theft.
Health care
Machine learning is a fast-growing trend in the health care industry, thanks to the advent of
wearable devices and sensors that can use data to assess a patient's health in real time. The
technology can also help medical experts analyze data to identify trends or red flags that may
lead to improved diagnoses and treatment.
Retail
Websites recommending items you might like based on previous purchases are using machine
learning to analyze your buying history. Retailers rely on machine learning to capture data,
analyze it and use it to personalize a shopping experience, implement a marketing
campaign, price optimization, merchandise supply planning, and for customer insights.
Oil and gas
Finding new energy sources. Analyzing minerals in the ground. Predicting refinery sensor
failure. Streamlining oil distribution to make it more efficient and cost-effective. The number
of machine learning use cases for this industry is vast – and still expanding.
Transportation
Analyzing data to identify patterns and trends is key to the transportation industry, which relies
on making routes more efficient and predicting potential problems to increase profitability. The
data analysis and modeling aspects of machine learning are important tools to delivery
companies, public transportation and other transportation organizations.

What is PAC Learning ?


We very well understand the importance of the size of dataset while performing training in
machine learning. Whereas when it comes to what will be learnt well by the algorithm amongst
the dataset and how well it will be learnt becomes a difficult part to answer.
In Machine learning we have a framework which can help us answering what can be learnt
efficiently by the algorithm, also it can help us answering the sample size which can give better
result. The framework is called Probably Approximately Correct learning framework.
PAC helps us in describing the probable features which an algorithm can learn, this depends
upon factors like the number of sample size, Sample complexity, time, space complexity for the
algorithm.

PAC Framework
Before getting into more detail first lets look at the representation/ terminologies which we are
going to use to represent PAC framework
c — Concept/features where X -> Y since Y = {0,1}, X -> {0,1}
C — Concept class ( set of concepts/ features to learn)
H — Hypothesis( Set of concepts which may not coincide with C)
D — Data distribution (considered here to be identical independently distributed)
S — Sample from H
hS — Hypothesis for S Sample
ε — Accuracy parameter
δ — Confidence parameter
PAC Learning
A class C is termed to be PAC learnable if the hypothesis (H) returned after applying the
algorithm (A) on the number of samples (N) is termed to be approximately correct if it gives an
error rate lesser than ε and a probability of at least 1 − δ (where N is polynomial and N is a
function for 1/ε, 1/δ) . The combination of probability and term approximately in the equation
leads to the term PAC — Probably Approximate Correct.
* Pr [(hs)≤ ε] ≥ 1 − δ
Assumption here being made is (ε, δ) > 0 and the hypothesis H is finite. Such an
algorithm/classifier which gives us atleast 1 − δ probability will be termed as approximately
correct in learning the features/concepts.
Also if the algorithm A takes polynomial time while runing ( in form of 1/ε, 1/δ) then C is said
to be efficiently PAC learnable. Here we are looking for more generalised learning (with lesser
generalisation error) but not memorisation of the concepts/features by the algorithm.
Generalisation error — We find the probability of H (hypothesis) and C ( Concept class) such
that for random instances where h(x) != c(x) will be the generalisation error as we are assuming
both of them to be different or we can say non overlapping(intersecting data points) and it will
be our true error.
* Pr [h(x)!= c(x)]
Sample complexity — Using PAC we can also find the number of samples which will give us
higher probability, here we are assuming that C is PAC learnable. So if we find a hypothesis
with atleast 1 − δ probability (high probability), then number of samples needed for training
such hypothesis can be defined as
* N = 1/ε (ln|H|+ln|δ| )
What if the hypothesis H is infinite?

Photo by Ghaith Harstany on Unsplash


The previous concept which we saw will hold true for finite hypothesis, but if hypothesis is
infinite we have to look for options of how we can split the hypothesis, the splitting of hypothesis
is termed as shattering.
To make the process more understandable there is a term VC dimension or VC dim which
defines largest set of points that can be shattered by the algorithm.
We can also say,
* VC = 2^k , where k is set of points
Let’s say VC-dimension is d for a sample S, which contains set of d points which can be
shattered, but for d+1 there is no set of points to be shattered. How? Lets look at an example

Example of VC Dimension1
Consider the graph (a) where there are two sets T, F(d points) which are being split into two
using an axis. Likewise any other axis can also be selected.
Consider the graph (b) where there are two sets F, T (d points) which are being split into two
using an axis. Likewise any other axis can also be selected.

Example of VC Dimension2

Consider the (a) graph, where there are 4 points of set — T,T,T, F (d+1 points). How can we
shatter it, any line we choose will cut the F point which is not a perfect split, So (a) graph cannot
be shattered.
Consider the (b) graph, where there are 3 points of set — T,T, F (d+1 points). This graph can
easily be split by using a line.
In the above examples we saw if there are set of points ≤ 3 it can easily be split but if the set of
points are ≥4 (d+1 points) it cannot be split perfectly.
So the last set of point that can be split or are shatterable is VC dimension <4 .
We have come to an end of the blog … In this blog we got an understanding of PAC learning
for finite hypothesis and VC dim for infinite hypothesis. In the next blog I will come up with
another interesting topic in AI, till then stay techventorous :)

Reference -
https://www.youtube.com/watch?v=X4Oxst5huQA&t=909s
https://www.slideshare.net/sanghyukchun/pac-learning-42139787
Book — Foundations of Machine Learning
Supervised Learning
What are some popular machine learning methods?
Two of the most widely adopted machine learning methods are supervised
learning and unsupervised learning – but there are also other methods of machine learning.
Here's an overview of the most popular types.
Supervised learning algorithms are trained using labeled examples, such as an input where
the desired output is known. For example, a piece of equipment could have data points labeled
either “F” (failed) or “R” (runs). The learning algorithm receives a set of inputs along with the
corresponding correct outputs, and the algorithm learns by comparing its actual output with
correct outputs to find errors. It then modifies the model accordingly. Through methods like
classification, regression, prediction and gradient boosting, supervised learning uses patterns
to predict the values of the label on additional unlabeled data. Supervised learning is commonly
used in applications where historical data predicts likely future events. For example, it can
anticipate when credit card transactions are likely to be fraudulent or which insurance customer
is likely to file a claim.
Unsupervised learning is used against data that has no historical labels. The system is not told
the "right answer." The algorithm must figure out what is being shown. The goal is to explore
the data and find some structure within. Unsupervised learning works well on transactional
data. For example, it can identify segments of customers with similar attributes who can then
be treated similarly in marketing campaigns. Or it can find the main attributes that separate
customer segments from each other. Popular techniques include self-organizing maps, nearest-
neighbor mapping, k-means clustering and singular value decomposition. These algorithms are
also used to segment text topics, recommend items and identify data outliers.
Regression
What is Regression Analysis?
Regression analysis is a set of statistical methods used for the estimation of relationships
between a dependent variable and one or more independent variables. It can be utilized to
assess the strength of the relationship between variables and for modeling the future
relationship between them.
Regression analysis includes several variations, such as linear, multiple linear, and nonlinear.
The most common models are simple linear and multiple linear. Nonlinear regression analysis
is commonly used for more complicated data sets in which the dependent and independent
variables show a nonlinear relationship.
Regression analysis offers numerous applications in various disciplines, including finance.

Regression Analysis – Linear Model Assumptions


Linear regression analysis is based on six fundamental assumptions:
1. The dependent and independent variables show a linear relationship between the slope
and the intercept.
2. The independent variable is not random.
3. The value of the residual (error) is zero.
4. The value of the residual (error) is constant across all observations.
5. The value of the residual (error) is not correlated across all observations.
6. The residual (error) values follow the normal distribution.

Regression Analysis – Simple Linear Regression


Simple linear regression is a model that assesses the relationship between a dependent variable
and an independent variable. The simple linear model is expressed using the following
equation:
Y = a + bX + ϵ

Where:
 Y – Dependent variable
 X – Independent (explanatory) variable
 a – Intercept
 b – Slope
 ϵ – Residual (error)

Regression Analysis – Multiple Linear Regression


Multiple linear regression analysis is essentially similar to the simple linear model, with the
exception that multiple independent variables are used in the model. The mathematical
representation of multiple linear regression is:
Y = a + bX1 + cX2 + dX3 + ϵ

Where:
 Y – Dependent variable
 X1, X2, X3 – Independent (explanatory) variables
 a – Intercept
 b, c, d – Slopes
 ϵ – Residual (error)

Multiple linear regression follows the same conditions as the simple linear model. However,
since there are several independent variables in multiple linear analysis, there is another
mandatory condition for the model:
 Non-collinearity: Independent variables should show a minimum correlation with
each other. If the independent variables are highly correlated with each other, it will be
difficult to assess the true relationships between the dependent and independent
variables.
Regression Analysis in Finance
Regression analysis comes with several applications in finance. For example, the statistical
method is fundamental to the Capital Asset Pricing Model (CAPM). Essentially, the CAPM
equation is a model that determines the relationship between the expected return of an asset
and the market risk premium.
The analysis is also used to forecast the returns of securities, based on different factors, or to
forecast the performance of a business. Learn more forecasting methods in CFI’s Budgeting
and Forecasting Course!

1. Beta and CAPM


In finance, regression analysis is used to calculate the Beta (volatility of returns relative to the
overall market) for a stock. It can be done in Excel using the Slope function.
2. Forecasting Revenues and Expenses
When forecasting financial statements for a company, it may be useful to do a multiple
regression analysis to determine how changes in certain assumptions or drivers of the business
will impact revenue or expenses in the future. For example, there may be a very high correlation
between the number of salespeople employed by a company, the number of stores they operate,
and the revenue the business generates.

The above example shows how to use the Forecast function in Excel to calculate a company’s
revenue, based on the number of ads it runs.

Given easy-to-use machine learning libraries like scikit-learn and Keras, it is straightforward
to fit many different machine learning models on a given predictive modeling dataset.
The challenge of applied machine learning, therefore, becomes how to choose among a range
of different models that you can use for your problem.
Model selection for machine learning.
After reading this post, you will know:
 Model selection is the process of choosing one among many candidate models for a predictive
modeling problem.
 There may be many competing concerns when performing model selection beyond model
performance, such as complexity, maintainability, and available resources.
 The two main classes of model selection techniques are probabilistic measures and resampling
methods.

A Gentle Introduction to Model Selection for Machine Learning

Overview
This tutorial is divided into three parts; they are:
1. What Is Model Selection
2. Considerations for Model Selection
3. Model Selection Techniques
What Is Model Selection
Model selection is the process of selecting one final machine learning model from among a
collection of candidate machine learning models for a training dataset.
Model selection is a process that can be applied both across different types of models (e.g.
logistic regression, SVM, KNN, etc.) and across models of the same type configured with
different model hyperparameters (e.g. different kernels in an SVM).
When we have a variety of models of different complexity (e.g., linear or logistic regression
models with different degree polynomials, or KNN classifiers with different values of K), how
should we pick the right one?
— Page 22, Machine Learning: A Probabilistic Perspective, 2012.
For example, we may have a dataset for which we are interested in developing a classification
or regression predictive model. We do not know beforehand as to which model will perform
best on this problem, as it is unknowable. Therefore, we fit and evaluate a suite of different
models on the problem.
Model selection is the process of choosing one of the models as the final model that addresses
the problem.
Model selection is different from model assessment.
For example, we evaluate or assess candidate models in order to choose the best one, and this
is model selection. Whereas once a model is chosen, it can be evaluated in order to
communicate how well it is expected to perform in general; this is model assessment.
The process of evaluating a model’s performance is known as model assessment, whereas the
process of selecting the proper level of flexibility for a model is known as model selection.
— Page 175, An Introduction to Statistical Learning: with Applications in R, 2017.
Considerations for Model Selection
Fitting models is relatively straightforward, although selecting among them is the
true challenge of applied machine learning.
Firstly, we need to get over the idea of a “best” model.
All models have some predictive error, given the statistical noise in the data, the
incompleteness of the data sample, and the limitations of each different model type. Therefore,
the notion of a perfect or best model is not useful. Instead, we must seek a model that is “good
enough.”
What do we care about when choosing a final model?
The project stakeholders may have specific requirements, such as maintainability and limited
model complexity. As such, a model that has lower skill but is simpler and easier to understand
may be preferred.
Alternately, if model skill is prized above all other concerns, then the ability of the model to
perform well on out-of-sample data will be preferred regardless of the computational
complexity involved.
Therefore, a “good enough” model may refer to many things and is specific to your project,
such as:
 A model that meets the requirements and constraints of project stakeholders.
 A model that is sufficiently skillful given the time and resources available.
 A model that is skillful as compared to naive models.
 A model that is skillful relative to other tested models.
 A model that is skillful relative to the state-of-the-art.
Next, we must consider what is being selected.
For example, we are not selecting a fit model, as all models will be discarded. This is because
once we choose a model, we will fit a new final model on all available data and start using it
to make predictions.
Therefore, are we choosing among algorithms used to fit the models on the training dataset?
Some algorithms require specialized data preparation in order to best expose the structure of
the problem to the learning algorithm. Therefore, we must go one step further and
consider model selection as the process of selecting among model development pipelines.
Each pipeline may take in the same raw training dataset and outputs a model that can be
evaluated in the same manner but may require different or overlapping computational steps,
such as:
 Data filtering.
 Data transformation.
 Feature selection.
 Feature engineering.
 And more…
The closer you look at the challenge of model selection, the more nuance you will discover.
Now that we are familiar with some considerations involved in model selection, let’s review
some common methods for selecting a model.
Model Selection Techniques
The best approach to model selection requires “sufficient” data, which may be nearly infinite
depending on the complexity of the problem.
In this ideal situation, we would split the data into training, validation, and test sets, then fit
candidate models on the training set, evaluate and select them on the validation set, and report
the performance of the final model on the test set.
If we are in a data-rich situation, the best approach […] is to randomly divide the dataset into
three parts: a training set, a validation set, and a test set. The training set is used to fit the
models; the validation set is used to estimate prediction error for model selection; the test set
is used for assessment of the generalization error of the final chosen model.
— Page 222, The Elements of Statistical Learning: Data Mining, Inference, and Prediction,
2017.
This is impractical on most predictive modeling problems given that we rarely have sufficient
data, or are able to even judge what would be sufficient.
In many applications, however, the supply of data for training and testing will be limited, and
in order to build good models, we wish to use as much of the available data as possible for
training. However, if the validation set is small, it will give a relatively noisy estimate of
predictive performance.
– Page 32, Pattern Recognition and Machine Learning, 2006.
Instead, there are two main classes of techniques to approximate the ideal case of model
selection; they are:
 Probabilistic Measures: Choose a model via in-sample error and complexity.
 Resampling Methods: Choose a model via estimated out-of-sample error.
Let’s take a closer look at each in turn.
Probabilistic Measures
Probabilistic measures involve analytically scoring a candidate model using both its
performance on the training dataset and the complexity of the model.
It is known that training error is optimistically biased, and therefore is not a good basis for
choosing a model. The performance can be penalized based on how optimistic the training error
is believed to be. This is typically achieved using algorithm-specific methods, often linear, that
penalize the score based on the complexity of the model.
Historically various ‘information criteria’ have been proposed that attempt to correct for the
bias of maximum likelihood by the addition of a penalty term to compensate for the over-fitting
of more complex models.
– Page 33, Pattern Recognition and Machine Learning, 2006.
A model with fewer parameters is less complex, and because of this, is preferred because it is
likely to generalize better on average.
Four commonly used probabilistic model selection measures include:
 Akaike Information Criterion (AIC).
 Bayesian Information Criterion (BIC).
 Minimum Description Length (MDL).
 Structural Risk Minimization (SRM).
Probabilistic measures are appropriate when using simpler linear models like linear regression
or logistic regression where the calculating of model complexity penalty (e.g. in sample bias)
is known and tractable.
Resampling Methods
Resampling methods seek to estimate the performance of a model (or more precisely, the model
development process) on out-of-sample data.
This is achieved by splitting the training dataset into sub train and test sets, fitting a model on
the sub train set, and evaluating it on the test set. This process may then be repeated multiple
times and the mean performance across each trial is reported.
It is a type of Monte Carlo estimate of model performance on out-of-sample data, although
each trial is not strictly independent as depending on the resampling method chosen, the same
data may appear multiple times in different training datasets, or test datasets.
Three common resampling model selection methods include:
 Random train/test splits.
 Cross-Validation (k-fold, LOOCV, etc.).
 Bootstrap.
Most of the time probabilistic measures (described in the previous section) are not available,
therefore resampling methods are used.
By far the most popular is the cross-validation family of methods that includes many subtypes.
Probably the simplest and most widely used method for estimating prediction error is cross-
validation.
Bayesian decision theory
Bayesian decision theory refers to the statistical approach based on tradeoff quantification
among various classification decisions based on the concept of Probability(Bayes Theorem)
and the costs associated with the decision.
It is basically a classification technique that involves the use of the Bayes Theorem which is
used to find the conditional probabilities.
In Statistical pattern Recognition, we will focus on the statistical properties of patterns that
are generally expressed in probability densities (pdf’s and pmf’s), and this will command most
of our attention in this article and try to develop the fundamentals of the Bayesian decision
theory.
Prerequisites
Random Variable
A random variable is a function that maps a possible set of outcomes to some values like while
tossing a coin and getting head H as 1 and Tail T as 0 where 0 and 1 are random variables.
Bayes Theorem
The conditional probability of A given B, represented by P(A | B) is the chance of occurrence
of A given that B has occurred.
P(A | B) = P(A,B)/P(B) or
By Using the Chain rule, this can also be written as:
P(A,B) = P(A|B)P(B)=P(B|A)P(A)
P(A | B) = P(B|A)P(A)/P(B) ——- (1)
Where, P(B) = P(B,A) + P(B,A’) = P(B|A)P(A) + P(B|A’)P(A’)
Here, equation (1) is known as the Bayes Theorem of probability
Our aim is to explore each of the components included in this theorem. Let’s explore step by
step:
(a) Prior or State of Nature:
 Prior probabilities represent how likely is each Class is going to occur.
 Priors are known before the training process.
 The state of nature is a random variable P(wi).
 If there are only two classes, then the sum of the priors is P(w1) + P(w2)=1, if the classes are
exhaustive.
(b) Class Conditional Probabilities:
 It represents the probability of how likely a feature x occurs given that it belongs to the
particular class. It is denoted by, P(X|A) where x is a particular feature
 It is the probability of how likely the feature x occurs given that it belongs to the class wi.
 Sometimes, it is also known as the Likelihood.
 It is the quantity that we have to evaluate while training the data. During the training process,
we have input(features) X labeled to corresponding class w and we figure out the likelihood of
occurrence of that set of features given the class label.
(c) Evidence:
 It is the probability of occurrence of a particular feature i.e. P(X).
 It can be calculated using the chain rule as, P(X) = Σin P(X | wi) P(wi)
 As we need the likelihood of class conditional probability is also figure out evidence values
during training.
(d) Posterior Probabilities:
 It is the probability of occurrence of Class A when certain Features are given
 It is what we aim at computing in the test phase in which we have testing input or features (the
given entity) and have to find how likely trained model can predict features belonging to the
particular class wi.

For a better understanding of the above theory, we consider an example


Problem Description
Suppose we have a classification problem statement where we have to classify among the
object-1 and object-2 with the given set of features X = [x1, x2, …, xn]T.
Objective
The main objective of designing a such classifier is to suggest actions when presented with
unseen features, i.e, object not yet seen i.e, not in training data.
In this example let w denotes the state of nature with w = w1 for object-1 and w = w2 for
object-2. Here, we need to know that in reality, the state of nature is so unpredictable that we
generally consider that was variable that is described probabilistically.
Priors
 Generally, we assume that there is some prior value P(w1) that the next object is object-1 and
P(w2) that the next object is object-2. If we have no other object as in this problem then the
sum of their prior is 1 i.e. the priors are exhaustive.
 The prior probabilities reflect the prior knowledge of how likely we will get object-1 and
object-2. It is domain-dependent as the prior may change based on the time of year they are
being caught.
It sounds somewhat strange and when judging multiple objects (as in a more realistic scenario)
makes this decision rule stupid as we always make the same decision based on the largest prior
even though we know that any other type of objective also might appear governed by the
leftover prior probabilities (as priors are exhaustive in nature).
Consider the following different scenarios:
 If P(ω1)>>> P(ω2), our decision in favor of ω1 will be correct most of the time we predict.
 But if P(ω1)= P(ω2), half probable of our prediction of being right. In general, the probability
of error is the minimum of P(ω1) and P(ω2), and later in this article, we will see that under these
conditions no other decision rule can yield a larger probability of being correct.

Feature Extraction process (Extract feature from the images)

A suggested set of features- Length, width, shapes of an object, etc.

In our example, we use the width x, which is more discriminatory to improve the decision
rule of our classifier. The different objects will yield different variable-width readings and we
usually see this variability in probabilistic terms and also we consider x to be a continuous
random variable whose distribution depends on the type of object wj, and is expressed as p(x|ωj)
(probability distribution function pdf as a continuous variable) and known as the class-
conditional probability density function. Therefore,
The pdf p(x|ω1) is the probability density function for feature x given that the state of nature is
ω1 and the same interpretation for p(x|w2).

Fig. Picture Showing pdf for both classes


Image Source: Google Images
Suppose that we are well aware of both the prior probabilities P(ωj) and the conditional
densities p(x|ωj). Now, we can arrive at the Bayes formula for finding posterior probabilities:

Fig. Formula of Bayes Theorem


Image Source: Google Images
Bayes’ formula gives us intuition that by observing the measurement of x we can convert the
prior P(ωj) to the posteriors, denoted by P(ωj|x) which is the probability of ωj given that feature
value x has been measured.
p(x|ωj) is known as the likelihood of ωj with respect to x.
The evidence factor, p(x), works as merely a scale factor that guarantees that the posterior
probabilities sum up to one for all the classes.
Bayes’ Decision Rule
The decision rule given the posterior probabilities is as follows
If P(w1|x) > P(w2|x) we would decide that the object belongs to class w1, or else class w2.
Probability of Error
To justify our decision we look at the probability of error, whenever we observe x, we have,
P(error|x)= P(w1|x) if we decide w2, and P(w2|x) if we decide w1
As they are exhaustive and if we choose the correct nature of an object by probability P then
the leftover probability (1-P) will show how probable is the decision that it the not the decided
object.
We can minimize the probability of error by deciding the one which has a greater posterior and
the rest as the probability of error will be minimum as possible. So we finally get,
P(error|x) = min [P(ω1|x),P(ω2|x)]
And our Bayes decision rule as,
Decide ω1 if P(ω1|x) >P(ω2|x); otherwise decide ω2
This type of decision rule highlights the role of the posterior probabilities. With the help Bayes
theorem, we can express the rule in terms of conditional and prior probabilities.
The evidence is unimportant as far as the decision is concerned. As we discussed earlier it is
working as just a scale factor that states how frequently we will measure the feature with value
x; it assures P(ω1|x)+ P(ω2|x) = 1.
So by eliminating the unrequired scale factor in our decision rule we have, the similar decision
rule by Bayes theorem as,
Decide ω1 if p(x|ω1)P(ω1) >p(x|ω2)P(ω2); otherwise decide ω2
Now, let’s consider 2 cases:
 Case-1: If class conditionals are equal i.e, p(x|ω1)= p(x|ω2), then we arrive at our premature
decision rule governed by just priors.
 Case-2: On the other hand, if priors are equal i.e, P(ω1)= P(ω2) then the decision is entirely
based on class conditionals p(x|ωj).
This completes our example formulation!

Generalization of the preceding ideas for Multiple Features and Classes


Bayes classification: Posterior, likelihood, prior, and evidence
P(wi | X)= P(X | wi) P(wi) / P(X)
Posterior = Likelihood* Prior/Evidence
We now discuss those cases which have multiple features as well as multiple classes,
Let the Multiple Features be X1, X2, … Xn and Multiple Classes be w1, w2, … wn, then:
P(wi | X1, …. Xn) = P(X1,…. , Xn|wi)*P(wi)/P(X1,… Xn)
Where,
Posterior = P(wi | X1, …. Xn)
Likelihood = P(X1,…. , Xn|wi)
Prior = P(wi)
Evidence = P(X1,… ,Xn)
In cases of the same incoming patterns, we might need to use a drastically different cost
function, which will lead to different actions altogether. Generally, different decision tasks may
require features and yield boundaries quite different from those useful for our original
categorization problem.
So, In the later articles, we will discuss the Cost function, Risk Analysis, and decisive
action which will further help to understand the Bayes decision theory in a better way.
End Notes
Thanks for reading!
If you liked this and want to know more, go visit my other articles on Data Science and Machine
Learning by clicking on the Link
Please feel free to contact me on Linkedin, Email.
Something not mentioned or want to share your thoughts? Feel free to comment below And
I’ll get back to you.
About the author
Chirag Goyal
Currently, I am pursuing my Bachelor of Technology (B.Tech) in Computer Science and
Engineering from the Indian Institute of Technology Jodhpur(IITJ). I am very enthusiastic
about Machine learning, Deep Learning, and Artificial Intelligence.
The media shown in this article are not owned by Analytics Vidhya and is used at the
Author’s discretion.
Related

Detailed Guide To Bayesian Decision Theory – Part 2

Dimensions of a Supervised Machine Learning Algorithm


Dimensions of a Supervised Machine Learning Algorithm Let us now recapitulate and
generalize. We have a sample X={xt , r t }N t=1 (2.19) independent and The sample is
independent and identically distributed (iid); the ordering identically distributed (iid) is not
important and all instances are drawn from the same joint dis tribution p(x, r ). t indexes one
of the N instances, xt is the arbitrary dimensional input, and r t is the associated desired output.
r t is 0/1 for two-class learning, is a K-dimensional binary vector (where exactly one of the
dimensions is 1 and all others 0) for (K > 2)-class classification, and is a real value in regression.
The aim is to build a good and useful approximation to r t using the model g(xt|θ). In doing
this, there are three decisions we must make: 1. Model we use in learning, denoted as g(x|θ)
where g(·) is the model, x is the input, and θ are the parameters. g(·) defines the hypothesis
class H, and a particular value of θ in h ∈ H. For example, in class learning, we have taken a
rectangle as our model whose four coordinates make up θ; in linear regression, the model is the
linear function of the input whose slope and intercept are the parameters learned from the data.
The model (inductive bias), or H, is fixed by the machine learning sys tem designer based on
his or her knowledge of the application and the hypothesis h is chosen (parameters are tuned)
by a learning algorithm using the training set, sampled from p(x, r ). 2. Loss function, L(·), to
compute the difference between the desired output, r t, and our approximation to it, g(xt|θ),
given the current value 42 2 Supervised Learning of the parameters, θ. The approximation
error, or loss, is the sum of losses over the individual instances E(θ|X) =  t L(r t , g(xt (2.20)
|θ)) In class learning where outputs are 0/1, L(·) checks for equality or not; in regression,
because the output is a numeric value, we have ordering information for distance and one
possibility is to use the square of the difference. 3. Optimization procedure to find θ∗ that
minimizes the total error θ∗ = arg minθ (2.21) E(θ|X) where arg min returns the argument that
minimizes. In polynomial re gression, we can solve analytically for the optimum, but this is
not always the case. With other models and error functions, the com plexity of the
optimization problem becomes important. We are es pecially interested in whether it has a
single minimum corresponding to a globally optimal solution, or whether there are multiple
minima corresponding to locally optimal solutions. For this setting to work well, the following
conditions should be satis fied: First, the hypothesis class of g(·) should be large enough, that
is, have enough capacity, to include the unknown function that generated the data that is
represented in X in a noisy form. Second, there should be enough training data to allow us to
pinpoint the correct (or a good enough) hypothesis from the hypothesis class. Third, we should
have a good optimization method that finds the correct hypothesis given the training data.
Different machine learning algorithms differ either in the models they assume (their hypothesis
class/inductive bias), the loss measures they employ, or the optimization procedure they use.
We will see many exam ples in the coming chapters

Testing Machine Learning Algorithms


Testers guide for Testing Machine Learning Models
Machines learning is a study of applying algorithms and statistics to make the computer to learn
by itself without being programmed explicitly. Computers rely on an algorithm that uses a
mathematical model. This model uses a data set which is known as “Training Dataset” to learn
and to predict the desired outcome. There are multiple learning algorithms that can be used to
solve the problem but the concept remains the same. All these algorithms fall into two categories
viz. Supervised learning or Unsupervised learning.
Let’s find out more about supervised learning as it is much more researched and used in
applications like user profiling, recommended products list, etc. Supervised learning output
generates two types of values and is classified in to two, one is Categorical(Classification
Model) where the value is from the finite set(male or female, t-shirt or shirt or innerwear, etc)
and another one is Nominal(Regression Model) where the value is a real-valued scalar (income
level, product ratings, etc). These algorithms are trained using the dataset and the outputs are
predicted.
Please note that the machine learning algorithm doesn’t generate a concrete output but it
provides an approximation or a probability of outcome.
As a tester have you ever wondered how can we test application which learns by itself and
correct its old mistakes. Don’t Worry!! Hold on before you fall off and read this article….
Without much wait let’s find out what testing approach one must take to test such learning
algorithms.
Testing approach: The answers lie in the data set. In order to test a machine learning algorithm,
tester defines three different datasets viz. Training dataset, validation dataset and a test dataset
(a subset of training dataset).
Please keep in mind the process is iterative in nature and it’s better if we refresh our validation
and test dataset on every iterative cycle.
Here, below is the basic approach a tester can follow in order to test the developed learning
algorithm:
1. Tester first defines three datasets, training dataset(65%), validation dataset(20%) and test
dataset(15%). Please randomize the dataset before splitting and do not use the
validation/test dataset in your training dataset.

Partition of the dataset


Different Dataset fed to the ML Models
2. Tester once defines the data set, Will begin to train the models with the training dataset. Once
this training model is done, the tester then performs to evaluate the models with the validation
dataset. This is iterative and can embrace any tweaks/changes needed for a model based on
results that can be done and re-evaluated. This ensures that the test dataset remains unused and
can be used to test an evaluated model.

An iterative process to evaluate the best machine learning model


3. Once the evaluation of all the models is done, the best model that the team feels confident
about based on the least error rate and high approximate prediction will be picked and tested
with a test dataset to ensure the model still performs well and matches with validation dataset
results. If you find the model accuracy is high then you must ensure that test/validation sets are
not leaked into your training dataset.
An iterative workflow of training, evaluating and testing of ML models
What if we train them with incorrect data??? If we train a model with incorrect data set, then the
error rate increases and will lead to Data Poisoning. Models must be trained with an adversary
dataset as well such that the system should be capable to sanitize the data before sending it to
train models.
With the above information, let’s understand an important concept called “Cross-Validation”
that helps us to evaluate the model's average performance.
Cross-Validation
Cross-validation is a technique where the datasets are split into multiple subsets and learning
models are trained and evaluated on these subset data. One of the widely used technique is the k-
fold cross-validation technique. In this, the dataset is divided into k-subsets(folds) and are used
for training and validation purpose for k iteration times. Each subsample will be used at least
once as a validation dataset and the remaining (k-1)as the training dataset. Once all the iterations
are completed, one can calculate the average prediction rate for each model.
Let’s understand with the below diagram:

Each sub-sample will be used at least once as a validation dataset across all iterations.
Now we know the testing approach, the main part is how to evaluate the learning models with
validation and test dataset… Let’s dig into it and learn the most common evaluation techniques
that a tester must be aware of.
Evaluation Techniques:
There are certain terminologies that we need to understand before diving into the evaluation
techniques. So let’s first know what they are.

With the above basic terminologies, now let’s dive into the techniques:
1. Classification Accuracy: It’s the most basic way of evaluating the learning model. It’s a
ratio between the positive(TN+TP) predictions vs the total number of predictions. If the
ratio is high then the model has a high prediction rate. Below are the formulas to find the
accuracy ratio.

However, it is seen that accuracy alone is not a good way to evaluate the model. For e.g. Out of
100 samples of shapes, the model might have correctly predicted True Negative cases however
it may have a less success rate for True Positive ones. Hence, The ratio/prediction rate may look
good/high but the overall model fails to identify the correct rectangular shapes.
2. Confusion Matrix: It’s a square matrix table of N*N where N is the number of classes that
the model needs to classify. It’s best used for classification models that categorizes an outcome
into a finite set of values. These values are known as labels. One axis is the label that the model
predicted and the other is the actual label. To understand more about this, let’s categories the
shapes into 3 labels [Rectangle, Circle, and Square]. As there are 3 labels, we will draw a 3*3
table(Confusion Matrix) of which one axis will be actual and the other is the predicted label.
Confusion matrix of 3[Actual]*3[Predicted] table. [Note: Remarks column is for the
understanding purpose]
With the above matrix, we can calculate the two important metrics to identify the positive
prediction rate.
Precision: Precision identifies the frequency with which a model was correct when predicting
the positive class. This means the prediction frequency of a positive class by the model. Let’s
calculate the precision of each label/class using the above matrix
.

Precision calculations for each label/class


With the above calculations, the model is 76% of the time is correct when predicted as the
rectangle shape. Likewise, 72% and 42% of the time is correct when predicted the circle and
square shape.
Recall: This metric answers the following question: Out of all the possible positive labels, how
many did the model correctly identify?. This means, the percentage of correctly identified actual
True Positive class. In other words, recall measures the number of correct predictions, divided
by the number of results that should have been predicted correctly.
Recall calculation for each label/class
The above simply means that the model has a correct prediction of 66%, 53% and 60% for
rectangles, circles, and squares.
What if the threshold value is increased, then the resultant number of correct predictions will be
declined which will lower the recall value. Or if the threshold value is lowered then the true
predictions will be higher which results in increased precision but will have incorrect predictions
as the positive class. To have an optimized metric, we may use the F1 measure which is defined
as below. This gives us a score between 0 and 1 where 1 means the model is perfect and 0 means
useless. A good score tells us that the model has low false positives[the other shapes which are
predicted as rectangles] and low false negative[the rectangles which are not predicted as
rectangles].

F1 Measure formula
There is another evaluation technique called ROC[receiver operating characteristics] and
AOC[Area under ROC curve] which needs to plot the graph based on two different parameters
[True Postive Rate(TPR or Recall) and False Postive Rate(FPR) for various thresholds.
However, we will cover this evaluation technique in our later article.
The above described is a basic testing approach and evaluation technique for a system that is
embedded with learning capabilities.

Guidelines for Machine Learning Experiments


Machine Learning Model Experimentation Best Practices

Now that you have a model running in Production, how do you know it’s adding value for your
business and customers? How do you know the parameters utilized in this model are better than
other parameters? How do you know what you are doing is working better than what you had
in production before? These are key questions you should ask yourself prior to productionizing
any machine learning (ML) model.
Experimentation is crucial to the ML model building strategy. Experiments may encompass
using different training and testing data, models with differing hyperparameters, running
different code (even if it's a small change), and often you may find yourself running the same
code but in different environment configurations. All experiments come with completely
different metrics; consequently, many Data Scientists find themselves lost in keeping track of
everything due to not following experiment best practices. Let’s get started on a few we have
picked up along the way.
Versioning
Why is version control important? First, it lowers any risk of erasing or writing over someone’s
work or making mistakes. Second, it’s a great way to incorporate collaboration between
colleagues. The most learned requirement in computer science is to ensure we have a
mechanism for version control of our code, but in the Data Science and ML process, it’s more
than just code requiring versioning. Notebooks, data, and the environment being utilized also
have a need for version control.
 Notebook versioning. Versioning your notebook is a must for keeping track of, not only
your code, but also the results of each model run of experiments. If you intend on
sharing and collaborating with your notebook, you will want to ensure you and your
peers do not step on each other’s work or make mistakes.
 Data versioning. Control of data is of utmost importance in ML. Data version control
allows for managing large datasets, project reproducibility, and the ability for scientists
to take advantage of new features while reusing existing features. Another advantage is
users will not have to remember which model uses which dataset – this mitigates risk
to model results. One way to have data version control is to save the incoming data in
specific locations with metadata tagging (or labeling) and logging to be able to
differentiate the old versus new.
 Environment versioning. This type of versioning can mean a couple of things:
infrastructure configuration and specific frameworks being used. You will want to have
a good approach for versioning both types as this is also a crucial step in ensuring your
experiments are being run 1-to-1. For example, if your experiments involve using
TensorFlow, you will need to ensure this framework is imported for your research
comparisons. Another example is when you want to promote your experiments from
Development to a Staging environment and run automated tests. You would need to
ensure the Staging environment matches all the configurations that were used in
Development. Good practice is to create step-by-step instructions via a script or some
automated process so as to avoid missteps.
Commits
Code commits require versioning to mitigate risk of merging production code with non-
production code, as well as avoiding risk of overwriting your peers code and potentially making
other detrimental mistakes. What happens if you run an experiment in between commits and
forget to commit this code first? These are dubbed “dirty commits” which occurs when
developers don’t follow development best practices. One best practice in this scenario is to
have users create a snapshot of their environment and code before running an experiment. This
way, they have the option of rolling back their changes to the code and configurations prior to
experimenting.
Hyperparameters
All ML models have hyperparameters to help control the behavior of the training process of
the algorithms and have a great impact on how the model will perform. To find the optimal
combinations of parameters for the best results, you will find yourself running many
experiments. In doing so, keeping track of the parameters you used for each experiment can
become cumbersome; consequently, many scientists find themselves re-running experiments
due to forgetting all the combinations used. A best practice for experimenting with
hyperparameters is to incorporate a tracking process. One way to track is to log everything via
audit logging or some form of logging that will save those parameters for every experiment.
Metrics
What metrics should you track and save? Best practice: all of them. Metrics can change daily
or over a span of time depending on the use case and situation. For example, measuring the
performance of your current experiment may involve looking at a Confusion Matrix and
distribution of predictions, but if you only logged the data from the distribution, you could miss
out on remembering how the matrix performed and therefore waste time re-running the same
experiment to gather this extra metric. Another example of metric loss is not tracking the
timestamps of the data being collected; consequently, you may experience model decay and
not be able to incorporate proper model retraining techniques. If you are only tracking specific
metrics, you can miss out on new discoveries; moreover, proactively logging as much in
metrics as possible can help mitigate wasting time in the future.
A/B Testing
This form of testing is widely used by scientists to run different models against each other and
compare their performance on real-time data, in a controlled environment. Best practice is to
follow steps like the scientific method:
 Form your hypothesis. For ML, you will want a null hypothesis (states that there
is no difference between the control and variant groups) and an alternate hypothesis
(the outcome you want your test to prove to be true).
 Setup your control group and test group. Your control group would receive results from
Model A, and your test group would receive results from Model B. You would then
pull a sample of data via random sampling and from a specified sample size.
 Perform A/B testing. How to run your A/B tests depend on your use case and
requirements. We at Wallaroo provide three modes of experimenting for testing:
 Random Split: Allows you to perform randomized control trial type experiment where
incoming data is sent to each model randomly. You can specify the percentage of
requests each model receives by assigning a ‘weight’. Weights are automatically
normalized into a percentage for you, so you don’t need to worry about them adding up
to a particular value. You can also specify a meta key field to ensure consistent handling
of grouped requests. For example, you can specify a split_key of 'session_id' to make
sure that requests from the same session are handled by the same (randomly chosen)
model.
 Key Split: Allows you to specifically choose which model handles requests for a user
(or group). For example, if using a credit card fraud use case, if you want all ‘gold’ card
users to go to one fraud prediction model and all ‘black’ card users to go to another,
then you should specify 'card_type" to be the split_key.
 Shadow Deploy: Allows you to test new models without removing the default/control
model. This is particularly useful for “burn-in” testing a new model with real world
data without displacing the currently proven model.
Coming up with an effective experimentation strategy can be cumbersome but following some
best practices will assist in proper planning. By including versioning, commit tracking, metrics,
hyperparameter tracking and A/B testing, you will be able to keep track of all information and
results of your experiments to have the needed comparisons and confidence that you know
which setup produced the best results.
About Wallaroo.
Wallaroo enables data scientists and ML engineers to deploy enterprise-level AI into
production simpler, faster, and with incredible efficiency. Our platform provides powerful self-
service tools, a purpose-built ultrafast engine for ML workflows, observability, and
experimentation framework. Wallaroo runs in cloud, on-prem, and edge environments while
reducing infrastructure costs by 80 percent.

Wallaroo’s unique approach to production AI gives any organization the desired fast time-to
market, audited visibility, scalability - and ultimately measurable business value - from their
AI-driven initiatives, and allows data scientists to focus on value creation, not low-level
“plumbing.”

Cross-Validation and Resampling Methods


A Gentle Introduction to k-fold Cross-Validation
Cross-validation is a statistical method used to estimate the skill of machine learning models.
It is commonly used in applied machine learning to compare and select a model for a given
predictive modeling problem because it is easy to understand, easy to implement, and results
in skill estimates that generally have a lower bias than other methods.
In this tutorial, you will discover a gentle introduction to the k-fold cross-validation procedure
for estimating the skill of machine learning models.
This tutorial is divided into 5 parts; they are:
1. k-Fold Cross-Validation
2. Configuration of k
3. Worked Example
4. Cross-Validation API
5. Variations on Cross-Validation
k-Fold Cross-Validation
Cross-validation is a resampling procedure used to evaluate machine learning models on a
limited data sample.
The procedure has a single parameter called k that refers to the number of groups that a given
data sample is to be split into. As such, the procedure is often called k-fold cross-validation.
When a specific value for k is chosen, it may be used in place of k in the reference to the model,
such as k=10 becoming 10-fold cross-validation.
Cross-validation is primarily used in applied machine learning to estimate the skill of a machine
learning model on unseen data. That is, to use a limited sample in order to estimate how the
model is expected to perform in general when used to make predictions on data not used during
the training of the model.
It is a popular method because it is simple to understand and because it generally results in a
less biased or less optimistic estimate of the model skill than other methods, such as a simple
train/test split.
The general procedure is as follows:
1. Shuffle the dataset randomly.
2. Split the dataset into k groups
3. For each unique group:
1. Take the group as a hold out or test data set
2. Take the remaining groups as a training data set
3. Fit a model on the training set and evaluate it on the test set
4. Retain the evaluation score and discard the model
4. Summarize the skill of the model using the sample of model evaluation scores
Importantly, each observation in the data sample is assigned to an individual group and stays
in that group for the duration of the procedure. This means that each sample is given the
opportunity to be used in the hold out set 1 time and used to train the model k-1 times.
This approach involves randomly dividing the set of observations into k groups, or folds, of
approximately equal size. The first fold is treated as a validation set, and the method is fit on
the remaining k − 1 folds.
— Page 181, An Introduction to Statistical Learning, 2013.
It is also important that any preparation of the data prior to fitting the model occur on the CV-
assigned training dataset within the loop rather than on the broader data set. This also applies
to any tuning of hyperparameters. A failure to perform these operations within the loop may
result in data leakage and an optimistic estimate of the model skill.
Despite the best efforts of statistical methodologists, users frequently invalidate their results by
inadvertently peeking at the test data.
— Page 708, Artificial Intelligence: A Modern Approach (3rd Edition), 2009.
The results of a k-fold cross-validation run are often summarized with the mean of the model
skill scores. It is also good practice to include a measure of the variance of the skill scores, such
as the standard deviation or standard error.
Configuration of k
The k value must be chosen carefully for your data sample.
A poorly chosen value for k may result in a mis-representative idea of the skill of the model,
such as a score with a high variance (that may change a lot based on the data used to fit the
model), or a high bias, (such as an overestimate of the skill of the model).
Three common tactics for choosing a value for k are as follows:
 Representative: The value for k is chosen such that each train/test group of data samples is
large enough to be statistically representative of the broader dataset.
 k=10: The value for k is fixed to 10, a value that has been found through experimentation to
generally result in a model skill estimate with low bias a modest variance.
 k=n: The value for k is fixed to n, where n is the size of the dataset to give each test sample an
opportunity to be used in the hold out dataset. This approach is called leave-one-out cross-
validation.
The choice of k is usually 5 or 10, but there is no formal rule. As k gets larger, the difference
in size between the training set and the resampling subsets gets smaller. As this difference
decreases, the bias of the technique becomes smaller
— Page 70, Applied Predictive Modeling, 2013.
A value of k=10 is very common in the field of applied machine learning, and is recommend
if you are struggling to choose a value for your dataset.
To summarize, there is a bias-variance trade-off associated with the choice of k in k-fold cross-
validation. Typically, given these considerations, one performs k-fold cross-validation using k
= 5 or k = 10, as these values have been shown empirically to yield test error rate estimates that
suffer neither from excessively high bias nor from very high variance.
— Page 184, An Introduction to Statistical Learning, 2013.
If a value for k is chosen that does not evenly split the data sample, then one group will contain
a remainder of the examples. It is preferable to split the data sample into k groups with the
same number of samples, such that the sample of model skill scores are all equivalent.
For more on how to configure k-fold cross-validation, see the tutorial:
 How to Configure k-Fold Cross-Validation
Worked Example
To make the cross-validation procedure concrete, let’s look at a worked example.
Imagine we have a data sample with 6 observations:
1 [0.1, 0.2, 0.3, 0.4, 0.5, 0.6]
The first step is to pick a value for k in order to determine the number of folds used to split the
data. Here, we will use a value of k=3. That means we will shuffle the data and then split the
data into 3 groups. Because we have 6 observations, each group will have an equal number of
2 observations.
For example:
1 Fold1: [0.5, 0.2]
2 Fold2: [0.1, 0.3]
3 Fold3: [0.4, 0.6]
We can then make use of the sample, such as to evaluate the skill of a machine learning
algorithm.
Three models are trained and evaluated with each fold given a chance to be the held out test
set.
For example:
 Model1: Trained on Fold1 + Fold2, Tested on Fold3
 Model2: Trained on Fold2 + Fold3, Tested on Fold1
 Model3: Trained on Fold1 + Fold3, Tested on Fold2
The models are then discarded after they are evaluated as they have served their purpose.
The skill scores are collected for each model and summarized for use.
Cross-Validation API
We do not have to implement k-fold cross-validation manually. The scikit-learn library
provides an implementation that will split a given data sample up.
The KFold() scikit-learn class can be used. It takes as arguments the number of splits, whether
or not to shuffle the sample, and the seed for the pseudorandom number generator used prior
to the shuffle.
For example, we can create an instance that splits a dataset into 3 folds, shuffles prior to the
split, and uses a value of 1 for the pseudorandom number generator.
1 kfold = KFold(3, True, 1)
The split() function can then be called on the class where the data sample is provided as an
argument. Called repeatedly, the split will return each group of train and test sets. Specifically,
arrays are returned containing the indexes into the original data sample of observations to use
for train and test sets on each iteration.
For example, we can enumerate the splits of the indices for a data sample using the
created KFold instance as follows:
1 # enumerate splits
2 for train, test in kfold.split(data):
3 print('train: %s, test: %s' % (train, test))
We can tie all of this together with our small dataset used in the worked example of the prior
section.
1 # scikit-learn k-fold cross-validation
2 from numpy import array
3 from sklearn.model_selection import KFold
4 # data sample
5 data = array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6])
6 # prepare cross validation
7 kfold = KFold(3, True, 1)
8 # enumerate splits
9 for train, test in kfold.split(data):
10 print('train: %s, test: %s' % (data[train], data[test]))
Running the example prints the specific observations chosen for each train and test set. The
indices are used directly on the original data array to retrieve the observation values.
1 train: [0.1 0.4 0.5 0.6], test: [0.2 0.3]
2 train: [0.2 0.3 0.4 0.6], test: [0.1 0.5]
3 train: [0.1 0.2 0.3 0.5], test: [0.4 0.6]
Usefully, the k-fold cross validation implementation in scikit-learn is provided as a component
operation within broader methods, such as grid-searching model hyperparameters and scoring
a model on a dataset.
Nevertheless, the KFold class can be used directly in order to split up a dataset prior to
modeling such that all models will use the same data splits. This is especially helpful if you are
working with very large data samples. The use of the same splits across algorithms can have
benefits for statistical tests that you may wish to perform on the data later.
Variations on Cross-Validation
There are a number of variations on the k-fold cross validation procedure.
Three commonly used variations are as follows:
 Train/Test Split: Taken to one extreme, k may be set to 2 (not 1) such that a single train/test
split is created to evaluate the model.
 LOOCV: Taken to another extreme, k may be set to the total number of observations in the
dataset such that each observation is given a chance to be the held out of the dataset. This is
called leave-one-out cross-validation, or LOOCV for short.
 Stratified: The splitting of data into folds may be governed by criteria such as ensuring that
each fold has the same proportion of observations with a given categorical value, such as the
class outcome value. This is called stratified cross-validation.
 Repeated: This is where the k-fold cross-validation procedure is repeated n times, where
importantly, the data sample is shuffled prior to each repetition, which results in a different
split of the sample.
 Nested: This is where k-fold cross-validation is performed within each fold of cross-validation,
often to perform hyperparameter tuning during model evaluation. This is called nested cross-
validation or double cross-validation.

Measuring Classifier Performance


Performance Measures for Classification Models And methods to evaluate the performance
of a classifier

Photo by Franck V. on Unsplash


Ifyou are building any Machine Learning model, be it on dummy dataset or real world problem,
the most important part is to determine how well your models works. This is determined usually
with a combination of two approaches.
→ Using a performance metric and usage of methods that take the performance metric and
provide empirical performance data.

Basics of Supervised Learning (Classification)


in Machine Learning with concepts, components and examples
towardsdatascience.com

This post will cover two broader aspects of classification models:


1. Multiple Performance Measures for a Classification Model
2. Different Methods to evaluate the performance based on the measures from point 1
The content covered will provide a conceptual grasp and they can be easily applied in real world
practical implementations. Almost all of the measures I will discuss are already implemented in
Machine Learning libraries for Python like scikit-learn, scipy, and numpy.
Performance Measures for a Classification Model
Confusion Matrix
Q. How can we understand what types of mistakes a learned model makes?
Ans → For a classification model it is based on the counts of records correctly and incorrectly
predicted by the model. These counts are tabulated in a table called confusion matrix.
For a binary problem (2 class labels in the dataset), the confusion matrix looks like this:

Confusion Matrix for a binary


problem — Image by Author
 here f(i)(j) is an incorrect prediction (class i predicted to be class j)
 Confusion matrix is also known as error matrix
 Each row represents the instances in the actual class whereas each column represents the
instances in the predicted class
→ This confusion matrix allows us to derive a lot of performance metrics which we will discuss
below.
Accuracy
It is closeness of the measurements to a specific value. In simpler terms, if we are measuring
something repeatedly, we say the measurement to be accurate if it is close to the true value of
the quantity being measured.

Introduction to Data Mining — Pang-Ning Tan, Michael Steinbach, Vipin Kumar


Error Rate
It is the opposite of accuracy. This metric measures the performance of a model as the name
suggests in terms of incorrect predictions.

Introduction to Data Mining — Pang-Ning Tan, Michael Steinbach, Vipin Kumar


Note: It is important to note that accuracy and error rate metrics are prone to class imbalance
problem. Class imbalance problem occurs when the dataset consists of a few classes in lower
proportion (or rare) to the rest of the classes.
Example:
→ Consider a 2-class problem
 Number of class POS instances = 10
 Number of class NEG instances = 990
If the model is predicting everything to be NEG, then
accuracy = 990 / 1000 = 90%
This is misleading because the model doesn’t detect any POS class. Detecting the rare class is
usually more interesting (examples: frauds, spams, cancer detection etc)
This requires to involve other performance measurement metrics that do not suffer similar
problems. We will discuss them further.
Precision
It is the degree to which repeated measurements under the same conditions show the same result.
It is often measured by the standard deviation of a set of values.
Example: We have an item that weighs 1g, we measure it 5 times and get the following set of
weights: {1.015, 0.990, 1.013, 1.001, 0.986}.
The precision using standard deviation measured is 0.013. It means that the most precisely we
can say the weight of the item is 1 + 0.013 or 1–0.013.
Having said that, in Machine Learning precision is defined as:
Precision for a binary problem — Image by Author | Introduction to Data Mining — Pang-
Ning Tan, Michael Steinbach, Vipin Kumar
Precision determines the fraction of records that actually turns out to be positive in the group
the classifier has declared as a positive class. The higher the precision is, the lower the number
of False Positives committed by the model.
To understand it with an example, let’s say we are trying to search for documents that contain
the term ‘machine learning’ in a corpus of 100 documents. The number of relevant documents
for ‘machine learning’ are 20 out of the 100. The model gives us 12 documents when queried
for the term ‘machine learning’ and fetching 15 documents. The precision turns out to be
precision = 12 / 15 = 80%
Recall / True Positive Rate
It measures the fraction of positive examples correctly predicted by the classifier. To understand
it with an example, let’s say we are trying to search for documents that contain the term ‘machine
learning’ in a corpus of 100 documents. The number of relevant documents for ‘machine
learning’ are 20 out of the 100. The model gives us 12 documents when queried for the term
‘machine learning’ and fetching 15 documents. The recall turns out to be
recall = 12 / 20 = 60%

Recall for a binary


problem — Image by Author | Introduction to Data Mining — Pang-Ning Tan, Michael
Steinbach, Vipin Kumar
F-measure
The two metrics above precision and recall can be combined into a single metric called F-
measure. It is a harmonic mean of precision and recall. The harmonic mean of two
numbers x and y is close to the smaller of the two numbers. Hence, a high value of F-measure
ensures both precision and recall are reasonably high.

F-measure for a binary problem — Introduction to Data Mining — Pang-Ning Tan, Michael
Steinbach, Vipin Kumar
Different Methods to evaluate the performance
Usually when working with a machine learning model, we need 3 splits of our dataset.
1. Training set
2. Validation set
3. Test set
Training set is used to train our model by learning the parameters of the model.
Validation set is used to learn the best hyperparameters of our model using the performance
metrics defined above.
Test set is never seen before data. The performance of the model is calculated based on the
model learnt using parameters from training set and hyperparameters from validation set by
applying the metrics mentioned above.
Note: These are not all possible performance metrics available in the literature. These are some
of the widely used ones.
The methods below are variations of the above.
Holdout Method
Split the learning sample into a training set and a test data set.
→ A model is induced on the training data set
→ Performance is evaluated on the test data set
Limitations:
→ Too few data for learning: The more data used for testing, the more reliable the performance
estimation but more data is missing (less data available) for learning.
→ Interdependence of training and test data set: If a class is underrepresented in the training
data set, it will be overrepresented in the test data set and vice versa.
Random Subsampling
The holdout method can be repeated several times to improve the estimation of a classifier’s
performance. If the estimation is performed k times then, the overall performance can be the
average of each estimate.

Image by Author
→ This method also encounters some of the problems associated with the holdout method
because it does not utilise as much data as possible for training.
→ It also has no control over the number of times each record is used for training and testing.
Cross-Validation
Core idea:
 use each record k times for training and once for testing
 aggregate the performance values over all k tests
k-fold cross validation
 split the learning dataset into k equi-sized subsets
 for i = 1, …., k use the k-1 folds for training and kth fold for testing
 aggregate the performance values over all k tests
leave one out cross validation
 In k-fold cross validation, if k = N where N is the number of records in the learning dataset
 Each test set will contain only one record
 Computationally expensive
Bootstrap
The methods presented so far assume that the training records are sampled without replacement.
It means that there are no duplicate records in the training and test set. In the bootstrap approach,
the training records are sampled with replacement. It means that a record already chosen for
training is put back into the original pool of records so that it is equally likely to be redrawn.

Hypothesis Testing
Hypothesis Testing | A Step-by-Step Guide with Easy Examples

Hypothesis testing is a formal procedure for investigating our ideas about the world
using statistics. It is most often used by scientists to test specific predictions, called hypotheses,
that arise from theories.
There are 5 main steps in hypothesis testing:
1. State your research hypothesis as a null hypothesis (Ho) and alternate hypothesis
(Ha or H1).
2. Collect data in a way designed to test the hypothesis.
3. Perform an appropriate statistical test.
4. Decide whether to reject or fail to reject your null hypothesis.
5. Present the findings in your results and discussion section.
Though the specific details might vary, the procedure you will use when testing a hypothesis
will always follow some version of these steps.

Step 1: State your null and alternate hypothesis


After developing your initial research hypothesis (the prediction that you want to investigate),
it is important to restate it as a null (Ho) and alternate (Ha) hypothesis so that you can test it
mathematically.
The alternate hypothesis is usually your initial hypothesis that predicts a relationship
between variables. The null hypothesis is a prediction of no relationship between the variables
you are interested in.
You want to test whether there is a relationship between gender and height. Based on your
knowledge of human physiology, you formulate a hypothesis that men are, on average, taller
than women. To test this hypothesis, you restate it as:
Ho: Men are, on average, not taller than women.
Ha: Men are, on average, taller than women.
Step 2: Collect data
For a statistical test to be valid, it is important to perform sampling and collect data in a way
that is designed to test your hypothesis. If your data are not representative, then you cannot
make statistical inferences about the population you are interested in.
To test differences in average height between men and women, your sample should have an
equal proportion of men and women, and cover a variety of socio-economic classes and any
other control variables that might influence average height.
You should also consider your scope (Worldwide? For one country?) A potential data source
in this case might be census data, since it includes data from a variety of regions and social
classes and is available for many countries around the world.
What is your plagiarism score?
Compare your paper with over 60 billion web pages and 30 million publications.
 Best plagiarism checker of 2021
 Plagiarism report & percentage
 Largest plagiarism database
Scribbr Plagiarism Checker
Step 3: Perform a statistical test
There are a variety of statistical tests available, but they are all based on the comparison
of within-group variance (how spread out the data is within a category) versus between-
group variance (how different the categories are from one another).
If the between-group variance is large enough that there is little or no overlap between groups,
then your statistical test will reflect that by showing a low p-value. This means it is unlikely
that the differences between these groups came about by chance.
Alternatively, if there is high within-group variance and low between-group variance, then your
statistical test will reflect that with a high p-value. This means it is likely that any difference
you measure between groups is due to chance.
Your choice of statistical test will be based on the type of data you collected.
Based on the type of data you collected, you perform a one-tailed t-test to test whether men are
in fact taller than women. This test gives you:
 an estimate of the difference in average height between the two groups.
 a p-value showing how likely you are to see this difference if the null hypothesis of no
difference is true.
Your t-test shows an average height of 175.4 cm for men and an average height of 161.7 cm
for women, with an estimate of the true difference ranging from 10.2cm to infinity. The p-value
is 0.002.
Step 4: Decide whether to reject or fail to reject your null hypothesis
Based on the outcome of your statistical test, you will have to decide whether to reject or fail
to reject your null hypothesis.
In most cases you will use the p-value generated by your statistical test to guide your decision.
And in most cases, your predetermined level of significance for rejecting the null hypothesis
will be 0.05 – that is, when there is a less than 5% chance that you would see these results if
the null hypothesis were true.
In some cases, researchers choose a more conservative level of significance, such as 0.01 (1%).
This minimizes the risk of incorrectly rejecting the null hypothesis (Type I error).
In your analysis of the difference in average height between men and women, you find that
the p-value of 0.002 is below your cutoff of 0.05, so you decide to reject your null hypothesis
of no difference.
Step 5: Present your findings
The results of hypothesis testing will be presented in the results and discussion sections of your
research paper.
In the results section you should give a brief summary of the data and a summary of the results
of your statistical test (for example, the estimated difference between group means and
associated p-value). In the discussion, you can discuss whether your initial hypothesis was
supported by your results or not.
In the formal language of hypothesis testing, we talk about rejecting or failing to reject the null
hypothesis. You will probably be asked to do this in your statistics assignments.
Stating results in a statistics assignment
In our comparison of mean height between men and women we found an average difference of
13.7 cm and a p-value of 0.002; therefore, we can reject the null hypothesis that men are not
taller than women and conclude that there is likely a difference in height between men and
women.
However, when presenting research results in academic papers we rarely talk this way. Instead,
we go back to our alternate hypothesis (in this case, the hypothesis that men are on average
taller than women) and state whether the result of our test was consistent or inconsistent with
the alternate hypothesis.
If your null hypothesis was rejected, this result is interpreted as being consistent with your
alternate hypothesis.

Comparing Two Classification Algorithms


There are many real-life use cases to create your unique machine learning projects. If you’re
still struggling to work on an actual use case, find something practical and unique, like a
machine learning project where you’ll show a comparison of some of the classification
algorithms in machine learning. So, if you want to know how to compare classification
algorithms, this article is for you. In this article, I will present a comparison of classification
algorithms in machine learning using Python.
Comparison of Classification Algorithms
In machine learning, classification means training a model to specify which category an entry
belongs to. There are so many classification algorithms in machine learning, so if you can show
a detailed comparison of classification algorithms in machine learning, it will become an
amazing and unique machine learning project as a beginner. For this task, you must first choose
a classification-based problem statement and determine all those classification algorithms that
may be useful for your problem. Next, you need to train classification models and show a
comparison based on their performance.
The performance of all classification algorithms will depend on the problem you are working
on. So let’s start this task by importing the necessary Python libraries, a dataset based on the
problem of classification, and some of the popular classification algorithms:
import numpy
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import BernoulliNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.metrics import classification_report

data = pd.read_csv("https://raw.githubusercontent.com/amankharwal/Website-
data/master/social.csv")
print(data.head())
view rawclassification comparison1.py hosted with ❤ by GitHub
Age EstimatedSalary Purchased
0 19 19000 0
1 35 20000 0
2 26 43000 0
3 27 57000 0
4 19 76000 0
The dataset I’m using here is based on social media marketing, I won’t analyze this dataset at
this time, but when building your project, you need to show a detailed exploration of your data.
You can find a detailed analysis of this dataset here.
Now let’s move forward to the task of comparing the performance of classification algorithms
in machine learning. Here you can either choose only one performance evaluation metric or
more, but the process will remain the same as shown in the code below:
x = np.array(data[["Age", "EstimatedSalary"]])
y = np.array(data[["Purchased"]])

xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.10, random_state=42)


decisiontree = DecisionTreeClassifier()
logisticregression = LogisticRegression()
knearestclassifier = KNeighborsClassifier()
svm_classifier = SVC()
bernoulli_naiveBayes = BernoulliNB()
passiveAggressive = PassiveAggressiveClassifier()

knearestclassifier.fit(xtrain, ytrain)
decisiontree.fit(xtrain, ytrain)
logisticregression.fit(xtrain, ytrain)
passiveAggressive.fit(xtrain, ytrain)

data1 = {"Classification Algorithms": ["KNN Classifier", "Decision Tree Classifier",


"Logistic Regression", "Passive Aggressive Classifier"],
"Score": [knearestclassifier.score(x,y), decisiontree.score(x, y),
logisticregression.score(x, y), passiveAggressive.score(x,y) ]}
score = pd.DataFrame(data1)
score
view rawclassification comparison2.py hosted with ❤ by GitHub
In the above code:
1. I first divided the data into training and test sets;
2. Then I stored all the classification algorithms provided by the scikit-learn library in
Python in their respective variables;
3. Then I used the fit method to fit the data in the algorithm;
4. Finally, I created a DataFrame, where I stored the model score on the data.
Below is the DataFrame you will see at the end:
Classification Algorithms Score
KNN Classifier 0.8750
Decision
Tree 0.9800
Classifier
Classification Algorithms Score
Logistic Regression 0.6425
Passive Aggressive Classifier 0.6425
According to the above output, the Decision Tree classification algorithm performs the best on
this dataset.
Summary
So this is how you can compare classification algorithms in machine learning using the Python
programming language. If you follow all the steps mentioned in this article while further
exploring your dataset, it will become an amazing machine learning project as a beginner. Hope
you liked this article on a comparison of classification algorithms in machine learning. Please
feel free to ask your valuable questions in the comments section below.

You might also like