UNIT 1 Notes
UNIT 1 Notes
PAC Framework
Before getting into more detail first lets look at the representation/ terminologies which we are
going to use to represent PAC framework
c — Concept/features where X -> Y since Y = {0,1}, X -> {0,1}
C — Concept class ( set of concepts/ features to learn)
H — Hypothesis( Set of concepts which may not coincide with C)
D — Data distribution (considered here to be identical independently distributed)
S — Sample from H
hS — Hypothesis for S Sample
ε — Accuracy parameter
δ — Confidence parameter
PAC Learning
A class C is termed to be PAC learnable if the hypothesis (H) returned after applying the
algorithm (A) on the number of samples (N) is termed to be approximately correct if it gives an
error rate lesser than ε and a probability of at least 1 − δ (where N is polynomial and N is a
function for 1/ε, 1/δ) . The combination of probability and term approximately in the equation
leads to the term PAC — Probably Approximate Correct.
* Pr [(hs)≤ ε] ≥ 1 − δ
Assumption here being made is (ε, δ) > 0 and the hypothesis H is finite. Such an
algorithm/classifier which gives us atleast 1 − δ probability will be termed as approximately
correct in learning the features/concepts.
Also if the algorithm A takes polynomial time while runing ( in form of 1/ε, 1/δ) then C is said
to be efficiently PAC learnable. Here we are looking for more generalised learning (with lesser
generalisation error) but not memorisation of the concepts/features by the algorithm.
Generalisation error — We find the probability of H (hypothesis) and C ( Concept class) such
that for random instances where h(x) != c(x) will be the generalisation error as we are assuming
both of them to be different or we can say non overlapping(intersecting data points) and it will
be our true error.
* Pr [h(x)!= c(x)]
Sample complexity — Using PAC we can also find the number of samples which will give us
higher probability, here we are assuming that C is PAC learnable. So if we find a hypothesis
with atleast 1 − δ probability (high probability), then number of samples needed for training
such hypothesis can be defined as
* N = 1/ε (ln|H|+ln|δ| )
What if the hypothesis H is infinite?
Example of VC Dimension1
Consider the graph (a) where there are two sets T, F(d points) which are being split into two
using an axis. Likewise any other axis can also be selected.
Consider the graph (b) where there are two sets F, T (d points) which are being split into two
using an axis. Likewise any other axis can also be selected.
Example of VC Dimension2
Consider the (a) graph, where there are 4 points of set — T,T,T, F (d+1 points). How can we
shatter it, any line we choose will cut the F point which is not a perfect split, So (a) graph cannot
be shattered.
Consider the (b) graph, where there are 3 points of set — T,T, F (d+1 points). This graph can
easily be split by using a line.
In the above examples we saw if there are set of points ≤ 3 it can easily be split but if the set of
points are ≥4 (d+1 points) it cannot be split perfectly.
So the last set of point that can be split or are shatterable is VC dimension <4 .
We have come to an end of the blog … In this blog we got an understanding of PAC learning
for finite hypothesis and VC dim for infinite hypothesis. In the next blog I will come up with
another interesting topic in AI, till then stay techventorous :)
Reference -
https://www.youtube.com/watch?v=X4Oxst5huQA&t=909s
https://www.slideshare.net/sanghyukchun/pac-learning-42139787
Book — Foundations of Machine Learning
Supervised Learning
What are some popular machine learning methods?
Two of the most widely adopted machine learning methods are supervised
learning and unsupervised learning – but there are also other methods of machine learning.
Here's an overview of the most popular types.
Supervised learning algorithms are trained using labeled examples, such as an input where
the desired output is known. For example, a piece of equipment could have data points labeled
either “F” (failed) or “R” (runs). The learning algorithm receives a set of inputs along with the
corresponding correct outputs, and the algorithm learns by comparing its actual output with
correct outputs to find errors. It then modifies the model accordingly. Through methods like
classification, regression, prediction and gradient boosting, supervised learning uses patterns
to predict the values of the label on additional unlabeled data. Supervised learning is commonly
used in applications where historical data predicts likely future events. For example, it can
anticipate when credit card transactions are likely to be fraudulent or which insurance customer
is likely to file a claim.
Unsupervised learning is used against data that has no historical labels. The system is not told
the "right answer." The algorithm must figure out what is being shown. The goal is to explore
the data and find some structure within. Unsupervised learning works well on transactional
data. For example, it can identify segments of customers with similar attributes who can then
be treated similarly in marketing campaigns. Or it can find the main attributes that separate
customer segments from each other. Popular techniques include self-organizing maps, nearest-
neighbor mapping, k-means clustering and singular value decomposition. These algorithms are
also used to segment text topics, recommend items and identify data outliers.
Regression
What is Regression Analysis?
Regression analysis is a set of statistical methods used for the estimation of relationships
between a dependent variable and one or more independent variables. It can be utilized to
assess the strength of the relationship between variables and for modeling the future
relationship between them.
Regression analysis includes several variations, such as linear, multiple linear, and nonlinear.
The most common models are simple linear and multiple linear. Nonlinear regression analysis
is commonly used for more complicated data sets in which the dependent and independent
variables show a nonlinear relationship.
Regression analysis offers numerous applications in various disciplines, including finance.
Where:
Y – Dependent variable
X – Independent (explanatory) variable
a – Intercept
b – Slope
ϵ – Residual (error)
Where:
Y – Dependent variable
X1, X2, X3 – Independent (explanatory) variables
a – Intercept
b, c, d – Slopes
ϵ – Residual (error)
Multiple linear regression follows the same conditions as the simple linear model. However,
since there are several independent variables in multiple linear analysis, there is another
mandatory condition for the model:
Non-collinearity: Independent variables should show a minimum correlation with
each other. If the independent variables are highly correlated with each other, it will be
difficult to assess the true relationships between the dependent and independent
variables.
Regression Analysis in Finance
Regression analysis comes with several applications in finance. For example, the statistical
method is fundamental to the Capital Asset Pricing Model (CAPM). Essentially, the CAPM
equation is a model that determines the relationship between the expected return of an asset
and the market risk premium.
The analysis is also used to forecast the returns of securities, based on different factors, or to
forecast the performance of a business. Learn more forecasting methods in CFI’s Budgeting
and Forecasting Course!
The above example shows how to use the Forecast function in Excel to calculate a company’s
revenue, based on the number of ads it runs.
Given easy-to-use machine learning libraries like scikit-learn and Keras, it is straightforward
to fit many different machine learning models on a given predictive modeling dataset.
The challenge of applied machine learning, therefore, becomes how to choose among a range
of different models that you can use for your problem.
Model selection for machine learning.
After reading this post, you will know:
Model selection is the process of choosing one among many candidate models for a predictive
modeling problem.
There may be many competing concerns when performing model selection beyond model
performance, such as complexity, maintainability, and available resources.
The two main classes of model selection techniques are probabilistic measures and resampling
methods.
Overview
This tutorial is divided into three parts; they are:
1. What Is Model Selection
2. Considerations for Model Selection
3. Model Selection Techniques
What Is Model Selection
Model selection is the process of selecting one final machine learning model from among a
collection of candidate machine learning models for a training dataset.
Model selection is a process that can be applied both across different types of models (e.g.
logistic regression, SVM, KNN, etc.) and across models of the same type configured with
different model hyperparameters (e.g. different kernels in an SVM).
When we have a variety of models of different complexity (e.g., linear or logistic regression
models with different degree polynomials, or KNN classifiers with different values of K), how
should we pick the right one?
— Page 22, Machine Learning: A Probabilistic Perspective, 2012.
For example, we may have a dataset for which we are interested in developing a classification
or regression predictive model. We do not know beforehand as to which model will perform
best on this problem, as it is unknowable. Therefore, we fit and evaluate a suite of different
models on the problem.
Model selection is the process of choosing one of the models as the final model that addresses
the problem.
Model selection is different from model assessment.
For example, we evaluate or assess candidate models in order to choose the best one, and this
is model selection. Whereas once a model is chosen, it can be evaluated in order to
communicate how well it is expected to perform in general; this is model assessment.
The process of evaluating a model’s performance is known as model assessment, whereas the
process of selecting the proper level of flexibility for a model is known as model selection.
— Page 175, An Introduction to Statistical Learning: with Applications in R, 2017.
Considerations for Model Selection
Fitting models is relatively straightforward, although selecting among them is the
true challenge of applied machine learning.
Firstly, we need to get over the idea of a “best” model.
All models have some predictive error, given the statistical noise in the data, the
incompleteness of the data sample, and the limitations of each different model type. Therefore,
the notion of a perfect or best model is not useful. Instead, we must seek a model that is “good
enough.”
What do we care about when choosing a final model?
The project stakeholders may have specific requirements, such as maintainability and limited
model complexity. As such, a model that has lower skill but is simpler and easier to understand
may be preferred.
Alternately, if model skill is prized above all other concerns, then the ability of the model to
perform well on out-of-sample data will be preferred regardless of the computational
complexity involved.
Therefore, a “good enough” model may refer to many things and is specific to your project,
such as:
A model that meets the requirements and constraints of project stakeholders.
A model that is sufficiently skillful given the time and resources available.
A model that is skillful as compared to naive models.
A model that is skillful relative to other tested models.
A model that is skillful relative to the state-of-the-art.
Next, we must consider what is being selected.
For example, we are not selecting a fit model, as all models will be discarded. This is because
once we choose a model, we will fit a new final model on all available data and start using it
to make predictions.
Therefore, are we choosing among algorithms used to fit the models on the training dataset?
Some algorithms require specialized data preparation in order to best expose the structure of
the problem to the learning algorithm. Therefore, we must go one step further and
consider model selection as the process of selecting among model development pipelines.
Each pipeline may take in the same raw training dataset and outputs a model that can be
evaluated in the same manner but may require different or overlapping computational steps,
such as:
Data filtering.
Data transformation.
Feature selection.
Feature engineering.
And more…
The closer you look at the challenge of model selection, the more nuance you will discover.
Now that we are familiar with some considerations involved in model selection, let’s review
some common methods for selecting a model.
Model Selection Techniques
The best approach to model selection requires “sufficient” data, which may be nearly infinite
depending on the complexity of the problem.
In this ideal situation, we would split the data into training, validation, and test sets, then fit
candidate models on the training set, evaluate and select them on the validation set, and report
the performance of the final model on the test set.
If we are in a data-rich situation, the best approach […] is to randomly divide the dataset into
three parts: a training set, a validation set, and a test set. The training set is used to fit the
models; the validation set is used to estimate prediction error for model selection; the test set
is used for assessment of the generalization error of the final chosen model.
— Page 222, The Elements of Statistical Learning: Data Mining, Inference, and Prediction,
2017.
This is impractical on most predictive modeling problems given that we rarely have sufficient
data, or are able to even judge what would be sufficient.
In many applications, however, the supply of data for training and testing will be limited, and
in order to build good models, we wish to use as much of the available data as possible for
training. However, if the validation set is small, it will give a relatively noisy estimate of
predictive performance.
– Page 32, Pattern Recognition and Machine Learning, 2006.
Instead, there are two main classes of techniques to approximate the ideal case of model
selection; they are:
Probabilistic Measures: Choose a model via in-sample error and complexity.
Resampling Methods: Choose a model via estimated out-of-sample error.
Let’s take a closer look at each in turn.
Probabilistic Measures
Probabilistic measures involve analytically scoring a candidate model using both its
performance on the training dataset and the complexity of the model.
It is known that training error is optimistically biased, and therefore is not a good basis for
choosing a model. The performance can be penalized based on how optimistic the training error
is believed to be. This is typically achieved using algorithm-specific methods, often linear, that
penalize the score based on the complexity of the model.
Historically various ‘information criteria’ have been proposed that attempt to correct for the
bias of maximum likelihood by the addition of a penalty term to compensate for the over-fitting
of more complex models.
– Page 33, Pattern Recognition and Machine Learning, 2006.
A model with fewer parameters is less complex, and because of this, is preferred because it is
likely to generalize better on average.
Four commonly used probabilistic model selection measures include:
Akaike Information Criterion (AIC).
Bayesian Information Criterion (BIC).
Minimum Description Length (MDL).
Structural Risk Minimization (SRM).
Probabilistic measures are appropriate when using simpler linear models like linear regression
or logistic regression where the calculating of model complexity penalty (e.g. in sample bias)
is known and tractable.
Resampling Methods
Resampling methods seek to estimate the performance of a model (or more precisely, the model
development process) on out-of-sample data.
This is achieved by splitting the training dataset into sub train and test sets, fitting a model on
the sub train set, and evaluating it on the test set. This process may then be repeated multiple
times and the mean performance across each trial is reported.
It is a type of Monte Carlo estimate of model performance on out-of-sample data, although
each trial is not strictly independent as depending on the resampling method chosen, the same
data may appear multiple times in different training datasets, or test datasets.
Three common resampling model selection methods include:
Random train/test splits.
Cross-Validation (k-fold, LOOCV, etc.).
Bootstrap.
Most of the time probabilistic measures (described in the previous section) are not available,
therefore resampling methods are used.
By far the most popular is the cross-validation family of methods that includes many subtypes.
Probably the simplest and most widely used method for estimating prediction error is cross-
validation.
Bayesian decision theory
Bayesian decision theory refers to the statistical approach based on tradeoff quantification
among various classification decisions based on the concept of Probability(Bayes Theorem)
and the costs associated with the decision.
It is basically a classification technique that involves the use of the Bayes Theorem which is
used to find the conditional probabilities.
In Statistical pattern Recognition, we will focus on the statistical properties of patterns that
are generally expressed in probability densities (pdf’s and pmf’s), and this will command most
of our attention in this article and try to develop the fundamentals of the Bayesian decision
theory.
Prerequisites
Random Variable
A random variable is a function that maps a possible set of outcomes to some values like while
tossing a coin and getting head H as 1 and Tail T as 0 where 0 and 1 are random variables.
Bayes Theorem
The conditional probability of A given B, represented by P(A | B) is the chance of occurrence
of A given that B has occurred.
P(A | B) = P(A,B)/P(B) or
By Using the Chain rule, this can also be written as:
P(A,B) = P(A|B)P(B)=P(B|A)P(A)
P(A | B) = P(B|A)P(A)/P(B) ——- (1)
Where, P(B) = P(B,A) + P(B,A’) = P(B|A)P(A) + P(B|A’)P(A’)
Here, equation (1) is known as the Bayes Theorem of probability
Our aim is to explore each of the components included in this theorem. Let’s explore step by
step:
(a) Prior or State of Nature:
Prior probabilities represent how likely is each Class is going to occur.
Priors are known before the training process.
The state of nature is a random variable P(wi).
If there are only two classes, then the sum of the priors is P(w1) + P(w2)=1, if the classes are
exhaustive.
(b) Class Conditional Probabilities:
It represents the probability of how likely a feature x occurs given that it belongs to the
particular class. It is denoted by, P(X|A) where x is a particular feature
It is the probability of how likely the feature x occurs given that it belongs to the class wi.
Sometimes, it is also known as the Likelihood.
It is the quantity that we have to evaluate while training the data. During the training process,
we have input(features) X labeled to corresponding class w and we figure out the likelihood of
occurrence of that set of features given the class label.
(c) Evidence:
It is the probability of occurrence of a particular feature i.e. P(X).
It can be calculated using the chain rule as, P(X) = Σin P(X | wi) P(wi)
As we need the likelihood of class conditional probability is also figure out evidence values
during training.
(d) Posterior Probabilities:
It is the probability of occurrence of Class A when certain Features are given
It is what we aim at computing in the test phase in which we have testing input or features (the
given entity) and have to find how likely trained model can predict features belonging to the
particular class wi.
In our example, we use the width x, which is more discriminatory to improve the decision
rule of our classifier. The different objects will yield different variable-width readings and we
usually see this variability in probabilistic terms and also we consider x to be a continuous
random variable whose distribution depends on the type of object wj, and is expressed as p(x|ωj)
(probability distribution function pdf as a continuous variable) and known as the class-
conditional probability density function. Therefore,
The pdf p(x|ω1) is the probability density function for feature x given that the state of nature is
ω1 and the same interpretation for p(x|w2).
Each sub-sample will be used at least once as a validation dataset across all iterations.
Now we know the testing approach, the main part is how to evaluate the learning models with
validation and test dataset… Let’s dig into it and learn the most common evaluation techniques
that a tester must be aware of.
Evaluation Techniques:
There are certain terminologies that we need to understand before diving into the evaluation
techniques. So let’s first know what they are.
With the above basic terminologies, now let’s dive into the techniques:
1. Classification Accuracy: It’s the most basic way of evaluating the learning model. It’s a
ratio between the positive(TN+TP) predictions vs the total number of predictions. If the
ratio is high then the model has a high prediction rate. Below are the formulas to find the
accuracy ratio.
However, it is seen that accuracy alone is not a good way to evaluate the model. For e.g. Out of
100 samples of shapes, the model might have correctly predicted True Negative cases however
it may have a less success rate for True Positive ones. Hence, The ratio/prediction rate may look
good/high but the overall model fails to identify the correct rectangular shapes.
2. Confusion Matrix: It’s a square matrix table of N*N where N is the number of classes that
the model needs to classify. It’s best used for classification models that categorizes an outcome
into a finite set of values. These values are known as labels. One axis is the label that the model
predicted and the other is the actual label. To understand more about this, let’s categories the
shapes into 3 labels [Rectangle, Circle, and Square]. As there are 3 labels, we will draw a 3*3
table(Confusion Matrix) of which one axis will be actual and the other is the predicted label.
Confusion matrix of 3[Actual]*3[Predicted] table. [Note: Remarks column is for the
understanding purpose]
With the above matrix, we can calculate the two important metrics to identify the positive
prediction rate.
Precision: Precision identifies the frequency with which a model was correct when predicting
the positive class. This means the prediction frequency of a positive class by the model. Let’s
calculate the precision of each label/class using the above matrix
.
F1 Measure formula
There is another evaluation technique called ROC[receiver operating characteristics] and
AOC[Area under ROC curve] which needs to plot the graph based on two different parameters
[True Postive Rate(TPR or Recall) and False Postive Rate(FPR) for various thresholds.
However, we will cover this evaluation technique in our later article.
The above described is a basic testing approach and evaluation technique for a system that is
embedded with learning capabilities.
Now that you have a model running in Production, how do you know it’s adding value for your
business and customers? How do you know the parameters utilized in this model are better than
other parameters? How do you know what you are doing is working better than what you had
in production before? These are key questions you should ask yourself prior to productionizing
any machine learning (ML) model.
Experimentation is crucial to the ML model building strategy. Experiments may encompass
using different training and testing data, models with differing hyperparameters, running
different code (even if it's a small change), and often you may find yourself running the same
code but in different environment configurations. All experiments come with completely
different metrics; consequently, many Data Scientists find themselves lost in keeping track of
everything due to not following experiment best practices. Let’s get started on a few we have
picked up along the way.
Versioning
Why is version control important? First, it lowers any risk of erasing or writing over someone’s
work or making mistakes. Second, it’s a great way to incorporate collaboration between
colleagues. The most learned requirement in computer science is to ensure we have a
mechanism for version control of our code, but in the Data Science and ML process, it’s more
than just code requiring versioning. Notebooks, data, and the environment being utilized also
have a need for version control.
Notebook versioning. Versioning your notebook is a must for keeping track of, not only
your code, but also the results of each model run of experiments. If you intend on
sharing and collaborating with your notebook, you will want to ensure you and your
peers do not step on each other’s work or make mistakes.
Data versioning. Control of data is of utmost importance in ML. Data version control
allows for managing large datasets, project reproducibility, and the ability for scientists
to take advantage of new features while reusing existing features. Another advantage is
users will not have to remember which model uses which dataset – this mitigates risk
to model results. One way to have data version control is to save the incoming data in
specific locations with metadata tagging (or labeling) and logging to be able to
differentiate the old versus new.
Environment versioning. This type of versioning can mean a couple of things:
infrastructure configuration and specific frameworks being used. You will want to have
a good approach for versioning both types as this is also a crucial step in ensuring your
experiments are being run 1-to-1. For example, if your experiments involve using
TensorFlow, you will need to ensure this framework is imported for your research
comparisons. Another example is when you want to promote your experiments from
Development to a Staging environment and run automated tests. You would need to
ensure the Staging environment matches all the configurations that were used in
Development. Good practice is to create step-by-step instructions via a script or some
automated process so as to avoid missteps.
Commits
Code commits require versioning to mitigate risk of merging production code with non-
production code, as well as avoiding risk of overwriting your peers code and potentially making
other detrimental mistakes. What happens if you run an experiment in between commits and
forget to commit this code first? These are dubbed “dirty commits” which occurs when
developers don’t follow development best practices. One best practice in this scenario is to
have users create a snapshot of their environment and code before running an experiment. This
way, they have the option of rolling back their changes to the code and configurations prior to
experimenting.
Hyperparameters
All ML models have hyperparameters to help control the behavior of the training process of
the algorithms and have a great impact on how the model will perform. To find the optimal
combinations of parameters for the best results, you will find yourself running many
experiments. In doing so, keeping track of the parameters you used for each experiment can
become cumbersome; consequently, many scientists find themselves re-running experiments
due to forgetting all the combinations used. A best practice for experimenting with
hyperparameters is to incorporate a tracking process. One way to track is to log everything via
audit logging or some form of logging that will save those parameters for every experiment.
Metrics
What metrics should you track and save? Best practice: all of them. Metrics can change daily
or over a span of time depending on the use case and situation. For example, measuring the
performance of your current experiment may involve looking at a Confusion Matrix and
distribution of predictions, but if you only logged the data from the distribution, you could miss
out on remembering how the matrix performed and therefore waste time re-running the same
experiment to gather this extra metric. Another example of metric loss is not tracking the
timestamps of the data being collected; consequently, you may experience model decay and
not be able to incorporate proper model retraining techniques. If you are only tracking specific
metrics, you can miss out on new discoveries; moreover, proactively logging as much in
metrics as possible can help mitigate wasting time in the future.
A/B Testing
This form of testing is widely used by scientists to run different models against each other and
compare their performance on real-time data, in a controlled environment. Best practice is to
follow steps like the scientific method:
Form your hypothesis. For ML, you will want a null hypothesis (states that there
is no difference between the control and variant groups) and an alternate hypothesis
(the outcome you want your test to prove to be true).
Setup your control group and test group. Your control group would receive results from
Model A, and your test group would receive results from Model B. You would then
pull a sample of data via random sampling and from a specified sample size.
Perform A/B testing. How to run your A/B tests depend on your use case and
requirements. We at Wallaroo provide three modes of experimenting for testing:
Random Split: Allows you to perform randomized control trial type experiment where
incoming data is sent to each model randomly. You can specify the percentage of
requests each model receives by assigning a ‘weight’. Weights are automatically
normalized into a percentage for you, so you don’t need to worry about them adding up
to a particular value. You can also specify a meta key field to ensure consistent handling
of grouped requests. For example, you can specify a split_key of 'session_id' to make
sure that requests from the same session are handled by the same (randomly chosen)
model.
Key Split: Allows you to specifically choose which model handles requests for a user
(or group). For example, if using a credit card fraud use case, if you want all ‘gold’ card
users to go to one fraud prediction model and all ‘black’ card users to go to another,
then you should specify 'card_type" to be the split_key.
Shadow Deploy: Allows you to test new models without removing the default/control
model. This is particularly useful for “burn-in” testing a new model with real world
data without displacing the currently proven model.
Coming up with an effective experimentation strategy can be cumbersome but following some
best practices will assist in proper planning. By including versioning, commit tracking, metrics,
hyperparameter tracking and A/B testing, you will be able to keep track of all information and
results of your experiments to have the needed comparisons and confidence that you know
which setup produced the best results.
About Wallaroo.
Wallaroo enables data scientists and ML engineers to deploy enterprise-level AI into
production simpler, faster, and with incredible efficiency. Our platform provides powerful self-
service tools, a purpose-built ultrafast engine for ML workflows, observability, and
experimentation framework. Wallaroo runs in cloud, on-prem, and edge environments while
reducing infrastructure costs by 80 percent.
Wallaroo’s unique approach to production AI gives any organization the desired fast time-to
market, audited visibility, scalability - and ultimately measurable business value - from their
AI-driven initiatives, and allows data scientists to focus on value creation, not low-level
“plumbing.”
F-measure for a binary problem — Introduction to Data Mining — Pang-Ning Tan, Michael
Steinbach, Vipin Kumar
Different Methods to evaluate the performance
Usually when working with a machine learning model, we need 3 splits of our dataset.
1. Training set
2. Validation set
3. Test set
Training set is used to train our model by learning the parameters of the model.
Validation set is used to learn the best hyperparameters of our model using the performance
metrics defined above.
Test set is never seen before data. The performance of the model is calculated based on the
model learnt using parameters from training set and hyperparameters from validation set by
applying the metrics mentioned above.
Note: These are not all possible performance metrics available in the literature. These are some
of the widely used ones.
The methods below are variations of the above.
Holdout Method
Split the learning sample into a training set and a test data set.
→ A model is induced on the training data set
→ Performance is evaluated on the test data set
Limitations:
→ Too few data for learning: The more data used for testing, the more reliable the performance
estimation but more data is missing (less data available) for learning.
→ Interdependence of training and test data set: If a class is underrepresented in the training
data set, it will be overrepresented in the test data set and vice versa.
Random Subsampling
The holdout method can be repeated several times to improve the estimation of a classifier’s
performance. If the estimation is performed k times then, the overall performance can be the
average of each estimate.
Image by Author
→ This method also encounters some of the problems associated with the holdout method
because it does not utilise as much data as possible for training.
→ It also has no control over the number of times each record is used for training and testing.
Cross-Validation
Core idea:
use each record k times for training and once for testing
aggregate the performance values over all k tests
k-fold cross validation
split the learning dataset into k equi-sized subsets
for i = 1, …., k use the k-1 folds for training and kth fold for testing
aggregate the performance values over all k tests
leave one out cross validation
In k-fold cross validation, if k = N where N is the number of records in the learning dataset
Each test set will contain only one record
Computationally expensive
Bootstrap
The methods presented so far assume that the training records are sampled without replacement.
It means that there are no duplicate records in the training and test set. In the bootstrap approach,
the training records are sampled with replacement. It means that a record already chosen for
training is put back into the original pool of records so that it is equally likely to be redrawn.
Hypothesis Testing
Hypothesis Testing | A Step-by-Step Guide with Easy Examples
Hypothesis testing is a formal procedure for investigating our ideas about the world
using statistics. It is most often used by scientists to test specific predictions, called hypotheses,
that arise from theories.
There are 5 main steps in hypothesis testing:
1. State your research hypothesis as a null hypothesis (Ho) and alternate hypothesis
(Ha or H1).
2. Collect data in a way designed to test the hypothesis.
3. Perform an appropriate statistical test.
4. Decide whether to reject or fail to reject your null hypothesis.
5. Present the findings in your results and discussion section.
Though the specific details might vary, the procedure you will use when testing a hypothesis
will always follow some version of these steps.
data = pd.read_csv("https://raw.githubusercontent.com/amankharwal/Website-
data/master/social.csv")
print(data.head())
view rawclassification comparison1.py hosted with ❤ by GitHub
Age EstimatedSalary Purchased
0 19 19000 0
1 35 20000 0
2 26 43000 0
3 27 57000 0
4 19 76000 0
The dataset I’m using here is based on social media marketing, I won’t analyze this dataset at
this time, but when building your project, you need to show a detailed exploration of your data.
You can find a detailed analysis of this dataset here.
Now let’s move forward to the task of comparing the performance of classification algorithms
in machine learning. Here you can either choose only one performance evaluation metric or
more, but the process will remain the same as shown in the code below:
x = np.array(data[["Age", "EstimatedSalary"]])
y = np.array(data[["Purchased"]])
knearestclassifier.fit(xtrain, ytrain)
decisiontree.fit(xtrain, ytrain)
logisticregression.fit(xtrain, ytrain)
passiveAggressive.fit(xtrain, ytrain)