Unit 3
Unit 3
Basics
o Model: When we talk about the learning process, abstraction is a significant step as it
represents raw input data in a summarized and structured format, such that a meaningful
insight is obtained from the data. This structured representation of raw input data to the
meaningful pattern is called a model.
The model might have different forms. It might be a mathematical equation, it
might be a graph or tree structure, it might be a computational block, etc. The
decision regarding which model is to be selected for a specific data set is taken
by the learning task, based on the problem to be solved and the type of data.
For example, when the problem is related to prediction and the target field is
numeric and continuous, the regression model is assigned.
o Training: The process of assigning a model, and fitting a specific model to a data set is
called model training.
o Bias:
Generalization searches through the huge set of abstracted knowledge to come
up with a small and manageable set of key findings. It is not possible to do an
exhaustive search by reviewing each of the abstracted findings one-by-one
If the outcome is systematically incorrect, the learning is said to have a bias.
Model selection
o Predictive models
Models for supervised learning or predictive models, as is understandable from
the name itself, try to predict certain value using the values in an input data set.
The predictive models have a clear focus on what they want to learn and how
they want to learn.
Ex. Predicting win/loss in a cricket match, Predicting whether a transaction is
fraud
The models which are used for prediction of target features of categorical value
are known as classification models.
The target feature is known as a class and the categories to which classes are
divided into are called levels.
Ex. k-Nearest Neighbor (kNN), Naïve Bayes, and Decision Tree.
The models which are used for prediction of the numerical value of the target
feature of a data instance are known as regression models.
Linear Regression and Logistic Regression models are popular regression
models.
o Descriptive models
Models for unsupervised learning or descriptive models are used to describe a
data set or gain insight from a data set.
Based on the value of all features, interesting patterns or insights are derived
about the data set.
Descriptive models which group together similar data instances, i.e. data
instances having a similar value of the different features are called clustering
models.
UNIT3: Modelling and Evaluation
Subject: Machine Learning (3170724)
Reference Book: Machine Learning, Saikat Dutt, S. Chjandramouli, Das, Pearson.
Notes complied by: Bhagirath Prajapati, Computer Engineering dept. ADIT.
To make sure that the data in both the buckets are similar in nature, the division
is done randomly. Random numbers are used to assign data items to the
partitions. This method of partitioning the input data into two parts – training and
test data (depicted in Figure 3.1),
Once the model is trained using the training data, the labels of the test data are
predicted using the model’s target function.
Then the predicted value is compared with the actual value of the label. This is
possible because the test data is a part of the input data with known labels.
The performance of the model is in general measured by the accuracy of
prediction of the label value.
Issue: An obvious problem in this method is that the division of data of different
classes into the training and test data may not be proportionate.
This situation is worse if the overall percentage of data related to certain
classes is much less compared to other classes.
This problem can be addressed to some extent by applying stratified
random sampling in place of sampling.
In case of stratified random sampling, the whole data is broken into
several homogenous groups or strata and a random sample is selected
from each such stratum.
UNIT3: Modelling and Evaluation
Subject: Machine Learning (3170724)
Reference Book: Machine Learning, Saikat Dutt, S. Chjandramouli, Das, Pearson.
Notes complied by: Bhagirath Prajapati, Computer Engineering dept. ADIT.
Figure 3.3 depicts the detailed approach of selecting the ‘k’ folds in k-fold cross-
validation. As can be observed in the figure, each of the circles resembles a record
in the input data set whereas the different colors indicate the different classes
that the records belong to. The entire data set is broken into ‘k’ folds – out of
which one fold is selected in each iteration as the test data set. The fold selected
as test data set in each of the ‘k’ iterations is different. Also, note that though in
figure 3.3 the circles resemble the records in the input data set, the contiguous
circles represented as folds do not mean that they are subsequent records in the
data set.
Two popular methods of k-fold validation
10-fold cross-validation (10-fold CV)
o 10-fold cross-validation is by far the most popular approach. In
this approach, for each of the 10-folds, each comprising of
approximately 10% of the data, one of the folds is used as the
test data for validating model performance trained based on the
remaining 9 folds (or 90% of the data). This is repeated 10 times,
once for each of the 10 folds being used as the test data and the
remaining folds as the training data. The average performance
across all folds is being reported.
Leave-one-out cross-validation (LOOCV)
o Leave-one-out cross-validation (LOOCV) is an extreme case of k-
fold cross-validation using one record or data instance at a time
as a test data. This is done to maximize the count of data used to
train the model. It is obvious that the number of iterations for
which it has to be run is equal to the total number of data in the
input data set. Hence, obviously, it is computationally very
expensive and not used much in practice.
UNIT3: Modelling and Evaluation
Subject: Machine Learning (3170724)
Reference Book: Machine Learning, Saikat Dutt, S. Chjandramouli, Das, Pearson.
Notes complied by: Bhagirath Prajapati, Computer Engineering dept. ADIT.
o Bootstrap sampling
It uses the technique of Simple Random Sampling with Replacement (SRSWR)
Ex:
bootstrapping randomly picks data instances from the input data set, with the
possibility of the same data instance to be picked multiple times.
This technique is particularly useful in case of input data sets of small size, i.e.
having very less number of data instances.
o Lazy vs. Eager learner
Eager learner: when the test data comes in for classification, the eager learner is
ready with the model and doesn’t need to refer back to the training data.
Eager learners take more time in the learning phase than the lazy learners.
Some of the algorithms which adopt eager learning approach include Decision
Tree, Support Vector Machine, Neural Network, etc.
Lazy learning, on the other hand, completely skips the abstraction and
generalization processes, as explained in context of a typical machine learning
process.
It uses the training data in exact, and uses the knowledge to classify the
unlabelled test data.
Since lazy learning uses training data as-is, it is also known as rote learning (i.e.
memorization technique based on repetition).
Due to its heavy dependency on the given training data instance, it is also known
as instance learning.
Lazy learners take very little time in training because not much of training actually
happens.
UNIT3: Modelling and Evaluation
Subject: Machine Learning (3170724)
Reference Book: Machine Learning, Saikat Dutt, S. Chjandramouli, Das, Pearson.
Notes complied by: Bhagirath Prajapati, Computer Engineering dept. ADIT.
However, it takes quite some time in classification as for each tuple of test data,
a comparison-based assignment of label happens. One of the most popular
algorithm for lazy learning is k-nearest neighbor.
o Overfitting:
Overfitting refers to a situation where the model has been designed in such a way
that it emulates the training data too closely.
Deviation in the training data, like noise or outliers, gets embedded in the model.
Overfitting, in many cases, occur as a result of trying to fit an excessively complex
model to closely match the training data.
The target function, in these cases, tries to make sure all training data points are
correctly partitioned by the decision boundary.
Overfitting results in good performance with training data set, but poor
generalization and hence poor performance with test data set.
Solution:
using re-sampling techniques like k-fold cross validation
hold back of a validation data set
remove the nodes which have little or no predictive power for the given
machine learning problem
o Confusion matrix: A matrix containing correct and incorrect predictions in the form of
TPs, FPs, FNs and TNs is known as confusion matrix.
o Error rate: The percentage of misclassifications is indicated using error rate which is
measured as:
UNIT3: Modelling and Evaluation
Subject: Machine Learning (3170724)
Reference Book: Machine Learning, Saikat Dutt, S. Chjandramouli, Das, Pearson.
Notes complied by: Bhagirath Prajapati, Computer Engineering dept. ADIT.
o Kappa: Sometimes, correct prediction, both TPs as well as TNs, may happen by mere
coincidence. Since these occurrences boost model accuracy, ideally it should not
happen. Kappa value of a model indicates the adjusted the model accuracy. It is
calculated using the formula below:
Note: Kappa value can be 1 at the maximum, which represents perfect agreement between model’s
prediction and actual values.
UNIT3: Modelling and Evaluation
Subject: Machine Learning (3170724)
Reference Book: Machine Learning, Saikat Dutt, S. Chjandramouli, Das, Pearson.
Notes complied by: Bhagirath Prajapati, Computer Engineering dept. ADIT.
In the ROC curve, the FP rate is plotted (in the horizontal axis) against true
positive rate (in the vertical axis) at different classification thresholds.
UNIT3: Modelling and Evaluation
Subject: Machine Learning (3170724)
Reference Book: Machine Learning, Saikat Dutt, S. Chjandramouli, Das, Pearson.
Notes complied by: Bhagirath Prajapati, Computer Engineering dept. ADIT.
The area under curve (AUC) value, as shown in figure, is the area of the two-
dimensional space under the curve extending from (0, 0) to (1, 1), where each
point on the curve gives a set of true and false positive values at a specific
classification threshold.
UNIT3: Modelling and Evaluation
Subject: Machine Learning (3170724)
Reference Book: Machine Learning, Saikat Dutt, S. Chjandramouli, Das, Pearson.
Notes complied by: Bhagirath Prajapati, Computer Engineering dept. ADIT.
The AUC of classifier 1 is more than the AUC of classifier 2. So, we can draw the
inference that classifier 1 is better than classifier 2.
o Ensemble
As an alternate approach of increasing the performance of one model, several
models may be combined together.
This approach of combining different models with diverse strengths is known as
ensemble
The outputs from the different models are combined using a combination
function. A very simple strategy of combining, say in case of a prediction task
using ensemble, can be majority voting of the different models combined. For
example, 3 out of 5 classes predict ‘win’ and 2 predict ‘loss’ – then the final
outcome of the ensemble using majority vote would be a ‘win’.
One of the earliest and most popular ensemble models is bootstrap aggregating
or bagging.
Bagging uses bootstrap sampling method to generate multiple training data sets.
These training data sets are used to generate (or train) a set of models using the
same learning algorithm. Then the outcomes of the models are combined by
majority voting (classification) or by average (regression).
Boosting is another key ensemble based technique. In this type of ensemble,
weaker learning models are trained on resampled data and the outcomes are
combined using a weighted voting approach based on the performance of
different models.