0% found this document useful (0 votes)
6 views13 pages

Unit 3

The document provides an overview of modeling and evaluation in machine learning, detailing concepts such as model training, bias, and model selection for both predictive and descriptive tasks. It discusses various validation techniques, including holdout methods, k-fold cross-validation, and bootstrap sampling, as well as the differences between eager and lazy learners. Additionally, it addresses issues of overfitting and underfitting, the bias-variance trade-off, and performance metrics like accuracy, confusion matrix, sensitivity, and specificity.

Uploaded by

12302130603011
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views13 pages

Unit 3

The document provides an overview of modeling and evaluation in machine learning, detailing concepts such as model training, bias, and model selection for both predictive and descriptive tasks. It discusses various validation techniques, including holdout methods, k-fold cross-validation, and bootstrap sampling, as well as the differences between eager and lazy learners. Additionally, it addresses issues of overfitting and underfitting, the bias-variance trade-off, and performance metrics like accuracy, confusion matrix, sensitivity, and specificity.

Uploaded by

12302130603011
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

UNIT3: Modelling and Evaluation

Subject: Machine Learning (3170724)


Reference Book: Machine Learning, Saikat Dutt, S. Chjandramouli, Das, Pearson.
Notes complied by: Bhagirath Prajapati, Computer Engineering dept. ADIT.

 Basics
o Model: When we talk about the learning process, abstraction is a significant step as it
represents raw input data in a summarized and structured format, such that a meaningful
insight is obtained from the data. This structured representation of raw input data to the
meaningful pattern is called a model.
 The model might have different forms. It might be a mathematical equation, it
might be a graph or tree structure, it might be a computational block, etc. The
decision regarding which model is to be selected for a specific data set is taken
by the learning task, based on the problem to be solved and the type of data.
 For example, when the problem is related to prediction and the target field is
numeric and continuous, the regression model is assigned.
o Training: The process of assigning a model, and fitting a specific model to a data set is
called model training.
o Bias:
 Generalization searches through the huge set of abstracted knowledge to come
up with a small and manageable set of key findings. It is not possible to do an
exhaustive search by reviewing each of the abstracted findings one-by-one
 If the outcome is systematically incorrect, the learning is said to have a bias.
 Model selection
o Predictive models
 Models for supervised learning or predictive models, as is understandable from
the name itself, try to predict certain value using the values in an input data set.
 The predictive models have a clear focus on what they want to learn and how
they want to learn.
 Ex. Predicting win/loss in a cricket match, Predicting whether a transaction is
fraud
 The models which are used for prediction of target features of categorical value
are known as classification models.
 The target feature is known as a class and the categories to which classes are
divided into are called levels.
 Ex. k-Nearest Neighbor (kNN), Naïve Bayes, and Decision Tree.
 The models which are used for prediction of the numerical value of the target
feature of a data instance are known as regression models.
 Linear Regression and Logistic Regression models are popular regression
models.
o Descriptive models
 Models for unsupervised learning or descriptive models are used to describe a
data set or gain insight from a data set.
 Based on the value of all features, interesting patterns or insights are derived
about the data set.
 Descriptive models which group together similar data instances, i.e. data
instances having a similar value of the different features are called clustering
models.
UNIT3: Modelling and Evaluation
Subject: Machine Learning (3170724)
Reference Book: Machine Learning, Saikat Dutt, S. Chjandramouli, Das, Pearson.
Notes complied by: Bhagirath Prajapati, Computer Engineering dept. ADIT.

 Ex. Customer grouping or segmentation based on social, demographic, ethnic,


etc. factors, Grouping of music based on different aspects like genre, language,
time-period, etc.
 The most popular model for clustering is k-Means
 Training a model (supervised learning)
o Holdout method
 In case of supervised learning, a model is trained using the labelled input data.
 In hold out method, subset of the input data is used as the test data for evaluating
the performance of a trained model.
 In general 70%–80% of the input data (which is obviously labelled) is used for
model training. The remaining 20%–30% is used as test data for validation of the
performance of the model.

 To make sure that the data in both the buckets are similar in nature, the division
is done randomly. Random numbers are used to assign data items to the
partitions. This method of partitioning the input data into two parts – training and
test data (depicted in Figure 3.1),
 Once the model is trained using the training data, the labels of the test data are
predicted using the model’s target function.
 Then the predicted value is compared with the actual value of the label. This is
possible because the test data is a part of the input data with known labels.
 The performance of the model is in general measured by the accuracy of
prediction of the label value.
 Issue: An obvious problem in this method is that the division of data of different
classes into the training and test data may not be proportionate.
 This situation is worse if the overall percentage of data related to certain
classes is much less compared to other classes.
 This problem can be addressed to some extent by applying stratified
random sampling in place of sampling.
 In case of stratified random sampling, the whole data is broken into
several homogenous groups or strata and a random sample is selected
from each such stratum.
UNIT3: Modelling and Evaluation
Subject: Machine Learning (3170724)
Reference Book: Machine Learning, Saikat Dutt, S. Chjandramouli, Das, Pearson.
Notes complied by: Bhagirath Prajapati, Computer Engineering dept. ADIT.

o K-fold Cross-validation method


 Issue with stratified sampling: the smaller data sets may have the challenge to
divide the data of some of the classes proportionally amongst training and test
data sets.
 Repeated hold out: to ensure the randomness of the composed data sets.
 In repeated holdout, several random holdouts are used to measure the model
performance.
 In the end, the average of all performances is taken.
 This process of repeated holdout is the basis of k-fold cross-validation technique.
 In k-fold cross-validation, the data set is divided into k completely distinct or non-
overlapping random partitions called folds.
UNIT3: Modelling and Evaluation
Subject: Machine Learning (3170724)
Reference Book: Machine Learning, Saikat Dutt, S. Chjandramouli, Das, Pearson.
Notes complied by: Bhagirath Prajapati, Computer Engineering dept. ADIT.

 Figure 3.3 depicts the detailed approach of selecting the ‘k’ folds in k-fold cross-
validation. As can be observed in the figure, each of the circles resembles a record
in the input data set whereas the different colors indicate the different classes
that the records belong to. The entire data set is broken into ‘k’ folds – out of
which one fold is selected in each iteration as the test data set. The fold selected
as test data set in each of the ‘k’ iterations is different. Also, note that though in
figure 3.3 the circles resemble the records in the input data set, the contiguous
circles represented as folds do not mean that they are subsequent records in the
data set.
 Two popular methods of k-fold validation
 10-fold cross-validation (10-fold CV)
o 10-fold cross-validation is by far the most popular approach. In
this approach, for each of the 10-folds, each comprising of
approximately 10% of the data, one of the folds is used as the
test data for validating model performance trained based on the
remaining 9 folds (or 90% of the data). This is repeated 10 times,
once for each of the 10 folds being used as the test data and the
remaining folds as the training data. The average performance
across all folds is being reported.
 Leave-one-out cross-validation (LOOCV)
o Leave-one-out cross-validation (LOOCV) is an extreme case of k-
fold cross-validation using one record or data instance at a time
as a test data. This is done to maximize the count of data used to
train the model. It is obvious that the number of iterations for
which it has to be run is equal to the total number of data in the
input data set. Hence, obviously, it is computationally very
expensive and not used much in practice.
UNIT3: Modelling and Evaluation
Subject: Machine Learning (3170724)
Reference Book: Machine Learning, Saikat Dutt, S. Chjandramouli, Das, Pearson.
Notes complied by: Bhagirath Prajapati, Computer Engineering dept. ADIT.

o Bootstrap sampling
 It uses the technique of Simple Random Sampling with Replacement (SRSWR)
 Ex:

 bootstrapping randomly picks data instances from the input data set, with the
possibility of the same data instance to be picked multiple times.

 This technique is particularly useful in case of input data sets of small size, i.e.
having very less number of data instances.
o Lazy vs. Eager learner
 Eager learner: when the test data comes in for classification, the eager learner is
ready with the model and doesn’t need to refer back to the training data.
 Eager learners take more time in the learning phase than the lazy learners.
 Some of the algorithms which adopt eager learning approach include Decision
Tree, Support Vector Machine, Neural Network, etc.
 Lazy learning, on the other hand, completely skips the abstraction and
generalization processes, as explained in context of a typical machine learning
process.
 It uses the training data in exact, and uses the knowledge to classify the
unlabelled test data.
 Since lazy learning uses training data as-is, it is also known as rote learning (i.e.
memorization technique based on repetition).
 Due to its heavy dependency on the given training data instance, it is also known
as instance learning.
 Lazy learners take very little time in training because not much of training actually
happens.
UNIT3: Modelling and Evaluation
Subject: Machine Learning (3170724)
Reference Book: Machine Learning, Saikat Dutt, S. Chjandramouli, Das, Pearson.
Notes complied by: Bhagirath Prajapati, Computer Engineering dept. ADIT.

 However, it takes quite some time in classification as for each tuple of test data,
a comparison-based assignment of label happens. One of the most popular
algorithm for lazy learning is k-nearest neighbor.

 MODEL REPRESENTATION AND INTERPRETABILITY


o A key consideration in learning the target function from the training data is the extent of
generalization. This is because the input data is just a limited, specific view and the new,
unknown data in the test data set may be differing quite a bit from the training data.
o Fitness of a target function approximated by a learning algorithm determines how
correctly it is able to classify a set of data it has never seen.
o Underfitting:
 Underfitting may occur when trying to represent a non-linear data with a linear
model
 Many times underfitting happens due to unavailability of sufficient training data.
 Underfitting results in both poor performance with training data as well as poor
generalization to test data.
 Solution:
 Using more training data
 Reducing features by effective feature selection
UNIT3: Modelling and Evaluation
Subject: Machine Learning (3170724)
Reference Book: Machine Learning, Saikat Dutt, S. Chjandramouli, Das, Pearson.
Notes complied by: Bhagirath Prajapati, Computer Engineering dept. ADIT.

o Overfitting:
 Overfitting refers to a situation where the model has been designed in such a way
that it emulates the training data too closely.
 Deviation in the training data, like noise or outliers, gets embedded in the model.
 Overfitting, in many cases, occur as a result of trying to fit an excessively complex
model to closely match the training data.
 The target function, in these cases, tries to make sure all training data points are
correctly partitioned by the decision boundary.
 Overfitting results in good performance with training data set, but poor
generalization and hence poor performance with test data set.
 Solution:
 using re-sampling techniques like k-fold cross validation
 hold back of a validation data set
 remove the nodes which have little or no predictive power for the given
machine learning problem

o Bias – variance trade-off


 In supervised learning, the class value assigned by the learning model built based
on the training data may differ from the actual class value.
 This error in learning can be of two types – errors due to ‘bias’ and error due to
‘variance’.
 The bias-variance tradeoff is a central problem in supervised learning. Ideally, one
wants to choose a model that both accurately captures the regularities in its
training data, but also generalizes well to unseen data. Unfortunately, it is
typically impossible to do both simultaneously.
 Errors due to ‘Bias’:
 Errors due to bias arise from simplifying assumptions made by the model
to make the target function less complex or easier to learn.
 In short, it is due to underfitting of the model.
 Parametric models generally have high bias making them easier to
understand/interpret and faster to learn.
 These algorithms have a poor performance on data sets, which are
complex in nature and do not align with the simplifying assumptions
made by the algorithm.
 Errors due to ‘Variance’:
 Errors due to variance occur from difference in training data sets used to
train the model.
 Ideally the difference in the data sets should not be significant and the
model trained using different training data sets should not be too
different.
 Increasing the bias will decrease the variance, and
 Increasing the variance will decrease the bias
 The goal of supervised machine learning is to achieve a balance between
bias and variance.
UNIT3: Modelling and Evaluation
Subject: Machine Learning (3170724)
Reference Book: Machine Learning, Saikat Dutt, S. Chjandramouli, Das, Pearson.
Notes complied by: Bhagirath Prajapati, Computer Engineering dept. ADIT.

 Supervised learning classification:


o Basic understanding:
 1. the model predicted win and the team won (TP)
 2. the model predicted win and the team lost (FP)
 3. the model predicted loss and the team won (FN)
 4. the model predicted loss and the team lost (TN)
o Model accuracy: model accuracy is given by total number of correct classifications
(either as the class of interest, i.e. True Positive or as not the class of interest, i.e. True
Negative) divided by total number of classifications done.

o Confusion matrix: A matrix containing correct and incorrect predictions in the form of
TPs, FPs, FNs and TNs is known as confusion matrix.

o Error rate: The percentage of misclassifications is indicated using error rate which is
measured as:
UNIT3: Modelling and Evaluation
Subject: Machine Learning (3170724)
Reference Book: Machine Learning, Saikat Dutt, S. Chjandramouli, Das, Pearson.
Notes complied by: Bhagirath Prajapati, Computer Engineering dept. ADIT.

o Kappa: Sometimes, correct prediction, both TPs as well as TNs, may happen by mere
coincidence. Since these occurrences boost model accuracy, ideally it should not
happen. Kappa value of a model indicates the adjusted the model accuracy. It is
calculated using the formula below:

Note: Kappa value can be 1 at the maximum, which represents perfect agreement between model’s
prediction and actual values.
UNIT3: Modelling and Evaluation
Subject: Machine Learning (3170724)
Reference Book: Machine Learning, Saikat Dutt, S. Chjandramouli, Das, Pearson.
Notes complied by: Bhagirath Prajapati, Computer Engineering dept. ADIT.

o Sensitivity & specificity:


 Some times accuracy is not enough for specific types of applications
 FN is low desirable for high accuracy but it may represent critical case, like, For
example, if a tumor is malignant but wrongly classified as benign by the
classifier, then the repercussion of such misclassification is fatal.
 Sensitivity: The sensitivity of a model measures the proportion of TP examples
or positive cases which were correctly classified.
 In case of the malignancy prediction of tumours, class of interest is ‘malignant’.
Sensitivity measure gives the proportion of tumours which are actually
malignant and have been predicted as malignant.

 Specificity: Specificity of a model measures the proportion of negative examples


which have been correctly classified.
 In the context, of malignancy prediction of tumours, specificity gives the
proportion of benign tumours which have been correctly classified.
UNIT3: Modelling and Evaluation
Subject: Machine Learning (3170724)
Reference Book: Machine Learning, Saikat Dutt, S. Chjandramouli, Das, Pearson.
Notes complied by: Bhagirath Prajapati, Computer Engineering dept. ADIT.

o Precision & recall


 Precision: Precision gives the proportion of positive predictions which are truly
positive

 Recall: Recall indicates the proportion of correct prediction of positives to the


total number of positives.

o F-measure: It takes the harmonic mean of precision and recall as calculated

o Receiver operating characteristic (ROC) curves:


 It helps in visualizing the performance of a classification model.
 It shows the efficiency of a model in the detection of true positives while
avoiding the occurrence of false positives.

TPR = Sensitivity , FPR = 1 – Specificity

 In the ROC curve, the FP rate is plotted (in the horizontal axis) against true
positive rate (in the vertical axis) at different classification thresholds.
UNIT3: Modelling and Evaluation
Subject: Machine Learning (3170724)
Reference Book: Machine Learning, Saikat Dutt, S. Chjandramouli, Das, Pearson.
Notes complied by: Bhagirath Prajapati, Computer Engineering dept. ADIT.

 This curve gives an indication of the predictive quality of a model.


 An excellent model has AUC near to the 1 which means it has a good measure of
separability. A poor model has an AUC near 0 which means it has the worst
measure of separability. In fact, it means it is reciprocating the result. It is
predicting 0s as 1s and 1s as 0s. And when AUC is 0.5, it means the model has no
class separation capacity whatsoever.

 The area under curve (AUC) value, as shown in figure, is the area of the two-
dimensional space under the curve extending from (0, 0) to (1, 1), where each
point on the curve gives a set of true and false positive values at a specific
classification threshold.
UNIT3: Modelling and Evaluation
Subject: Machine Learning (3170724)
Reference Book: Machine Learning, Saikat Dutt, S. Chjandramouli, Das, Pearson.
Notes complied by: Bhagirath Prajapati, Computer Engineering dept. ADIT.

 The AUC of classifier 1 is more than the AUC of classifier 2. So, we can draw the
inference that classifier 1 is better than classifier 2.

 IMPROVING PERFORMANCE OF A MODEL


o Model parameter tuning
 It is the process of adjusting the model fitting options.
 In the popular classification model k-Nearest Neighbour (kNN), using different
values of ‘k’ or the number of nearest neighbours to be considered
 In the same way, a number of hidden layers can be adjusted to tune the
performance in neural networks model.

o Ensemble
 As an alternate approach of increasing the performance of one model, several
models may be combined together.
 This approach of combining different models with diverse strengths is known as
ensemble
 The outputs from the different models are combined using a combination
function. A very simple strategy of combining, say in case of a prediction task
using ensemble, can be majority voting of the different models combined. For
example, 3 out of 5 classes predict ‘win’ and 2 predict ‘loss’ – then the final
outcome of the ensemble using majority vote would be a ‘win’.

 One of the earliest and most popular ensemble models is bootstrap aggregating
or bagging.
 Bagging uses bootstrap sampling method to generate multiple training data sets.
These training data sets are used to generate (or train) a set of models using the
same learning algorithm. Then the outcomes of the models are combined by
majority voting (classification) or by average (regression).
 Boosting is another key ensemble based technique. In this type of ensemble,
weaker learning models are trained on resampled data and the outcomes are
combined using a weighted voting approach based on the performance of
different models.

You might also like