0% found this document useful (0 votes)
10 views52 pages

UNIT03

Uploaded by

Amit Yadav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views52 pages

UNIT03

Uploaded by

Amit Yadav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 52

UNIT03

Modelling and Evaluation


OUTLINE

• Selecting a Model: Predictive/Descriptive

• Training a Model for supervised learning

• model representation and interpretability

• Evaluating performance of a model

• Improving performance of a model


• The basic learning process, irrespective of the
fact that the learner is a human or a machine,
can be divided into three parts:
– Data Input
– Abstraction
• abstraction is a significant step as it represents raw
input data in a summarized and structured format, such
that a meaningful insight is obtained from the data.
This structured representation of raw input data to the
meaningful pattern is called a model.
– Generalization
• Generalization searches through the huge set of
abstracted knowledge to come up with a small and
manageable set of key findings.
Selecting a model
• Y=f(x)+e
– where f is target function
– x= independent variables/input
– Y = target variable/output
– random error term = e
– Cost function/error function
• helps to measure the extent to which the model is going wrong in
estimating the relationship between X and Y. In that sense, cost
function can tell how bad the model is performing.
– Loss function: function define on data points
– Objective function
• Objective function takes in data and model (along with
parameters) as input and returns a value.
• Supervised
– Classification
– Regression
• Unsupervised
– Clustering
– Association analysis
• Reinforcement
• Most important factors while selecting a model for
machine learning
– The kind of problem we want to solve using machine
learning and
– The nature of the underlying data.
• Machine learning algorithms are broadly of two
types:
– models for supervised learning, which primarily focus
on solving predictive problems and models for
– unsupervised learning, which solve descriptive
problems.
• Predictive model
– Supervised learning
– The predictive models have a clear focus on what
they want to learn and how they want to learn.
– The models which are used for prediction of target
features of categorical value are known as
classification models.
• k-Nearest Neighbor (kNN), Naïve Bayes, and Decision
Tree.
– The models which are used for prediction of the
numerical value of the target feature of a data
instance are known as regression models.
• Linear Regression and Logistic Regression models are
popular regression models.
• Descriptive Model
– Models for unsupervised learning
– Descriptive models which group together similar
data instances, i.e. data instances having a similar
value of the different features are called clustering
models.
• K-means
– Descriptive models related to pattern discovery
/transactional data are called Association analysis
Training a model
(for supervised learning)
• Hold out method
• K-fold Cross-validation method
• Bootstrap sampling
• Lazy vs. Eager learner
Holdout method
• Division of input data is random
• Random numbers are used to assign data items
to the partitions. This method of partitioning
the input data into two parts – training and
test data, which is by holding back a part of the
input data for validating the trained model is
known as holdout method.
• Effect of Maximum training set
• Effect of Maximum Testing set
• Problem with holdout method
– in this method is that the division of data of
different classes into the training and test data
may not be proportionate.
– Solution : Stratification
• In case of stratified random sampling, the whole data is
broken into several homogenous groups or strata and a
random sample is selected from each such stratum.
This ensures that the generated random partitions
have equal proportions of each class.
K-fold Cross-validation method
.

• Repeated holdout method


– Problem : Overlapping test set
• Two approaches of k-fold cross validation
method
– 10-fold cross-validation (10-fold CV)
– Leave-one-out cross-validation (LOOCV)
Overall approach of k-fold cross validation
Detail approach for fold selection
• Leave-one-out cross-validation (LOOCV)
– Leave-one-out cross-validation (LOOCV) is an
extreme case of k-fold cross validation using one
record or data instance at a time as a test data.
This is done to maximize the count of data used
to train the model. It is obvious that the number
of iterations for which it has to be run is equal to
the total number of data in the input data set.
Hence, obviously, it is computationally very
expensive and not used much in practice.
Bootstrap sampling
• This technique is particularly useful in case of
input data sets of small size, i.e. having very
less number of data instances.
• Simple Random Sampling with Replacement
(SRSWR),
• Bootstrapping can create one or more training
data sets having ‘n’ data instances, some of
the data instances being repeated multiple
times.
Lazy vs. Eager learner Eager learning
• Eager learners take more time in the learning phase
than the lazy learners. Some of the algorithms
which adopt eager learning approach include
Decision Tree, Support Vector Machine, Neural
Network,
• Lazy learners take very little time in training because
not much of training actually happens. However, it
takes quite some time in classification as for each
tuple of test data, a comparison-based assignment
of label happens. One of the most popular
algorithm for lazy learning is k-nearest neighbor.
MODEL REPRESENTATION AND
INTERPRETABILITY

• Underfitting
• Overfitting
• Bias – variance trade-off
• Underfitting
– When target function is kept to simple
– Unavailability of sufficient training data set
– Underfitting results in both poor performance
with training data as well as poor generalization to
test data.

– Underfitting can be avoided by


• using more training data
• reducing features by effective feature selection
• Overfitting
– situation where the model has been designed in
such a way that it emulates the training data too
closely.
– Noise and outliers may embedded in model
– Overfitting results in good performance with
training data set, but poor generalization and
hence poor performance with test data set.
– Overfitting can be avoided by
• using re-sampling techniques like k-fold cross validation
• hold back of a validation data set
• Remove the nodes which have little or no predictive
power for the given machine learning problem.
Bias – variance trade-off

• Bias
– Gap between predicted values and actual value
– Parametric models generally have high bias
making them easier to understand/interpret and
faster to learn. These algorithms have a poor
performance on data sets, which are complex in
nature and do not align with the simplifying
assumptions made by the algorithm.
– Underfitting results in high bias.
• Variance
– Errors due to variance occur from difference in
training data sets used to train the model.
– Distance of all predicted values with respect to
each other
EVALUATING PERFORMANCE OF
A MODEL
• Supervised learning - classification
– Accuracy
– Sensitivity
Understanding with cricket match win
example
• There are four possibilities with regards to the
cricket match win/loss prediction:
– The model predicted win and the team won
– The model predicted win and the team lost
– The model predicted loss and the team won
– The model predicted loss and the team lost In
this problem, the obvious class of interest is ‘win’.
The first case, i.e. the model predicted win and the
• model accuracy is given by total number of
correct classifications (either as the class of
interest, i.e. True Positive or as not the class of
interest, i.e. True Negative) divided by total
number of classifications done.
Model accuracy =
• Error rate : The percentage of misclassifications

• Kappa value of a model indicates the adjusted the


model accuracy. Kappa value can be 1 at the
maximum, which represents perfect agreement
between model’s prediction and actual values.
• Sensitivity:
– The sensitivity of a model measures the
proportion of TP examples or positive cases which
were correctly classified.

• Specificity
– Specificity of a model measures the proportion of
negative examples which have been correctly
classified.
• There are two other performance measures of
a supervised learning model which are similar
to sensitivity and specificity.
– Precision : precision gives the proportion of
positive predictions which are truly positive,

– Recall : Recall indicates the proportion of correct


prediction of positives to the total number of
positives.
• Example:
Calculate Model accuracy, error rate, kappa
value, Sensitivity, Specificity, Precision, Recall for
confusion matrix of the win/loss prediction of
cricket match problem to be as below:
• F-measure is another measure of model
performance which combines the precision
and recall. It takes the harmonic mean of
precision and recall
• Receiver operating characteristic (ROC) curves
– Receiver Operating Characteristic (ROC) curve
helps in visualizing the performance of a
classification model. It shows the efficiency of a
model in the detection of true positives while
avoiding the occurrence of false positives.
Supervised learning – regression

• A regression model which ensures that the


difference between predicted and actual
values is low can be considered as a good
model.
• y = α + βx
Unsupervised learning - clustering

• What is clustering?
• challenges which lie in the process of clustering:
– It is generally not known how many clusters can be formulated
from a particular data set. It is completely open-ended in most
cases and provided as a user input to a clustering algorithm.
– Even if the number of clusters is given, the same number of
clusters can be formed with different groups of data instances.
In a more objective way, it can be said that a clustering

• popular approaches which are adopted for cluster quality


evaluation.
– Internal evaluation
– External evaluation
• Internal Evaluation
– The internal evaluation methods generally measure cluster quality
based on homogeneity of data belonging to the same cluster and
heterogeneity of data belonging to different clusters.
– silhouette coefficient, which is one of the most popular internal
evaluation methods, uses distance (Euclidean or Manhattan distances
most commonly used) between data elements as a similarity
measure.
– The value of silhouette width ranges between –1 and +1, with a high
value indicating high intra-cluster homogeneity and inter-cluster
heterogeneity.
• There are four clusters namely cluster 1, 2, 3, and 4. Let’s consider
an arbitrary data element ‘i’ in cluster 1, resembled by the asterisk.
a(i) is the average of the distances ai1, ai2, …, ain1 of the different
data elements from the ith data element in cluster 1, assuming
there are n1 data elements in cluster 1. Mathematically,

• In the same way, let’s calculate the distance of an arbitrary data


element ‘i’ in cluster 1 with the different data elements from
another cluster, say cluster 4 and take an average of all those
distances.

• where n4 is the total number of elements in cluster 4. In the same


way, we can calculate the values of b12 (average) and b13
(average). b (i) is the minimum of all these values.
• Hence, we can say that, b(i) = minimum [b12(average),
b13(average), b14(average)]
• External Evaluation
– In this approach, class label is known for the data
set subjected to clustering.
– The cluster algorithm is assessed based on how
close the results are compared to those known
class labels. For example, purity is one of the most
popular measures of cluster algorithms – evaluates
the extent to which clusters contain a single class.
For a data set having
– purity is one of the most popular measures of
cluster algorithms – evaluates the extent to which
clusters contain a single class. For a data set having
– For a data set having ‘n’ data instances and ‘c’
known class labels which generates ‘k’ clusters,
purity is measured as: Purity =
IMPROVING PERFORMANCE OF A MODEL
• Can we improve the performance of our model?
• which model should be selected for which
machine learning task?
• We have already discussed earlier that the model
selection is done one several aspects:
– Type of learning the task in hand, i.e. supervised or
unsupervised
– Type of the data, i.e. categorical or numeric
– Sometimes on the problem domain
– Above all, experience in working with different models
to solve problems of diverse domains So, assuming
that
• Various methods to improve model
performance
– Model parameter tuning
• Model parameter tuning is the process of adjusting the
model fitting options.
• For example, in the popular classification model k-
Nearest Neighbour (kNN), using different values of ‘k’
or the number of nearest neighbours to be considered,
the model can be tuned.
• Ensemble
– combining different models with diverse strengths is known as ensemble
– Ensemble helps in averaging out biases of the different underlying models and
also reducing the variance. Ensemble methods combine weaker learners to
create stronger ones.
• Various methods of Ensemble
– bootstrap aggregating or bagging.
– Boosting
– Random Forest
• Following are the typical steps in ensemble process:
– Build a number of models based on the training data
– For diversifying the models generated, the training data subset can be varied
using the allocation function. Sampling techniques like bootstrapping may be
used to generate unique training data sets.
– Alternatively, the same training data may be used but the models combined are
quite varying, e.g, SVM, neural network, kNN, etc.
– The outputs from the different models are combined using a combination
function. A very simple strategy of combining, say in case of a prediction task
using ensemble, can be majority voting of the different models combined. For
example, 3 out of 5 classes predict ‘win’ and 2 predict ‘loss’ – then the final
outcome of the ensemble using majority vote would be a ‘win’.

You might also like