EDA Module 2

The document discusses exploratory data analytics focusing on statistical learning and model selection, emphasizing the importance of prediction accuracy and various error metrics such as MAE, RMSE, and MAPE. It highlights the challenges of overfitting and the bias-variance trade-off in model performance, along with techniques like cross-validation for model evaluation. Additionally, it outlines different data splitting methods, including training, validation, and test sets, to optimize model accuracy and generalization.

Uploaded by

chiragcs9911

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views28 pages

EDA Module 2

Uploaded by

chiragcs9911

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 28

Exploratory Data Analytics

Module 2: Statistical learning &

Model Selection
Prediction Accuracy
• A good learner is the one which has good
prediction accuracy; in other words, which has
the smallest prediction error.
• A prediction error is the failure of some expected
event to occur.
• Classical statistical analysis – Goodness of fit
– R2, Adjusted R2 ,Standard error, and residual
analysis
– https://www.ncl.ac.uk/webtemplate/ask-assets/external/maths-resources/statistics/regression-and-correlation/coefficient-of-
determination-r-squared.html#:~:text=%C2%AFy)2.-
,R%202%20%3D%201%20%E2%88%92%20sum%20squared%20regression%20(SSR)%20total,from%20the%20mean%20all%20s
quared
• The assumption is that, with a high R2 value,
the model is expected to predict well for data
observed in the future.
• However, these measures will not tell us much
about the ability of the model to predict new
records.
• In modern data science and analytics world,
many prediction performance measures are
used.
• In all cases, the measures are based on
validation set and test data.
• The validation set are more similar to the
future records to be predicted.
• These validation set observations are not used
to select predictors.
Prediction accuracy measure
• MAE – Mean absolute error/ deviation (MAE)

• Mean error (ME)

• Mean percentage error (MPE)

• Mean absolute percentage error (MAPE)

• Root mean squared error (RMSE)

• Mean squared error

Training and Test Error as A Function of
Model Complexity
• Errors that are based on the training set tell us about
model fit.
• Based on validation set measures the models ability to
predict new data (Prediction error).
• The fit of a model improves with the complexity of the
model, i.e. as more predictors are included in the
model the R2 value is expected to improve.
• The training error keeps on decreasing as we increase
the complexity of the model. Here complexity can be
defined as the number of different or complex
(predictors) input features involved in the model.
Over fitting a model
• However, reducing a model's training error too
aggressively can lead to test error increasing
rather than decreasing.
• This phenomenon is called over fitting and is one
of the primary obstacles to selecting good
predictive models.
• If a learning technique learns the structure of a
training data too well then the model is applied
to the data on which the model was built, it
correctly predicts every sample value.
• In the extreme case, the model in training
data admits no error. In addition to learning
the general patterns in the data, the model
has also learned the characteristics of each
training data point's unique noise.
• This type of model is said to be over-fit and
will usually have poor accuracy when
predicting a new sample.
Bias-Variance Trade-off
• Machine learning is a branch of Artificial
Intelligence, which allows machines to perform
data analysis and make predictions.
• However, if the machine learning model is not
accurate, it can make predictions errors, and
these prediction errors are usually known as Bias
and Variance (Training error & Test error).
• The main aim of ML/data science analysts is to
reduce these errors in order to get more accurate
results.
• In machine learning, an error is a measure of
how accurately an algorithm can make
predictions for the previously unknown
dataset.
• On the basis of these errors, the machine
learning model is selected that can perform
best on the particular dataset.
What is Bias
https://www.javatpoint.com/bias-and-variance-in-machine-learning

• In general, a machine learning model analyses

the data, find patterns in it and make predictions.
• While training, the model learns these patterns in
the dataset and applies them to test data for
prediction.
• While making predictions, a difference occurs
between prediction values made by the model
and actual values, and this difference is known as
bias errors or Errors due to bias.
• A high bias model also cannot perform well
on new data.
• Ways to reduce High Bias:
– High bias mainly occurs due to a much simple
model
• Increase the input features as the model is under fitted
• Use more complex models, such as including some
polynomial features.
• Decrease the regularization term
What is a Variance Error?
• The variance is the amount of variation in the
prediction if the different training data was used.
• In simple words, variance tells that how much a
random variable is different from its expected value.
• Ideally, a model should not vary too much from one
training dataset to another, which means the algorithm
should be good in understanding the hidden mapping
between inputs and output variables.
• Variance errors are either of low variance or high
variance.
• Low variance means there is a small variation in
the prediction of the target function with changes
in the training data set. At the same time, High
variance shows a large variation in the prediction
of the target function with changes in the training
dataset.
• A model that shows high variance learns a lot
and perform well with the training dataset, and
does not generalize well with the unseen dataset.
• As a result, such a model gives good results with
the training dataset but shows high error rates on
the test dataset.
Problems of high variance error
• A high variance model leads to over fitting.

• Increase model complexities.

Ways to Reduce High Variance:
• Reduce the input features or number of
parameters as a model is over fitted.
• Do not use a much complex model.
• Increase the training data.
• Increase the Regularization term.
•Low-Bias, Low-Variance:

•Low-Bias, High-Variance:

•High-Bias, Low-Variance:

•High-Bias, High-Variance:
Bias-Variance trade-off.
• While building the machine learning model, it is really
important to take care of bias and variance in order to
avoid over fitting and under fitting in the model.
• If the model is very simple with fewer parameters, it
may have low variance and high bias. Whereas, if the
model has a large number of parameters, it will have
high variance and low bias.
• So, it is required to make a balance between bias and
variance errors, and this balance between the bias
error and variance error is known as the Bias-Variance
trade-off.
Introduction to cross - validation
• An ideal predictor is that, which will learn all the
structure in the data but none of the noise. While with
increasing model complexity in the training data, PE
reduces monotonically, the same will not be true for
test data.
• Bias and variance move in opposing directions and at a
suitable bias-variance combination the PE is the
minimum in the test data. The model that achieves this
lowest possible PE is the best prediction model.
• Cross-validation is a comprehensive set of data splitting
techniques which helps to estimate the point of
inflection of PE.
Cross-validation: Data
splitting method?
• Why data splitting?
• What data splitting methods we have
discussed so far?
– Holdout sample: Training & Test data
Demerits
• In a sparse data set, one may not have the luxury to set
aside a reasonable portion of the data for testing.
• If we happen to have a 'bad' split, the estimate is not
reliable.
Three-way Split: Training, Validation
and Test Data
• The available data is partitioned into three sets:
training, validation and test set.
• The prediction model is trained on the training
set and is evaluated on the validation set.
• A typical split is 50% for the training data and
25% each for validation set and test set.

 Training Data: Set of data used for learning (by

the model), that is, to fit the parameters to the
machine learning model
 Validation data:
– Set of data used to provide an unbiased
evaluation of a model fitted on the training
dataset while tuning model hyper parameters.
– Also play a role in other forms of model
preparation, such as feature selection, threshold
cut-off selection.
– Training and validation may be iterated a few
times till a 'best' model is found
– A set of examples used to tune the parameters of
a classifier, for example to choose the number of
hidden units in a neural network.
• Test Dataset
– Set of data used to provide an unbiased
evaluation of a final model fitted on the training
dataset.
– After training, validating and selecting a model,
we should take it to production after testing its
performance for this extracted subset of data is
called the test data.
– After assessing the final model on the test set, the
model must not be fine-tuned any further.
Unfortunately, data insufficiency often does not
allow three-way split
Cross-validation- K-fold Cross-Validation
• The original sample is randomly partitioned into K equal-
sized (or almost equal sized) subsamples.
• Of the K subsamples, a single subsample is retained as the
test set for estimating the PE, and the remaining K-1
subsamples are used as training data.
• The cross-validation process is then repeated K times (the
folds), with each of the K subsamples used exactly once as
the test set.
• The K error estimates from the folds can then be averaged
to produce a single estimation.
• The advantage of this method is that all observations are
used for both training and validation, and each observation
is used for validation exactly once.
Leave-One-Out Cross-
Validation
• LOO is the degenerate case of K-fold cross-
validation where K = n for a sample of size n.
• That means that n separate times, the prediction
function is trained on all the data except for one
point and a prediction is made for that point.
• As before the average error is computed and
used to evaluate the model.
• The evaluation given by leave-one-out cross-
validation error is good, but sometimes it may be
very expensive to compute.
Random sub sampling
• Random sub sampling performs K data splits of the
entire sample.
• For each data split, a fixed number of observations is
chosen without replacement from the sample and kept
aside as the test data.
• The prediction model is fitted to the training data from
scratch for each of the K splits and an estimate of the
prediction error is obtained from each test set.
• The true error estimate is obtained as the average of
the separate estimates.

ML Unit 4 Trupesh Patel
No ratings yet
ML Unit 4 Trupesh Patel
56 pages
Machine Learning Math Essentials - 12.02.2025
No ratings yet
Machine Learning Math Essentials - 12.02.2025
88 pages
ML Mod 5
No ratings yet
ML Mod 5
58 pages
Unit 4
No ratings yet
Unit 4
50 pages
All DL
No ratings yet
All DL
72 pages
ML-4 Cross Validation in Machine Learning
No ratings yet
ML-4 Cross Validation in Machine Learning
13 pages
Module 3 Modified
No ratings yet
Module 3 Modified
48 pages
ML UNIT 4 Notes
No ratings yet
ML UNIT 4 Notes
30 pages
Lecture Slide 02 - Supervised Learning - Summer 2023
No ratings yet
Lecture Slide 02 - Supervised Learning - Summer 2023
43 pages
ML3 - Evaluation
100% (1)
ML3 - Evaluation
65 pages
Data Science Interview Questions - 1
No ratings yet
Data Science Interview Questions - 1
55 pages
CSL0777 L08
No ratings yet
CSL0777 L08
29 pages
Lect 03 Evaluation Part 2
No ratings yet
Lect 03 Evaluation Part 2
40 pages
12 Bias-Variance - Underfit - Overfit
No ratings yet
12 Bias-Variance - Underfit - Overfit
4 pages
Lec-1 Bias-variance-Tradeoff
No ratings yet
Lec-1 Bias-variance-Tradeoff
24 pages
Edab Module - 2
No ratings yet
Edab Module - 2
20 pages
Unit 4
No ratings yet
Unit 4
34 pages
Emailing PREDICTIVE ANALYSIS 2
No ratings yet
Emailing PREDICTIVE ANALYSIS 2
14 pages
SML Updated UNIT 4
No ratings yet
SML Updated UNIT 4
44 pages
CH 05 Optimization Technique
No ratings yet
CH 05 Optimization Technique
58 pages
Unit IV
No ratings yet
Unit IV
51 pages
Lecture 9 - Evaluations
No ratings yet
Lecture 9 - Evaluations
68 pages
ML 4
No ratings yet
ML 4
21 pages
ML MU Unit 2
100% (3)
ML MU Unit 2
84 pages
ML 5
No ratings yet
ML 5
26 pages
What Statistical Analysis Should I Use?: Sunday, June 4, 2017 04:22 AM
No ratings yet
What Statistical Analysis Should I Use?: Sunday, June 4, 2017 04:22 AM
364 pages
UNIT II Machine Learning
No ratings yet
UNIT II Machine Learning
43 pages
Model Generalization
No ratings yet
Model Generalization
117 pages
MLT Unit 2 Notes
No ratings yet
MLT Unit 2 Notes
58 pages
Unit 3
No ratings yet
Unit 3
55 pages
P-2.1.2 Cross Validation and Regularization
No ratings yet
P-2.1.2 Cross Validation and Regularization
37 pages
CHP 3
No ratings yet
CHP 3
70 pages
Vsat2k - ML - Ch1a Evaluation of Learning Algorithms - Jan 2025
No ratings yet
Vsat2k - ML - Ch1a Evaluation of Learning Algorithms - Jan 2025
19 pages
UE20CS302 Unit3 Slides
No ratings yet
UE20CS302 Unit3 Slides
308 pages
Theory in Machine Learning
No ratings yet
Theory in Machine Learning
60 pages
Ensemble Method
No ratings yet
Ensemble Method
12 pages
ML Unit 2 Part 1
No ratings yet
ML Unit 2 Part 1
47 pages
ML MU Unit 2
100% (2)
ML MU Unit 2
42 pages
Unit V
No ratings yet
Unit V
12 pages
Bias Variance
No ratings yet
Bias Variance
8 pages
ML.1Lecture.2 (Old)
No ratings yet
ML.1Lecture.2 (Old)
23 pages
15-The Bias - Variance - Trade-Off-08-04-2024
No ratings yet
15-The Bias - Variance - Trade-Off-08-04-2024
23 pages
Machine Learning General: Definiton
No ratings yet
Machine Learning General: Definiton
14 pages
ML Models Concepts
No ratings yet
ML Models Concepts
32 pages
Training Evaluation
No ratings yet
Training Evaluation
42 pages
Bias and Variance
No ratings yet
Bias and Variance
7 pages
Diagnosing Bias Vs Variance
No ratings yet
Diagnosing Bias Vs Variance
11 pages
Machine Learning Notes Anna University
No ratings yet
Machine Learning Notes Anna University
9 pages
Jkkklphftbbhuii
No ratings yet
Jkkklphftbbhuii
17 pages
Chapter2 1 33
No ratings yet
Chapter2 1 33
18 pages
Module 3 - ML
No ratings yet
Module 3 - ML
101 pages
Unit 5 New
No ratings yet
Unit 5 New
9 pages
Machine Learning: Lecture 13: Model Validation Techniques, Overfitting, Underfitting
100% (2)
Machine Learning: Lecture 13: Model Validation Techniques, Overfitting, Underfitting
26 pages
Model Evaluation
No ratings yet
Model Evaluation
29 pages
Wk07 Topic07 2 - 202303
No ratings yet
Wk07 Topic07 2 - 202303
21 pages
Csa202 Unit 2
No ratings yet
Csa202 Unit 2
36 pages
ISLP - Website 135 200
No ratings yet
ISLP - Website 135 200
66 pages
ML 5
No ratings yet
ML 5
14 pages
Srivastava Et Al 2023 Estimating Relative Tax Efficiency For Selected States in India An Error Correction Approach
No ratings yet
Srivastava Et Al 2023 Estimating Relative Tax Efficiency For Selected States in India An Error Correction Approach
27 pages
3 Simple Linear Regression
No ratings yet
3 Simple Linear Regression
71 pages
MCQ On Sampling Methods
No ratings yet
MCQ On Sampling Methods
18 pages
1.4 Intro To Need of Estimation and Validation PDF
No ratings yet
1.4 Intro To Need of Estimation and Validation PDF
18 pages
EDA Unit-3
No ratings yet
EDA Unit-3
31 pages
ArticleText 52748 1 10 20220630
No ratings yet
ArticleText 52748 1 10 20220630
8 pages
A Regression Model Using Common Baseball Statistics To Project Offensive and Defensive Efficiency
No ratings yet
A Regression Model Using Common Baseball Statistics To Project Offensive and Defensive Efficiency
36 pages
Validation Over Under Fir Unit 5
No ratings yet
Validation Over Under Fir Unit 5
6 pages
Determining The Acceleration Due To Gravity With A Simple Pendulum
No ratings yet
Determining The Acceleration Due To Gravity With A Simple Pendulum
7 pages
CAPM
No ratings yet
CAPM
82 pages
Heteroscedasticity Issue
100% (2)
Heteroscedasticity Issue
3 pages
Pertemuan 13 14 PTI
No ratings yet
Pertemuan 13 14 PTI
75 pages
Triacylglycerol Analysis of Fats and Oils by Evaporative Light Scattering Detection
No ratings yet
Triacylglycerol Analysis of Fats and Oils by Evaporative Light Scattering Detection
7 pages
Lecture 2 SLR - 1
No ratings yet
Lecture 2 SLR - 1
28 pages
Assignment 01 Nipun Goyal Jinye Lu
No ratings yet
Assignment 01 Nipun Goyal Jinye Lu
12 pages
Quiz 14.15 Confidence Interval Practices
No ratings yet
Quiz 14.15 Confidence Interval Practices
11 pages
EViews 1st Week Assignment With Solution
No ratings yet
EViews 1st Week Assignment With Solution
7 pages
Subjective Questions
No ratings yet
Subjective Questions
8 pages
Journal of Mass Spectrometry and Advances in The Clinical Lab
No ratings yet
Journal of Mass Spectrometry and Advances in The Clinical Lab
10 pages
IJMCE (Mac and Nelson)
No ratings yet
IJMCE (Mac and Nelson)
12 pages
The Lee-Carter Method For Forecasting Mo
No ratings yet
The Lee-Carter Method For Forecasting Mo
14 pages
Data Analysis Final Requierements
No ratings yet
Data Analysis Final Requierements
11 pages
Confidence Interval, Model Fitness and Prediction: S S T B
No ratings yet
Confidence Interval, Model Fitness and Prediction: S S T B
8 pages
The Self Regulated Learning, Habit of Mind, and Creativity As High Order Thinking Skills Predictors H. Hodiyanto Muhamad Firdaus
No ratings yet
The Self Regulated Learning, Habit of Mind, and Creativity As High Order Thinking Skills Predictors H. Hodiyanto Muhamad Firdaus
10 pages
ENE151 - Lesson 4.7 6
No ratings yet
ENE151 - Lesson 4.7 6
53 pages
Economics 536 Introduction To Specification Testing in Dynamic Econometric Models
No ratings yet
Economics 536 Introduction To Specification Testing in Dynamic Econometric Models
6 pages
Week 13 Tutorial - Sample Solutions - Chapter 14-MYLOVJune2020S1
No ratings yet
Week 13 Tutorial - Sample Solutions - Chapter 14-MYLOVJune2020S1
5 pages
Exanova
No ratings yet
Exanova
3 pages

EDA Module 2

Uploaded by

EDA Module 2

Uploaded by

Exploratory Data Analytics

Module 2: Statistical learning &

• Mean error (ME)

• Mean percentage error (MPE)

• Root mean squared error (RMSE)

• Mean squared error

• In general, a machine learning model analyses

• Increase model complexities.

 Training Data: Set of data used for learning (by

You might also like