0% found this document useful (0 votes)
3 views

EDA Module 2

The document discusses exploratory data analytics focusing on statistical learning and model selection, emphasizing the importance of prediction accuracy and various error metrics such as MAE, RMSE, and MAPE. It highlights the challenges of overfitting and the bias-variance trade-off in model performance, along with techniques like cross-validation for model evaluation. Additionally, it outlines different data splitting methods, including training, validation, and test sets, to optimize model accuracy and generalization.

Uploaded by

chiragcs9911
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

EDA Module 2

The document discusses exploratory data analytics focusing on statistical learning and model selection, emphasizing the importance of prediction accuracy and various error metrics such as MAE, RMSE, and MAPE. It highlights the challenges of overfitting and the bias-variance trade-off in model performance, along with techniques like cross-validation for model evaluation. Additionally, it outlines different data splitting methods, including training, validation, and test sets, to optimize model accuracy and generalization.

Uploaded by

chiragcs9911
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Exploratory Data Analytics

Module 2: Statistical learning &


Model Selection
Prediction Accuracy
• A good learner is the one which has good
prediction accuracy; in other words, which has
the smallest prediction error.
• A prediction error is the failure of some expected
event to occur.
• Classical statistical analysis – Goodness of fit
– R2, Adjusted R2 ,Standard error, and residual
analysis
– https://www.ncl.ac.uk/webtemplate/ask-assets/external/maths-resources/statistics/regression-and-correlation/coefficient-of-
determination-r-squared.html#:~:text=%C2%AFy)2.-
,R%202%20%3D%201%20%E2%88%92%20sum%20squared%20regression%20(SSR)%20total,from%20the%20mean%20all%20s
quared
• The assumption is that, with a high R2 value,
the model is expected to predict well for data
observed in the future.
• However, these measures will not tell us much
about the ability of the model to predict new
records.
• In modern data science and analytics world,
many prediction performance measures are
used.
• In all cases, the measures are based on
validation set and test data.
• The validation set are more similar to the
future records to be predicted.
• These validation set observations are not used
to select predictors.
Prediction accuracy measure
• MAE – Mean absolute error/ deviation (MAE)

• Mean error (ME)

• Mean percentage error (MPE)


• Mean absolute percentage error (MAPE)

• Root mean squared error (RMSE)

• Mean squared error


Training and Test Error as A Function of
Model Complexity
• Errors that are based on the training set tell us about
model fit.
• Based on validation set measures the models ability to
predict new data (Prediction error).
• The fit of a model improves with the complexity of the
model, i.e. as more predictors are included in the
model the R2 value is expected to improve.
• The training error keeps on decreasing as we increase
the complexity of the model. Here complexity can be
defined as the number of different or complex
(predictors) input features involved in the model.
Over fitting a model
• However, reducing a model's training error too
aggressively can lead to test error increasing
rather than decreasing.
• This phenomenon is called over fitting and is one
of the primary obstacles to selecting good
predictive models.
• If a learning technique learns the structure of a
training data too well then the model is applied
to the data on which the model was built, it
correctly predicts every sample value.
• In the extreme case, the model in training
data admits no error. In addition to learning
the general patterns in the data, the model
has also learned the characteristics of each
training data point's unique noise.
• This type of model is said to be over-fit and
will usually have poor accuracy when
predicting a new sample.
Bias-Variance Trade-off
• Machine learning is a branch of Artificial
Intelligence, which allows machines to perform
data analysis and make predictions.
• However, if the machine learning model is not
accurate, it can make predictions errors, and
these prediction errors are usually known as Bias
and Variance (Training error & Test error).
• The main aim of ML/data science analysts is to
reduce these errors in order to get more accurate
results.
• In machine learning, an error is a measure of
how accurately an algorithm can make
predictions for the previously unknown
dataset.
• On the basis of these errors, the machine
learning model is selected that can perform
best on the particular dataset.
What is Bias
https://www.javatpoint.com/bias-and-variance-in-machine-learning

• In general, a machine learning model analyses


the data, find patterns in it and make predictions.
• While training, the model learns these patterns in
the dataset and applies them to test data for
prediction.
• While making predictions, a difference occurs
between prediction values made by the model
and actual values, and this difference is known as
bias errors or Errors due to bias.
• A high bias model also cannot perform well
on new data.
• Ways to reduce High Bias:
– High bias mainly occurs due to a much simple
model
• Increase the input features as the model is under fitted
• Use more complex models, such as including some
polynomial features.
• Decrease the regularization term
What is a Variance Error?
• The variance is the amount of variation in the
prediction if the different training data was used.
• In simple words, variance tells that how much a
random variable is different from its expected value.
• Ideally, a model should not vary too much from one
training dataset to another, which means the algorithm
should be good in understanding the hidden mapping
between inputs and output variables.
• Variance errors are either of low variance or high
variance.
• Low variance means there is a small variation in
the prediction of the target function with changes
in the training data set. At the same time, High
variance shows a large variation in the prediction
of the target function with changes in the training
dataset.
• A model that shows high variance learns a lot
and perform well with the training dataset, and
does not generalize well with the unseen dataset.
• As a result, such a model gives good results with
the training dataset but shows high error rates on
the test dataset.
Problems of high variance error
• A high variance model leads to over fitting.

• Increase model complexities.


Ways to Reduce High Variance:
• Reduce the input features or number of
parameters as a model is over fitted.
• Do not use a much complex model.
• Increase the training data.
• Increase the Regularization term.
•Low-Bias, Low-Variance:

•Low-Bias, High-Variance:

•High-Bias, Low-Variance:

•High-Bias, High-Variance:
Bias-Variance trade-off.
• While building the machine learning model, it is really
important to take care of bias and variance in order to
avoid over fitting and under fitting in the model.
• If the model is very simple with fewer parameters, it
may have low variance and high bias. Whereas, if the
model has a large number of parameters, it will have
high variance and low bias.
• So, it is required to make a balance between bias and
variance errors, and this balance between the bias
error and variance error is known as the Bias-Variance
trade-off.
Introduction to cross - validation
• An ideal predictor is that, which will learn all the
structure in the data but none of the noise. While with
increasing model complexity in the training data, PE
reduces monotonically, the same will not be true for
test data.
• Bias and variance move in opposing directions and at a
suitable bias-variance combination the PE is the
minimum in the test data. The model that achieves this
lowest possible PE is the best prediction model.
• Cross-validation is a comprehensive set of data splitting
techniques which helps to estimate the point of
inflection of PE.
Cross-validation: Data
splitting method?
• Why data splitting?
• What data splitting methods we have
discussed so far?
– Holdout sample: Training & Test data
Demerits
• In a sparse data set, one may not have the luxury to set
aside a reasonable portion of the data for testing.
• If we happen to have a 'bad' split, the estimate is not
reliable.
Three-way Split: Training, Validation
and Test Data
• The available data is partitioned into three sets:
training, validation and test set.
• The prediction model is trained on the training
set and is evaluated on the validation set.
• A typical split is 50% for the training data and
25% each for validation set and test set.

 Training Data: Set of data used for learning (by


the model), that is, to fit the parameters to the
machine learning model
 Validation data:
– Set of data used to provide an unbiased
evaluation of a model fitted on the training
dataset while tuning model hyper parameters.
– Also play a role in other forms of model
preparation, such as feature selection, threshold
cut-off selection.
– Training and validation may be iterated a few
times till a 'best' model is found
– A set of examples used to tune the parameters of
a classifier, for example to choose the number of
hidden units in a neural network.
• Test Dataset
– Set of data used to provide an unbiased
evaluation of a final model fitted on the training
dataset.
– After training, validating and selecting a model,
we should take it to production after testing its
performance for this extracted subset of data is
called the test data.
– After assessing the final model on the test set, the
model must not be fine-tuned any further.
Unfortunately, data insufficiency often does not
allow three-way split
Cross-validation- K-fold Cross-Validation
• The original sample is randomly partitioned into K equal-
sized (or almost equal sized) subsamples.
• Of the K subsamples, a single subsample is retained as the
test set for estimating the PE, and the remaining K-1
subsamples are used as training data.
• The cross-validation process is then repeated K times (the
folds), with each of the K subsamples used exactly once as
the test set.
• The K error estimates from the folds can then be averaged
to produce a single estimation.
• The advantage of this method is that all observations are
used for both training and validation, and each observation
is used for validation exactly once.
Leave-One-Out Cross-
Validation
• LOO is the degenerate case of K-fold cross-
validation where K = n for a sample of size n.
• That means that n separate times, the prediction
function is trained on all the data except for one
point and a prediction is made for that point.
• As before the average error is computed and
used to evaluate the model.
• The evaluation given by leave-one-out cross-
validation error is good, but sometimes it may be
very expensive to compute.
Random sub sampling
• Random sub sampling performs K data splits of the
entire sample.
• For each data split, a fixed number of observations is
chosen without replacement from the sample and kept
aside as the test data.
• The prediction model is fitted to the training data from
scratch for each of the K splits and an estimate of the
prediction error is obtained from each test set.
• The true error estimate is obtained as the average of
the separate estimates.

You might also like