Unit 4
Unit 4
STATISTICAL METHODS
1
Table of contents
1. Normalization
Feature Scaling
3. Regularization
4. Cross Validation
2
Normalization
3
Normalization
“ The word “normalization” is used informally in statistics, and so the term nor-
malized data can have multiple meanings. In most cases, when you normalize
data you eliminate the units of measurement, enabling you to more easily
compare data from different places.
4
”
Normalization I
In statistics and applications of statistics, normalization can have a range of meanings.
• In the simplest cases, normalization means adjusting values measured on different
scales to a notionally common scale, often prior to averaging.
• In more complicated cases, normalization may refer to more sophisticated adjust-
ments where the intention is to bring the entire probability distributions of ad-
justed values into alignment.
5
Normalization II
6
Normalization III
• There are different types of normalizations in statistics : nondimensional ratios of
errors, residuals, means and standard deviations, which are hence scale invariant.
some of which may be summarized as follows.
7
Normalization IV
8
Feature Scaling I
Feature scaling is used to bring all values into the range [0, 1]. This is also called unity
based normalization. This can be generalized to restrict the range of values in the dataset
between any arbitrary points a and b.
9
Feature Scaling II
▶ Therefore, the range of all features should be normalized so that each feature
contributes approximately proportionately to the final distance.
▶ Another reason why feature scaling is applied is that gradient descent converges
much faster with feature scaling than without it.
▶ It’s also important to apply feature scaling if regularization is used as part of the
loss function (so that coefficients are penalized appropriately).
10
Feature Scaling III
x − min(x)
x′ =
max(x) − min(x)
To rescale a range between an arbitrary set of values [a, b], the for-
mula becomes:
(x − min(x))(b − a)
x′ = a +
max(x) − min(x)
11
Feature Scaling IV
Standardization We can handle various types of data, e.g. audio signals and pixel
(Z-score Nor- values for image data, and this data can include multiple dimen-
malization) sions. Feature standardization makes the values of each feature in
the data have zero-mean and unit-variance.
x − mean(x)
x′ =
σ
12
Bias & Variance I
Statistical bias is a feature of a statistical technique or of its results whereby the ex-
pected value of the results differs from the true underlying quantitative parameter be-
ing estimated.
13
Regularization I
Why Regularization?
• Regularization is necessary because least squares regression methods, where the
residual sum of squares is minimized, can be unstable. This is especially true if there
is multicollinearity in the model.
14
Regularization II
• However, the mere practice of model fitting comes with a major pitfall: any set of
data can be fitted to a model, even if that model is ridiculously complex.
• For example, take a simple data set of two points. A set of two points can be fitted
by multiple models, including a linear model (green) and an unlimited number of
higher-degree polynomial models (red).
• Fitting a small amount of data will often lead to a complex, overfit model. A simpler
model may be underfit and will perform poorly with predictions.
15
Regularization III
• Just because two data points fit a line perfectly doesn’t mean that a third point will
fall exactly on that line – in fact, it’s highly unlikely.
• Simply put, regularization penalizes models that are more complex in favor of sim-
pler models (ones with smaller regression coefficients) – but not at the expense of
reducing predictive power.
16
Regularization IV
17
Ridge Regression
Ridge regression is a way to create a parsimonious model when the number of pre-
dictor variables in a set exceeds the number of observations, or when a data set has
multicollinearity (correlations between predictor variables).
18
Lasso Regression
19
Cross Validation I
20
Cross Validation II
21
Cross Validation III
In the above example, I have two parameters (i.e., time of departure from home
and route to take to the station) and I need to choose these parameters such
that I reach the station by 8.15 am.
In order to solve the above problem I may try out different sets of ’parame-
ters’ (i.e., different combination of times of departure and route) on Mondays,
Wednesdays, and Fridays, to see which combination is the ’best’ one. The idea is
that once I have identified the best combination I can use it every day so that I
achieve my objective.
22
Cross Validation IV
Problem of The problem with the above approach is that I may overfit which
Overfitting essentially means that the best combination I identify may in
some sense may be unique to Mon, Wed and Fridays and that
combination may not work for Tue and Thu. Overfitting may hap-
pen if in my search for the best combination of times and routes I
exploit some aspect of the traffic situation on Mon/Wed/Fri which
does not occur on Tue and Thu.
One Solution Cross-validation is one solution to overfitting. The idea is that once
to Overfit- we have identified our best combination of parameters (in our
ting: Cross- case time and route) we test the performance of that set of pa-
Validation rameters in a different context. Therefore, we may want to test on
Tue and Thu as well to ensure that our choices work for those days
as well.
23
Cross Validation V
Cross validation didn’t become prevalent until huge datasets came into being.
Prior to that, analysts preferred to use all the available data to test a model. With
larger data sets, it makes sense to hold back a portion of the data to test the
model. However, the question becomes which portion of the data do you hold
back? Most data isn’t homogeneous across it’s entire length, so if you choose
the wrong chunk of data, you could invalidate a perfectly good model. Cross val-
idation solves this problem by using multiple, sequential holdout samples that
cover all of the data.
24
Leave p-out cross-validation
25
Leave One-out cross-validation
• Leave-one-out cross-validation (LOOCV) is an exhaustive cross-validation technique.
It is a category of LpOCV with the case of p=1.
• For a dataset having n rows, 1st row is selected for validation, and the rest (n-1)
rows are used to train the model. For the next iteration, the 2nd row is selected
for validation and rest to train the model. Similarly, the process is repeated until n
steps or the desired number of operations.
Both the above two cross-validation techniques are the types of exhaustive
cross-validation. Exhaustive cross-validation methods are cross-validation meth-
ods that learn and test in all possible ways. They have the same pros and cons.
• In the case of holdout cross-validation, the dataset is randomly split into training
and validation data. Generally, the split of training data is more than test data. The
training data is used to induce the model and validation data is evaluates the per-
formance of the model.
27
Holdout cross-validation II
The more data is used to train the model, the better the model is. For the holdout
cross-validation method, a good amount of data is isolated from training.
28
k-fold cross-validation I
• The process is repeated for k times until each group is treated as validation and
remaining as training data.
29
k-fold cross-validation II
The final accuracy of the model is computed by taking the mean accuracy of the
k-models validation data.
∑
k
acci
acccv =
k
i=1
LOOCV is a variant of k-fold cross-validation where k=n.
30
Stratified k-fold cross-validation I
Stratified The splitting of data into folds may be governed by criteria such as
ensuring that each fold has the same proportion of observations
with a given categorical value, such as the class outcome value.
This is called stratified cross-validation.
• For all the cross-validation techniques discussed above, they may not work well
with an imbalanced dataset. Stratified k-fold cross-validation solved the problem
of an imbalanced dataset.
31
Stratified k-fold cross-validation II
32
Stratified k-fold cross-validation III
• The final score is computed by taking the mean of scores of each fold.
33