0% found this document useful (0 votes)
15 views33 pages

Unit 4

The document discusses various statistical methods including normalization, bias and variance, regularization, and cross validation. Specifically, it covers topics such as feature scaling to normalize data between 0 and 1, how regularization adds penalties to reduce overfitting, and that bias is when a statistic overestimates or underestimates a true parameter. It provides technical definitions and explanations of these key statistical concepts.

Uploaded by

PaiEducation
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views33 pages

Unit 4

The document discusses various statistical methods including normalization, bias and variance, regularization, and cross validation. Specifically, it covers topics such as feature scaling to normalize data between 0 and 1, how regularization adds penalties to reduce overfitting, and that bias is when a statistic overestimates or underestimates a true parameter. It provides technical definitions and explanations of these key statistical concepts.

Uploaded by

PaiEducation
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Unit IV

STATISTICAL METHODS

Sumit Kr. Choubey

1
Table of contents

1. Normalization
Feature Scaling

2. Bias & Variance

3. Regularization

4. Cross Validation

2
Normalization

3
Normalization

“ The word “normalization” is used informally in statistics, and so the term nor-
malized data can have multiple meanings. In most cases, when you normalize
data you eliminate the units of measurement, enabling you to more easily
compare data from different places.

4

Normalization I
In statistics and applications of statistics, normalization can have a range of meanings.
• In the simplest cases, normalization means adjusting values measured on different
scales to a notionally common scale, often prior to averaging.
• In more complicated cases, normalization may refer to more sophisticated adjust-
ments where the intention is to bring the entire probability distributions of ad-
justed values into alignment.

In the case of normalization of scores in educational assessment,


1 there may be an intention to align distributions to a normal distri-
bution.

A different approach to normalization of probability distributions is


2 quantile normalization, where the quantiles of the different measures
are brought into alignment.

5
Normalization II

• In another usage in statistics, normalization refers to the creation of shifted and


scaled versions of statistics, where the intention is that these normalized values
allow the comparison of corresponding normalized values for different datasets in
a way that eliminates the effects of certain gross influences, as in an anomaly time
series.
• Some types of normalization involve only a rescaling, to arrive at values relative
to some size variable. In terms of levels of measurement, such ratios only make
sense for ratio measurements (where ratios of measurements are meaningful), not
interval measurements (where only distances are meaningful, but not ratios).

6
Normalization III
• There are different types of normalizations in statistics : nondimensional ratios of
errors, residuals, means and standard deviations, which are hence scale invariant.
some of which may be summarized as follows.

Standardization : Transforming data using a z-score or t-score. This is


usually called standardization. In the vast majority of cases, if a statis-
1 tics textbook is talking about normalizing data, then this is the defi-
nition of “normalization” they are probably using.

feature scaling : Rescaling data to have values between 0 and 1. This


2 is usually called feature scaling.

Standardizing residuals: Ratios used in regression analysis can force


3 residuals into the shape of a normal distribution.

7
Normalization IV

4 Normalizing Moments using the formula µ/σ.

Normalizing vectors (in linear algebra) to a norm of one. Normaliza-


5 tion in this sense means to transform a vector so that it has a length
of one.

8
Feature Scaling I

Feature scaling is used to bring all values into the range [0, 1]. This is also called unity
based normalization. This can be generalized to restrict the range of values in the dataset
between any arbitrary points a and b.

Feature is a method used to normalize the range of independent variables


scaling or features of data. In data processing, it is also known as data nor-
malization and is a preprocessing step.

Why Feature scaling?


▶ Since the range of values of raw data varies widely, in some machine learning
algorithms, objective functions will not work properly without normalization.
▶ For example, many classifiers calculate the distance between two points by the
Euclidean distance. If one of the features has a broad range of values, the
distance will be governed by this particular feature.

9
Feature Scaling II

▶ Therefore, the range of all features should be normalized so that each feature
contributes approximately proportionately to the final distance.
▶ Another reason why feature scaling is applied is that gradient descent converges
much faster with feature scaling than without it.
▶ It’s also important to apply feature scaling if regularization is used as part of the
loss function (so that coefficients are penalized appropriately).

10
Feature Scaling III

Min-Max Also known as Rescaling or min-max normalization, is the simplest


scaling method and consists in rescaling the range of features to scale the
range in [0, 1] or [−1, 1]. The general formula for a min-max of [0, 1]
is given as:

x − min(x)
x′ =
max(x) − min(x)

To rescale a range between an arbitrary set of values [a, b], the for-
mula becomes:

(x − min(x))(b − a)
x′ = a +
max(x) − min(x)

11
Feature Scaling IV

Mean nor- To normalize mean(x) of the data formula is


malization
x − mean(x)
x′ =
max(x) − min(x)

Standardization We can handle various types of data, e.g. audio signals and pixel
(Z-score Nor- values for image data, and this data can include multiple dimen-
malization) sions. Feature standardization makes the values of each feature in
the data have zero-mean and unit-variance.

x − mean(x)
x′ =
σ

12
Bias & Variance I

What is it ? Bias is the tendency of a statistic to overestimate or underesti-


mate a parameter. Bias can seep into results for a slew of reasons
including sampling or measurement errors, or unrepresentative
samples.

Statistical bias is a feature of a statistical technique or of its results whereby the ex-
pected value of the results differs from the true underlying quantitative parameter be-
ing estimated.

13
Regularization I

What is Reg- Regularization is a way to avoid overfitting by penalizing high-


ularization? valued regression coefficients. In simple terms, it reduces param-
eters and shrinks (simplifies) the model. This more streamlined,
more parsimonious model will likely perform better at predictions.
Regularization adds penalties to more complex models and then
sorts potential models from least overfit to greatest; The model
with the lowest “overfitting” score is usually the best choice for
predictive power.

Why Regularization?
• Regularization is necessary because least squares regression methods, where the
residual sum of squares is minimized, can be unstable. This is especially true if there
is multicollinearity in the model.

14
Regularization II

• However, the mere practice of model fitting comes with a major pitfall: any set of
data can be fitted to a model, even if that model is ridiculously complex.
• For example, take a simple data set of two points. A set of two points can be fitted
by multiple models, including a linear model (green) and an unlimited number of
higher-degree polynomial models (red).

• Fitting a small amount of data will often lead to a complex, overfit model. A simpler
model may be underfit and will perform poorly with predictions.

15
Regularization III

• Just because two data points fit a line perfectly doesn’t mean that a third point will
fall exactly on that line – in fact, it’s highly unlikely.
• Simply put, regularization penalizes models that are more complex in favor of sim-
pler models (ones with smaller regression coefficients) – but not at the expense of
reducing predictive power.

16
Regularization IV

Penalty Terms Regularization works by biasing data towards particular values


(such as small values near zero). The bias is achieved by adding
a tuning parameter to encourage those values:
• L1 regularization adds an L1 penalty equal to the absolute
value of the magnitude of coefficients. In other words, it lim-
its the size of the coefficients. L1 can yield sparse models (i.e.
models with few coefficients); Some coefficients can become
zero and eliminated. Lasso regression uses this method.
• L2 regularization adds an L2 penalty equal to the square of
the magnitude of coefficients. L2 will not yield sparse models
and all coefficients are shrunk by the same factor (none are
eliminated). Ridge regression and SVMs use this method.
• Elastic nets combine L1 & L2 methods, but do add a hyperpa-
rameter

17
Ridge Regression
Ridge regression is a way to create a parsimonious model when the number of pre-
dictor variables in a set exceeds the number of observations, or when a data set has
multicollinearity (correlations between predictor variables).

18
Lasso Regression

19
Cross Validation I

Cross-Validation also referred to as out of sampling technique is an essential element


of ML. It is a resampling procedure used to evaluate ML models and access how the
model will perform for an independent test dataset.

Cross Val- Cross validation (also called rotation estimation, or out-of-sample


idation testing) is one way to ensure your model is robust. A portion of your
data (called a holdout sample) is held back; The bulk of the data is
trained and the holdout sample is used to test the model. This is
different from the “classical” method of model testing, which uses
all of the data to test the model.

20
Cross Validation II

Consider the following situation:

Example I want to catch the subway to go to my office. My plan is to take my


car, park at the subway and then take the train to go to my office.
My goal is to catch the train at 8.15 am every day so that I can reach
my office on time. I need to decide the following: [A] the time at
which I need to leave from my home and [B] the route I will take
to drive to the station.

21
Cross Validation III

In the above example, I have two parameters (i.e., time of departure from home
and route to take to the station) and I need to choose these parameters such
that I reach the station by 8.15 am.

In order to solve the above problem I may try out different sets of ’parame-
ters’ (i.e., different combination of times of departure and route) on Mondays,
Wednesdays, and Fridays, to see which combination is the ’best’ one. The idea is
that once I have identified the best combination I can use it every day so that I
achieve my objective.

22
Cross Validation IV

Problem of The problem with the above approach is that I may overfit which
Overfitting essentially means that the best combination I identify may in
some sense may be unique to Mon, Wed and Fridays and that
combination may not work for Tue and Thu. Overfitting may hap-
pen if in my search for the best combination of times and routes I
exploit some aspect of the traffic situation on Mon/Wed/Fri which
does not occur on Tue and Thu.

One Solution Cross-validation is one solution to overfitting. The idea is that once
to Overfit- we have identified our best combination of parameters (in our
ting: Cross- case time and route) we test the performance of that set of pa-
Validation rameters in a different context. Therefore, we may want to test on
Tue and Thu as well to ensure that our choices work for those days
as well.

23
Cross Validation V

Cross validation didn’t become prevalent until huge datasets came into being.
Prior to that, analysts preferred to use all the available data to test a model. With
larger data sets, it makes sense to hold back a portion of the data to test the
model. However, the question becomes which portion of the data do you hold
back? Most data isn’t homogeneous across it’s entire length, so if you choose
the wrong chunk of data, you could invalidate a perfectly good model. Cross val-
idation solves this problem by using multiple, sequential holdout samples that
cover all of the data.

24
Leave p-out cross-validation

• Leave p-out cross-validation (LpOCV) is an exhaustive cross-validation technique,


that involves using p-observation as validation data, and remaining data is used
to train the model.
• This is repeated in all ways to cut the original sample on a validation set of p
observations and a training set.
• A variant of LpOCV with p=2 known as leave-pair-out cross-validation has been
recommended as a nearly unbiased method for estimating the area under ROC
curve of a binary classifier.

25
Leave One-out cross-validation
• Leave-one-out cross-validation (LOOCV) is an exhaustive cross-validation technique.
It is a category of LpOCV with the case of p=1.
• For a dataset having n rows, 1st row is selected for validation, and the rest (n-1)
rows are used to train the model. For the next iteration, the 2nd row is selected
for validation and rest to train the model. Similarly, the process is repeated until n
steps or the desired number of operations.

Both the above two cross-validation techniques are the types of exhaustive
cross-validation. Exhaustive cross-validation methods are cross-validation meth-
ods that learn and test in all possible ways. They have the same pros and cons.

Pros Simple, easy to understand, and implement.

Cons • The model may lead to a low bias.


• The computation time required is high.
26
Holdout cross-validation I

• The holdout technique is an exhaustive cross-validation method, that randomly


splits the dataset into train and test data depending on data analysis.

• In the case of holdout cross-validation, the dataset is randomly split into training
and validation data. Generally, the split of training data is more than test data. The
training data is used to induce the model and validation data is evaluates the per-
formance of the model.

27
Holdout cross-validation II

The more data is used to train the model, the better the model is. For the holdout
cross-validation method, a good amount of data is isolated from training.

Pros Simple, easy to understand, and implement.

Cons • Not suitable for an imbalanced dataset.


• A lot of data is isolated from training the model.

28
k-fold cross-validation I

• In k-fold cross-validation, the original dataset is equally partitioned into k subparts


or folds. Out of the k-folds or groups, for each iteration, one group is selected as
validation data, and the remaining (k-1) groups are selected as training data.

• The process is repeated for k times until each group is treated as validation and
remaining as training data.

29
k-fold cross-validation II

The final accuracy of the model is computed by taking the mean accuracy of the
k-models validation data.

k
acci
acccv =
k
i=1
LOOCV is a variant of k-fold cross-validation where k=n.

Pros • The model has low bias


• Low time complexity
• The entire dataset is utilized for both training and validation.

Cons • Not suitable for an imbalanced dataset.

30
Stratified k-fold cross-validation I

Stratified The splitting of data into folds may be governed by criteria such as
ensuring that each fold has the same proportion of observations
with a given categorical value, such as the class outcome value.
This is called stratified cross-validation.

• For all the cross-validation techniques discussed above, they may not work well
with an imbalanced dataset. Stratified k-fold cross-validation solved the problem
of an imbalanced dataset.

31
Stratified k-fold cross-validation II

• In Stratified k-fold cross-validation, the dataset is partitioned into k groups or folds


such that the validation data has an equal number of instances of target class label.
This ensures that one particular class is not over present in the validation or train
data especially when the dataset is imbalanced.

32
Stratified k-fold cross-validation III

• The final score is computed by taking the mean of scores of each fold.

Pros • Works well for an imbalanced dataset.

Cons • Now suitable for time series dataset.

33

You might also like