Emailing PREDICTIVE ANALYSIS 2
Emailing PREDICTIVE ANALYSIS 2
Emailing PREDICTIVE ANALYSIS 2
Bias
Bias refers to the error introduced by approximating a real-world problem, which may be
complex, by a simplified model.
It can be seen as the difference between the predicted values and the true values in a model.
Low Bias: A low-bias model fits the training data very well but may perform poorly on
unseen data. It is often associated with complex, flexible models.
High Bias: A high-bias model is too simplistic and may underfit the data, failing to capture
important patterns. It is often associated with overly simple models.
Variance
Variance refers to the model's sensitivity to small fluctuations or noise in the training data. It
measures how much the model's predictions would vary if we trained it on different subsets
of the data.
Low Variance: A low-variance model is stable and produces consistent predictions across
different datasets. However, it may not adapt well to complex patterns and can underfit.
High Variance: A high-variance model is sensitive to the training data and can produce very
different predictions for different datasets. It is often associated with overfitting, where the
model captures noise in the data.
Model Complexity
Model complexity refers to the sophistication or flexibility of a model in capturing
relationships within data. It's often controlled by the number of parameters or features in the
model.
Low Model Complexity: Models with low complexity are simple and have fewer parameters.
They are less likely to overfit but may underfit if the data is complex.
High Model Complexity: Models with high complexity are more sophisticated and have
many parameters. They can capture intricate patterns in the data but are prone to overfitting
when there's noise or insufficient data.
Bias-Variance tradeoff
The bias-variance trade-off is a fundamental concept in machine learning and statistics that
deals with the balance between two types of errors that a predictive model can make: bias
and variance.
Achieving the right balance between these two types of errors is crucial for building models
that generalize well to unseen data.
Predicted Error: Predicted error generally refers to the difference between the actual (ground
truth) values and the values predicted by a machine learning model. It can be measured using
various metrics, such as mean squared error (MSE), mean absolute error (MAE), or root
mean squared error (RMSE), depending on the problem type (e.g., regression or
classification). Predicted error helps assess how well the model is performing on a specific
set of data.
Test Error: Test error, also known as out-of-sample error or generalization error, measures
how well a machine learning model performs on data that it has never seen during training.
Typically, a dataset is split into a training set and a test set. The model is trained on the
training set and then evaluated on the test set. Test error provides an estimate of how well the
model is expected to perform on new, unseen data. It helps assess the model's ability to
generalize beyond the training data.
Test error, also referred to as generalization error, is the prediction error over an independent
test sample
Validation Error: Validation error is closely related to test error but is often used during the
model training process, especially in techniques like cross-validation. In k-fold cross-
validation, for example, the dataset is divided into k subsets (folds), and the model is trained
and validated k times. Validation error measures how well the model performs on the
validation set (which is part of the original data but not used in training) during each iteration
of cross-validation. It helps in hyperparameter tuning and model selection.
. Training error consistently decreases with model complexity, typically dropping to zero if
we increase the model complexity enough. However, a model with zero training error is
overfit to the training data and will typically generalize poorly
High Bias, Low Variance: Models with high bias and low variance are simple and tend to
underfit the data. They may not capture complex patterns and have a relatively high training
error.
Low Bias, High Variance: Models with low bias and high variance are complex and can fit
the training data very closely. However, they are prone to overfitting and have a high test
error when applied to new data.
Balanced Trade-off: The goal is to strike a balance between bias and variance. You want a
model that is complex enough to capture the underlying patterns in the data but not so
complex that it overfits and becomes sensitive to noise.
Regularization: Adding constraints or penalties to the model to reduce its complexity and
prevent overfitting.
Feature Selection: Choosing relevant features and reducing dimensionality to simplify the
model.
Ensemble Methods: Combining the predictions of multiple models to reduce variance while
maintaining low bias.
The ultimate goal of the bias-variance trade-off is to build models that generalize well to new,
unseen data, rather than just memorizing the training data or being too simplistic to be useful.
Optimism of the Training error rate
Given a training set T = {(x1,y1),(x2,y2),...(xN ,yN )} the generalization error of a model ˆf
is
Note that the training set T is fixed in expression . The point (X0 ,Y 0 ) is a new test data
point, drawn from F, the joint distribution of the data. Averaging over training sets T yields
the expected error
which is more amenable to statistical analysis. As mentioned earlier, it turns out that most
methods effectively estimate the expected error rather than ET t. Now typically, the
training error
will be less than the true error ErrT , because the same data is being used to fit the method
and assess its error.A fitting method typically adapts to the training data, and hence the
apparent or training error err will be an overly optimistic estimate of the generalization
error ErrT .
We define the optimism as the difference between Err in sample and the training error
.
This is typically positive since err is usually biased downward as an estimate of
prediction error. Finally, the average optimism is the expectation of the optimism over
training sets
ω ≡ Ey(op).
we can estimate the expected error Err rather than the conditional error ErrT .
For squared error, 0–1, and other loss functions, one can show quite generally that
This expression simplifies if ˆyi is obtained by a linear fit with d inputs or basis functions.
For example,
The optimism increases linearly with the number d of inputs or basis functions we use,
but decreases as the training sample size increases.
An obvious way to estimate prediction error is to estimate the optimism and then add it
to the training error err. The methods described in the next section—Cp, AIC, BIC and
others—work in this way, for a special class of estimates that are linear in their
parameters.
Estimates of In-Sample prediction Error
In-sample error is not usually of direct interest since future values of the features are not
likely to coincide with their training set values. But for comparison between models, in-
sample error is convenient and often leads to effective model selection. The reason is that the
relative (rather than absolute) size of the error is what matters.
Using expression applicable when d parameters are fit under squared error loss, leads
to a version of the so-called Cp statistic,
Here ˆσε 2 is an estimate of the noise variance, obtained from the meansquared error of a
low-bias model. Using this criterion we adjust the training error by a factor proportional to
the number of basis functions used.
The Akaike information criterion is a similar but more generally applicable estimate of
Errin when a log-likelihood loss function is used. It relies on a relationship similar to
(7.24) that holds asymptotically as N → ∞:
AIC statistic is equivalent to Cp, and so we refer to them collectively as AIC.To use AIC for
model selection, we simply choose the model giving smallest AIC over the set of models
considered. For nonlinear and other complex models, we need to replace d by some measure
of model complexity.
Cp (Mallow's Cp):
● Purpose: Cp was developed by Colin Mallows to assess the quality of linear
regression models. It is primarily used in the context of regression analysis.
● Calculation: Cp measures the trade-off between model fit and model
complexity. It is calculated as follows: Cp = (SSE_p / MSE) - (n - 2p) Where:
● SSE_p: The sum of squared errors for the model with p predictor variables.
● MSE: The mean squared error for the full model with all predictor variables.
● n: The number of data points.
● p: The number of predictor variables in the model.
● Interpretation: A smaller Cp value indicates a better trade-off between model
fit and complexity. Cp is used to assess whether a model with a subset of
predictor variables is competitive with the full model while penalizing for
model complex
The choice between Cp, AIC, and BIC depends on the specific context and goals of your
analysis. Cp is suitable for linear regression, while AIC and BIC are versatile and widely
used in various statistical modeling scenarios. BIC is the most conservative in terms of model
selection and favors simpler models the most, while AIC strikes a balance between model fit
and complexity.
The effective number of parameters in model evaluation, assessing the complexity of a model
and understanding its generalization performance.
● Models with too many parameters relative to the amount of training data are
prone to overfitting. They may capture noise in the training data rather than
learning the underlying patterns.
● The effective number of parameters helps in evaluating whether a model is
likely to generalize well to new, unseen data.
Model Complexity:
Regularization Impact:
● Regularization techniques, such as L1 and L2 regularization, directly influence
the effective number of parameters.
● Evaluation involves assessing the impact of regularization on the model's
performance and understanding how it helps control overfitting.
Interpretability:
Model Compression:
Trade-off Analysis:
Cross-Validation Considerations
Cross Validation
This method directly estimates the expected extra-sample error Err = E[L(Y, ˆf(X))], the
average generalization error when the method ˆf(X) is applied to an independent test sample
from the joint distribution of X and Y.
The basic idea is to split your dataset into multiple subsets or folds. The model is trained on
some of these folds and tested on the remaining fold. This process is repeated multiple times,
with different folds used as the test set in each iteration. The most common form is k-fold
cross-validation.
K-Fold Cross-Validation:
BootStrapping
Bootstrapping is a resampling technique that involves generating multiple subsets (samples)
from a single dataset by randomly selecting data points with replacement. Each bootstrap
sample is of the same size as the original dataset, but it's likely to contain some duplicate data
points.
The bootstrap is a general tool for assessing statistical accuracy. First we describe the
bootstrap in general, and then show how it can be used to estimate extra-sample prediction
error. As with cross-validation, the bootstrap seeks to estimate the conditional error ErrT , but
typically estimates well only the expected prediction error Err.
Steps in Bootstrapping:
● Randomly select data points from the original dataset with replacement to form a new
bootstrap sample.
● Repeat this process to create multiple bootstrap samples.
● Perform the analysis (e.g., train a model) on each bootstrap sample.
● Aggregate the results (e.g., average predictions) over all bootstrap samples.
However, it is easy to see that Err boot does not provide a good estimate in general. The
reason is that the bootstrap datasets are acting as the training samples, while the original
training set is acting as the test sample, and these two samples have observations in common.
This overlap can make overfit predictions look unrealistically good, and is the reason that
crossvalidation explicitly uses non-overlapping data for the training and test samples.
Imagine you're training a model to identify cats and dogs. During training, your model gets
really good at recognizing the particular cats and dogs in your dataset. But the true test comes
when you unleash it on a new set of pictures with different cats and dogs. The conditional
test error tells you how well your model is expected to generalize to these fresh examples.
It's like predicting the future performance of your model. The lower the conditional test error,
the better your model is at handling unseen data. So, when you hear someone talking about
minimizing the expected test error, they're essentially fine-tuning their crystal ball to make
better predictions in the real world!
We conclude that estimation of test error for a particular training set is not easy in general,
given just the data from that same training set. Instead, cross-validation and related methods
may provide reasonable estimates of the expected error Err.