Data Science Unit 5
Data Science Unit 5
Introduction
The generalization of machine learning models is the ability of a model to classify or forecast
new data. When we train a model on a dataset, and the model is provided with new data absent
from the trained set, it may perform well. Such a model is generalizable. It doesn’t have to act on
all data types but with similar domains or datasets.
It is important to have an understanding of what unseen data is. Unseen data is data new to the
model that was not part of the training. Models perform better on observations they have seen
before. For more benefit, we should try to have models that will be able to perform even on
unseen data.
Benefits of Generalization
Since generalization is more of an advantage, it is necessary to see some of the factors that can
influence it in the design cycle of models.
All models have different behaviors. How they treat data and how they optimize their
performance is different. Decision Trees are non-parametric, making them prone to over fitting.
To address generalization in models, the nature of the algorithm should be considered
intentionally. Sometimes how models perform comes with high complexity. When they are
complex, over fitting becomes easy. A balance can be created using model regularization to
achieve generalization and avoid over fitting. For deep networks, changing network structure by
reducing the number of weights or network parameters i.e, the values of weights, could do the
trick.
The other side is the dataset being used for training. Sometimes datasets are too unified. They
hold little difference from each other. The dataset of bicycles may be so unified that it can’t be
used to detect motorcycles. In order to achieve a generalized machine learning model, the dataset
should contain diversity. Different possible samples should be added to provide a high range.
This helps models to be trained with the generalization best achieved. During training, we can
use cross-validation techniques e.g, K-fold. This is needful to see the sense of our model even
while targeting generalization.
Non-Generalization of Models
It is seen that models do not require generalization. Models should just do what they are strictly
expected to do. This may or may not be the best. I may want my model trained on images of
motorcycles to be able to identify all similar vehicles, including bicycles and even wheelchairs.
This may be very robust. In another application, this may not be good. We may want our model
trained with motorcycles to strictly identify motorcycles. It should not identify bicycles. Maybe
we want to count motorcycles in the parking lot without bicycles.
Using the above factors that affect generalization, we can decide and have control over when we
want or don’t need generalization. Though generalization can contain risks. As such if the means
are available, non-generalization should be highly optimized. If the means are available, a new
model should be developed for bicycles and another for wheelchairs. In cases when there are
fewer resources like time and dataset, the generalization technique can then be utilized.
A good fit is what we need to target when we want a model that can be generalized.
Sample Evaluation Metrics Evaluation metrics for evaluating the performance of a machine
learning model, which is an integral component of any data science project. It aims to estimate
the generalization accuracy of a model on the future (unseen/out-of-sample) data.
Confusion Matrix
A confusion matrix is a matrix representation of the prediction results of any binary testing that
is often used to describe the performance of the classification model (or “classifier”) on a set
of test data for which the true values are known.
The confusion matrix itself is relatively simple to understand, but the related terminology can be
confusing.
Each prediction can be one of the four outcomes, based on how it matches up to the actual value:
A Hypothesis is speculation or theory based on insufficient evidence that lends itself to further
testing and experimentation. With further testing, a hypothesis can usually be proven true or
false.
A Null Hypothesis is a hypothesis that says there is no statistical significance between the two
variables in the hypothesis. It is the hypothesis that the researcher is trying to disprove.
We would always reject the null hypothesis when it is false, and we would accept the null
hypothesis when it is indeed true.
Even though hypothesis tests are meant to be reliable, there are two types of errors that can
occur.
For example, when examining the effectiveness of a drug, the null hypothesis would be that the
drug does not affect a disease.
The first kind of error that is possible involves the rejection of a null hypothesis that is true.
Type II Error:- equivalent to False Negatives(FN).
The other kind of error that occurs when we accept a false null hypothesis. This sort of error is
called a type II error and is also referred to as an error of the second kind.
. Accuracy
Accuracy = (TP+TN)/total
When our classes are roughly equal in size, we can use accuracy, which will give us correctly
classified values.
Accuracy is a common evaluation metric for classification problems. It’s the number of correct
predictions made as a ratio of all predictions made.
Misclassification Rate(Error Rate): Overall, how often is it wrong. Since accuracy is the
percent we correctly classified (success rate), it follows that our error rate (the percentage we got
wrong) can be calculated as follows:
Precision
Precision=TP/predicted yes
Recall or Sensitivity
Recall gives us the true positive rate (TPR), which is the ratio of true positives to everything
positive.
What is Cross-Validation?
Cross validation is a technique used in machine learning to evaluate the performance of a model
on unseen data. It involves dividing the available data into multiple folds or subsets, using one of
these folds as a validation set, and training the model on the remaining folds. This process is
repeated multiple times, each time using a different fold as the validation set. Finally, the results
from each validation step are averaged to produce a more robust estimate of the model’s
performance. Cross validation is an important step in the machine learning process and helps to
ensure that the model selected for deployment is robust and generalizes well to new data.
The main purpose of cross validation is to prevent over fitting, which occurs when a model is
trained too well on the training data and performs poorly on new, unseen data. By evaluating the
model on multiple validation sets, cross validation provides a more realistic estimate of the
model’s generalization performance, i.e., its ability to perform well on new, unseen data.
Types of Cross-Validation
There are several types of cross validation techniques, including k-fold cross validation,
leave-one-out cross validation, and Holdout validation, Stratified Cross-Validation. The
choice of technique depends on the size and nature of the data, as well as the specific
requirements of the modeling problem.
When we talk about the Machine Learning model, we actually talk about how well it performs
and its accuracy which is known as prediction errors. Let us consider that we are designing a
machine learning model. A model is said to be a good machine learning model if it generalizes
any new input data from the problem domain in a proper way. This helps us to make predictions
about future data, that the data model has never seen. Now, suppose we want to check how well
our machine learning model learns and generalizes to the new data. For that, we have overfitting
and underfitting, which are majorly responsible for the poor performances of the machine
learning algorithms.
● Variance: Variance, on the other hand, is the error due to the model’s sensitivity to
fluctuations in the training data. It’s the variability of the model’s predictions for different
instances of training data. High variance occurs when a model learns the training data’s
noise and random fluctuations rather than the underlying pattern. As a result, the model
performs well on the training data but poorly on the testing data, indicating overfitting.
Under fitting
A statistical model or a machine learning algorithm is said to have underfitting when a model is
too simple to capture data complexities. It represents the inability of the model to learn the
training data effectively result in poor performance both on the training and testing data. In
simple terms, an underfit model’s are inaccurate, especially when applied to new, unseen
examples. It mainly happens when we uses very simple model with overly simplified
assumptions. To address underfitting problem of the model, we need to use more complex
models, with enhanced feature representation, and less regularization.
Note: The underfitting model has High bias and low variance.
1. The model is too simple, So it may be not capable to represent the complexities in the
data.
2. The input features which is used to train the model is not the adequate representations of
underlying factors influencing the target variable.
4. Excessive regularization are used to prevent the overfitting, which constraint the model to
capture the data well.
4. Increase the number of epochs or increase the duration of training to get better results.
Over fitting
A statistical model is said to be overfitted when the model does not make accurate predictions on
testing data. When a model gets trained with so much data, it starts learning from the noise and
inaccurate data entries in our data set. And when testing with test data results in High variance.
Then the model does not categorize the data correctly, because of too many details and noise.
The causes of overfitting are the non-parametric and non-linear methods because these types of
machine learning algorithms have more freedom in building the model based on the dataset and
therefore they can really build unrealistic models. A solution to avoid overfitting is using a linear
algorithm if we have linear data or using the parameters like the maximal depth if we are using
decision trees.
Coefficient shrinkage Strong shrinkage, can result in Moderate shrinkage, coefficients are
exact zeros. close to zero.
Feature selection Automatically selects relevant Retains all features, reduces impact of
features. less important ones.
Interpretability Can provide a sparse model with Retains all features, less sparse model.
selected features.
Bias-variance More biased but less variance. Less biased but more variance.
trade-off
● Overfitting: Overfitting occurs when a regression model performs well on the training
data but fails to generalize well to new, unseen data. It often happens when the model
becomes too complex, capturing noise or irregularities specific to the training set. Ridge
regression helps mitigate overfitting by adding a penalty term that discourages large
coefficient values. By shrinking the coefficients, it reduces the complexity of the model
and improves its generalization ability.
● Prediction Accuracy: When the main objective is accurate prediction rather than
interpreting individual coefficients, ridge regression can be advantageous. By reducing
the variance of coefficient estimates, it enhances the stability of the model, resulting in
improved prediction performance on new data.
FAQ