Unit 1 - Reinforcement Learning, Overfitting, Training, Validation Sets, Metrics, Bias and Variance
Unit 1 - Reinforcement Learning, Overfitting, Training, Validation Sets, Metrics, Bias and Variance
Reinforcement Learning (RL) is a type of machine learning in which an agent learns how
to make decisions by interacting with an environment to maximize a cumulative reward.
Unlike supervised learning, where the model is trained on labeled data, in RL, the agent must
explore the environment, take actions, and learn from the feedback (rewards or penalties) it
receives.
Reinforcement learning has a few fundamental components that define how the agent learns
from the environment and how actions are performed:
1. Agent: The decision-maker. The agent observes the current state of the environment,
chooses an action, and learns from the consequences of its actions. The agent aims to
maximize long-term rewards.
2. Environment: The external system or world in which the agent operates. The
environment can be anything from a video game to a real-world scenario, like robotics
or self-driving cars.
3. State (s): A description of the current situation of the environment. It provides all the
relevant information that the agent needs to make a decision at a specific point in
time. States may be continuous (e.g., the position of a robot) or discrete (e.g., in a
board game).
4. Action (a): The move or decision made by the agent at any given time. Actions
influence the environment and the agent’s future state. The set of all possible actions
is called the action space.
5. Reward (r): A numerical value received by the agent after taking an action in a
particular state. The reward serves as feedback to guide the agent toward desired
behavior. Positive rewards reinforce the action, while negative rewards (penalties)
discourage undesirable behavior.
6. Policy (π): The strategy or plan that the agent follows to choose actions. It maps states
to actions and can be either deterministic (always the same action for a given state) or
stochastic (randomized actions based on probabilities).
7. Value Function (V(s)): A function that estimates how good a particular state is,
considering the expected future rewards that can be obtained starting from that state.
A higher value means that the agent can expect to accumulate more reward from that
state.
8. Action-Value Function (Q(s, a)): Similar to the value function, but here it estimates
the expected future rewards when taking a specific action a in a particular state s and
then following the policy thereafter.
9. Return (G): The total accumulated reward an agent receives from a certain time step
onward. The return is often calculated as a discounted sum of future rewards, where
rewards received later in time are given less importance using a discount factor γ.
10. Discount Factor (γ): A number between 0 and 1 that determines the importance of
future rewards. A value close to 0 makes the agent focus on immediate rewards, while
a value close to 1 makes the agent more future-oriented.
A key part of RL is how to balance these two: the agent must explore enough to discover the
best actions, but also exploit its current knowledge to maximize cumulative rewards. A
common technique to address this is ε-greedy strategy, where the agent usually exploits but
occasionally explores by selecting a random action with probability ε.
Reinforcement learning algorithms can be broadly classified into the following categories:
1. High Training Accuracy, Low Test Accuracy: The hallmark of overfitting is that
the model shows high performance (accuracy, low error) on the training dataset, but
when evaluated on new, unseen data, its performance drops significantly.
2. Memorizing vs. Generalizing: The model doesn't generalize well to new data
because it has learned specific details or noise from the training data that don't
represent the broader, underlying patterns. This results in poor generalization.
3. Complex Models: Overfitting is more common with complex models, especially
those with many parameters, such as deep neural networks. The more complex the
model, the more capacity it has to memorize the data, which increases the likelihood
of overfitting.
Causes of Overfitting:
1. Excessive Model Complexity: If a model has too many parameters relative to the
amount of training data, it may "over-learn" the details of the training data, including
noise. For instance, a polynomial regression with too high a degree can result in a
model that fits the noise of the data rather than the trend.
2. Insufficient Data: If the dataset is too small or not representative of the broader
population, the model may memorize specific details from the training set, which
don't generalize to unseen data.
3. Lack of Regularization: Regularization techniques, such as L1 and L2
regularization, help penalize overly complex models. Without regularization, the
model might fit excessively complex relationships in the data, leading to overfitting.
4. Too Many Features (High Dimensionality): If the dataset contains a large number
of features, the model might find spurious relationships between those features and
the target variable, even though those relationships do not exist in real-world data.
5. Noise in Data: If there’s noise in the dataset (random variations), an overly complex
model can fit these noisy patterns, causing poor performance on new data that doesn't
have the same noise.
Signs of Overfitting:
Performance Gap: The model shows great performance (e.g., high accuracy) on the
training set but much lower performance on the validation or test set.
Model Instability: Small changes in the training data can lead to large changes in the
model's behavior or predictions, which indicates that the model is too sensitive and
overly fitted to the training data.
How to Detect Overfitting:
Example:
A validation set is a subset of the data that is used to evaluate the performance of a machine
learning model during training, but it is not used for training the model itself. It helps tune the
model's hyperparameters and provides an unbiased estimate of the model's ability to
generalize to unseen data.
Data Splitting:
1. Training Set: This subset is used to train the model, meaning it is used to learn the
model's parameters (like weights in neural networks).
2. Validation Set: This subset is used to evaluate the model during the training process.
It helps in choosing the best model or hyperparameters.
3. Test Set: This subset is used to evaluate the final performance of the model after
training. It provides an unbiased measure of the model's generalization ability to
unseen data.
The validation set is often used in techniques such as k-fold cross-validation, where the data
is split into k subsets. Each subset is used as a validation set while the others are used for
training, providing a more robust estimate of model performance.
Not Used in Training: It is important that the validation set is not used for training
the model in any way. It should be kept separate to provide an unbiased evaluation.
Helps Prevent Overfitting: By evaluating the model on the validation set during
training, you can see if the model is overfitting to the training data, helping you take
action to improve generalization.
Independent from Test Set: The validation set is used during the training process,
whereas the test set is only used after training to evaluate the final model's
performance. This ensures the test set provides an unbiased estimate of model
performance.
Suppose you are training a machine learning model to classify images of cats and dogs.
1. Training: You split the dataset into 80% for training and 20% for validation. The
model trains on the training data.
2. Validation: After each training epoch, the model is tested on the validation set to see
how well it is performing. You might tune hyperparameters (e.g., adjusting the
learning rate, trying different architectures) based on validation accuracy or loss.
3. Testing: Once you have trained the model and selected the best hyperparameters
using the validation set, you finally test the model on the test set (which was not used
during training or validation) to get the final evaluation metric.
Key Differences: Training Set vs. Validation Set vs. Test Set:
Used in
Subset Purpose
Training
Training Set Used to train the model, optimize its parameters (weights). Yes
Validation Used to tune hyperparameters and evaluate the model during
No
Set training.
Used to evaluate final performance after training is
Test Set No
complete.
Confusion Matrix
For a binary classification problem (e.g., predicting whether an email is spam or not), a
confusion matrix is a 2x2 table that shows how many instances of each class were predicted
versus the actual class. For a multi-class classification problem, the matrix is extended to an
N x N matrix, where N is the number of classes.
In the case of binary classification, the confusion matrix consists of four key components:
Where:
True Positive (TP): These are the instances that were correctly classified as positive (i.e., the
model predicted positive, and the actual class was also positive).
True Negative (TN): These are the instances that were correctly classified as negative (i.e.,
the model predicted negative, and the actual class was also negative).
False Positive (FP): These are the instances that were incorrectly classified as positive (i.e.,
the model predicted positive, but the actual class was negative). This is also called a Type I
error.
False Negative (FN): These are the instances that were incorrectly classified as negative (i.e.,
the model predicted negative, but the actual class was positive). This is also called a Type II
error.
Example:
Consider a simple binary classification problem where a model is trained to predict whether
an email is spam or not spam.
Actual Spam TP = 50 FN = 5
True Positives (TP) = 50: The model correctly identified 50 spam emails.
True Negatives (TN) = 42: The model correctly identified 42 non-spam emails.
False Positives (FP) = 3: The model incorrectly identified 3 non-spam emails as spam.
False Negatives (FN) = 5: The model incorrectly identified 5 spam emails as non-spam.
From the confusion matrix, we can calculate several important metrics that help in evaluating
the performance of a classification model.
1. Accuracy:
Accuracy is the proportion of correctly classified instances (both true positives and true
negatives) out of all instances.
TP+TN
Accuracy=
TP+ TN + FP+ FN
50+42 92
Accuracy= = =0.92 (92% accuracy)
50+42+3+5 100
2. Precision (also called Positive Predictive Value):
Precision measures the accuracy of positive predictions. It answers the question: Of all the
instances that were predicted as positive, how many were actually positive?
TP
Precision=
TP+ FP
50 50
Precision = = ≈ 0.943 (94.3% precision)
50+ 3 53
3. Recall (also called Sensitivity, True Positive Rate):
Recall measures how well the model identifies positive instances. It answers the question: Of
all the actual positives, how many were correctly predicted as positive?
TP
Recall=
TP+ FN
50 50
Recall= = ≈ 0.909 (90.9% recall)
50+5 55
4. F1-Score:
The F1-score is the harmonic mean of precision and recall. It provides a balance between
precision and recall, and is particularly useful when you need to balance false positives and
false negatives.
Precision ×Recall
F1-Score=2×
Precision +Recall
0.943 × 0.909
F1-Score=2× ≈ 0.926 (92.6% F1-score)
0.943+0.909
5. Specificity (also called True Negative Rate):
Specificity measures how well the model identifies negative instances. It answers the
question: Of all the actual negatives, how many were correctly predicted as negative?
TN
Specificity=
TN + FP
42 42
Specificity= = ≈ 0.933 (93.3% specificity)
42+3 45
6. False Positive Rate (FPR):
The false positive rate measures the proportion of negative instances that were incorrectly
classified as positive.
FP
FPR=
FP+TN
3 3
FPR = = =0.067 (6.7% FPR)
3+42 45
For a multi-class classification problem (e.g., classifying animals as cats, dogs, and birds),
the confusion matrix is extended to an N x N matrix where N is the number of classes. Each
row represents the actual class, and each column represents the predicted class. The diagonal
elements of the matrix represent the correct classifications (True Positives for each class),
while the off-diagonal elements represent misclassifications.
Actual A 50 5 2
Actual B 3 45 7
Actual C 1 4 49
From this matrix, you can compute the precision, recall, F1-score, etc., for each individual
class (A, B, C), as well as overall metrics like macro-averaged F1 or micro-averaged F1.
Key Components:
1. True Positive Rate (TPR): Also known as Sensitivity or Recall, it is the proportion
of actual positive instances that are correctly identified by the model.
True Positives
TPR=
True Positives + False Negatives
2. False Positive Rate (FPR): The proportion of actual negative instances that are
incorrectly classified as positive.
False Positives
FPR=
False Positives + True Negatives
The X-axis of the ROC curve represents the False Positive Rate (FPR), and the Y-
axis represents the True Positive Rate (TPR).
The curve is generated by varying the classification threshold (the cutoff value) of the
model from 0 to 1.
A higher threshold makes the model more conservative, classifying fewer positives
(leading to a lower TPR and FPR). A lower threshold makes the model more
aggressive, classifying more instances as positive (leading to higher TPR and FPR).
The ideal point on the ROC curve is the top-left corner (0, 1), where the True
Positive Rate is 1 (all positives are correctly classified) and the False Positive Rate is
0 (no negatives are misclassified).
A model that randomly guesses would produce a diagonal line from the bottom-left to
the top-right corner (from FPR = 0 to FPR = 1, and from TPR = 0 to TPR = 1). This
line represents the performance of a model with no discriminative power.
A model that produces a curve closer to the top-left corner is considered to have better
classification performance.
The closer the curve is to the diagonal line (i.e., the line from (0, 0) to (1, 1)), the
worse the model is at distinguishing between positive and negative instances.
Example:
For a medical test for disease detection, if the ROC curve is very close to the top-left
corner, it suggests the test is excellent at identifying sick patients (high TPR) while
avoiding false positives (low FPR).
Practical Considerations:
ROC curves are particularly useful when dealing with imbalanced datasets where the
number of positive and negative classes is disproportionate.
Since ROC curves measure performance across various thresholds, they provide a
more comprehensive view of a model's effectiveness compared to simple accuracy.
Bias and Variance
Bias and variance are two critical sources of error in machine learning models that affect their
ability to generalize well to unseen data. They help explain the trade-off between underfitting
and overfitting and are key components of the bias-variance trade-off. Here's a detailed
breakdown:
1. Bias
Definition: Bias refers to the error introduced by the model’s assumptions about the
data. A model with high bias makes strong assumptions and typically oversimplifies
the problem. It may not capture the underlying patterns in the data, leading to
systematic errors (i.e., consistent inaccuracies).
Impact: High bias leads to underfitting, where the model is too simple to accurately
represent the training data. It may fail to capture important features, relationships, or
complexities in the data.
Examples:
o A linear regression model applied to data that follows a nonlinear relationship
has high bias because it assumes a linear relationship.
o A decision tree with a very shallow depth (i.e., few splits) might have high
bias, as it will fail to capture the complexity of the data.
Mathematical View: Bias can be viewed as the difference between the expected (or
predicted) model and the true function we’re trying to approximate.
Bias=E [ f ( x ) ] −f true ( x )
In Practice:
o To reduce bias, more complex models or more flexible algorithms are often
used. However, this can increase variance, as discussed below.
2. Variance
Definition: Variance refers to the error introduced by the model’s sensitivity to small
fluctuations in the training data. A model with high variance is highly flexible and
may fit the training data very well, but it will also react to noise or random
fluctuations, leading to overfitting.
Impact: High variance leads to overfitting, where the model captures not only the
underlying patterns in the data but also the noise or random fluctuations. This makes
the model less generalizable to new, unseen data.
Examples:
o A decision tree with a very deep structure (i.e., many splits) will likely fit the
training data perfectly, but it might overfit the data, especially if there is noise.
o A high-degree polynomial regression model can create a curve that fits the
training data almost perfectly but oscillates wildly between points, fitting
noise rather than the true trend.
Mathematical View: Variance is the variability of the model’s predictions across
different training sets.
[
Variance=E ( f ( x )−E [ f ( x ) ] )
2
]
where f ( x ) i s t he model ’ s prediction , andE [ f ( x ) ]
In Practice:
o To reduce variance, simpler models or regularization techniques (like pruning
decision trees or using L2 regularization) can help prevent overfitting.
3. Bias-Variance Trade-Off
The key idea behind the bias-variance trade-off is that reducing bias typically increases
variance, and reducing variance typically increases bias. Ideally, we want to find a model that
strikes a balance between these two:
High Bias, Low Variance: The model is very simple and doesn’t capture the data
well. The predictions are consistent but inaccurate.
Low Bias, High Variance: The model is very complex and captures noise in the data.
The predictions vary greatly depending on the training set, but they are more accurate
on the training data.
Optimal Model: The best model is one that has an appropriate balance, where both
bias and variance are minimized to achieve good generalization performance.
4. Error Decomposition
The total error in a machine learning model can be broken down into three components:
The total expected error (on new, unseen data) can be expressed as:
2
Total Error =Bias +Variance+ Irreducible Error
Thus, as you make a model more complex (decreasing bias), you usually increase its
variance, and vice versa.
Linear Model: A linear regression model has high bias and low variance (if the data
is nonlinear). It might underfit the data but is unlikely to overfit.
Complex Models: Random forests, neural networks, or deep learning models tend to
have lower bias and higher variance. They can overfit if not properly tuned, for
example, using regularization or ensemble methods.
Regularization: Techniques like L2 (Ridge) or L1 (Lasso) regularization aim to
reduce variance by penalizing large weights in the model, thus preventing overfitting.