Aicw
Aicw
Task 1:
Introduction
Machine learning has become a transformative technology with applications in computer
vision, NLP, and healthcare. A key application is image classification, which categorizes
images based on visual content, with real-world uses in autonomous vehicles, medical
imaging, and object recognition. This coursework focuses on image classification using a
dataset containing objects like parachutes, oil boxes, and trucks. We develop a
Convolutional Neural Network (CNN) to accurately classify images, leveraging CNNs' ability
to extract hierarchical features. We also explore data augmentation and model
enhancements to improve performance. The report covers the problem background, CNN
model design, experimental evaluation, and key findings, demonstrating a practical
application of machine learning.
Approach
A Convolutional Neural Network (CNN) was chosen for image classification due to its ability
to automatically learn spatial hierarchies of features from raw pixel data, making it highly
effective for object recognition.
Feature Extraction: CNNs eliminate the need for manual feature engineering by
learning relevant features directly from images.
Input Size (224x224x3): Standardized size for consistency and efficient feature
extraction; RGB channels retained for colour information.
Convolutional Layers:
o Conv1 (32 filters, 3x3, ReLU): Captures low-level features (edges, textures).
o Conv2 (64 filters, 3x3, ReLU): Extracts higher-level features (object parts).
Optimizer (Adam): Efficient learning with adaptive rates and momentum for faster
convergence.
Table 1. Layer type, it’s parameter and the shape of its output
Evaluation protocol:
1. Dataset Splitting: The dataset was already split into training (for learning) and
validation (for tuning hyperparameters) sets, ensuring performance is assessed on unseen
data to prevent overfitting.
2. Performance Metrics: The following metrics were used to evaluate the model:
Accuracy: The proportion of correctly classified. Primary metric for assessing overall
performance.
Loss: Tracks how well predictions align with true labels.
Confusion Matrix: Identifies misclassified classes.
Classification Report: Provides precision, recall, and F1-score for a detailed class-
wise evaluation.
3. Learning Curves
Training vs. Validation Loss
This Detects overfitting (validation loss increasing while training loss decreases) or
underfitting (both losses remain high).
Experiment
Figure 1 - Accuracy and loss of each epoch for both training and validation
Key Observations:
The model achieved high training accuracy but struggled to generalize to the
validation set, as evidenced by the increasing validation loss and stagnant validation
accuracy.
Overfitting became evident after the second epoch, as the model memorized the
training data instead of learning generalizable features.
2. Classification Report Key Observations:
The model performed best on class n03888257 (F1-score: 0.68) and worst on
class n03000684 (F1-score: 0.34).
The low precision and recall for some classes (e.g., n03000684) suggest that the
model struggled to distinguish these classes from others.
Discussion of Findings
Overfitting: The model exhibited clear signs of overfitting, as evidenced by the high
training accuracy and low validation accuracy. This suggests that the model
memorized the training data instead of learning generalizable features.
Class Imbalance: The variability in precision, recall, and F1-scores across classes
indicates potential class imbalance or insufficient representation of certain classes in
the training data.
Model Complexity: The CNN architecture, while effective, may be too complex for
the dataset, leading to overfitting. Simplifying the model or adding regularization
techniques (e.g., dropout, weight decay) could improve generalization.
Figure 5 – Confusion matrix at each class
1. Data Augmentation
Enhancements:
Justification: Expands dataset diversity, improving generalization by exposing the model to varied
input conditions.
Impact: Reduced overfitting, improved validation accuracy, and enhanced robustness to image
variations.
2. Dropout Layers
Implementation: Added a Dropout layer (0.5) after the dense layer, randomly deactivating 50% of
neurons during training.
3. L2 Regularization
4. Early Stopping
Implementation: Used Early Stopping (monitoring validation loss) with patience = 3 and best weight
restoration.
Figure 6 - Accuracy and loss of each epoch for both training and validation of enhanced method
The results from the improved CNN model show significant changes compared to the previous model.
The previous model achieved a training accuracy of 97.26% but a validation accuracy of only 55.95%,
with a large gap between training and validation metrics, indicating severe overfitting.
The improved model achieved a lower training accuracy (65.23%) but a higher validation accuracy
(58.37%), with a smaller gap between training and validation metrics. This suggests that the
regularization techniques (dropout, L2 regularization) and data augmentation effectively reduced
overfitting.
Figure 7 -Loss vs Epoch curve for improved version
2. Classification Report
Precision: Ranged from 0.44 (n03425413) to 0.94 (n02102040), showing improved precision
for some classes compared to the previous model.
Recall: Ranged from 0.36 (n02102040) to 0.81 (n03888257), indicating better recall for
certain classes.
F1-score: Ranged from 0.41 (n03000684) to 0.73 (n01440764), reflecting a better balance
between precision and recall for most classes.
Overall Accuracy: The model achieved an accuracy of 58% on the validation set, slightly
higher than the previous model's 56%.
The previous model had an F1-score range of 0.34 to 0.68, while the improved model
achieved a range of 0.41 to 0.73, indicating better overall performance.
The improved model showed higher precision and recall for several classes, such
as n01440764 (F1-score: 0.73) and n03888257 (F1-score: 0.69), compared to the previous
model.
Figure 8 – Accuracy vs Epoch curve for improved version
Discussion of Findings
Reduced Overfitting: The improved model showed a smaller gap between training and
validation metrics, indicating that the regularization techniques and data augmentation
effectively reduced overfitting.
Better Generalization: The improved model achieved higher validation accuracy (58.37%)
compared to the previous model (55.95%), demonstrating better generalization to unseen
data.
Class-Specific Performance: The improved model showed higher precision, recall, and F1-
scores for several classes, indicating that it learned more robust features and performed
better on challenging classes.
Trade-offs: While the improved model achieved better generalization, it required more epochs
to converge and had a lower training accuracy, reflecting the trade-off between model
complexity and generalization.
Future work could focus on further fine-tuning hyperparameters, exploring advanced augmentation
techniques, or using more complex architectures to achieve even better results.
Task 2:
Introduction
This task involves solving a Gridworld problem using two reinforcement learning methods: Value
Iteration and Q-Learning. The Gridworld contains walls (w), obstacles (o), and a goal (g). The agent
starts at the top-left corner and must navigate to the goal while avoiding obstacles. The goal is to find
an optimal policy that maximizes cumulative rewards.
Value Iteration: A model-based method that computes the optimal value function and derives
the policy. It requires knowledge of the environment's dynamics.
Q-Learning: A model-free method that learns the optimal policy by updating a Q-table through
exploration and exploitation. It does not require prior knowledge of the environment.
Both methods are applied to the Gridworld, and their performance is evaluated based on their ability
to find the optimal policy.
Methods
1. Value Iteration
How It Works: Value Iteration is an iterative algorithm that computes the optimal value
function V∗(s)V∗(s) for each state ss. The value function represents the expected cumulative reward
when starting from state ss and following the optimal policy. The algorithm updates the value function
using the Bellman Optimality Equation.
Implementation Details:
2. Iteration: The value function is updated iteratively until convergence (when the change
in VV is below a small threshold ϵ=10−7ϵ=10−7).
3. Policy Extraction: Once the value function converges, the optimal policy ππ is derived by
selecting the action that maximizes the expected cumulative reward for each state.
Justification:
2. Q-Learning
How It Works: Q-Learning is a model-free RL method that learns the optimal policy by iteratively
updating a Q-table. The Q-table Q(s,a)Q(s,a) represents the expected cumulative reward for taking
action aa in state ss and following the optimal policy thereafter. The algorithm uses the Bellman
Equation to update the Q-values.
Implementation Details:
4. Policy Extraction: After training, the optimal policy ππ is derived by selecting the action with
the highest Q-value for each state.
Justification:
Q-Learning is chosen because it is a model-free method that does not require prior
knowledge of the environment's dynamics. It learns the optimal policy through trial and error,
making it suitable for environments where the transition probabilities and rewards are
unknown.
The epsilon-greedy policy ensures a balance between exploration and exploitation, allowing
the agent to discover the optimal policy while avoiding suboptimal solutions.
Experiments
1. Value Iteration
Convergence: The Value Iteration algorithm converged in 250 iterations with a convergence
time of 0.0836 seconds.
Policy: The derived policy was visualized on the Gridworld, showing the optimal path from the
start to the goal while avoiding obstacles and walls.
Performance: The average reward over 100 episodes was -125.00, indicating that the policy
needs further refinement to improve performance.
2. Q-Learning
Policy: The derived policy was visualized on the Gridworld, showing the optimal path from the
start to the goal while avoiding obstacles and walls.
Performance: The average reward over 100 episodes was 67.00, demonstrating better
performance compared to Value Iteration.
Table 2 – Method comparison
Analysis:
Convergence Time: Value Iteration converged significantly faster than Q-Learning. This is
expected because Value Iteration is a model-based method that directly computes the optimal
value function using known environment dynamics.
Policy Quality: Q-Learning achieved a higher average reward (67.00) compared to Value
Iteration (-125.00). This suggests that Q-Learning's exploration strategy (epsilon-greedy)
allowed it to discover a more effective policy.
Exploration: Q-Learning's ability to explore the environment and learn from interactions
made it more robust in finding an optimal policy, especially in complex or unknown
environments.
Performance: The negative average reward for Value Iteration indicates that the derived
policy may not be optimal or that the reward structure needs adjustment. In contrast, Q-
Learning's positive average reward demonstrates its effectiveness in maximizing cumulative
rewards.
Discussion
Value Iteration converged quickly (250 iterations, 0.0836s) but yielded a suboptimal policy with an
average reward of -125. The policy graph suggests ineffective obstacle avoidance, likely due to
misaligned rewards or transition probabilities. Adjusting the reward function, such as increasing
penalties for moving toward obstacles, could improve performance.
Q-Learning, though slower, achieved a significantly higher average reward of 67.00. Its policy graph
shows better pathfinding, with the agent reliably avoiding obstacles and reaching the goal. This
improvement stems from Q-Learning’s model-free learning and epsilon-greedy exploration, allowing
for better policy discovery. The cumulative reward graph confirms steady learning progress. In
summary, Q-Learning outperformed Value Iteration in this task, achieving better rewards through
exploration and adaptability. While Value Iteration's speed makes it suitable for well-defined problems,
its reliance on accurate modelling can limit performance, as seen in the suboptimal policy graph.
Future work could refine the reward structure for Value Iteration and optimize Q-Learning's
hyperparameters for further improvements.
Part C:
To investigate the effects of the discount factor (γ) and exploration rate (ε) on the performance of Q-
learning, a systematic experimental approach was implemented. The goal was to analyse how
different combinations of these parameters influence the agent's learning process, policy quality, and
overall
1. Parameter Selection
Discount Factor (γ): Five values of γ were tested: [0.1, 0.5, 0.7, 0.9, 0.99]. These values
represent a range from short-term to long-term reward focus.
Exploration Rate (ε): Five values of ε were tested: [0.1, 0.3, 0.5, 0.7, 0.9]. These values
represent a range from low to high exploration.
2. Q-learning Implementation
For each combination of γ and ε, the Q-learning algorithm was executed with the initialization and
training Loop
1. Metrics Calculation:
o Success Rate: The percentage of episodes where the agent reached the goal.
o Average Episode Length: The mean number of steps taken per episode.
o Final Q-value Variance: The variance of the Q-values at the end of training, indicating
the stability of the learned policy.
3. Visualisation
Success Rate vs. Episodes: The cumulative success rate over episodes was plotted for each
combination of γ and ε to observe how quickly the agent learned to reach the goal.
Average Episode Length vs. Episodes: The average episode length over episodes was
plotted to analyse how efficiently the agent navigated the Gridworld.
Table 3 - numeric results for effects analysis
The discount factor determines the importance of future rewards. A higher γ values future rewards
more heavily, while a lower γ focuses on immediate rewards.
Low γ (0.1): Poor performance across all ε values, with average rewards ranging from -
175.89 to -131.58 and success rates below 0.11%. A low γ causes the agent to prioritize
immediate rewards, which is ineffective in environments where long-term planning is required
to reach the goal. The agent fails to learn a meaningful policy.
Moderate γ (0.5, 0.7, 0.9): Significant improvement in performance, especially for low ε
values. For example, with γ=0.5 and ε=0.1, the average reward is 25.47, and the success rate
is 79.78%. A moderate γ balances immediate and future rewards, enabling the agent to learn
effective policies. The agent can navigate the Gridworld efficiently, as shown by the higher
success rates and lower episode lengths.
High γ (0.99): Like moderate γ values, with high success rates (e.g., 80.34% for γ=0.99 and
ε=0.1) and low episode lengths. A high γ emphasizes long-term rewards, which is beneficial in
this environment. However, the performance is comparable to moderate γ values, suggesting
diminishing returns for very high γ.
Table 4 – numeric results for effects analysis
Low ε (0.1): High success rates (e.g., 79.78% for γ=0.5 and ε=0.1) and low episode lengths
(e.g., 35.09 steps). A low ε prioritizes exploitation, allowing the agent to follow the best-known
policy. This works well when the agent has already learned a good policy but may fail if the
initial policy is poor.
Moderate ε (0.3, 0.5): Mixed results. For γ=0.5, ε=0.3 yields a success rate of 44.18%, while
ε=0.5 yields 14.76%. Moderate ε balances exploration and exploitation. While it helps the
agent discover better policies, excessive exploration (e.g., ε=0.5) can reduce performance by
diverting the agent from optimal paths.
High ε (0.7, 0.9): Poor performance, with success rates close to 0% and high episode
lengths. High ε prioritizes exploration, causing the agent to take random actions frequently.
This prevents the agent from converging to an optimal policy, as it spends too much time
exploring suboptimal paths.
Low γ and High ε: The worst performance is observed, as the agent focuses on immediate
rewards and explores excessively, failing to learn a meaningful policy.
Moderate/High γ and Low ε: The best performance is achieved, as the agent balances long-
term rewards with exploitation of the learned policy.
Discussion
The results demonstrate that the discount factor (γ) and exploration rate (ε) significantly impact the
performance of Q-learning in the Gridworld environment. There is a clear trade-off between
exploration and exploitation; while some exploration is necessary to discover good policies, excessive
exploration prevents the agent from converging to an optimal policy. For similar environments, it is
recommended to use a moderate to high γ (e.g., 0.7 to 0.9) and a low ε (e.g., 0.1 to 0.3) to achieve
the best performance. In conclusion, the results highlight the importance of carefully selecting γ and ε
in Q-learning, with moderate to high γ and low ε generally yielding the best results in this Gridworld
scenario. Future improvements could involve fine-tuning hyperparameters, implementing epsilon
decay, exploring advanced strategies like Boltzmann exploration, and refining the reward structure.
This study provided key insights into optimizing Q-learning performance.