Bao Cao BTL Python
Bao Cao BTL Python
PYTHON PROGRAMMING
ASSIGNMENT 2
Hanoi
TABLE OF CONTENTS
1. Introduction
1.1. Background and Motivation
1.2. The Challenge of Image Classification
1.3. Evolution of Image Classification
1.4. Evolution of Image Classification
1.5. Evolution of Image Classification
2. Evolution of Image Classification
2.1. Libraries Used and Rationale
2.2. Key Concepts
2.3. Implementation Steps and Detailed Rationale
3. Building a Convolutional Neural Network (CNN)
3.1. Libraries Used and Rationale
3.2. Key Concepts
3.3. Implementation Steps and Detailed Rationale
4. Performing Image Classification
4.1. Libraries Used and Rationale
4.2. Key Concepts
4.3. Implementation Steps and Detailed Rationale
5. Plotting Learning Curves
5.1. Libraries Used and Rationale
5.2. Key Concepts
5.3. Implementation Steps and Detailed Rationale
6. Plotting Learning Matrix
6.1. Libraries Used and Rationale
6.2. Key Concepts
6.3. Implementation Steps and Detailed Rationale
7. Comparison and Discussion of Results
7.1. Quantitative Performance Metrics
7.2. Qualitative Analysis of Learning Curves
7.3. Insights from Confusion Matrices
7.4. Architectural Advantages, Disadvantages, and Computational Considera-
tions
7.5. Impact of Implemented Techniques on Model Performance
8. Conclusion
1
1. Introduction
1.1. Background and Motivation
We live in a world saturated with visual information. From the countless photographs
shared on social media and the vast archives of digitized art and historical documents
to the real-time video feeds from surveillance systems and autonomous vehicles, im-
ages and videos constitute a significant and rapidly growing portion of global data. The
ability to automatically process, analyze, and understand this visual data is no longer
a niche scientific pursuit but a critical technological capability with far-reaching implica-
tions across diverse sectors. Applications span from healthcare (medical image analy-
sis for disease diagnosis), security (facial recognition, anomaly detection), retail (visual
search, automated checkout), entertainment (content-based image retrieval, special
effects), and autonomous systems (environmental perception for navigation) to scien-
tific research itself (astronomical image analysis, microscopy). The sheer volume and
complexity of this visual data necessitate the development of sophisticated computa-
tional tools that can extract meaningful information and perform tasks that traditionally
require human visual acuity and cognitive interpretation. This project delves into one of
the most fundamental of these tasks: image classification.
2
raw pixel data. This "end-to-end" learning approach, where feature extraction and clas-
sification are integrated into a single, optimizable system, has led to state-of-the-art
performance on a wide range of visual tasks, often surpassing human capabilities in
specific domains. The ImageNet Large Scale Visual Recognition Challenge (ILSVRC)
played a pivotal role in demonstrating the power of CNNs, with models like AlexNet,
VGGNet, GoogLeNet, and ResNet progressively pushing the boundaries of accuracy.
• Local Receptive Fields: Neurons in early layers connect only to small, localized
regions of the input image, allowing them to detect local patterns like edges or
corners.
• Parameter Sharing: The same set of weights (a filter or kernel) is applied across
different spatial locations in the input, enabling the network to detect a specific
feature regardless of its position (translation invariance) and significantly reducing
the number of parameters to learn.
3
script implements such an architecture, incorporating techniques like Batch Nor-
malization and Dropout.
Given their architectural advantages for visual data, CNNs are generally expected to
outperform MLPs on image classification tasks.
• torch and torch.nn: For building the neural network. nn.Module is the base
class, nn.Linear defines fully connected layers.
• matplotlib.pyplot as plt: For plotting learning curves and sample image predic-
tions.
4
• numpy as np: For numerical operations, often in conjunction with Matplotlib or
Scikit-learn.
• seaborn as sns: For creating more visually appealing heatmaps for the confu-
sion matrix.
The choice of these libraries is standard for deep learning tasks with PyTorch, providing
a comprehensive toolkit for model building, data handling, training, evaluation, and
visualization.
• Input Layer: Conceptually, this layer takes the flattened image data. CIFAR-10
images (3×32×32=3072 features) are flattened within the training loop before be-
ing passed to the model.
• Hidden Layers:
b. The Neuron
At the core of each neuron (or node) within a fully connected layer (like nn.Linear) is a
two-step computation process that transforms the inputs it receives from the previous
layer:
5
• Activation Function: The result of this linear transformation, z (often called the
"logit" or "pre-activation"), is then passed through a non-linear activation function,
a = f (z), to produce the neuron’s final output or "activation." This non-linearity is
crucial for the network’s ability to learn complex patterns that go beyond simple
linear relationships.
The weights (W) and biases (B) associated with each neuron are the parameters of the
model. These parameters are initialized (often randomly or using specific schemes)
before training and are then iteratively adjusted during the training process (via back-
propagation and an optimization algorithm) to minimize the difference between the
network’s predictions and the actual target values.
c. Activation Functions
The MLP.py script uses F.relu (Rectified Linear Unit) as the activation function after the
first two linear layers (fc1 and fc2).
ReLU (f (x) = max(0, x)): Introduces non-linearity, allowing the model to learn complex
mappings. Its advantages include computational efficiency and mitigating vanishing
gradients for positive inputs.
d. Data Flattening
The MLP.py script performs flattening explicitly within the training and evaluation loops
using X_train_flat = X_train.view(X_train.shape[0], -1). This reshapes the [batch_size,
3, 32, 32] image tensor into [batch_size, 3072] before feeding it to model().
• Loss of Spatial Information: The most significant drawback is that MLPs re-
quire the input image to be flattened into a 1D vector. For an image of size W×H
with C channels (e.g., 32×32×3 for CIFAR-10), this results in a long vector of
W×H×C features. This flattening process discards the crucial 2D (or 3D, includ-
ing channels) spatial structure of the image. Pixels that are close together in the
image (and thus likely related, forming edges, textures, or parts of objects) are
treated independently by the initial layer after flattening, just like pixels that are
far apart. The network loses the information about the relative positions of pixels
and local correlations, which are vital for visual pattern recognition.
6
• Lack of Translation Invariance: MLPs are not inherently translation invariant.
If an object appears in the top-left corner of one training image and the bottom-
right corner of another, an MLP treats these as entirely different input patterns. It
needs to learn to recognize the object (and its features) at every possible position
independently. This is highly inefficient and requires a vast amount of training
data to cover all spatial variations. In contrast, features useful for identifying an
object (like a specific texture or edge) should be detectable regardless of where
they appear in the image.
• Sensitivity to Input Permutations: Because an MLP treats each input feature
(flattened pixel) independently in its connections to the first hidden layer, if the
order of pixels in the flattened vector were permuted, the learned weights would
no longer be meaningful, and the model would perform poorly. While images are
not typically permuted, this highlights the lack of structural understanding.
These limitations mean that while an MLP can learn to classify images, it often requires
more data, more parameters (leading to overfitting risk), and achieves lower accuracy
compared to architectures like CNNs that are specifically designed to exploit the spatial
nature of image data.
7
d. Training Loop
• epochs = 10.
• Training Phase:
g. Confusion Matrix
• Collects all predictions and true labels from the test_loader.
• Uses seaborn.heatmap to plot the confusion matrix with class labels. The plot is
displayed using plt.show().
8
h. Visualization of Sample Predictions
Loads a batch from test_loader, makes predictions, and displays 64 sample images
with their true and predicted labels, color-coding correct/incorrect predictions.
• os: Imported, potentially for num_workers logic based on OS, though the pro-
vided snippet in the Appendix shows num_workers_global = 0 if os.name == ’nt’
else 2.
9
3.2. Key Concepts
a. Convolutional Layers
Convolutional layers are the defining elements of CNNs. They operate on input volumes
(e.g., an image or the feature map from a previous layer) and transform them into output
volumes.
◦ Stride: Controls how many pixels the filter slides over at each step. A stride
of 1 means the filter moves one pixel at a time. A stride of 2 means it moves
two pixels at a time, resulting in a smaller output feature map.
◦ Padding: Refers to adding extra pixels (usually zeros) around the border of
the input volume before the convolution. This is useful for two main reasons:
It allows the filter to be centered on border pixels, ensuring they are pro-
cessed as thoroughly as interior pixels. It can be used to control the spatial
size of the output feature map. For instance, with a 3x3 kernel and stride 1,
using padding=1 (one layer of zeros around the input) will result in an output
feature map with the same height and width as the input. This is often called
"same" padding. Without padding, the feature map would shrink after each
convolution.
• Parameter Sharing and Local Connectivity Two key properties make convolu-
tional layers efficient and effective for images:
10
The CNN.py script uses three convolutional layers, each with 3x3 kernels and
padding of 1.
b. Batch Normalization
Included in CNN.py after each convolutional layer and before the ReLU activation.
• Purpose: Addresses internal covariate shift, where the distribution of each layer’s
inputs changes during training as the parameters of the previous layers change.
• How it Works: Normalizes the output of the previous layer by subtracting the
batch mean and dividing by the batch standard deviation. It then scales and shifts
the normalized output using two learnable parameters (gamma and beta) per
feature map.
• Benefits: Faster training (allows higher learning rates), stabilizes training, acts
as a slight regularizer. During evaluation (model.eval()), it uses learned running
statistics.
c. Pooling Layers
Pooling layers, such as nn.MaxPool2d used in CNN.py, are typically inserted between
successive convolutional layers.
• Purpose:
• Max Pooling: Specifically, max pooling selects the maximum value from a patch
of features in the input feature map. This tends to retain the most prominent fea-
tures within that patch. CNN.py uses MaxPool2d(kernel_size=2, stride=2) after
each (Conv -> BatchNorm -> ReLU) block, which halves the height and width of
the feature maps at each step.
11
e. Dropout Regularization
CNN.py includes a dropout layer (self.dropout_fc = nn.Dropout(dropout_rate)) after the
first fully connected layer’s activation.
• How it Works: During training, for each forward pass, a fraction (dropout_rate)
of the neurons in the layer preceding dropout are randomly "dropped" or tem-
porarily deactivated (their outputs are set to zero). This means these neurons do
not contribute to the forward pass and do not participate in backpropagation for
that training step. This prevents complex co-adaptations where neurons become
overly reliant on specific other neurons. Instead, each neuron must learn more
robust features that are useful in conjunction with different random subsets of
other neurons.
• Early Layers (closer to the input): Filters in these layers have small receptive
fields (the region in the original input image that a particular filter or neuron "sees"
or is affected by). These layers typically learn to detect simple, low-level features
such as edges, corners, color blobs, and basic textures. For example, one filter
might become a horizontal edge detector, another a detector for a specific color.
• Intermediate Layers: Filters in these layers combine the simple features de-
tected by earlier layers to learn more complex patterns or motifs, such as parts of
objects (e.g., an eye, a wheel, a patch of fur). Their receptive fields are larger than
those of earlier layers because they are looking at feature maps that are already
abstractions of the input.
12
• Deeper Layers (further from the input, closer to the output): These layers
combine the mid-level features to detect even more complex and abstract con-
cepts, potentially corresponding to entire objects or significant portions of them.
Their receptive fields are the largest, encompassing a substantial part of the orig-
inal input image.
This hierarchical structure, where features build upon each other in increasing com-
plexity and abstraction, mimics aspects of the human visual cortex and allows CNNs
to learn powerful and discriminative representations for visual tasks. The depth of the
network (number of layers) often correlates with the complexity of features it can learn.
• The forward method specifies the data flow: Conv -> BN -> ReLU -> Pool for each
block, then Flatten -> FC -> ReLU -> Dropout -> Output FC.
• transform_test includes ToTensor() and Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)).
• criterion_cnn = nn.CrossEntropyLoss().
13
e. Main Execution Block
• Sets hyperparameters: num_epochs_global = 25, dropout_rate_global = 0.3, etc.
• torch, torch.nn, torch.optim: For model building, defining loss functions, and
optimization.
• CNN.py makes use of nn.BatchNorm2d and nn.Dropout from torch.nn, and os for
platform-dependent num_workers setting.
The rationale remains consistent: leveraging PyTorch’s ecosystem for deep learning
and standard Python libraries for data science and visualization.
14
4.2. Key Concepts
a. Data Preparation
• Loading CIFAR-10: Both scripts load the CIFAR-10 dataset using torchvision.datasets.CIFAR10
• Transformations:
◦ MLP.py: Applies transforms.ToTensor().
◦ CNN.py: For training data (transform_train): Implements data augmentation
(RandomCrop, RandomHorizontalFlip), followed by ToTensor() and Normal-
ize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)). For test/validation data (transform_test):
Applies ToTensor() and Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)).
• Validation Set Handling:
◦ MLP.py: Creates a validation set by splitting the cifar10_train_full dataset
using random_split (val_split_ratio = 0.2).
◦ CNN.py: Defines validationset by loading torchvision.datasets.CIFAR10(...,
train=False, ...), thus using the official CIFAR-10 test set for generating vali-
dation metrics during training epochs.
b. DataLoaders
• Both scripts use DataLoader.
• Batch Sizes:
– MLP.py: batch_size_train = 100, batch_size_val = 500, batch_size_test =
500.
– CNN.py: batch_size_global = 64 for all loaders.
• Shuffling: shuffle=True for training loaders in both.
• num_workers: MLP.py uses default (0). CNN.py uses num_workers_global (e.g.,
2, or 0 for Windows).
15
d. Validation
• MLP.py: Validates on its val_loader (split from train) after each training epoch.
• CNN.py: Validates on its valloader (the test set) after each training epoch within
train_model.
e. Testing
• MLP.py: Evaluates on test_loader after all training epochs.
• Input Normalization Range: The input data is scaled differently: MLP.py uses
ToTensor (resulting in a [0,1] range), whereas CNN.py applies an additional Nor-
malize transform (resulting in a [-1,1] range).
c. Hyperparameter Configurations
The scripts utilize different configurations for several key hyperparameters, including
the number of training epochs, optimizer types and their specific parameters (learning
rate, weight decay), batch sizes, and the application of techniques like Batch Normal-
ization and Dropout (present in CNN.py but not MLP.py).
16
5.1. Libraries Used and Rationale
• matplotlib.pyplot as plt: Used in both scripts for creating the plots.
◦ Training Loss: This curve shows the value of the loss function calculated on
the training data at the end of each epoch (or sometimes per batch). Ideally,
the training loss should decrease steadily as the model learns the patterns
in the training data. A flat or increasing training loss might indicate problems
such as an inappropriate learning rate (too high or too low), a model that is
too simple for the data, or issues with the data itself.
◦ Validation Loss: This curve shows the loss function calculated on the vali-
dation data (data not used for training the model’s parameters) at the end of
each epoch. The validation loss is a crucial indicator of how well the model is
generalizing to new, unseen data. Ideally, it should also decrease and then
stabilize.
• Accuracy Curves:
The relationship and trends of these four curves (training loss, validation loss, training
accuracy, validation accuracy) provide deep insights into the training process.
17
◦ Symptoms: Both training loss and validation loss remain high or plateau at
a high value. Similarly, both training accuracy and validation accuracy are
low and do not improve significantly. The training and validation curves are
typically close together but at an unsatisfactory performance level.
◦ Meaning: The model is too simple or lacks the capacity to learn the underly-
ing patterns in the data. It performs poorly on both the data it was trained on
and new data.
◦ Possible Solutions: Increase model complexity (e.g., add more layers, more
neurons/filters), train for more epochs (if not yet converged), choose a more
complex model architecture, or improve feature engineering (though less
common in end-to-end deep learning).
• Good Fit:
◦ Symptoms: Both training and validation loss decrease to a low point and then
stabilize. Both training and validation accuracy increase to a satisfactory high
point and then stabilize. The gap between the training and validation curves
is small and stable.
◦ Meaning: The model has learned the general patterns in the data and gen-
eralizes well to unseen data. This is the desired outcome.
• The point where validation loss starts to consistently increase can be an indicator
for early stopping.
18
5.3. Implementation Steps
a. For MLP.py:
• Data Collection: train_losses, val_losses, train_accuracies, val_accuracies lists
are populated after each epoch.
b. For CNN.py
• Data Collection: train_model returns the four lists of losses and accuracies.
• Plot Display: Both scripts display the generated learning curves using plt.show().
19
6. Plotting Confusion Matrix
◦ True Positives TPk : The number of samples that actually belong to class k
and were correctly predicted as class k. These are the values on the main
diagonal of the confusion matrix, specifically cm[k][k].
◦ False Positives FPk : The number of samples that do not belong to class k
but were incorrectly predicted as class k. For class k, this is the sum of all
values in column k excluding the diagonal element cm[k][k]. (FPk .
20
◦ False Negatives FNk : The number of samples that actually belong to class k
but were incorrectly predicted as belonging to some other class. For class k,
this is the sum of all values in row k excluding the diagonal element cm[k][k].
FNk .
◦ True Negatives TNk : The number of samples that do not belong to class k
and were correctly predicted as not belonging to class k. This is the sum of
all values in the matrix that are not in row k or column k.
• The element cm[i][j] (row i, column j) represents the number of instances whose
actual (true) class was i and were predicted by the model as class j.
• Diagonal elements (cm[i][i]) show the number of correct classifications for each
class i.
◦ Interpretation: "Of all the samples that the model predicted as class k, what
proportion actually belonged to class k?"
◦ High precision for class k means that when the model predicts class k, it is
likely correct.
◦ Interpretation: "Of all the samples that truly belonged to class k, what pro-
portion did the model correctly identify?"
◦ High recall for class k means that the model is good at finding most instances
of class k.
While these metrics are not explicitly calculated and printed by the provided scripts, the
generated confusion matrix contains all the necessary information to compute them.
21
c. Visualizing Misclassifications
The primary advantage of plotting the confusion matrix, especially as a heatmap, is the
immediate visual identification of patterns in misclassifications.
• Strong Diagonals: Indicate good overall performance, with many correct classi-
fications for each class.
• Prominent Off-Diagonal Cells: Highlight specific pairs of classes that the model
frequently confuses. For instance, if cell cm[’cat’][’dog’] (row ’cat’, column ’dog’)
has a high value, it means many actual ’cat’ images were misclassified as ’dog’.
This visual feedback is invaluable for understanding the model’s weaknesses and
can guide efforts to improve it, such as collecting more ambiguous examples for
the confused classes or designing model features that better differentiate them.
◦ plt.figure(figsize=(10, 8)).
◦ classes are obtained from cifar10_train_full.classes.
◦ sns.heatmap(cm, annot=True, fmt=’d’, cmap=’Blues’, xticklabels=classes, yt-
icklabels=classes).
◦ Sets titles and labels.
◦ plt.show(): Displays the plot.
22
b. For CNN.py
• Collect Predictions and Labels: test_model returns cnn_predicted and cnn_labels.
◦ Takes true labels, predicted labels, class names, and model name.
◦ Computes cm.
◦ Generates heatmap using sns.heatmap.
◦ plt.show(): Displays the plot.
• Plot Display: Both scripts display the generated confusion matrices using plt.show().
23
7. Comparison and Discussion of Results
This section provides a comparative analysis of the Multi-Layer Perceptron (MLP)
model (from MLP.py) and the Convolutional Neural Network (CNN) model (from CNN.py)
based on their performance on the CIFAR-10 image classification task. The discussion
will cover quantitative metrics, qualitative analysis of learning curves and confusion
matrices, and architectural considerations
24
◦ Describe the trends for training and validation loss/accuracy.
◦ Analyze for signs of overfitting or underfitting based on the 10 epochs of
training.
• CNN: Discuss its strengths (spatial feature extraction via convolutions, parameter
sharing, hierarchical feature learning, partial translation invariance via pooling) in
the context of the CNN.py architecture (3 Conv blocks with BN, MaxPool, followed
by FC layers with Dropout).
• CNN (CNN.py):
25
◦ Normalization: Impact of normalizing inputs to [-1, 1].
Overall Discussion: The empirical results will be analyzed. The CNN’s performance
is expected to be superior, and this will be attributed to its architecture designed for im-
ages and the various training techniques employed in CNN.py. The discussion should
acknowledge that the two scripts represent different experimental setups beyond just
MLP vs. CNN architecture.
8. Conclusion
This section summarizes the project’s outcomes based on the analyses of MLP.py and
CNN.py, reflects on the insights gained, and suggests potential directions for future
exploration.
• Understanding of the MLP (input, 120/84-neuron hidden layers, output) and CNN
(3 Conv blocks with 32/64/128 filters, BatchNorm, Dropout) architectures.
• The fundamental differences in how MLPs and CNNs process image data.
• The utility of visualization tools like learning curves and confusion matrices for
model analysis.
26
8.3. Acknowledging Implementation Choices and Proposing Future
Work
The two scripts (MLP.py and CNN.py) represent distinct experimental setups. Future
work could include:
• Further CNN Enhancements: Explore deeper CNNs, different filter sizes, or ad-
vanced architectures (ResNet, etc.).
• Plot and Model Saving: Implement functionality to save generated plots and
trained model weights.
• Quantitative Per-Class Metrics: Calculate and report precision, recall, and F1-
score per class.
27