0% found this document useful (0 votes)
21 views9 pages

Machine Learning Final Project Report

Uploaded by

truonglk8a
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views9 pages

Machine Learning Final Project Report

Uploaded by

truonglk8a
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Enhancing Age and Gender Prediction with

Attention-augmented CNNs and Transfer Learning

Bui Minh Quan


Department of Artificial Intelligence
VNU University of Engineering and Technology
[email protected]

1 Introduction

1.1 Introducing the Problem

Age estimation and gender prediction from facial images are two fundamental tasks in computer vision,
with wide-ranging applications in surveillance, human-computer interaction, targeted advertising,
and social media analytics. These tasks require extracting high-level semantic attributes—such
as estimated age or perceived gender—from facial features, which can vary significantly due to
differences in illumination, expression, ethnicity, and image quality.
Traditional approaches relying on handcrafted features have generally been inadequate for achieving
robust performance, particularly under unconstrained conditions. With the advent of deep learning,
especially convolutional neural networks (CNNs), these challenges have been addressed more
effectively by enabling automatic and hierarchical feature extraction directly from image pixels.
This work draws inspiration from the study by Vikas Sheoran. [15], titled "Age and Gender Prediction
using Deep CNNs and Transfer Learning", in which the authors utilize pretrained CNN models and
transfer learning strategies to perform age and gender classification.
Building on this foundation, I propose a modified architecture aimed at improving performance
while reducing computational complexity. Specifically, I redesign parts of their model architecture to
make it more lightweight and introduce attention mechanisms to enhance feature selection. These
improvements lead to better convergence speed and measurable gains in both accuracy and loss
during training and validation.

1.2 Dataset

The UTKFace dataset comprises over 20,000 facial images, each containing a single individual.
The dataset includes diverse demographic attributes embedded in the file names, formatted as
[age]_[gender]_[race]_[date&time].jpg. The labels are defined as follows:

• Age: An integer from 0 to 116, indicating the individual’s age.


• Gender: A binary value, with 0 representing male and 1 representing female.
• Race: An integer from 0 to 4, representing White (0), Black (1), Asian (2), Indian (3), and
Other (4, including categories such as Hispanic, Latino, and Middle Eastern).
• Date and Time: A timestamp in the format YYYYMMDDHHMMSSFFF, recording when the
image was added to the UTKFace dataset.

Since this paper focuses on age and gender prediction, I extract only the age and gender values from
each image’s file name and use them as the dataset labels.

Preprint. Under review.


To train the model, I split the dataset into training and evaluation sets with an 80:20 ratio. The gender
distribution in the dataset is already well-balanced, so no further adjustments are necessary.
However, the age distribution is highly imbalanced, with certain age groups significantly underrep-
resented. To address this issue, I apply oversampling to the minority age classes in the training set,
ensuring that each age group has at least 2000 samples.

Age Range Total Dataset Training Set Evaluation Set Oversampled Training Set
0–9 3,313 2,646 667 2.930
10–19 1,502 1,198 304 2003
20–29 6,233 4,981 1252 4984
30–39 3,914 3,126 788 3252
40–49 1,820 1,452 368 2224
50–59 1,783 1,420 363 2024
60–69 1,118 889 229 2000
70–79 596 474 122 2000
80–89 425 335 90 2000
90+ 151 114 37 2400
Total 20,855 12,633 3,149 25817
Table 1: Age distribution across the full dataset, training set, evaluation set and oversampled training
set.

Gender Total Dataset Training Set Evaluation Set


Man 10327 8.261 2066
Woman 10528 8.422 2106
Total 20855 16686 4172
Table 2: Gender distribution across the full dataset, training set

1.3 The Importance of the Problem

Accurate age and gender prediction from facial images has significant applications across various
domains. Gender classification enables personalized user experiences in areas such as targeted
marketing, human-computer interaction, and social media analytics. It supports systems that adapt
content or services based on demographic insights, improving user engagement and accessibility.

Age prediction is equally critical, with applications in age-specific access control, health monitoring,
and behavioral analysis. For instance, it can enhance surveillance systems by identifying age
demographics or assist in medical diagnostics by detecting age-related visual cues. These capabilities
underscore the importance of developing robust models for age and gender prediction, addressing
real-world needs in automation, security, and personalized technology.

1.4 Challenges

Gender classification, although typically framed as a binary classification task, presents notable
difficulties in specific cases. One such challenge arises when predicting the gender of children under
five years old. At this age, facial features are often not yet fully developed and can appear strikingly
similar across genders, making accurate classification difficult even for human observers. As a result,
achieving perfect accuracy in gender prediction for very young individuals is inherently impractical.
Age estimation, on the other hand, is a substantially more complex task. The aging process is highly
individual and influenced by a combination of genetic, environmental, and lifestyle factors. This leads
to significant variation in how age is visually expressed across individuals. Furthermore, external
factors such as lighting conditions, camera angles, image quality, and facial expressions introduce
additional variability, increasing the difficulty of accurately estimating chronological age from facial
images. These challenges necessitate sophisticated and robust models capable of handling diverse
and noisy inputs.

2
An additional challenge I encountered during this project is the limited amount of time available.
Training each model for 30 to 50 epochs typically takes between 110 and 130 minutes. The need
for frequent experimentation—such as tuning hyperparameters, modifying model architectures, and
evaluating performance—further extends the time required for model development and optimization.
This constraint significantly impacts the ability to iterate and fine-tune the models effectively within
the project timeline.

1.5 Contribution

This work builds upon the methodology presented in the paper “Age and Gender Prediction using
Deep CNNs and Transfer Learning” [15], with a primary focus on improving model performance
through architectural enhancements and extensive experimentation.

Gender Classification. For the task of gender prediction, I reimplemented and redesigned model ar-
chitectures to optimize both classification accuracy and computational efficiency. Using models based
on comparable convolutional layers (e.g., Conv2D and Depthwise Separable Convolutions),
my best model achieved an accuracy of 92.69%, outperforming the original paper’s reported accuracy
of 91.26%.
To further enhance performance, I introduced various attention mechanisms into the base architecture,
including:
• Channel Attention (CA)
• Spatial Attention (SA)
• Convolutional Block Attention Module (CBAM), which combines both CA and SA
• Squeeze-and-Excitation (SE) blocks
• Hybrid modules combining SE and SA
The following CNN-based models were developed and evaluated:
• Baseline CNN
• CNN + Channel Attention (CA)
• CNN + Spatial Attention (SA)
• CNN + CBAM
• CNN + Squeeze-and-Excitation (SE)
• CNN + SE + SA
• CNN + CBAM (alternative implementation)

Transfer Learning. I also experimented with various transfer learning approaches by fine-tuning
pretrained models and modifying their classifier heads. The tested models include:
• Pretrained ResNet-50 with custom classifier layers
• Pretrained VGG16 with custom classifier layers
• ResNet-50 + CBAM
• VGG16 + CBAM
• VGG16 + SE + SA
All of the above architectures achieved strong results, with several of my models trained from scratch
outperforming the pretrained baselines presented in the original paper.

Age Estimation. For the age estimation task, I extended the same architectural principles used in
gender classification. Initial models based on a baseline CNN provided a starting point, while further
improvements were achieved through the integration of attention modules. The most effective archi-
tectures combined pretrained backbones (e.g., VGG-16 and ResNet-50) with CBAM-enhanced regres-
sion heads. These models outperformed both my own baseline and existing literature benchmarks,
demonstrating the effectiveness of attention mechanisms in improving continuous age prediction
performance.

3
2 Related Work
Gender classification and age estimation have long been active research areas with numerous ap-
proaches proposed over the years. Early works primarily relied on classical machine learning
techniques such as support vector machines (SVMs), random forests, and handcrafted feature ex-
traction methods like Local Binary Patterns (LBP) or Histogram of Oriented Gradients (HOG).
While these methods provided baseline results, their performance was limited due to the reliance on
manually designed features and insufficient capacity to model complex variations in facial data.
With the rise of deep learning, convolutional neural networks (CNNs) became the dominant approach
due to their ability to learn hierarchical features directly from raw images. More recently, transfer
learning using pretrained models such as VGG, ResNet, and SENet has enabled researchers to achieve
significantly better results on both gender classification and age estimation tasks, even with relatively
limited training data.
The reference paper [15] explores both training models from scratch and using transfer learning for
gender classification and age estimation.

Models Trained from Scratch. For models trained from scratch, the authors primarily employed
SeparableConv2D layers to reduce computational complexity while preserving performance. The
architectures were further enhanced with:
• Spatial Dropout for regularization,
• He uniform initialization for stable convergence,
• MaxNorm weight constraint to control overfitting.

Transfer Learning. In addition to training from scratch, the paper leverages pretrained CNN
architectures fine-tuned on the target tasks. This approach leverages knowledge learned from large-
scale datasets, leading to improved accuracy and faster convergence compared to training from
scratch.
Building upon these findings, my work investigates the integration of attention mechanisms—such as
channel attention, spatial attention, and combined modules like CBAM and Squeeze-and-Excitation
blocks—within both training-from-scratch and transfer learning frameworks to further boost perfor-
mance on gender and age prediction tasks.

3 Methodology
3.1 Data Preparation and Experimental Setup

I use the UTKFace dataset, which contains facial images labeled with age, gender, and ethnicity. Each
image file name is formatted as [age]_[gender]_[race]_[date_time].jpg. For this work, I
focus only on age and gender.

Custom Dataset Class. To handle the dataset, I define a custom PyTorch Dataset class. The
class reads images from a specified directory, extracts age and gender labels from the filenames, and
applies image transformations. Each image is loaded in RGB mode and resized to 128 × 128 pixels.

Data Transformation. I apply different transformations to the training and testing sets. The
training set uses data augmentation techniques such as random horizontal flipping and slight rotations
to improve generalization. Both training and testing images are normalized to have pixel values in
the range [−1, 1] using a mean and standard deviation of [0.5, 0.5, 0.5].

Train/Test Split. The dataset is split into training and testing subsets with an 80:20 ratio. The split
is done randomly but with a fixed random seed to ensure reproducibility.

Class Imbalance Handling. Although the gender distribution is relatively balanced, age labels
are highly imbalanced. To address this, I oversample the training data by repeating the dataset
multiple times until each age group has at least 2000 samples. This helps the model learn from
underrepresented age classes.

4
Dataloader. The training and testing sets are wrapped in PyTorch DataLoader objects with a batch
size of 32. The training loader shuffles the data to improve training dynamics, while the test loader
maintains order for evaluation consistency.

Training setup details For gender classification, I treat the problem as a binary classification task
and use the Cross-Entropy Loss as the objective function. For age estimation, which is a regression
task, I use the Mean Absolute Error (L1 Loss) to better handle outliers and improve the interpretability
of the results.
All models are trained using the Adam optimizer with a learning rate of 1 × 10−4 and a weight decay
of 1 × 10−5 to prevent overfitting. To enable effective learning over longer training durations, I
incorporate a step-based learning rate scheduler (StepLR) that reduces the learning rate by a factor of
0.1 every 10 epochs.
Each model is trained for 30 to 35 epochs. This training setup strikes a balance between training time
and convergence quality, and allows for consistent comparison across different model architectures
and variations.

3.2 Deep CNNs

Network architecture. The general architecture of all convolutional neural networks (CNNs) used
in this work follows a standard deep learning design pattern. At the core, each network consists of a
sequence of convolutional layers responsible for extracting hierarchical spatial features from input
facial images. These layers are typically followed by nonlinear activation functions such as ReLU to
introduce non-linearity, and pooling layers (e.g., max pooling) to progressively reduce the spatial
dimensions and retain the most salient features.
After multiple convolutional and pooling stages, the extracted features are flattened and passed through
one or more fully connected (FC) layers to perform the final prediction. For gender classification, the
final output layer has two neurons representing the male and female classes, while for age estimation,
the output is a single continuous value. This general pipeline serves as the foundation upon which
various architectural improvements are introduced in subsequent sections

Gender classification(Training from scratch): In the task of gender classification, I achieved


highly promising results with the baseline CNN model trained from scratch. Building upon this
success, I explored several architectural variants to further improve performance. These variants
include the integration of attention mechanisms such as Channel Attention (CA), Spatial Attention
(SA), and their combination via Convolutional Block Attention Module (CBAM), as well as Squeeze-
and-Excitation (SE) blocks. Each of these additions was designed to enhance the model’s ability to
focus on the most informative features within the input images.

Layer Filters Output Size Kernel Size Activation


Image - 3 x 128 x 128 - -
DepthwiseSepConv1 + BN 32 64 × 64 × 32 3×3 ReLU
MaxPooling - 32 × 32 × 32 2×2 -
DepthwiseSepConv2 + BN 64 32 × 32 × 64 3×3 ReLU
MaxPooling - 16 × 16 × 64 2×2 -
DepthwiseSepConv3 + BN 128 16 × 16 × 128 3×3 ReLU
MaxPooling - 8 × 8 × 128 2×2 -
Flatten - 8192 - -
Dense (fc1) + BN + Dropout - 256 - ReLU
Dense (fc2) - 2 - -
Table 3: Architecture of Baseline Separable CNN Model for Gender Classification

All CNN variants are built upon the same baseline structure described previously, consisting of
depthwise separable convolutional layers followed by batch normalization, ReLU activation, max
pooling, and fully connected layers. To enhance the representational capacity of the model, I explore
the addition of attention mechanisms and channel recalibration modules:

5
• CNN + Channel Attention (CA): Integrates a channel attention mechanism that adaptively
reweights feature channels based on their relative importance, helping the model focus on
the most informative features.
• CNN + Spatial Attention (SA): Applies a spatial attention module to highlight significant
spatial regions within feature maps, guiding the model to focus on relevant areas in the
image.
• CNN + CBAM: Combines channel and spatial attention sequentially through the Convolu-
tional Block Attention Module (CBAM), enhancing the network’s ability to focus both on
important channels and spatial regions.
• CNN + SE Block: Employs Squeeze-and-Excitation (SE) blocks to recalibrate channel-wise
feature responses. Although SE and CA both target channel attention, they differ slightly in
their attention computation mechanisms.
• CNN + SE + SA: Combines SE blocks with spatial attention, mimicking CBAM’s structure
but using SE as the channel-wise attention mechanism instead of the original CBAM CA
block.
• CNN + CBAM + SE: Explores the impact of using both CBAM and SE together, introducing
a double channel-attention mechanism alongside spatial attention to assess potential gains
in representational capacity.

Age Estimation (Training from Scratch). For age estimation, I implemented a baseline CNN
architecture that closely follows the design proposed in the original paper [15]. The model is built
entirely from scratch using depthwise separable convolutional layers followed by fully connected
layers. Although the architecture is nearly identical to that in the reference study, my model achieved
superior performance in terms of MAE (Mean Absolute Error). This improvement may be attributed to
differences in preprocessing strategies, optimization settings, or regularization techniques. Due to time
constraints, I was only able to train and evaluate this baseline model, but the results demonstrate that
even a simple architecture—when properly tuned—can perform competitively in the age estimation
task.

Table 4: Architecture of Separable CNN Model for age estimation

Layer Filters Output Size Kernel Size Activation


Image - 3 × 128 × 128 - -
Depthwise Separable Conv1 + BN 32 32 × 64 × 64 3×3 ReLU
Depthwise Separable Conv2 + BN 64 64 × 32 × 32 3×3 ReLU
Depthwise Separable Conv3 + BN 128 128 × 16 × 16 3×3 ReLU
Depthwise Separable Conv4 + BN 256 256 × 8 × 8 3×3 ReLU
Depthwise Separable Conv5 + BN 256 256 × 4 × 4 3×3 ReLU
Flatten - 4096 - -
FC1 + BN + Dropout - 256 - ReLU
FC2 + BN + Dropout - 512 - ReLU
FC3 (Output) - 1 - Linear

3.3 Transfer Learning

For transfer learning, I employed two widely used pretrained models: ResNet-50 and VGG-16.
During training, I froze the convolutional layers of these models and replaced their final fully
connected layers with custom classifier or regression heads tailored for gender classification and age
estimation.
To further enhance performance, I experimented with integrating attention mechanisms, specifically
the Convolutional Block Attention Module (CBAM) and a combination of Squeeze-and-Excitation
(SE) and Spatial Attention (SA), after the pretrained feature extraction layers.

6
Table 5: Architecture of the Baseline Transfer Learning Model
Layer Filters Output Size Kernel Size Activation
Input Image - 3 × 128 × 128 - -
Pretrained Backbone - 512 × 4 × 4 - -
FC1 + Dropout - 4096 - ReLU
FC2 + Dropout - 1024 - ReLU
FC3 (Output) - Number of classes or 1 (regression) - -

4 Experimentation and Results

In this section, I summarize the performance of the 13 models I developed and tested: 10 models for
gender classification and 3 models for age estimation.
The imbalance between the number of gender classification and age estimation models is due to
the significantly higher difficulty of the age estimation task. Although I experimented with a total
of 6 models for age group classification and 6 models for age regression, many of them failed to
achieve results comparable to the baseline. Therefore, only the models that demonstrated acceptable
performance are included in this report, while the under-performing ones are omitted for clarity and
relevance.

4.1 Gender classification

Origin Model Accuracy (%)


Base CNN [15] 94.52
VGG_f [15] 93.42
Original Paper
ResNet50_f [15] 94.64
SENet50_f [15] 94.94
Base CNN 92.69
CNN + Channel Attention 94.97
CNN + Spatial Attention 97.48
CNN + CBAM 98.06
CNN + SE 93.12
Ours
CNN + SE + SA 98.42
CNN + SE + CBAM 97.79
ResNet + Classifier 87.85
VGG + Classifier 98.85
VGG + CBAM 98.90
Table 6: Comparison of Gender Classification Models

Analysis. The results show that the incorporation of attention mechanisms significantly boosts the
performance of my CNN-based models. Notably, both spatial attention and CBAM (which combines
channel and spatial attention) provide large improvements over the baseline CNN. While channel
attention alone (via SE or a basic CA block) offers modest gains, it is the combination of spatial
attention with SE that leads to the best performance, surpassing even CBAM. This suggests that
spatial attention plays a more crucial role in this task, and that combining it with SE is more effective
than stacking multiple channel-focused modules. Interestingly, adding CBAM on top of SE does not
lead to further gains, and in fact, performs worse than SE + SA. This indicates that redundant channel
attention may introduce unnecessary complexity or interfere with feature learning.
In terms of transfer learning, my ResNet-based model underperforms significantly—even with CBAM
enhancement, it remains the least accurate of all models. In contrast, the VGG-16-based model
achieves nearly perfect accuracy with just a custom classifier. Adding attention modules like CBAM
or SE+SA does not meaningfully improve its performance and may even degrade it slightly, likely
due to the already strong baseline and potential overfitting from added complexity.

7
Origin Model MAE (↓)
Base CNN [15] 6.080
SENet50_f [15] 4.58
Original Paper
VGG_f [15] 4.86
ResNet50_f [15] 4.65
Base CNN 5.30
Ours ResNet + CBAM + Regression Head 3.13
VGG + CBAM + Regression Head 3.09
Table 7: Comparison of Age Estimation Models

4.2 Age estimation

Analysis. The baseline CNN model from the literature achieved an MAE of 6.080, while my
reimplementation of the same architecture yielded an improved MAE of 5.30, indicating better
optimization or training strategies. More notably, transfer learning models enhanced with attention
mechanisms significantly outperformed all others. My VGG + CBAM model achieved the lowest
MAE of 3.09, followed closely by ResNet + CBAM at 3.13. These results confirm that combin-
ing pretrained feature extractors with attention modules can substantially improve age estimation
performance, highlighting the value of feature refinement in regression tasks.

5 Conclusion
In this project, I explored both training-from-scratch and transfer learning approaches for gender
classification and age estimation. Through extensive experimentation with attention mechanisms
such as Channel attention, Spatial attention and Squeeze-and-Excitation, I significantly improved
model performance over standard baselines.
Overall, this work shows that carefully integrated attention mechanisms can substantially enhance
facial attribute prediction, both in classification and regression tasks.

8
References
[1] Wenli Chen, Quan Pan, Yufei Li, and Chuang Yang. Facial age estimation using convolutional
neural networks. arXiv preprint arXiv:2105.06746, 2021.
[2] François Chollet. Xception: Deep learning with depthwise separable convolutions. arXiv
preprint arXiv:1610.02357, 2017.
[3] Yihui Dong, Guodong Zhou, Yiming Ding, Yanzhao Wang, Chang Liu, Chenglin Qian, Bingbing
Xiao, and Yue Zhang. Repghost: A hardware-efficient ghost module via re-parameterization.
arXiv preprint arXiv:2211.06088, 2022.
[4] Einat Eidinger, Roee Enbar, and Tal Hassner. Age and gender classification using convolutional
neural networks. In IEEE International Conference on Computer Vision Workshops (ICCVW),
2014.
[5] Xinyu Fan, Shiyu Liu, Mengchen Wu, and Jun Li. An empirical study of spatial attention
mechanisms in deep networks. arXiv preprint arXiv:1904.05873, 2019.
[6] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image
recognition. arXiv preprint arXiv:1512.03385, 2016.
[7] Arun Kumar et al. Deep convolutional neural networks for gender classification. In 2019 Inter-
national Conference on Vision Towards Emerging Trends in Communication and Networking
(ViTECoN), pages 1–6, 2019.
[8] Gilad Levi and Tal Hassner. Deep expectation of apparent age from a single image without facial
landmarks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(8):1586–1597,
2015.
[9] Bijit Mandal, Sanjay Bose, and Arup Kumar Ghosh. Gender classification using convolutional
neural networks. In 11th International Conference on Computer Vision Theory and Applications,
pages 221–228, 2016.
[10] Zhen Niu, Hui Wang, Xinqing Zhan, Shiguang Shan, and Xilin Chen. Apparent age estimation
using ensemble of deep convolutional neural networks. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition Workshops (CVPRW), 2016.
[11] Rasmus Rothe, Radu Timofte, and Luc Van Gool. Dex: Deep expectation of apparent age
from a single image. In Proceedings of the IEEE International Conference on Computer Vision
Workshops (ICCVW), 2015.
[12] Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding for
face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), pages 815–823, 2015.
[13] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale
image recognition. arXiv preprint arXiv:1409.1556, 2014.
[14] Yaniv Taigman, Ming Yang, Marc’Aurelio Ranzato, and Lior Wolf. Deepface: Closing the gap
to human-level performance in face verification. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), pages 1701–1708, 2014.
[15] Shreyansh Joshi Vikas Sheoran and Tanisha R. Bhayani. Age and gender prediction using deep
cnns and transfer learning. arXiv preprint arXiv:2110.12633, 2021.
[16] Qilong Wang, Bangguo Wu, Pengfei Zhu, Peihua Li, Wangmeng Zuo, and Qi Hu. Eca-net: Effi-
cient channel attention for deep convolutional neural networks. arXiv preprint arXiv:1910.03151,
2020.
[17] Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In So Kweon. Cbam: Convolutional
block attention module. arXiv preprint arXiv:1807.06521, 2018.

You might also like