Methodology
Methodology
Datasets
Images were classified into seven age classes based on intuitive age
ranges. The dataset included faces with varying lighting conditions,
angles, and expressions to simulate real-world scenarios. The labels
were curated to avoid overlap between classes, ensuring clarity in
classification tasks.
Preprocessing
Age values from the dataset were mapped into seven distinct age
classes for multi-class classification:
Class 0: 1–2 years, Class 1: 3–9 years, Class 2: 10–20 years, Class
3: 21–27 years, Class 4: 28–45 years, Class 5: 46–65 years and
Class 6: 66+ years
Dataset
The model used the UTKFace dataset, which contains images with
metadata embedded in their filenames in the following format:
[age]_[gender]_[race]_[date and time].jpg. For this project, only the
gender labels (denoted by the second field in the filename) were
extracted to train the gender classification model.
Preprocessing Steps
The dataset was split into training and testing sets using the
train_test_split() function from the sklearn library. (Training Set: 75%
of the dataset, Test Set: 25% of the dataset)
Dataset:
The CK+ (Cohn-Kanade Plus) dataset was used to train the emotion
recognition model. This dataset consists of facial images annotated
with discrete emotion labels, including anger, contempt, disgust,
fear, happiness, sadness, and surprise.
Preprocessing Steps:
To align with the model’s use case, the original emotion labels were
mapped into three categories:
The pre-processed dataset was divided into training and testing sets
using an 80:20 split, ensuring the model was trained on diverse
samples while retaining a portion for evaluation.
CNN Architecture
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv2d (Conv2D) (None, 198, 198, 32) 320
=================================================================
The model begins with an input layer that accepts images of size
100×100 pixels, converted to grayscale. This grayscale format
reduces computational complexity by eliminating the need to
process multiple colour channels (i.e., RGB). The image is
represented as a matrix of pixel values, typically ranging from 0 to
255 for grayscale. The input layer prepares these values for further
processing by the convolutional layers.
This layer applies 32 filters, each with a kernel size of 3×33 \times
33×3 to the input image. A filter is essentially a small matrix that
slides over the image to detect patterns such as edges or textures.
The ReLU (Rectified Linear Unit) activation function is applied to
introduce non-linearity. ReLU replaces any negative values in the
feature map with zeros, helping the model learn more complex
patterns. Following this, an AveragePooling2D layer is used to
reduce the spatial size of the feature maps by downsampling. This
helps reduce computational load and retains only the most essential
features.
The architecture includes three sets of convolutional layers. Each set
contains:
The increase in the number of filters as you move deeper into the
network allows the model to capture more abstract and hierarchical
patterns.
Following the GAP layer, the model has a Dense layer with 132
neurons. Each neuron is connected to every output from the
previous layer, allowing the model to learn high-level abstractions
from the features extracted by the convolutional layers. This Dense
layer uses ReLU activation to introduce non-linearity and help the
model learn complex patterns.
Training Process
The model is trained for 60 epochs, meaning that the entire training
dataset is passed through the network 60 times. Each epoch helps
the model gradually adjust its weights to improve performance. The
batch size is set to 512, meaning that 512 images are processed in
one go before the model’s weights are updated. This helps to
stabilize the training process by averaging gradients over a larger
set of data.
During training, the images (with their corresponding labels) are fed
into the network in batches. For each batch:
This process continues across the 60 epochs until the model has
learned to make accurate predictions.
Epoch 58/60
458/458 [==============================] - ETA: 0s - loss: 0.2502
- accuracy: 0.9002
Epoch 00058: val_accuracy did not improve from 0.82481
458/458 [==============================] - 137s 299ms/step -
loss: 0.2502 - accuracy: 0.9002 - val_loss: 0.6888 -
val_accuracy: 0.8015
Epoch 59/60
458/458 [==============================] - ETA: 0s - loss: 0.2570
- accuracy: 0.8969
Epoch 00059: val_accuracy did not improve from 0.82481
458/458 [==============================] - 136s 296ms/step -
loss: 0.2570 - accuracy: 0.8969 - val_loss: 0.7584 -
val_accuracy: 0.7819
Epoch 60/60
458/458 [==============================] - ETA: 0s - loss: 0.2733
- accuracy: 0.8898
Epoch 00060: val_accuracy did not improve from 0.82481
458/458 [==============================] - 136s 296ms/step -
loss: 0.2733 - accuracy: 0.8898 - val_loss: 0.7837 -
val_accuracy: 0.7792
CNN Architecture
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_1 (InputLayer) [(None, 100, 100, 1)] 0
=================================================================
This layer applies 32 filters of size 3×3 on the input images, aiming
to detect low-level features like edges and textures. The ReLU
activation function is used to introduce non-linearity into the model,
which allows it to learn complex patterns. Negative values are set to
zero by the ReLU function. A MaxPooling2D layer follows the
convolution to reduce the spatial dimensions of the feature map,
downsampling the output by a factor of 2 in both dimensions (height
and width).
Here, 128 filters of size 3×3 are applied, allowing the model to
capture even more abstract and high-level features from the data.
Again, ReLU activation is used, followed by MaxPooling2D, further
reducing the feature map size and emphasizing the most relevant
features.
The model uses 256 filters of size 3×3 to capture the most complex
patterns, leading to a deeper understanding of the data. This layer
is followed by MaxPooling2D to reduce the spatial dimensions of the
feature map and retain the critical features.
The output of the last pooling layer is a 3D tensor. The Flatten layer
converts this 3D tensor into a 1D vector, which can then be
processed by fully connected layers (Dense layers). This flattening
operation is crucial because it reshapes the feature map from the
convolutional layers into a format that can be fed into the Dense
layers.
Training Process
The model uses the Adam optimizer, which combines the benefits of
momentum and adaptive learning rates. This optimizer is efficient
for training deep networks and helps the model converge quickly to
an optimal solution.
Epoch 28/30
555/556 [============================>.] - ETA: 0s - loss: 0.2340
- accuracy: 0.9283
Epoch 00028: loss improved from 0.23508 to 0.23439, saving model
to ./output/gender_model.h5
556/556 [==============================] - 8s 14ms/step - loss:
0.2344 - accuracy: 0.9282 - val_loss: 0.3232 - val_accuracy:
0.8929
Epoch 29/30
555/556 [============================>.] - ETA: 0s - loss: 0.2374
- accuracy: 0.9288
Epoch 00029: loss did not improve from 0.23439
556/556 [==============================] - 7s 13ms/step - loss:
0.2373 - accuracy: 0.9288 - val_loss: 0.3397 - val_accuracy:
0.8900
Epoch 30/30
556/556 [==============================] - ETA: 0s - loss: 0.2353
- accuracy: 0.9302
Epoch 00030: loss did not improve from 0.23439
556/556 [==============================] - 7s 13ms/step - loss:
0.2353 - accuracy: 0.9302 - val_loss: 0.3113 - val_accuracy:
0.8947
CNN Architecture
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_1 (InputLayer) [(None, 48, 48, 1)] 0
=================================================================
Here, the model applies 128 filters of size 3x3, allowing it to learn
even more intricate and abstract representations of the input
images. The ReLU activation function is again used for non-linearity,
and MaxPooling2D follows to reduce the spatial dimensions of the
feature map, enabling the model to focus on higher-level features
and reduce computational complexity.
In this layer, the model uses 256 filters of size 3x3 to capture the
most complex patterns in the image. These patterns are more
abstract and help the model better understand complex facial
expressions associated with different emotions. A MaxPooling2D
layer is again applied after the convolution to downsample the
feature map, preserving the most significant features and enhancing
the model’s ability to generalize.
The final layer is another Dense layer, but with 6 neurons, one for
each emotion class (e.g., Happy, Sad, Angry, Surprise, Neutral,
Fear). The Softmax activation function is used here to output a
probability distribution over the 6 possible classes. The Softmax
function ensures that the output values are between 0 and 1 and
that their sum is equal to 1, making it suitable for multi-class
classification problems. The model will predict the class with the
highest probability, which corresponds to the recognized emotion.
Training Process
Epoch 48/50
17/23 [=====================>........] - ETA: 0s - loss: 0.0945 -
accuracy: 0.9963
Epoch 00048: loss did not improve from 0.09191
23/23 [==============================] - 0s 9ms/step - loss: 0.0946
- accuracy: 0.9959 - val_loss: 0.1396 - val_accuracy: 0.9837
Epoch 49/50
17/23 [=====================>........] - ETA: 0s - loss: 0.0936 -
accuracy: 0.9945
Epoch 00049: loss did not improve from 0.09191
23/23 [==============================] - 0s 9ms/step - loss: 0.0919
- accuracy: 0.9959 - val_loss: 0.1374 - val_accuracy: 0.9837
Epoch 50/50
16/23 [===================>..........] - ETA: 0s - loss: 0.0893 -
accuracy: 0.9961
Epoch 00050: loss improved from 0.09191 to 0.08838, saving model to
./output/emotion_model.h5
23/23 [==============================] - 0s 14ms/step - loss:
0.0884 - accuracy: 0.9959 - val_loss: 0.1243 - val_accuracy: 0.9797
The algorithm detects faces and marks them with bounding boxes.
For each face, a rectangle (bounding box) is drawn around the face,
which has four coordinates: x,y (top-left corner), and w,h (width and
height). The center of each bounding box is calculated to identify
the exact location of the face in the image.
The horizontal distance between the centers of two bounding boxes
is computed using the calculate_horizontal_distance() function. It
uses the following formula to calculate the horizontal distance
between two faces:
|( ) ( )|
Distance= x 1
w1
2
− x2 +
w2
2
where x1, y1, h1 and h1 are the coordinates of the first bounding box,
and x2, y2, w2 and h2 are the coordinates of the second bounding
box.
Once the horizontal distances between the faces are calculated, the
program checks if any face is close to another. If two faces are
within the proximity threshold (200 pixels), they are added to the
same group.
Couples Identification:
Each face in the group has a gender prediction based on the output
of the gender prediction model. The model classifies each face as
either "male" or "female."
Families Identification:
Each face in the group is classified into one of several age groups
based on the output of the age prediction model. The age groups
used in the classification are:
Attention Detection
The attention detection mechanism in the code is designed to
evaluate whether individuals in an image are paying attention based
on their face orientation. This evaluation is achieved through a
combination of frontal and side-profile face detection techniques,
filtering to ensure unique detections, and labeling with visualization.
Below is a more detailed explanation of each step:
Face Detection
Haar Cascade classifiers are used for detecting left and right profile
faces. Haar features capture intensity differences in images and are
effective for this task. The grayscale version of the input image is
passed through a pre-trained Haar Cascade classifier for left profile
detection. Detected faces are returned as bounding boxes.
Attention-Grabbing Content
If more than 50% of individuals are not paying attention, the system
shifts focus to Attention-Grabbing Content. This content type is
designed to re-engage distracted audiences through visually striking
or highly engaging material.
Dynamic Weighting
Loss vs Epochs
Loss vs Epoch
2
1.8
1.6
1.4
1.2
1 Train Loss
Loss
Epoch
Training Loss:
Validation Loss:
The validation loss also shows a generally decreasing trend but with
some fluctuations. At the beginning of training, the validation loss
starts relatively high and gradually decreases as the model
improves. However, the red line shows more variability compared to
the training loss, which is common when evaluating a model on
unseen data. This fluctuation could indicate the model’s struggle to
generalize well in some epochs, but the overall decrease suggests
improvement.
Interpretation:
Both the training and validation loss decrease over time, suggesting
that the model is improving in its ability to predict age from the
input data. However, the gap between the training loss and
validation loss could also suggest a slight overfitting issue, where
the model fits the training data well but experiences some difficulty
generalizing to the validation set.
Accuracy vs Epoch
Accuracy vs Epoch
100.00%
90.00%
80.00%
70.00%
60.00%
Accuracy
Epoch
Training Accuracy:
Validation Accuracy:
Interpretation:
Confusion Matrix:
1-2 94 1 1 1 4 3 0
8
3-9 16 51 69 58 26 5 5
4 8
66-116 0 0 0 8 29 74 698
Interpretation:
The model performs extremely well for the '1-2' age range, with 948
correct predictions (diagonal element). The misclassifications are
minimal, with only 1 instance predicted as '3-9', 1 as '10-20', 1 as
'21-27', and a few others spread across other ranges. This suggests
that the model is highly accurate in predicting the '1-2' age group,
with a very low rate of confusion.
The '3-9' range shows 518 correct predictions, but there are notable
misclassifications, such as 164 instances predicted as '1-2' and 69
as '10-20'. This suggests that the model struggles to distinguish
between '3-9' and the '1-2' age range, and occasionally confuses it
with the '10-20' range.
For the '10-20' range, there are 566 correct predictions, but
misclassifications are also notable. For example, 260 instances
predicted as '21-27' and 91 as '28-45'. This indicates that while the
model correctly predicts '10-20' most of the time, it occasionally
confuses it with higher age ranges.
The model performs well for the '21-27' range, with 1674 correct
predictions, but it also misclassifies some as '28-45' (480 instances),
suggesting the model tends to overestimate the age in this range.
There is a low level of confusion with other ranges, which suggests
good model performance for this age group.
For '28-45', the model performs very well, with 2221 correct
predictions. However, it misclassifies a few as '21-27' (480
instances) and a few as '46-65' (91 instances). The confusion is
relatively low, indicating strong generalization within this range.
The '46-65' range has 1203 correct predictions, but there are 43
instances misclassified as '21-27' and 359 as '28-45'. The model
shows some confusion between '46-65' and '28-45', which might be
due to the overlap in age appearance in real-life scenarios. This
suggests that the model is somewhat less accurate at distinguishing
between these two age ranges.
The '66-116' range also performs well, with 698 correct predictions.
The model confuses this group with '28-45' (29 instances) and '46-
65' (74 instances) but performs relatively well in this range. The
confusion with '46-65' may stem from age-related appearance
overlap, where older adults may appear younger due to various
factors like health or lifestyle.
0.4
0.2
0
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29
Epoch
Training Loss:
Validation Loss:
The validation loss also decreases over time, starting from 0.5918 in
the first epoch and ending at 0.3113 by the 30th epoch. Similar to
the training loss, the validation loss generally decreases, although it
shows some fluctuations. This variability is typical when a model is
evaluated on unseen data. It indicates that while the model
performs well on the training data, it may experience some
challenges in generalizing to the validation set, especially in certain
epochs.
Interpretation:
Both the training and validation loss decrease over the epochs,
indicating that the model is learning to predict gender more
accurately. However, the validation loss fluctuates more than the
training loss, which may suggest some difficulty in generalizing to
new, unseen data. The gap between the training and validation loss
might indicate a slight overfitting issue, where the model fits the
training data well but struggles to generalize. This can be addressed
by improving regularization or increasing the diversity of the
training dataset.
Accuracy vs Epoch
100.00%
90.00%
80.00%
70.00%
60.00%
Accuracy
50.00% Accuracy
40.00% Validation Accuracy
30.00%
20.00%
10.00%
0.00%
1 4 7 10 13 16 19 22 25 28
Epoch
Training Accuracy:
Interpretation:
Loss vs Epoch
1.4
1.2
1
0.8
Training Loss
Loss
Epoch
The Loss vs Epoch graph demonstrates how the model's training
and validation loss evolved over the course of 50 epochs. The x-axis
represents the epoch number (1 to 50), and the y-axis represents
the loss values.
Training Loss:
Validation Loss:
Interpretation:
The reduction in both training and validation loss over the epochs
indicates that the emotion prediction model is effectively learning
and adapting. The presence of occasional fluctuations in the
validation loss curve suggests potential overfitting in specific
epochs; however, the overall alignment between the training and
validation loss curves indicates that the model is generalizing
reasonably well.
100.00%
80.00%
Accuracy
20.00%
0.00%
1 5 9 13 17 21 25 29 33 37 41 45 49
Epoch
Training Accuracy:
Validation Accuracy:
Interpretation:
Both training and validation accuracy improve significantly over the
epochs, with the gap between them remaining relatively small. This
small gap suggests that the model maintains a balance between
learning and generalization, avoiding severe overfitting. The
validation accuracy's stabilization and high performance towards
the later epochs indicate the model's ability to predict emotions with
a high degree of accuracy on unseen data.
Categorization of Groups
Couples: Groups containing exactly one male and one female are
classified as couples. This criterion ensures precision by combining
gender-based filtering with proximity-based grouping.
The image is then displayed with bounding boxes around faces, lines
connecting the centers of faces within groups, and labels indicating
whether the group is a "Couple" or a "Family."
Attention Identification
Visual Representation:
Quantitative Results:
1. Preview Frame: This frame is split into two sections. The first
displays the original input image, and the second presents the
processed image with annotated details, including detected
faces and their respective attributes (e.g., age, gender,
emotion). This dual visualization allows for a side-by-side
comparison, aiding in the verification of system outputs.
User Interaction
Two buttons at the bottom of the preview frame streamline the user
experience,
Load Image: Allows users to upload an image file for
analysis.
Ad Display Mechanism
Conclusions
Future Works
Expanded Functionality