A CNN-Based Human Head Detection Algorithm Implemented On Edge AI Chip
A CNN-Based Human Head Detection Algorithm Implemented On Edge AI Chip
Authorized licensed use limited to: Consortium - Algeria (CERIST). Downloaded on October 03,2022 at 21:39:54 UTC from IEEE Xplore. Restrictions apply.
optimizer, normalization method, weight decay, and max Finally, the number of original images are increased from 800
training epochs. In this paper, we utilize Adam [2] optimizer to 120,000 including 108,000 training images and 12,000
with batch size 250, learning rate 0.0001, weight decay testing images.
0.0001, and weight normalization [3]. We make the training
tool to recreate new database by creating database tool every
30 epochs.
All the executable files of deep learning tools can be
executed as a console application through a command prompt,
and then the procedures can be written into batch files.
B. Image Data Augmentation
The training data is one of the key points for training deep
neural network models. If the training data is not diverse
enough, the trained model will be overfitting and cannot detect
the untrained targets correctly. Therefore, the deep learning
tools utilized in this paper is able to expand the training
dataset. The creating database tool and inference tool can
randomly change brightness, rotation, and sharpness of
images so that the background can be replaced.
1) Rotation: Randomly rotating the images clockwise or
counterclockwise within a specified range of angle [4].
2) Brightness adjustment: There are two ways to adjust
the brightness of images. One is to multiply by a constant
value, and the other is to add a constant value to every pixel
of the image. These two ways are shown in the following
formulas.
Pnew=Pold∗a (1)
Pnew=Pold+b (2)
where Pold and Pnew are the pixel value in the original and
adjusted image, respectively. a and b are constant values. Fig. 3. The overall process of training model.
Otherwise, these two ways can be utilized simultaneously.
3) Blurring image: Randomly blurring or sharpenning
images with different strengths.
4) Mirroring image: Randomly mirroring the input
images.
5) Background replacement: If the background color is
uniform and significantly different from the foreground,
randomly replacing the background area with one of the (a) (b)
given images (e.g. landscape image).
III. THE OPERATING PROCEDURE OF MIPY
In this section, we explain a standard operating procedures
of Mipy evaluation board. Fig. 3 shows the flowchart of the
whole training process.
1) Capturing images: We photograph some persons and
(c) (d)
utilize inference tool to capture training images. The images
of each person include eight body orientations, and every Fig. 4. Human head images. (a) Front view. (b) Back view. (c) Left side
view. (d) Right side view.
body orientation contains five head directions (i.e., up, down,
left, right, and forward). We totally take 800 original images TABLE I. METHOD OF DATA AUGMENTATION AND PARAMETERS
and divide them into three classes, i.e., front, side, and back
Methods Parameters
of human head images as shown in Fig. 4. Changing brightness using (1) a=0.8-1.2
2) Increasing samples: We utilize the inference tool to Changing brightness using (2) b=−15-15
augment data by replacing the uniform color background of Rotating −15°-15°
the original images with landscape images. Moreover, every Mirroring Enable
Blurring Enable
image is randomly transformed by changing the brightness,
blurring, sharpening, rotating, and mirroring. The parameters
that we used for augmenting data are shown in TABLE I.
Authorized licensed use limited to: Consortium - Algeria (CERIST). Downloaded on October 03,2022 at 21:39:54 UTC from IEEE Xplore. Restrictions apply.
3) Editing “creating database profile”: This profile is 7) Score calculation: We edit the score calculation
required by creating database tool to set the amount of profile for the inference tool and set the threshold to find out
training and testing data. We utilize the tool to generate which image needs the additional training. The score is
30,000 training images and 3,000 testing images to the defined below:
database. The training set contains positive and negative
samples. The latter images are shifted by 50% in four S=N0−N5 (3)
directions (i.e., up, down, left, and right). Then we set up the
output neurons types. The neuron 0 and the neuron 5 are for where N0 and N5 are the output values of neuron 0 and neuron
positive and negative sample detection, respectively. The 5, respectively. The output value of neuron 0 means how
neurons 1-4 are reserved for class label. The neurons 6-9 are positive the input value is, and the output value of neuron 5
for bounding boxes position regression. Every image is also means how negative the input value is.
labeled in this step. We consider two of four reserved neurons 8) Training model: We utilize the batch file to create
to be binary classifier with only +1 and -1 labels. Then database and run an infinite training loop to train the model.
(+1,+1), (+1,-1), and (-1,-1) are utilized to represent the Manual termination is required in this step. In addition to the
human head front view, side view, and back view, rising of testing accuracy, we could observe the score
respectively. Then we add the file path to the profile and set distribution graph of each class to figure out whether the
the data augmentation parameters of training and testing data. model has been well trained. When most of the negative data
The parameters that we used for augmenting data are shown from class 4 is correctly predicted, the loop should be
in TABLE II. terminated. In this step, the model is called “G1”. Fig. 6
shows the graph of training and testing errors of G1 model.
TABLE II. DATA AUGMENTATION PARAMETERS WHILE CREATING
DATABASE
Parameters
Methods
Training set Testing set
Changing brightness using (1) a=0.8-1.2 Disabled
Changing brightness using (2) b=−10-10 Disabled
Rotating Disabled Disabled
Mirroring Disabled Disabled
Blurring Enabled Disabled
(a) (b)
Authorized licensed use limited to: Consortium - Algeria (CERIST). Downloaded on October 03,2022 at 21:39:54 UTC from IEEE Xplore. Restrictions apply.
11) Handling overfitting: The false positive bounding TN, respectively. Adding images from LFW database to
boxes detected by inference tool with G4 model are very few, positive images, the accuracy of TP is higher than that of G4
but the testing accuracy of G4 model is only 58.9%. The model. We find that the detection ability of G5 model on the
target detection ability of G4 model is not good. The G4 evaluation board has been greatly improved, but the accuracy
on testing set is not increased much. Then, we try to find the
model is considered overfitting, and we decide to increase the
reason why the accuracy is low.
training data.
12) Adding LFW to database: For each Labeled Faces TABLE IV. CONFUSION MATRIX OF G5 MODEL
in the Wild (LFW) [5] image, we choose the eligible photos
Ground Truth
and add these photos into our data set. The photos need to G5 Model
Positive Negative
meet the following two criterions: 1. there can only be a Predict
Positive 65.6% 0.7%
single person in the photo. 2. The target in the photo cannot Negative 34.4% 99.3%
be covered by anything. Finally, there are 9,131 photos which
meet these two criterions. A quarter of them is added to the For analysis purposes, we separate the data with the front
testing set, and the rest are added to the training set. view from the testing set and find that the accuracy is much
13) Improving model: We repeat step 10 on G4 model higher than that of side and back view data. TABLE V shows
with new dataset (i.e., adding LFW images) twice, and then the confusion matrix of testing results for G5 model with only
we get the G5 and G6 models. The testing accuracy increases human head front view data. The accuracy is 92.6% and
to 65.5% and 70.1% for models G5 and G6, respectively. Fig. 95.3% for TP and TN images for G5 model, respectively.
8 shows the graphs of training and testing errors of G5 and TABLE V. CONFUSION MATRIX OF G5 MODEL (FRONT VIEW ONLY)
G6 models.
G5 Model Ground Truth
Front only Positive Negative
Positive 92.6% 4.7%
Predict
Negative 7.4% 95.3%
IV. EXPERIMENT RESULTS TABLE VI. CONFUSION MATRIX OF G6 MODEL (FRONT VIEW ONLY)
After loading the trained model into the AI860, the Mipy G6 Model Ground Truth
evaluation board can perform the human head detection task. Front only Positive Negative
Positive 98.7% 4.7%
Fig. 9 shows the experimental results. The testing set consists Predict
Negative 1.3% 95.3%
of positive and negative images. We take 10% of the positive
samples and 10% of the negative samples as the testing images V. CONCLUSIONS
in the beginning. TABLE III shows the confusion matrix of
testing results for G4 model. The accuracy is 58.9% and In this paper, we utilize a set of training tools provided by
98.8% for true positive (TP) and true negative (TN) images, AVSdsp, and the trained model can be loaded into an AI chip
respectively. to perform a task of detecting human heads. The test accuracy
can reach 98.7%, and the bounding boxes detected by the
trained model is accurate enough for real-time detection
system. In conclusion, when we utilize comprehensive
training data, the Mipy evaluation board can get the sufficient
detection accuracy to perform the practical applications.
ACKNOWLEDGMENT
This work was financially supported by the “Chinese
Language and Technology Center” of the National Taiwan
Normal University (NTNU) from the Featured Areas
Research Center Program within the framework of the Higher
Education Sprout Project by the Ministry of Education (MOE)
Fig. 9. The experimental results of human head detection. in Taiwan, and Ministry of Science and Technology, Taiwan,
under Grants no. MOST 109-2634-F-003-006 and MOST
TABLE III. CONFUSION MATRIX OF G4 MODEL 109-2634-F-003-007 through Pervasive Artificial Intelligence
Ground Truth Research (PAIR) Labs. We are grateful to the National Center
G4 Model for High-performance Computing for computer time and
Positive Negative
Predict
Positive 58.9% 1.2% facilities to conduct this research.
Negative 41.4% 98.8%
Authorized licensed use limited to: Consortium - Algeria (CERIST). Downloaded on October 03,2022 at 21:39:54 UTC from IEEE Xplore. Restrictions apply.
REFERENCES [4] D. Ciregan, U. Meier, and J. Schmidhuber, “Multi-column deep neural
networks for image classification,” in Proc. of the 2012 IEEE
[1] Advance Video System CO., LTD (AVSdsp), AI courses, Conference on Computer Vision and Pattern Recognition, Providence,
requirements, tool updates, Q&A area: CNN Tool v0.0.1.2c, Available: June 2012, pp. 3642-3649.
http://www.avsdsp.com/AI_Data.html
[5] G. B. Huang, M. Mattar, T. Berg, and E. Learned-Miller, “Labeled
[2] D. P. Kingma and J. Ba, “Adam: A method for stochastic faces in the wild: A database for studying face recognition in
optimization,” arXiv preprint arXiv:1412.6980, 2014. unconstrained environments,” Technical Report, University of
[3] T. Salimans and D. P. Kingma, “Weight normalization: A simple Massachusetts, Amherst, Oct. 2007.
reparameterization to accelerate training of deep neural networks,” in
Proc. of the 30th Conference on Neural Information Processing
Systems (NIPS 2016), 2016.
Authorized licensed use limited to: Consortium - Algeria (CERIST). Downloaded on October 03,2022 at 21:39:54 UTC from IEEE Xplore. Restrictions apply.