0% found this document useful (0 votes)
36 views5 pages

A CNN-Based Human Head Detection Algorithm Implemented On Edge AI Chip

This document presents a human head detection algorithm implemented on an edge AI chip called Mipy. The algorithm uses a CNN and was trained using tools developed by AVSdsp for tasks like data augmentation, model training, and inference. Experimental results showed the Mipy board could effectively detect human heads in indoor environments. The system utilizes multiple chips - the AI860 implements the CNN while AVS05P-S controls the system and processes images. Data augmentation techniques like brightness adjustment, rotation, blurring and background replacement were used to increase the training data from 800 to 120,000 images.

Uploaded by

wafa wafa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views5 pages

A CNN-Based Human Head Detection Algorithm Implemented On Edge AI Chip

This document presents a human head detection algorithm implemented on an edge AI chip called Mipy. The algorithm uses a CNN and was trained using tools developed by AVSdsp for tasks like data augmentation, model training, and inference. Experimental results showed the Mipy board could effectively detect human heads in indoor environments. The system utilizes multiple chips - the AI860 implements the CNN while AVS05P-S controls the system and processes images. Data augmentation techniques like brightness adjustment, rotation, blurring and background replacement were used to increase the training data from 800 to 120,000 images.

Uploaded by

wafa wafa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

A CNN-Based Human Head Detection Algorithm

Implemented on Edge AI Chip


Fang-Jing Shen Jian-Hao Chen Wei-Yen Wang
Department of Electrical Engineering Department of Electrical Engineering Department of Electrical Engineering
National Taiwan Normal University National Taiwan Normal University National Taiwan Normal University
Taipei, Taiwan Taipei, Taiwan Taipei, Taiwan
[email protected] [email protected] [email protected]

Dien-Lin Tsai Lien-Chieh Shen Ching-Tung Tseng


Department of Electrical Engineering ADVANCE VIDEO SYSTEM CO., LTD ADVANCE VIDEO SYSTEM CO., LTD
National Taiwan Normal University Taipei, Taiwan Taipei, Taiwan
Taipei, Taiwan [email protected] [email protected]
[email protected]

Abstract—This paper presents an integrated circuit


implementation of a human head detection algorithm. The
technique of image data augmentation for deep learning and the
operating procedures of the evaluation board named as Mipy
are described in the article. Experimental results demonstrate
the effectiveness of the proposed evaluation board to detect the
human heads in indoor environments.

Keywords—convolutional neural network (CNN), human


head detection, edge AI chip
Fig. 1. The exterior of the Mipy evaluation board.
I. INTRODUCTION
In recent years, many human head detection algorithms
with convolutional neural network (CNN) were presented. In
order to pursue a higher accuracy rate, the number of layers
and weights of the network is increased. Then, the
computation is heavy and a lot of memory is also required. In
contrast, the integrated circuit (IC) implementation of CNN is
a way to get better power consumption and computation
speed. Therefore, the IC used in this paper implements a fixed
neural network architecture with a low-cost integrated circuits
design. Furthermore, when the network architecture cannot be
changed, the setting of hyperparameters during training and
the preprocess of training data become more important. As a
result, a set of tools specifically for training the network are
developed.
Fig. 2. The functional block diagram of the Mipy evaluation board.
In this paper, we utilize the Mipy evaluation board which
is developed by a technology company “AVSdsp” to II. RELATED WORK
implement the image-based human head detection task. Fig.
1. shows the exterior of the Mipy evaluation board. The A. Training Tools
evaluation board is equipped with a CNN integrated circuit The deep learning tools developed by AVSdsp is
named as AI860. As a collaborative chip, it only focuses on implemented in C++. The entire tool set is divided into three
processing the computation of CNN. The built-in neural executable files which are utilized for different tasks in the
network has ten output neurons. The Mipy evaluation board overall training process.
utilizes AVS05P-S as the main control chip for control and
1) Creating database tool: Creating database tool
image processing, and AI860 communicates with AVS05P-S
via Inter-Integrated Circuit Bus (I2C). Fig. 2. shows the encodes the training images into a binary file to speed up
functional block diagram of the Mipy evaluation board. The execution time and applies data augmentation to the images.
AVSdsp also provides a set of deep learning tools [1] for users 2) Inference tool: This tool has three purposes, i.e.,
to train models that is suitable for AI860. capturing images from different sources, applying data
augmentation while capturing images, and infering the
The structure of this paper is as follows: Sec. II describes
classification with a trained CNN model.
the related works. A standard operating procedure of the Mipy
evaluation board is shown in Sec. III. In Sec. IV, the 3) Training tool: The training tool performs the main
experimental results are shown to manifest the effectiveness training loop, such as forward propagation, backward
of the proposed method. Finally, Sec. V concludes this paper. propagation, loss calculation, and weights optimization. The
hyperparameters for training neural networks can be set by
the training profile which is a manually editable text file.
These hyperparameters include batch size, learning rate,

XXX-X-XXXX-XXXX-X/XX/$XX.00 ©20XX IEEE

Authorized licensed use limited to: Consortium - Algeria (CERIST). Downloaded on October 03,2022 at 21:39:54 UTC from IEEE Xplore. Restrictions apply.
optimizer, normalization method, weight decay, and max Finally, the number of original images are increased from 800
training epochs. In this paper, we utilize Adam [2] optimizer to 120,000 including 108,000 training images and 12,000
with batch size 250, learning rate 0.0001, weight decay testing images.
0.0001, and weight normalization [3]. We make the training
tool to recreate new database by creating database tool every
30 epochs.
All the executable files of deep learning tools can be
executed as a console application through a command prompt,
and then the procedures can be written into batch files.
B. Image Data Augmentation
The training data is one of the key points for training deep
neural network models. If the training data is not diverse
enough, the trained model will be overfitting and cannot detect
the untrained targets correctly. Therefore, the deep learning
tools utilized in this paper is able to expand the training
dataset. The creating database tool and inference tool can
randomly change brightness, rotation, and sharpness of
images so that the background can be replaced.
1) Rotation: Randomly rotating the images clockwise or
counterclockwise within a specified range of angle [4].
2) Brightness adjustment: There are two ways to adjust
the brightness of images. One is to multiply by a constant
value, and the other is to add a constant value to every pixel
of the image. These two ways are shown in the following
formulas.

Pnew=Pold∗a (1)

Pnew=Pold+b (2)

where Pold and Pnew are the pixel value in the original and
adjusted image, respectively. a and b are constant values. Fig. 3. The overall process of training model.
Otherwise, these two ways can be utilized simultaneously.
3) Blurring image: Randomly blurring or sharpenning
images with different strengths.
4) Mirroring image: Randomly mirroring the input
images.
5) Background replacement: If the background color is
uniform and significantly different from the foreground,
randomly replacing the background area with one of the (a) (b)
given images (e.g. landscape image).
III. THE OPERATING PROCEDURE OF MIPY
In this section, we explain a standard operating procedures
of Mipy evaluation board. Fig. 3 shows the flowchart of the
whole training process.
1) Capturing images: We photograph some persons and
(c) (d)
utilize inference tool to capture training images. The images
of each person include eight body orientations, and every Fig. 4. Human head images. (a) Front view. (b) Back view. (c) Left side
view. (d) Right side view.
body orientation contains five head directions (i.e., up, down,
left, right, and forward). We totally take 800 original images TABLE I. METHOD OF DATA AUGMENTATION AND PARAMETERS
and divide them into three classes, i.e., front, side, and back
Methods Parameters
of human head images as shown in Fig. 4. Changing brightness using (1) a=0.8-1.2
2) Increasing samples: We utilize the inference tool to Changing brightness using (2) b=−15-15
augment data by replacing the uniform color background of Rotating −15°-15°
the original images with landscape images. Moreover, every Mirroring Enable
Blurring Enable
image is randomly transformed by changing the brightness,
blurring, sharpening, rotating, and mirroring. The parameters
that we used for augmenting data are shown in TABLE I.

Authorized licensed use limited to: Consortium - Algeria (CERIST). Downloaded on October 03,2022 at 21:39:54 UTC from IEEE Xplore. Restrictions apply.
3) Editing “creating database profile”: This profile is 7) Score calculation: We edit the score calculation
required by creating database tool to set the amount of profile for the inference tool and set the threshold to find out
training and testing data. We utilize the tool to generate which image needs the additional training. The score is
30,000 training images and 3,000 testing images to the defined below:
database. The training set contains positive and negative
samples. The latter images are shifted by 50% in four S=N0−N5 (3)
directions (i.e., up, down, left, and right). Then we set up the
output neurons types. The neuron 0 and the neuron 5 are for where N0 and N5 are the output values of neuron 0 and neuron
positive and negative sample detection, respectively. The 5, respectively. The output value of neuron 0 means how
neurons 1-4 are reserved for class label. The neurons 6-9 are positive the input value is, and the output value of neuron 5
for bounding boxes position regression. Every image is also means how negative the input value is.
labeled in this step. We consider two of four reserved neurons 8) Training model: We utilize the batch file to create
to be binary classifier with only +1 and -1 labels. Then database and run an infinite training loop to train the model.
(+1,+1), (+1,-1), and (-1,-1) are utilized to represent the Manual termination is required in this step. In addition to the
human head front view, side view, and back view, rising of testing accuracy, we could observe the score
respectively. Then we add the file path to the profile and set distribution graph of each class to figure out whether the
the data augmentation parameters of training and testing data. model has been well trained. When most of the negative data
The parameters that we used for augmenting data are shown from class 4 is correctly predicted, the loop should be
in TABLE II. terminated. In this step, the model is called “G1”. Fig. 6
shows the graph of training and testing errors of G1 model.
TABLE II. DATA AUGMENTATION PARAMETERS WHILE CREATING
DATABASE
Parameters
Methods
Training set Testing set
Changing brightness using (1) a=0.8-1.2 Disabled
Changing brightness using (2) b=−10-10 Disabled
Rotating Disabled Disabled
Mirroring Disabled Disabled
Blurring Enabled Disabled

Fig. 6. The graph of training and testing errors of G1 model.


4) Editing “training profile”: This profile is required by
training tool. The main purpose of this profile is setting the 9) Simulation: The inference tool is utilized to simulate
path of model, the path of database, and hyperparameters (e.g. the effectiveness of G1 model. We find there are too many
batch size, learning rate, and max epochs). false positive bounding boxes.
5) Pretraining model: In order to make the training 10) Repeating steps 6-9: We utilize about 1,000
process more automatically, we utilize the batch files. Firstly, landscape images for this step to capture false positive
we execute the creating database tool. Next, we execute the bounding boxes. The score of these additional images needs
training tool to start training. Third, we write a python script to be calculated, and new database needs to be created in each
for reading and recording the log file generated by the round. After the model G1 goes through these steps again, it
training tool to record the accuracy. The batch file runs the is called “G2” model. Repeat once again, it is called “G3”
above steps three times to create a pretrained model called model and so on. We continue to repeat this step until the
“G0”. Fig. 5 shows the graph of training and testing errors of model becomes “G4” model. Fig. 7 shows the graphs of
G0 model. Obviously, in the G0 model, training error is much training and testing errors of G2, G3 and G4 models.
lower than testing error. It is most likely because of
overfitting.

(a) (b)

Fig. 5. The graph of training and testing errors of G0 model.

6) False positive correction: We utilize the inference tool


to detect false positive bounding boxes in the 36 landscape
images for the G0 model. The tool captures the images within
the detected bounding boxes. Then we consider these (c)
captured images as negative training data and add them into Fig. 7. The graphs of training and testing errors. (a) G2 model. (b) G3 model.
training database. (c) G4 model.

Authorized licensed use limited to: Consortium - Algeria (CERIST). Downloaded on October 03,2022 at 21:39:54 UTC from IEEE Xplore. Restrictions apply.
11) Handling overfitting: The false positive bounding TN, respectively. Adding images from LFW database to
boxes detected by inference tool with G4 model are very few, positive images, the accuracy of TP is higher than that of G4
but the testing accuracy of G4 model is only 58.9%. The model. We find that the detection ability of G5 model on the
target detection ability of G4 model is not good. The G4 evaluation board has been greatly improved, but the accuracy
on testing set is not increased much. Then, we try to find the
model is considered overfitting, and we decide to increase the
reason why the accuracy is low.
training data.
12) Adding LFW to database: For each Labeled Faces TABLE IV. CONFUSION MATRIX OF G5 MODEL
in the Wild (LFW) [5] image, we choose the eligible photos
Ground Truth
and add these photos into our data set. The photos need to G5 Model
Positive Negative
meet the following two criterions: 1. there can only be a Predict
Positive 65.6% 0.7%
single person in the photo. 2. The target in the photo cannot Negative 34.4% 99.3%
be covered by anything. Finally, there are 9,131 photos which
meet these two criterions. A quarter of them is added to the For analysis purposes, we separate the data with the front
testing set, and the rest are added to the training set. view from the testing set and find that the accuracy is much
13) Improving model: We repeat step 10 on G4 model higher than that of side and back view data. TABLE V shows
with new dataset (i.e., adding LFW images) twice, and then the confusion matrix of testing results for G5 model with only
we get the G5 and G6 models. The testing accuracy increases human head front view data. The accuracy is 92.6% and
to 65.5% and 70.1% for models G5 and G6, respectively. Fig. 95.3% for TP and TN images for G5 model, respectively.
8 shows the graphs of training and testing errors of G5 and TABLE V. CONFUSION MATRIX OF G5 MODEL (FRONT VIEW ONLY)
G6 models.
G5 Model Ground Truth
Front only Positive Negative
Positive 92.6% 4.7%
Predict
Negative 7.4% 95.3%

TABLE VI shows the confusion matrix of testing results


for G6 model with only human head front view data. Due to
the addition of LFW data, the amount of the front view data is
(a) (b) much larger than the side and back view data. Therefore, the
Fig. 8. The graphs of training and testing errors. (a) G5 model. (b) G6 model. accuracy of the front view data becomes very high.

IV. EXPERIMENT RESULTS TABLE VI. CONFUSION MATRIX OF G6 MODEL (FRONT VIEW ONLY)

After loading the trained model into the AI860, the Mipy G6 Model Ground Truth
evaluation board can perform the human head detection task. Front only Positive Negative
Positive 98.7% 4.7%
Fig. 9 shows the experimental results. The testing set consists Predict
Negative 1.3% 95.3%
of positive and negative images. We take 10% of the positive
samples and 10% of the negative samples as the testing images V. CONCLUSIONS
in the beginning. TABLE III shows the confusion matrix of
testing results for G4 model. The accuracy is 58.9% and In this paper, we utilize a set of training tools provided by
98.8% for true positive (TP) and true negative (TN) images, AVSdsp, and the trained model can be loaded into an AI chip
respectively. to perform a task of detecting human heads. The test accuracy
can reach 98.7%, and the bounding boxes detected by the
trained model is accurate enough for real-time detection
system. In conclusion, when we utilize comprehensive
training data, the Mipy evaluation board can get the sufficient
detection accuracy to perform the practical applications.
ACKNOWLEDGMENT
This work was financially supported by the “Chinese
Language and Technology Center” of the National Taiwan
Normal University (NTNU) from the Featured Areas
Research Center Program within the framework of the Higher
Education Sprout Project by the Ministry of Education (MOE)
Fig. 9. The experimental results of human head detection. in Taiwan, and Ministry of Science and Technology, Taiwan,
under Grants no. MOST 109-2634-F-003-006 and MOST
TABLE III. CONFUSION MATRIX OF G4 MODEL 109-2634-F-003-007 through Pervasive Artificial Intelligence
Ground Truth Research (PAIR) Labs. We are grateful to the National Center
G4 Model for High-performance Computing for computer time and
Positive Negative
Predict
Positive 58.9% 1.2% facilities to conduct this research.
Negative 41.4% 98.8%

TABLE IV shows the confusion matrix of testing results


for G5 model. The accuracy is 65.6% and 99.3% for TP and

Authorized licensed use limited to: Consortium - Algeria (CERIST). Downloaded on October 03,2022 at 21:39:54 UTC from IEEE Xplore. Restrictions apply.
REFERENCES [4] D. Ciregan, U. Meier, and J. Schmidhuber, “Multi-column deep neural
networks for image classification,” in Proc. of the 2012 IEEE
[1] Advance Video System CO., LTD (AVSdsp), AI courses, Conference on Computer Vision and Pattern Recognition, Providence,
requirements, tool updates, Q&A area: CNN Tool v0.0.1.2c, Available: June 2012, pp. 3642-3649.
http://www.avsdsp.com/AI_Data.html
[5] G. B. Huang, M. Mattar, T. Berg, and E. Learned-Miller, “Labeled
[2] D. P. Kingma and J. Ba, “Adam: A method for stochastic faces in the wild: A database for studying face recognition in
optimization,” arXiv preprint arXiv:1412.6980, 2014. unconstrained environments,” Technical Report, University of
[3] T. Salimans and D. P. Kingma, “Weight normalization: A simple Massachusetts, Amherst, Oct. 2007.
reparameterization to accelerate training of deep neural networks,” in
Proc. of the 30th Conference on Neural Information Processing
Systems (NIPS 2016), 2016.

Authorized licensed use limited to: Consortium - Algeria (CERIST). Downloaded on October 03,2022 at 21:39:54 UTC from IEEE Xplore. Restrictions apply.

You might also like