Design of Intelligent Classroom Facial Recognition
Design of Intelligent Classroom Facial Recognition
1. Introduction
Nowadays, education plays an important role in modern society. Almost every family is trying to help
their children win at the starting line. With the rapid development of Internet technology and artificial
intelligence (AI), intelligent teaching [ITS] [intelligent classroom] [1] has been introduced into modern
education in order to provide better teaching services.
Measures to improve the quality and outcomes of education begin with the improvement of students'
performance in the classroom [2]. Specifically, positive changes in students' behavior can help teachers
to achieve relatively good results, which is definitely conducive to the teaching process. Observing each
student's performance in an intelligent environment, such as facial expressions, will help teachers
dynamically adjust teaching methods by receiving rapid feedback from this real-time interaction, and
will be beneficial to the quality of education.
In this way, the initial goal of the project is to implement a real-time computer vision system that
automatically provides intelligent insight into the observed student. With the rapid development of deep
learning in the field of computer vision, real-time target tracking and detection technology has made
great progress. It can be used in this intelligent environment to help assess the performance of each
student in the classroom.
2. Background
Content from this work may be used under the terms of the Creative Commons Attribution 3.0 licence. Any further distribution
of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.
Published under licence by IOP Publishing Ltd 1
CISAT 2018 IOP Publishing
IOP Conf. Series: Journal of Physics: Conf. Series 1168 (2019) 022043 doi:10.1088/1742-6596/1168/2/022043
ideas, inspired by biological concepts, are called receptive fields, which are a feature of the animal's
visual cortex and act as detectors sensitive to certain types of stimuli.
For example, edges. This biological function can be flexibly applied to computer vision using
convolution operations [4]. And in image processing convolution can be used to filter images to produce
different visible effects. A set of convolution filters may be combined to form a convolution layer of the
neural network [4]. Figure 1 shows an example of a convolutional network. These continuous
convolution layers form a convolution neural network (CNN).
3.1. Dataset
The project uses the FER-2013 facial expression dataset written by Pierre-Luc Carrier and Aaron
Courville. Figure 3 shows the FER-2013 sample. The data set shown in figure 4 consists of 48x48 pixel
grayscale images of the face [5], all of which are marked with numeric codes from 0 to 6. Includes seven
2
CISAT 2018 IOP Publishing
IOP Conf. Series: Journal of Physics: Conf. Series 1168 (2019) 022043 doi:10.1088/1742-6596/1168/2/022043
3.3. Models
In the context of intelligent curriculum, the project has been committed to obtain suitable face detection
and facial expression recognition model, the model can be deployed in real-time and hardware constraint
architecture, with higher accuracy and better performance. Therefore, the project not only tries to use
the deep learning method, which is a popular technology in computer vision in recent years, but also
adopts the traditional machine learning method using multiple classifiers for target detection. This would
facilitate the comparison.
(1)Machine learning method
Initially, the project designed the process using a machine learning approach. Figure 6 shows the
entire process of this machine learning approach. Based on the training set, the project extracts facial
markers, which help to detect the face in the region of interest. And the extracted features are fed into
the classifier. The project uses multi-class SVM, logical regression and random forest classifiers for
comparison. The test set then helps validate the training model and evaluate performance. Finally, the
model will predict and output the classification results of several facial expressions.
3
CISAT 2018 IOP Publishing
IOP Conf. Series: Journal of Physics: Conf. Series 1168 (2019) 022043 doi:10.1088/1742-6596/1168/2/022043
4
CISAT 2018 IOP Publishing
IOP Conf. Series: Journal of Physics: Conf. Series 1168 (2019) 022043 doi:10.1088/1742-6596/1168/2/022043
3.4. Implementation
The project then adopted this modified CNN schema in the FER-2013 dataset. The training was done
on NVIDIA Jetson TX1, the first supercomputer to run a module based on the GPU computing
architecture, with 256 CUDA cores. Running 150k steps in the final model takes about four hours. And
the weights can be stored in an 870 kilobyte file. By reducing the computational cost of the architecture,
the project is now able to connect two models and use them sequentially in the same image without
significantly reducing time. After obtaining the pre-training model, the complete pipeline, including the
facial detection module and facial expression classification, takes 0.22 ±0.0003 seconds on the MacBook
Pro laptop (i72.7 GHz, 16GB). This corresponds to a 1. 5 times acceleration compared to the original
architecture of a normal CNN. A pre-training model is implemented to detect the face and recognize
facial expressions using a real-time camera on a laptop computer.
5
CISAT 2018 IOP Publishing
IOP Conf. Series: Journal of Physics: Conf. Series 1168 (2019) 022043 doi:10.1088/1742-6596/1168/2/022043
6
CISAT 2018 IOP Publishing
IOP Conf. Series: Journal of Physics: Conf. Series 1168 (2019) 022043 doi:10.1088/1742-6596/1168/2/022043
And fig. 10 shows a confusion matrix result using facial expression classification of the obtained
model. It was observed that some misclassifications, such as "sadness" were predicted to be "fear" and
"anger" were predicted to be "aversion".
The learning features of the original facial expression and the modified CNN model can be observed
in figure 11. The white area in figure 11 (c) corresponds to the pixel value, which maps the selected
neurons activated in the last convolution layer. We can observe that CNN learns to activate by taking
into account features such as teeth, eyebrows, and eye enlargement, and that each feature remains
constant within the same label. These results suggest that CNN is learning to understand human-like
features that provide these extensible elements. The explainable results help us to understand some
misclassifications, such as people wearing glasses or bushy whiskers were predicted to be "angry". This
happens because the label "anger" is highly activated when it explains a person frowning and the frown
features are confused with the darker background frame. In addition, we can observe that the features
learned in our modified model (shown in figure 11 (c) are more explanatory than those learned from the
general CNN model (figure 11 (b). Therefore, using more parameters in the original architecture results
in less robust features.
Figure 11 (a) Orignial facial expressions smaples Figure 11 (b) Guided back-propagation
visualization of general CNN model
7
CISAT 2018 IOP Publishing
IOP Conf. Series: Journal of Physics: Conf. Series 1168 (2019) 022043 doi:10.1088/1742-6596/1168/2/022043
"happy" emotions into "positive" states, with those with "neutral" emotions being considered as
"focused" states.
5. Conclusion
In the module of face detection and facial expression recognition, based on the FER-2013 emotion
dataset, the project uses the traditional rule-based method and the depth learning-based CNN method
for comparison. A suitable model with higher recognition accuracy and better performance is sought.
Experiments show that the CNN model has a higher average precision (mAP), but due to the generation
of millions of parameters, the delay of the hardware constraint equipment used in the project is very
large. To solve this problem, the project modifies the general CNN model by removing the fully
connected layer and combining the deep separable convolution with the remaining modules. Compared
with the original model, the modified model reduces the parameters by 80 times. After 150k training,
the project got the final model, the recognition time increased 1.5-fold, and the mAP increased from
65.4% to 70.1%. Even if there is a misclassification, the model can eventually provide teachers with
feedback on students' states, including focus, positivity, negativity, or surprise. For future work, the
project will attempt to eliminate misclassification by reducing interference with learning features, such
as wearing glasses, which may classify the results as negative.
References
[1] Daniel Faggella (2017, September). Examples of Artificial Intelligence in Education. Retrieved
September 1, 2017 from TechEmergence Web site: https://www.techemergence.com/
[2] Room 241 Teams (2012, December). Strategies to Improve Classroom Behaviour and Academic
Outcomes. Retrieved December 26, 2012, from Concordia University-Portland Web site:
https://education.cu-portland.edu/blog/
[3] Girshick, R., Donahue, J., Darrell, T., and Malik, J. Rich feature hierarchies for accurate object
detection and semantic segmentation. In Proceedings of the IEEE conference on computer
vision and pattern recognition. 2014 (pp. 580–587)
[4] Marr, D., and Hildreth, E. Theory of edge detection. Proceedings of the Royal Society of London
B: Biological Sciences 207, 1167 (1980), 187–217.
[5] Dataset in Challenges in Representation Learning: Facial Expression Recognition Challenge.
Retrieved 2013, from Kaggle Web site: https://www.kaggle.com/c/challenges-in-
representation-learning-facial-expression-recognition-challenge/data