Real-Time Categorization of Drivers Gaze Zone Using The Deep Learning Techniques
Real-Time Categorization of Drivers Gaze Zone Using The Deep Learning Techniques
Abstract— This paper presents a study in which driver’s intuitively, and those can be used in detecting distraction and
gaze zone is categorized using new deep learning techniques. drowsiness. One of the favored methods to detect drowsiness is
Since the sequence of gaze zones of a driver reflects precisely to measure the eye blink frequency, partly because it is the
what and how he behaves, it allows us infer his drowsiness, non-contact method. Brainwave measurement method can also
focusing or distraction by analyzing the images coming from a be used for drowsiness detection, but it is difficult to
camera. A Haar feature based face detector is combined with a commercialize it due to the inconvenience of attaching sensors
correlation filter based MOSS tracker for the face detection to the driver’s head. For distraction detection, driver’s gaze is
task to handle a tough visual environment in the car. Driving most commonly used. The ratio of the time for driver’s gaze
deviating from the concentrated front gaze is measured during
database is a big-data which was constructed using a recording
his driving. The performance is evaluated by identifying a gaze
setup within a compact sedan by driving around the urban area.
zone from others.
The gaze zones consist of 9 categories depending on where a
driver is looking at during driving. A convolutional neural In 2011, Lee et al. [2] developed a gaze zone detection
network is trained to categorize driver’s gaze zone from a system based on the head pose estimation method. To check
given face detected image using a multi-GPU platform, and the system performance, they established a driver database
then its network parameters are transferred to a GPU within a using an infra-red camera. The adaptive template matching
PC running on Windows to operate in the real-time basis. method was adopted to detect the face. Head pose was then
Result suggests that the correct rate of gaze zone estimated by measuring yaw and pitch values. 18 gaze zones
categorization reaches to 95% in average, indicating that our are categorized based on yaw and pitch values. The detection
system outperforms the state-of-art gaze zone categorization accuracy of this system reached to 47% in the strictly correct
estimation rate (SCER) and 86% in the loosely correct
methods based on conventional computer vision techniques.
estimation rate (LCER), respectively. Similarly, Fu et al. [3]
Keywords— Real-time system; Driver’s gaze zone; Deep categorized 12 gaze zones using yaw and pitch, demonstrating
learning; Convolutional neural netwok; Driver distraction and 99% and 92% mean detection rates, respectively. However,
fatigue because these two methods rely on head pose, they cannot
detect gazes diverting without moving the head.
I. INTRODUCTION
In 2014, Tawari et al [4] proposed a system involving 10
According to a report of World Health Organization gaze zones including unknown zone and blink category. Two
(WHO), traffic accidents are the 8th leading cause of mortality cameras were used in making driving database, thereby
worldwide, with nearly 1.3 million people dying each year as a tracking 44 landmarks within the driver’s face while estimating
result of road traffic crashes [1]. Unlike cancers and other the head pose using the Pose from Orthography and Scaling
diseases, traffic-related deaths have external causes. Also, due (POSIT) algorithm, followed by the detection of iris center
to its continuous upward trend, the traffic-related mortality is using the HOG descriptor and iris direction by comparing it
projected to increase to 1.9 million people annually by 2020, with the facial landmarks adjacent to the eyes. Then the gaze
becoming the 5th leading cause of mortality. Main causes of zones were determined by merging the iris and head pose
traffic accidents are known to be speed, distraction, and detection results. In the performance testing, a mean detection
drowsiness. Distracted or drowsy driving is particularly accuracy of 93% was achieved, marking an increase of 20%
dangerous because they can easily result in high-impact compared to the detection performance relying on the head
accidents and secondary accidents. pose alone.
For those reasons, car companies have been putting To handle these problems, we propose a gaze zone
extensive research efforts to develop a variety of driving detection system based on face tracking using MOSSE tracker
assistance systems. Most of the studies focus on driver’s eye with a single camera and a convolutional neural network (CNN)
because the eyes reveal driver’s conditions immediately and for categorizing the gaze zones in the real time basis.
Before
B test driving sessiion, each su ubject driver was
instrructed to drivee naturally andd yet to movee his head andd eyes
sporradically to lo ook at 8 gazze zones in the t car. Fig. 1(A)
depiicts 8 gaze zon nes: (1) front view, (2) back-mirror, (3) right-
r
fronnt view, (4) lefft back-mirror,, (5) wheel haandle, (6) audio, (7)
gearr box, (8) rig ght back-mirroor. Adding blink to these gaze
zonees, a total of 9 gaze zones w were used. Bliink was countted as
a gaaze zone becau use if blink is eexcluded fromm the detectionn data,
misssing sections occur in thee performancce test, givenn that
drivver normally blinks
b 5 timees per minutee. In additionn, we
conssidered brightt and dim illuuminations forr the databasee and
drivving scenes wiith the driver facing the sun or a tunnel were
film
med. In Fig. 1((B) and (C) shhow the brigh ht and dark drriver’
seat light conditioons, with the ddriver facing thhe sun and witth his
Fig. 1.
1 The gaze zonnes defined in thhe present study y (A) and illuminnation back k to the sun, respectively. We included d such illuminnation
change in the driiver’s seat: Brighht illumination (B) and Dim illuminnation chan nges into the database
d becauuse different light
l conditionns are
(C) enco ountered by drrivers as the ddriving directiion and time of o the
day change ofteen. It was also taken into i account that
II. DRIV
VER DATABASE
E AND FACE TR
RACKING mination posess problems to most of vision
illum n systems.
A. Driving
D Databbase
Considering
C experience-deppendent drivin
ng behaviors, a care
was taken to ballance the drivvers from veery experienced to
Driving
D databbase is esseential for most m driver-reelated begiinner-level drrivers. Whereeas experienced drivers ussually
reseaarches. Howevver, most of the t existing drriver databasees are movve the head in synchrony wiith gaze changges, it is foundd that
made by car com mpanies and noot freely availlable for acaddemic
reseaarch. We havee surveyed sevveral driving datasets
d that caan be
potentially fitted ffor our study. An example is RobeSafe D Driver
Mon nitoring Videoo dataset (RS S-DMV) [5] that has a hhigh-
resollution (1390xx480 pixels) annd 30 fps. Ho owever, it is ffound
that this dataset dooes not fit to our purpose. First,
F the facee area
is offten partially ccut during thee driving partlly because of their
cameera position setup, so that thet face detecctor cannot opperate
and then head pose estimattion process cannot procceed;
Secoondly since the head pose ground
g truth vaalue is absent,, it is
not possible to eevaluate system performance. Thereforee, we
decidde that it woould be a beetter option to t build our own
databbase, as often other researchhers do.
Driver’s
D frontt scenes weree captured witth a CCD caamera
and his head movvements were also recorded d using a 3D gyro
sensor with a laptoop computer as a shown in Fig. 1(B). We uuse a
Ueyee camera withh a 6mm lens setup that allo ows us acquirre the
high
h-resolution faace images as shown in Fig. 2(B). A caare is
taken n to focus onn driver’s facce directly whhen the cameera is Fig. 2.
2 Apparatus to acquire driver vi video images and head pose data during
instaalled onto the front window of the car as shown
s in Fig. 1(C), the test driving. (A) an AHRS 33D gyro sensor (B) Ueye camerra with
so thhat we are abble to get drivver’s full facee even thoughh the 6mm lens (C) The driver head-seet where the gyroo sensor is mountted (D)
driveer moves his head at anyy directions and/or
a handless the A subject is weaaring the head-sett.
wheeel.
144
Authorized licensed use limited to: Chang'an University. Downloaded on July 03,2023 at 06:26:23 UTC from IEEE Xplore. Restrictions apply.
Fig. 3.
3 Example imagges of 9 gaze zonnes included in the t driving dataseet. 0: blink, 1: drriver’s front., 2: rear view mirrorr, 3: passenger’s front, 4: left mirrror, 5:
cluster and steeriing handle, 6: carr navigation and audio
a system, 7: ggearbox, 8: right mirror
145
Authorized licensed use limited to: Chang'an University. Downloaded on July 03,2023 at 06:26:23 UTC from IEEE Xplore. Restrictions apply.
Fram
me
<Face Tracking>
Face Dettection
Fig. 5.
5 Structure of ouur convolution neeural network
N
No
Next Frame
rang
ge cannot be determined unless
u accomp
panying head pose MOSSE
E Tracker Deteccted
chan
nges occur.
Yes
g
B. Network
N structure
Following
F dataaset, the seconnd most imporrtant componeent of
CNN N is the netw work structuree. We organiized the netw work, Face Cropp & Resize
TABL LE I. NUMBER
R OF IMAGES PER ZONE
Z I THE DATASET (UNIT:
INCLUDED IN
100)
Gaze Zone
0 1 2 3 4 5 6 7 8
Fig. 6.
6 System flowc
chart
38
8 169 9 30 5
59 8 4 5 34
categorized imagee data was 5 hhours for the workstation
w annd 14
hourrs for the PC C. One of the reasons for such differennce in
The
T input imaage size was set by 227x2 227 cropped from train
ning time is thet improvedd speed of thee latest versioon of
256xx256. The num mber of convoolution layers was reduced from Cafffe since it allows us run 4-G
GPUs concurreently.
5 to 3 and did sim milar to the sizze of hidden laayers. The poooling
betwween the layerrs was taken from the orig ginal network,, and
the probability
p off each gaze zoone was deriv ved from the final TABLE II. SPECIFIC
CATION COMPARISSON BETWEEN TH
HE TRAINING
KSTATION AND A TESTING PC.
WORK
outpput layer usingg the Rectifiedd Linear Unit (ReLU). As ssuch,
we created
c a high--precision CN NN suitable forr real-time proocess. Training systtem Teesting system
Learrning was adm ministered by allocating
a the time at the rattio of
train
ning : validatioon : test = 7 : 1.5
1 : 1.5. OS Ubuntu Windows
CPU i7-5930 i7-3770
C. Deep
D learningg platform
GPU NVidia Titan x 4 NV
Vidia GTX 9800
Many
M deep llearning fram meworks are available succh as CU
UDA core 3072 x 4 2048
Theaano, Caffe, aand Torch [8][9]. They were createdd for
diffeerent purposess and target performances
p and use diffferent
comp
mputer languagges. Theano and a Caffe aree based on pyython IV. EXPEERIMENT
and C++, respecttively. We constructed our system usingg the
C+++-based Caffee as an MF FC-based sysstem runningg on A. Gaze
G zone dettection systemm
Windows. Howevver, as the Cafffe with a GPU U requires the long Fig.
F 6 showss the flowchaart of the en ntire system. Face
learnning time, it shall delay thet system deevelopment ccycle. deteection is performed by ccombining faace detection and
Therrefore, we ussed the deepp learning wo orkstation NV Vidia trackking in the reaal time basis, ffollowed by gaze
g zone deteection
DIGGITS as a deepp learning platfform [10]. Tab ble 2 illustratees the usin
ng a CNN. Forr input and ouutput image prrocessing, OpeenCV
speccifications of thhe workstationn used for ourr training and tthose libraary and C++ tool
t are used, and Caffe library has beenn used
of th
he PC used foor the testing system.
s Note that
t the numbber of for deep
d learning network. Forr experiment, we used the video
v
CUD DA cores in thhis workstationn is 3072x4 = 12,288. dataasets of 4 diifferent driver ers, totaling 7200
7 frames. The
The
T time requiired for 500 trraining epochss with 35,900 evalluation result is displayed using the co onfusion matriix by
commparing the acttual gaze zonee (or ground trruth) and estimmated
gazee zones by the present system m.
146
Authorized licensed use limited to: Chang'an University. Downloaded on July 03,2023 at 06:26:23 UTC from IEEE Xplore. Restrictions apply.
Fig. 7.
7 Examples of the resullts yieldded by the CNN--based gaze zonee detection system
m proposed in th
his study. The syystem display shows three parts: image
N output value diisplay part, and gaze
display part, CNN g zone displayy part.
B. Result
R The
T time requ uired for idenntifying the gaaze zone usinng the
This
T section ppresents the reesults of the system proce ss of CNN N was abou ut 25–28ms. The face trracking part was
deteccting the gazee zones from 7200 frames.. Fig. 9 show ws the conffigured to usee the OpenCV V library with h the GPU CU UDA
conffusion matrix,, showing thaat the detectio on performancce of suppport, and the MOSSE traccker was set to t use 4 corees for
the proposed
p systtem with the mean detectio on rate reachhes to pixeel-by-pixel caalculation usinng the Open n Multi-proceessing
95%
%. Note that thhe gaze zone 2 (rear view mirror)
m cases have (OpenMP) to red duce the proceessing time. The
T overall syystem
the highest
h error rate. This is due to its higher iris direection speeed is ~20 f/s, which
w allows rreal-time operration. An MF FC UI
comp
mpared with thee gaze zone 3 (the passenger’s front) witthout was used for mon nitoring the syystem status as
a shown in Fig.F 7.
movving the head, which is not clearly
c noticeaable in the perrsons The UI consists of three differeent parts: driveer face displayy part,
with
h small eyes. H However, exceept for the gazze zone 2, all gaze a CNN
C output value
v display part, and gaaze zone deteection
zonees exhibit high detection rates because typically gaze resu
ult display as a numeric num mber at the botttom of the sysstem.
chan
nge is accompanied by the headh pose change as well. F Fig. 8 V. CONC
NCLUSION
show
ws how the C CNN convergges to a targeet performancce as
epocch goes by. First,
F we havee built a drivving database during the naatural
driv
ving condition. Secondly, a Haar featuree face detectorr and
MOSSE tracker are combinedd to detect face reliably aggainst
visu
ually tough env vironment in tthe car. Finallly, a CNN model is
deveeloped for th he driver’s ggaze zone caategorization. The
prop
posed system achieved a m mean detection n accuracy off over
95%% on average, and its processsing speed was high enouggh for
real--time applications.
Recognizing
R face
f or facial ffeatures of thee driver has been an
important issue since
s those sttudies have manym applicattions.
And d yet, most of o them havee been based d on conventtional
commputer vision techniques.
t R
Recent progresss in deep leaarning
research demonsttrates that it has a high potential
p in allmost
everry arias includ
ding computerr vision. In paarticular, in caase of
mon nitoring driverr’s behavior ttypically usin ng a camera, since
suchh setup generrates huge strreaming data,, researchers often
tendd to select som
me portion of data rather th han using them m all.
Fig. 8.
8 Loss and Acccuracy of the nettwork as Deep leearning trainnig epoch How wever, such huge
h data is necessary or even essentiial in
progress. train
ning a deep leaarning system
m as shown in the t present stuudy.
147
Authorized licensed use limited to: Chang'an University. Downloaded on July 03,2023 at 06:26:23 UTC from IEEE Xplore. Restrictions apply.
Fig. 9.
9 Confusion maatrix for the 9 gazze zone categoriess
Of course,
c we neeed a fast maachine to carry y out this kinnd of
train
ning. GPU or multi-GPU syystem, at the moment, provvides
an id
deal platform tto crunch suchh streaming daata.
REFER
RENCES
[1] World Healthh Organizatioon Global stattus report on road
safety: time ffor action. 20009 report.
[2] Lee, S. J., Jo,, J., Jung, H. G.,
G Park, K. R.,
R & Kim, J. R Real-
time gaze esttimator based on driver's heead orientatioon for
forward coollision waarning systtem. Intell igent
Transportatioon Systems, 20011. 12(1), 25 54-267.
[3] Fu, X., Guann, X., Peli, E.,, Liu, H., & Luo,
L G. Autom matic
calibration mmethod for drivver's head orientation in naatural
driving envirronment. Intellligent Transp portation Systtems,
2013. 14(1), 303-312.
[4] Tawari, A., C Chen, K. H., & Trivedi, M. M M. Where iis the
driver lookinng: Analysis of o head, eye and
a iris for roobust
gaze zone estimation. In Intelligen nt Transporttation
Systems, 2014. 988-994.
[5] Nuevo, J., B Bergasa, L. M., & Jimén nez, P. RSM MAT:
Robust simuultaneous moodeling and tracking. Paattern
Recognition L Letters. 2010 2455-2463.
[6] Bolme, D. S., Beveridge, J. J R., Draper, B., & Lui, Y Y. M.
Visual objectt tracking usinng adaptive coorrelation filterrs. In
Computer Viision and Patttern Recogniition, 2010. 22544-
2550.
[7] Krizhevsky, A A., Sutskeverr, I., & Hinton n, G. E. Imaggenet
classification with deep coonvolutional neural
n networkks. In
Advances in nneural inform mation processiing systems, 22012.
1097-1105.
[8] Bastien, F., Lamblin, P., P Pascanu, R., Bergstraa, J.,
Goodfellow, I., Bergeron, A. & Bengio o, Y. Theano: new
features and sspeed improveements. 2012. 1211.5590.
[9] Jia, Y., Shelhhamer, E., Donnahue, J., Karrayev, S., Lonng, J.,
Girshick, R R. & Darrelll, T. Caffe fe: Convoluttional
architecture for fast feature embed dding. the A ACM
International Conference on o Multimediia 2014. 675--678.
ACM
[10] https://develooper.nvidia.com/digits.
148
Authorized licensed use limited to: Chang'an University. Downloaded on July 03,2023 at 06:26:23 UTC from IEEE Xplore. Restrictions apply.