0% found this document useful (0 votes)
32 views6 pages

Real-Time Categorization of Drivers Gaze Zone Using The Deep Learning Techniques

Uploaded by

Noman Iqbal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views6 pages

Real-Time Categorization of Drivers Gaze Zone Using The Deep Learning Techniques

Uploaded by

Noman Iqbal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Real-time categorization of driver’s gaze zone using the

deep learning techniques

In-Ho Choi Sung Kyung Hong Yong-Guk Kim*


Dept. of Computer Eng. Dept. of Aerospace. Dept. of Computer Eng.
Sejong University Sejong University Sejong University
Kwangjin-Gu, Seoul, Korea Kwangjin-Gu, Seoul, Korea Kwangjin-Gu, Seoul, Korea
[email protected] [email protected] *[email protected]

Abstract— This paper presents a study in which driver’s intuitively, and those can be used in detecting distraction and
gaze zone is categorized using new deep learning techniques. drowsiness. One of the favored methods to detect drowsiness is
Since the sequence of gaze zones of a driver reflects precisely to measure the eye blink frequency, partly because it is the
what and how he behaves, it allows us infer his drowsiness, non-contact method. Brainwave measurement method can also
focusing or distraction by analyzing the images coming from a be used for drowsiness detection, but it is difficult to
camera. A Haar feature based face detector is combined with a commercialize it due to the inconvenience of attaching sensors
correlation filter based MOSS tracker for the face detection to the driver’s head. For distraction detection, driver’s gaze is
task to handle a tough visual environment in the car. Driving most commonly used. The ratio of the time for driver’s gaze
deviating from the concentrated front gaze is measured during
database is a big-data which was constructed using a recording
his driving. The performance is evaluated by identifying a gaze
setup within a compact sedan by driving around the urban area.
zone from others.
The gaze zones consist of 9 categories depending on where a
driver is looking at during driving. A convolutional neural In 2011, Lee et al. [2] developed a gaze zone detection
network is trained to categorize driver’s gaze zone from a system based on the head pose estimation method. To check
given face detected image using a multi-GPU platform, and the system performance, they established a driver database
then its network parameters are transferred to a GPU within a using an infra-red camera. The adaptive template matching
PC running on Windows to operate in the real-time basis. method was adopted to detect the face. Head pose was then
Result suggests that the correct rate of gaze zone estimated by measuring yaw and pitch values. 18 gaze zones
categorization reaches to 95% in average, indicating that our are categorized based on yaw and pitch values. The detection
system outperforms the state-of-art gaze zone categorization accuracy of this system reached to 47% in the strictly correct
estimation rate (SCER) and 86% in the loosely correct
methods based on conventional computer vision techniques.
estimation rate (LCER), respectively. Similarly, Fu et al. [3]
Keywords— Real-time system; Driver’s gaze zone; Deep categorized 12 gaze zones using yaw and pitch, demonstrating
learning; Convolutional neural netwok; Driver distraction and 99% and 92% mean detection rates, respectively. However,
fatigue because these two methods rely on head pose, they cannot
detect gazes diverting without moving the head.
I. INTRODUCTION
In 2014, Tawari et al [4] proposed a system involving 10
According to a report of World Health Organization gaze zones including unknown zone and blink category. Two
(WHO), traffic accidents are the 8th leading cause of mortality cameras were used in making driving database, thereby
worldwide, with nearly 1.3 million people dying each year as a tracking 44 landmarks within the driver’s face while estimating
result of road traffic crashes [1]. Unlike cancers and other the head pose using the Pose from Orthography and Scaling
diseases, traffic-related deaths have external causes. Also, due (POSIT) algorithm, followed by the detection of iris center
to its continuous upward trend, the traffic-related mortality is using the HOG descriptor and iris direction by comparing it
projected to increase to 1.9 million people annually by 2020, with the facial landmarks adjacent to the eyes. Then the gaze
becoming the 5th leading cause of mortality. Main causes of zones were determined by merging the iris and head pose
traffic accidents are known to be speed, distraction, and detection results. In the performance testing, a mean detection
drowsiness. Distracted or drowsy driving is particularly accuracy of 93% was achieved, marking an increase of 20%
dangerous because they can easily result in high-impact compared to the detection performance relying on the head
accidents and secondary accidents. pose alone.
For those reasons, car companies have been putting To handle these problems, we propose a gaze zone
extensive research efforts to develop a variety of driving detection system based on face tracking using MOSSE tracker
assistance systems. Most of the studies focus on driver’s eye with a single camera and a convolutional neural network (CNN)
because the eyes reveal driver’s conditions immediately and for categorizing the gaze zones in the real time basis.

This research was supported by Basic Science Research Program through


the National Research Foundation of Korea (NRF) funded by the Ministry of
Education, Science and Technology (NRF-2013R1A1A2006969) and by the
MSIP(Ministry of Science, ICT and Future Planning), Korea, under the
Global IT Talent support program (IITP-2015-R0134-15-1032) supervised by
the IITP(Institute for Information and Communication Technology
Promotion).
978-1-4673-8796-5/16/$31.00 2016 IEEE 143 BigComp 2016
Authorized licensed use limited to: Chang'an University. Downloaded on July 03,2023 at 06:26:23 UTC from IEEE Xplore. Restrictions apply.
It
I is found durring our prelim minary study that
t a conventtional
gyroo sensor is often interaccted with surrrounding eleectro-
maggnetic field so that it generattes unwanted noise
n into the main
sign
nal. Since an airplane
a or a ddrone confrontts similar situaation,
a hiigh-fidelity gyyro sensor iss developed to t overcome such
probblem. One of them is the A Attitude and Heading
H Referrence
Systtem (AHRS), that is baseed on Micro-ElecroMechaanical
Systtem (MEMS) consisting oof 3 axis accelerometer, 3 axis
electromagnetic sensor
s and 3 axis gyro sen nsor within a DSP
boarrd including a TMS chip to install a few sensing algoriithms
as sh
hown in Fig. 2(A).
2

Before
B test driving sessiion, each su ubject driver was
instrructed to drivee naturally andd yet to movee his head andd eyes
sporradically to lo ook at 8 gazze zones in the t car. Fig. 1(A)
depiicts 8 gaze zon nes: (1) front view, (2) back-mirror, (3) right-
r
fronnt view, (4) lefft back-mirror,, (5) wheel haandle, (6) audio, (7)
gearr box, (8) rig ght back-mirroor. Adding blink to these gaze
zonees, a total of 9 gaze zones w were used. Bliink was countted as
a gaaze zone becau use if blink is eexcluded fromm the detectionn data,
misssing sections occur in thee performancce test, givenn that
drivver normally blinks
b 5 timees per minutee. In additionn, we
conssidered brightt and dim illuuminations forr the databasee and
drivving scenes wiith the driver facing the sun or a tunnel were
film
med. In Fig. 1((B) and (C) shhow the brigh ht and dark drriver’
seat light conditioons, with the ddriver facing thhe sun and witth his
Fig. 1.
1 The gaze zonnes defined in thhe present study y (A) and illuminnation back k to the sun, respectively. We included d such illuminnation
change in the driiver’s seat: Brighht illumination (B) and Dim illuminnation chan nges into the database
d becauuse different light
l conditionns are
(C) enco ountered by drrivers as the ddriving directiion and time of o the
day change ofteen. It was also taken into i account that
II. DRIV
VER DATABASE
E AND FACE TR
RACKING mination posess problems to most of vision
illum n systems.
A. Driving
D Databbase
Considering
C experience-deppendent drivin
ng behaviors, a care
was taken to ballance the drivvers from veery experienced to
Driving
D databbase is esseential for most m driver-reelated begiinner-level drrivers. Whereeas experienced drivers ussually
reseaarches. Howevver, most of the t existing drriver databasees are movve the head in synchrony wiith gaze changges, it is foundd that
made by car com mpanies and noot freely availlable for acaddemic
reseaarch. We havee surveyed sevveral driving datasets
d that caan be
potentially fitted ffor our study. An example is RobeSafe D Driver
Mon nitoring Videoo dataset (RS S-DMV) [5] that has a hhigh-
resollution (1390xx480 pixels) annd 30 fps. Ho owever, it is ffound
that this dataset dooes not fit to our purpose. First,
F the facee area
is offten partially ccut during thee driving partlly because of their
cameera position setup, so that thet face detecctor cannot opperate
and then head pose estimattion process cannot procceed;
Secoondly since the head pose ground
g truth vaalue is absent,, it is
not possible to eevaluate system performance. Thereforee, we
decidde that it woould be a beetter option to t build our own
databbase, as often other researchhers do.

Driver’s
D frontt scenes weree captured witth a CCD caamera
and his head movvements were also recorded d using a 3D gyro
sensor with a laptoop computer as a shown in Fig. 1(B). We uuse a
Ueyee camera withh a 6mm lens setup that allo ows us acquirre the
high
h-resolution faace images as shown in Fig. 2(B). A caare is
taken n to focus onn driver’s facce directly whhen the cameera is Fig. 2.
2 Apparatus to acquire driver vi video images and head pose data during
instaalled onto the front window of the car as shown
s in Fig. 1(C), the test driving. (A) an AHRS 33D gyro sensor (B) Ueye camerra with
so thhat we are abble to get drivver’s full facee even thoughh the 6mm lens (C) The driver head-seet where the gyroo sensor is mountted (D)
driveer moves his head at anyy directions and/or
a handless the A subject is weaaring the head-sett.
wheeel.

144
Authorized licensed use limited to: Chang'an University. Downloaded on July 03,2023 at 06:26:23 UTC from IEEE Xplore. Restrictions apply.
Fig. 3.
3 Example imagges of 9 gaze zonnes included in the t driving dataseet. 0: blink, 1: drriver’s front., 2: rear view mirrorr, 3: passenger’s front, 4: left mirrror, 5:
cluster and steeriing handle, 6: carr navigation and audio
a system, 7: ggearbox, 8: right mirror

beginners tend to change only gazes


g without changing the head As the
t head posee is increasedd further or th he face is parrtially
posee very much. occlluded by a hand, even MO OSSE tracker is i not able to track
the face
f correctly as shown in thhe right colum
mn of Fig. 4.
B. Face
F Trackingg
High
H performmance machinne learning algorithm
a succh as III. CON L NEURAL NETWORK
NVOLUTIONAL
convvolutional neuural network (CNN) requiires accurate face
A. Dataset
D
track
king and faace detectionn as prerequ uisites for good
perfoormance. How wever, face deetection based d on Haar feaature, Performance
P of a learningg algorithms such as CN NN is
mostt commonly used object detection method m at preesent, affected by the nu
umber of learnning data. For instance, suffi
ficient
cannnot be efficienntly applied to driver’s face recognition ddue to nummber of data is necessary to achieve the desired target t
frequuent changes in directions and changing g light condittions. otheerwise it lowers the networkk performancee. Table 1 preesents
Therrefore, we useed the Minimizzing the Outpu ut Sum of Squuared the dataset for gaze
g zone im
mages generatted by the abbove-
Erroor (MOSSE) trracker propossed by Bolme et al. [6]. Am mong desccribed gaze zo
one categorizaation. It is shown therein thaat the
many, MOSSE trracker is know wn to be the most
m stable m model zonee on the driv ver’s front occcupies the greatest
g propoortion
withh respect to rotation, scaale, and occlu usion as welll as (47%%), followed byb right and lleft mirrors annd blink. Thee total
illum
mination channges. The operrating princip ple of the MO OSSE of th
he dataset wass 35,900 imagees.
track
ker is findingg the correlatiion in Fourier domain. Eqq. (1) We
W defined th he blink zone aas the state off completely closed
c
repreesents the MO OSSE filter. eyess. Gaze changes accompaniied by head po ose changes can be
� � F ⊙ H∗ easilly discerned because
b head movement co ompensates foor the
( 1)
anglle of gaze gooing beyond tthe range dettectable by huuman
Hence
H nput image iss given as F ; filter as H ; 2D
the inp eyess. In many sittuations, howwever, the iris direction rem mains
Gaussian Map as G, respectively. F is acquirred by convoluuting fron
ntal while the head
h pose chaanges. Therefo ore, we defineed the
an innput image vvia a cosine window
w using g 2D Fast Foourier gazee zones accordding to the iriss direction wiithout regard to
t the
Trannsform (FFT). ⊙ refers to the element-w wise multiplic ation, headd direction. Fig. 3 show ws image exaamples with their
wherreas, ∗ referss to the com mplex conjug gate of the ggiven resp
pective gaze zoones. Gaze zoones 2–6 adjaccent to gaze zoone 1
matrrix. In trackinng moving objject such as driver’s
d head, drift can be clearly dettermined, wheereas the gazee zones outsidee this
and convergence are inevitable due to rap pid head possition
channges. We aaddressed succh problems by combininng a
MOS SSE tracker w with a face deetector. In the present metho hod, a
MOS SSE filters arre constantly generated wh hile detecting face,
and a previouslyy generated MOSSE M filteer takes overr the
king in case off a face detecttion failure. In
track n other words,, face
detecction is usedd for generatiing the initial MOSSE traacker
tempplate and its recovery, annd the MOSSE tracker kkeeps
track
king the face wwhenever facee detection faills.
When
W the drivver is lookingg at the front, both Haar feeature
face detector andd MOSSE traccker are able to detect the face
correectly as showwn in the left column
c of Figg. 4. Howeverr, the
face detector faills to detect the face as the head posse is Fig. 4.
4 Performance comparison betw ween face detecctor, face trackinng and
combination of two
t systems for th
the strong head po
ose cases.
increeased as showwn in the midddle column of o the same figgure.

145
Authorized licensed use limited to: Chang'an University. Downloaded on July 03,2023 at 06:26:23 UTC from IEEE Xplore. Restrictions apply.
Fram
me

<Face Tracking>
Face Dettection

Fig. 5.
5 Structure of ouur convolution neeural network
N
No

Next Frame
rang
ge cannot be determined unless
u accomp
panying head pose MOSSE
E Tracker Deteccted
chan
nges occur.
Yes

g
B. Network
N structure
Following
F dataaset, the seconnd most imporrtant componeent of
CNN N is the netw work structuree. We organiized the netw work, Face Cropp & Resize

<Detect Gaze Zone>


buildding on the A AlexNet createed by Hinton’s research teaam of
the University
U off Toronto [7]. It is based on 5 convoluution
layerrs and 3 hiddden layers. Givven that this network
n is ussually Convoolution
usedd for assessinng a large numbern of caategories succh as Neural Network
N
ILSVVRC-2010 DB,, it was unneccessarily large e and heavy tto be
usedd for detectingg only 9 gaze zones.
z Thereffore, we adjustted it
to a structure beest-suited forr real-time processing.
p Fiig. 5 Classsify
illustrates the CNN N architecturee used in our sttudy.

TABL LE I. NUMBER
R OF IMAGES PER ZONE
Z I THE DATASET (UNIT:
INCLUDED IN
100)
Gaze Zone
0 1 2 3 4 5 6 7 8
Fig. 6.
6 System flowc
chart
38
8 169 9 30 5
59 8 4 5 34
categorized imagee data was 5 hhours for the workstation
w annd 14
hourrs for the PC C. One of the reasons for such differennce in
The
T input imaage size was set by 227x2 227 cropped from train
ning time is thet improvedd speed of thee latest versioon of
256xx256. The num mber of convoolution layers was reduced from Cafffe since it allows us run 4-G
GPUs concurreently.
5 to 3 and did sim milar to the sizze of hidden laayers. The poooling
betwween the layerrs was taken from the orig ginal network,, and
the probability
p off each gaze zoone was deriv ved from the final TABLE II. SPECIFIC
CATION COMPARISSON BETWEEN TH
HE TRAINING
KSTATION AND A TESTING PC.
WORK
outpput layer usingg the Rectifiedd Linear Unit (ReLU). As ssuch,
we created
c a high--precision CN NN suitable forr real-time proocess. Training systtem Teesting system
Learrning was adm ministered by allocating
a the time at the rattio of
train
ning : validatioon : test = 7 : 1.5
1 : 1.5. OS Ubuntu Windows
CPU i7-5930 i7-3770
C. Deep
D learningg platform
GPU NVidia Titan x 4 NV
Vidia GTX 9800
Many
M deep llearning fram meworks are available succh as CU
UDA core 3072 x 4 2048
Theaano, Caffe, aand Torch [8][9]. They were createdd for
diffeerent purposess and target performances
p and use diffferent
comp
mputer languagges. Theano and a Caffe aree based on pyython IV. EXPEERIMENT
and C++, respecttively. We constructed our system usingg the
C+++-based Caffee as an MF FC-based sysstem runningg on A. Gaze
G zone dettection systemm
Windows. Howevver, as the Cafffe with a GPU U requires the long Fig.
F 6 showss the flowchaart of the en ntire system. Face
learnning time, it shall delay thet system deevelopment ccycle. deteection is performed by ccombining faace detection and
Therrefore, we ussed the deepp learning wo orkstation NV Vidia trackking in the reaal time basis, ffollowed by gaze
g zone deteection
DIGGITS as a deepp learning platfform [10]. Tab ble 2 illustratees the usin
ng a CNN. Forr input and ouutput image prrocessing, OpeenCV
speccifications of thhe workstationn used for ourr training and tthose libraary and C++ tool
t are used, and Caffe library has beenn used
of th
he PC used foor the testing system.
s Note that
t the numbber of for deep
d learning network. Forr experiment, we used the video
v
CUD DA cores in thhis workstationn is 3072x4 = 12,288. dataasets of 4 diifferent driver ers, totaling 7200
7 frames. The
The
T time requiired for 500 trraining epochss with 35,900 evalluation result is displayed using the co onfusion matriix by
commparing the acttual gaze zonee (or ground trruth) and estimmated
gazee zones by the present system m.

146
Authorized licensed use limited to: Chang'an University. Downloaded on July 03,2023 at 06:26:23 UTC from IEEE Xplore. Restrictions apply.
Fig. 7.
7 Examples of the resullts yieldded by the CNN--based gaze zonee detection system
m proposed in th
his study. The syystem display shows three parts: image
N output value diisplay part, and gaze
display part, CNN g zone displayy part.

B. Result
R The
T time requ uired for idenntifying the gaaze zone usinng the
This
T section ppresents the reesults of the system proce ss of CNN N was abou ut 25–28ms. The face trracking part was
deteccting the gazee zones from 7200 frames.. Fig. 9 show ws the conffigured to usee the OpenCV V library with h the GPU CU UDA
conffusion matrix,, showing thaat the detectio on performancce of suppport, and the MOSSE traccker was set to t use 4 corees for
the proposed
p systtem with the mean detectio on rate reachhes to pixeel-by-pixel caalculation usinng the Open n Multi-proceessing
95%
%. Note that thhe gaze zone 2 (rear view mirror)
m cases have (OpenMP) to red duce the proceessing time. The
T overall syystem
the highest
h error rate. This is due to its higher iris direection speeed is ~20 f/s, which
w allows rreal-time operration. An MF FC UI
comp
mpared with thee gaze zone 3 (the passenger’s front) witthout was used for mon nitoring the syystem status as
a shown in Fig.F 7.
movving the head, which is not clearly
c noticeaable in the perrsons The UI consists of three differeent parts: driveer face displayy part,
with
h small eyes. H However, exceept for the gazze zone 2, all gaze a CNN
C output value
v display part, and gaaze zone deteection
zonees exhibit high detection rates because typically gaze resu
ult display as a numeric num mber at the botttom of the sysstem.
chan
nge is accompanied by the headh pose change as well. F Fig. 8 V. CONC
NCLUSION
show
ws how the C CNN convergges to a targeet performancce as
epocch goes by. First,
F we havee built a drivving database during the naatural
driv
ving condition. Secondly, a Haar featuree face detectorr and
MOSSE tracker are combinedd to detect face reliably aggainst
visu
ually tough env vironment in tthe car. Finallly, a CNN model is
deveeloped for th he driver’s ggaze zone caategorization. The
prop
posed system achieved a m mean detection n accuracy off over
95%% on average, and its processsing speed was high enouggh for
real--time applications.
Recognizing
R face
f or facial ffeatures of thee driver has been an
important issue since
s those sttudies have manym applicattions.
And d yet, most of o them havee been based d on conventtional
commputer vision techniques.
t R
Recent progresss in deep leaarning
research demonsttrates that it has a high potential
p in allmost
everry arias includ
ding computerr vision. In paarticular, in caase of
mon nitoring driverr’s behavior ttypically usin ng a camera, since
suchh setup generrates huge strreaming data,, researchers often
tendd to select som
me portion of data rather th han using them m all.
Fig. 8.
8 Loss and Acccuracy of the nettwork as Deep leearning trainnig epoch How wever, such huge
h data is necessary or even essentiial in
progress. train
ning a deep leaarning system
m as shown in the t present stuudy.

147
Authorized licensed use limited to: Chang'an University. Downloaded on July 03,2023 at 06:26:23 UTC from IEEE Xplore. Restrictions apply.
Fig. 9.
9 Confusion maatrix for the 9 gazze zone categoriess

Of course,
c we neeed a fast maachine to carry y out this kinnd of
train
ning. GPU or multi-GPU syystem, at the moment, provvides
an id
deal platform tto crunch suchh streaming daata.

REFER
RENCES
[1] World Healthh Organizatioon Global stattus report on road
safety: time ffor action. 20009 report.
[2] Lee, S. J., Jo,, J., Jung, H. G.,
G Park, K. R.,
R & Kim, J. R Real-
time gaze esttimator based on driver's heead orientatioon for
forward coollision waarning systtem. Intell igent
Transportatioon Systems, 20011. 12(1), 25 54-267.
[3] Fu, X., Guann, X., Peli, E.,, Liu, H., & Luo,
L G. Autom matic
calibration mmethod for drivver's head orientation in naatural
driving envirronment. Intellligent Transp portation Systtems,
2013. 14(1), 303-312.
[4] Tawari, A., C Chen, K. H., & Trivedi, M. M M. Where iis the
driver lookinng: Analysis of o head, eye and
a iris for roobust
gaze zone estimation. In Intelligen nt Transporttation
Systems, 2014. 988-994.
[5] Nuevo, J., B Bergasa, L. M., & Jimén nez, P. RSM MAT:
Robust simuultaneous moodeling and tracking. Paattern
Recognition L Letters. 2010 2455-2463.
[6] Bolme, D. S., Beveridge, J. J R., Draper, B., & Lui, Y Y. M.
Visual objectt tracking usinng adaptive coorrelation filterrs. In
Computer Viision and Patttern Recogniition, 2010. 22544-
2550.
[7] Krizhevsky, A A., Sutskeverr, I., & Hinton n, G. E. Imaggenet
classification with deep coonvolutional neural
n networkks. In
Advances in nneural inform mation processiing systems, 22012.
1097-1105.
[8] Bastien, F., Lamblin, P., P Pascanu, R., Bergstraa, J.,
Goodfellow, I., Bergeron, A. & Bengio o, Y. Theano: new
features and sspeed improveements. 2012. 1211.5590.
[9] Jia, Y., Shelhhamer, E., Donnahue, J., Karrayev, S., Lonng, J.,
Girshick, R R. & Darrelll, T. Caffe fe: Convoluttional
architecture for fast feature embed dding. the A ACM
International Conference on o Multimediia 2014. 675--678.
ACM
[10] https://develooper.nvidia.com/digits.

148
Authorized licensed use limited to: Chang'an University. Downloaded on July 03,2023 at 06:26:23 UTC from IEEE Xplore. Restrictions apply.

You might also like