Helmet Use Detection of Tracked Motorcycles Using CNN-Based Multi-Task Learning
Helmet Use Detection of Tracked Motorcycles Using CNN-Based Multi-Task Learning
ABSTRACT Automated detection of motorcycle helmet use through video surveillance can facilitate
efficient education and enforcement campaigns that increase road safety. However, existing detection
approaches have a number of shortcomings, such as the inabilities to track individual motorcycles through
multiple frames, or to distinguish drivers from passengers in helmet use. Furthermore, datasets used to
develop approaches are limited in terms of traffic environments and traffic density variations. In this
paper, we propose a CNN-based multi-task learning (MTL) method for identifying and tracking individual
motorcycles, and register rider specific helmet use. We further release the HELMET dataset, which includes
91,000 annotated frames of 10,006 individual motorcycles from 12 observation sites in Myanmar. Along
with the dataset, we introduce an evaluation metric for helmet use and rider detection accuracy, which can be
used as a benchmark for evaluating future detection approaches. We show that the use of MTL for concurrent
visual similarity learning and helmet use classification improves the efficiency of our approach compared
to earlier studies, allowing a processing speed of more than 8 FPS on consumer hardware, and a weighted
average F-measure of 67.3% for detecting the number of riders and helmet use of tracked motorcycles. Our
work demonstrates the capability of deep learning as a highly accurate and resource efficient approach to
collect critical road safety related data.
INDEX TERMS Deep learning, traffic surveillance, motorcycle safety, helmet use detection, tracking.
I. INTRODUCTION helmet use of riders through machine learning has also been
Nowadays, drivers’ adherence to traffic laws is mainly explored [10], [11]. The availability of exact and concurrent
monitored and enforced by traffic police officers through data about motorcycle helmet use on the street is crucial to
direct observation. Yet implementations of road surveillance injury prevention, as it can be used for targeted enforcement
infrastructure are increasingly being used to automatically and effective education campaigns.
identify safety related behaviors through traffic video anal- The registration of helmet use through human observers
ysis. Approaches have been developed to register relatively naturally consists of four basic elements, that any automated
simple variables, such as traffic flow and density [1], [2], detection method must also possess to produce comparably
speed [3]–[5], traffic light violations [6], or collisions [7]. detailed helmet use estimates. (1) Detection: Initially, active
More recently, computer vision has been used to register more motorcycles need to be detected. (2) Tracking: Individual
complex road user behaviors, such as driver mobile phone motorcycles need to be tracked through the road environment,
use [8] and unauthorized use of car-pooling lanes [9]. Since to ensure that each motorcycle is only registered once, regard-
for many developing countries the main form of motorized less of how long it is observed. (3) Rider differentiation:
transport consists of motorcycles, the detection of motorcycle For an accurate calculation of motorcycle helmet use and
to produce position-specific helmet use data, rider numbers
The associate editor coordinating the review of this manuscript and and positions (i.e. distinguishing the driver and passenger(s))
approving it for publication was Yiming Tang . per motorcycle need to be registered. (4) Site-diversity:
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
VOLUME 8, 2020 162073
H. Lin et al.: Helmet Use Detection of Tracked Motorcycles Using CNN-Based MTL
Helmet use numbers need to be accurately registered, found in Table 1. For the initial step of active motorcycle
independent of the road environment at an observation detection, approaches can be broadly categorized into
site. Hence, automated approaches need to show accuracy conventional methods [18]–[21] and deep-learning-based
for more than one road environment. While these four methods [10], [11], [22]–[24].
basic elements of motorcycle helmet use observation come For the detection of active motorcycles, most conventional
naturally to humans observers, existing automated detection methods follow similar procedures. First, a background
approaches, either do not include all four elements or subtraction method is used to extract moving objects/vehicles
have low performance on some of them (see Section II). from the video data. After this, a binary classifier (e.g.
In particular the lack of rider differentiation is a crucial a support vector machine (SVM)) is used to detect motor-
element for the application of automated helmet use detection cycles. In another step, the head region of the motorcyclists
in the field. Researchers repeatedly find evidence of an is localized, and an additional classifier is used to distinguish
influence of rider position and rider number on helmet use on helmet use from non-helmet use. To improve the performance
individual motorcycles [12]–[15]. Hence, the differentiation of the binary classifier, hand-crafted features are used,
of rider helmet use for drivers and passengers is a crucial a common one is to extract a histogram of oriented gradients
metric, that should not be omitted in automated detection (HOG) [25] from the detected head regions of riders.
approaches. The lack of broad applicability and robustness Such methods, however, do not work well when there are
prevents the substitution of human observers through auto- many motorcycles and/or there is more than one rider on
mated approaches in helmet use observation. a motorcycle. Instead of designing hand-crafted features,
Hence, we present a deep learning based automatic deep learning based methods strive to automatically develop
detection approach that contains all four basic elements representations from raw image data that are most suitable for
of human-observer helmet use registration, i.e. detection, the helmet use detection task. In [24], helmet use is classified
tracking, rider differentiation, and site-diversity. The pro- in the detected head regions of riders using a convolutional
posed work builds on and extends a previous approach neural network (CNN). In [22] and [11], two independent
for frame-based helmet use detection [10], which did not CNNs are trained, one is used to distinguish motorcycles from
include tracking of motorcycles and in which the dataset was other vehicles, the other to classify helmet and non-helmet in
not made public. To encourage the development of diverse the head region of riders. Since it is time-consuming to detect
detection approaches, we make this dataset available with the motorcycles and helmet use through two separate CNNs, [10]
publication of this article. In addition, we further propose a and [23] use one single CNN to detect motorcycles and
benchmark metric for the assessment of automated detection helmet use simultaneously.
approaches. The tracking of individual motorcycles through single
In summary, our main contributions are twofold: frames of a recorded video is only included in half of existing
• We propose a comprehensive CNN-based approach for approaches presented in Table 1. While video data recorded
helmet use detection of tracked motorcycles, contain- with traffic surveillance infrastructure is inherently frame-
ing all basic elements utilized by human observers. based, helmet use data produced through automatic detection
A multi-task learning (MTL) framework is developed must be projected onto individual motorcycles, to allow a
for both visual similarity learning and patch-based hel- valid appraisal of helmet use. Hence, frame-based detection
met use classification, which increases computational results for motorcycle and rider counts, as well as helmet use
efficiency as well as detection accuracy. The source code must be remapped to individual motorcycles which appear
and pre-trained model are available in [16]. in multiple frames. This can only be achieved by approaches
• We publish a diverse, large-scale, annotated dataset that link frame-based detection to cross-frame tracking.
for motorcycle detection, called HELMET. It contains This tracking is missing in some approaches (e.g. [11]).
10,006 annotated motorcycles in 910 video clips, To compensate for this lack of tracking, it is necessary to
recorded throughout the country of Myanmar, con- either use single frame detection at a fixed point/line in
taining 12 observation sites across 7 cities. To the the frame to prevent the repeated detection of the same
best of our knowledge, it is the largest and most motorcycle) (e.g. [20]) or to collect helmet use data in
diverse motorcycle helmet use detection dataset. Based every video frame without tracking, leading to the loss of
on the dataset, we propose a metric to evaluate the information on the number of motorcycles registered at an
performance of helmet use detection algorithms, which observation site (e.g. [10]). Both of these shortcuts lead to a
takes account of both spatial and temporal detection. The decrease in helmet use data quality and in addition prevent the
dataset, together with the source code for performance use of multiple frames of an individual motorcycle for helmet
evaluation, are available in [17]. use and rider detection.
For rider number and position detection, only one of the
II. RELATED WORK approaches listed in (Table 1) generates detailed information
To date, a number of approaches for the automated detection on this [10]. And while other approaches (e.g. [20]) use
of motorcycle helmet use in recorded video data have been head counts on the motorcycle as a substitute for rider
proposed [10], [11], [18]–[24], details of which can be numbers, this information is not mapped on rider positions
(i.e. driver vs. passenger). As the specific position and num- on them, on a single frame level. In the second step,
ber of riders on a motorcycle directly relates to their helmet each detected active motorcycle is tracked through adjacent
use [12]–[15], the lack of this critical information presents frames, using both the motion state of the motorcycle as well
a clear barrier for the application of automated helmet use as the visual similarity between detected active motorcycles.
detection approaches in the field. In the last step, when a track terminates, i.e. an individual
On the element of site-diversity, the existing datasets motorcycle leaves the view of the video camera, the helmet
used to develop automated motorcycle helmet use detection use class of the tracked motorcycle is predicted, i.e. rider
approaches (Table 1) show a critical lack of diverse observa- number, their position, and their helmet use are identified. All
tion sites and a general lack of detailed information on road three steps are described in detail in the following sections.
environments used. Five of the datasets [19]–[21], [23], [24]
only contain data from one recording site, prohibiting robust A. MOTORCYCLE DETECTION
evaluation of the developed solutions in diverse traffic envi- Detecting a motorcycle in a single frame is a classic object
ronments. The two datasets which contain more recording detection task. To this end we trained a state-of-the-art object
sites do not distinguish helmet use between motorcycle detection algorithm to detect motorcycles in the dataset.
drivers and passengers [11], [22]. This lack of data diversity Today’s prevalent algorithms for object detection can be
and level of annotation detail in existing datasets hinders the subdivided in two broad approaches: one-stage and two-
development of widely applicable detection solutions. stage. While the two-stage algorithms have overall higher
accuracies in object detection, they are comparably slower,
as frames are processed twice, once for identifying potential
object locations in a frame, and once more for detecting
the actual objects. Single-stage methods combine the steps
of localizing potential objects and object detection into a
single processing stage, which results in a small decrease
in accuracy, but a large decrease in the processing time.
A relatively new single-stage method is RetinaNet [26],
which uses a multi-scale feature pyramid combined with focal
loss to successfully overcome detection accuracy limitations.
RetinaNet achieves faster detection than two-stage methods,
while having a higher detection accuracy than comparable
single-stage methods such as YOLO [27]. We therefore
applied a RetinaNet model for detecting motorcycles.
Since motorcycle detection is very similar to other
object detection tasks, instead of training from scratch,
FIGURE 1. An illustration of the proposed approach for helmet use we fine-tuned a RetinaNet model with pre-trained weights
detection of tracked motorcycles.
obtained by the COCO dataset [28].
TABLE 2. Notations for multiple motorcycle tracking. where x (n) denotes the N cropped image patches that are
assigned to track v(i) , and φ(·; θ) corresponds to the feature
vector learned by a InceptionV3 deep neural network model,
to be defined later in Section III-D.
Hence, we have a combined distance as the product of the
motion distance and the visual dissimilarity:
ij · Dij .
Dij = DM V
(4)
To sum up, Eq. (4) indicates any track and measurement
are similar only if they have similar visual appearances and
similar motions.
Applying the Munkres assignment algorithm to the dis-
tance matrix D, any new measurement can either be assigned
to an existing track or initiate a new track. If a measurement
(j)
mt is assigned to an existing track v(i) , the track is updated
as:
(i) (i)
denote the state vector and the state covariance matrix of a K = P̂t H T (H P̂t H T + R)−1 ,
(1) (n)
Kalman filter [29]. Meanwhile, let M = {mt , . . . , mt } (i) (i)
st = ŝt + K (zt − H ŝt ),
(j) (i)
be newly arrived measurements at time t. Each measurement (i) (i)
(j) (j) (j) (j)
is denoted as mt = (bt , xt ), where bt is the predicted Pt = (I − KH )P̂t ,
(j)
(j)
bounding box, and xt is the cropped image patch from the B(i) = B(i) ∪ mt , (5)
predicted bounding box. Each image patch is re-scaled to
where K is Kalman gain; otherwise it initiates a new track
192 × 192. Furthermore, we normalize the bounding box
v(i+1) , with track information updated by:
by the frame width and height so that all the numbers fall
(j) (j)
between 0 and 1. Given a bounding box (l, u, w, h), its lt + wt /2
centroid z is computed as z = (l + w/2, u + h/2). (i+1) 0
u(j) + h(j) /2 ,
(i)
For an existing track v(i) , we first predict its new state ŝt st =
(i) (i) t t
and new state covariance P̂t using the estimated state st−1 0
(i)
and estimated state covariance Pt−1 at time t − 1:
0.100 0 0 0
image patches within the track. More specifically, let patch x (b) :
(x (n) )N
n=1 be the cropped image patches that are assigned to
a tracked motorcycle, then the track’s helmet use class is d(x (a) , x (b) ) = kφ(x (a) ; θ) − φ(x (b) ; θ)k2 , (9)
estimated as:
N 3) Given φ(x (b) ; θ), predict helmet use class p(b) =
1 X f φ(x (b) ; θ); w(b) , with the softmax regression model
ŷ = arg max g(x (n) ; W ), (8)
y∈{1,2,...,C} N n=1 f (·) parameterized by weight w(b) .
where g(·; W ) is a deep convolutional neural network (CNN), Using the MTL model, the helmet use classification model
parameterized by W . in Eq. (8) can be rewritten as g(·; W ) = f (φ(·; θ); w),
where the visual similarity learning task and the helmet use
classification task shares the weights θ in the training and
predicting process, which not only significantly decreases the
computational cost, but also improves generalization by using
the domain information contained in the related tasks [34].
For the first and third tasks, we use the cross-entropy loss
for optimization:
K
(a) (a)
X
L1 (x (a) , y(a) ) = − yi log(pi )
i=1
K
(b) (b)
X
FIGURE 2. The proposed architecture for patch-based helmet use
L3 (x (b) , y(b) ) = − yi log(pi ) (10)
classification and visual similarity learning. i=1
• Precision is the ratio of the number of correct EDetect to 0.1 0 0 0
0 0.25 0 0
the total number of EDetect (correct and incorrect) in the Q=
0 0 0.1 0
i-th class.
• Recall is the ratio of the number of correct EDetect to the 0 0 0 0.25
number of correct EDetect combined with missed EGT in 0.05 0
R=
i-th class. 0 0.05
• For i-th class, the F-measure is the harmonic mean of To train and evaluate the MTL model, we first generated all
precision and recall: pairs of image patches in each video clip. Next we randomly
Precision × Recall sampled 2,000,000, 100,000, and 200,000 pairs from training,
Fi = 2 · (13) validation, and test sets respectively. In each subset, 50%
Precision + Recall
image pairs come from the same tracks and 50% image pairs
• Since samples across all helmet use classes LGT are
come from different tracks. In our work, the CNN body was
imbalanced, we use a weighted aggregate F-measure in
initialized with the pre-trained weights on ImageNet [38] and
the dataset, defined as:
all FC layers were initialized with random weights.
1 For both deep learning models, i.e. the motorcycle detec-
Fweighted = PC (14)
wi × 1 tion model and the MTL model, the Adam optimizer [39] was
i=1 Fi
used with the default parameters β1 = 0.9, β2 = 0.999,
where C is the number of helmet use classes, and the and a custom learning rate α. In our experiments, we tried
weight on the i-th class wi is proportional to the number α = 10−2 , 10−3 , . . . , 10−5 and chose the value that gives
of samples in the i-th class; wi ’s sum to one. the best validation result. Considering the large size of data,
we only trained for 10 epochs with a batch size of 2 for the
TABLE 4. Segmentation of video clips in the dataset for performance motorcycle detection model and 128 for the MTL model.
evaluation, the training-validation-test split ratio is 70%, 10%, and 20%. In the training process, we saved the best model that gave
the minimum loss on the validation set and reported its final
performance on the test set.
FIGURE 4. Motorcycle detection results on sampled frames, where the top row corresponds to the ground truth human annotation (green bounding
boxes), and the bottom row corresponds to the results (yellow bounding boxes), predicted by fine-tuned RetinaNet.
FIGURE 5. Examples of the output distances of the MTL model between image pairs of the same track (vertical comparison) and different tracks
(horizontal comparison), with smaller numbers showing higher visual similarity.
FIGURE 7. Helmet use detection of tracked motorcycles on sampled frames in three observation sites. Each column corresponds to the frames
sampled (every 5th frame) from a video clip. where different bounding box colors corresponds to different predicted classes. The predicted helmet
use class and track id are labelled at the top of bounding boxes.
different schemes, motion-based or hybrid, for motorcycle E. RESULTS AND ANALYSIS OF HELMET USE DETECTION
tracking. Detailed results of our proposed approach for helmet use
For computational speed, it can be observed that the detection of tracked motorcycles in each individual class
first two approaches are achieving the highest number of are presented in Table 6. We achieved a 67.3% weighted
processed frames per second (25.12 and 13.42 FPS), as they F-measure on the test set of the HELMET dataset. Our
only use one network to simultaneously predict motorcycle approach works well on common classes of up to two riders
and helmet use class. However, it can be observed that this per motorcycle. Considering only these common classes,
high speed is achieved at the expense of detection accuracy, the weighted F-measure improves to 70.6%. Fig. 7 shows
as the two approaches are prone to produce missing detections detection results on some sampled frames. For more detailed
due to the imbalanced helmet use classes. This in turn results, we have attached video samples of our approach as
decreases tracking performance. Comparing RetinaNet and supplementary files.
YOLOv2, the advantage of the multi-scale feature pyramid In addition we present the location-wise performance
with focal loss is apparent, as RetinaNet has a higher accuracy in Fig. 8, which shows the weighted F-measure for each
than YOLOv2, at the expense of processing speed. This observation site in the HELMET dataset. Video clips from
difference is present for detecting motorcycle and helmet the test set for all sites can be found in the supplementary
use simultaneously or separately. Finally, combining motion files of this article. It can be observed that the approach works
similarity with visual similarity during tracking (our hybrid well for most locations, with nine of the observation sites
tracking approach) improves the F-measure value with little showing an F-measure of around 70% and above. However,
additional computational cost. comparatively low accuracy can be observed for Bago_urban
F. COMPUTATIONAL COST
Our approach was implemented using the Python Keras
library with Tensorflow as a backend and ran on two NVIDIA
FIGURE 8. Performance evaluation in each observation site.
Titan Xp GPUs. In our implementation, instead of keeping
every cropped image patch in a track, we retain its visual
and NyaungU_urban, with F-measures slightly below 60%, feature and helmet use prediction output only, which reduces
and Yangon_II with an F-measure of only 39.1%. Looking at both computational space and time.
the video clips in the test dataset (see supplementary files), The overall processing speed of our method is 8.32 FPS.
it becomes apparent, that the three sites with the lowest More specifically, the computational time for motorcycle
detection is 0.059 seconds per frame; the computational time There are some limitations to our work. The detection
for visual feature extraction and patch-based helmet use clas- accuracy can be much compromised when dealing with
sification is 0.058 seconds per frame; and the computational uncommon traffic environments, or street scenes with
time for tracking is negligible, merely 0.003 seconds per parked motorcycles. Hence, the current approach is partly
frame. constrained in real-world applicability, as observation site
specific elements could decrease detection accuracy. And
VI. CONCLUSION while the HELMET dataset is a first step towards using more
In this paper, we have proposed a deep learning based diverse datasets for the development of automated helmet
method to automatically perform three elements of human detection approaches, further data needs to be collected to
observer motorcycle helmet use registration, i.e. detection make approaches universally applicable.
and tracking of active motorcycles, as well as identification of For future research, we intend to enhance the HELMET
rider number per motorcycle, rider position, and rider specific dataset by incorporating scenes with more diverse traffic
helmet use. In addition, we have applied our approach to infrastructure, e.g. crossroads, to ensure more robust appli-
video data from diverse road environments, which included cation of the approach. Also, more training data that contains
adverse factors such as occlusion, differences in camera parked motorcycles will be acquired and used for training,
angle, an imbalanced number of coded classes, as well as so that these objects will not be detected as false positive.
differing rider numbers per motorcycle and varying traffic
densities. All of these elements make our approach more REFERENCES
comprehensive than earlier approaches for the automated [1] B. Coifman, D. Beymer, P. McLauchlan, and J. Malik, ‘‘A real-time
detection of motorcycle helmet use (see Table 1). Our results computer vision system for vehicle tracking and traffic surveillance,’’
Transp. Res. C, Emerg. Technol., vol. 6, no. 4, pp. 271–288, Aug. 1998.
show a generally high accuracy of our approach. For the
[2] Y. Lv, Y. Duan, W. Kang, Z. Li, and F.-Y. Wang, ‘‘Traffic flow prediction
element of frame-based detection of motorcycles, we achieve with big data: A deep learning approach,’’ IEEE Trans. Intell. Transp. Syst.,
an average precision of 95.3%. The visual similarity element vol. 16, no. 2, pp. 865–873, Apr. 2015.
of motorcycle tracking of our approach achieves 0.967 AUC, [3] S. Gupte, O. Masoud, R. F. K. Martin, and N. P. Papanikolopoulos,
‘‘Detection and classification of vehicles,’’ IEEE Trans. Intell. Transp.
in this first application of CNN-based tracking of active Syst., vol. 3, no. 1, pp. 37–47, Mar. 2002.
motorcycles. For the element of detection of helmet use [4] J. Wu, Z. Liu, J. Li, C. Gu, M. Si, and F. Tan, ‘‘An algorithm for automatic
class, i.e. the registration of rider number, position, and vehicle speed detection using video camera,’’ in Proc. 4th Int. Conf.
Comput. Sci. Edu., Jul. 2009, pp. 193–196.
rider specific helmet use, we achieve an accuracy of 80.6% [5] T. N. Schoepflin and D. J. Dailey, ‘‘Dynamic camera calibration of roadside
on a frame based level. Especially the imbalanced number traffic management cameras for vehicle speed estimation,’’ IEEE Trans.
of classes in the HELMET dataset contribute to wrong Intell. Transp. Syst., vol. 4, no. 2, pp. 90–98, Jun. 2003.
classifications. For the comprehensive application of our [6] D.-W. Lim, S.-H. Choi, and J.-S. Jun, ‘‘Automated detection of all kinds
of violations at a street intersection using real time individual vehicle
approach, all its elements are combined, i.e. motorcycle tracking,’’ in Proc. 5th IEEE Southwest Symp. Image Anal. Interpretation,
detection, tracking, and helmet use class prediction are jointly Apr. 2002, pp. 126–129.
applied. Our results show a weighted F-measure of 67.3% for [7] H. Veeraraghavan, O. Masoud, and N. P. Papanikolopoulos, ‘‘Computer
vision algorithms for intersection monitoring,’’ IEEE Trans. Intell. Transp.
the helmet use detection of tracked motorcycles, showing that Syst., vol. 4, no. 2, pp. 78–89, Jun. 2003.
our approach can be used to generate reliable motorcycle, [8] Y. Artan, O. Bulan, R. P. Loce, and P. Paul, ‘‘Driver cell phone usage
rider number, and position specific helmet use estimates. detection from HOV/HOT NIR images,’’ in Proc. IEEE Conf. Comput. Vis.
Pattern Recognit. Workshops, Jun. 2014, pp. 225–230.
The results of our ablation study show that our approach [9] Y. Artan, O. Bulan, R. P. Loce, and P. Paul, ‘‘Passenger compartment
achieves a comparatively high accuracy against ablation violation detection in HOV/HOT lanes,’’ IEEE Trans. Intell. Transp. Syst.,
experiments. While this high accuracy comes at the expense vol. 17, no. 2, pp. 395–405, Feb. 2016.
[10] F. W. Siebert and H. Lin, ‘‘Detecting motorcycle helmet use with deep
of computational efficiency, our approach can process more
learning,’’ Accident Anal. Prevention, vol. 134, Jan. 2020, Art. no. 105319.
than 8 FPS on consumer hardware, which is close to real-time [11] B. Yogameena, K. Menaka, and S. Saravana Perumaal, ‘‘Deep learning-
speed for 10 FPS video data. Overall, our work shows that all based helmet wear analysis of a motorcycle rider for intelligent surveil-
four basic elements of helmet use registration through human lance system,’’ IET Intell. Transp. Syst., vol. 13, no. 7, pp. 1190–1198,
Jul. 2019.
observers can be implemented in a CNN-based approach [12] F. W. Siebert, D. Albers, U. Aung Naing, P. Perego, and S. Chamaiparn,
that is computationally efficient on consumer hardware. ‘‘Patterns of motorcycle helmet use–A naturalistic observation study in
Furthermore, the inclusion of detailed rider differentiation Myanmar,’’ Accident Anal. Prevention, vol. 124, pp. 146–150, May 2019.
[13] R. D. Ledesma, S. S. López, J. Tosi, and F. M. Poó, ‘‘Motorcycle helmet
is an enhancement of existing approaches. In addition to use in mar del plata, argentina: Prevalence and associated factors,’’ Int.
presenting our helmet use detection approach, we publish J. Injury Control Saf. Promotion, vol. 22, no. 2, pp. 172–176, Apr. 2015.
the HELMET dataset with this paper, which includes diverse [14] D. V. Hung, M. R. Stevenson, and R. Q. Ivers, ‘‘Prevalence of helmet use
among motorcycle riders in Vietnam,’’ Injury Prevention, vol. 12, no. 6,
traffic video data that can be used to train and evaluate pp. 409–413, Dec. 2006.
similar approaches. Since existing datasets have a number [15] A. M. Bachani, N. T. Tran, S. Sann, M. F. Ballesteros, C. Gnim, A. Ou,
of shortcomings and are not readily available to researchers, P. Sem, X. Nie, and A. A. Hyder, ‘‘Helmet use among motorcyclists in
we hope that the publication of the HELMET dataset cambodia: A survey of use, knowledge, attitudes, and practices,’’ Traffic
Injury Prevention, vol. 13, no. sup1, pp. 31–36, Mar. 2012.
will advance the development and evaluation of detection [16] H. Lin. (2020). Helmet Use Detection Source Code. [Online]. Available:
approaches similar to the one in this paper. https://github.com/LinHanhe/Helmet_use_detection
[17] H. Lin and F. W. Siebert. (2020). The HELMET Dataset. [Online]. HANHE LIN received the Ph.D. degree from the
Available: https://osf.io/4pwj8/ Department of Information Science, University of
[18] J. Chiverton, ‘‘Helmet presence classification with motorcycle detection Otago, New Zealand, in 2016. He is currently
and tracking,’’ IET Intell. Transport Syst., vol. 6, no. 3, pp. 259–269, 2012. a Postdoctoral Researcher with the Department
[19] R. Silva, K. Aires, T. Santos, K. Abdala, R. Veras, and A. Soares, of Computer and Information Science, University
‘‘Automatic detection of motorcyclists without helmet,’’ in Proc. XXXIX of Konstanz, Germany. His research interests
Latin Amer. Comput. Conf. (CLEI), Oct. 2013, pp. 1–7. include machine learning and deep learning-based
[20] R. Waranusast, N. Bundon, V. Timtong, C. Tangnoi, and P. Pattanathaburt,
application, visual quality assessment, and crowd-
‘‘Machine vision techniques for motorcycle safety helmet detection,’’
sourcing.
in Proc. 28th Int. Conf. Image Vis. Comput. New Zealand (IVCNZ ),
Nov. 2013, pp. 35–40.
[21] K. Dahiya, D. Singh, and C. K. Mohan, ‘‘Automatic detection of bike-riders
without helmet using surveillance videos in real-time,’’ in Proc. Int. Joint
Conf. Neural Netw. (IJCNN), Jul. 2016, pp. 3046–3051.
[22] C. Vishnu, D. Singh, C. K. Mohan, and S. Babu, ‘‘Detection of motor-
cyclists without helmet in videos using convolutional neural network,’’ in
Proc. Int. Joint Conf. Neural Netw. (IJCNN), May 2017, pp. 3036–3041.
[23] N. Boonsirisumpun, W. Puarungroj, and P. Wairotchanaphuttha, ‘‘Auto-
matic detector for bikers with no helmet using deep learning,’’ in Proc.
22nd Int. Comput. Sci. Eng. Conf. (ICSEC), Nov. 2018, pp. 1–4. JEREMIAH D. DENG (Member, IEEE) received
[24] L. Shine and C. V. Jiji, ‘‘Automated detection of helmet on motorcyclists the B.Eng. degree from the University of Elec-
from traffic surveillance videos: A comparative analysis using hand-crafted tronic Science and Technology of China, Chengdu,
features and CNN,’’ Multimedia Tools Appl., vol. 79, pp. 14179–14199, China, in 1989, the M.Eng. degree from the
Feb. 2020. South China University of Technology (SCUT),
[25] N. Dalal and B. Triggs, ‘‘Histograms of oriented gradients for human Guangzhou, China, in 1992, and the Ph.D. degree
detection,’’ in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern under the co-supervision of SCUT and The
Recognit. (CVPR), Jun. 2005, pp. 886–893. University of Hong Kong, Hong Kong, in 1995.
[26] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollar, ‘‘Focal loss for dense He joined the University of Otago, Dunedin, New
object detection,’’ IEEE Trans. Pattern Anal. Mach. Intell., vol. 42, no. 2, Zealand, as a Research Fellow, in 1999, where
pp. 318–327, Feb. 2020. he is currently an Associate Professor with the Department of Information
[27] J. Redmon and A. Farhadi, ‘‘YOLO9000: Better, faster, stronger,’’ in
Science. He has published over 110 technical articles in machine learning,
Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017,
signal processing, and mobile computing.
pp. 6517–6525.
[28] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,
P. Dollár, and C. L. Zitnick, ‘‘Microsoft COCO: Common objects in
context,’’ in Proc. Eur. Conf. Comput. Vis. (ECCV), Zurich, Switzerland.
Springer, 2014, pp. 740–755.
[29] R. E. Kalman, ‘‘A new approach to linear filtering and prediction
problems,’’ J. Basic Eng., vol. 82, no. 1, pp. 35–45, Mar. 1960.
[30] J. Munkres, ‘‘Algorithms for the assignment and transportation problems,’’
J. Soc. Ind. Appl. Math., vol. 5, no. 1, pp. 32–38, Mar. 1957.
[31] G. J. McLachlan, ‘‘Mahalanobis distance,’’ Resonance, vol. 4, no. 6, DEIKE ALBERS received the M.Sc. degree in
pp. 20–26, 1999. human factors engineering from the Technical
[32] J. Bromley, I. Guyon, Y. LeCun, E. Säckinger, and R. Shah, ‘‘Signature University of Munich (TUM), where she is
verification using a ‘Siamese’ time delay neural network,’’ in Proc. Adv. currently pursuing the Ph.D. degree with the Chair
Neural Inf. Process. Syst., 1994, pp. 737–744. of Ergonomics. She is also a Researcher. Her
[33] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, ‘‘Rethinking research interests include traffic safety and validity
the inception architecture for computer vision,’’ in Proc. IEEE Conf. of usability assessments under different testing
Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 2818–2826. conditions.
[34] R. Caruana, ‘‘Multitask learning,’’ Mach. Learn., vol. 28, no. 1,
pp. 41–75, 1997.
[35] R. Hadsell, S. Chopra, and Y. LeCun, ‘‘Dimensionality reduction by
learning an invariant mapping,’’ in Proc. IEEE Comput. Soc. Conf. Comput.
Vis. Pattern Recognit. (CVPR), vol. 2, Jun. 2006, pp. 1735–1742.
[36] A. Shen, ‘‘Beaverdam: Video annotation tool for computer
vision training labels,’’ M.S. thesis, EECS Dept., Univ.
California, Berkeley, CA, USA, Dec. 2016. [Online]. Available:
http://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-
193.html
[37] S. Oh, A. Hoogs, A. Perera, N. Cuntoor, and C. C. Chen, ‘‘A large-scale
benchmark dataset for event recognition in surveillance video,’’ in Proc.
FELIX WILHELM SIEBERT received the M.Sc.
CVPR, Jun. 2011, pp. 3153–3160. degree in human factors from the Technische
[38] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, ‘‘ImageNet: Universität Berlin, Germany, and the Ph.D. degree
A large-scale hierarchical image database,’’ in Proc. IEEE Conf. Comput. in psychology from the Leuphana University of
Vis. Pattern Recognit., Jun. 2009, pp. 248–255. Lüeneburg, Germany. He holds a postdoctoral
[39] D. P. Kingma and J. Ba, ‘‘Adam: A method for stochastic opti- position with the Department of Psychology,
mization,’’ 2014, arXiv:1412.6980. [Online]. Available: http://arxiv.org/ Friedrich-Schiller University of Jena, Germany.
abs/1412.6980 His research interests include safety of vulnerable
[40] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, road users and impact of new forms of mobility on
‘‘The Pascal visual object classes (VOC) challenge,’’ Int. J. Comput. Vis., transport systems.
vol. 88, no. 2, pp. 303–338, Jun. 2010.