Unsupervised Traffic Accident Detection in First-Person Videos
Unsupervised Traffic Accident Detection in First-Person Videos
net/publication/331519482
CITATIONS READS
0 478
5 authors, including:
Yu Yao Mingze Xu
University of Michigan Indiana University Bloomington
18 PUBLICATIONS 183 CITATIONS 29 PUBLICATIONS 387 CITATIONS
Some of the authors of this publication are also working on these related projects:
NASA's Generalized Trajectory Modeling and Prediction for Unmanned Aircraft Systems View project
All content following this page was uploaded by Yu Yao on 25 April 2019.
I. I NTRODUCTION
An alternative approach is to avoid modeling all possible
Autonomous driving has the potential to transform the driving scenarios, but instead to train models that recog-
world as we know it, revolutionizing transportation by mak- nize “normal,” safe roadway conditions, and then signal an
ing it faster, safer, cheaper, and less labor intensive. A key anomaly when events that do not fit the model are observed.
challenge is building autonomous systems that can accurately Unlike the fully-supervised classification-based work, this
perceive and safely react to the huge diversity in situations unsupervised approach would not be able to identify exactly
that are encountered on real-world roadways. The problem what anomaly has occurred, but it may still provide sufficient
is that driving situations obey a long-tailed distribution, such information for the driving system to recognize an unsafe
that a very small number of common situations makes up situation and take evasive action. This paper proposes a
the vast majority of what a driver encounters, and a virtually novel approach that learns a deep neural network model
infinite number of rare scenarios — animals running into the to predict the future locations of objects such as cars,
roadway, cars driving on the wrong side of the street, etc. bikes, pedestrians, etc., in the field of view of a dashboard-
— are the rest. While each of these individual scenarios is mounted camera on a moving ego-vehicle. These models
rare, they can and do happen. In fact, the chances that one can be easily learned from massive collections of dashboard-
of them will occur on any given day is actually quite high. mounted video of normal driving, and no manual labeling is
Existing work in computer vision has applied deep required. We then compare those predication locations to the
learning-based visual classification to recognize actions in actual locations observed in the next few video frames. We
the video collected by dashboard-mounted cameras [1], [2]. hypothesize that anomalous roadway events can be detected
The long-tailed distribution of driving events means that by looking for major deviations between the predicted and
unusual events may occur so infrequently that it may be actual locations, because unexpected roadway events (such
impossible to collect training data for them, or to even as cars striking other objects) result in sudden unexpected
anticipate that they might occur [3]. In fact, some studies changes in an object’s speed or position.
indicate that driverless cars would need to be tested for Perhaps the closest related work to ours is Liu et al. [3],
billions of miles before enough of these rare situations occur who also detect anomalous events in video. Their technique
to even accurately measure system safety [4], much less to tries to predict entire future RGB frames and then looks
collect sufficient training data to make them work well. for deviations between those and observed RGB frames.
But while their approach can work well for static cameras,
∗ The first two authors contributed equally. accurately predicting whole frames is extremely difficult
1 Robotics Institute, University of Michigan, Ann Arbor, MI 48109, USA. when cameras are rapidly moving as in the driving scenario.
{brianyao,ematkins}@umich.edu
2 School of Informatics, Computing, and Engineering, Indiana University, We side-step this difficult project by detecting objects and
Bloomington, IN 47408, USA. {mx6,wang617,djcran}@iu.edu predicting their trajectories, as opposed to trying to predict
whole frames. To model the moving camera, we explicitly differences between a predicted future frame and the actual
predict the future odometry of the ego-vehicle; this also frame. However, in dynamic autonomous driving scenarios, it
allows us to detect significant deviations of the predicted is hard to reconstruct either the current or future RGB frames
and real ego-motion, which can be used to classify if the due to the ego-car’s intense motion. It is even harder to detect
ego-vehicle is involved in the accident or is just an observer. abnormal events. This paper proposes detecting accidents on
We evaluate our technique in extensive experiments on three roads by using the difference between predicted and actual
datasets, including a new labeled dataset of some 1,500 video trajectories of other vehicles. Our method not only eliminates
traffic accidents from dashboard cameras that we collected the computational cost of reconstructing full RGB frames,
from YouTube. We find that our method significantly outper- but also localizes potential anomaly participants.
forms a number of baselines, including the published state- Prior work has also detected anomalies such as moving
of-the-art in anomaly detection. violations and car collisions on roads. Chan et al. [1]
introduce a dataset of crowd-sourced dashcam videos and a
II. R ELATED W ORK dynamic-spatial-attention RNN model for accident detection.
Herzig et al. [2] propose a Spatio-Temporal Action Graph
Trajectory Prediction. Extensive research has investigated
(STAG) network to model the latent graph structure of spatial
trajectory prediction, often posed as a sequence-to-sequence
and temporal relations between objects. These methods are
generation problem. Alahi et al. [5] introduce a Social-
based on supervised learning that requires arduous human
LSTM for pedestrian trajectories and their interactions. The
annotations and makes the unrealistic assumption that all
proposed social pooling method is further improved by
abnormal patterns have been observed in the training data.
Gupta et al. [6] to capture global context in a Generative
This paper considers the challenging but practical problem
Adversarial Network (GAN). Social pooling is also applied
of predicting accidents with unsupervised learning. To eval-
to vehicle trajectory prediction in Deo et al. [7] with multi-
uate our approach, we introduce a new dataset with traffic
modal maneuver conditions. Other work [8], [9] captures
accidents involving objects such as cars and pedestrians.
scene context information using attention mechanisms to
assist trajectory prediction. Lee et al. [10] incorporate Recur- III. U NSUPERVISED T RAFFIC ACCIDENT D ETECTION
rent Neural Networks (RNNs) with conditional variational IN F IRST-P ERSON V IDEOS
autoencoders (CVAEs) to generate multimodal predictions
Autonomous vehicles must monitor the roadway ahead
and choose the best by ranking scores.
for signs of unexpected activity that may require evasive
While the above methods are designed for third-person
action. A natural way to detect these anomalies is to look for
views from static cameras, recent work has considered vision
unexpected or rare movements in the first-person perspective
in first-person (egocentric) videos that capture the natural
of a front-facing, dashboard-mounted camera on a moving
field of view of the person or agent (e.g., vehicle) wearing
ego-vehicle. Prior work [3] proposes monitoring for unex-
the camera to study the camera wearer’s actions [11], [12],
pected scenarios by using past video frames to predict the
trajectories [13], interactions [14], [15], etc. Bhattacharyya et
current video frame, and then comparing it to the observed
al. [16] predict future locations of pedestrians from vehicle-
frame and looking for major differences. However, this does
mounted cameras, modeling observation uncertainties with
not work well for moving cameras on vehicles, where the
a Bayesian LSTM network. Yagi et al. [17] incorporate
perceived optical motion in the frame is induced by both
different kinds of cues into a convolution-deconvolution
moving objects and camera ego-motion. More importantly,
(Conv1D) network to predict pedestrians’ future locations.
anomaly detection systems do not need to accurately predict
Yao et al. [18] extend this work to autonomous driving sce-
all information in the frame, since anomalies are unlikely to
narios, by proposing a multi-stream RNN Encoder-Decoder
involve peripheral objects such as houses or billboards by
(RNN-ED) architecture with both past vehicle locations and
the roadside. This paper thus assumes that an anomaly may
image features as inputs for anticipating vehicle locations.
exist if an object’s real-world observed trajectory deviates
Video Anomaly Detection. Video anomaly detection has from the predicted trajectory. For example, when a vehicle
received considerable attention in computer vision and should move through an intersection but instead suddenly
robotics [19]. Previous work mainly focuses on video surveil- stops, a collision may have occurred.
lance scenarios typically using an unsupervised learning Following Liu et al. [3], our model is trained with a large-
method on the reconstruction of normal training data. For scale dataset of normal, non-anomalous driving videos. This
example, Hasan et al. [20] propose a 3D convolutional allows the model to learn normal patterns of object and
Auto-Encoder (Conv-AE) to model non-anomalous frames. ego motions, then recognize deviations without the need to
To take advantage of temporal information, [21], [22] use explicitly train the model with examples of every possible
a Convolutional LSTM Auto-Encoder (ConvLSTM-AE) to anomaly. This video data is easy to obtain and does not
capture regular visual and motion patterns simultaneously. require hand labeling. Considering the influence of ego-
Luo et al. [23] propose a special framework of sRNN, motion on perceived object location, we incorporate a future
called temporally-coherent sparse coding (TSC), to preserve ego-motion prediction module [18] as an additional input.
the similarities between frames within normal and abnor- At test time, we use the model to predict the current locations
mal events. Liu et al. [3] detect anomalies by looking for of objects based on the last few frames of data and determine
3) Missed Objects: We build a list of trackers T rks
per [24] to record the current bounding boxes T rks[i].Xt ,
the predictions T rks[i].Ŷt , and tracker ages T rks[i].age
of each object i ∈ C. At each time step, we update the
observed trackers and initialize a new tracker when a new
object is detected. For objects that are missed (i.e., suddenly
disappear), we use their previously predicted bounding boxes
as their estimated current location and run future object
localization with RoIPool features from those predicted
boxes per Algorithm 1. This missed object mechanism is
essential in our prediction-based anomaly detection method
to eliminate the impact of failed object detection or tracking
in any given frame. For example, if an object with a normal
motion pattern is missed for several frames, the FOL is
still expected to give reasonable predictions except for some
Fig. 2: Overview of the future object localization model. accumulated deviations. On the other hand, if an anomalous
object is missed during tracking [24], FOL-Track will
if an abnormal event has happened based on three different
make a prediction using its previously predicted bounding
anomaly detection strategies strategies per Section III-B.
box whose region can be totally displaced and can result
in inaccurate predictions. In this case, some false alarms
A. Future Object Localization
and false negatives can be eliminated by using the metrics
1) Bounding Box Prediction: Following [18], we denote presented in Section III-B.3.
an observed object’s bounding box Xt = [cxt , cyt , wt , ht ] at
time t, where (cxt , cyt ) is the location of the center of the box Algorithm 1: FOL-Track Algorithm
and wt and ht are its width and height in pixels, respectively. (i)
We denote the object’s future bounding box trajectory for the Input : Observed bounding boxes {Xt } where
δ frames after time t to be Yt = {Yt+1 , Yt+2 , · · · , Yt+δ }, i ∈ C, observed image evidence Ot , trackers
where each Yt is a bounding box parameterized by center, of all objects T rks with track IDs D
width, and height. Given the image evidence Ot observed at Output: Updated trackers T rks
time t, a visible object’s location Xt , and its corresponding 1 A is the maximum age of a tracker
historical information Ht−1 , our future object localization 2 for i ∈ C do // update observed trackers
model predicts Yt . This model is inspired by the multi- 3 if i ∈/ D then
stream RNN encoder-decoder framework of Yao et al. [18], 4 initialize T rks[i]
but with completely different network structure. For each 5 else
(i)
frame, [18] receives and re-processes the previous 10 frames 6 T rks[i].Xt = Xt
before making a decision, whereas our model only needs to (i)
7 T rks[i].Ŷt = F OL(Xt , Ot )
process the current information, making it much faster at in- 8 end
ference time. Our model is shown in Figure 2. Two encoders 9 end
based on gated recurrent units (GRUs) receive an object’s 10 for j ∈ D − C do // update missed
current bounding box and pixel-level spatiotemporal features trackers
as inputs, where spatiotemporal features are extracted by a 11 if T rks[j].age > A then
region-of-interest pooling (RoIPool) operation using bilinear 12 remove T rks[j] from T rks
interpolation from precomputed optical flow fields. 13 else
2) Ego-Motion Cue: Ego-motion information of the mov- 14 T rks[j].Xt = T rks[j].Ŷt−1
ing camera has been shown necessary for accurate future 15 T rks[j].Ŷt = F OL(T rks[j].Xt , Ot )
object localization [16], [18]. Let Et be the ego-vehicle’s 16 end
pose at time t; Et = {φt , xt , zt } where φt is the yaw 17 end
angle and xt and zt are the positions along the ground
plane with respect to the vehicle’s starting position in the
first video frame. We predict the ego-vehicle’s odometry by
using another RNN encoder-decoder module to encode ego- B. Traffic Accident Detection
position change vector Et − Et−1 and decode future ego- In this section, we propose three different strategies for
position changes E = {Êt+1 −Et , Êt+2 −Et , ..., Êt+δ −Et }. traffic accident detection by monitoring the prediction ac-
We use the change in ego-position to eliminate accumulated curacy and consistency of objects’ future locations. The key
odometry errors. The output E is then combined with the idea is that object trajectories and locations in non-anomalous
hidden state of the future object localization decoder to form events can be precisely predicted, while deviations from
the input into the next time step. predicted behaviors suggest an anomaly.
Fig. 3: Overview of our unsupervised traffic accident detection methods. The three brackets correspond to: (1) Predicted
bounding box accuracy method (pink); (2) Predicted box mask accuracy method (green); (3) Predicted bounding box
consistency method (purple). All methods use multiple previous FOL outputs to compute anomaly scores.
1) Predicted Bounding Boxes - Accuracy: One simple as the IoU between these two binary masks,
method for recognizing abnormal events is to directly mea- (
sure the similarity between predicted object bounding boxes (u,v) 1, if pixel (u, v) within box X i , ∀i,
I = (2)
and their corresponding observations. The future object lo- 0, otherwise,
calization (FOL) model predicts bounding boxes of the next δ Lmask = 1 − IoU Iˆt,t−1 , It ,
(3)
future frames, i.e., at each time t each object has δ bounding
(u,v) i
boxes predicted from time t − δ to t − 1, respectively. We where I is pixel (u, v) on mask I, X is the i-th
first average the positions of the δ bounding boxes, then bounding box, Iˆt,t−1 is the predicted mask from time t − 1,
compute intersection over union (IoU) between the averaged and It is the observed mask at t. In other words, while
bounding box and the observed box location, where higher the metric in the last section compares bounding boxes on
IoU means greater agreement between the two boxes. We an object-by-object basis, this metric simply compares the
average computed IoU values over all observed objects then bounding boxes of all objects simultaneously. The main idea
compute an aggregate anomaly score Lbbox ∈ [0, 1], is that accurate prediction results will still have a relative
large IoU compared to the ground truth observation.
3) Predicted Bounding Boxes - Consistency: The above
N X δ
1 X 1 i
i methods rely on accurate detection of objects in concurrent
Lbbox =1− IoU Ŷ , Yt0 , (1)
N i=1 δ j=1 t,t−j frames to compute anomaly scores. However, the detection of
anomaly participants is not always accurate due to changes
in appearance and mutual occlusions. We hypothesize that
i visual and motion features about an anomaly do not only
where N is the total number of observed objects, and Ŷt,t−j
is the predicted bounding box from time t − j of object i at appear once it happens, but usually are accompanied by
time t. This method relies upon accurate object tracking to a salient pre-event. We thus propose another strategy to
match the predicted and observed bounding boxes. detect anomalies by computing consistency of future object
localization outputs from several previous frames while elim-
2) Predicted Box Mask - Accuracy: Although tracking
inating the effect of inaccurate detection and tracking.
algorithms such as Deep-SORT [24] offer reasonable ac-
As discussed in Section III-B.1, our model has δ predicted
curacy, it is still possible to lose or mis-track objects.
bounding boxes for each object in video frame t. We com-
We found that inaccurate tracking particularly happens in
pute the standard deviation (STD) between all δ predicted
severe traffic accidents because of the twist and distortion
bounding boxes to measure their similarity,
of object appearances. Moreover, severe ego-motion also
results in inaccurate tracking due to sudden changes in object N
1 X
locations. This increases the number of false negatives of Lpred = max STD(Ŷt,t−j ). (4)
N i=1 {cx ,cy ,w,h}
the metric proposed above, which simply ignores objects
that are not successfully tracked in a given frame. To solve We compute the maximum STD over the four components
this problem, we first convert all areas within the predicted of the bounding boxes since different anomalies may be
bounding boxes to binary masks, with areas inside the boxes indicated by different effects on the bounding box, e.g.,
having value 1 and backgrounds having 0, and do the same suddenly stopped cross traffic may only have large STD
with the observed boxes. We then calculate an anomaly score along the horizontal axis. A low STD suggests the object is
TABLE I: Comparison of publicly available datasets for video anomaly detection. ∗ Surveillance videos. ∗∗
Egocentric videos
(training frames are all normal videos, while some test frame videos contain anomalies.)
Dataset # videos # training frames # testing frames # anomaly events typical participants
UCSD Ped1/Ped2∗
[25] 98 9,350 9,210 77 bike, pedestrian, cart, skateboard
CUHK Avenue∗ [26] 37 15,328 15,324 47 bike, pedestrian
UCF-Crime∗ [27] 1,900 1,610 videos 290 videos 1,900 car, pedestrian, animal
ShanghaiTech∗ [23] 437 274,515 42,883 130 bike, pedestrian
Street Accidents (SA)∗∗ [1] 994 82,900 16,500 165 car, truck, bike
A3D∗∗ 1,500 79,991 (HEV-I) 128,175 1500 car, truck, bike, pedestrian, animal
following normal movement patterns thus the predictions are A. Implementation Details
stable, while a high standard deviation suggestions abnormal We implemented our model in PyTorch [30] and per-
motion. For all three methods, we follow [3] to normalize formed experiments on a system with Pascal Nvidia Titan
computed anomaly scores for evaluation. Xp GPU1 . We use ORB-SLAM 2.0 [31] for ego odometry
IV. E XPERIMENTS calculation and compute optical flow using FlowNet 2.0 [32].
We use a 5×5 RoIPool operator to produce the final flat-
To evaluate our method on realistic traffic scenarios, we
tened feature vector Ot ∼ R50 . The gated recurrent unit
introduce a new dataset AnAn Accident Detection (A3D) of
(GRU) [33] is our basic RNN cell. GRU hidden state sizes
on-road abnormal event videos compiled as 1500 video clips
for future object localization and the ego-motion prediction
from a YouTube channel [28] of dashboard cameras from
model were set to 512 and 128, respectively. To learn
different cars in East Asia. Each video contains an abnormal
network parameters, we use the RMSprop [34] optimizer
traffic event at different temporal locations. We labeled each
with default parameters, learning rate 10−4 , and no weight
video with anomaly start and end times under the consensus
decay. Our models were optimized in an end-to-end manner,
of three human annotators. The annotators were instructed
and the training process was terminated after 100 epochs
to label the anomalies based on common sense, with the
using a batch size of 32. The best model was selected
start time defined to be the point at which the accident is
according to its performance in future object localization.
inevitable and the end time the point when all participants
recover a normal moving condition or fully stop. B. Evaluation Metrics
We compare our A3D dataset with existing video anomaly
For evaluation, we follow the literature of video anomaly
detection datasets in Table I. A3D includes a total of 128,175
detection [25] and compute frame-level Receiver Operation
frames (ranging from 23 to 208 frames) at 10 frame per
Characteristic (ROC) curves and Area Under the Curve
second and is clustered into 18 types of traffic accidents
(AUC). A higher AUC value indicates better performance.
each labeled with a brief description. A3D includes driving
scenarios with different weather conditions (e.g., sunny, C. Baselines
rainy, snowy, etc.), places (e.g., urban, countryside, etc.),
and participant types (e.g., cars, motorcycles, pedestrians, K-Nearest Neighbor Distance. We segment each video
animals, etc.). In addition to start and end times, each traffic into a bag of short video chunks with 16 frames. Each
anomaly is labeled with a binary value indicating whether chunk is labeled as either normal or anomalous based on
the ego-vehicle is involved to provide a better understanding the annotation of the 8-th frame. We then feed each chunk
of the event. Note that this could especially benefit the first- into an I3D [35] network pre-trained on Kinetics dataset,
person vision community. For example, rear-end collisions and extract the outputs of the last fully connected layer as
are the most difficult to detect from traditional anomaly its feature representations. All videos in the HEV-I dataset
detection methods. About 60% of accidents in the dataset are used as normal data. The normalized distance of each
involve the ego-vehicle, and others are observed by moving test video chunk to the centroid of its K nearest normal (K-
cars from a third-person perspective. NN) video chunks are computed as the anomaly score. We
Since A3D does not contain normal videos, we use show results of K = 1 and K = 5 in this paper.
the publicly available Honda Egocentric View Intersection Conv-AE [20]. We reimplement the Conv-AE model for
(HEV-I) [18] dataset to train our model. HEV-I was designed unsupervised video anomaly detection by following [20]. The
for future object localization and consists of 230 on-road input images are encoded by 3 convolutional layers and 2
videos at intersections in the San Francisco Bay Area. Each pooling layers, and then decoded by 3 deconvolutional layers
video is 10-60 seconds in length. Since HEV-I and A3D were and 2 upsampling layers for reconstruction. Anomaly score
collected in different places with different kinds of cameras, computation is from [20]. The model is trained on a mixture
there is no overlap between the training and testing datasets. of the SA (Table I) and the HEV-I dataset for 20 epochs and
Following prior work [18], we produce object bounding the best model is selected.
boxes using Mask-RCNN [29] pre-trained on the COCO
dataset and find tracking IDs using Deep-SORT [24]. 1 The code and dataset will be made publicly available upon publication.
TABLE II: Experimental results on A3D and SA dataset. ego cars (A3D w/o Ego in Table II) are involved in anomalies
to show how ego motion influences anomaly detection. As
Methods A3D A3D (w/o Ego) SA [1] shown in the first and the second columns of Table II, FOL-
K-NN (K = 1) 48.0 51.3 48.2 AvgIoU and FOL-MinIoU are impacted by ego-motion while
K-NN (K = 5) 47.8 51.2 48.1 the other methods are relatively robust. This further shows
Conv-AE [20] 49.5 49.9 50.4 that it is necessary to reduce dependency on accurate object
State-of-the-art [3] 46.2 51.2 50.4 detection and tracking when anomalies occur.
FOL-AvgIoU 49.7 57.0 53.4 Qualitative Results. Fig. 4 shows two sample results of our
FOL-MinIoU 48.4 56.0 52.6 best method and the published state-of-the-art on the A3D
FOL-Mask 54.1 54.9 54.8 dataset. For example, in the upper one, predictions of all
FOL-AvgSTD (pred only) 59.3 60.2 55.8 observed traffic participants are accurate and consistent at
FOL-MaxSTD (pred only) 60.1 59.8 55.6 the beginning. The ego car is hit at around the 30-th frame
by the white car on its left, causing inaccurate and unstable
predictions thus generating high anomaly scores. After the
State-of-the-art [3]. The future frame prediction network crash, the ego car stops and the predictions recover, as
with Generative Adversarial Network (GAN) achieved the presented in the last two images. Fig. 6 shows a failure case
state-of-the-art results for video anomaly detection. This where our method makes false alarms at the beginning due to
work detects abnormal events by leveraging the difference inconsistent prediction of the very left car occluded by trees.
between a predicted future frame and its ground truth. To This is because our model takes all objects into consideration
fairly compare with our method, we used the publicly avail- equally rather than focusing on important objects. False
able code by the authors of [3] and finetuned on the same negatives show that our method is not able to detect an
dataset as Conv-AE. Training is terminated after 100,000 accident if participants are totally occluded (e.g. the bike) or
iterations and the best model is selected. the motion pattern is accidentally normal from a particular
viewpoint (e.g. the middle car).
D. Results on A3D Dataset
E. Results on the SA Dataset
Quantitative Results. We evaluated baselines, a state-of- We also compared the performance of our model and
the-art method, and our proposed method on the A3D dataset. baselines on the Street Accident (SA) [1] dataset of on-
As shown in the first column of Table II, our method outper- road accidents in Taiwan. This dataset was collected from
forms the K−NN baseline as well as Conv-AE and state- dashboard cameras with 720p resolution from the driver’s
of-the-art. As a comparative study, we evaluate performance point-of-view. Note that we use SA only for testing, and still
of our future object localization (FOL) methods with the train on HEV-I dataset. We follow prior work [1] and report
three metrics presented in Section III-B. FOL-AvgIoU uses evaluation results with 165 test videos containing different
the metrics in Eq. 1, while FOL-MinIoU is a variation where anomalies. The right-most column in Table II shows the
we evaluate minimum IoU over all observed objects instead results of different methods on SA. In general, our best
of computing the average, resulting in not only anomaly method outperforms all baselines and the published state-of-
detection but also anomalous object localization. However, the-art. The SA testing dataset is much smaller than A3D,
FOL-MinIoU can perform worse since it is not robust to and we have informally observed that it is biased towards
outliers such as failed prediction of a normal object, which is anomalies involving bikes. It also contains videos collected
more frequent in videos with a large number of objects. FOL- from cyclist head cameras which have irregular camera
Mask uses the metrics in Eq. 3 and significantly outperforms angles and large vibrations. Figure 6 shows an example of
the above two methods. This method does not rely on anomaly detection in the SA dataset.
accurate tracking, so it handles cases including mis-tracked
objects. However, it may mis-label a frame as an anomaly if V. C ONCLUSION
object detection loses some normal objects. Our best methods
use the prediction-only metric defined in Eq. (4) which This paper proposes an unsupervised deep learning frame-
has two variations FOL-AvgSTD and FOL-MaxSTD. Similar work for traffic accident detection from egocentric videos. A
to the IoU based methods, FOL-MaxSTD finds the most key challenge is rapid motion of the ego-car, which means
anomalous object in the frame. By using only prediction, our visual reconstruction of either current or future RGB frames
method is free from unreliable object detection and tracking from regular training data is difficult. Instead, we predict
when an anomaly happens, including the false negatives (in traffic participant trajectories as well as their future locations,
IoU based methods) and the false positives (in Mask based and monitor the anticipation accuracy and consistency as a
methods) caused by losing objects. However, this method can signal that an anomaly may have occurred. We introduce a
fail in cases where predicting future locations of an object new dataset consisting of a variety of real-world accidents on
is difficult, e.g., an object with low resolution, intense ego- roads. We evaluate our method on two traffic accident detec-
motion, or multiple object occlusions due to heavy traffic. tion datasets. Experiments show that our model significantly
We also evaluated the methods by removing videos where outperforms published baselines.
Fig. 4: Two examples of our best method and a state-of-the-art method on the A3D dataset.
Fig. 5: An example of our best method and a state-of-the-art method on SA dataset [1].
Fig. 6: A failure case of our method on the A3D dataset with false alarms and false negatives.
R EFERENCES [31] R. Mur-Artal and J. D. Tardós, “Orb-slam2: An open-source slam
system for monocular, stereo, and rgb-d cameras,” T-RO, 2017.
[1] F.-H. Chan, Y.-T. Chen, Y. Xiang, and M. Sun, “Anticipating accidents [32] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox,
in dashcam videos,” in ACCV, 2016. “Flownet 2.0: Evolution of optical flow estimation with deep net-
[2] R. Herzig, E. Levi, H. Xu, E. Brosh, A. Globerson, and T. Darrell, works,” in CVPR, 2017.
“Classifying collisions with spatio-temporal action graph networks,” [33] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Gated feedback
arXiv:1812.01233, 2018. recurrent neural networks,” in ICML, 2015.
[3] W. Liu, W. Luo, D. Lian, and S. Gao, “Future frame prediction for [34] G. Hinton, N. Srivastava, and K. Swersky, “Neural networks for ma-
anomaly detection–a new baseline,” in CVPR, 2018. chine learning, lecture 6a: Overview of mini–batch gradient descent.”
[4] N. Kalra and S. M. Paddock, “Driving to safety: How many miles of [35] J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new
driving would it take to demonstrate autonomous vehicle reliability?” model and the kinetics dataset,” in CVPR, 2017.
Transportation Research Part A: Policy and Practice, 2016.
[5] A. Alahi, K. Goel, V. Ramanathan, A. Robicquet, L. Fei-Fei, and
S. Savarese, “Social LSTM: Human trajectory prediction in crowded
spaces,” in CVPR, 2016.
[6] A. Gupta, J. Johnson, L. Fei-Fei, S. Savarese, and A. Alahi, “Social
GAN: Socially acceptable trajectories with generative adversarial
networks,” in CVPR, 2018.
[7] N. Deo, A. Rangesh, and M. M. Trivedi, “How would surround
vehicles move? a unified framework for maneuver classification and
motion prediction,” T-IV, 2018.
[8] A. Sadeghian, F. Legros, M. Voisin, R. Vesel, A. Alahi, and
S. Savarese, “Car-Net: Clairvoyant attentive recurrent network,” in
ECCV, 2018.
[9] A. Sadeghian, V. Kosaraju, A. Sadeghian, N. Hirose, and S. Savarese,
“Sophie: An attentive gan for predicting paths compliant to social and
physical constraints,” arXiv:1806.01482, 2018.
[10] N. Lee, W. Choi, P. Vernaza, C. B. Choy, P. H. Torr, and M. Chan-
draker, “Desire: Distant future prediction in dynamic scenes with
interacting agents,” in CVPR, 2017.
[11] Y. Li, Z. Ye, and J. M. Rehg, “Delving into egocentric actions,” in
CVPR, 2015.
[12] M. Ma, H. Fan, and K. M. Kitani, “Going deeper into first-person
activity recognition,” in CVPR, 2016.
[13] G. Bertasius, A. Chan, and J. Shi, “Egocentric basketball motion
planning from a single first-person image,” in CVPR, 2018.
[14] C. Fan, J. Lee, M. Xu, K. K. Singh, Y. J. Lee, D. J. Crandall, and
M. S. Ryoo, “Identifying first-person camera wearers in third-person
videos,” arXiv:1704.06340, 2017.
[15] M. Xu, C. Fan, Y. Wang, M. S. Ryoo, and D. J. Crandall, “Joint person
segmentation and identification in synchronized first-and third-person
videos,” arXiv:1803.11217, 2018.
[16] A. Bhattacharyya, M. Fritz, and B. Schiele, “Long-term on-board
prediction of people in traffic scenes under uncertainty,” in CVPR,
2018.
[17] T. Yagi, K. Mangalam, R. Yonetani, and Y. Sato, “Future person
localization in first-person videos,” in CVPR, 2018.
[18] Y. Yao, M. Xu, C. Choi, D. J. Crandall, E. M. Atkins, and B. Dariush,
“Egocentric vision-based future vehicle localization for intelligent
driving assistance systems,” arXiv:1809.07408, 2018.
[19] V. Chandola, A. Banerjee, and V. Kumar, “Anomaly detection: A
survey,” in ACM CSUR, 2009.
[20] M. Hasan, J. Choi, J. Neumann, A. K. Roy-Chowdhury, and L. S.
Davis, “Learning temporal regularity in video sequences,” in CVPR,
2016.
[21] J. R. Medel and A. Savakis, “Anomaly detection in video us-
ing predictive convolutional long short-term memory networks,”
arXiv:1612.00390, 2016.
[22] Y. S. Chong and Y. H. Tay, “Abnormal event detection in videos using
spatiotemporal autoencoder,” in ISNN, 2017.
[23] W. Luo, W. Liu, and S. Gao, “A revisit of sparse coding based anomaly
detection in stacked rnn framework,” in ICCV, 2017.
[24] N. Wojke, A. Bewley, and D. Paulus, “Simple online and realtime
tracking with a deep association metric,” in ICIP, 2017.
[25] W. Li, V. Mahadevan, and N. Vasconcelos, “Anomaly detection and
localization in crowded scenes,” TPAMI, 2014.
[26] C. Lu, J. Shi, and J. Jia, “Abnormal event detection at 150 fps in
matlab,” in ICCV, 2013.
[27] W. Sultani, C. Chen, and M. Shah, “Real-world anomaly detection in
surveillance videos,” in CVPR, 2018.
[28] https://www.youtube.com/channel/UC-Oa3wml6F3YcptlFwaLgDA/
featured.
[29] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask R-CNN,” in
ICCV, 2017.
[30] http://pytorch.org/.