MIT DriveSeg Manual
MIT DriveSeg Manual
MIT DriveSeg Manual
Li Ding1 , Jack Terwilliger1 , Rini Sherony2 , Bryan Reimer1 , and Lex Fridman1
1
Massachusetts Institute of Technology (MIT)
2
Collaborative Safety Research Center, Toyota Motor North America
1
Figure 1: Examples from the proposed MIT DriveSeg dataset. Annotations are overlayed on frames.
Amazon Mechanical Turk (MTurk); This dataset is made freely available to academic and non-
2) create an open-source densely-annotated video driv- academic entities for non-commercial purposes, such as
ing scene dataset that can help with future research in var- academic research, teaching, scientific publications. Per-
ious fields, e.g. spatiotemporal scene perception, predictive mission is granted to use the data given that you agree to
modeling, semi-automatic annotation process development. the license terms (see Appendix A).
Dataset Split. We do not specify an official training /
3.1. Dataset Overview validation / testing split of the dataset. Instead, we release
We collect a long, untrimmed video (2 minutes 47 sec- the whole dataset and encourage people to experiment with
onds, 5,000 frames in total) at 1080P (1920x1080) resolu- different split and sampling settings depending on the task,
tion, 30 fps, which is a single daytime driving trip around especially in the video domain. For the standard image se-
crowded city streets, and annotate it with fine, per-frame, mantic segmentation task, we suggest a split of 3,000 for
pixel-wise semantic labels. Examples from the dataset are training, 500 for validation, and 1,500 for testing.
shown in Fig. 1.
2
Figure 2: Front-end of our annotation tool.
3
Our annotation process involves 4 stages: 1) task creation 2) jects such as pedestrians and vehicles. In order for this to
task distribution 3) annotation validation and 4) the assem- work, we carefully designed the instructions for each class
bly of sub-scene annotations into full-scene annotations. so that they could fit together harmoniously. The order in
For stage 1, the creation of tasks, we label which frames which we draw the classes dictates the instructions. When
contain the classes we are interested in and group the frames annotating the ith class of n total classes, a worker must an-
into sets of 3. This stage removes cases where a worker notate the boundaries between objects of class i and classes
is asked to annotate the presence of a class which is not j where j >= i. In other words, if we draw the road an-
present in a frame. Since this stage only requires label- notations before vehicle annotations, workers do not need
ing frame numbers in which a member of a particular class to draw the boundary between road and vehicle when an-
enters a visual scene and frame numbers in which the last notating road, since this work will be handled by workers
member of a particular class leaves, it is much faster and annotating vehicles.
cheaper than creating a semantic segmentation task for ev-
ery frame and let annotators find out the class does not exist.
5. Research Directions
This approach creates significant time-and-cost-savings es-
pecially for rare classes, such as motorcycles in our case. There are many potential research directions that can be
For stage 2, the distribution of tasks, we submit our tasks pursued with this densely annotated video scene dataset.
to MTurk and specify additional information which controls We provide a few open research questions where people
how our tasks are distributed: may find the dataset helpful:
Reward. This is the amount of money a worker receives Spatio-temporal semantic segmentation. As we have
for completing our task. We specify different rewards for explored the value of temporal information in [3], we are
different classes based on the estimated duration and effort interested in further research finding a novel way to utilize
in the annotation. temporal data, such as optical flow and driving state, to im-
Qualifications. This allows us to limit the pool of work- prove perception from using static image only.
ers who may work on our tasks based on 1) the worker’s Predictive modeling. Can we know ahead what is going
approval rate, calculated from all a worker’s work on the to happen on the road? Predictive power is an important
MTurk platform. 2) the total number of tasks the worker component of human intelligence, and can be crucial to the
has completed. 3) the qualification task we designed for ev- safety of autonomous driving. The dataset we provide is
ery new worker taking our task for the first time, which is a consistent in time and therefore can be used for predictive
test task that can be evaluated with the known ground truth. perception research.
For stage 3, annotation validation, we use both auto- Transfer learning. How much extra data do we need
mated and manual processes for assessing the quality of if we have a perception system trained in Europe and
worker annotations. In addition to the first qualification want to use it in Boston, US? In practice, many literatures
task, workers are assigned additional test tasks occasionally, shows that pre-trained networks help generalization to other
which are indistinguishable from non-ground truth tasks, datasets or tasks. Transfer learning is the key to achieve
to check whether they are still following our instructions. training a deep neural network with limited data.
If the worker’s annotation deviates significantly from the
Deep learning with video encoding. Most of the current
ground truth, they are disqualified from working on our
deep learning systems are based on RGB encoded images.
tasks in the future. The process of comparing worker’s an-
However, preserving exact RGB value for each single frame
notations with the ground truth is automated, by calculating
in the video is too expensive for computation and storage.
the Jaccard distance. The threshold score is class depen-
dent since it is easier to score high on less-complex objects Solving redundancy of video frames. How can we effi-
like the road than pedestrians. For our manual validation ciently find useful data from visually similar frames? What
process, we visually validate that a worker’s annotations shall be the best fps for a good perception system? One
are of sufficient quality, using a tool which steps through of the most important problem of real-time applications is
annotated frames as a video player which allows approv- trade-off between efficiency and accuracy.
ing/rejecting work and blocking workers via key presses.
For stage 4, the merging of sub-annotations, we com- 6. Conclusion
bine the class-level annotations for a given frame into a full-
scene annotation. For this task, we automatically compose The MIT DriveSeg dataset [3] has been used to show
the final full-scene annotation one class at a time. Our al- the value of temporal dynamics information, and allows
gorithm first draws the background classes, such as road the computer vision community to explore modeling both
and sidewalk, and then stationary foreground objects, such short-term and long-term context as part of the driving
as poles and buildings, and finally dynamic foreground ob- scene segmentation task.
4
Acknowledgments Appendix
This work was in part supported by the Toyota Collabo- A. License Agreement
rative Safety Research Center. The views and conclusions
being expressed are those of the authors and do no necces- The MIT DriveSeg Dataset is made freely available to
sarily reflect those of Toyota. academic and non-academic entities for non-commercial
purposes such as academic research, teaching, scientific
References publications. Permission is granted to use the data given
that you agree:
[1] G. J. Brostow, J. Fauqueur, and R. Cipolla. Semantic object
classes in video: A high-definition ground truth database. 1. That the dataset comes “AS IS”, without express or im-
Pattern Recognition Letters, 30(2):88–97, 2009. 1 plied warranty. Although every effort has been made to
[2] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, ensure accuracy, we (MIT, Toyota) do not accept any
R. Benenson, U. Franke, S. Roth, and B. Schiele. The responsibility for errors or omissions.
cityscapes dataset for semantic urban scene understanding.
In Proc. of the IEEE Conference on Computer Vision and 2. That you include a reference to the MIT DriveSeg
Pattern Recognition (CVPR), 2016. 1, 2 Dataset in any work that makes use of the dataset. For
[3] L. Ding, J. Terwilliger, R. Sherony, B. Reimer, and L. Frid- research papers, cite our preferred publication as listed
man. Value of temporal dynamics information in driving on our website; for other media cite our preferred pub-
scene segmentation. arXiv preprint arXiv:1904.00758, 2019. lication as listed on our website or link to the website.
1, 4
[4] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and 3. That you do not distribute this dataset or modified ver-
A. Zisserman. The pascal visual object classes (voc) chal- sions. It is permissible to distribute derivative works in
lenge. International Journal of Computer Vision, 88(2):303– as far as they are abstract representations of this dataset
338, June 2010. 1 (such as models trained on it or additional annotations
[5] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun. Vision meets that do not directly include any of our data) and do
robotics: The kitti dataset. The International Journal of not allow to recover the dataset or something similar
Robotics Research, 32(11):1231–1237, 2013. 1 in character.
[6] F. Li, T. Kim, A. Humayun, D. Tsai, and J. M. Rehg. Video
segmentation by tracking many figure-ground segments. In 4. That you may not use the dataset or any derivative
ICCV, 2013. 1 work for commercial purposes as, for example, licens-
[7] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra- ing or selling the data, or using the data with a purpose
manan, P. Dollár, and C. L. Zitnick. Microsoft coco: Com- to procure a commercial gain.
mon objects in context. In European conference on computer
vision, pages 740–755. Springer, 2014. 1 5. That all rights not expressly granted to you are re-
[8] G. Neuhold, T. Ollmann, S. R. Bulò, and P. Kontschieder. served by us (MIT, Toyota).
The mapillary vistas dataset for semantic understanding of
street scenes. In ICCV, pages 5000–5009, 2017. 1, 2
[9] F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool,
M. Gross, and A. Sorkine-Hornung. A benchmark dataset
and evaluation methodology for video object segmentation.
In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, pages 724–732, 2016. 1
[10] B. C. Russell, A. Torralba, K. P. Murphy, and W. T. Freeman.
Labelme: a database and web-based tool for image annota-
tion. International journal of computer vision, 77(1):157–
173, 2008. 3
[11] D. Tsai, M. Flagg, and J. M.Rehg. Motion coherent tracking
with multi-label mrf optimization. BMVC, 2010. 1
[12] F. Yu, W. Xian, Y. Chen, F. Liu, M. Liao, V. Madha-
van, and T. Darrell. Bdd100k: A diverse driving video
database with scalable annotation tooling. arXiv preprint
arXiv:1805.04687, 2018. 1
[13] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Tor-
ralba. Scene parsing through ade20k dataset. In Proceed-
ings of the IEEE Conference on Computer Vision and Pattern
Recognition, 2017. 1