Ref 4
Ref 4
Abstract 0HGLDWHG3HUFHSWLRQ
maneuvers for several reasons. Firstly, with other cars on sponding labels. Together with the simple controller that we
the road, even when the input images are similar, differen- design, our model can make meaningful predictions for af-
t human drivers may make completely different decisions, fordance indicators and autonomously drive a car in differ-
which results in an ill-posed problem that is confusing when ent tracks of the video game, under different traffic condi-
training a regressor. For example, with a car directly ahead, tions and lane configurations. At the same time, it enjoys a
one may choose to follow the car, to pass the car from the much simpler structure than the typical mediated perception
left, or to pass the car from the right. When all these scenar- approach. Testing our system on car-mounted smartphone
ios exist in the training data, a machine learning model will videos and the KITTI dataset [6] demonstrates good real-
have difficulty deciding what to do given almost the same world perception as well. Our direct perception approach
images. Secondly, the decision-making for behavior reflex provides a compact, task-specific affordance description for
is too low-level. The direct mapping cannot see a bigger scene understanding in autonomous driving.
picture of the situation. For example, from the model’s per-
spective, passing a car and switching back to a lane are just a 1.1. Related work
sequence of very low level decisions for turning the steering Most autonomous driving systems from industry today
wheel slightly in one direction and then in the other direc- are based on mediated perception approaches. In comput-
tion for some period of time. This level of abstraction fails er vision, researchers have studied each task separately [6].
to capture what is really going on, and it increases the diffi- Car detection and lane detection are two key elements of
culty of the task unnecessarily. Finally, because the input to an autonomous driving system. Typical algorithms output
the model is the whole image, the learning algorithm must bounding boxes on detected cars [4, 13] and splines on de-
determine which parts of the image are relevant. However, tected lane markings [1]. However, these bounding boxes
the level of supervision to train a behavior reflex model, i.e. and splines are not the direct affordance information we use
the steering angle, may be too weak to force the algorithm for driving. Thus, a conversion is necessary which may re-
to learn this critical information. sult in extra noise. Typical lane detection algorithms such as
We desire a representation that directly predicts the af- the one proposed in [1] suffer from false detections. Struc-
fordance for driving actions, instead of visually parsing the tures with rigid boundaries, such as highway guardrails or
entire scene or blindly mapping an image to steering angles. asphalt surface cracks, can be mis-recognized as lane mark-
In this paper, we propose a direct perception approach [7] ings. Even with good lane detection results, critical infor-
for autonomous driving – a third paradigm that falls in be- mation for car localization may be missing. For instance,
tween mediated perception and behavior reflex. We propose given that only two lane markings are usually detected reli-
to learn a mapping from an image to several meaningful af- ably, it can be difficult to determine if a car is driving on the
fordance indicators of the road situation, including the angle left lane or the right lane of a two-lane road.
of the car relative to the road, the distance to the lane mark- To integrate different sources into a consistent world
ings, and the distance to cars in the current and adjacent representation, [5, 22] proposed a probabilistic generative
lanes. With this compact but meaningful affordance repre- model that takes various detection results as inputs and out-
sentation as perception output, we demonstrate that a very puts the layout of the intersection and traffic details.
simple controller can then make driving decisions at a high For behavior reflex approaches, [17, 18] are the seminal
level and drive the car smoothly. works that use a neural network to map images directly to
Our model is built upon the state-of-the-art deep Convo- steering angles. More recently, [11] train a large recurren-
lutional Neural Network (ConvNet) framework to automat- t neural network using a reinforcement learning approach.
ically learn image features for estimating affordance related The network’s function is the same as [17, 18], mapping the
to autonomous driving. To build our training set, we ask image directly to the steering angles, with the objective to
a human driver to play a car racing video game TORCS keep the car on track. Similarly to us, they use the video
for 12 hours while recording the screenshots and the corre- game TORCS for training and testing.
2723
RQPDUNLQJV\VWHP
DFWLYDWHUDQJH
DQJOH
RYHUODSSLQJ
DUHD
GLVWB00
GLVWB// GLVWB55 WR0DUNLQJB0 GLVWB/ GLVWB5
WR0DUNLQJB0/ LQODQHV\VWHP
WR0DUNLQJB05 WR0DUNLQJB5 DFWLYDWHUDQJH
WR0DUNLQJB// WR0DUNLQJB55
WR0DUNLQJB/
WR0DUNLQJB55
(a) angle (b) in lane: toMarking (c) in lane: dist (d) on mark.: toMarking (e) on marking: dist (f) overlapping area
Figure 3: Illustration of our affordance representation. A lane changing maneuver needs to traverse the “in lane system”
and the “on marking system”. (f) shows the designated overlapping area used to enable smooth transitions.
In terms of deep learning for autonomous driving, [14] 2.1. Mapping from an image to affordance
is a successful example of ConvNets-based behavior re-
We use a state-of-the-art deep learning ConvNet as our
flex approach. The authors propose an off-road driving
direct perception model to map an image to the affordance
robot DAVE that learns a mapping from images to a human
indicators. In this paper, we focus on highway driving with
driver’s steering angles. After training, the robot demon-
multiple lanes. From an ego-centric point of view, the host
strates capability for obstacle avoidance. [9] proposes an
car only needs to concern the traffic in its current lane and
off-road driving robot with self-supervised learning ability
the two adjacent (left/right) lanes when making decision-
for long-range vision. In their system, a multi-layer con-
s. Therefore, we only need to model these three lanes.
volutional network is used to classify an image segmen-
We train a single ConvNet to handle three lane configura-
t as a traversable area or not. For depth map estimation,
tions together: a road of one lane, two lanes, or three lanes.
DeepFlow [20] uses ConvNets to achieve very good result-
Shown in Figure 2 are the typical cases we are dealing with.
s for driving scene images on the KITTI dataset [6]. For
Occasionally the car has to drive on lane markings, and in
image features, deep learning also demonstrates significant
such situations only the lanes on each side of the lane mark-
improvement [12, 8, 3] over hand-crafted features, such as
ing need to be monitored, as shown in Figure 2e and 2f.
GIST [16]. In our experiments, we will make a compari-
Highway driving actions can be categorized into two ma-
son between learned ConvNet features and GIST for direct
jor types: 1) following the lane center line, and 2) changing
perception in driving scenarios.
lanes or slowing down to avoid collisions with the preceding
cars. To support these actions, we define our system to have
2. Learning affordance for driving perception two sets of representations under two coordinate systems:
To efficiently implement and test our approach, we use “in lane system” and “on marking system”. To achieve t-
the open source driving game TORCS (The Open Racing wo major functions, lane perception and car perception, we
Car Simulator) [21], which is widely used for AI research. propose three types of indicators to represent driving situa-
From the game engine, we can collect critical indicators for tions: heading angle, the distance to the nearby lane mark-
driving, e.g. speed of the host car, the host car’s relative po- ings, and the distance to the preceding cars. In total, we
sition to the road’s central line, the distance to the preced- propose 13 affordance indicators as our driving scene rep-
ing cars. In the training phase, we manually drive a “label resentation, illustrated in Figure 3. A complete list of the
collecting car” in the game to collect screenshots (first per- affordance indicators is enumerated in Figure 4. They are
son driving view) and the corresponding ground truth val- the output of the ConvNet as our affordance estimation and
ues of the selected affordance indicators. This data is stored the input of the driving controller.
and used to train a model to estimate affordance in a su- The “in lane system” and “on marking system” are acti-
pervised learning manner. In the testing phase, at each time vated under different conditions. To have a smooth transi-
step, the trained model takes a driving scene image from the tion, we define an overlapping area, where both systems are
game and estimates the affordance indicators for driving. active. The layout is shown in Figure 3f.
A driving controller processes the indicators and computes Except for heading angle, all the indicators may output
the steering and acceleration/brake commands. The driving an inactive state. There are two cases in which a indicator
commands are then sent back to the game to drive the host will be inactive: 1) when the car is driving in either the “in
car. Ground truth labels are also collected during the test- lane system” or “on marking system” and the other system
ing phase to evaluate the system’s performance. In both the is deactivated, then all the indicators belonging to that sys-
training and testing phase, traffic is configured by putting a tem are inactive. 2) when the car is driving on boundary
number of pre-programmed AI cars on road. lanes (left most or right most lane), and there is either no
2724
always: while (in autonomous driving mode)
1) angle: angle between the car’s heading and the tangent of the road ConvNet outputs affordance indicators
“in lane system”, when driving in the lane: check availability of both the left and right lanes
2) toMarking LL: distance to the left lane marking of the left lane if (approaching the preceding car in the same lane)
3) toMarking ML: distance to the left lane marking of the current lane if (left lane exists and available and lane changing allowable)
4) toMarking MR: distance to the right lane marking of the current lane left lane changing decision made
5) toMarking RR: distance to the right lane marking of the right lane else if (right lane exists and available and lane changing allowable)
6) dist LL: distance to the preceding car in the left lane right lane changing decision made
7) dist MM: distance to the preceding car in the current lane else
8) dist RR: distance to the preceding car in the right lane slow down decision made
“on marking system”, when driving on the lane marking: if (normal driving)
9) toMarking L: distance to the left lane marking center line= center line of current lane
10) toMarking M: distance to the central lane marking else if (left/right lane changing)
11) toMarking R: distance to the right lane marking center line= center line of objective lane
12) dist L: distance to the preceding car in the left lane compute steering command
13) dist R: distance to the preceding car in the right lane compute desired speed
compute acceleration/brake command based on desired speed
Figure 4: Complete list of affordance indicators in our
direct perception representation. Figure 5: Controller logic.
left lane or no right lane, then the indicators corresponding 3. Implementation
to the non-existing adjacent lane are inactive. According to
the indicators’ value and active/inactive state, the host car Our direct perception ConvNet is built upon Caffe [10],
can be accurately localized on the road. and we use the standard AlexNet architecture [12]. There
are 5 convolutional layers followed by 4 fully connected
layers with output dimensions of 4096, 4096, 256, and 13,
2.2. Mapping from affordance to action respectively. Euclidian loss is used as the loss function. Be-
cause the 13 affordance indicators have various ranges, we
The steering control is computed using the car’s position normalize them to the range of [0.1, 0.9].
and pose, and the goal is to minimize the gap between the We select 7 tracks and 22 traffic cars in TORCS, shown
car’s current position and the center line of the lane. Defin- in Figure 6 and Figure 7, to generate the training set. We
ing dist center as the distance to the center line of the lane, replace the original road surface textures in TORCS with
we have: over 30 customized asphalt textures of various lane config-
steerCmd = C ∗(angle−dist center/road width) (1) urations and asphalt darkness levels. We also program dif-
ferent driving behaviors for the traffic cars to create differ-
where C is a coefficient that varies under different driving
ent traffic patterns. We manually drive a car on each track
conditions, and angle ∈ [−π, π]. When the car changes
multiple times to collect training data. While driving, the
lanes, the center line switches from the current lane to the
screenshots are simultaneously down-sampled to 280 × 210
objective lane. The pseudocode describing the logic of the
and stored in a database together with the ground truth la-
driving controller is listed in Figure 5.
bels. This data collection process can be easily automated
At each time step, the system computes desired speed. by using an AI car. Yet, when driving manually, we can
A controller makes the actual speed follow the intentionally create extreme driving conditions (e.g. off the
desired speed by controlling the acceleration/brake. road, collide with other cars) to collect more effective train-
The baseline desired speed is 72 km/h. If the car is ing samples, which makes the ConvNet more powerful and
turning, a desired speed drop is computed according to significantly reduces the training time.
the past few steering angles. If there is a preceding car In total, we collect 484,815 images for training. The
in close range and a slow down decision is made, the training procedure is similar to training an AlexNet on Ima-
desired speed is also determined by the distance to the geNet data. The differences are: the input image has a reso-
preceding car. To achieve car-following behavior in such lution of 280 × 210 and is no longer a square image. We do
situations, we implement the optimal velocity car-following not use any crops or a mirrored version. We train our mod-
model [15] as: el from scratch. We choose an initial learning rate of 0.01,
c and each mini-batch consists of 64 images randomly select-
v(t) = vmax (1 − exp(− dist(t) − d)) (2)
vmax ed from the training samples. After 140,000 iterations, we
where dist(t) is the distance to the preceding car, vmax stop the training process.
is the largest allowable speed, c and d are coefficients to In the testing phase, when our system drives a car in
be calibrated. With this implementation, the host car can TORCS, the only information it accesses is the front facing
achieve stable and smooth car-following under a wide range image and the speed of the car. Right after the host car over-
of speeds and even make a full stop if necessary. takes a car in its left/right lane, it cannot judge whether it is
2725
6SHHG
,PDJH 6SHHG
5HDG
:ULWH
725&6 DQJOH
'ULYLQJ
&11 &RQWUROOHU
6KDUHG 5HDG WR0DUNLQJ
0HPRU\ ,PDJH GLVW
Each track is customized to the configuration of one-lane, Figure 8: System architecture. The ConvNet processes
two-lane, and three-lane with multiple asphalt darkness lev- the TORCS image and estimates 13 indicators for driving.
els. The rest of the tracks are used in the testing set. Based on the indicators and the current speed of the car, a
controller computes the driving commands which will be
sent back to TORCS to drive the host car in it.
Figure 7: Examples of the 22 cars used in the training tests, reliable car perception within 30 meters can guaran-
set. The rest of the cars are used in the testing set. tee satisfactory control quality in the game.
To maintain smooth driving, our system can tolerate
safe to move to that lane, simply because the system can- moderate error in the indicator estimations. The car is a
not see things behind. To solve this problem, we make an continuous system, and the controller is constantly correct-
assumption that the host car is faster than the traffic. There- ing its position. Even with some scattered erroneous estima-
fore if sufficient time has passed since its overtaking (in- tions, the car can still drive smoothly without any collisions.
dicated by a timer), it is safe to change to that lane. The
control frequency in our system for TORCS is 10Hz, which 4.2. Comparison with baselines
is sufficient for driving below 80 km/h. A schematic of the
system is shown in Figure 8. To quantitatively evaluate the performance of the
TORCS-based direct perception ConvNet, we compare it
with three baseline methods. We refer to our model as
4. TORCS evaluation
“ConvNet full” in the following comparisons.
We first evaluate our direct perception model on the
TORCS driving game. Within the game, the ConvNet out- 1) Behavior reflex ConvNet: The method directly map-
put can be visualized and used by the controller to drive s an image to steering using a ConvNet. We train this
the host car. To measure the estimation accuracy of the af- model on the driving game TORCS using two settings: (1)
fordance indicators, we construct a testing set consisting of The training samples (over 60,000 images) are all collect-
tracks and cars not included in the training set. ed while driving on an empty track; the task is to follow
the lane. (2) The training samples (over 80,000 images) are
In the aerial TORCS visualization (Figure 10a, right),
collected while driving in traffic; the task is to follow the
we treat the host car as the reference object. As its vertical
lane, avoid collisions by switching lanes, and overtake s-
position is fixed, it moves horizontally with a heading com-
low preceding cars. The video in our project website shows
puted from angle. Traffic cars only move vertically. We do
the typical performance. For (1), the behavior reflex system
not visualize the curvature of the road, so the road ahead is
can easily follow empty tracks. For (2), when testing on the
always represented as a straight line. Both the estimation
same track where the training set is collected, the trained
(empty box) and the ground truth (solid box) are displayed.
system demonstrates some capability at avoiding collisions
4.1. Qualitative assessment by turning left or right. However, the trajectory is erratic.
The behavior is far different from a normal human driver
Our system can drive very well in TORCS without any and is unpredictable - the host car collides with the preced-
collision. In some lane changing scenarios, the controller ing cars frequently.
may slightly overshoot, but it quickly recovers to the de-
sired position of the objective lane’s center. As seen in the 2) Mediated perception (lane detection): We run the
TORCS visualization, the lane perception module is pretty Caltech lane detector [1] on TORCS images. Because only
accurate, and the car perception module is reliable up to 30 two lanes can be reliably detected, we map the coordinates
meters away. In the range of 30 meters to 60 meters, the of spline anchor points of the top two detected lane mark-
ConvNet output becomes noisier. In a 280 × 210 image, ings to the lane-based affordance indicators. We train a sys-
when the traffic car is over 30 meter away, it actually ap- tem composed of 8 Support Vector Regression (SVR) and 6
pears as a very tiny spot, which makes it very challenging Support Vector Classification (SVC) models (using libsvm
for the network to estimate the distance. However, because [2]) to implement the mapping (a necessary step for mediat-
the speed of the host car does not exceed 72 km/h in our ed perception approaches). The system layout is similar to
2726
*,67
GHVFULSWRU
695
69& WR0DUNLQJB0/ 69& 69& 695 69&
KDVOHIWODQH WR0DUNLQJB05 KDVULJKWODQH KDVOHIWODQH WR0DUNLQJB0 KDVULJKWODQH
GLVWB00
(a) Autonomous driving in TORCS (b) Testing on real video
695 695 695 695 Figure 10: Testing the TORCS-based system. The esti-
WR0DUNLQJB// WR0DUNLQJB55 WR0DUNLQJB/ WR0DUNLQJB5
GLVWB// GLVWB55 GLVWB/ GLVWB5 mation is shown as an empty box, while the ground truth is
indicated by a solid box. For testing on real videos, without
Figure 9: GIST baseline. Procedure of mapping GIST de- the ground truth, we can only show the estimation.
scriptor to the 13 affordance indicators for driving using
Parameter angle to LL to ML to MR to RR to L to M to R
SVR and SVC. Caltech lane 0.048 1.673 1.179 1.084 1.220 1.113 1.060 0.895
ConvNet full 0.025 0.260 0.197 0.179 0.239 0.291 0.262 0.231
the GIST-based system (next section) illustrated in Figure 9,
but without car perception. Table 1: Mean Absolute Error (angle is in radians, the rest
Because the Caltech lane detector is a relatively weak are in meters) on the testing set for the Caltech lane detector
baseline, to make the task simpler, we create a special train- baseline.
ing set and testing set. Both the training set (2430 samples) [2, 50] meters ahead. Below two meters, cars in the adjacent
and testing set (2533 samples) are collected from the same lanes are not visually present in the image.
track (not among the 7 training tracks for ConvNet) without Results in Table 2 show that the ConvNet-based system
traffic, and in a finer image resolution of 640 × 480. We dis- works considerably better than the GIST-based system. By
cover that, even when trained and tested on the same track, comparing “ConvNet sub” and “ConvNet full”, it is clear
the Caltech lane detector based system still performs worse that more training data is very helpful for increasing the ac-
than our model. We define our error metric as Mean Abso- curacy of the ConvNet-based direct perception system.
lute Error (MAE) between the affordance estimations and
ground truth distances. A comparison of the errors for the 5. Testing on real-world data
two systems is shown in Table 1.
5.1. Smartphone video
3) Direct perception with GIST: We compare the hand- We test our TORCS-based direct perception ConvNet
crafted GIST descriptor with the deep features learned by on real driving videos taken by a smartphone camera. Al-
the ConvNet’s convolutional layers in our model. A set of though trained and tested in two different domains, our sys-
13 SVR and 6 SVC models are trained to convert the GIST tem still demonstrates reasonably good performance. The
feature to the 13 affordance indicators defined in our sys- lane perception module works particularly well. The algo-
tem. The procedure is illustrated in Figure 9. The GIST rithm is able to determine the correct lane configuration, lo-
descriptor partitions the image into 4 × 4 segments. Be- calize the car in the correct lane, and recognize lane chang-
cause the ground area represented by the lower 2 × 4 seg- ing transitions. The car perception module is slightly nois-
ments may be more relevant to driving, we try two different ier, probably because the computer graphics model of cars
settings in our experiments: (1) convert the whole GIST de- in TORCS look quite different from the real ones. Please
scriptor, and (2) convert the lower 2 × 4 segments of GIST refer to the video on our project website for the result. A
descriptor. We refer to these two baselines as “GIST w- screenshot of the system running on real video is shown in
hole” and “GIST half” respectively. Figure 10b. Since we do not have ground truth measure-
Due to the constraints of libsvm, training with the full ments, only the estimations are visualized.
dataset of 484,815 samples is prohibitively expensive. We
instead use a subset of the training set containing 86,564 5.2. Car distance estimation on the KITTI dataset
samples for training. Samples in the sub training set are col-
To quantitatively analyze how the direct perception ap-
lected on two training tracks with two-lane configurations.
proach works on real images, we train a different ConvNet
To make a fair comparison, we train another ConvNet on
on the KITTI dataset [6]. The task is estimating the distance
the same sub training set for 80,000 iterations (referred to as
to other cars ahead.
“ConvNet sub”). The testing set is collected by manually
The KITTI dataset contains over 40,000 stereo image
driving a car on three different testing tracks with two-lane
pairs taken by a car driving through European urban areas.
configurations and traffic. It has 8,639 samples.
Each stereo pair is accompanied by a Velodyne LiDAR 3D
The results are shown in Table 2. The dist (car distance) point cloud file. Around 12,000 stereo pairs come with of-
errors are computed when the ground truth cars lie within
2727
Parameter angle to LL to ML to MR to RR dist LL dist MM dist RR to L to M to R dist L dist R
GIST whole 0.051 1.033 0.596 0.598 1.140 18.561 13.081 20.542 1.201 1.310 1.462 30.164 30.138
GIST half 0.055 1.052 0.547 0.544 1.238 17.643 12.749 22.229 1.156 1.377 1.549 29.484 31.394
ConvNet sub 0.043 0.253 0.180 0.193 0.289 6.168 8.608 9.839 0.345 0.336 0.345 12.681 14.782
ConvNet full 0.033 0.188 0.155 0.159 0.183 5.085 4.738 7.983 0.316 0.308 0.294 8.784 10.740
Table 2: Mean Absolute Error (angle is in radians, the rest are in meters) on the testing set for the GIST baseline.
settings for the KITTI-based ConvNet are altered from the &HQWUDO
DUHD
/HIWDUHD
previous TORCS-based ConvNet. In most KITTI images, PaP
PaP
[U\U
origin is the center of the host car, the y axis is along the
host car’s heading, while the x axis is pointing to the right (a) (b)
of the host car (Figure 11a). We ask the ConvNet to esti- Figure 11: Car distance estimation on the KITTI dataset.
mate the coordinate (x, y) of the cars “ahead” of the host (a) The coordinate system is defined relative to the host car.
car in this system. We partition the space into three areas, and the objective
There can be many cars in a typical KITTI image, but is to estimate the coordinate of the closest car in each area.
only those closest to the host car are critical for driving de- (b) We compare our direct perception approach to the DPM-
cisions. So it is not necessary to detect all the cars. We based mediated perception. The central crop of the KITTI
partition the space in front of the host car into three areas image (indicated by the yellow box in the upper left image
according to x coordinate: 1) central area, x ∈ [−1.6, 1.6] and shown in the lower left image) is sent to the far range
meters, where cars are directly in front of the host car. 2) ConvNet. The bounding boxes output by DPM are shown
left area, x ∈ [−12, 1.6) meters, where cars are to the left in red, as are its distance projections in the LiDAR visual-
of the host car. 3) right area, x ∈ (1.6, 12] meters, where ization (right). The ConvNet outputs and the ground truth
cars are to the right of the host car. We are not concerned are represented by green and black boxes, respectively.
with cars outside this range. We train the ConvNet to es-
timate the coordinate (x, y) of the closest car in each area
perception approach). The DPM car detector is provided by
(Figure 11a). Thus, this ConvNet has 6 outputs.
[5] and is optimized for the KITTI dataset. We run the de-
Due to the low resolution of input images, cars far away tector on the full resolution images and convert the bound-
can hardly be discovered by the ConvNet. We adopt a two- ing boxes to distance measurements by projecting the cen-
ConvNet structure. The close range ConvNet covers 2 to 25 tral point of the lower edge to the ground plane (zero height)
meters (in the y coordinate) ahead, and its input is the entire using the calibrated camera model. The projection is very
KITTI image resized to 497 × 150 resolution. The far range accurate given that the ground plane is flat, which holds for
ConvNet covers 15 to 55 meters ahead, and its input is a most KITTI images. DPM can detect multiple cars in the
cropped KITTI image covering the central 497 × 150 area. image, and we select the closest ones (one on the host car’s
The final distance estimation is a combination of the two left, one on its right, and one directly in front of it) to com-
ConvNets’ outputs. We build our training samples mostly pute the estimation error. Since the images are taken while
from the KITTI officially labeled images, with some addi- the host car is driving, many images contain closest cars
tional samples we labeled ourselves. The final number is that only partially appear in the left lower corner or right
around 14,000 stereo pairs. This is still insufficient to suc- lower corner. DPM cannot detect these partial cars, while
cessfully train a ConvNet. We augment the dataset by us- the ConvNet can better handle such situations. To make the
ing both the left camera and right camera images, mirroring comparison fair, we only count errors when the closest cars
all the images, and adding some negative samples that do fully appear in the image. The error is computed when the
not contain any car. Our final training set contains 61,894 traffic cars show up within 50 meters ahead (in the y coor-
images. Both ConvNets are trained on this set for 50,000 dinate). When there is no car present, the ground truth is set
iterations. We label another 2,200 images as our testing set, as 50 meters. Thus, if either model has a false positive, it
on which we compute the numerical estimation error. will be penalized. The Mean Absolute Error (MAE) for the
y and x coordinate, and the Euclidian distance d between
5.3. Comparison with DPM-based baseline the estimation and the ground truth of the car position are
We compare the performance of our KITTI-based Con- shown in Table 3. A screenshot of the system is shown in
vNet with the state-of-the-art DPM car detector (a mediated Figure 11b.
2728
Figure 12: Activation patterns of neurons. The neurons’
activation patterns display strong correlations with the host
car’s heading, the location of lane markings, and traffic cars.
2729
References [17] D. A. Pomerleau. Alvinn: An autonomous land vehicle in a
neural network. Technical report, DTIC Document, 1989. 1,
[1] M. Aly. Real time detection of lane markers in urban streets. 2
In Intelligent Vehicles Symposium, 2008 IEEE, pages 7–12. [18] D. A. Pomerleau. Neural network perception for mobile
IEEE, 2008. 2, 5 robot guidance. Technical report, DTIC Document, 1992.
[2] C.-C. Chang and C.-J. Lin. Libsvm: a library for support 1, 2
vector machines. ACM Transactions on Intelligent Systems [19] S. Ullman. Against direct perception. Behavioral and Brain
and Technology (TIST), 2(3):27, 2011. 5 Sciences, 3(03):373–381, 1980. 1
[3] D. Erhan, C. Szegedy, A. Toshev, and D. Anguelov. Scal- [20] P. Weinzaepfel, J. Revaud, Z. Harchaoui, and C. Schmid.
able object detection using deep neural networks. In Pro- Deepflow: Large displacement optical flow with deep match-
ceedings of the IEEE Conference on Computer Vision and ing. In Computer Vision (ICCV), 2013 IEEE International
Pattern Recognition (CVPR), 2014. 3 Conference on, pages 1385–1392. IEEE, 2013. 3
[4] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ra- [21] B. Wymann, E. Espié, C. Guionneau, C. Dimitrakakis,
manan. Object detection with discriminatively trained part- R. Coulom, and A. Sumner. TORCS, The Open Racing Car
based models. Pattern Analysis and Machine Intelligence, Simulator. http://www.torcs.org, 2014. 3
IEEE Transactions on, 32(9):1627–1645, 2010. 2 [22] H. Zhang, A. Geiger, and R. Urtasun. Understanding high-
[5] A. Geiger, M. Lauer, C. Wojek, C. Stiller, and R. Urtasun. 3d level semantics by modeling traffic patterns. In Computer Vi-
traffic scene understanding from movable platforms. Pattern sion (ICCV), 2013 IEEE International Conference on, pages
Analysis and Machine Intelligence (PAMI), 2014. 2, 7 3056–3063. IEEE, 2013. 2
[6] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun. Vision meet-
s robotics: The kitti dataset. The International Journal of
Robotics Research, 2013. 1, 2, 3, 6
[7] J. J. Gibson. The ecological approach to visual perception.
Psychology Press, 1979. 2
[8] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich fea-
ture hierarchies for accurate object detection and semantic
segmentation. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), 2014. 3
[9] R. Hadsell, P. Sermanet, J. Ben, A. Erkan, M. Scoffier,
K. Kavukcuoglu, U. Muller, and Y. LeCun. Learning long-
range vision for autonomous off-road driving. Journal of
Field Robotics, 26(2):120–144, 2009. 3
[10] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Gir-
shick, S. Guadarrama, and T. Darrell. Caffe: Convolutional
architecture for fast feature embedding. arXiv preprint arX-
iv:1408.5093, 2014. 4
[11] J. Koutnı́k, G. Cuccu, J. Schmidhuber, and F. J. Gomez. E-
volving large-scale neural networks for vision-based torcs.
In FDG, pages 206–212, 2013. 2
[12] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet
classification with deep convolutional neural networks. In
Advances in neural information processing systems, pages
1097–1105, 2012. 3, 4
[13] P. Lenz, J. Ziegler, A. Geiger, and M. Roser. Sparse scene
flow segmentation for moving object detection in urban en-
vironments. In Intelligent Vehicles Symposium (IV), 2011
IEEE, pages 926–932. IEEE, 2011. 2
[14] U. Muller, J. Ben, E. Cosatto, B. Flepp, and Y. L. Cun. Off-
road obstacle avoidance through end-to-end learning. In Ad-
vances in neural information processing systems, pages 739–
746, 2005. 3
[15] G. F. Newell. Nonlinear effects in the dynamics of car fol-
lowing. Operations research, 9(2):209–229, 1961. 4
[16] A. Oliva and A. Torralba. Modeling the shape of the scene: A
holistic representation of the spatial envelope. International
journal of computer vision, 42(3):145–175, 2001. 3
2730