Distant Traffic Light Recognition Using Semantic Segmentation
Distant Traffic Light Recognition Using Semantic Segmentation
Abstract
Traffic light recognition is an important task for automatic driving support systems. Conventional traffic light recognition tech-
niques are categorized into model-based methods, which frequently suffer from environmental changes such as sunlight, and
machine-learning-based methods, which have difficulty detecting distant and occluded traffic lights because they fail to repre-
sent features efficiently. In this work, we propose a method for recognizing distant traffic lights by utilizing a semantic segmen-
tation for extracting traffic light regions from images and a convolutional neural network (CNN) for classifying the state of
the extracted traffic lights. Since semantic segmentation classifies objects pixel by pixel in consideration of the surrounding
information, it can successfully detect distant and occluded traffic lights. Experimental results show that the proposed seman-
tic segmentation improves the detection accuracy for distant traffic lights and confirms the accuracy improvement of 12:8%
over the detection accuracy by object detection. In addition, our CNN-based classifier was able to identify the traffic light
status more than 30% more accurately than the color thresholding classification.
In an automatic driving support system, pedestrian detec- CNN-based methods do not describe the features for
tion and understanding the surrounding environment are lower resolution objects efficiently, which results in the
crucial. Traffic light recognition is particularly indispen- failure of distant object detection.
sable. It consists of two steps: detecting a traffic light In this paper, we propose using a CNN-based seman-
from an in-vehicle camera image and recognizing the tic segmentation for detecting distant traffic lights and a
state of the traffic light (i.e., green, red, or other). When CNN for identifying the state of the traffic light. The
we drive a vehicle manually, we check a distant traffic semantic segmentation based on a CNN classifies objects
light and adjust the speed accordingly, as the vehicle pixel by pixel considering the surrounding information,
requires a certain amount of distance to completely stop which enables us to detect distant traffic lights, occluded
after braking. The same is true in autonomous driving, traffic lights, or both. After detecting the traffic lights,
and thus it is necessary to detect and classify the distant we identify the state of those lights by using a CNN clas-
traffic lights. sifier. A CNN-based classifier trained with various
Conventional traffic light recognition techniques can appearance data can describe features that are more
be categorized into model-based methods and machine- robust to diffuse illuminations and accurately identify
learning-based methods. Model-based methods detect the traffic light states. We performed experiments with
and classify traffic lights by modeling the color and the Cityscapes dataset (1) to evaluate the accuracies of
shapes of traffic lights. By investigating the color distri- semantic segmentation and detection of traffic lights over
bution in the light part and setting an appropriate different distances. We also evaluated the classification
threshold, we can detect the entire traffic light and the accuracy of the proposed CNN-based method in com-
light part. However, problems can occur because of envi- parison with another machine-learning-based method.
ronmental changes such as sunlight, which can cause The contributions of this paper are as follows:
detection failures or mistaken detection of tail lights or
street lights. Therefore, model-based methods require
fine adjustments to the threshold and parameters 1
Chubu University, Kasugai, Japan
depending on the target environment. As for machine-
learning methods, they typically utilize convolutional Corresponding Author:
neural network (CNN)-based object detection. However, Shota Masaki, [email protected]
98 Transportation Research Record 2675(11)
To achieve distant traffic light recognition, we leverages a semantic segmentation framework to detect
propose a detection method based on a semantic small objects.
segmentation framework. We compare the accu-
racy over different distances by using the
Cityscapes dataset. Semantic Segmentation
To clarify the most accurate traffic light state Semantic segmentation estimates object classes pixel by
classification, we evaluate and compare the CNN- pixel. As with object detection methods, several CNN-
based classifier and a conventional machine-learn- based semantic segmentation methods have been pro-
ing-based classification method. posed (10–14). The pyramid scene parsing network
(PSPNet) (11) introduces a pyramid pooling module
where the application of multi-scaled pooling for feature
maps enables both global and local feature maps to be
Related Work captured. DeepLab-based architectures (12–14) intro-
Traffic Light Recognition duce an Atrous Spatial Pyramid Pooling (ASPP) module
that applies a pyramid pooling module with atrous con-
In general, traffic light recognition consists of two steps:
volutions. This enables broader multi-scaled receptive
detection and classification. Several traffic light recogni-
fields to be captured than when using the conventional
tion methods have been proposed over the last few
pyramid pooling module.
decades (2–6), including model-based approaches,
machine-learning-based approaches, and combinations
of the two. Proposed Method
The model-based approaches recognize traffic lights
by using color threshold and object shape information As stated above, traffic light recognition needs to detect
(2) while the machine-learning-based ones utilize a CNN lights from an image and identify the state of the detected
or support vector machine (SVM) (3, 4). As for the com- lights. Model-based approaches often make detection
binations of these approaches, Soares et al. (5) proposed mistakes because of tail lights and street lights, while with
a method that uses Haralick texture measures and a the machine-learning-based approach, it is difficult to
CNN classifier. detect distant traffic light, occluded traffic lights, or both,
To detect traffic lights from in-vehicle camera images, because of insufficient feature representations for lower
CNN-based detection methods have been widely investi- resolution objects. In this paper, we propose detecting
gated and utilized (the details are discussed below). distant traffic lights by using a semantic segmentation
Although most of the detection methods deal with the architecture. CNN-based semantic segmentation success-
localization of the target object and object class estima- fully detects small and distant traffic lights while suppres-
tion simultaneously, Gupta et al. (6) proposed a method sing incorrect detection and the effect of illumination
that first detects traffic lights by a detection method and changes. In the proposed method, we detect traffic lights
then classifies the traffic lights by a graph embedding by using a bounding rectangle for semantic segmentation
Grassmann discriminant analysis. results and then identify the state of the detected traffic
lights by using a CNN classifier. Figure 1 shows the pro-
cess flow of the proposed method.
Object Detection
Object detection estimates the object class and the posi-
tion in an image. Several CNN-based detection methods
Traffic Light Detection
have been proposed, including two-stage and one-stage We utilize DeepLab v3plus (14) as a semantic segmenta-
detection methods. Examples of the one-stage detection tion method. DeepLab v3plus is a segmentation method
methods include the single shot multibox detector (SSD) for encoder–decoder structures. Multiscale features are
(7). SSD detects object candidate regions using object acquired by Atrous Spatial Pyramid Pooling of the enco-
class prediction and the default boxes of several aspect der. It also uses low-level feature maps to improve the
ratios for multi-scaled feature maps. To capture efficient recognition accuracy near the boundaries.The used
multi-scale features, the feature pyramid network (FPN) source code is implemented by TensorFlow (https://
(8) has been introduced. Zhao et al. (9) proposed an github.com/tensorflow/models/tree/master/research/dee
object detection method with FPN called M2Det, which plab). We train the segmentation network in an ordi-
uses a multi-level feature pyramid network (MLFPN) to nary semantic segmentation manner and then use the
detect objects more accurately compared with a detection trained network to estimate the segmentation results for
method with a single FPN. In this paper, we focus on an in-vehicle camera image. The traffic lights are then
detecting small and distant traffic lights. Our method detected from these results by estimating bounding
Masaki et al 99
Table 1. Threshold in HSV Color Space for Traffic Light State make the input images the same size, so we crop the traf-
Classification fic light image into a square that includes the estimated
bounding rectangle. Then, the cropped image is resized
H S V to 224 3 224 pixels. The ResNet50 used is the PyTorch
Red 0<H<40 or 70<S<200 100<V<255
implementation (https://github.com/pytorch/vision). We
160<H<180 use SGD as the optimizer. The initial learning rate is set
Green 55<H<110 70<S<200 100<V<255 to 0.1, momentum to 0.9, and weightdecay to 0.0005.
The learning rate is multiplied by 0.1 for 50,100 and 200
Note: H = hue; S = saturation; V = value. epochs.
rectangles for the traffic light regions. For the back- Experiments
bone, Xception65, which has been pre-learned by In this section, we evaluate the proposed method from
ImageNet (15) learning, is used, and the number of the viewpoints of traffic light detection and traffic light
learnings is 90,000 iterations.We use SGD as the opti- state classification. For the object detection evaluation,
mizer, with an initial learning rate of 0.007, a momen- we also evaluate the detection accuracy over different
tum of 0.9, and a weight decay of 0.0005. We use traffic light distances.
‘‘poly’’ for scheduling the learning rate.
Dataset
Traffic Light Classification We used the Cityscapes dataset (1) in the following
The next step is to classify the state of the traffic lights experiments. It contains in-vehicle camera images taken
detected by the semantic segmentation. We use color- in 50 German cities and has 30-class segmentation labels
based (model-based) and CNN-based (machine-learning- (road, pedestrian, traffic light, etc.). There are 2,975
based) classifiers and evaluate their accuracies. For the images with fine annotation labels for training, 500 for
color-based classifier, we use a simple color threshold validation, and 1,525 for tests. In our experiment, we use
and object shape features. Specifically, we extract light the training and validation samples with 19 classes for
regions by using the threshold shown in Table 1 and semantic segmentation. However, there is no annotation
apply template matching of a circular shape. about traffic lights. We annotate ground truth labels for
We use ResNet50 (16) as a CNN-based traffic light traffic lights whose light can be recognized. Moreover, to
state classifier. To train the ResNet50 for traffic light evaluate the accuracy of the proposed method in relation
classification, we fine-tune a network pre-trained on the to traffic light distance, we use disparity data originally
ImageNet dataset (15) for a 3-class classification prob- included in the Cityscapes dataset. We denote the focal
lem. The output labels are red, green, and other. When point as f, the distance between stereo cameras as T, and
using a CNN to classify the traffic light, we need to the disparity of a target object as xl 2xr. The distance
100 Transportation Research Record 2675(11)
TP
Recall = : ð2Þ
TP + FN
Our semantic segmentation detects back-facing traffic
lights and other objects. Such objects should not be clas-
sified as traffic lights. Therefore, in the evaluation of red
and green lights, we consider misdetections, that is, FP.
fT
Z= : ð1Þ Results for Traffic Light Classification
xl xr
Figure 5 shows the results for the traffic light classifica-
Using the distance information acquired by the disparity
tion. The color-based classification failed to classify red
information, we check the visibility of traffic lights at dif-
lights under the brighter environment (sunlight). Table 3
ferent distances. Figure 2 shows that at distances of 100
shows the recalls on the light state classification. Recall
and 200 m, the traffic lights become much smaller in the
on the color-based (model-based) classifier was about
image, making them difficult to detect.
60%. This is because the light regions on traffic lights are
To train detection methods, we use coordinates
affected by sunlight and by the number of pixels whose
extracted from semantic segmentation labels. For traffic
color is out-of-threshold, which makes it difficult for the
light state classification, we build a dataset by trimming
color-based method to adjust the threshold finely. In con-
the traffic light region from the images and segmentation
trast, recall on the CNN-based method was over 90% for
labels of the Cityscapes dataset. We trim each image to a
both red and green lights.
square and resize it into 224 3 224 pixels to make it the
The classification results for traffic lights considering
same size as a network input. Trimmed images are anno-
FP are shown in Table 4. The FP of the CNN-based clas-
tated as red, green, or other.
sifier outperformed the color-based classifier by 5:05%
for red and 0:97% for green. However, the FN of the
Baselines and Evaluation Metrics CNN was reduced by 27% compared with the color-
As a comparative method for traffic light detection, we based method, which means the CNN achieved a higher
use M2Det (9), which is a state-of-the-art object classification performance. These results demonstrate
Masaki et al 101
Figure 3. Traffic light detection results on Cityscapes dataset (1): (a) Detection results by M2Det (9) and (b) Semantic segmentation-
based detection results by DeepLab v3plus (14).
Note: In (a), green bounding boxes indicate lights detected by M2Det. In (b), red circles indicate detected traffic lights; in the result on the right, our
segmentation-based detection could detect a traffic light while the state-of-the-art detection method failed.
Figure 5. Visualization of traffic light classification results by (a) color-based classifier and (b) CNN-based classifier.
Note: Red and green bounding boxes indicate that the classifier decided on red and green lights, respectively. CNN = convolutional neural network.
Table 4. Classification Result Considering Misdetections (FP) Table 5. Accuracy of Traffic Light Recognition through the Entire
Process
Classifier Class TP (%) FN (%) FP (%)
Detector Classifier Class TP (%) FN (%) FP (%)
Color threshold Red 60.60 29.44 10.05
Green 58.12 35.07 6.81 M2Det Color Red 46.75 47.19 6.06
CNN Red 86.53 1.63 15.10 threshold Green 50.79 47.09 2.12
Green 86.53 5.70 7.78 CNN Red 47.55 34.34 18.11
Green 63.21 32.64 4.15
Note: TP = number of successfully detected traffic lights; FN = number of DeepLab Color Red 53.69 35.25 11.07
mis-detected traffic lights; FP = misdetections; CNN = convolutional v3p threshold Green 50.74 40.39 8.87
neural network. CNN Red 75.39 9.38 15.23
Green 74.04 14.91 11.06
slightly higher, FN was reduced and TP showed the high- Note: TP = number of successfully detected traffic lights; FN = number of
est values. mis-detected traffic lights; FP = misdetections; CNN = convolutional
neural network.
Conclusion
accuracy than the model-based method. In particular,
In this paper, we proposed a method to recognize distant the CNN-based method was more robust to images
traffic lights, occluded traffic lights, or both, from in- appearing under large illumination changes. The features
vehicle camera images. The proposed method consists of of traffic signals do not change significantly in different
two steps: semantic-segmentation-based traffic light environments. The most influential cause of degradation
detection and classification with a CNN to identify the in state identification performance is considered to be
traffic light states. Experimental results showed that our sunlight. Since the proposed method can recognize traffic
semantic-segmentation-based detection can detect distant signals even under the influence of light, it is considered
traffic lights with more accuracy than another object to be robust to environmental changes. On the other
detection method. Moreover, for state classification, our hand, the semantic segmentation used for detection has a
CNN-based classifier achieved higher recognition problem in that the recognition accuracy of domains that
Masaki et al 103
are not included in the training data (e.g., night and Color and Shape Features for Traffic Light Detection and
snow) is greatly reduced. Our future work will include Recognition. Proc., International Joint Conference on
developing a semantic segmentation method that can Neural Networks (IJCNN), Rio de Janeiro, Brazil, 2018,
deal with the differences over different environments and pp. 1–7.
regions universally. 6. Gupta, A., and A. Choudhary. A Framework for Traffic
Light Detection and Recognition Using Deep Learning
and Grassmann Manifolds. Proc., IEEE Intelligent Vehi-
Author Contributions cles Symposium (IV), Paris, France, 2019, pp. 600–605.
The authors confirm contribution to the paper as follows: study 7. Liu, W., D. Anguelov, D. Erhan, C. Szegedy, S. E. Reed,
conception and design: Shota Masaki, Tsubasa Hirakawa, C. Fu, and A. C. Berg. SSD: Single Shot MultiBox Detec-
Takayoshi Yamashita, Hironobu Fujiyoshi; data collection: tor. In Proc., Lecture Notes in Computer Science: European
Shota Masaki; analysis and interpretation of results: Shota Conference on Computer Vision (ECCV) (B. Leibe, J.
Masaki, Tsubasa Hirakawa, Takayoshi Yamashita, Hironobu Matas, N. Sebe, and M. Welling, eds.), Vol. 9905, Amster-
Fujiyoshi; draft manuscript preparation: Shota Masaki, dam, The Netherlands, October 8–16, 2016, Springer,
Tsubasa Hirakawa. All authors reviewed the results and Cham, pp. 21–37.
approved the final version of the manuscript. 8. Lin, T. Y., P. Dollár, R. B. Girshick, K. He, B. Hariharan,
and S. J. Belongie. Feature Pyramid Networks for Object
Detection. Proc., IEEE Conference on Computer Vision and
Declaration of Conflicting Interests Pattern Recognition (CVPR), Honolulu, HI, IEEE, New
The author(s) declared no potential conflicts of interest with York, 2017, pp. 936–944.
respect to the research, authorship, and/or publication of this 9. Zhao, Q., T. Sheng, Y. Wang, Z. Tang, Y. Chen, L. Cai,
article. and H. Ling. M2Det: A Single-Shot Object Detector Based
on Multi-Level Feature Pyramid Network. Proc., 33rd
AAAI Conference on Artificial Intelligence (AAAI), Hono-
Funding lulu, HI, 2019, pp. 9259–9266.
The author(s) disclosed receipt of the following financial sup- 10. Long, J., E. Shelhamer, and T. Darrell. Fully Convolu-
port for the research, authorship, and/or publication of this tional Networks for Semantic Segmentation. Proc., IEEE
article: This work was supported by Council for Science, Computer Society Conference on Computer Vision and Pat-
Technology and Innovation(CSTI), Cross-ministerial Strategic tern Recognition (CVPR), Boston, MA, IEEE, New York,
Innovation Promotion Program (SIP), Automated Driving for 2015, pp. 3431–3440.
Universal Services/Research on the recognition technology 11. Zhao, H., J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid Scene
required for automated driving technology (levels 3 and 4). Parsing Network. Proc., IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), Honolulu, HI,
IEEE, New York, 2017, pp. 6230–6239.
References 12. Chen, L., G. Papandreou, I. Kokkinos, K. Murphy, and
1. Cordts, M., M. Omran, S. Ramos, T. Rehfeld, M. Enzwei- A. L. Yuille. DeepLab: Semantic Image Segmentation with
ler, R. Benenson, U. Franke, S. Roth, and B. Schiele. The Deep Convolutional Nets, Atrous Convolution, and Fully
Cityscapes Dataset for Semantic Urban Scene Understand- Connected CRFs. IEEE Transactions on Pattern Analysis
ing. Proc., IEEE Conference on Computer Vision and Pat- and Machine Intelligence (TPAMI), Vol. 40, No. 4, 2018,
tern Recognition (CVPR), Las Vegas, NV, IEEE, New pp. 834–848.
York, 2016, pp. 3213–3223. 13. Chen, L.-C., G. Papandreou, F. Schroff, and H. Adam.
2. Siogkas, G., E. Skodras, and E. Dermatas. Traffic Lights Rethinking Atrous Convolution for Semantic Image Seg-
Detection in Adverse Conditions Using Color, Symmetry mentation. arXiv Preprint arXiv:1706.05587, 2017. http://
and Spatiotemporal Information. Proc., International Con- arxiv.org/abs/1706.05587.
ference on Computer Vision Theory and Applications 14. Chen, L. C., Y. Zhu, G. Papandreou, F. Schroff, and H.
(VISAPP), Rome, Italy, 2012, pp. 620–627. Adam. Encoder-Decoder with Atrous Separable Convolu-
3. Weber, M., P. Wolf, and J. M. Zöllner. DeepTLR: A Sin- tion for Semantic Image Segmentation. Proc., European
gle Deep Convolutional Network for Detection and Classi- Conference on Computer Vision (ECCV), Munich, Ger-
fication of Traffic Lights. Proc., IEEE Intelligent Vehicles many, 2018, pp. 833–851.
Symposium (IV), Gothenburg, Sweden, 2016, pp. 342– 15. Deng, J., W. Dong, R. Socher, L. J. Li, K. Li, and L. Fei-
348. Fei. ImageNet: A Large-Scale Hierarchical Image Data-
4. Behrendt, K., L. Novak, and R. Botros. A Deep Learning base. Proc., IEEE Computer Society Conference on Com-
Approach to Traffic Lights: Detection, Tracking, and puter Vision and Pattern Recognition (CVPR), Miami, FL,
Classification. Proc., IEEE International Conference on IEEE, New York, 2009, pp. 248–255.
Robotics and Automation (ICRA), Singapore, IEEE, New 16. He, K., X. Zhang, S. Ren, and J. Sun. Deep Residual Learning
York, 2017, pp. 1370–1377. for Image Recognition. Proc., IEEE Computer Society Confer-
5. da Silva Soares, J. C., T. B. Borchartt, A. C. de Paiva, and ence on Computer Vision and Pattern Recognition (CVPR),
A. de Almeida Neto. Methodology Based on Texture, Las Vegas, NV, IEEE, New York, 2016, pp. 770–778.