Traffic Density Controller
Traffic Density Controller
R1 1
forward neural network was employed for crowd counting Finally, the comparison was conducted based on three
[16-18], the present study adopted a deep convolutional neural performance indices: the mean absolute error (MAE); the
network (CNN) to estimate the number of vehicles from a correlation coefficient with observed numbers; and, the
video shoot. CNNs have recently recorded great success when percent root mean squared error (%RMSE).
recognizing medical CT images and human faces in a field of A CNN model within the entire modeling framework
computer vision [19-21]. The present study began with the played a key role in counting vehicles using a video image.
expectation that a CNN must perform well in counting However, it is difficult to explain what mechanism makes it
vehicles, which is much simpler than recognizing medical CT possible to count vehicles. The most plausible way to guess
images or human faces. the mechanics is to investigate high-level features that the
The next section describes the entire framework of the CNN extracts via filters. The features that the present model
present vehicle-counting scheme and expounds on the extracted from the traffic images are shown and discussed in
principle of CNN. How to collect data to train and test a CNN the fourth section.
is described in the third section. The counted results and
comparisons with those from the most prevalent
methodologies, as well as with those from other previous
studies adopting various methodologies, are shown in the
fourth section. The fifth section draws conclusions and
provides possible extensions for the present study.
II.MODELING FRAMEWORK
Preparing data to feed a CNN is the starting point of the
present study. The input features of a CNN are the RGB
values of an image at each pixel level. Whereas a CNN Fig. 1. CNN model structure.
requires no preprocess to extract input features, each input
image should have a label, since a CNN belongs to the Fig. 1 shows the structure of the CNN model that was
category of supervised machine learning. In the present study, adopted in the present study. The original high-resolution (90
vehicles within each input image were counted manually in ×600 ) input images were downsized into a tractable
order to tag a label to the image. This labeling task is easier dimension (30×200 ). Since each input image was in color
than that performed by existing CNNs to detect objects, which with RGB values, the dimensions of the input image were 3 ×
requires drawing a bounding box for each target object. 30×200. The first convolution layer was created using 40 3
Nonetheless, it may take great effort to manually count ×3 × 3 filters each of which slid through an input image with
vehicles in all input images. An efficient way to circumvent
a stride value of 1. Each cell value of the convolutional hidden
this difficulty will be suggested in the fourth section.
layer was computed by the linear combination of all weights
Input images to train and test a CNN model were obtained
of a filter and the values of the portion of a target image that
from video shoots taken at the approach of an actual
the filter covered, and then was activated by a rectified linear
intersection. Video shoots for every single second were chosen
unit (ReLU). The ReLU outperformed the conventional
to prepare the input images. Most machine learning models
sigmoid function, which is one of the recent breakthroughs for
are over-fitted to training data. To avoid over-fitting, the
deep learning [22]. At this stage, each filter captured its own
model after training should be validated against a new dataset
basic feature regardless of the feature location within an
that has never been used in the training stage. Thus, it is
image. In addition, using filters had the advantage of reducing
important to divide available input images into a training set
the number of weight parameters to be estimated, since each
and a test set. After dividing the input data, the training set
filter shared weight parameters wherever it resided within an
was augmented using various filters, so that a CNN model
image. To avoid a layer-by-layer dimensionality reduction,
could accommodate different situations that the original
prior to convoluting the filters, target images were padded
training data did not account for.
with null columns and rows that consisted of 0s. After
Unfortunately, at the present time, there is no systematic
convolution, a new layer was created by pooling each of the 5
way to determine the best model structure for a CNN within a
practical computing time. A plausible model structure must be
×5 cells of the convoluted layer with average values, which
had a smoothing effect on the images.
selected by trial-and-error. While finding the best model
At the next stage, a second convolution layer was created by
structure, hyper-parameters should be determined upon a third
dataset other than the training and test datasets. To establish a allowing 80 2×2× 40 filters to slide through the previous
model structure, 5% of the training images were selected. pooled layer. The second-level convolution filters extracted
After training, the model performance was evaluated and more complex features than those elicited from the first-level
compared based on the test data that had been separated from filters. After average pooling again, the second convolutional
the training data. The background subtraction method was hidden layer was flattened to facilitate connection to a generic
chosen as a baseline to verify the utility of the present model. hidden layer. The connection between the flattened layer and
the next fully connected layer was the same as that between
T-ITS-16-09-0605.R1 3
two consecutive hidden layers of a feed-forward neural parameters such as the number of hidden layers, the number of
network. The fully connected layer linearly fed the final hidden nodes within each hidden layer, the filter size for each
output layer of a single node that represented the observed convolutional hidden layer, and the number of filters used for
number of vehicles. each hidden layer, etc. A systematic way to determine the
optimal value for the hyper-parameters will govern the
III. LITERATURE SURVEY performance of counting vehicles.
A CNN is known to recognize objects irrespective of scale, In addition, vehicle details were ignored when counting
location, or orientation. In particular, one of the main vehicles in the present CNN model. Namely, the CNN model
motivations of the present study was to confirm whether a counted vehicles regardless of size, make, and type. Of course,
the purpose of counting vehicles was confined to evaluating
CNN can count partially occluded vehicles. Also, real-world
traffic flows at the aggregate level in the present study.
traffic images may contain either a few instances of vehicles
However, in the future, advanced counting technology should
or a very large number of them. Whether a CNN can count
recognize the details of each vehicle. In particular,
vehicles regardless of congestion level was another issue that distinguishing between moving and stopped vehicles is very
the present study tried to resolve. Answers to these questions important for traffic control and management. A CNN model
will be clearly presented in the fourth section. based on consecutive images is now under construction to
The training method of a CNN is not different from that of a count vehicles while discerning whether each vehicle is
feed-forward neural network. The basic theory is to derive moving or not. If this succeeds, the next version of the present
weight parameters that minimize the sum of squared errors study will measure the space mean speed as well as the traffic
between observed and estimated output values, which is density. The space mean speed is another important parameter
formulated as a loss function. A back-propagation algorithm is in traffic engineering, and it cannot be measured directly with
used to derive the gradient of the loss function with respect to the existing surveillance systems.
each weight parameter. The algorithm, however, has a fatal
drawback. The derivative of errors is likely to be lessened, as V. RESULT/OUTPUT
they are propagated from the top to the bottom layers, which is An innovative system for detecting and extracting vehicles in
a phenomenon that is referred to as the vanishing gradient traffic surveillance scenes is presented. This system involves
problem. Owing to adopting a ReLU for activating the node locating moving objects present in complex road scenes by
values instead of the conventional sigmoid function, the back- implementing an advanced background subtraction
propagation algorithm successfully trained the proposed CNN methodology. The innovation concerns a histogram-based
model. A ReLU activates node values greater than 0 into filtering procedure, which collects scatter background
themselves and values less than 0 into 0’s, which prevents the information carried in a series of frames, at pixel level,
gradient vanishing problem. Readers who are interested in the generating reliable instances of the actual background. The
details of CNN can refer to this framework. proposed algorithm reconstructs a background instance on
Another advantage of a CNN is that the number of weight demand under any traffic conditions.The rationale in the
parameters to be estimated can be reduced considerably approach is that of detecting the moving
compared with adopting a conventional feed-forward neural objects from the difference between the current frame and a
network. A feed-forward neural network has a large number of reference frame, often called "background image", or
weight parameters because each cell of an input image should "background model". Background subtraction is mostly done
connect to all hidden nodes of the second hidden layer. A if the image in question is a part of a video stream.
CNN, however, takes only filter parameters into account, Background subtraction provides important cues for
which makes it possible to recognize a large-dimension image. numerous applications in computer vision, for example
surveillance tracking or human poses estimation.
IV. CONCLUSIONS
The present study was a demonstration of a novel REFERENCES
approach to counting vehicles on a road segment in order to [1] P. Ryus, M. Vandehey, L. Elefteriadou, R. G. Dowling, and B. K.
accurately quantify traffic density at an aggregate level for Ostrom, Highway Capacity Manual 2010. Washington D.C.:
Transportation Research Board, 2010.
traffic control and management. The approach succeeded in [2] S. Messelodi, C. M. Modena, and M. Zanin, “A computer vision system
counting vehicles with an acceptable accuracy that was for the detection and classification of vehicles at urban road
comparable to the results using existing methodologies. It was intersections” Pattern Anal. Appl., vol. 8, no. 1–2, pp. 17–31, Sep. 2005.
concluded that the proposed CNN model is applicable to [3] N. Buch, J. Orwell, and S. A. Velastin, “Urban road user detection and
classification using 3-D wireframe models,” IET Comput. Vis. J., vol. 4,
measuring the traffic density at the HCM level. no. 2, pp. 105–116, Jun. 2010.
However, further studies will be necessary to tackle [4] H. Veeraraghavan, O. Masoud, and N. Papanikolopoulos, “Vision-based
several difficulties regarding the proposed approach. Even monitoring of intersections,” in Proc. IEEE 5th Int. Conf. Intell. Transp.
Syst., Singapore, 2002, pp. 7–12.
though the proposed model required no hand-crafted feature [5] K. Park, D. Lee, and Y. Park, “Video-based detection of street-parking
engineering, how to determine the hyper-parameters of a CNN violation,” in Proc. Int. Conf. Image Process., Comput. Vis., Pattern
was not broached. A CNN model contains several hyper- Recognit., Las Vegas, NV, 2007, vol. 1, pp. 152–156.
T-ITS-16-09-0605.R1 4