Sensors 20 02178
Sensors 20 02178
Article
Estimation of the Number of Passengers in a Bus
Using Deep Learning
Ya-Wen Hsu, Yen-Wei Chen and Jau-Woei Perng *
Department of Mechanical and Electro-Mechanical Engineering, National Sun Yat-sen University,
Kaohsiung 804201, Taiwan; [email protected] (Y.-W.H.); [email protected] (Y.-W.C.)
* Correspondence: [email protected]; Tel.: +886-7-525-2000 (ext. 4281)
Received: 28 February 2020; Accepted: 7 April 2020; Published: 12 April 2020
Abstract: For the development of intelligent transportation systems, if real-time information on the
number of people on buses can be obtained, it will not only help transport operators to schedule
buses but also improve the convenience for passengers to schedule their travel times accordingly.
This study proposes a method for estimating the number of passengers on a bus. The method is based
on deep learning to estimate passenger occupancy in different scenarios. Two deep learning methods
are used to accomplish this: the first is a convolutional autoencoder, mainly used to extract features
from crowds of passengers and to determine the number of people in a crowd; the second is the you
only look once version 3 architecture, mainly for detecting the area in which head features are clearer
on a bus. The results obtained by the two methods are summed to calculate the current passenger
occupancy rate of the bus. To demonstrate the algorithmic performance, experiments for estimating
the number of passengers at different bus times and bus stops were performed. The results indicate
that the proposed system performs better than some existing methods.
Keywords: crowd density estimation; deep learning; object detection; passenger counting; YOLOv3
1. Introduction
In recent years, with the rapid development of technologies such as sensing, communication,
and management, improving the efficiency of traditional transportation systems through advanced
technological applications is becoming more feasible. Therefore, intelligent transportation systems have
gradually become a focus of transportation development around the world. Currently, many related
applications exist in the field of bus information services; for example, people can utilize a dynamic
information web page or a mobile app to inquire about the location and arrival time of every bus.
If more comprehensive information is provided on existing bus information platforms, the quality of
public transport services will be significantly improved. Thus, the number of passengers willing to use
public transport will increase. The information regarding the load on each bus is critical to the control
and management of public transport. The ride information includes the number of passengers entering
and leaving at each stop and the number of passengers remaining on the bus. Through intelligent
traffic monitoring, passengers can preview a bus’s occupancy status in real time and then make a
decision based on the additional information and evaluate the waiting time. Furthermore, passenger
transport operators can manage vehicle scheduling based on this information; thus, operational costs
depending on whether service quality is degraded or not are reduced effectively while providing
passengers with more useful ride information.
In the past, some research groups have explored the counting of passengers on buses.
Luo et al. proposed extracting footprint data through contact pedals to determine the directions
in which passengers entered and left the bus [1]. Torriti and Landau proposed radio-frequency
identification technology to implement passenger counting, although the recognition result is
susceptible to the position of the RF antenna and the direction of radiation [2]. With the popularity
of surveillance cameras and advances in computer vision technology, image-based people-counting
methods have been continuously proposed [3–10]. References [3–5,10] employed images from actual
buses as experimental scenes. The images were captured from a camera mounted on the ceiling of a
bus, with an almost vertical angle of view. In these works, the heads of the passengers were a noticeable
detection feature in the images, and their methods included motion vectors, feature-point-tracking,
and hair color detection. The authors in a recent article [10] proposed a deep learning-based
convolutional neural network (CNN) architecture to detect passengers’ heads. Reference [11] proposed
a counting method by combining the Adaboost algorithm with a CNN for head detection. This study
was divided into three phases; namely, two offline training phases and one online detection phase.
The first phase used the Adaboost algorithm to learn the features obtained from the histogram of
oriented gradients (HOGs). The second phase established a new dataset from the preliminary results
detected in the first phase as the modeling data for training the CNN. The resulting model was used as
a classifier for the head feature in the online detection phase.
Based on the results of the above references, although object detection technology development is
mature, in a complex environment and crowd-intensive situation, achieving high detection performance
with a single algorithm is improbable. Existing works related to the number of passengers on a bus are
generally concerned with the image of the bus door and ignore the overall situation inside the bus.
However, as time flows, the accuracy of the number of passengers on the bus significantly drops owing
to the accumulation of the counting errors of passengers getting on and off of the bus. Only a few
studies have directly calculated numbers of passengers on buses. Reference [12] proposed a shallow
CNN architecture suitable for estimating the level of crowdedness in buses, which was a modification
of the multi-column CNN architecture. A fully connected layer was added to the network output,
and then the output of the model was classified into five levels of crowdedness [12].
In terms of crowd density estimation, the regression model was used in [13–18] to calculate the
number of people in a crowd of a public scene. This method solves the occlusion problem between the
people in a crowd. Hence, the primary step is to obtain the necessary characteristic information from
the image, such as edge, gradient, and foreground features. Then, linear regression, ridge regression,
or Gaussian process regression is used to learn the relationship between the crowd features and the
number of people. In [19], the image is first divided into small blocks one by one, and multiple
information sources are used to overcome the problem of overly dense and complex background
environments. The number of individuals present in an extremely dense population is calculated from
a single image. At various scales, the HOG detection method and Fourier analysis are used to detect
the heads of people. The points of interest are used to calculate the local neighbors. Then, each block
establishes the number of people from these three different methods. Finally, the total number of
people in the image is the sum of the number of people in each block. The author in [20] proposed a
new supervised learning architecture that uses the crowd density image corresponding to an original
crowd image to obtain the number of people in an area through regional integration. The practical loss
function approach is defined as the difference between the estimated density map and the ground
truth density map, and is used to enhance the learning ability. This procedure helps provide the correct
direction for deep learning applications in crowd density estimation.
The CNN architecture is used in [21–25] to solve the problems of severe occlusion and distortion
of the scene perspective. The CNN implements an end-to-end training process that inputs the raw
data and outputs the final result without the need to pass through foreground segmentation or
feature extraction. Using supervised learning, the original image through multiple convolutional
layers can produce the corresponding density map. The authors in [26,27] proposed the use of a
multi-column CNN (MCNN) architecture to map the relationship between crowd images and density
images. The filters corresponding to the different columns of the convolutional layers can perform
operations on images of different input sizes. Thus, the problem concerning the different sizes of
peoples’ heads in an image is solved. Finally, the three-column CNN is linearly weighted to obtain
Sensors 2020, 20, 2178 3 of 18
the corresponding number of people in the original image. The author in [28] reported that crowd
estimation involves three characteristics; namely, the entire body, head, and background. The body
feature information can help with judging whether a pedestrian exists in a specific position. However,
the existing density labeling method is mostly marked according to the method mentioned in [20],
so the semantic structure information is added to improve the accuracy of the data labeling, and the
estimation of the number of crowd scenes can be more effective. The authors in [29] improved the
scale-adaptive CNN (SaCNN) proposed in [25]. Adaptive Gaussian kernels are used to estimate
the parameter settings for different head sizes. The original image is reduced to a size of 256 × 256,
which improves the diversity of training samples and increases the network’s generalization ability.
The authors in [30] proposed a multi-column multi-task CNN (MMCNN), which defines the input
image as a single channel of 960 × 960 when training the network model image. This image is evenly
divided into 16 non-overlapping blocks during the training phase to avoid network over-fitting and
low generalization during the process. To solve the problem in the evaluation phase of the model,
the image is divided into 120 × 120 up to 480 × 480 different sizes for input; then, all the block images
are sequentially extracted, and the output of the three columns is merged to output the multi-tasking
result. The outputs are a mask of the crowd density map, congestion degree analysis, and masks of the
foreground and background.
In [12], although a bus congestion evaluation is involved, the result of the network output is simply
at the classification level. In the present study, we propose the use of a deep learning object detection
method and the establishment of a convolutional autoencoder (CAE) to extract the characteristics of
passengers in crowded areas to evaluate the number of people on a bus. The results of these two
methods are summed into the total number of passengers on the bus.
The rest of the paper is organized as follows. In Section 2, the proposed system is described.
In Section 3, the proposed methodology is introduced. Section 4 presents the experimental results and
evaluates the proposed schemes. Finally, conclusions and discussions on ideas for future work are
provided in Section 5.
1. Passenger counting based on object detection: For areas where the head features of passengers
are more visible, the deep learning object detection method is employed to calculate the number
of passengers in the image. The you only look once version 3 (YOLOv3) network model with a
high detection rate was selected.
2. Passenger counting based on density estimation: We propose a CAE architecture suitable for
scenarios of crowded areas on a bus. This model filters all objects in the original image that do not
possess passenger characteristics and outputs the information characteristics of the passengers in
the image.
Sensors 2020, 20, 2178 4 of 18
Sensors 2018, 18, x FOR PEER REVIEW 4 of 17
Rear camera
Front camera
(a)
(b) (c)
2. Positions
Figure 2. Positionsofofthe
thesurveillance
surveillance cameras:
cameras: (a)(a) internal
internal configuration
configuration of bus—images
of the the bus—images
of (b)of (b)
front
front camera
camera and (c)and (c)camera.
rear rear camera.
For passenger counting based on object detection, a total of 5365 images from inside the bus
recorded from February to November 2018 were utilized as the training set. Moreover, 500 sets of
in-vehicle images from November 2018 to May 2019 were utilized for the experimental tests. We defined
the images simultaneously captured from the front and rear cameras as the same set. In this dataset,
the periods of different lighting conditions are included, which are the daytime scene, nighttime
interior lighting scene, and scenes affected by light, as shown in Figure 3a–c. Some of the most common
situations in the images of a bus are shown in Figure 3; the pictures in the first and second rows present
different cases for the front and rear cameras, respectively. In the image of the daytime scene, we can
easily observe the current number of passengers on the bus. Meanwhile, in the nighttime scene, images
from the front camera are sometimes disturbed by the fluorescent light of the bus, and images from
the rear camera show the colors of some passengers’ heads as similar to the scene outside the bus.
In the scenes affected by light, images from the front camera have uneven light and shade due to
sunlight exposure.
In the images(a) from the front camera, the range
(b) of passenger congestion occurs (c) in a fixed area.
Therefore, we define the space of passenger congestion on the bus, as shown in Figure 4. When this
area is filled with passengers, serious occlusion problems generally occur, resulting in less distinct
head characteristics of passengers. A convolution auto-encode model based on a density map was
used to predict the number of passengers in this area.
For training the passenger density data, a total of 3776 images of crowded areas inside the bus
were marked. To increase the diversity of training images, data augmentation was performed on each
image by Gaussian noise, and the brightness was adjusted, as shown in Figure 5. The original dataset
was augmented to 11,328. The images of crowded areas were divided into training and validation
samples to train the neural network, and the segmentation ratio of the two samples was 7:3.
(b) (c)
Sensors 2020, 20,
Figure 2. 2178
Positions of the surveillance cameras: (a) internal configuration of the bus—images of (b)6 of 18
front camera and (c) rear camera.
Sensors2018,
Sensors 2018,18,
18,xxFOR
FORPEER
PEERREVIEW
REVIEW 66of
of17
17
Figure3.3.Examples
Figure Examplesof ofin-vehicle
in-vehicleimages
imagesobtained
obtainedfrom
fromeach
eachsurveillance
surveillancecamera:
camera:(a)
(a)daytime
daytimescene,
scene,
(b) nighttime interior lighting scene, (c) scenes affected by light.
(b) nighttime interior lighting scene, (c) scenes affected by light.
In the
In the images
images from
from the
the front
front camera,
camera, the the range
range ofof passenger
passenger congestion
congestion occurs
occurs inin aa fixed
fixed area.
area.
Therefore, we define the space of passenger congestion on the bus, as shown
Therefore, we define the space of passenger congestion on the bus, as shown in Figure 4. When this in Figure 4. When this
area is filled with passengers, serious occlusion problems generally occur, resulting
area is filled with passengers, serious occlusion problems generally occur, resulting in less distinct in less distinct
head characteristics
head characteristics of of passengers.
passengers. A A convolution
convolution auto-encode
auto-encode model
model based
based on on aa density
density map map was
was
used to predict the number of passengers
used to predict the number of passengers in this area. in this area.
For training
For training the
the passenger
passenger density
density data,
data, aa total
total of
of 3776
3776 images
images ofof crowded
crowded areasareas inside
inside the
the bus
bus
weremarked.
were marked.To Toincrease
increasethethediversity
diversityof oftraining
trainingimages,
images,datadataaugmentation
augmentationwas wasperformed
performedon oneach
each
(a) (b) (c)
image by Gaussian noise, and the brightness was adjusted, as shown in Figure
image by Gaussian noise, and the brightness was adjusted, as shown in Figure 5. The original dataset 5. The original dataset
was augmented
was augmented
Figure to 11,328.
to 11,328.
3. Examples The images
The images
of in-vehicle imagesofobtained
of crowded
crowded areas
from
areaseachwere dividedcamera:
surveillance
were divided into training
into training and validation
(a) daytime
and validation
scene,
samples
(b) to train
nighttime the neural
interior network,
lighting scene,and
(c) the
scenes segmentation
affected by ratio
light. of the
samples to train the neural network, and the segmentation ratio of the two samples was 7:3. two samples was 7:3.
Figure4.
Figure
Figure 4.Crowded
4. Crowdedarea
Crowded areadefinition.
area definition.
definition.
(a)
(a) (b)
(b) (c)
(c)
Figure 5.
Figure5. Data
5.Data augmentation:
Dataaugmentation: (a)
augmentation:(a) original
(a)original image,
originalimage, (b)
image,(b) darkened,
(b)darkened, and
darkened,and (c)
and(c) added
(c)added noise.
addednoise.
noise.
Figure
3. Methodology
3.Methodology
3. Methodology
3.1. Passenger Counting Based on Object Detection
3.1.Passenger
3.1. PassengerCounting
CountingBased Basedon
onObject
ObjectDetection
Detection
YOLO is an end-to-end CNN often used for object detection and recognition and has been
YOLO isis an
YOLO an end-to-end
end-to-end CNN CNN oftenoften used
used forfor object
object detection
detection and
and recognition
recognition and
and has
has been
been
successfully applied to many detection fields, such as traffic sign, pedestrian, and vehicle detection.
successfully
successfully applied
applied to many detection
to manyprocess
detection fields,
fields, such
suchdeepas traffic
as traffic sign, pedestrian, and vehicle detection.
By developing the detection into a single neuralsign, pedestrian,
network, and vehicle
the predicted detection.
bounding box
Bydeveloping
By developingthe thedetection
detectionprocess
processinto intoaasingle
singledeep
deepneural
neuralnetwork,
network,the thepredicted
predictedbounding
boundingbox box
and classification probability can be obtained simultaneously to achieve faster and higher-precision
and
and classification
classification probability
probability can be obtained simultaneously to achieve faster and higher-precision
object detection. New versionscan be obtained
of the YOLO model simultaneously to achieve
have been released, andfaster and higher-precision
its detection performance
objectdetection.
object detection.New Newversions
versionsof ofthe
theYOLO
YOLOmodel modelhavehavebeen
beenreleased,
released,andanditsitsdetection
detectionperformance
performance
continues to improve. Owing to its high efficiency, we directly used the YOLOv3 algorithm [31]
continuesto
continues toimprove.
improve.Owing Owingto toits
itshigh
highefficiency,
efficiency,we wedirectly
directlyused
usedthetheYOLOv3
YOLOv3algorithm
algorithm[31]
[31]based
based
based on the Darknet-53 architecture as our detection model. The process of the YOLO structure for
on
on the Darknet-53
the Darknet-53 architecture as our detection model. The process of the YOLO structure for
passenger detection architecture as our detection model. The process of the YOLO structure for
is described below.
passenger
passenger detection is described below.
First, detection
images with is described below.
ground-truth bounding boxes of heads and the corresponding classification
First, images with ground-truth bounding boxesboxes ofof heads
heads and
and the corresponding
corresponding classification
labelsFirst, images
are input to with ground-truth
the network. The inputbounding
images are reduced to 448 ×the
448 resolution andclassification
divided into
labels
labels are input
are input to the network. The input images are reduced to 448 × resolution and divided
448 resolution
to 448 × 448 divided
S × S grids. If thetoobject’s
the network.
center The
falls input
into the images are reduced
grid, each grid is responsible for predictingandB bounding
into
boxesSSof
into ××Sobjects.
S grids. If the object’s center falls into the grid, each grid is responsible for predicting BB
grids. If the object’s center falls into the grid, each grid is responsible for
Each bounding box contains five information factors; namely, Px , P y , Pw , Ph , c, where
predicting
bounding boxes
bounding boxes of of objects.
objects. Each
Each bounding
bounding box box contains
contains fivefive information
information factors;
factors; namely,
namely,
Px and P y represent the center coordinates of the bounding box relative to the bounds of the grid cell.
Px , Py , Pw , P h , c , where PP and P represent the center coordinates
Px , Py , Pw , Ph , c , where x and Py represent the center coordinates of the bounding box relative
x y of the bounding box relative
to the
to the bounds
bounds of
of the
the grid cell. PPw and
grid cell. and PPh are
are the
the width
width and
and height
height predicted
predicted relative
relative to
to the
the entire
entire
w h
image,respectively.
image, respectively.The confidence cc isisdefined
Theconfidence definedusing
usingEquation
Equation(1).
(1).
cc==PP0××PPIOU.. (1)
(1)
Sensors 2020, 20, 2178 7 of 18
Pw and Ph are the width and height predicted relative to the entire image, respectively. The confidence
c is defined using Equation (1).
c = P0 × PIOU . (1)
Sensors 2018, 18, x FOR PEER REVIEW 7 of 17
Here, P0 represents the probability of the box containing a detection object and PIOU is the intersection
over union between the detection object and the predicted bounding box.
Here, P0 represents the probability of the box containing a detection object and PIOU is the intersection over
Each bounding box corresponds to a degree of confidence. If there is no target in the grid cell,
union
the between the
confidence detection
is 0. If thereobject and thethe
is a target, predicted bounding
confidence box.to PIOU .
is equal
YOLO’s loss function λloss is calculated using Equation (2). If there is no target in the grid cell,
Each bounding box corresponds to a degree of confidence.
the confidence is 0. If there is a target, the confidence is equal to PIOU .
S2
YOLO’s loss function λloss is calculated X using Equation (2).
λloss = Ecoord + EIOU + Eclass . (2)
2
i=S0
λloss = Ecoord + EIOU + Eclass . (2)
=0
Here, Ecoord represents the coordinate error, Ei IOU is the c = P0 × PIOU error, and Eclass is the classification
error Ecoord represents
Here,between theresults
the predicted coordinate
and theerror,
ground EIOU is the c = P0 × PIOU error, and Eclass is the
truth.
As the number
classification of training
error between the iterations
predictedincreases,
results andthe weight
the groundparameters
truth. of the network model are
continuously updated
As the number until theiterations
of training loss function reduces
increases, the to less than
weight a presetofvalue,
parameters and themodel
the network networkare
model is considered
continuously updatedto be completely
until trained. reduces to less than a preset value, and the network
the loss function
model is considered to be completely trained.
3.2. Passenger Counting Based on Density Estimation
3.2. Passenger Counting Based on Density Estimation
3.2.1. Inverse K-Nearest Neighbor Map Labeling
3.2.1.First, the K-Nearest
Inverse position ofNeighbor
each passenger’s head in the crowd image is labeled, as shown in Figure 6.
Map Labeling
The red cross symbol in the figure represents the head coordinates of a person to be marked. If there is
First, the position of each passenger’s head in the crowd image is labeled, as shown in Figure 6.
a head at pixel (xh , yh ), it is represented as a delta function δ(x − xh , y − yh ). An image with N heads
The red cross symbol in the figure represents the head coordinates of a person to be marked. If there
labeled is depicted by Equation (3).
is a head at pixel ( xh , yh ) , it is represented as a delta function δ ( x − xh , y − yh ) . An image with N
heads labeled is depicted by Equation (3). X N
H (x) = N δ(x − xh , y − yh ). (3)
H( x) = h
=1δ ( x − xh , y − yh ) . (3)
h =1
To
To convert
convert thethe labeled
labeled image
image intointo aa continuous
continuous densitydensity map, map, aa common
common method method involves
involves
performing a convolution operation on H (x) and a Gaussian function Gσ (xG) to obtain the density map
performing a convolution operation on H ( x ) and a Gaussian function σ (x) to obtain the density
D g , as shown by Equation (4).
map Dg , as shown by Equation (4).
N
(x − xh2 )2 + ( y − 2yh )2
X 1
D g (x, f (·)) = H (x) ∗ Gσ (x) = N √ 1 exp −( x − xh ) + ( y − y2 h ) , (4)
Dg ( x , f ()) = H ( x) ∗ Gσ ( x) = h 2π f (σh )exp − 2 f (σ2h ) , (4)
=1
h =1 2π f (σ h ) 2 f (σ h )
where σh is a size determined by the k-nearest neighbor (kNN) distance of each head position (xh , yh )
where
from
σh head
other is a size determined
positions by the
(a fixed size k-nearest
is also used), andneighbor (kNN) distance
f is a manually determined of each headforposition
function scaling
σ(hxhto
, ydecide
h
) from other
the kernelheadsize positions
of the (a fixed
Gaussian size is
function.also used), and f is a manually determined function
We
for scaling adopt
σh inverse
to decide kNN the(ikNN) mapsofas
kernel size theanGaussian
alternative labeling method from the commonly used
function.
density map. According to [32], the ikNN graph performs better than density maps when there are
We adopt inverse kNN (ikNN) maps as an alternative labeling method from the commonly used
density map. According to [32], the ikNN graph performs better than density maps when there are
ideally selected spread parameters. Here, f is defined as a simple scalar function f (σ h ) = βσ h , and
β is a scalar that is manually adjusted. The full kNN map is defined by Equation (5).
K( x , k ) =
1
k
min
k
( )
( x − x h ) 2 + ( y − y h ) 2 , ∀h ∈ Η , (5)
where H is the list of all head positions. The above formula calculates the kNN distance from each
pixel ( x , y ) to each head position ( xh , yh ) .
Sensors 2020, 20, 2178 8 of 18
ideally selected spread parameters. Here, f is defined as a simple scalar function f (σh ) = βσh , and β is
a scalar that is manually adjusted. The full kNN map is defined by Equation (5).
q !
1X 2 2
K(x, k) = min (x − xh ) + ( y − yh ) , ∀h ∈ H , (5)
k k
where H is the list of all head positions. The above formula calculates the kNN distance from each
pixel (x, y) to each head position (xh , yh ).
The2018,
Sensors calculation
18, x FOR to generate
PEER REVIEWthe ikNN map is depicted as Equation (6). The ikNN map is shown in17
8 of
Figure 7.
11
==
MM (6)
K(Kx,( xk), k+
) +11 (6)
Figure Inverse
7. 7.
Figure kNN
Inverse map
kNN ofof
map the crowded
the area.
crowded area.
Additionally, in the rear camera view, passengers alongside the aisle of the first row in the rear
Additionally, in the rear camera view, passengers alongside the aisle of the first row in the rear
seating area may be blocked by passengers in the second row of seats, as shown in Figure 10a. In the
seating area may be blocked by passengers in the second row of seats, as shown in Figure 10a. In the
front camera view, the passengers seated in the two seats in the middle of the first row are clearly seen
front camera view, the passengers seated in the two seats in the middle of the first row are clearly
compared to the rear camera view, so this part of the passenger information was considered within the
seen compared to the rear camera view, so this part of the passenger information was considered
estimation range of the front camera, as shown in Figure 10b.
within the estimation range of the front camera, as shown in Figure 10b.
Therefore, according to the aforementioned calculation methods of the front and rear cameras,
we counted the actual number of people in the 500 sets of bus test data manually, and the statistics of
this dataset are the model testing data to be used.
(a) (b)
Figure 10. Seating area definition: (a) rear camera; (b) front camera.
Therefore, according to the aforementioned calculation methods of the front and rear cameras,
we counted the actual number of people in the 500 sets of bus test data manually, and the statistics
Additionally, in the rear camera view, passengers alongside the aisle of the first row in the rear
seating area may be blocked by passengers in the second row of seats, as shown in Figure 10a. In the
front camera view, the passengers seated in the two seats in the middle of the first row are clearly
seen
Sensors compared
2020, 20, 2178 to the rear camera view, so this part of the passenger information was considered
10 of 18
within the estimation range of the front camera, as shown in Figure 10b.
(a) (b)
(a) (b)
Figure 11. Repeat detection area: (a) without correction and (b) with correction.
Figure 11. Repeat detection area: (a) without correction and (b) with correction.
In the rear camera view, there is no occlusion problem. Therefore, we mainly used the YOLO
In the rear camera view, there is no occlusion problem. Therefore, we mainly used the YOLO
detection method to count the number of passengers in the seats and aisles of the bus. Finally, the results
detection method to count the number of passengers in the seats and aisles of the bus. Finally, the
of the CAE density estimation and YOLO detection can be summed to obtain the current number of
results of the CAE density estimation and YOLO detection can be summed to obtain the current
passengers
number on the bus. on the bus.
of passengers
4.3. Evaluation of Passenger Number Estimation
4.3. Evaluation of Passenger Number Estimation
After defining the test dataset, to verify the effectiveness of the proposed algorithm in passenger
After defining the test dataset, to verify the effectiveness of the proposed algorithm in passenger
number estimation, we compare it to two methods: SaCNN [25] and MCNN [26]. To account for the
number estimation, we compare it to two methods: SaCNN [25] and MCNN [26]. To account for the
performances of the different detection models, we used the mean absolute error (MAE) and root mean
performances of the different detection models, we used the mean absolute error (MAE) and root
squared error (RMSE), as shown in Equations (7) and (8), to evaluate the effectiveness of models based
mean squared error (RMSE), as shown in Equations (7) and (8), to evaluatei the effectiveness of models
on based
the references [28–30], respectively.
on the references Nimg is the N
[28–30], respectively. numberis of
thetest of xtest
images,
number g is images,
the actuali number of
i is the number of img
xg is the actual
passengers in the test image, and x̂ passengers on the bus estimated by
number of passengers in the testp image, and xˆ ip is the number of passengers on the bus estimated bythe different
methods. Table 1methods.
the different shows the performances
Table 1 shows the of performances
the different methods for 500 sets
of the different methodsof testfor
data
500from
sets the
of test
busdata
images.
from the bus images.
Nimg
|x
1 i
MAE = g − xˆ pi |, (7)
Nimg i =1
N img
(x
1 i
R MSE = g − xˆ pi )2 . (8)
N img i =1
Sensors 2020, 20, 2178 11 of 18
Nimg
1 X i
MAE = x g − x̂ip , (7)
Nimg
i=1
v
u
Nimg
u
t
1 X i 2
RMSE = (x g − x̂ip ) . (8)
Nimg
i=1
The estimation performance results shown in Table 1 indicate that if only a single neural network
architecture was used to estimate the number of passengers in the bus scenario, whether it is a
density-based method or a detection-based method, there would be a significant error in the model
performance evaluation. In our experimental scene, the image field of view is relatively small, resulting
in extreme differences in the size of the passengers’ heads, as seen closer and farther in the image. As a
result, estimates using a single network architecture are poor. To further investigate the performance
of the proposed algorithm, we established in the original bus passenger dataset that when the bus
interior contains more than 25 passengers, it is considered to be more crowded, as shown in Figure 12,
wherein 104 sets of images belong to this category. In the next part, only the YOLOv3 detection method
is analyzed and compared with the proposed method in this study.
Figure 12a shows a picture of 25 passengers on the bus. Although there is space in the aisle section,
as seen in the front camera view, the seating area seen in the rear camera is almost full. Figure 12b
shows 30 passengers inside the bus. The front camera shows the passengers on both sides of the
aisle, and there are many passengers in the defined crowded area of the image. Figure 12c shows 35
passengers in the bus. This picture shows that most of the space, as seen in the front camera, has been
filled, and the seating area seen in the rear camera is full. Figure 12d shows 40 passengers in the
bus. The space shown in the front camera view is full, and there are a few passengers standing in
the area near the door. Few passengers can be seen standing on the aisle in the rear camera image
as well. Based on this, the crowded passenger dataset was explicitly selected for further analysis
and comparison. The analysis results are shown in Table 2. The estimation results indicate that the
performance of the proposed algorithm is better than when using only the YOLOv3 detection method.
From the estimation results, we observed that the most significant performance difference between the
two methods occurred in the defined crowded area.
(a)
(b)
(d)
Figure 12. Examples of crowded scenes: (a) 25 passengers, (b) 30 passengers, (c) 35 passengers,
Figure
and (d)12. Examples of crowded scenes: (a) 25 passengers, (b) 30 passengers, (c) 35 passengers, and
40 passengers.
(d) 40 passengers.
Figure 13a–c presents the front camera images from the bus, and the detection results from
employing YOLOv3 Table
and2. from
Performances of different
the proposed methodsof
architecture forthis
the study,
crowded bus dataset. Take the first row
respectively.
of images as an example. The ground truth of the number of passengers in the image is 25. The results
Crowded Dataset (104 Sets)
of the YOLOv3 detection and the proposed method are 17 and 24, respectively. The result of YOLOv3
detection, in Figure 13b, proves that more Method MAE
detection failures occurRMSE
in the defined crowded passenger
YOLOv3
areas only if this single detection algorithm is employed to 4.93estimate5.31
the congestion of passengers for
the front camera. As shown inDensity-CAE
Figure 13c, the density estimation
+ YOLOv3 1.98 network
2.66 method can compensate for
the low detection rate in the crowded area, and the density image presents the distribution of each
passenger in presents
Figure 13a–c the crowded area, camera
the front so the result
images of from
the final
the estimation
bus, and the ofdetection
passengers in the
results bus employing
from is closer to
the actualand
YOLOv3 number
from of
thepassengers
proposed in the bus, compared
architecture with respectively.
of this study, that from a single
Take detection method.
the first row of images
as an example. The ground truth of the number of passengers in the image is 25. The results of the
YOLOv3 detection and the proposed method are 17 and 24, respectively. The result of YOLOv3
detection, in Figure 13b, proves that more detection failures occur in the defined crowded passenger
areas only if this single detection algorithm is employed to estimate the congestion of passengers for
the front camera. As shown in Figure 13c, the density estimation network method can compensate
for the low detection rate in the crowded area, and the density image presents the distribution of each
Sensors 2018, 18, x FOR PEER REVIEW 13 of 18
Figure 13.
Figure 13. Estimation results
Estimation for for
results eacheach
method: (a) original
method: image, image,
(a) original (b) YOLOv3 only, andonly,
(b) YOLOv3 (c) proposed
and (c)
proposed architecture.
architecture.
4.4. Evaluation of
4.4. Evaluation of System
System Performance
Performance in
in Continuous
Continuous Time
Time
To
To assess
assess the
the effectiveness
effectivenessof of the
the proposed
proposed model,
model, wewe also
also performed
performed model
model tests
tests on
on continuous
continuous
time
time bus
bus images.
images. The
The selected
selected scenes
scenes cancan be
be divided
divided into
into three
three main
main conditions: afternoon, evening,
conditions: afternoon, evening,
and nighttime. The reason for choosing these conditions is that in the afternoon,
and nighttime. The reason for choosing these conditions is that in the afternoon, the image inside the image inside
the
the bus is more susceptible to light exposure, while in the evening, many crowded
bus is more susceptible to light exposure, while in the evening, many crowded situations occur. At situations occur.
At night,
night, thethe image
image of the
of the busbus is affected
is affected by by
thethe interior
interior fluorescent
fluorescent lights
lights andand scenes
scenes outside
outside thethe bus.
bus. In
In the following section, we introduce these three continuous time scenes individually
the following section, we introduce these three continuous time scenes individually and evaluate the and evaluate
the estimation
estimation results
results of the
of the passengers
passengers inside
inside the the vehicle.
vehicle.
4.4.1. Estimated Results (Afternoon)
4.4.1. Estimated Results (Afternoon)
In the performance analysis for afternoon time, we selected the test video from 4:00–5:00 p.m. on
23 March 2019. This period was selected because of the relationship between the driving route and the
Sensors 2018, 18, x FOR PEER REVIEW 14 of 18
In the performance analysis for afternoon time, we selected the test video from 4:00–5:00 p.m.
Sensors 2020, 20, 2178 14 of 18
on 23 March 2019. This period was selected because of the relationship between the driving route and
the direction of the bus. During this period, the sunlight caused uneven brightness in the image of
inside theofbus.
direction Moreover,
the bus. During there were students
this period, travelling
the sunlight from school
caused uneven during
brightness in this period,
the image of so the
inside
congestion of passengers can also be observed. The occurrence of the above situations
the bus. Moreover, there were students travelling from school during this period, so the congestion of can verify the
effectiveness
passengers canofalso
the beproposed
observed. model framework.ofFigure
The occurrence 14 shows
the above thecan
situations change
verifyinthe
theeffectiveness
number of
passengers when the bus arrives at each bus stop. The bus stop numbers are arranged
of the proposed model framework. Figure 14 shows the change in the number of passengers when chronologically
frombus
the leftarrives
to rightatoneach
the horizontal
bus stop. axis. During
The bus stopthis period,are
numbers thearranged
bus had achronologically
total of 27 stops.from
The yellow
left to
polyline shown in the figure indicates the actual number of passengers on the bus.
right on the horizontal axis. During this period, the bus had a total of 27 stops. The yellow polylineThe dark blue
shown in the figure indicates the actual number of passengers on the bus. The dark blue polyline isline
polyline is the number of passengers estimated by the proposed framework. The green dotted the
represents a large number of passengers entering the bus at that stop, and the number
number of passengers estimated by the proposed framework. The green dotted line represents a large of people on
the bus increased to approximately 30. The orange dotted line represents a large
number of passengers entering the bus at that stop, and the number of people on the bus increased to number of
passengers leaving the bus at that stop, where approximately half the passengers got
approximately 30. The orange dotted line represents a large number of passengers leaving the bus at off. The black
dotted
that linewhere
stop, is theapproximately
bus terminal. half the passengers got off. The black dotted line is the bus terminal.
on the bus are used to estimate the number of passengers. The algorithm used is a combination of a
deep learning object detection method and the CAE architecture. The CAE density estimation model
was used to extract the passenger features of the crowded area, and YOLOv3 was used to detect
the areas with more apparent head features. Then, the results obtained by the two methods were
summed to estimate the number of passengers in the vehicle. Moreover, this result was compared
with other methods. In the final performance evaluation, the MAEs for the bus passenger dataset and
the crowded dataset were 1.35 and 1.98, respectively. In these experiments, the RMSEs were 2.02 and
2.66, respectively. Furthermore, we estimated the number of passengers on a bus for three consecutive
times; namely, afternoon, evening, and nighttime. The results were consistent with the variations in
passenger numbers at each stop.
Although the algorithm used in this study has better estimation performance, the proposed
CAE density estimation network model is still susceptible to light exposure, which reduces accuracy.
This issue will be addressed in our future work. In the future, we also hope to combine the proposed
algorithm for estimating the number of passengers with the method of counting passengers getting on
and off a bus to provide more reliable information in terms of bus load.
Author Contributions: Conceptualization, Y.-W.C. and J.-W.P.; methodology, Y.-W.H. and Y.-W.C.; software,
Y.-W.H. and Y.-W.C.; validation, Y.-W.H.; writing—original draft preparation, Y.-W.H.; writing—review and
editing, J.-W.P. All authors have read and agreed to the published version of the manuscript.
Funding: The authors would like to thank the Ministry of Science and Technology of R.O.C. for financially
supporting this research under contract number MOST 108-2638-E-009-001-MY2.
Acknowledgments: We thank United Highway Bus Co., Ltd. and Transportation Bureau of Kaohsiung City
Government in Taiwan for their assistance.
Conflicts of Interest: The authors declare no conflict of interest.
References
1. Luo, Y.; Tan, J.; Tian, X.; Xiang, H. A device for counting the passenger flow is introduced. In Proceedings of
the IEEE International Conference on Vehicular Electronics and Safety, Dongguan, China, 28–30 July 2013.
2. Oberli, C.; Torriti, M.T.; Landau, D. Performance evaluation of UHF RFID technologies for real-time passenger
recognition in intelligent public transportation systems. IEEE Trans. Intell. Transp. Syst. 2010, 11, 748–753.
[CrossRef]
3. Chen, C.H.; Chang, Y.C.; Chen, T.Y.; Wang, D.J. People counting system for getting in/out of a bus based
on video processing. In Proceedings of the International Conference on Intelligent Systems Design and
Applications, Kaohsiung, Taiwan, 26–28 November 2008.
4. Yang, T.; Zhang, Y.; Shao, D.; Li, Y. Clustering method for counting passengers getting in a bus with single
camera. Opt. Eng. 2010, 49. [CrossRef]
5. Chen, J.; Wen, Q.; Zhuo, C.; Mete, M. Automatic head detection for passenger flow analysis in bus surveillance
videos. In Proceedings of the IEEE International Conference on Vehicular Electronics and Safety, Dongguan,
China, 28–30 October 2013.
6. Hu, B.; Xiong, G.; Li, Y.; Chen, Z.; Zhou, W.; Wang, X.; Wang, Q. Research on passenger flow counting based
on embedded system. In Proceedings of the International IEEE Conference on Intelligent Transportation
Systems (ITSC), Qingdao, China, 8–11 October 2014.
7. Mukherjee, S.; Saha, B.; Jamal, I.; Leclerc, R.; Ray, N. A novel framework for automatic passenger counting.
In Proceedings of the IEEE International Conference on Image Processing, Brussels, Belgium, 11–14 September
2011.
8. Xu, H.; Lv, P.; Meng, L. A people counting system based on head-shoulder detection and tracking in
surveillance video. In Proceedings of the International Conference On Computer Design and Applications,
Qinhuangdao, China, 25–27 June 2010.
9. Zeng, C.; Ma, H. Robust head-shoulder detection by PCA-based multilevel HOG-LBP detector for people
counting. In Proceedings of the International Conference on Pattern Recognition, Istanbul, Turkey, 23–26
August 2010.
Sensors 2020, 20, 2178 17 of 18
10. Liu, G.; Yin, Z.; Jia, Y.; Xie, Y. Passenger flow estimation based on convolutional neural network in public
transportation system. Knowl. Base Syst. 2017, 123, 102–115. [CrossRef]
11. Gao, C.; Li, P.; Zhang, Y.; Liu, J.; Wang, L. People counting based on head detection combining Adaboost and
CNN in crowded surveillance environment. Neurocomputing 2016, 208, 108–116. [CrossRef]
12. Wang, Z.; Cai, G.; Zheng, C.; Fang, C. Bus-crowdedness estimation by shallow convolutional neural
network. In Proceedings of the International Conference on Sensor Networks and Signal Processing (SNSP),
Xi’an, China, 28–31 October 2018.
13. Chan, A.B.; Vasconcelos, N. Bayesian Poisson regression for crowd counting. In Proceedings of the IEEE
International Conference on Computer Vision, Kyoto, Japan, 29 September–2 October 2009.
14. Chen, K.; Loy, C.C.; Gong, S.; Xiang, T. Feature mining for localised crowd counting. In Proceedings of the
British Machine Vision Conference (BMVC), Surrey, England, 3–7 September 2012.
15. Xu, B.; Qiu, G. Crowd density estimation based on rich features and random projection forest. In Proceedings
of the IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Placid, NY, USA,
7–10 March 2016.
16. Borstel, M.; Kandemir, M.; Schmidt, P.; Rao, M.; Rajamani, K.; Hamprecht, F. Gaussian process density
counting from weak supervision. In Proceedings of the European Conference on Computer Vision (ECCV),
Amsterdam, The Netherlands, 11–14 October 2016.
17. Chan, A.B.; Liang, Z.S.J.; Vasconcelos, N. Privacy preserving crowd monitoring: Counting people without
people models or tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, Anchorage, AK, USA, 23–28 June 2008.
18. Chan, A.B.; Vasconcelos, N. Counting people with low-level features and Bayesian regression. IEEE Trans.
Image Process. 2012, 21, 2160–2177. [CrossRef] [PubMed]
19. Idrees, H.; Saleemi, I.; Seibert, C.; Shah, M. Multi-source multi-scale counting in extremely dense crowd
images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR,
USA, 23–28 June 2013.
20. Lempitsky, V.; Zisserman, A. Learning to count objects in images. In Proceedings of the International
Conference on Neural Information Processing Systems (NIPS), Hyatt Regency, Vancouver, BC, Canada,
6–11 December 2010.
21. Wang, J.; Wang, L.; Yang, F. Counting crowd with fully convolutional networks. In Proceedings of the
International Conference on Multimedia and Image Processing (ICMIP), Wuhan, China, 17–19 March 2017.
22. Zhang, C.; Li, H.; Wang, X.; Yang, X. Cross-scene crowd counting via deep convolutional neural networks.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA,
USA, 7–12 June 2015.
23. Sindagi, V.A.; Patel, V.M. CNN-based cascaded multi-task learning of high-level prior and density estimation
for crowd counting. arXiv 2017, arXiv:1707.09605.
24. Sindagi, V.A.; Patel, V.M. Generating high-quality crowd density maps using contextual pyramid CNNs.
arXiv 2017, arXiv:1708.00953v1.
25. Zhang, L.; Shi, M.; Chen, Q. Crowd counting via scale-adaptive convolutional neural network. arXiv 2017,
arXiv:1711.04433.
26. Zhang, Y.; Zhou, D.; Chen, S.; Gao, S.; Ma, Y. Single-image crowd counting via multi-column convolutional
neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
Las Vegas, NV, USA, 27–30 June 2016.
27. Weng, W.T.; Lin, D.T. Crowd density estimation based on a modified multicolumn convolutional neural
network. In Proceedings of the International Joint Conference on Neural Networks (IJCNN), Rio de Janeiro,
Brazil, 8–13 July 2018.
28. Huang, S.; Li, X.; Zhang, Z.; Wu, F.; Gao, S.; Ji, R.; Han, J. Body structure aware deep crowd counting.
IEEE Trans. Image Process. 2018, 27, 1049–1059. [CrossRef] [PubMed]
29. Sang, J.; Wu, W.; Luo, H.; Xiang, H.; Zhang, Q.; Hu, H.; Xia, X. Improved crowd counting method based on
scale-adaptive convolutional neural network. IEEE Access 2019, 7, 24411–24419. [CrossRef]
30. Yang, B.; Cao, J.; Wang, N.; Zhang, Y.; Zou, L. Counting challenging crowds robustly using a multi-column
multi-task convolutional neural network. Signal Process. Image Commun. 2018, 64, 118–129. [CrossRef]
31. Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767.
Sensors 2020, 20, 2178 18 of 18
32. Olmschenk, G.; Tang, H.; Zhu, Z. Improving dense crowd counting convolutional neural networks using
inverse k-nearest neighbor maps and multiscale upsampling. arXiv 2019, arXiv:1902.05379v3.
33. Masci, J.; Meier, U.; Ciresan, D.; SchmidHuber, J. Stacked convolutional auto-encoders for hierarchical feature
extraction. In Proceedings of the Artificial Neural Networks and Machine Learning (ICANN), Espoo, Finland,
14–17 June 2011.
34. Li, Y.; Zhang, X.; Chen, D. CSRNet: Dilated convolutional neural networks for understanding the highly
congested scenes. arXiv 2018, arXiv:1802.10062v4.
© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).