0% found this document useful (0 votes)

27 views18 pages

Sensors 20 02178

This document discusses a method for estimating the number of passengers on a bus using deep learning. It reviews previous research on passenger counting that used methods like foot sensors, RFID tags, and computer vision to detect heads. The proposed method uses a convolutional autoencoder to extract crowd features and detect the number of people, and a YOLOv3 model to detect the area where heads are clearest. It sums the results of these two deep learning models to calculate real-time passenger occupancy on the bus. An experiment tested the method's performance at different bus times and stops.

Uploaded by

siboruremarukundoc1

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views18 pages

Sensors 20 02178

Uploaded by

siboruremarukundoc1

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

sensors

Article
Estimation of the Number of Passengers in a Bus
Using Deep Learning
Ya-Wen Hsu, Yen-Wei Chen and Jau-Woei Perng *
Department of Mechanical and Electro-Mechanical Engineering, National Sun Yat-sen University,
Kaohsiung 804201, Taiwan; [email protected] (Y.-W.H.); [email protected] (Y.-W.C.)
* Correspondence: [email protected]; Tel.: +886-7-525-2000 (ext. 4281)

Received: 28 February 2020; Accepted: 7 April 2020; Published: 12 April 2020

Abstract: For the development of intelligent transportation systems, if real-time information on the
number of people on buses can be obtained, it will not only help transport operators to schedule
buses but also improve the convenience for passengers to schedule their travel times accordingly.
This study proposes a method for estimating the number of passengers on a bus. The method is based
on deep learning to estimate passenger occupancy in different scenarios. Two deep learning methods
are used to accomplish this: the first is a convolutional autoencoder, mainly used to extract features
from crowds of passengers and to determine the number of people in a crowd; the second is the you
only look once version 3 architecture, mainly for detecting the area in which head features are clearer
on a bus. The results obtained by the two methods are summed to calculate the current passenger
occupancy rate of the bus. To demonstrate the algorithmic performance, experiments for estimating
the number of passengers at different bus times and bus stops were performed. The results indicate
that the proposed system performs better than some existing methods.

Keywords: crowd density estimation; deep learning; object detection; passenger counting; YOLOv3

1. Introduction
In recent years, with the rapid development of technologies such as sensing, communication,
and management, improving the efficiency of traditional transportation systems through advanced
technological applications is becoming more feasible. Therefore, intelligent transportation systems have
gradually become a focus of transportation development around the world. Currently, many related
applications exist in the field of bus information services; for example, people can utilize a dynamic
information web page or a mobile app to inquire about the location and arrival time of every bus.
If more comprehensive information is provided on existing bus information platforms, the quality of
public transport services will be significantly improved. Thus, the number of passengers willing to use
public transport will increase. The information regarding the load on each bus is critical to the control
and management of public transport. The ride information includes the number of passengers entering
and leaving at each stop and the number of passengers remaining on the bus. Through intelligent
traffic monitoring, passengers can preview a bus’s occupancy status in real time and then make a
decision based on the additional information and evaluate the waiting time. Furthermore, passenger
transport operators can manage vehicle scheduling based on this information; thus, operational costs
depending on whether service quality is degraded or not are reduced effectively while providing
passengers with more useful ride information.
In the past, some research groups have explored the counting of passengers on buses.
Luo et al. proposed extracting footprint data through contact pedals to determine the directions
in which passengers entered and left the bus [1]. Torriti and Landau proposed radio-frequency
identification technology to implement passenger counting, although the recognition result is

Sensors 2020, 20, 2178; doi:10.3390/s20082178 www.mdpi.com/journal/sensors

Sensors 2020, 20, 2178 2 of 18

susceptible to the position of the RF antenna and the direction of radiation [2]. With the popularity
of surveillance cameras and advances in computer vision technology, image-based people-counting
methods have been continuously proposed [3–10]. References [3–5,10] employed images from actual
buses as experimental scenes. The images were captured from a camera mounted on the ceiling of a
bus, with an almost vertical angle of view. In these works, the heads of the passengers were a noticeable
detection feature in the images, and their methods included motion vectors, feature-point-tracking,
and hair color detection. The authors in a recent article [10] proposed a deep learning-based
convolutional neural network (CNN) architecture to detect passengers’ heads. Reference [11] proposed
a counting method by combining the Adaboost algorithm with a CNN for head detection. This study
was divided into three phases; namely, two offline training phases and one online detection phase.
The first phase used the Adaboost algorithm to learn the features obtained from the histogram of
oriented gradients (HOGs). The second phase established a new dataset from the preliminary results
detected in the first phase as the modeling data for training the CNN. The resulting model was used as
a classifier for the head feature in the online detection phase.
Based on the results of the above references, although object detection technology development is
mature, in a complex environment and crowd-intensive situation, achieving high detection performance
with a single algorithm is improbable. Existing works related to the number of passengers on a bus are
generally concerned with the image of the bus door and ignore the overall situation inside the bus.
However, as time flows, the accuracy of the number of passengers on the bus significantly drops owing
to the accumulation of the counting errors of passengers getting on and off of the bus. Only a few
studies have directly calculated numbers of passengers on buses. Reference [12] proposed a shallow
CNN architecture suitable for estimating the level of crowdedness in buses, which was a modification
of the multi-column CNN architecture. A fully connected layer was added to the network output,
and then the output of the model was classified into five levels of crowdedness [12].
In terms of crowd density estimation, the regression model was used in [13–18] to calculate the
number of people in a crowd of a public scene. This method solves the occlusion problem between the
people in a crowd. Hence, the primary step is to obtain the necessary characteristic information from
the image, such as edge, gradient, and foreground features. Then, linear regression, ridge regression,
or Gaussian process regression is used to learn the relationship between the crowd features and the
number of people. In [19], the image is first divided into small blocks one by one, and multiple
information sources are used to overcome the problem of overly dense and complex background
environments. The number of individuals present in an extremely dense population is calculated from
a single image. At various scales, the HOG detection method and Fourier analysis are used to detect
the heads of people. The points of interest are used to calculate the local neighbors. Then, each block
establishes the number of people from these three different methods. Finally, the total number of
people in the image is the sum of the number of people in each block. The author in [20] proposed a
new supervised learning architecture that uses the crowd density image corresponding to an original
crowd image to obtain the number of people in an area through regional integration. The practical loss
function approach is defined as the difference between the estimated density map and the ground
truth density map, and is used to enhance the learning ability. This procedure helps provide the correct
direction for deep learning applications in crowd density estimation.
The CNN architecture is used in [21–25] to solve the problems of severe occlusion and distortion
of the scene perspective. The CNN implements an end-to-end training process that inputs the raw
data and outputs the final result without the need to pass through foreground segmentation or
feature extraction. Using supervised learning, the original image through multiple convolutional
layers can produce the corresponding density map. The authors in [26,27] proposed the use of a
multi-column CNN (MCNN) architecture to map the relationship between crowd images and density
images. The filters corresponding to the different columns of the convolutional layers can perform
operations on images of different input sizes. Thus, the problem concerning the different sizes of
peoples’ heads in an image is solved. Finally, the three-column CNN is linearly weighted to obtain
Sensors 2020, 20, 2178 3 of 18

the corresponding number of people in the original image. The author in [28] reported that crowd
estimation involves three characteristics; namely, the entire body, head, and background. The body
feature information can help with judging whether a pedestrian exists in a specific position. However,
the existing density labeling method is mostly marked according to the method mentioned in [20],
so the semantic structure information is added to improve the accuracy of the data labeling, and the
estimation of the number of crowd scenes can be more effective. The authors in [29] improved the
scale-adaptive CNN (SaCNN) proposed in [25]. Adaptive Gaussian kernels are used to estimate
the parameter settings for different head sizes. The original image is reduced to a size of 256 × 256,
which improves the diversity of training samples and increases the network’s generalization ability.
The authors in [30] proposed a multi-column multi-task CNN (MMCNN), which defines the input
image as a single channel of 960 × 960 when training the network model image. This image is evenly
divided into 16 non-overlapping blocks during the training phase to avoid network over-fitting and
low generalization during the process. To solve the problem in the evaluation phase of the model,
the image is divided into 120 × 120 up to 480 × 480 different sizes for input; then, all the block images
are sequentially extracted, and the output of the three columns is merged to output the multi-tasking
result. The outputs are a mask of the crowd density map, congestion degree analysis, and masks of the
foreground and background.
In [12], although a bus congestion evaluation is involved, the result of the network output is simply
at the classification level. In the present study, we propose the use of a deep learning object detection
method and the establishment of a convolutional autoencoder (CAE) to extract the characteristics of
passengers in crowded areas to evaluate the number of people on a bus. The results of these two
methods are summed into the total number of passengers on the bus.
The rest of the paper is organized as follows. In Section 2, the proposed system is described.
In Section 3, the proposed methodology is introduced. Section 4 presents the experimental results and
evaluates the proposed schemes. Finally, conclusions and discussions on ideas for future work are
provided in Section 5.

2. Overview of the Proposed System

2.1. Systems Architecture

The proposed system estimates the number of passengers on a bus based on deep learning.
The process of the developed system is shown in Figure 1. The system first pre-trains two deep learning
network models in an offline manner. Here, the main difference is that the area in which the head
features of a passenger are visible in an image picture is established by employing object detection,
while the area where space is prone to crowding and head features are not apparent is based on the
density method to estimate the number of passengers. At the stage of the online passenger number
estimation, the images to be tested are divided into crowded areas and areas with more evident head
features, which are then input into the trained density estimation model and passenger detection
model, respectively. The system can instantly output the number of people on the bus. The two
counting sections of the system are briefly described as follows:

1. Passenger counting based on object detection: For areas where the head features of passengers
are more visible, the deep learning object detection method is employed to calculate the number
of passengers in the image. The you only look once version 3 (YOLOv3) network model with a
high detection rate was selected.
2. Passenger counting based on density estimation: We propose a CAE architecture suitable for
scenarios of crowded areas on a bus. This model filters all objects in the original image that do not
possess passenger characteristics and outputs the information characteristics of the passengers in
the image.
Sensors 2020, 20, 2178 4 of 18
Sensors 2018, 18, x FOR PEER REVIEW 4 of 17

Figure 1. Flowchart for estimation of the number of passengers on a bus.

Figure 1. Flowchart for estimation of the number of passengers on a bus.
2.2. Systems Database
2.2. Systems Database
All the videos used to produce the samples were obtained from the monitoring system of a city
bus inAll the videosTaiwan.
Kaohsiung, used toWeproduce thethese
utilized samples
imagewere
dataobtained
to createfrom the monitoring
a proprietary system
database of a city
of passengers
bus in Kaohsiung, Taiwan. We utilized these image data to create a proprietary
for training the proposed neural network model. The surveillance cameras were mounted inside database of
passengers for training the proposed neural network model. The surveillance cameras were mounted
inside the front and rear of the bus at the height of approximately 240 cm from the bus floor, as shown
in Figure 2a. The actual positions are shown in Figure 2b,c.
the periods of different lighting conditions are included, which are the daytime scene, nighttime
interior lighting scene, and scenes affected by light, as shown in Figure 3a–c. Some of the most
common situations in the images of a bus are shown in Figure 3; the pictures in the first and second
rows present different cases for the front and rear cameras, respectively. In the image of the daytime
scene, we can easily observe the current number of passengers on the bus. Meanwhile, in5 ofthe
Sensors 2020, 20, 2178 18
nighttime scene, images from the front camera are sometimes disturbed by the fluorescent light of
the bus, and images from the rear camera show the colors of some passengers’ heads as similar to the
the
scenefront and rear
outside of the
the bus. busscenes
In the at the affected
height ofbyapproximately 240 cm
light, images from thefrom
frontthe bus floor,
camera as shown
have uneven in
light
Figure 2a. The actual positions
and shade due to sunlight exposure. are shown in Figure 2b,c.

Rear camera

Front camera

(a)

(b) (c)

2. Positions
Figure 2. Positionsofofthe
thesurveillance
surveillance cameras:
cameras: (a)(a) internal
internal configuration
configuration of bus—images
of the the bus—images
of (b)of (b)
front
front camera
camera and (c)and (c)camera.
rear rear camera.

For passenger counting based on object detection, a total of 5365 images from inside the bus
recorded from February to November 2018 were utilized as the training set. Moreover, 500 sets of
in-vehicle images from November 2018 to May 2019 were utilized for the experimental tests. We defined
the images simultaneously captured from the front and rear cameras as the same set. In this dataset,
the periods of different lighting conditions are included, which are the daytime scene, nighttime
interior lighting scene, and scenes affected by light, as shown in Figure 3a–c. Some of the most common
situations in the images of a bus are shown in Figure 3; the pictures in the first and second rows present
different cases for the front and rear cameras, respectively. In the image of the daytime scene, we can
easily observe the current number of passengers on the bus. Meanwhile, in the nighttime scene, images
from the front camera are sometimes disturbed by the fluorescent light of the bus, and images from
the rear camera show the colors of some passengers’ heads as similar to the scene outside the bus.
In the scenes affected by light, images from the front camera have uneven light and shade due to
sunlight exposure.
In the images(a) from the front camera, the range
(b) of passenger congestion occurs (c) in a fixed area.
Therefore, we define the space of passenger congestion on the bus, as shown in Figure 4. When this
area is filled with passengers, serious occlusion problems generally occur, resulting in less distinct
head characteristics of passengers. A convolution auto-encode model based on a density map was
used to predict the number of passengers in this area.
For training the passenger density data, a total of 3776 images of crowded areas inside the bus
were marked. To increase the diversity of training images, data augmentation was performed on each
image by Gaussian noise, and the brightness was adjusted, as shown in Figure 5. The original dataset
was augmented to 11,328. The images of crowded areas were divided into training and validation
samples to train the neural network, and the segmentation ratio of the two samples was 7:3.
(b) (c)
Sensors 2020, 20,
Figure 2. 2178
Positions of the surveillance cameras: (a) internal configuration of the bus—images of (b)6 of 18
front camera and (c) rear camera.

Sensors2018,
Sensors 2018,18,
18,xxFOR
FORPEER
PEERREVIEW
REVIEW 66of
of17
17

Figure3.3.Examples
Figure Examplesof ofin-vehicle
in-vehicleimages
imagesobtained
obtainedfrom
fromeach
eachsurveillance
surveillancecamera:
camera:(a)
(a)daytime
daytimescene,
scene,
(b) nighttime interior lighting scene, (c) scenes affected by light.
(b) nighttime interior lighting scene, (c) scenes affected by light.

In the
In the images
images from
from the
the front
front camera,
camera, the the range
range ofof passenger
passenger congestion
congestion occurs
occurs inin aa fixed
fixed area.
area.
Therefore, we define the space of passenger congestion on the bus, as shown
Therefore, we define the space of passenger congestion on the bus, as shown in Figure 4. When this in Figure 4. When this
area is filled with passengers, serious occlusion problems generally occur, resulting
area is filled with passengers, serious occlusion problems generally occur, resulting in less distinct in less distinct
head characteristics
head characteristics of of passengers.
passengers. A A convolution
convolution auto-encode
auto-encode model
model based
based on on aa density
density map map was
was
used to predict the number of passengers
used to predict the number of passengers in this area. in this area.
For training
For training the
the passenger
passenger density
density data,
data, aa total
total of
of 3776
3776 images
images ofof crowded
crowded areasareas inside
inside the
the bus
bus
weremarked.
were marked.To Toincrease
increasethethediversity
diversityof oftraining
trainingimages,
images,datadataaugmentation
augmentationwas wasperformed
performedon oneach
each
(a) (b) (c)
image by Gaussian noise, and the brightness was adjusted, as shown in Figure
image by Gaussian noise, and the brightness was adjusted, as shown in Figure 5. The original dataset 5. The original dataset
was augmented
was augmented
Figure to 11,328.
to 11,328.
3. Examples The images
The images
of in-vehicle imagesofobtained
of crowded
crowded areas
from
areaseachwere dividedcamera:
surveillance
were divided into training
into training and validation
(a) daytime
and validation
scene,
samples
(b) to train
nighttime the neural
interior network,
lighting scene,and
(c) the
scenes segmentation
affected by ratio
light. of the
samples to train the neural network, and the segmentation ratio of the two samples was 7:3. two samples was 7:3.

Figure4.
Figure
Figure 4.Crowded
4. Crowdedarea
Crowded areadefinition.
area definition.
definition.

(a)
(a) (b)
(b) (c)
(c)

Figure 5.
Figure5. Data
5.Data augmentation:
Dataaugmentation: (a)
augmentation:(a) original
(a)original image,
originalimage, (b)
image,(b) darkened,
(b)darkened, and
darkened,and (c)
and(c) added
(c)added noise.
addednoise.
noise.
Figure
3. Methodology
3.Methodology
3. Methodology
3.1. Passenger Counting Based on Object Detection
3.1.Passenger
3.1. PassengerCounting
CountingBased Basedon
onObject
ObjectDetection
Detection
YOLO is an end-to-end CNN often used for object detection and recognition and has been
YOLO isis an
YOLO an end-to-end
end-to-end CNN CNN oftenoften used
used forfor object
object detection
detection and
and recognition
recognition and
and has
has been
been
successfully applied to many detection fields, such as traffic sign, pedestrian, and vehicle detection.
successfully
successfully applied
applied to many detection
to manyprocess
detection fields,
fields, such
suchdeepas traffic
as traffic sign, pedestrian, and vehicle detection.
By developing the detection into a single neuralsign, pedestrian,
network, and vehicle
the predicted detection.
bounding box
Bydeveloping
By developingthe thedetection
detectionprocess
processinto intoaasingle
singledeep
deepneural
neuralnetwork,
network,the thepredicted
predictedbounding
boundingbox box
and classification probability can be obtained simultaneously to achieve faster and higher-precision
and
and classification
classification probability
probability can be obtained simultaneously to achieve faster and higher-precision
object detection. New versionscan be obtained
of the YOLO model simultaneously to achieve
have been released, andfaster and higher-precision
its detection performance
objectdetection.
object detection.New Newversions
versionsof ofthe
theYOLO
YOLOmodel modelhavehavebeen
beenreleased,
released,andanditsitsdetection
detectionperformance
performance
continues to improve. Owing to its high efficiency, we directly used the YOLOv3 algorithm [31]
continuesto
continues toimprove.
improve.Owing Owingto toits
itshigh
highefficiency,
efficiency,we wedirectly
directlyused
usedthetheYOLOv3
YOLOv3algorithm
algorithm[31]
[31]based
based
based on the Darknet-53 architecture as our detection model. The process of the YOLO structure for
on
on the Darknet-53
the Darknet-53 architecture as our detection model. The process of the YOLO structure for
passenger detection architecture as our detection model. The process of the YOLO structure for
is described below.
passenger
passenger detection is described below.
First, detection
images with is described below.
ground-truth bounding boxes of heads and the corresponding classification
First, images with ground-truth bounding boxesboxes ofof heads
heads and
and the corresponding
corresponding classification
labelsFirst, images
are input to with ground-truth
the network. The inputbounding
images are reduced to 448 ×the
448 resolution andclassification
divided into
labels
labels are input
are input to the network. The input images are reduced to 448 × resolution and divided
448 resolution
to 448 × 448 divided
S × S grids. If thetoobject’s
the network.
center The
falls input
into the images are reduced
grid, each grid is responsible for predictingandB bounding
into
boxesSSof
into ××Sobjects.
S grids. If the object’s center falls into the grid, each grid is responsible for predicting BB
grids. If the object’s center falls into the grid, each grid is responsible for
Each bounding box contains five information factors; namely, Px , P y , Pw , Ph , c, where
predicting
bounding boxes
bounding boxes of of objects.
objects. Each
Each bounding
bounding box box contains
contains fivefive information
information factors;
factors; namely,
namely,
Px and P y represent the center coordinates of the bounding box relative to the bounds of the grid cell.
Px , Py , Pw , P h , c , where PP and P represent the center coordinates
Px , Py , Pw , Ph , c , where x and Py represent the center coordinates of the bounding box relative
x y of the bounding box relative

to the
to the bounds
bounds of
of the
the grid cell. PPw and
grid cell. and PPh are
are the
the width
width and
and height
height predicted
predicted relative
relative to
to the
the entire
entire
w h
image,respectively.
image, respectively.The confidence cc isisdefined
Theconfidence definedusing
usingEquation
Equation(1).
(1).

cc==PP0××PPIOU.. (1)
(1)
Sensors 2020, 20, 2178 7 of 18

Pw and Ph are the width and height predicted relative to the entire image, respectively. The confidence
c is defined using Equation (1).
c = P0 × PIOU . (1)
Sensors 2018, 18, x FOR PEER REVIEW 7 of 17
Here, P0 represents the probability of the box containing a detection object and PIOU is the intersection
over union between the detection object and the predicted bounding box.
Here, P0 represents the probability of the box containing a detection object and PIOU is the intersection over
Each bounding box corresponds to a degree of confidence. If there is no target in the grid cell,
union
the between the
confidence detection
is 0. If thereobject and thethe
is a target, predicted bounding
confidence box.to PIOU .
is equal
YOLO’s loss function λloss is calculated using Equation (2). If there is no target in the grid cell,
Each bounding box corresponds to a degree of confidence.
the confidence is 0. If there is a target, the confidence is equal to PIOU .
S2
YOLO’s loss function λloss is calculated X using Equation (2).
λloss = Ecoord + EIOU + Eclass . (2)
2
i=S0
λloss =  Ecoord + EIOU + Eclass . (2)
=0
Here, Ecoord represents the coordinate error, Ei IOU is the c = P0 × PIOU error, and Eclass is the classification
error Ecoord represents
Here,between theresults
the predicted coordinate
and theerror,
ground EIOU is the c = P0 × PIOU error, and Eclass is the
truth.
As the number
classification of training
error between the iterations
predictedincreases,
results andthe weight
the groundparameters
truth. of the network model are
continuously updated
As the number until theiterations
of training loss function reduces
increases, the to less than
weight a presetofvalue,
parameters and themodel
the network networkare
model is considered
continuously updatedto be completely
until trained. reduces to less than a preset value, and the network
the loss function
model is considered to be completely trained.
3.2. Passenger Counting Based on Density Estimation
3.2. Passenger Counting Based on Density Estimation
3.2.1. Inverse K-Nearest Neighbor Map Labeling
3.2.1.First, the K-Nearest
Inverse position ofNeighbor
each passenger’s head in the crowd image is labeled, as shown in Figure 6.
Map Labeling
The red cross symbol in the figure represents the head coordinates of a person to be marked. If there is
First, the position of each passenger’s head in the crowd image is labeled, as shown in Figure 6.
a head at pixel (xh , yh ), it is represented as a delta function δ(x − xh , y − yh ). An image with N heads
The red cross symbol in the figure represents the head coordinates of a person to be marked. If there
labeled is depicted by Equation (3).
is a head at pixel ( xh , yh ) , it is represented as a delta function δ ( x − xh , y − yh ) . An image with N
heads labeled is depicted by Equation (3). X N
H (x) = N δ(x − xh , y − yh ). (3)
H( x) = h
=1δ ( x − xh , y − yh ) . (3)
h =1

Figure 6. Labelled head positions.

To
To convert
convert thethe labeled
labeled image
image intointo aa continuous
continuous densitydensity map, map, aa common
common method method involves
involves
performing a convolution operation on H (x) and a Gaussian function Gσ (xG) to obtain the density map
performing a convolution operation on H ( x ) and a Gaussian function σ (x) to obtain the density
D g , as shown by Equation (4).
map Dg , as shown by Equation (4).
N
 (x − xh2 )2 + ( y − 2yh )2 
 
X 1
D g (x, f (·)) = H (x) ∗ Gσ (x) = N √ 1 exp −( x − xh ) + ( y − y2 h )  , (4)
Dg ( x , f ()) = H ( x) ∗ Gσ ( x) = h 2π f (σh )exp  − 2 f (σ2h )  , (4)
=1
h =1 2π f (σ h )  2 f (σ h ) 
where σh is a size determined by the k-nearest neighbor (kNN) distance of each head position (xh , yh )
where
from
σh head
other is a size determined
positions by the
(a fixed size k-nearest
is also used), andneighbor (kNN) distance
f is a manually determined of each headforposition
function scaling
σ(hxhto
, ydecide
h
) from other
the kernelheadsize positions
of the (a fixed
Gaussian size is
function.also used), and f is a manually determined function
We
for scaling adopt
σh inverse
to decide kNN the(ikNN) mapsofas
kernel size theanGaussian
alternative labeling method from the commonly used
function.
density map. According to [32], the ikNN graph performs better than density maps when there are
We adopt inverse kNN (ikNN) maps as an alternative labeling method from the commonly used
density map. According to [32], the ikNN graph performs better than density maps when there are
ideally selected spread parameters. Here, f is defined as a simple scalar function f (σ h ) = βσ h , and
β is a scalar that is manually adjusted. The full kNN map is defined by Equation (5).

K( x , k ) =
1
k
 min
k
( )
( x − x h ) 2 + ( y − y h ) 2 , ∀h ∈ Η , (5)

where H is the list of all head positions. The above formula calculates the kNN distance from each
pixel ( x , y ) to each head position ( xh , yh ) .
Sensors 2020, 20, 2178 8 of 18

ideally selected spread parameters. Here, f is defined as a simple scalar function f (σh ) = βσh , and β is
a scalar that is manually adjusted. The full kNN map is defined by Equation (5).
q !
1X 2 2
K(x, k) = min (x − xh ) + ( y − yh ) , ∀h ∈ H , (5)
k k

where H is the list of all head positions. The above formula calculates the kNN distance from each
pixel (x, y) to each head position (xh , yh ).
The2018,
Sensors calculation
18, x FOR to generate
PEER REVIEWthe ikNN map is depicted as Equation (6). The ikNN map is shown in17
8 of
Figure 7.
11
==
MM (6)
K(Kx,( xk), k+
) +11 (6)

Figure Inverse
7. 7.
Figure kNN
Inverse map
kNN ofof
map the crowded
the area.
crowded area.

3.2.2. Architecture of the Proposed Convolutional Autoencoder

3.2.2. Architecture of the Proposed Convolutional Autoencoder
Masci et al. proposed a deep neural network architecture CAE in 2011 [33]. A CAE combines the
Masci et al. proposed a deep neural network architecture CAE in 2011 [33]. A CAE combines the
traditional autoencoder of unsupervised learning with the convolutional and pooling operations of
traditional autoencoder of unsupervised learning with the convolutional and pooling operations of
CNNs. Two implementation processes are used: encoding and decoding. The model is trained by
CNNs. Two implementation processes are used: encoding and decoding. The model is trained by
comparing the encoded and the original data, so the decoded data can be restored to the original data
comparing the encoded and the original data, so the decoded data can be restored to the original data
as much as possible.
as much as possible.
The head feature extraction in crowded areas in this study was mainly based on the CAE network.
The head feature extraction in crowded areas in this study was mainly based on the CAE
In the offline stage, the color image containing a passenger’s head and the ikNN map were first input
network. In the offline stage, the color image containing a passenger’s head and the ikNN map were
to train the CAE model. Then, the image passed the trained model to obtain a pure image of the
first input to train the CAE model. Then, the image passed the trained model to obtain a pure image
head feature that filtered the other objects. The number of passengers in the crowded area in the
of the head feature that filtered the other objects. The number of passengers in the crowded area in
image was then calculated by integration. Figure 8 shows the CAE architecture applied in this study.
the image was then calculated by integration. Figure 8 shows the CAE architecture applied in this
In the network architecture, 9 convolutional layers were used, and the input was a color image with a
study. In the network architecture, 9 convolutional layers were used, and the input was a color image
size of 48 × 352 in the crowded area we previously defined. Excluding the last convolutional layer,
with a size of 48 × 352 in the crowded area we previously defined. Excluding the last convolutional
batch normalization was used after each convolutional layer to prevent the neural network from
layer, batch normalization was used after each convolutional layer to prevent the neural network
overfitting. Further, a rectified linear unit was used as an activation function to introduce nonlinear
from overfitting. Further, a rectified linear unit was used as an activation function to introduce
factors that overcome the gradient disappearance. The first to forth convolutional layers each use a
nonlinear factors that overcome the gradient disappearance. The first to forth convolutional layers
max-pooling with the kernel size of 2 × 2 to reduce the image, while the fifth to eighth convolutional
each use a max-pooling with the kernel size of 2 × 2 to reduce the image, while the fifth to eighth
layers use up-sampling with the kernel size of 2 × 2 to enlarge the image. The first convolution layer
convolutional layers use up-sampling with the kernel size of 2 × 2 to enlarge the image. The first
has 24 filters, each of size 9 × 9; the second has 48 filters, each of size 7 × 7; the third has 96 filters,
convolution layer has 24 filters, each of size 9 × 9; the second has 48 filters, each of size 7 × 7; the third
each of size 5 × 5; and the fourth has 128 filters, each of size 3 × 3. Inspired by [34], to enlarge receptive
has 96 filters, each of size 5 × 5; and the fourth has 128 filters, each of size 3 × 3. Inspired by [34], to
fields and extract deeper features without losing resolution, the dilated kernels with the dilation rate of
enlarge receptive fields and extract deeper features without losing resolution, the dilated kernels with
2 were used for the back-end. The first deconvolution layer has 128 filters, each of size 3 × 3; the second
the dilation rate of 2 were used for the back-end. The first deconvolution layer has 128 filters, each of
has 96 filters, each of size 5 × 5.; the third has 48 filters, each of size 7 × 7; and the fourth has 24 filters,
size 3 × 3; the second has 96 filters, each of size 5 × 5.; the third has 48 filters, each of size 7 × 7; and
each of size 9 × 9. Because the final output image is a single channel, only one filter was used, and the
the fourth has 24 filters, each of size 9 × 9. Because the final output image is a single channel, only one
sigmoid function was used as an activation function. To make the size of the output data consistent
filter was used, and the sigmoid function was used as an activation function. To make the size of the
with the size of the input data and avoid loss of image space information in the output density map,
output data consistent with the size of the input data and avoid loss of image space information in
up-sampling was used in the convolutional layer after the fifth layer to restore the original image size.
the output density map, up-sampling was used in the convolutional layer after the fifth layer to
restore the original image size.
size 3 × 3; the second has 96 filters, each of size 5 × 5.; the third has 48 filters, each of size 7 × 7; and
the fourth has 24 filters, each of size 9 × 9. Because the final output image is a single channel, only one
filter was used, and the sigmoid function was used as an activation function. To make the size of the
output data consistent with the size of the input data and avoid loss of image space information in
Sensors
the 2020, 20,
output 2178
density map, up-sampling was used in the convolutional layer after the fifth layer 9 of to
18
restore the original image size.

Sensors 2018, 18, x FOR PEER REVIEW 9 of 17

Figure 8. Architecture of the proposed convolutional autoencoder (CAE).

Figure 8. Architecture of the proposed convolutional autoencoder (CAE).
4. Experimental Results and Discussion
4. Experimental Results and Discussion
4.1.
4.1. Introduction
Introduction to
to the
the Experimental
Experimental Scene
Scene
The
The 100th
100th route
route traveled
traveledby byaabus
busininKaohsiung
Kaohsiungwas wasutilized
utilized asas
thethe
research
researchscenario
scenarioas it
asisitan
is
important transportation route in Kaohsiung City. This bus route passes multiple
an important transportation route in Kaohsiung City. This bus route passes multiple department department stores
and
storesschools; hence,hence,
and schools; manymany passengers are students
passengers are studentstravelling
travellingto to
andandfrom
fromclass,
class,making
making the the bus
bus
particularly crowded.
particularly crowded.
Before evaluatingthe
Before evaluating theperformance
performance of each
of each detection
detection model,model, wedefine
we first first define the calculation
the calculation method
method
of the image of a passenger on the bus. In the image of the rear camera in the bus, there are a totalare
of the image of a passenger on the bus. In the image of the rear camera in the bus, there of
a21total of 21 passenger
passenger seats. Imagesseats. Images of passengers
of passengers on the
on the left and rightleft andofright
sides sides
the last rowofare
thesusceptible
last row are to
susceptible to the positioning of the lens. As a result, only the lower body is often exposed
the positioning of the lens. As a result, only the lower body is often exposed in the image, as shown in in the image,
as shown
Figure in Figure
9. In 9. In the
the training training
of the deepof the deep
learning learning
model, we model,
use thewe useofthe
head thehead of the passenger
passenger as the main as
the main training feature, so information of these two seats in the rear
training feature, so information of these two seats in the rear camera view is ignored.camera view is ignored.

Figure 9. Occlusion range of the rear camera.

Additionally, in the rear camera view, passengers alongside the aisle of the first row in the rear
Additionally, in the rear camera view, passengers alongside the aisle of the first row in the rear
seating area may be blocked by passengers in the second row of seats, as shown in Figure 10a. In the
seating area may be blocked by passengers in the second row of seats, as shown in Figure 10a. In the
front camera view, the passengers seated in the two seats in the middle of the first row are clearly seen
front camera view, the passengers seated in the two seats in the middle of the first row are clearly
compared to the rear camera view, so this part of the passenger information was considered within the
seen compared to the rear camera view, so this part of the passenger information was considered
estimation range of the front camera, as shown in Figure 10b.
within the estimation range of the front camera, as shown in Figure 10b.
Therefore, according to the aforementioned calculation methods of the front and rear cameras,
we counted the actual number of people in the 500 sets of bus test data manually, and the statistics of
this dataset are the model testing data to be used.

(a) (b)

Figure 10. Seating area definition: (a) rear camera; (b) front camera.

Therefore, according to the aforementioned calculation methods of the front and rear cameras,
we counted the actual number of people in the 500 sets of bus test data manually, and the statistics
Additionally, in the rear camera view, passengers alongside the aisle of the first row in the rear
seating area may be blocked by passengers in the second row of seats, as shown in Figure 10a. In the
front camera view, the passengers seated in the two seats in the middle of the first row are clearly
seen
Sensors compared
2020, 20, 2178 to the rear camera view, so this part of the passenger information was considered
10 of 18
within the estimation range of the front camera, as shown in Figure 10b.

(a) (b)

Figure 10. Seating

Figure areaarea
10. Seating definition: (a) rear
definition: camera;
(a) rear (b) front
camera; camera.
(b) front camera.
4.2. Calculation of the Total Number of Passengers
Therefore, according to the aforementioned calculation methods of the front and rear cameras,
weIncounted
the frontthecamera
actualview, passengers
number in the
of people edge
in the ofsets
500 the crowded
of bus testarea might
data be recounted
manually, and theby the
statistics
CAE and YOLOv3. Therefore, to obtain more
of this dataset are the model testing data to be used.accurate estimation results, we established a dividing
Sensors 2018, 18, x FOR PEER REVIEW 10 of 17
line on the image to deal with this problem. As shown in Figure 11a, in the overlapping detection area,
we point
determined
4.2. Calculationtheofposition
of the object the ofNumber
is Total
within the
thecenter point of the
of Passengers
defined density object bounding
estimation area, thisbox. When the
detection center
object pointto
belongs ofthe
thecounting
object is within
result the defined density estimation area, this detection object belongs to the counting
In the frontofcamera
the density
view,estimation.
passengersBy insegmenting
the edge of the
the detection
crowded areaarea and thebe
might density estimation
recounted by the
result
area of the density estimation. By segmenting the detection area and the density estimation area
CAEthrough the dividing
and YOLOv3. line, to
Therefore, theobtain
problemmore of repeated counting can
accurate estimation be avoided,
results, as shownain
we established Figure
dividing
through
11b. the dividing line, the problem of repeated counting can be avoided, as shown in Figure 11b.
line on the image to deal with this problem. As shown in Figure 11a, in the overlapping detection

(a) (b)

Figure 11. Repeat detection area: (a) without correction and (b) with correction.
Figure 11. Repeat detection area: (a) without correction and (b) with correction.

In the rear camera view, there is no occlusion problem. Therefore, we mainly used the YOLO
In the rear camera view, there is no occlusion problem. Therefore, we mainly used the YOLO
detection method to count the number of passengers in the seats and aisles of the bus. Finally, the results
detection method to count the number of passengers in the seats and aisles of the bus. Finally, the
of the CAE density estimation and YOLO detection can be summed to obtain the current number of
results of the CAE density estimation and YOLO detection can be summed to obtain the current
passengers
number on the bus. on the bus.
of passengers
4.3. Evaluation of Passenger Number Estimation
4.3. Evaluation of Passenger Number Estimation
After defining the test dataset, to verify the effectiveness of the proposed algorithm in passenger
After defining the test dataset, to verify the effectiveness of the proposed algorithm in passenger
number estimation, we compare it to two methods: SaCNN [25] and MCNN [26]. To account for the
number estimation, we compare it to two methods: SaCNN [25] and MCNN [26]. To account for the
performances of the different detection models, we used the mean absolute error (MAE) and root mean
performances of the different detection models, we used the mean absolute error (MAE) and root
squared error (RMSE), as shown in Equations (7) and (8), to evaluate the effectiveness of models based
mean squared error (RMSE), as shown in Equations (7) and (8), to evaluatei the effectiveness of models
on based
the references [28–30], respectively.
on the references Nimg is the N
[28–30], respectively. numberis of
thetest of xtest
images,
number g is images,
the actuali number of
i is the number of img
xg is the actual
passengers in the test image, and x̂ passengers on the bus estimated by
number of passengers in the testp image, and xˆ ip is the number of passengers on the bus estimated bythe different
methods. Table 1methods.
the different shows the performances
Table 1 shows the of performances
the different methods for 500 sets
of the different methodsof testfor
data
500from
sets the
of test
busdata
images.
from the bus images.

Nimg

|x
1 i
MAE = g − xˆ pi |, (7)
Nimg i =1

N img

 (x
1 i
R MSE = g − xˆ pi )2 . (8)
N img i =1
Sensors 2020, 20, 2178 11 of 18

Nimg
1 X i
MAE = x g − x̂ip , (7)
Nimg
i=1
v
u
Nimg
u
t
1 X i 2
RMSE = (x g − x̂ip ) . (8)
Nimg
i=1

The estimation performance results shown in Table 1 indicate that if only a single neural network
architecture was used to estimate the number of passengers in the bus scenario, whether it is a
density-based method or a detection-based method, there would be a significant error in the model
performance evaluation. In our experimental scene, the image field of view is relatively small, resulting
in extreme differences in the size of the passengers’ heads, as seen closer and farther in the image. As a
result, estimates using a single network architecture are poor. To further investigate the performance
of the proposed algorithm, we established in the original bus passenger dataset that when the bus
interior contains more than 25 passengers, it is considered to be more crowded, as shown in Figure 12,
wherein 104 sets of images belong to this category. In the next part, only the YOLOv3 detection method
is analyzed and compared with the proposed method in this study.

Table 1. Performances of different methods for the bus passenger dataset.

Bus Passenger Data Set (500 sets)

Method MAE RMSE
SaCNN [25] 3.25 4.37
MCNN [26] 2.96 3.85
Density-CAE 2.31 2.98
YOLOv3 2.54 3.17
Density-CAE + YOLOv3 1.35 2.02

Figure 12a shows a picture of 25 passengers on the bus. Although there is space in the aisle section,
as seen in the front camera view, the seating area seen in the rear camera is almost full. Figure 12b
shows 30 passengers inside the bus. The front camera shows the passengers on both sides of the
aisle, and there are many passengers in the defined crowded area of the image. Figure 12c shows 35
passengers in the bus. This picture shows that most of the space, as seen in the front camera, has been
filled, and the seating area seen in the rear camera is full. Figure 12d shows 40 passengers in the
bus. The space shown in the front camera view is full, and there are a few passengers standing in
the area near the door. Few passengers can be seen standing on the aisle in the rear camera image
as well. Based on this, the crowded passenger dataset was explicitly selected for further analysis
and comparison. The analysis results are shown in Table 2. The estimation results indicate that the
performance of the proposed algorithm is better than when using only the YOLOv3 detection method.
From the estimation results, we observed that the most significant performance difference between the
two methods occurred in the defined crowded area.

Table 2. Performances of different methods for the crowded bus dataset.

Crowded Dataset (104 Sets)

Method MAE RMSE
YOLOv3 4.93 5.31
Density-CAE + YOLOv3 1.98 2.66
near the door. Few passengers can be seen standing on the aisle in the rear camera image as well.
Based on this, the crowded passenger dataset was explicitly selected for further analysis and
comparison. The analysis results are shown in Table 2. The estimation results indicate that the
performance of the proposed algorithm is better than when using only the YOLOv3 detection method.
From 2020,
Sensors the estimation
20, 2178 results, we observed that the most significant performance difference between
12 of 18
the two methods occurred in the defined crowded area.

(a)

(b)

Sensors 2018, 18, x FOR PEER REVIEW 12 of 18

(c)

(d)

Figure 12. Examples of crowded scenes: (a) 25 passengers, (b) 30 passengers, (c) 35 passengers,
Figure
and (d)12. Examples of crowded scenes: (a) 25 passengers, (b) 30 passengers, (c) 35 passengers, and
40 passengers.
(d) 40 passengers.
Figure 13a–c presents the front camera images from the bus, and the detection results from
employing YOLOv3 Table
and2. from
Performances of different
the proposed methodsof
architecture forthis
the study,
crowded bus dataset. Take the first row
respectively.
of images as an example. The ground truth of the number of passengers in the image is 25. The results
Crowded Dataset (104 Sets)
of the YOLOv3 detection and the proposed method are 17 and 24, respectively. The result of YOLOv3
detection, in Figure 13b, proves that more Method MAE
detection failures occurRMSE
in the defined crowded passenger
YOLOv3
areas only if this single detection algorithm is employed to 4.93estimate5.31
the congestion of passengers for
the front camera. As shown inDensity-CAE
Figure 13c, the density estimation
+ YOLOv3 1.98 network
2.66 method can compensate for
the low detection rate in the crowded area, and the density image presents the distribution of each
passenger in presents
Figure 13a–c the crowded area, camera
the front so the result
images of from
the final
the estimation
bus, and the ofdetection
passengers in the
results bus employing
from is closer to
the actualand
YOLOv3 number
from of
thepassengers
proposed in the bus, compared
architecture with respectively.
of this study, that from a single
Take detection method.
the first row of images
as an example. The ground truth of the number of passengers in the image is 25. The results of the
YOLOv3 detection and the proposed method are 17 and 24, respectively. The result of YOLOv3
detection, in Figure 13b, proves that more detection failures occur in the defined crowded passenger
areas only if this single detection algorithm is employed to estimate the congestion of passengers for
the front camera. As shown in Figure 13c, the density estimation network method can compensate
for the low detection rate in the crowded area, and the density image presents the distribution of each
Sensors 2018, 18, x FOR PEER REVIEW 13 of 18

Sensors 2020, 20, 2178 13 of 18

Ground truth: 25 YOLOv3 detection: 17 Proposed method: 24

Ground truth: 30 YOLOv3 detection: 20 Proposed method: 27

Ground truth: 20 YOLOv3 detection: 11 Proposed method: 18

Ground truth: 27 YOLOv3 detection: 13 Proposed method: 25

(a) (b) (c)

Figure 13.
Figure 13. Estimation results
Estimation for for
results eacheach
method: (a) original
method: image, image,
(a) original (b) YOLOv3 only, andonly,
(b) YOLOv3 (c) proposed
and (c)
proposed architecture.
architecture.

4.4. Evaluation of
4.4. Evaluation of System
System Performance
Performance in
in Continuous
Continuous Time
Time
To
To assess
assess the
the effectiveness
effectivenessof of the
the proposed
proposed model,
model, wewe also
also performed
performed model
model tests
tests on
on continuous
continuous
time
time bus
bus images.
images. The
The selected
selected scenes
scenes cancan be
be divided
divided into
into three
three main
main conditions: afternoon, evening,
conditions: afternoon, evening,
and nighttime. The reason for choosing these conditions is that in the afternoon,
and nighttime. The reason for choosing these conditions is that in the afternoon, the image inside the image inside
the
the bus is more susceptible to light exposure, while in the evening, many crowded
bus is more susceptible to light exposure, while in the evening, many crowded situations occur. At situations occur.
At night,
night, thethe image
image of the
of the busbus is affected
is affected by by
thethe interior
interior fluorescent
fluorescent lights
lights andand scenes
scenes outside
outside thethe bus.
bus. In
In the following section, we introduce these three continuous time scenes individually
the following section, we introduce these three continuous time scenes individually and evaluate the and evaluate
the estimation
estimation results
results of the
of the passengers
passengers inside
inside the the vehicle.
vehicle.
4.4.1. Estimated Results (Afternoon)
4.4.1. Estimated Results (Afternoon)
In the performance analysis for afternoon time, we selected the test video from 4:00–5:00 p.m. on
23 March 2019. This period was selected because of the relationship between the driving route and the
Sensors 2018, 18, x FOR PEER REVIEW 14 of 18

In the performance analysis for afternoon time, we selected the test video from 4:00–5:00 p.m.
Sensors 2020, 20, 2178 14 of 18
on 23 March 2019. This period was selected because of the relationship between the driving route and
the direction of the bus. During this period, the sunlight caused uneven brightness in the image of
inside theofbus.
direction Moreover,
the bus. During there were students
this period, travelling
the sunlight from school
caused uneven during
brightness in this period,
the image of so the
inside
congestion of passengers can also be observed. The occurrence of the above situations
the bus. Moreover, there were students travelling from school during this period, so the congestion of can verify the
effectiveness
passengers canofalso
the beproposed
observed. model framework.ofFigure
The occurrence 14 shows
the above thecan
situations change
verifyinthe
theeffectiveness
number of
passengers when the bus arrives at each bus stop. The bus stop numbers are arranged
of the proposed model framework. Figure 14 shows the change in the number of passengers when chronologically
frombus
the leftarrives
to rightatoneach
the horizontal
bus stop. axis. During
The bus stopthis period,are
numbers thearranged
bus had achronologically
total of 27 stops.from
The yellow
left to
polyline shown in the figure indicates the actual number of passengers on the bus.
right on the horizontal axis. During this period, the bus had a total of 27 stops. The yellow polylineThe dark blue
shown in the figure indicates the actual number of passengers on the bus. The dark blue polyline isline
polyline is the number of passengers estimated by the proposed framework. The green dotted the
represents a large number of passengers entering the bus at that stop, and the number
number of passengers estimated by the proposed framework. The green dotted line represents a large of people on
the bus increased to approximately 30. The orange dotted line represents a large
number of passengers entering the bus at that stop, and the number of people on the bus increased to number of
passengers leaving the bus at that stop, where approximately half the passengers got
approximately 30. The orange dotted line represents a large number of passengers leaving the bus at off. The black
dotted
that linewhere
stop, is theapproximately
bus terminal. half the passengers got off. The black dotted line is the bus terminal.

Figure 14. Continuous image analysis for afternoon.

Figure 14. Continuous image analysis for afternoon.
According to Figure 14, although there are errors in the estimation of the number of people at a
According to Figure 14, although there are errors in the estimation of the number of people at a
few stops when the image was affected by lighting, the overall analysis results are consistent with the
few stops when the image was affected by lighting, the overall analysis results are consistent with
current number of passenger changes on the bus. Table 3 shows the results of the model performance
the current number of passenger changes on the bus. Table 3 shows the results of the model
evaluation of the continuous images of the bus in the afternoon. Five consecutive seconds of the image
performance evaluation of the continuous images of the bus in the afternoon. Five consecutive
were taken for analysis after passengers get on the bus, where each second is represented by one frame.
seconds of the image were taken for analysis after passengers get on the bus, where each second is
Finally, the median number of passengers estimated by the five consecutive images was considered as
represented by one frame. Finally, the median number of passengers estimated by the five
the result of this stop.
consecutive images was considered as the result of this stop.
Table 3. Evaluation of the continuous image model for afternoon.
Table 3. Evaluation of the continuous image model for afternoon.
MAE RMSE
MAE RMSE
Density-CAE + YOLOv3 1.11 1.66
Density-CAE + YOLOv3 1.11 1.66
4.4.2. Estimated Results (Evening)
4.4.2. Estimated Results (Evening)
For the analysis of the performance of the model in the evening, we selected the test recording
Forperiod
for the the analysis of the performance
from 5:00–6:00 p.m. on 23ofMaythe model
2019. in Asthe evening,
shown we15,
Figure selected
the bus themade
test recording
26 stops
for the period
during from 5:00–6:00
this period. This periodp.m. onchosen
was 23 Maybecause
2019. Astheshown Figure three
bus passes 15, theschools
bus made along26 its
stops during
route and
this period. This period was chosen because the bus passes three schools along its
students leave school around this time. As students started leaving school for the day, passenger route and students
leave school
congestion around
started this time.
increasing onAsthestudents started the
bus. Therefore, leaving schoolperformance
evaluation for the day, of passenger
the proposedcongestion
model
started increasing on the bus. Therefore, the evaluation performance of
architecture was more extensively verified. The green dotted line indicates that a large number the proposed modelof
architecture was more extensively verified. The green dotted line indicates that
passengers entered the bus at that stop; the number of passengers on the bus at the stop of the firsta large number of
passengers
green entered to
line increased theapproximately
bus at that stop; the number
30, and the numberof passengers
at the stop on thesecond
of the bus at green
the stoplineofincreased
the first
to approximately 45. The orange dotted line represents the large number of passengers who left line
green line increased to approximately 30, and the number at the stop of the second green the
increased to approximately 45. The orange dotted line represents the large number
bus at that stop—approximately half the passengers got off the bus. The black dotted line represents of passengers who
left bus
the the terminal.
bus at that stop—approximately
Table 4 shows the resultshalf
of thetheperformance
passengers evaluation
got off the of bus.
theThe black dotted
continuous images line
of
the bus in the evening.
Sensors 2018, 18, x FOR PEER REVIEW 15 of 18
Sensors 2018, 18, x FOR PEER REVIEW 15 of 18
represents the bus terminal. Table 4 shows the results of the performance evaluation of the
continuousthe
represents images
bus of the bus in
terminal. the evening.
Table 4 shows the results of the performance evaluation of the
Sensors 2020, 20, 2178 15 of 18
continuous images of the bus in the evening.

Figure 15. Continuous image analysis for the evening.

Figure 15.
Figure Continuous image
15. Continuous image analysis
analysis for
for the
the evening.
evening.
Table 4. Evaluation of continuous image model for the evening.
Table 4. Evaluation of continuous image model for the evening.
Table 4. Evaluation of continuous image model for the evening.
MAE
MAE RMSE RMSE
Density-CAE+ +YOLOv3
YOLOv3 1.15
1.15 MAE RMSE
1.52
Density-CAE 1.52
Density-CAE + YOLOv3 1.15 1.52
4.4.3. Estimated Results (Nighttime)
4.4.3. Estimated Results (Nighttime)
4.4.3. Estimated Results (Nighttime)
In the performance analysis of the model at night, the selected test video record was from 9:00–
In the performance analysis of the model at night, the selected test video record was from
10:00
In p.m. on 23 May 2019.
the p.m.
performance This
analysis ofperiod
the wasatchosen
model because the test
brightness of thewas
bus image was
9:00–10:00 on 23 May 2019. This period was night,
chosenthe selected
because video record
the brightness of the bus fromimage9:00–
was
affected
10:00 p.m. by interior lighting and the outside scenery. Figure 16 presents the change in the number of
affected byon 23 May
interior 2019. and
lighting Thistheperiod
outside wasscenery.
chosenFigure
because16 the brightness
presents of theinbus
the change theimage
number wasof
passengers
affected when the
by when
interior bus arrived at each bus stop. Figure
During16this period, the bus made theanumber
total ofof
16
passengers thelighting and the
bus arrived outside
at each bus scenery.
stop. During presents
this period, thethe
buschange
made aintotal of 16 stops.
stops. The yellow
passengers when polyline
the bus shownat
arrived in each
the figure
bus indicates
stop. the this
During actual number
period, the ofbus
passengers
made a on the
total ofbus.
16
The yellow polyline shown in the figure indicates the actual number of passengers on the bus. The dark
The dark
stops. blue polyline
The yellow polyline indicates in the number of passengers estimated by of the proposed on framework.
blue polyline indicates theshown
number of the figure indicates
passengers the actual
estimated by thenumber
proposed passengers
framework. The thegreen
bus.
Thedark
The greenblue
dotted line indicates
polyline indicates a large
the numberofofpassengers
number passengersestimated
boarding the by busproposed
the at that stop; the orange
framework.
dotted line indicates a large number of passengers boarding the bus at that stop; the orange dotted line
dotted
The greenline indicates a large number
a large of passengers leaving the bus atthe that stop; andstop;
the black dotted
indicates adotted line indicates
large number of passengers number
leaving ofthepassengers
bus at thatboarding
stop; and thebus at that
black dotted theindicates
line orange
line indicates
dotted the bus terminal. Table 5 presents the results
the busofat the
thatperformance theevaluations of
the busline indicates
terminal. a large
Table number
5 presents of passengers
the results of the leaving
performance stop; and
evaluations black dotted
of continuous image
continuous
line indicates image
the modeling
bus of the
terminal. bus at
Table 5 nighttime.
presents the results of the performance evaluations of
modeling of the bus at nighttime.
continuous image modeling of the bus at nighttime.

Figure 16. Continuous image analysis for nighttime.

Figure 16. Continuous image analysis for nighttime.
Table 5. Evaluation
Figure of continuous
16. Continuous image model
image analysis for nighttime.
for nighttime.
Table 5. Evaluation of continuous image model for nighttime.
MAE RMSE
Table 5. Evaluation of continuous image model for nighttime.
Density-CAE + YOLOv3 MAE
0.63 RMSE
0.94
Density-CAE + YOLOv3 MAE
0.63 RMSE
0.94
5. Conclusions Density-CAE + YOLOv3 0.63 0.94
5. Conclusions
In the existing bus passenger-counting literature, cameras were installed above the doors to
5. Conclusions
calculate the numbers of passengers getting on/off buses. However, research on directly estimating the
number of passengers on a bus has not been explored. In this study, front and rear cameras placed
Sensors 2020, 20, 2178 16 of 18

on the bus are used to estimate the number of passengers. The algorithm used is a combination of a
deep learning object detection method and the CAE architecture. The CAE density estimation model
was used to extract the passenger features of the crowded area, and YOLOv3 was used to detect
the areas with more apparent head features. Then, the results obtained by the two methods were
summed to estimate the number of passengers in the vehicle. Moreover, this result was compared
with other methods. In the final performance evaluation, the MAEs for the bus passenger dataset and
the crowded dataset were 1.35 and 1.98, respectively. In these experiments, the RMSEs were 2.02 and
2.66, respectively. Furthermore, we estimated the number of passengers on a bus for three consecutive
times; namely, afternoon, evening, and nighttime. The results were consistent with the variations in
passenger numbers at each stop.
Although the algorithm used in this study has better estimation performance, the proposed
CAE density estimation network model is still susceptible to light exposure, which reduces accuracy.
This issue will be addressed in our future work. In the future, we also hope to combine the proposed
algorithm for estimating the number of passengers with the method of counting passengers getting on
and off a bus to provide more reliable information in terms of bus load.

Author Contributions: Conceptualization, Y.-W.C. and J.-W.P.; methodology, Y.-W.H. and Y.-W.C.; software,
Y.-W.H. and Y.-W.C.; validation, Y.-W.H.; writing—original draft preparation, Y.-W.H.; writing—review and
editing, J.-W.P. All authors have read and agreed to the published version of the manuscript.
Funding: The authors would like to thank the Ministry of Science and Technology of R.O.C. for financially
supporting this research under contract number MOST 108-2638-E-009-001-MY2.
Acknowledgments: We thank United Highway Bus Co., Ltd. and Transportation Bureau of Kaohsiung City
Government in Taiwan for their assistance.
Conflicts of Interest: The authors declare no conflict of interest.

References
1. Luo, Y.; Tan, J.; Tian, X.; Xiang, H. A device for counting the passenger flow is introduced. In Proceedings of
the IEEE International Conference on Vehicular Electronics and Safety, Dongguan, China, 28–30 July 2013.
2. Oberli, C.; Torriti, M.T.; Landau, D. Performance evaluation of UHF RFID technologies for real-time passenger
recognition in intelligent public transportation systems. IEEE Trans. Intell. Transp. Syst. 2010, 11, 748–753.
[CrossRef]
3. Chen, C.H.; Chang, Y.C.; Chen, T.Y.; Wang, D.J. People counting system for getting in/out of a bus based
on video processing. In Proceedings of the International Conference on Intelligent Systems Design and
Applications, Kaohsiung, Taiwan, 26–28 November 2008.
4. Yang, T.; Zhang, Y.; Shao, D.; Li, Y. Clustering method for counting passengers getting in a bus with single
camera. Opt. Eng. 2010, 49. [CrossRef]
5. Chen, J.; Wen, Q.; Zhuo, C.; Mete, M. Automatic head detection for passenger flow analysis in bus surveillance
videos. In Proceedings of the IEEE International Conference on Vehicular Electronics and Safety, Dongguan,
China, 28–30 October 2013.
6. Hu, B.; Xiong, G.; Li, Y.; Chen, Z.; Zhou, W.; Wang, X.; Wang, Q. Research on passenger flow counting based
on embedded system. In Proceedings of the International IEEE Conference on Intelligent Transportation
Systems (ITSC), Qingdao, China, 8–11 October 2014.
7. Mukherjee, S.; Saha, B.; Jamal, I.; Leclerc, R.; Ray, N. A novel framework for automatic passenger counting.
In Proceedings of the IEEE International Conference on Image Processing, Brussels, Belgium, 11–14 September
2011.
8. Xu, H.; Lv, P.; Meng, L. A people counting system based on head-shoulder detection and tracking in
surveillance video. In Proceedings of the International Conference On Computer Design and Applications,
Qinhuangdao, China, 25–27 June 2010.
9. Zeng, C.; Ma, H. Robust head-shoulder detection by PCA-based multilevel HOG-LBP detector for people
counting. In Proceedings of the International Conference on Pattern Recognition, Istanbul, Turkey, 23–26
August 2010.
Sensors 2020, 20, 2178 17 of 18

10. Liu, G.; Yin, Z.; Jia, Y.; Xie, Y. Passenger flow estimation based on convolutional neural network in public
transportation system. Knowl. Base Syst. 2017, 123, 102–115. [CrossRef]
11. Gao, C.; Li, P.; Zhang, Y.; Liu, J.; Wang, L. People counting based on head detection combining Adaboost and
CNN in crowded surveillance environment. Neurocomputing 2016, 208, 108–116. [CrossRef]
12. Wang, Z.; Cai, G.; Zheng, C.; Fang, C. Bus-crowdedness estimation by shallow convolutional neural
network. In Proceedings of the International Conference on Sensor Networks and Signal Processing (SNSP),
Xi’an, China, 28–31 October 2018.
13. Chan, A.B.; Vasconcelos, N. Bayesian Poisson regression for crowd counting. In Proceedings of the IEEE
International Conference on Computer Vision, Kyoto, Japan, 29 September–2 October 2009.
14. Chen, K.; Loy, C.C.; Gong, S.; Xiang, T. Feature mining for localised crowd counting. In Proceedings of the
British Machine Vision Conference (BMVC), Surrey, England, 3–7 September 2012.
15. Xu, B.; Qiu, G. Crowd density estimation based on rich features and random projection forest. In Proceedings
of the IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Placid, NY, USA,
7–10 March 2016.
16. Borstel, M.; Kandemir, M.; Schmidt, P.; Rao, M.; Rajamani, K.; Hamprecht, F. Gaussian process density
counting from weak supervision. In Proceedings of the European Conference on Computer Vision (ECCV),
Amsterdam, The Netherlands, 11–14 October 2016.
17. Chan, A.B.; Liang, Z.S.J.; Vasconcelos, N. Privacy preserving crowd monitoring: Counting people without
people models or tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, Anchorage, AK, USA, 23–28 June 2008.
18. Chan, A.B.; Vasconcelos, N. Counting people with low-level features and Bayesian regression. IEEE Trans.
Image Process. 2012, 21, 2160–2177. [CrossRef] [PubMed]
19. Idrees, H.; Saleemi, I.; Seibert, C.; Shah, M. Multi-source multi-scale counting in extremely dense crowd
images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR,
USA, 23–28 June 2013.
20. Lempitsky, V.; Zisserman, A. Learning to count objects in images. In Proceedings of the International
Conference on Neural Information Processing Systems (NIPS), Hyatt Regency, Vancouver, BC, Canada,
6–11 December 2010.
21. Wang, J.; Wang, L.; Yang, F. Counting crowd with fully convolutional networks. In Proceedings of the
International Conference on Multimedia and Image Processing (ICMIP), Wuhan, China, 17–19 March 2017.
22. Zhang, C.; Li, H.; Wang, X.; Yang, X. Cross-scene crowd counting via deep convolutional neural networks.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA,
USA, 7–12 June 2015.
23. Sindagi, V.A.; Patel, V.M. CNN-based cascaded multi-task learning of high-level prior and density estimation
for crowd counting. arXiv 2017, arXiv:1707.09605.
24. Sindagi, V.A.; Patel, V.M. Generating high-quality crowd density maps using contextual pyramid CNNs.
arXiv 2017, arXiv:1708.00953v1.
25. Zhang, L.; Shi, M.; Chen, Q. Crowd counting via scale-adaptive convolutional neural network. arXiv 2017,
arXiv:1711.04433.
26. Zhang, Y.; Zhou, D.; Chen, S.; Gao, S.; Ma, Y. Single-image crowd counting via multi-column convolutional
neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
Las Vegas, NV, USA, 27–30 June 2016.
27. Weng, W.T.; Lin, D.T. Crowd density estimation based on a modified multicolumn convolutional neural
network. In Proceedings of the International Joint Conference on Neural Networks (IJCNN), Rio de Janeiro,
Brazil, 8–13 July 2018.
28. Huang, S.; Li, X.; Zhang, Z.; Wu, F.; Gao, S.; Ji, R.; Han, J. Body structure aware deep crowd counting.
IEEE Trans. Image Process. 2018, 27, 1049–1059. [CrossRef] [PubMed]
29. Sang, J.; Wu, W.; Luo, H.; Xiang, H.; Zhang, Q.; Hu, H.; Xia, X. Improved crowd counting method based on
scale-adaptive convolutional neural network. IEEE Access 2019, 7, 24411–24419. [CrossRef]
30. Yang, B.; Cao, J.; Wang, N.; Zhang, Y.; Zou, L. Counting challenging crowds robustly using a multi-column
multi-task convolutional neural network. Signal Process. Image Commun. 2018, 64, 118–129. [CrossRef]
31. Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767.
Sensors 2020, 20, 2178 18 of 18

32. Olmschenk, G.; Tang, H.; Zhu, Z. Improving dense crowd counting convolutional neural networks using
inverse k-nearest neighbor maps and multiscale upsampling. arXiv 2019, arXiv:1902.05379v3.
33. Masci, J.; Meier, U.; Ciresan, D.; SchmidHuber, J. Stacked convolutional auto-encoders for hierarchical feature
extraction. In Proceedings of the Artificial Neural Networks and Machine Learning (ICANN), Espoo, Finland,
14–17 June 2011.
34. Li, Y.; Zhang, X.; Chen, D. CSRNet: Dilated convolutional neural networks for understanding the highly
congested scenes. arXiv 2018, arXiv:1802.10062v4.

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).