Identification of Vehicle-Pedestrian Collision - Yao 2018

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

sustainability

Article
Identification of Vehicle-Pedestrian Collision
Hotspots at the Micro-Level Using Network Kernel
Density Estimation and Random Forests: A Case
Study in Shanghai, China
Shenjun Yao 1,2 , Jinzi Wang 1,2 , Lei Fang 3 and Jianping Wu 1,2, *
1 Key Laboratory of Geographic Information Science (Ministry of Education), East China Normal University,
Shanghai 200241, China; [email protected] (S.Y.); [email protected] (J.W.)
2 School of Geographic Sciences, East China Normal University, Shanghai 200241, China
3 Department of Environmental Science and Engineering, Fudan University, Shanghai 200438, China;
[email protected]
* Correspondence: [email protected]; Tel.: +86-21-5434-1204

Received: 12 November 2018; Accepted: 10 December 2018; Published: 13 December 2018 

Abstract: The improvement of pedestrian safety plays a crucial role in developing a safe and
friendly walking environments, which can contribute to urban sustainability. A preliminary step
in improving pedestrian safety is to identify hazardous road locations for pedestrians. This study
proposes a framework for the identification of vehicle-pedestrian collision hot spots by integrating
the information about both the likelihood of the occurrence of vehicle-pedestrian collisions and the
potential for the reduction in vehicle-pedestrian crashes. First, a vehicle-pedestrian collision density
surface was produced via network kernel density estimation. By assigning a threshold value, possible
vehicle-pedestrian hot spots were identified. To obtain the potential for vehicle-pedestrian collision
reduction, random forests was employed to model the density with a set of variables describing
vehicle and pedestrian flows. The potential for crash reduction was then measured as the difference
between the observed vehicle-pedestrian crash density and the prediction produced by the random
forests models. The final hotspots were determined by excluding those with a crash reduction value
of no more than zero. The method was applied to the identification of hazardous road locations
for pedestrians in a district in Shanghai, China. The result indicates that the method is useful for
decision-making support.

Keywords: kernel density; random forests; pedestrians; crash; hotspots; safety; walking

1. Introduction
People start and end most of their trips on foot in their daily lives. However, mainly due to the
lack of awareness, pedestrians are often at high risk for death and injury. According to the World
Health Organization [1], approximately 1.24 million traffic deaths occur annually on the world’s roads,
of which about 22% involve pedestrians. As walking positively influences health and the environment,
encouraging walking can help develop a sustainable community. Despite a shift from motorized to
sustainable transport modes (such as walking and cycling) that have focused attention on pedestrian
safety, there is still much room for improvement to ensure a safe walking environment for pedestrians.
A preliminary step to improve pedestrian safety is to identify hazardous road locations for
pedestrians. This task plays a crucial role in safety countermeasure proposals and resource allocation.
From a geography perspective, hazardous road locations are usually represented by clusters of
traffic collisions. In the literature, extensive research has focused on the detection of traffic collision

Sustainability 2018, 10, 4762; doi:10.3390/su10124762 www.mdpi.com/journal/sustainability


Sustainability 2018, 10, 4762 2 of 11

concentration at the micro levels [2–12]. The studies can be categorized into two types [13,14]. The first
is the link-attribute class, where the road network is segmented into basic spatial units (BSUs) and
treats the traffic collisions as attributes attached to the BSUs. The other is the event-based type,
where individual traffic collision events represented by x and y coordinates in space are analyzed.
In traffic collision analysis, kernel density estimation (KDE) is one of the most popular event-based
approaches [15]. KDE has been widely applied to the identification of hazardous road locations.
Although some researchers employed traditional planar KDE [16–18] that estimates density in
two-dimensional space where traffic collisions are weighted based on the Euclidean distance, there has
been a growing trend in applying network KDE (NKDE), which estimates density in a one-dimensional
space where distance is calculated along the road network mainly because traffic collisions are a
network-constrained phenomenon. For instance, Xie and Yan [5] developed a novel NKDE approach
to estimate the density of network-constrained point events and applied it to the analysis of 2005 traffic
crash data in the Bowling Green, Kentucky, USA area. The results indicate that the NKDE is more
appropriate than standard planar KDE for density estimation of traffic collisions, since the latter is
likely to overestimate the density values.
In the context of road safety, hazardous road locations are usually referred to as traffic collision
“hotspots”, “blackspots”, ”sites with promise”, or “high risk locations”. A number of previous studies
employed different methods to detect traffic collision hot spots based on traffic collision frequency
and rate [19–22] aggregated by BSUs. Unlike spatial analysts who are interested in spatial analytical
techniques for the detection of traffic collision clusters, traffic safety researchers are more concerned
with the definition of hazardous road locations. Although using a simple ranking approach is the most
convenient way of defining a traffic collision hotspot, it is thought that the method is naive and is likely
to cause a large number of false positives. In handling this, previous studies have proposed other
measures to define a hazardous (or unsafe) road locations. For instance, McGuigan [23,24] measured
the “potential of accident reduction”, which was calculated as the difference between the observed and
the expected crash count at a site given exposure. Mahalel et al. [25] suggested that locations that are
selected for treatment should maximize the expected total reduction of traffic collisions. The premise
of these studies is that only excess traffic collisions can be prevented by appropriate treatments [26].
However, most of these studies focused on vehicle-vehicle collisions and dealt with collision frequency.
The method has not yet been applied to vehicle-pedestrian collision density.
As there is been no consensus on the best method of detecting hazardous road locations, this
study proposes an integrated micro-level method that incorporates both traffic crash intensity and the
potential for reduction to identify vehicle-pedestrian collision hot spots. The reasons for developing
the method are twofold. Firstly, there is a growing trend among nations worldwide to set a “zero”
tolerance vision in terms of fatalities to protect road users. To realize the ambitious target of zero road
fatalities and serious injuries on roads, researchers and engineers should be concerned with locations
where traffic collisions happen frequently. Secondly, in safety practice, resources are usually insufficient
for treating every hazardous road location. Policy-makers may not be interested in traffic crash clusters
that only result from high traffic volume. They may, instead, like to know hazardous road locations
that produce the maximum reduction in traffic deaths and injuries when appropriately treated. In this
light, we attempted to develop a framework to integrate both crash density and reduction potential
information sources for decision-making support for pedestrian safety.
The following section first introduces the steps for identifying vehicle-pedestrian hot spots, with
emphasis on models we used to analyse vehicle-pedestrian collisions. The study area and data
are introduced in Section 3, and the results are presented and discussed in Section 4, followed by
conclusions and further research directions in Section 5.

2. Method
The proposed framework for the identification of vehicle-pedestrian collision hot spots involves
three steps: producing a vehicle-pedestrian collision density surface, measuring the potential for
Sustainability 2018, 10, 4762 3 of 11

vehicle-pedestrian collision reduction, and identifying the vehicle-pedestrian collision hot spots.
This section introduces the models and approaches employed in each step.

2.1. Generation of Vehicle-Pedestrian Collision Density Surface


The NKDE method was used for detecting the vehicle-pedestrian collision hot spots by following
the approach in Xie and Yan [5] and Loo et al. [12]. First, by analogy with standard planar KDE, where
the entire two-dimensional space is divided into regular grids, the roads were divided into BSUs in
equal intervals to ensure regularly spaced locations along a network for density estimation [5]. Next,
the center points of BSUs were obtained as reference points. For each reference point (RP), the density
estimate, f (i), is calculated by:
1 N dij
f (i ) = ∑
Nb j=1
Kern( )
b
(1)

where b is the bandwidth, dij is the network distance between reference point i and vehicle-pedestrian
traffic collision j, and Kern(.) is a kernel function that measures the distance decay effect, such as
Uniform, Triangle, Quartic, Triweight, and Gaussian [27]. In this study, the length of BSU was set as
200 m, and the Quartic function was chosen as the kernel function, which is determined by:

2
dij 15 dij 2 d
i f 0 < bij ≤ 1

Kern( ) = 16 (1 − b2
) (2)
b  0 otherwise;

Although the BSU length and the choice of kernel function may have limited influence on the
results, the selection of bandwidth has significant impacts on the resultant density surface [4,5,12].
A small bandwidth may produce a sharp density pattern and may result in a large number of tiny
isolated individual clusters, and a broad bandwidth produces smooth density surface where hazardous
road locations are likely to be mixed with safe neighboring locations. In this research, the bandwidth
was chosen as 250 m—an intermediate value—to ensure an appropriate density surface.

2.2. Calculation of Potential of Vehicle-Pedestrian Collision Reduction


The potential for vehicle-pedestrian collision reduction was measured as the difference between
the observed and the estimated crash density values. The former is obtained using Equations (1) and
(2), the latter can be calculated by modelling the vehicle-pedestrian crash density with variables that
describe not only vehicle volume but also pedestrian flow. Although traditional statistical models have
been widely used in traffic collision modelling [28–30], applying machine learning methods [31–33] has
become a growing trend. A typical example is Chang [31] who analysed freeway collisions with neural
network (NN) approaches and found that NN models had better predictive performance because of
their exceptional ability in approximating the complicated nonlinearity. However, NNs have limited
ability to illustrate the influence of risk factors due to the “black-box” drawback and are likely to cause
a severe over-fitting problem. To balance the explanatory ability of risk factors and the accuracy of
traffic collision prediction, we employed the random forest (RF) method [34,35] for modeling traffic
collisions, because the technique is relatively robust to outliers and can evaluate the relative importance
of potential predictors [36]. The RF technique is being increasingly applied to many research fields
such as classification of land cover [37], identification of fire occurrence [38], mapping of oil spill [39],
detection of gold potential [40], and diagnosis of tree health [41]; however, it has rarely been applied to
the modeling of traffic collision density.
RF was first proposed by Breiman [35]. The technique relies on the “bagging” method that
constructs each tree independently by using a bootstrap sample of the dataset [42]. A random forest
consists of many trees, each of which is generated by drawing bootstrap samples from the original
dataset, with “out-of-bag” (OOB) data for validation. Unlike in standard trees where each node is
split using the best among all predictors, in a random forest, each node is split by randomly sampling
Sustainability 2018, 10, 4762 4 of 11

a subset of predictors and choosing the best split among those variables [34]. The outcome of the
RFs is determined by averaging the predictions of all the trees [35]. The importance of each predictor
can be estimated by examining the increase in prediction error when permuting the OOB data for
that variable and leaving all others unchanged. Two commonly used measures in RFs for assessing
variable importance are the mean decrease in accuracy and the decrease in node impurity. As the
former indicator is considered a more reliable measure [43], it was used for measuring the variable
importance in this study.
This study employed the Sci-Kit Learn (SKlearn, The French Institute for Research in Computer
Science and Automation, Rocquencourt, France) toolkit [44] that provides machine learning tools
in Python for data mining and data analysis. In SKlearn, the RandomForestRegressor tool was
used for implementing the RF algorithm. It contains several parameters that allow users to specify
modifications for optimizing the model, including n estimators (the number of decision trees), criterion
(the method to measure the quality of a split), max_depth (the maximum depth of a decision tree),
and min_samples_split (the minimum sample size in a split). SKlearn also provides functions that enable
users to measure the prediction accuracy of the model, such as cross_val_score mean_squared_error,
mean_absolute_error, and r2_score, which compute the values of mean squared error, mean absolute
error, and R2 , respectively. The function feature_importances is used for measuring the importance of
each variable.
Although independent validation samples are not necessary for RF, they allow the assessment of
the generalization capability of the method [38,45]. In this light, the dataset was randomly divided into
two parts: 70% for calibration and 30% for validation. The procedure was repeated n times, resulting
in n sub-samples. The final predicted density value was determined by averaging predictions from
RF models based on n sub-samples. The potential for vehicle-pedestrian collision reduction was then
obtained by calculating the difference between the observed vehicle-pedestrian collision density and
the final prediction. In this study, n was set to five.

2.3. Identification of Vehicle-Pedestrian Collision Hot Spots


The potential vehicle-pedestrian collision hot spots were first detected by setting a threshold
value for crash density. For each of these locations, the potential for vehicle-pedestrian collision
reduction was examined. If the value was no more than zero, the site was treated as a false positive
and was excluded from the hot spots. The final hazardous road locations for pedestrians only included
those with the potential for collision reduction above zero. Following Harirforoush and Bellalite [4],
the threshold value was set to three standard deviations from the mean value in this research.

3. Study Area and Data


We analysed vehicle-pedestrian collisions occurring in 2015 in Changning District, which is
located in the urban core of Shanghai, China. The vehicle-pedestrian collision data were collected
from the Shanghai 110 Calling Center. The total length of arterial, secondary, and branch roads in this
district is about 295 km. In 2015, 1200 vehicle-pedestrian collisions occurred in the district. Figure 1
shows the spatial distribution of vehicle-pedestrian crashes in the study area. In traffic safety research,
the analysis is usually conducted based on crash data observed for 3- to 5-year periods; however, this
study only used a dataset for one year. The reasons for this are twofold. First, given the length of the
road network in the study area, 1200 vehicle-pedestrian collisions are able to depict overall pedestrian
safety. It is not necessary to pool 3- or 5-year datasets to ensure the representativeness of the events.
Second, since the late 2000s, the Shanghai Police has enforced a set of safety rules , which may result in
significant yearly variation in safety performance.
Sustainability 2018, 10, 4762 5 of 11
Sustainability 2018, 10, x FOR PEER REVIEW    5  of  11 

Figure 1. Spatial distribution of vehicle-pedestrian collisions in 2015 in the study area.


Figure 1. Spatial distribution of vehicle‐pedestrian collisions in 2015 in the study area. 

As mentioned earlier, to determine the potential for vehicle-pedestrian collision reduction,


As mentioned earlier, to determine the potential for vehicle‐pedestrian collision reduction, the 
the vehicle-pedestrian
vehicle‐pedestrian  collision
collision  density
density  should
should  be be modeled
modeled  by by
RF RFwith 
withvariables 
variablesthat 
that describe 
describe both 
both
vehicle and pedestrian volume. Because it is challenging and extremely costly to
vehicle  and  pedestrian  volume.  Because  it  is  challenging  and  extremely  costly  to  collect  detailed  collect detailed
information on the vehicle and pedestrian volume along roads, we employed proxy variables that may
information on the vehicle and pedestrian volume along roads, we employed proxy variables that 
reflect the spatial variation in flows.
may reflect the spatial variation in flows. 
One crucial 
One  crucial variable 
variable delineating 
delineating traffic 
traffic volume 
volume is  is the 
the Global
Global  Positioning System (GPS) 
Positioning  System  (GPS) data 
data
extracted from GPS-equipped taxis. Such on-vehicle GPS data have been widely used in various fields
extracted from GPS‐equipped taxis. Such on‐vehicle GPS data have been widely used in various fields 
such as urban
such as  urban traffic
traffic surveillance, trip
surveillance,  pattern identification,
trip pattern  city city structure 
identification,  structure recognition, and traffic
recognition,  safety
and  traffic 
on arterial roads [46–49], but have not been applied to the modeling of vehicle-pedestrian
safety  on  arterial  roads  [46–49],  but  have  not  been  applied  to  the  modeling  of  vehicle‐pedestrian  collision
density. The
collision  data were
density.  collected
The  data  were from nearlyfrom 
collected  13,000 GPS-equipped
nearly  taxis from Shanghai
13,000  GPS‐equipped  taxis  from Qiangsheng
Shanghai 
Holding Co., Ltd. (Shanghai, China) The Qiangsheng family owns about
Qiangsheng Holding Co., Ltd. (Shanghai, China) The Qiangsheng family owns about 25% of the total  25% of the total number
of taxis, which represents 4–7% of the vehicle population in Shanghai [49]. The Qiangsheng taxi
number of taxis, which represents 4–7% of the vehicle population in Shanghai [49]. The Qiangsheng 
GPS GPS 
taxi  tracking pointpoint 
tracking  database contains
database  information
contains  including
information  vehicle
including  identification
vehicle  (ID), time,
identification  (ID), speed,
time, 
and longitude and latitude recorded by GPS receivers on the vehicles about every
speed, and longitude and latitude recorded by GPS receivers on the vehicles about every 10 s. With  10 s. With locational
information,
locational  GPS pointsGPS 
information,  werepoints 
plotted ontoplotted 
were  a map.onto 
A map-matching process was then
a  map.  A  map‐matching  conducted
process  to
was  then 
ensure that the tracking points were assigned to appropriate roads [50]. For each
conducted  to  ensure  that  the  tracking  points  were  assigned  to  appropriate  roads  [50].  For  each  reference point,
the number of taxis that passed was calculated. In this study, taxi GPS tracking data 1–7 March 2016
reference point, the number of taxis that passed was calculated. In this study, taxi GPS tracking data 
wereMarch 
1–7  used for thewere 
2016  calculation.
used  for The
the  average daily
calculation.  taxi
The  flow was
average  introduced
daily  taxi  flow as theintroduced 
was  vehicle exposure
as  the 
variable for the vehicle-pedestrian collision density prediction models. One crucial
vehicle exposure variable for the vehicle‐pedestrian collision density prediction models. One crucial  issue is that the
travel patterns and characteristics of taxicabs may differ from that of general traffic. A typical problem
issue is that the travel patterns and characteristics of taxicabs may differ from that of general traffic. 
is that unoccupied taxis tend to cluster in some specific types of places such as shopping malls and
A typical problem is that unoccupied taxis tend to cluster in some specific types of places such as 
metro stations. Including unoccupied taxis may cause overestimation of traffic flow in these locations.
shopping malls and metro stations. Including unoccupied taxis may cause overestimation of traffic 
As trajectories of occupied taxis are more likely to reflect travel demands and hence the variation in
flow in these locations. As trajectories of occupied taxis are more likely to reflect travel demands and 
real traffic, only taxis with passengers were included in the sample.
hence the variation in real traffic, only taxis with passengers were included in the sample. 
In addition 
In  addition to  to vehicle 
vehicle flow, 
flow, the
the  pedestrian
pedestrian  volume plays a
volume  plays  a  crucial
crucial  role
role  in
in  vehicle-pedestrian
vehicle‐pedestrian 
safety models. In the absence of detailed pedestrian flow data, we employed a set of variables that
safety models. In the absence of detailed pedestrian flow data, we employed a set of variables that 
comprehensively reflect 
comprehensively  reflect characteristics 
characteristics of of pedestrian 
pedestrian flow. 
flow. As 
As different 
different uses 
uses of 
of land 
land may 
may suggest 
suggest
diverse activities of human beings, which influence different features of pedestrian flow [51–53], we 
employed land use data to reflect the spatial variation in pedestrian exposure. Point of Interest (POI) 
Sustainability 2018, 10, 4762 6 of 11

diverse activities of human beings, which influence different features of pedestrian flow [51–53],
we employed land use data to reflect the spatial variation in pedestrian exposure. Point of Interest
(POI) data that could be used to further segment the activities were also introduced into the RF model
to incorporate more detailed features on pedestrian flow. In this research, land use data were derived
from Landsat (National Aeronautics and Space Administration, Washington, DC, US) images from
2014 with a spatial resolution of 30 m. POIs were collected from Baidu, Inc. (Beijing, China) in 2014.
The company provides application programming interfaces whereby users are allowed to develop
programs for collecting POI information from Baidu Map. As some land use and POI variables are
highly correlated, not all types of land use and POIs were integrated into the prediction models. Table 1
describes the variables that were finally introduced in the vehicle-pedestrian collision density models.
The result of the collinearity test for these variables was 3.4, reflecting little collinearity.

Table 1. Description of variables in the vehicle-pedestrian collision density models.

Variable Name Data Source Description


NoMetro Point of Interest No. of metro stations within 500 m of a Reference Point
NoBusStop POI No. of bus stops within 500 m of a RP
NoGov POI No. of government institutions within 500 m of a RP
NoBank POI No. of banking service facilities within 500 m of a RP
NoComBld POI No. of commercial buildings within 500 m of a RP
NoRetShp POI No. of retail shops within 500 m of a RP
NoMedi POI No. of medical service facilities within 500 m of a RP
NoEdu POI No. of educational institutions within 500 m of a RP
NoComp POI No. of companies within 500 m of a RP
NoPlaza POI No. of pedestrian plazas within 500 m of a RP
NoResi POI No. of residence places within 500 m of a RP
NoRest POI No. of restaurants within 500 m of a RP
AreaResi Land use Residential area (sq. m) within 500 m of a RP
AreaIndu Land use Industrial area (sq. m) within 500 m of a RP
AreaCom Land use Commercial area (sq. m) within 500 m of a RP
Global Positioning System
NoTaxi No. of taxies that pass a RP
tracking point

Due to data availability, we used the 2015 vehicle-pedestrian collision data, taxi GPS data from
2016, and land use and POI datasets from 2014. Since Changning District is located in the urban
area of Shanghai where the features of the built environment did not vary significantly from 2014 to
2016, it was reasonable to conduct analysis based on datasets collected from different years during
this period.

4. Result and Discussion


There were 1723 BSUs after the segmentation process. Following Equations (1) and (2),
the vehicle-pedestrian density surface was produced, and the mean and standard deviation values
were 0.008 and 0.01, respectively. The threshold value for identifying potential vehicle-pedestrian
collision hot spots was computed as 0.038, which resulted in 35 possible hazardous road locations
for pedestrians.
The RF models were established using GridsearchCV in SKlearn for parameter adjustment. In this
study, n_estimator, max_depth, and min_samples_split were set from 100 to 200, 2 to 30, and 2 to 20,
respectively. The values of the mean cross-validation score, mean squared error, median absolute
error, and R2 for each sample are presented in Table 2. Regardless of the sample, the value of R2 was
above 0.60. The mean cross validation scores were about 0.60 and slightly fluctuated, which suggests
that the results were relatively stable. The values of the mean squared error and median absolute
error were small. All these indicators reflect that the RF models could explain, to a large extent,
the variation in vehicle-pedestrian collision density when vehicle and pedestrian exposure variables
The RF models were established using GridsearchCV in SKlearn for parameter adjustment. In 
this study, n_estimator, max_depth, and min_samples_split were set from 100 to 200, 2 to 30, and 2 
to  20,  respectively.  The  values  of  the  mean  cross‐validation  score,  mean  squared  error,  median 
absolute error, and R2 for each sample are presented in Table 2. Regardless of the sample, the value 
of R2 was above 0.60. The mean cross validation scores were about 0.60 and slightly fluctuated, which 
Sustainability 2018, 10, 4762 7 of 11
suggests that the results were relatively stable. The values of the mean squared error and median 
absolute error were small. All these indicators reflect that the RF models could explain, to a large 
were considered.
extent,  The result
the  variation in  also indicates that
vehicle‐pedestrian  the occurrence
collision  of vehicle-pedestrian
density  when  collisions may
vehicle and  pedestrian exposure 
result from exposures (vehicle and pedestrian flows in this study), as well as from some risk factors
variables  were  considered.  The  result  also  indicates  that  the  occurrence  of  vehicle‐pedestrian 
that requiremay 
collisions  further investigation
result from  for treatment. This pedestrian flows in 
exposures (vehicle and  is the reason why it this 
was study), as 
essential towell 
consider the
as  from 
potential for collision reduction.
some  risk  factors  that  require  further  investigation  for  treatment.  This  is  the  reason  why  it  was 
essential to consider the potential for collision reduction. 
Table 2. Results of Random Forest (RF) models.
Table 2. Results of Random Forest (RF) models. 
Mean Cross-Validation Score Mean Squared Error Median Absolute Error R2
Sample  1 Mean Cross‐Validation Score  Mean Squared Error  Median Absolute Error  R0.6191
  2
0.61 (±0.12) 0.0040 0.0247
Sample 1 
Sample 2 0.61 (±0.12) 
0.59 (±0.09) 0.0040 
0.0037 0.0247 
0.0260 0.6191 
0.6868
Sample 2 
Sample 3 0.59 (±0.12)
0.59 (±0.09)  0.0037 
0.0039 0.0260 
0.0260 0.6868 
0.6351
Sample 4
Sample 3  0.56 (±0.20)
0.59 (±0.12)  0.0048
0.0039  0.0278
0.0260  0.6457
0.6351 
Sample 5
Sample 4  0.58 (±0.07)
0.56 (±0.20)  0.0032
0.0048  0.0292
0.0278  0.6624
0.6457 
Sample 5  0.58 (±0.07)  0.0032  0.0292  0.6624 

As mentioned before, the RF technique has strength in dealing with the complicated nonlinearity
As mentioned before, the RF technique has strength in dealing with the complicated nonlinearity 
relationship between the vehicle (or pedestrian) flow and occurrence of vehicle-pedestrian collisions.
relationship between the vehicle (or pedestrian) flow and occurrence of vehicle‐pedestrian collisions. 
Although it may have some black-box problems, RF is capable of providing importance of variables
Although it may have some black‐box problems, RF is capable of providing importance of variables 
(also called “features” in RF). Figure 2 shows the value of the importance for each variable with different
(also  called  “features”  in  RF).  Figure  2  shows  the  value  of  the  importance  for  each  variable  with 
samples. Although the importance of each variable varied in different samples, two variables—the
different  samples.  Although  the  importance  of  each  variable  varied  in  different  samples,  two 
number of retail shops and the taxi flow—ranked as the top two regardless of which sample was used.
variables—the number of retail shops and the taxi flow—ranked as the top two regardless of which 
The mean feature importance of the two variables among the five samples was 0.3 and 0.15, respectively,
sample was used. The mean feature importance of the two variables among the five samples was 0.3 
indicating their ability to predict the occurrence of vehicle-pedestrian collisions. As mentioned before,
and  0.15,  respectively,  indicating  their  ability  to  predict  the  occurrence  of  vehicle‐pedestrian 
previous studies have already investigated the relationship between land use characteristics and the
collisions. As mentioned before, previous studies have already investigated the relationship between 
occurrence of traffic crashes involving pedestrians [30,51], and it was found that vehicle-pedestrian
land use characteristics and the occurrence of traffic crashes involving pedestrians [30,51], and it was 
collisions were more likely to happen in commercial areas. In this study, the commercial land was
found  that  vehicle‐pedestrian  collisions  were  more  likely  to  happen  in  commercial  areas.  In  this 
further segmented into different types of places such as retail shops and restaurants. The average
study, the commercial land was further segmented into different types of places such as retail shops 
importance value of the number of retail shops ranked in first place (see NoRetShp in Figure 2); the value
and restaurants. The average importance value of the number of retail shops ranked in first place (see 
ofNoRetShp in Figure 2); the value of the restaurant count ranged from 0.04 to 0.08. This may have 
the restaurant count ranged from 0.04 to 0.08. This may have occurred because different kinds of
activities
occurred may produce
because  diverse
different  types
kinds  of of pedestrian
activities  flow,
may  thus significantly
produce  influencing
diverse  types  the flow, 
of  pedestrian  occurrence
thus 
ofsignificantly 
vehicle-pedestrian collisions. The findings suggest that introducing POIs into the vehicle-pedestrian
influencing  the  occurrence  of  vehicle‐pedestrian  collisions.  The  findings  suggest  that 
crash prediction models is desirable.
introducing POIs into the vehicle‐pedestrian crash prediction models is desirable.   

 
Figure 2. Feature importance of variables in each sample.

The final predicted vehicle-pedestrian collision density was produced by averaging the predictions
of five samples, and the potential of collision reduction was then calculated by subtracting the
prediction from the observation of vehicle-pedestrian collision density. Altogether, there were 634 BSUs
Sustainability 2018, 10, x FOR PEER REVIEW    8  of  11 

Figure 2. Feature importance of variables in each sample. 

The  final 
Sustainability predicted 
2018, 10, 4762 vehicle‐pedestrian  collision  density  was  produced  by  averaging  8 ofthe 
11
predictions  of  five  samples,  and  the  potential  of  collision  reduction  was  then  calculated  by 
subtracting the prediction from the observation of vehicle‐pedestrian collision density. Altogether, 
with collision reduction potential. By comparing the resultant locations with those detected by merely
there were 634 BSUs with collision reduction potential. By comparing the resultant locations with 
setting the density
those  detected  by  threshold value,the 
merely  setting  4 ofdensity 
35 potential hot spots
threshold  were
value,  4  of excluded. Figure
35  potential  hot 3spots 
showswere 
the
spatial distribution of hot spots that were finally determined as hazards for pedestrians (see solid black
excluded. Figure 3 shows the spatial distribution of hot spots that were finally determined as hazards 
lines in Figure 3), as well as locations with no crash reduction potential (see solid red lines in Figure 3).
for pedestrians (see solid black lines in Figure 3), as well as locations with no crash reduction potential 
It can be observed from the figure that hot spots were also clustered, resulting in several hot zones
(see solid red lines in Figure 3). It can be observed from the figure that hot spots were also clustered, 
for pedestrians. Some notable hot spots in this district (see the ellipse in Figure 3) were located in
resulting in several hot zones for pedestrians. Some notable hot spots in this district (see the ellipse 
Tian Shan Road, Gu Bei Road, Mao Tai Road, Lou Shan Guan Road, and South Yu Ping Road. If the
in Figure 3) were located in Tian Shan Road, Gu Bei Road, Mao Tai Road, Lou Shan Guan Road, and 
potential for vehicle-pedestrian collision reduction was not considered, the length of the roads that
South Yu Ping Road. If the potential for vehicle‐pedestrian collision reduction was not considered, 
required further examination, including those colored in both black and red in the figure, was 2.7 km
the length of the roads that required further examination, including those colored in both black and 
in total. When the proposed integrated method was applied, only 1.8 km of road segments were
red in the figure, was 2.7 km in total. When the proposed integrated method was applied, only 1.8 
identified as hazardous. This allows engineers and policy-makers to focus their efforts on locations
km of road segments were identified as hazardous. This allows engineers and policy‐makers to focus 
where there might be a higher likelihood of improving pedestrian safety.
their efforts on locations where there might be a higher likelihood of improving pedestrian safety.   

Figure 3. Spatial distribution of vehicle-pedestrian collision hot spots.


Figure 3. Spatial distribution of vehicle‐pedestrian collision hot spots. 

Notably, in the absence of detailed vehicle and pedestrian exposure information at the micro level,
Notably, in the absence of detailed vehicle and pedestrian exposure information at the micro 
we employed three variables—taxi
level,  we  employed  flow, land
three  variables—taxi  flow, use,
land and POI
use,  and data—to reflect reflect 
POI  data—to  the variation in traffic
the  variation  in 
and pedestrian characteristics across the study area by following previous studies on the relationship
traffic  and  pedestrian  characteristics  across  the  study  area  by  following  previous  studies  on  the 
between the vehicle
relationship  between volume (or pedestrian
the  vehicle  volume flow)
(or  and taxi flowflow) 
pedestrian  (or land
and use characteristics)
taxi  flow  (or  land  [52–54].
use 
Although the focus of this research was not the validation of the three variables as proxies of vehicle
characteristics)  [52–54].  Although  the  focus  of  this  research  was  not  the  validation  of  the  three 
and pedestrian flow, the way in which vehicle and pedestrian exposure can be measured has always
variables as proxies of vehicle and pedestrian flow, the way in which vehicle and pedestrian exposure 
been an measured 
can  be  area of interest
has  in road safety
always  been  research
an  area  [30]. With more
of  interest  experiments
in  road  on the feasibility
safety  research  [30].  With of proxy
more 
variables being performed in future, better tools can be developed to increase the precision
experiments  on  the  feasibility  of  proxy  variables  being  performed  in  future,  better  tools  can  of the
be 
estimation, and the proposed method in this research could be further improved.
developed to increase the precision of the estimation, and the proposed method in this research could 
be further improved.   
5. Conclusions
The improvement in pedestrian safety plays a crucial role in developing a safe and friendly
walking environment to help ensure urban sustainability. Given the importance of hot spot detection
in safety management, we proposed a framework for the identification of hazardous road locations
Sustainability 2018, 10, 4762 9 of 11

for pedestrians by integrating the likelihood of the occurrence of vehicle-pedestrian collisions and
the potential for the reduction in traffic collisions involving vehicles and pedestrians. The research
is of significance by not only theoretically enriching the methodology of hotspot identification but
also practically providing useful information for policy-makers to propose countermeasures for
pedestrian safety.
The method through which traffic and pedestrian exposures are measured by taxi trajectories,
land use, and POI variables has not been fully explored. As a further step, research efforts may be
dedicated to additional validation experiments. We used the proposed framework to identify the
vehicle-pedestrian crash hot spots in only one period. If more vehicle-pedestrian collision data in other
periods are available, the usefulness of the framework can be further examined. As the identification
of hazardous road locations is the first step in safety improvement programs, future studies should
investigate risk factors and the treatment of hot spots.

Author Contributions: Conceptualization, S.Y.; methodology, S.Y. and J.W. (Jianping Wu); software, J.W.
(Jinzi Wang); validation, S.Y., L.F. and J.W. (Jianping Wu); formal analysis, S.Y.; investigation, S.Y.; resources,
S.Y. and J.W. (Jianping Wu); data curation, S.Y. and J.W. (Jinzi Wang); writing—original draft preparation, S.Y.;
writing—review and editing, S.Y., L.F. and J.W. (Jianping Wu); visualization, S.Y.; supervision, J.W. (Jianping Wu);
project administration, S.Y.; funding acquisition, S.Y. and J.W. (Jianping Wu).
Funding: This research was funded by National Key R&D Program of China, grant No. 2017YFE0100700; National
Natural Science Foundation of China, grant No. 41701462; and China Postdoctoral Science Foundation, grants
No. 2016M601539 and No. 2018T110371.
Acknowledgments: The authors would like to thank Jie Zhu for technical support, and greatly appreciate the
valuable comments from editors and three reviewers.
Conflicts of Interest: The authors declare no conflict of interest. The funders had no role in the design of the
study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to
publish the results.

References
1. WHO. Global Status Report on Road Safety 2015; World Health Organization: Geneva, Switzerland, 2015.
2. Loo, B.P.Y.; Yao, S. The Identification of Traffic Crash Hot Zones under the Link-Attribute and Event-Based
Approaches in a Network-Constrained Environment. Comput. Environ. Urban Syst. 2013, 41, 249–261.
[CrossRef]
3. Yamada, I.; Thill, J.C. Local Indicators of Network-Constrained Clusters in Spatial Patterns Represented by a
Link Attribute. Ann. Assoc. Am. Geogr. 2010, 100, 269–285. [CrossRef]
4. Harirforoush, H.; Bellalite, L. A New Integrated GIS-Based Analysis to Detect Hotspots: A Case Study of the
City of Sherbrooke. Accid. Anal. Prev. 2016, in press. [CrossRef] [PubMed]
5. Xie, Z.; Yan, J. Kernel Density Estimation of Traffic Accidents in a Network Space. Comput. Environ.
Urban Syst. 2008, 32, 396–406. [CrossRef]
6. Xie, Z.; Yan, J. Detecting Traffic Accident Clusters with Network Kernel Density Estimation and Local Spatial
Statistics: An Integrated Approach. J. Transp. Geogr. 2013, 31, 64–71. [CrossRef]
7. Cheng, W.; Washington, S.P. Experimental Evaluation of Hotspot Identification Methods. Accid. Anal. Prev.
2005, 37, 870–881. [CrossRef] [PubMed]
8. Long, T.T.; Somenahalli, S.V.C. Using GIS to Identify Pedestrian-Vehicle Crash Hot Spots and Unsafe Bus
Stops. J. Public Trans. 2011, 14, 99–114. [CrossRef]
9. Hao, Y.; Liu, P.; Chen, J.; Wang, H. Comparative Analysis of the Spatial Analysis Methods for Hotspot
Identification. Accid. Anal. Prev. 2014, 66, 80–88. [CrossRef]
10. Nie, K.; Wang, Z.; Du, Q.; Ren, F.; Tian, Q. A Network-Constrained Integrated Method for Detecting
Spatial Cluster and Risk Location of Traffic Crash: A Case Study from Wuhan, China. Sustainability 2015, 7,
2662–2677. [CrossRef]
11. Naji, H.A.H.; Xue, Q.; Lyu, N.; Wu, C.; Zheng, K. Evaluating the Driving Risk of near-Crash Events Using a
Mixed-Ordered Logit Model. Sustainability 2018, 10, 2868. [CrossRef]
Sustainability 2018, 10, 4762 10 of 11

12. Loo, B.P.; Yao, S.; Wu, J. Spatial Point Analysis of Road Crashes in Shanghai: A GIS-Based Network Kernel
Density Method. In Proceedings of the 19th International Conference on Geoinformatics, Shanghai, China,
24–26 June 2011.
13. Yamada, I.; Thill, J.C. Local Indicators of Network-Constrained Clusters in Spatial Point Patterns. Geogr. Anal.
2007, 39, 268–292. [CrossRef]
14. Yao, S.; Loo, B.P.; Yang, B.Z. Traffic Collisions in Space: Four Decades of Advancement in Applied GIS.
Ann. GIS 2016, 22, 1–14. [CrossRef]
15. Silverman, B.W. Density Estimation for Statistics and Data Analysis; Chapman & Hall/CRC Press: Boca Raton,
FL, USA, 1986.
16. Flahaut, B.; Mouchart, M.; Martin, E.S.; Thomas, I. The Local Spatial Autocorrelation and the Kernel Method
for Identifying Black Zones: A Comparative Approach. Accid. Anal. Prev. 2003, 35, 991–1004. [CrossRef]
17. Erdogan, S.; Yilmaz, I.; Baybura, T.; Gullu, M. Geographical Information Systems Aided Traffic Accident
Analysis System Case Study: City of Afyonkarahisar. Accid. Anal. Prev. 2008, 40, 174–181. [CrossRef]
[PubMed]
18. Krisp, J.M.; Durot, S. Segmentation of Lines Based on Point Densities—An Optimisation of Wildlife Warning
Sign Placement in Southern Finland. Accid. Anal. Prev. 2007, 39, 38–46. [CrossRef] [PubMed]
19. Deacon, J.A.; Charles, V.Z.; Deen, R.C. Identification of Hazardous Rural Highway Locations. Transp. Res. Rec.
1974, 410. [CrossRef]
20. Norden, M.; Orlansky, J.; Jacobs, H. Application of Statistical Quality-Control Techniques to Analysis of
Highway-Accident Data. Highw. Res. Board Bull. 1956, 117, 17–31.
21. Morin, D.A. Application of Statistical Concepts to Accident Data. Highw. Res. Rec. 1967, 188, 72–79.
22. Stokes, R.; Mutabazi, M. Rate-Quality Control Method of Identifying Hazardous Road Locations.
Transp. Res. Rec. 1996, 1542, 44–48. [CrossRef]
23. McGuigan, D.R.D. The Use of Relationships between Road Accidents and Traffic Flow in “Black-Spot”
Identification. Traffic Eng. Control 1981, 22, 448–453.
24. McGuigan, D.R.D. Non-Junction Accident Rates and Their Use In ‘black-Spot’ Identification. Traffic Eng. Control
1982, 23, 60–65.
25. Mahalel, D.; Hakkert, A.S.; Prashker, J.N. A System for the Allocation of Safety Resources on a Road Network.
Accid. Anal. Prev. 1982, 14, 45–56. [CrossRef]
26. Cheng, W.; Washington, S. New Criteria for Evaluating Methods of Identifying Hot Spots. Transp. Res. Rec.
2008, 2083, 76–85. [CrossRef]
27. Waller, L.A.; Gotway, C.A. Applied Spatial Statistics for Public Health Data; Wiley-Interscience: Hoboken, NJ,
USA, 2004.
28. Huang, H.; Hong, C.C. Modeling Road Traffic Crashes with Zero-Inflation and Site-Specific Random Effects.
Stat. Methods Appl. 2010, 19, 445–462. [CrossRef]
29. Anastasopoulos, P.C.; Mannering, F.L. A Note on Modeling Vehicle Accident Frequencies with
Random-Parameters Count Models. Accid. Anal. Prev. 2009, 41, 153–159. [CrossRef] [PubMed]
30. Yao, S.; Loo, B.P.Y.; Lam, W.W.Y. Measures of Activity-Based Pedestrian Exposure to the Risk of
Vehicle-Pedestrian Collisions: Space-Time Path Vs. Potential Path Tree Methods. Accid. Anal. Prev. 2015, 75,
320–332. [CrossRef] [PubMed]
31. Chang, L.Y. Analysis of Freeway Accident Frequencies: Negative Binomial Regression Versus Artificial
Neural Network. Saf. Sci. 2005, 43, 541–557. [CrossRef]
32. Xie, Y.; Lord, D.; Zhang, Y. Predicting Motor Vehicle Collisions Using Bayesian Neural Network Models:
An Empirical Analysis. Accid. Anal. Prev. 2007, 39, 922–933. [CrossRef]
33. Zeng, Q.; Huang, H.; Xin, P.; Wong, S.C.; Gao, M. Rule Extraction from an Optimized Neural Network for
Traffic Crash Frequency Modeling. Accid. Anal. Prev. 2016, 97, 87–95. [CrossRef]
34. Liaw, A.; Wiener, M. Classification and Regression by Randomforest. R News 2002, 2, 18–22.
35. Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [CrossRef]
36. Gromping, U. Variable Importance Assessment in Regression: Linear Regression Versus Random Forest.
Am. Stat. 2009, 63, 308–319. [CrossRef]
37. Haas, J.; Ban, Y. Urban Growth and Environmental Impacts in Jing-Jin-Ji, the Yangtze, River Delta and the
Pearl River Delta. Int. J. Appl. Earth Obs. Geoinf. 2014, 30, 42–55. [CrossRef]
Sustainability 2018, 10, 4762 11 of 11

38. Oliveira, S.; Oehler, F.; San-Miguel-Ayanz, J.; Camia, A.; Pereira, J.M.C. Modeling Spatial Patterns of Fire
Occurrence in Mediterranean Europe Using Multiple Regression and Random Forest. For. Ecol. Manag. 2012,
275, 117–129. [CrossRef]
39. Topouzelis, K.; Psyllos, A. Oil Spill Feature Selection and Classification Using Decision Tree Forest on Sar
Image Data. ISPRS J. Photogramm. Remote Sens. 2012, 68, 135–143. [CrossRef]
40. Rodriguez-Galiano, V.F.; Chica-Olmo, M.; Chica-Rivas, M. Predictive Modelling of Gold Potential with the
Integration of Multisource Information Based on Random Forest: A Case Study on the Rodalquilar Area,
Southern Spain. Int. J. Geogr. Inf. Sci. 2014, 28, 1336–1354. [CrossRef]
41. Wang, H.; Zhao, Y.; Pu, R.; Zhang, Z. Mapping Robinia Pseudoacacia Forest Health Conditions by Using
Combined Spectral, Spatial, and Textural Information Extracted from Ikonos Imagery and Random Forest
Classifier. Remote Sens. 2015, 7, 9020–9044. [CrossRef]
42. Breiman, L. Bagging Predictors. Mach. Learn. 1996, 24, 123–140. [CrossRef]
43. Genuer, R.; Poggi, J.M.; Tuleau-Malot, C. Variable Selection Using Random Forests. Pattern Recognit. Lett.
2010, 31, 2225–2236. [CrossRef]
44. Scikit-learn. Available online: https://scikit-learn.org/stable/ (accessed on 18 November 2018).
45. Cutler, D.R.; Edwards, T.C.; Beard, K.H.; Cutler, A.; Hess, K.T.; Gibson, J.; Lawler, J.J. Random Forests for
Classification in Ecology. Ecology 2007, 88, 2783–2792. [CrossRef]
46. Li, Q.; Zhang, T.; Yu, Y. Using Cloud Computing to Process Intensive Floating Car Data for Urban Traffic
Surveillance. Int. J. Geogr. Inf. Sci. 2011, 25, 1303–1322. [CrossRef]
47. Liu, X.; Gong, L.; Gong, Y.; Liu, Y. Revealing Travel Patterns and City Structure with Taxi Trip Data.
J. Transp. Geogr. 2015, 43, 78–90. [CrossRef]
48. Gao, S.; Wang, Y.; Gao, Y.; Liu, Y. Understanding Urban Traffic-Flow Characteristics: A Rethinking of
Betweenness Centrality. Environ. Plan. B Plan. Des. 2013, 40, 135–153. [CrossRef]
49. Wang, X.; Fan, T.; Chen, M.; Deng, B.; Wu, B.; Tremont, P. Safety Modeling of Urban Arterials in Shanghai,
China. Accid. Anal. Prev. 2015, 83, 57–66. [CrossRef] [PubMed]
50. Chen, B.Y.; Yuan, H.; Li, Q.; Lam, W.H.K.; Shaw, S.L.; Yan, K. Map-Matching Algorithm for Large-Scale
Low-Frequency Floating Car Data. Int. J. Geogr. Inf. Sci. 2014, 28, 22–38. [CrossRef]
51. Yang, B.Z.; Loo, B.P.Y. Land Use and Traffic Collisions: A Link-Attribute Analysis Using Empirical Bayes
Method. Accid. Anal. Prev. 2016, 95, 236–249. [CrossRef]
52. Ozbil, A.; Peponis, J.; Stone, B. Understanding the Link between Street Connectivity, Land Use and Pedestrian
Flows. Urban Des. Int. 2011, 16, 125–141. [CrossRef]
53. Lamíquiz, P.J.; López-Domínguez, J. Effects of Built Environment on Walking at the Neighbourhood Scale.
A New Role for Street Networks by Modelling Their Configurational Accessibility? Transp. Res. A Policy Pract.
2015, 74, 148–163. [CrossRef]
54. Castro, P.S.; Zhang, D.; Li, S. Urban Traffic Modelling and Prediction Using Large Scale Taxi Gps Traces.
In Proceedings of the 10th International Conference, Pervasive 2012, Newcastle, UK, 18–22 June 2012.

© 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).

You might also like