Applying Machine Learning Approaches To Analyze The Vulnerable Roadusers' Crashes at Statewide Traffic Analysis Zones
Applying Machine Learning Approaches To Analyze The Vulnerable Roadusers' Crashes at Statewide Traffic Analysis Zones
Applying Machine Learning Approaches To Analyze The Vulnerable Roadusers' Crashes at Statewide Traffic Analysis Zones
a r t i c l e i n f o a b s t r a c t
Article history: Introduction: In this paper, we present machine learning techniques to analyze pedestrian and bicycle crash by
Received 8 June 2018 developing macro-level crash prediction models. Methods: We collected the 2010–2012 Statewide Traffic Anal-
Received in revised form 30 March 2019 ysis Zone (STAZ) level crash data and developed rigorous machine learning approach (i.e., decision tree regres-
Accepted 16 April 2019
sion (DTR) models) for both pedestrian and bicycle crash counts. To our knowledge, this is the first application
Available online xxxx
of DTR models in the burgeoning macro-level traffic safety literature. Results: The DTR models uncovered the
Keywords:
most significant predictor variables for both response variables (pedestrian and bicycle crash counts) in terms
Machine learning of three broad categories: traffic, roadway, and socio-demographic characteristics. Additionally, spatial predictor
Macro-level variables of neighboring STAZs were considered along with the targeted STAZ in both DTR models. The DTR
Decision tree regression model considering spatial predictor variables (spatial DTR model) were compared without considering spatial
Statewide traffic analysis zone predictor variables (aspatial DTR model) and the model comparison results discovered that the prediction accu-
Ensemble technique racy of the spatial DTR model performed better than the aspatial DTR model. Finally, the current research effort
contributed to the safety literature by applying some ensemble techniques (i.e. bagging, random forest, and gra-
dient boosting) in order to improve the prediction accuracy of the DTR models (weak learner) for macro-level
crash count. The study revealed that all the ensemble techniques performed slightly better than the DTR
model and the gradient boosting technique outperformed other competing ensemble techniques in macro-
level crash prediction models.
© 2019 National Safety Council and Elsevier Ltd. All rights reserved.
1. Introduction 7.40 (ranked first among all states), which clearly present the challenge
faced in Florida (NHTSA, 2015; NHTSA, 2017b). The crash prediction
The most active forms of transportation are walking and bicycling models applied to the pedestrian and bicycle crashes would give some
which have the lowest impact on the environment and improve physi- valuable insights for a transportation planner to identify the contribut-
cal health of pedestrians and bicyclists. Transportation agencies are in- ing factors related to pedestrians and bicyclists' crashes which might
creasingly promoting walking and bicycling options for short distance be helpful for policy implications at a planning level.
trips to mitigate climate change and obesity problem among adults. In transportation safety research, crash prediction models are devel-
However, the most common problem impeding the preference of walk- oped for two levels: (1) micro-level (2) macro-level. The former one fo-
ing and bicycling is traffic safety concerns. According to the latest traffic cuses on crashes at a segment or intersection to identify the influence of
safety data from the National Highway Traffic Safety Administration contributing factors with the objective of offering engineering solutions.
(NHTSA), pedestrian and bicycle deaths have increased by 9.0% and On the other hand, the macro-level crashes from a spatial aggregation
1.3%, respectively in 2016 compared to the calendar year 2015 such as traffic analysis zone, census block, census tract, county are con-
(NHTSA, 2017a). Thus, the safety challenges associated with pedestrians sidered to quantify the significant factors at a macro-level so that it can
and bicyclists remain an important concern for transportation policy. provide countermeasures from a planning perspective. Statistical
The safety risk posed to active transportation users in Florida is exacer- models, such as Poisson and negative binomial regression, have been
bated compared to active transportation users in the US. In 2015, while employed to analyze both micro- and macro-level crashes for many
the national average for pedestrian and bicyclist fatalities per 100,000 years. However, statistical models have their own model-specific as-
population was 1.67 and 2.50, respectively, the corresponding number sumptions which lead to inaccurate results of injury likelihood (Chang
for the state of Florida was 3.10 (ranked second among all states) and & Chen, 2005; Rahman, 2018; Rahman, Abdel-Aty, Hasan, & Cai, 2019;
Saad, Abdel-aty, & Lee, 2019; Saad, Abdel-aty, Lee, & Cai, 2019; Saad,
⁎ Corresponding author. Abdel-aty, Lee, & Wang, 2018a; Yuan & Abdel-Aty, 2018). In this regard,
E-mail address: [email protected] (M.S. Rahman). this study contributes to the safety literature by undertaking pedestrian
https://doi.org/10.1016/j.jsr.2019.04.008
0022-4375/© 2019 National Safety Council and Elsevier Ltd. All rights reserved.
Please cite this article as: M.S. Rahman, M. Abdel-Aty, S. Hasan, et al., Applying machine learning approaches to analyze the vulnerable road-users'
crashes at statewide traf..., Journal of Safety Research, https://doi.org/10.1016/j.jsr.2019.04.008
2 M.S. Rahman et al. / Journal of Safety Research xxx (2019) xxx
and bicycle crash prediction model using the most widely applied ma- pre-defined underlying relationships between target variable and pre-
chine learning technique: decision tree regression (DTR). To the best dictors (Gong, Abdel-Aty, Cai, & Rahman, 2019; Tavakoli Kashani,
of our knowledge, none of the studies have explored machine learning Rabieyan, & Besharati, 2014). Among the machine learning techniques,
techniques in analyzing pedestrian and bicycle crashes at the macro- the decision tree model has gained much popularity in transportation
level. In this regard, three broad categories of predictor variables safety literature which can identify and easily explain the complex pat-
including traffic, roadway, and socio-demographic characteristics are terns associated with crash risk (Chang & Chen, 2005; Chang & Chien,
considered in the DTR model development and validation. In addition, 2013; Chang & Wang, 2006; Pande, Abdel-Aty, & Das, 2010). To over-
the attributes of the neighboring zones are considered as predictor var- come the shortcoming of the statistical modeling, decision tree can be
iables along with the targeted STAZs attributes in DTR models to im- a preferred alternative for forecasting traffic crashes with reasonable in-
prove the prediction accuracy of pedestrian and bicycle crashes. terpretations. Unlike statistical models, decision trees do not need any
Furthermore, the current study has undertaken some ensemble tech- predefined model assumption and underlying relationship between de-
niques (i.e. bagging, random forest, and gradient boosting) to improve pendent and independent variables. It does deal well with
the prediction accuracy of the DTR models considered as weak learner multicollinear independent variables and does treat satisfactorily dis-
which provides valuable insights on advancing crash prediction model- crete variables with more than two levels (Karlaftis & Golias, 2002;
ing techniques for macro-level crash analysis. Washington & Wolf, 1997). Moreover, decision tree models can help
in deciding how to subdivide heavily skewed target variables (i.e.,
2. Literature review zero crash counts) into ranges while the statistical modeling has some
limitations for dealing with heavily skewed data (Song & Lu, 2015).
Road traffic accidents are highly recognized as a national health Therefore, decision tree models might be a preferred option to analyze
problem which affects the society both emotionally and economically heavily skewed response variable which is most common in pedestrian
(Blincoe, Seay, & Zaloshnja, 2000; NHTSA, 2005). There is a considerable and bicycle crashes. A summary of earlier studies employing decision
number of research efforts that have been examined in crash frequency tree models in traffic safety literature is presented in Table 1 (Abdel-
estimation (vehicle, pedestrian, and bicycle) (see (Lord & Mannering, Aty, Keller, & Brady, 2005; Chang & Chen, 2005; Chang & Chien, 2013;
2010) for a detailed review). These studies have been conducted for dif- Chang & Wang, 2006; De Oña, López, & Abellán, 2013; Eustace,
ferent modes of vehicle (automobiles and motorbikes), pedestrian and Alqahtani, & Hovey, 2018; Iragavarapu, Lord, & Fitzpatrick, 2015;
bicycle, and for different scales – micro (such as intersection and seg- Karlaftis & Golias, 2002; Kashani & Mohaymany, 2011; Montella, Aria,
ment) and macro-level (such as census tract, traffic analysis zone D'Ambrosio, & Mauriello, 2012; Pande et al., 2010; Tavakoli Kashani
(TAZ), county). It is beyond the scope of this paper for exhaustive re- et al., 2014; Wah, Nasaruddin, Voon, & Lazim, 2012; Zheng, Lu, &
view of micro-level (see Eluru, Bhat, and Hensher (2008), Lord, Denver, 2016). The information provided in the table includes the
Washington, and Ivan (2005), Nashad, Yasmin, Eluru, Lee, and Abdel- study unit considered, the methodological approach employed, the tar-
Aty (2016) for detailed micro-level literature review) and macro-level get variables analyzed in the decision tree framework. The following ob-
(see Cai, Abdel-Aty, Lee, and Eluru (2017), Cai, Lee, Eluru, and Abdel- servations can be inferred from the table. From the table, it is evident
Aty (2016), Lee, Yasmin, Eluru, Abdel-Aty, and Cai (2018), Wang, that all the existing decision tree-based safety studies are conducted
Yuan, Schultz, and Fang (2018) for detailed macro-level literature re- at a micro-level such as roadway segments and intersections. To the
view) crash frequency studies. These studies have heavily focused on best our knowledge, none of the studies have explored decision tree
econometric statistical modeling approaches for the prediction of traffic methods in order to build the crash prediction model at the macro-
crashes with exploring significant contributing factors related to the level. It is also noticed that most of the model structures employed in
crash occurrence. However, statistical models can lead to inaccurate es- developing decision trees are classification trees except for two studies
timations of injury likelihood if prespecified model assumptions and (Abdel-Aty et al., 2005; Karlaftis & Golias, 2002) which conducted hier-
underlying relationship between dependent and independent variables archical tree-based regression for developing the micro-level crash pre-
of these models are invalid (Chang & Chen, 2005). diction model. Within the decision tree structure, those studies did not
Moreover, the presence of large number of zeroes in pedestrian and explore the total number of pedestrian and bicycle crashes while they
bicycle crashes is one of the major methodological challenges in statis- have predominantly analyzed crash frequency by severity levels or
tical modeling to analyze the contributing factors related to pedestrian other different attribute levels.
and bicyclist crashes. In crash count models, the presence of excess With the different modeling techniques, vulnerable road users' (pe-
zeros may result from two underlying processes or states of crash fre- destrian and bicycle) crashes have been investigated separately. Differ-
quency likelihoods: crash-free state (or zero crash state) and crash ent types of contributory factors were identified from previous studies.
state (see Mannering, Shankar, and Bhat (2016) for more explanation). Table 2 provides a summary of contributing factors related to vulnerable
In the presence of such dual-state, application of single-state model may road users' (non-motorist) crashes including both macro- (Cai, Abdel-
result in biased and inconsistent parameter estimates. In a statistical Aty, & Lee, 2017; Cai et al., 2016; Lee, Abdel-Aty, Choi, & Huang, 2015;
framework, the potential relaxation of the single-state count model is Nashad et al., 2016; Ukkusuri, Hasan, & Aziz, 2011; Zhang, Bigham,
zero inflated model for addressing the issue of excess zeros: zero in- Ragland, & Chen, 2015) and micro-level (Abdel-Aty et al., 2005;
flated (ZI) model (Shankar, Milton, & Mannering, 1997). But, several re- Abdel-Aty, Chundi, & Lee, 2007; Eluru et al., 2008; Lee & Abdel-Aty,
search studies have criticized the application of dual state ZI models for 2005; Pitt, Guyer, Hsieh, & Malek, 1990; Prati, Pietrantoni, & Fraboni,
traffic safety analysis (Lord et al., 2005; Lord, Washington, & Ivan, 2007; 2017; Roudsari et al., 2004) studies separately. Some general observa-
Son, Kweon, & Park, 2011). A ZI model assumes that two types of zeros tions can be made from the Table 2. In macro-level analysis, most of
exist, i.e., sampling zeros and structural zeros. For traffic safety, the the studies found that the length of sidewalks, total employment, num-
structural zeros correspond to inherently safe conditions implying ber of public transit commuters, number of commuters by walk, num-
zero crash by nature and the sampling zeros correspond to potential ber of commuters by bike, vehicle miles travel, population density,
crash conditions implying zero crash only by chance (Lord et al., 2005; number of rail and bus station, number of hotels, motels, and guest
Lord et al., 2007). Hence, the statistical assumptions of having structural house, number of schools, proportion of uneducated people, and num-
zeroes is unrealistic as a traffic crash could occur under any conditions. ber of signalized intersections had positive impact on the vulnerable
Recently, machine learning and/or data mining techniques have be- road users' crash frequency. However, the median household income,
come popular in transportation safety research to determine the factors proportion of heavy vehicle, proportion of high speeds (55 mph or
associated with traffic crashes. Unlike statistical models, machine learn- higher) roads, average geodesic distance, betweenness centrality, and
ing techniques are non-parametric methods which do not require any clustering coefficients reduced the likelihood of pedestrian and bicyclist
Please cite this article as: M.S. Rahman, M. Abdel-Aty, S. Hasan, et al., Applying machine learning approaches to analyze the vulnerable road-users'
crashes at statewide traf..., Journal of Safety Research, https://doi.org/10.1016/j.jsr.2019.04.008
M.S. Rahman et al. / Journal of Safety Research xxx (2019) xxx 3
Table 1
Summary of previous traffic safety studies using decision tree and ensemble techniques.
Decision Tree Kashani et al. Roadway segment (Micro) Classification Injury severity level – injury, fatality
(2014) Tree
Zheng et al. Highway-rail grade crossings (Micro) Classification Highway-rail grade crossings crash
(2016) Tree
Kashani et al. Two-lane, two-way rural roads Classific ation Injury severity level – light injury, serious injury, fatality
(2011) segments (Micro) Tree
Iragavarapu et al. Road segments-pedestrian crash Classification Injury severity level – fatal or non-fatal
(2015) (Micro) Tree
Chang and Chen National Freeway (Micro) Classification Injury Severity level (0–4, 4 representing 4 or more crashes)
(2005) Tree
Wah et al. (2012) Roadway segments (Micro) Classification Category of frequencies of motorcycle accidents – Zero frequency (0), Low
Tree frequency (1–19), High frequency (20 and above)
Chang and Wang Roadway segments (Micro) Classification Injury severity level – fatality, injury, no-injury
(2006) Tree
Pande et al. Roadway segments (Micro) Classification Binary variable – Crash vs Non-crash
(2010) Tree
Chang and Chien National freeways (Micro) Classification Injury severity level – fatality, injury, no-injury
(2013) Tree
Ona et al. (2013) Road Segments–Rural highways Classification Accident severity – slightly injured, killed or seriously injured (KSI) (state B)
(Micro) Tree
Montella et al. Roadway segments–Powered Classification Several response variables – severity, crash type, involved vehicles, alignment
(2012) two-wheeler crashes (Micro) Tree
Eustace et al. Road segments (Micro) Classification Injury severity level-fatal/injury, and property damage only
(2018) Tree
Abdel-Aty et al. Road segments (Micro) Regression Total crash, angle crash, left turn crash, head on crash, pedestrian crash, rear-end
(2005) Tree crash, right turn crash, sideswipe crash
Karlaftis and Road segments (Micro) Regression Total number of crashes
Golias (2002) Tree
Ensemble Sohn et al. (2002) Road segments (Micro) Arcing and Injury severity level-bodily injury and property damage
Techniques bagging
crashes. Table 2 also provide summary findings from earlier studies re- intoxicated, and very young or elderly are more prone to severe injuries,
garding the contributing factors that are related to the non-motorist as are pedestrians struck by an alcohol intoxicated driver, by non-sedan
crash risks in micro-level analysis. Overall, studies analyzing non-mo- vehicles (SUVs, pick-up vans), and by high speed vehicles. Moreover,
torized crash injury severity indicate that non-motorist who are male, vulnerable road users' injury occurred in crashes at school zone
Table 2
Summary of contributory factors for vulnerable road users' crashes.
Macro Cai, Abdel-Aty, Proportion of length of local roads, signalized intersection density, and length of Median household income, and proportion of heavy
and Lee (2017) sidewalks, pedestrian and bicycle commuters, and population equal to or older than 65 vehicle mileage
years old
Cai et al. Length of sidewalks, Number of total employments, public transportation commuter, Proportion of heavy vehicle mileage in VMT, distance to
(2016) walk commuters, bike commuters, population density, and signalized intersection nearest urban area
density.
Nashad et al. Vehicle miles travel, total population, public transit commuters, bike commuters, walk Proportion of heavy vehicles, distance to nearest urban
(2016) commuters, school enrolment density, length of sidewalk, proportion of urban roads area, proportion of industry employments
Zhang et al. Number of commercial properties, Number of bus lines, Number of 4-way intersections, Average geodesic distance, betweenness centrality, and
(2015) number of housing units, Vehicle miles travel. clustering coefficients.
Lee et al. Total population, number of rail and bus station, number of hotels, motels, and guest Median household income, proportion of high-speed roads
(2015) house per square mile, and number of schools per square mile (55 mph or higher), proportion of people working at home
Ukkusuri et al., Proportion of African-American population, industrial land use proportion of total land Median age population, number of three approach
2011 use, total number of signalized intersections, number of bus stops, and proportion of intersections, proportion of local road.
uneducated people
Micro Prati et al. Road type, age of cyclist, gender of cyclist, and the type of opponent vehicle
(2017)
Eluru et al. Age of the individual, the speed limit on the roadway, location of crashes, and time-of-day
(2008)
Abdel-Aty Driver's age, gender, and alcohol use, pedestrian's/bicyclist's age, number of lanes, median type, and speed
et al. (2007) limits
Abdel-Aty Right turn channelized on major roads, exclusive left turn lanes on minor roads, daily traffic volume on major roads, speed limits on minor roads, total
et al. (2005) left turn lanes of minor road
Lee et al. Higher traffic volume, drivers' age, drivers' sex, vehicle type, traffic control devices, locations, and lighting conditions
(2015)
Roudsari et al. Drivers' age, vehicle class, and speed limits on the roadway
(2004)
Pitt et al. Drivers' sex, gender, vehicle characteristics, speed of the roadway, and the time of the day
(1990)
Please cite this article as: M.S. Rahman, M. Abdel-Aty, S. Hasan, et al., Applying machine learning approaches to analyze the vulnerable road-users'
crashes at statewide traf..., Journal of Safety Research, https://doi.org/10.1016/j.jsr.2019.04.008
4 M.S. Rahman et al. / Journal of Safety Research xxx (2019) xxx
Please cite this article as: M.S. Rahman, M. Abdel-Aty, S. Hasan, et al., Applying machine learning approaches to analyze the vulnerable road-users'
crashes at statewide traf..., Journal of Safety Research, https://doi.org/10.1016/j.jsr.2019.04.008
M.S. Rahman et al. / Journal of Safety Research xxx (2019) xxx 5
to obtain subsamples b and c, variable importance score, decision tree regression looks at the im-
provement measure attributable to each variable in its role as a either
L
X 2 a primary or a surrogate splitter. The values of all these improvements
Da ¼ Y ðaÞl −μ ðaÞ ¼ total deviance in sample ðnodeÞ a ð4Þ
are summed over each node and totaled and are then scaled relative
l¼1
to the best performing variable. The variable with the highest sum of
M 2 improvements is scored 100, and all other variables will have lower
X
Db ¼ Y ðbÞl −μ ðbÞ ¼ total deviance in sample ðnodeÞ b ð5Þ scores ranging downwards towards zero. A variable can obtain an im-
l¼1 portance score of zero in decision tree regression only if it never appears
as either a primary or a surrogate splitter. Because such a variable plays
N
X 2 no role anywhere in the tree, eliminating it from the dataset should
Dc ¼ Y ðcÞl −μ ðcÞ ¼ total deviance in sample ðnodeÞ c ð6Þ make no difference to the results.
l¼1
In regression tree, tree growth will continue until there are homog-
enous observations in each terminal node. At first, the regression tree
1X M
produces the maximal tree with a complex structure that overfits the
μ ðbÞ ¼ Y m ¼ Arithmetic mean of subsample ðnodeÞ b ð7Þ
M m¼1 training data. However, maximal tree produces good prediction accu-
racy in training data but worse prediction accuracy in testing sample.
1X N To have better understanding, complex tree overfits the training obser-
μ ð cÞ ¼ Y n ¼ Arithmetic mean of subsample ðnodeÞ c ð8Þ vations which results in overstated confidence in predictions and inclu-
N n¼1
sion of insignificant predictor variables. The most common method
used to reduce overfitting problem is called pruning. This method uses
It is worth mentioning that M is the sample size of subsample (node)
criteria about model complexity to trim the full tree model to a smaller
b, and N is the sample size of subsample (node) c. In regression tree,
and more manageable or practical tree size which reduce overfitting
predictor variable Xi taken from Xn,p is sought to partition the column
significantly (Washington, 2000; Washington & Wolf, 1997). Pruning
vector Y such that the deviance reduction function showed in Eq. (9)
is performed according to the cost-complexity algorithm. The principle
is maximized.
behind pruning is to remove the branches that add little to the predic-
L 2 M 2 X
N 2 tive value of the tree. The pruning process starts with the maximal
X X
Δ¼ Y ðaÞl −μ ðaÞ − Y ðbÞl −μ ðbÞ − Y ðcÞl −μ ðcÞ ð9Þ tree and selectively prunes upward to produce a sequence of sub-trees
l¼1 m¼1 n¼1 of the maximal tree, and eventually collapses to the tree of the root
node. The pruning process relies on a complexity parameter which is
While searching the matrix from Xn,p, two items must be sought to defined through a cost function of misclassification of the data and the
maximize Eq. (9): the variable Xi and the numerical value on which tree size (Kashani & Mohaymany, 2011). For each tree created, the “mis-
the corresponding partition of Y will produce the maximum reduction classification error rate” or “misclassification cost”, or in other words,
of the deviance reduction function. When this maximal partition is the “goodness of fit” index, is calculated as Eq. (10):
found, the original data in node a are partitioned into two subsamples
2 3
b and c having minimal combined deviance compared with all possible X
M X
j
subsamples. Thus, the reduction in node a deviance is greatest when the Misclassification error rate ¼ pðmÞ 41− p ð j=mÞ5
2
ð10Þ
deviances at nodes b and c are smallest. As mentioned earlier, numerical m¼1 j¼1
Fig. 1. Misclassification error rates for both training and testing data.
Please cite this article as: M.S. Rahman, M. Abdel-Aty, S. Hasan, et al., Applying machine learning approaches to analyze the vulnerable road-users'
crashes at statewide traf..., Journal of Safety Research, https://doi.org/10.1016/j.jsr.2019.04.008
6 M.S. Rahman et al. / Journal of Safety Research xxx (2019) xxx
to find a tree with respect to a measure of misclassification cost on the The basic idea underlying bagging is to reduce the variance of the de-
testing dataset so that the information in the learning dataset will not cision tree that creates several subsets of data from the training sample
overfit. Towards this end, the data is usually divided into two subsets, with replacement and build the final output averaging all the predic-
one for learning and the other for testing. The learning sample is used tions. To be more specific, if several similar datasets are created by re-
to split nodes, while the testing sample compares the misclassification sampling with replacement which is called bootstrapping and a
for all the subtrees. When the tree grows larger and larger, the misclas- number of regression trees are grown without pruning and averaged,
sification cost for the learning sample decreases monotonically, indicat- the variance component of the output error is reduced. Mathematically,
ing that the maximal tree always gives the best fit to the learning data. 1 2 B
it is possible to calculate ^f ðxÞ, ^f ðxÞ, …, ^f ðxÞ, using B separate training
On the other hand, the misclassification cost for the testing sample first
sets, and averaging them in order to obtain a single low variance statis-
decreases and then increases after reaching a minimum. This indicates
tical learning model, given by Eq. (11):
that the saturated tree is greatly overfitted when applied to analyze
the testing sample. Therefore, the optimal tree is determined when
XB
the misclassification costs reach a minimum for both the learning and ^f ðxÞ ¼ 1 ^f ðxÞ ð11Þ
avg
testing samples (see (Breiman et al., 1998) for a detailed review). Fig. B b¼1 b
1 shows how an optimal tree is selected from the decision trees created.
From the figure, with an increase in complexity (more terminal nodes), However, this is not practical because the dataset does not have ac-
the misclassification cost for train data will repeatedly decrease. How- cess to multiple training sets. Hence, the sample can bootstrap by taking
ever, for the test data, first there is a decrease, and then an increase is repeated samples from the training dataset (James et al., 2013). This can
observed. An optimal tree is the one that has the least misclassification generate B different bootstrapped training datasets and train the model
cost for the test data. 1 2 B
on the bth bootstrapped training set in order to get ^f ðxÞ;bf ðxÞ……^f
ðxÞ, and finally average all the predictions (See Eq. (12))
3.2. Ensemble techniques
XB
An ensemble technique is defined by a set of individually trained ^f ðxÞ ¼ 1 ^f b ðxÞ ð12Þ
bag
B b¼1
classifiers whose predictions are combined in order to improve the pre-
diction accuracy of a single classifier (i.e., regression tree). The predic-
tion of an ensemble technique typically requires more computation This empirical formulation is called bagging.
compared to a single learner so that ensembles techniques compensate Random forest is similar to bagging in that bootstrap samples are
poor learning algorithms by performing a lot of extra computation. In drawn to construct multiple trees. The main difference from bagging
this paper, we have undertaken bagging, random forests, and boosting is that random forest compute one extra step having the random selec-
as methods for creating three ensemble techniques of regression tree tion of predictor variables rather than using all variables to grow the
to construct more powerful prediction models. trees. The number of predictors used to find the best split at each
Table 3
Sample characteristics of the road accidents attributes.
Crash variables
Pedestrian crash Total number of pedestrian crashes per STAZ 1.907 3.315 39.000 – – –
Bicycle crash Total number of pedestrian crashes per STAZ 1.797 3.309 88.000 – – –
Socio-demographic variables
Population density Population density per square mile 2520.3 4043.3 63,069.0 2330.2 3489.7 57,181.9
Proportion of families without Total number of families with no vehicle in STAZ/Total number of 0.095 0.123 1.000 0.095 0.108 1.000
vehicle families in STAZ
School enrolments density Total school enrolment per square miles in STAZ 775.02 5983.05 255,147.24 684.22 2900.54 102,285.73
Proportion of urban area Total urban area in STAZ/Total area in STAZ 0.722 0.430 1.000 0.650 0.434 1.000
Distance to the nearest urban area Distance of the STAZ to the nearest urban area 2.140 5.441 44.101 – – –
Hotels, motels, and timeshare Hotels, motels, and timeshare rooms density per square mile 172.49 941.71 32,609.84 121.678 528.078 11,397.148
rooms density
No of total employment Total employment in STAZ 1140.10 1722.45 31,932.15 6917.245 6725.135 76,533.000
Proportion of industry employment Proportion of industry employment 0.176 0.232 1.000 0.183 0.177 1.000
Proportion of commercial Proportion of commercial employment 0.299 0.235 1.000 0.305 0.177 1.000
employment
Proportion of service employment Proportion of service employment 0.525 0.257 1.000 0.495 0.186 1.000
No of commuters by public No of commuters using public transportation 18.813 54.273 934.000 119.582 246.299 3559.985
transportation
No of commuters by cycling No of commuters using bicycle 5.894 19.804 775.000 90.869 128.399 1902.135
No of commuters by walking No of commuters by walking 14.354 34.680 1288.000 37.566 74.484 1634.530
Please cite this article as: M.S. Rahman, M. Abdel-Aty, S. Hasan, et al., Applying machine learning approaches to analyze the vulnerable road-users'
crashes at statewide traf..., Journal of Safety Research, https://doi.org/10.1016/j.jsr.2019.04.008
M.S. Rahman et al. / Journal of Safety Research xxx (2019) xxx 7
node is a randomly chosen subset of the total number of predictors. In predictor variables are summarized in Table 3. Specifically, the table
random forest, the trees are grown to maximum size without pruning, provides the predictor values at a STAZ level as well as for the neighbor-
and aggregation is by averaging the trees. Suppose, there are N observa- ing STAZs. For the targeted and the neighboring STAZs, all the predictor
tions and M predictor variables in the learning dataset. At first, subsets variables are calculated for each of 8518 zones. Table 3 included the
of data from the training sample with replacement are taken from full mean, standard deviation, minimum, and the maximum values of the
dataset like bagging. Then, a subset of M predictor variables is selected corresponding predictor variables for 8518 STAZs including both
randomly, and whichever variables give the best split is used to split aspatial and spatial variables. It is worth mentioning that, the predictor
the node iteratively. The main advantages of random forest over bag- variables of the targeted STAZs are the variables that are calculated from
ging is that random predictor selection diminishes correlations among the corresponding STAZs with crash frequency of both pedestrian and
unpruned trees and constructs a learning model with low bias and var- bicycle, while the variables of the neighboring STAZs denotes the aver-
iance at the same time. age values of the variables that are calculated from all the surrounding
Boosting is another approach for improving the predictions resulting STAZs adjacent to the targeted STAZ.
from a series of decision trees. Like bagging, boosting is an efficient ap- Roadway characteristics included are road lengths for different func-
proach that creates several subsets of data which constructs a final out- tional class, signalized intersection density, length of bike lanes and
put by averaging all the prediction of resulting trees. Unlike bagging, the sidewalks, etc. Intersection density denotes the number of intersections
training set used for each individual learner is chosen based on the per- per street mile in a STAZ. Vehicle-miles-traveled and proportion of
formance of the earlier learner(s). In boosting, observations that are in- heavy vehicles in VMT are considered as traffic characteristics. For de-
correctly predicted by previous classifiers in the individual learners are mographic characteristics, population density, proportion of families
chosen more often than observations that were correctly predicted Con- without vehicle, proportion of urban area, no of commuters by public
sequently, boosting attempts to produce new learners for its ensemble transportation, etc. are considered.
that are better able to correctly predict examples for which the current
ensemble performance is poor. It is worth mentioning that in bagging,
the resampling of the training set is not dependent on the performance 5. Modeling results and discussions
of the earlier classifiers. In machine learning, gradient boosting tech-
nique has gained much popularity for building powerful predictive 5.1. Model assessment
models from weak learners. Specifically, gradient boosting techniques
uses a base weak learner and try to boost the performance of weak In this study, from the 8518 STAZs, 70% of the STAZs were randomly
learners by iteratively shifting the focus towards problematic observa- selected as training set for model development while 30% were
tions that were difficult to predict. This ensemble technique identifies employed as testing set for model validation. In the first step, the
problematic observations by large residuals computed in the previous model estimation process involved estimating four models as follows:
iterations (Mayr, Binder, Gefeller, & Schmid, 2014). (1) DTR aspatial model for pedestrian crashes, (2) DTR spatial model
for pedestrian crashes, (3) DTR aspatial model for bicycle crashes, (4)
4. Data preparation DTR spatial model for bicycle crashes. Prior to discussing the model re-
sults, we compare the estimated models in Table 4. The table presents
This study is focused on pedestrian and bicycle crashes at the State- the Average Squared Error (ASE) and Standard Deviation of Error
wide Traffic Analysis Zone (STAZ) level. STAZ's are geographic entities (SDE) for the four DTR models with training and testing samples. It is
delineated by state or local transportation officials to tabulate traffic-re- worth mentioning that a series of trees have been produced in order
lated data such as journey-to-work and place-of-work statistics (Cai, to achieve the best DTR models for each of the four models mentioned
2017). The data provides crash information for 8518 STAZs, with an av- above. The model with the lower ASE and SDE is the preferred DTR
erage area of 6.472 mile2. Data for the empirical study is obtained from
Florida for the years 2010 to 2012. About 16,240 pedestrians and 15,307
Table 4
bicycles involved crashes that occurred in Florida in these 3 years' pe- Comparison of predictability between different models.
riod were compiled for the analysis. Among the STAZs, 46.18% of them
Pedestrian crashes
have zero pedestrian crashes while 49.86% of them didn't have any bicy-
Training (N = Without spatial With spatial predictor variables (%
cle crashes. The crash records are collected from Florida Department of 5963) predictor variables Reduction from aspatial model)
Transportation, Crash Analysis Reporting (CAR) and Signal Four Analyt- No of predictor 10 12
ics (S4A) databases. Roadway characteristics, traffic characteristics, and variable used
ASE 5.597 5.142 (8.1)
socio-demographic characteristics – three broad categories of predic-
SDE 2.366 2.268 (4.1)
tors are considered in our study. The response variables are the total Testing (N = Without spatial With spatial predictor variables
number of pedestrian and bicycle crash in each zone. The data 2555) predictor variables
employed are obtained from FDOT Transportation Statistics Division No of predictor 10 12
and US Census Bureau. The attributes are then aggregated at the STAZ variable used
ASE 6.328 6.178 (2.4)
level using geographical information system (GIS). As discussed earlier,
SDE 2.516 2.485 (1.2)
the current analysis considered spatial predictor variables which corre-
spond to characteristics of neighboring STAZs along with the target Bicycle crashes
Training (N = Without spatial With spatial predictor variables
STAZs. In macroscopic and microscopic analyses, crashes occurring in
5963) predictor variables
a spatial unit or site are aggregated to obtain the crash frequency. The No of predictor 9 12
aggregation process might introduce errors in identifying the exoge- variable used
nous variables for the spatial unit or site. To accommodate for such spa- ASE 5.413 5.092 (5.9)
tial unit or site induced bias, spatial correlation should be considered in SDE 2.327 2.257 (3.0)
Testing (N = Without spatial With spatial predictor variables
the crash model estimates. Towards this end, for every STAZ, the adja- 2555) predictor variables
cent STAZs are identified. Based on the identified neighbors, a new var- No of predictor 9 12
iable can be obtained by averaging the values of each predictor variable variable used
from surrounding STAZs. The average value based on those surrounding ASE 6.724 5.926 (11.8)
SDE 2.594 2.435 (6.1)
STAZs (identified neighbors) is the corresponding spatial predictor var-
iable of the targeted STAZs. The descriptive statistics of the response and ASE = Average Squared Error, SDE = Standard Deviation of Error.
Please cite this article as: M.S. Rahman, M. Abdel-Aty, S. Hasan, et al., Applying machine learning approaches to analyze the vulnerable road-users'
crashes at statewide traf..., Journal of Safety Research, https://doi.org/10.1016/j.jsr.2019.04.008
8 M.S. Rahman et al. / Journal of Safety Research xxx (2019) xxx
Table 5
Variable importance for pedestrian crash of STAZs.
model. The percentage reduction of ASE and SDE in the spatial model The importance value of the most important variable is 1. Then all
compared to the aspatial model were quantified to observe the impact other variables are assigned with a relative importance. The variable im-
of spatial variables for both pedestrian and bicycle crash frequency. portance result of four models (2 model types with and without spatial
For instance, ASE and SDE in the training (testing) dataset for pedestrian predictor variables of neighboring STAZs) of pedestrian and bicycle
spatial model were 8.1% (2.4%) and 4.1% (1.2%) lower, respectively than crashes each are displayed in Table 5 and Table 6, separately. Across
the corresponding aspatial model (Table 4). Nevertheless, in terms of bi- the four models for either pedestrian or bicycle crashes, the significant
cycle spatial model, the percentage reduction of ASE and SDE in training importance variable are quite comparable. While the variables with rel-
(testing) datasets were found to be 5.9% (11.8%) and 3.0% (6.1%), re- ative importance results for all DTR models across pedestrians and bicy-
spectively, in comparison with the bicycle aspatial model. Furthermore, cle crashes are presented, the graphical discussion of decision tree
spatial models have the higher number of significant predictor variables regression focuses on the DTR model with spatial predictor variables
compared to the aspatial models. For example, the pedestrian spatial that offers the best model.
model has 12 significant predictor variables, while 10 variables are sig-
nificant in the pedestrian aspatial model. Generally, higher number of 5.2.1. DTR models for pedestrian crash
predictor variables overfitted the DTR models in the training data
(Chang & Chen, 2005; Chang & Wang, 2006). However, in our analysis, 5.2.1.1. Decision tree. Forty-two predictor variables including both
the models with lower ASE and SDE were found in both training and aspatial and spatial were used for developing spatial DTR models to pre-
testing data which does not overfit the DTR model. Therefore, across pe- dict the quantitative target variable: total number of pedestrian crashes.
destrian and bicycle crash prediction models, the models with spatial Fig. 2 shows the results of the regression tree, which has 32 terminal
predictor variables (spatial model) offer better prediction accuracy in nodes. It shows that the number of commuters by public transportation,
terms of ASE and SDE for both training and testing date sets. However, number of total employments, signalized intersection density, number
the spatial variables have higher impact on the bicycle crash frequency of commuters by walking, number of commuters by public transporta-
model, compared to the pedestrian crash frequency model in terms of tion in neighboring STAZs, vehicle miles traveled, length of sidewalk,
percentage reduction of ASE and SDE (Table 4). The result seems rea- and length of bike lanes are the primary splitters in the regression
sonable because there is a higher probability to travel to neighboring tree, implying that these variables are critical for predicting pedestrian
zones by bicycling rather than walking. The State of Florida has 8518 crashes at macro-level. The interpretation of the decision tree regres-
STAZs, with an average area of 6.472 mile2. In a nutshell, this result sion results is straightforward. The rectangular box in this figure con-
highlighted that inclusion of predictor variables of adjacent STAZs im- tains the node id, average number of pedestrian crashes, and the
prove crash prediction models using machine learning techniques number of observations in the dataset. The initial split at node 1 is
(DTR models) which confirmed similar results obtained using statistical based on the variable of total number of commuters by public transpor-
modeling techniques in Cai et al. (2016). tation. This indicates that the total number of commuters by public
transportation was found to minimize the deviance most and is there-
5.2. DTR model estimation and interpretation fore the variable used for the first split. The tree diagram then breaks
into two branches and then has the opportunity to branch again. In
As previously mentioned, DTR partitions the data into relatively ho- this case, the left branch is further divided by total employment while
mogeneous terminal nodes, and it takes the mean value observed in the right branch is divided by signalized intersection density. Each of
each node as its predicted value. The empirical analysis involved a series the new branches again splits before stating the final expectation of
of DTR model estimations in order to achieve the lowest possible ASE the pedestrian crash with terminal node. As an example, consider a
and SDE. Towards this end, lists of variables are entered into each zone where the total number of commuters by public transportation is
model and their relative importance were also produced. Variable im- less than 52, total employment is less than 709, length of sidewalk is
portance is calculated based on deviance (D) or sum of squared errors equal or greater than 0.45-mile, and vehicle miles traveled is equal or
(SSE) of each variable which indicates a measure of the dispersion. greater than 162,346.6. From the training set of this tree, it can be
The first partition of the observations in the DTR models is undertaken found that an average of 5 pedestrian crashes are expected in 3 years
based on the most important predictor variable resulting in the maxi- of a hypothetical zone, as based on 10 observations of similar zones
mum reduction in variability of the response variable. Then, further par- The interaction effects of the variables in decision tree are
titions are made based on the hierarchy of most important variables. completely different than regular regression. The hierarchical structure
Please cite this article as: M.S. Rahman, M. Abdel-Aty, S. Hasan, et al., Applying machine learning approaches to analyze the vulnerable road-users'
crashes at statewide traf..., Journal of Safety Research, https://doi.org/10.1016/j.jsr.2019.04.008
M.S. Rahman et al. / Journal of Safety Research xxx (2019) xxx 9
Table 6
Variable importance for bicycle crash of STAZs.
of a tree means that the response to one input variable depends on significant predictor variables (rank-8,9,10) in DTR aspatial model,
values of other inputs in the tree, so interactions between predictors while those variables are not found significant variables in DTR spatial
are automatically modeled (Elith, Leathwick, & Hastie, 2008). The inter- models which offers the better fit. Among the significant important spa-
action of the variables is local in the decision tree technique, while the tial predictor variables, the number of commuters by public transporta-
classical regression have global interaction between the variables. The tion offers the most important variable to predict pedestrian crashes. Cai
local interaction means that the interaction is only used for certain et al. (2016) proved that the commuters by public transportation in
values of the predictor variables instead of all the values in the corre- neighboring STAZ has a positive impact on pedestrian crashes. More-
sponding predictor variables. There are a lot of interaction terms ob- over, the number of commuters by walking, population density, propor-
served in the decision tree regression shown in Fig. 2. For the previous tion of families without vehicle, and school enrolment density in
example shown above, vehicle miles traveled only has an effect in the neighboring STAZs are significant spatial variables of pedestrian crashes
model for the subset of data for which, length of sidewalk is equal or at the macro-level.
greater than 0.45-mile, total employment is less than 709, and the
total number of commuters by public transportation is less than 52. 5.2.2. DTR models for bicycle crash
Therefore, these variables are interacting each other for the correspond-
ing tree with local interaction. 5.2.2.1. Decision tree. The same number of predictor variables (42) in-
cluding the spatial and aspatial were tried to build the decision tree re-
5.2.1.2. Variable importance. For DTR spatial model, seven predictor var- gression model for bicycle crash frequency. Fig. 3 shows the results of
iables of targeted STAZs and five predictor variables of neighboring the regression tree for the bicycle crash frequency in the macro-level.
STAZ are found to be most important variables for forecasting pedes- The tree has 25 terminal nodes with 12 significant predictor vari-
trian crash. Five significant predictor variables of neighboring STAZ con- ables including total employment, number of commuters using bicycle,
firmed the importance of including spatial variables in order to predict number of commuters using public transport, vehicle miles traveled,
the pedestrian crashes at the macro-level. The results of the variable im- length of bike lanes, number of commuters using bicycle in neighboring
portance for both models (aspatial and spatial) for pedestrian crashes STAZs, population density in neighboring STAZs etc. From Fig. 3, the
are presented in Table 5. To emphasize the predictor variables, we also total number of employments was found to minimize the deviance
ranked each variable based on their variable importance – with 1 as most, thereby the variable used for the first split. From the tree, it is
the highest important variable and 12 as the lowest important variable also noticed that the left and the right brunch further divided by the
in spatial model. number of commuters using bicycle in targeted STAZs and number of
The following observations can be made based on the results pre- commuters using bicycle in neighboring STAZs, respectively. Hence, it
sented in Table 5. The most important variable for determining the is expected that the number of commuters using bicycle in targeted
number of pedestrian crashes at macro-level is number of commuters STAZs and number of commuters using bicycle in neighboring STAZs
using public transport with relative importance 1.0. The statistical are the primary splitters in the regression tree, implying that these var-
modeling results intuitively support that commuters by public trans- iables are critical for predicting bicycle crashes at macro-level. To inter-
portation reflect zones with higher pedestrian activity resulting in in- pret the tree, consider a STAZ where the total number of employments
creased crash risk (Abdel-Aty, Lee, Siddiqui, & Choi, 2013). The next is less than 889, number of commuters using bicycle is equal or greater
most important variable to predict the pedestrian crashes is total em- than 9, population density in neighboring STAZs equal or greater than
ployment which is surrogate measures of pedestrian exposure 2674.5. From Fig. 3, it can be found that an average of 3.82 bicycle
(Siddiqui et al., 2012). Hence, it is expected that total employment has crashes are expected in 3 years of a hypothetical STAZ, as based on
a higher impact on crash frequency. The variables including signalized 111 observations of similar STAZs. In order to explain the interaction
intersection density, number of walk commuters, length of sidewalks, term of the above example, population density in neighboring STAZs
and length of bike lanes represent the likelihood of pedestrian access. only has an effect in the model for the subset of data for which, number
Therefore, these variables are found to be significant variables in the of commuters using bicycle is equal or greater than 9, and the total
DTR model. The VMT variable is a measure of vehicle exposure and as number of employments is less than 889.
expected a significant predictor for pedestrian crashes. It is interesting
to note that the variables distance to nearest urban area, hotel, motel, 5.2.2.2. Variable importance. In the DTR model with spatial variables pre-
and timeshare room density, and proportion of urban area are sented in Table 6, eight variables of the targeted STAZs and four
Please cite this article as: M.S. Rahman, M. Abdel-Aty, S. Hasan, et al., Applying machine learning approaches to analyze the vulnerable road-users'
crashes at statewide traf..., Journal of Safety Research, https://doi.org/10.1016/j.jsr.2019.04.008
10 M.S. Rahman et al. / Journal of Safety Research xxx (2019) xxx
Please cite this article as: M.S. Rahman, M. Abdel-Aty, S. Hasan, et al., Applying machine learning approaches to analyze the vulnerable road-users'
crashes at statewide traf..., Journal of Safety Research, https://doi.org/10.1016/j.jsr.2019.04.008
M.S. Rahman et al. / Journal of Safety Research xxx (2019) xxx 11
Please cite this article as: M.S. Rahman, M. Abdel-Aty, S. Hasan, et al., Applying machine learning approaches to analyze the vulnerable road-users'
crashes at statewide traf..., Journal of Safety Research, https://doi.org/10.1016/j.jsr.2019.04.008
12 M.S. Rahman et al. / Journal of Safety Research xxx (2019) xxx
variables of the neighboring STAZs are responsible for predicting bicycle The bicycle crash frequency model has dissimilar trend with higher
crash frequency. ranking of spatial variables. In terms of spatial variables effect, the im-
The impact of some predictor variables in the pedestrian and bicycle portant variables have mixed effects between pedestrian and bicyclists.
crash prediction models are quite similar. A possible reason is that Population density and the school enrolment density in neighboring
STAZs with high pedestrian activity are also likely to experience high bi- STAZs offers important spatial variables for both pedestrian and bicycle
cyclists activity. Among the parent (targeted) STAZ variables, number of crash prediction models. Number of commuters using bicycle and the
total employments is the most important predictor variable of bicycle length of bike lanes in neighboring STAZs are found significantly associ-
crashes. The other important variables for the bicycle crash propensity ated with bicycle crashes.
are vehicle miles traveled (VMT), number of commuters using bicycle,
number of commuters by walk, length of sidewalks, number of com- 5.3. Ensemble techniques results
muters using public transport, signalized intersection density, and pro-
portion of urban area, respectively. There are four main differences in To improve the prediction accuracy of the DTR models, we have used
the STAZ variable impacts between pedestrians' and bicyclists' crash fre- ensemble techniques using three structures: (1) Bagging, (2) Random
quency in terms of variable importance. First, the density of hotel, motel, Forest, (3) Gradient Boosting. Fig. 3 illustrates the basic framework of
and time share rooms is not a significant variable for predicting bicycle the three ensemble techniques proposed in the pedestrian and bicycle
crash. This result is intuitive because tourists are less likely to use bicy- crash prediction models. Some observations can be made from this
cles. Second, the school enrolment density does have significant impact framework. All the three ensemble techniques combine several decision
on bicycle crashes as it is possible that students are more likely use bicy- trees to produce better predictive performance than utilizing a single
cles for traveling to schools. Third, the length of sidewalks in the STAZ decision tree.
does not have significant importance to predict bicycle crashes, Bagging create several subsets of data by bootstrap resampling while
whereas, sidewalk length is found to be significant variable for the random forest utilizes the same process in addition to taking the
predicting pedestrian crashes. Finally, the ranking of the spatial vari- random subset predictors. Unlike bagging and random forest, boosting
ables in the pedestrian crash frequency model are all at the bottom. generate multiple training samples by re-weighting which can
Table 7
Comparison of predictability across ensemble techniques.
Measure of effectiveness Decision tree Bagging (% Reduction) Random Forests (% Reduction) Gradient Boosting (% Reduction)
Please cite this article as: M.S. Rahman, M. Abdel-Aty, S. Hasan, et al., Applying machine learning approaches to analyze the vulnerable road-users'
crashes at statewide traf..., Journal of Safety Research, https://doi.org/10.1016/j.jsr.2019.04.008
M.S. Rahman et al. / Journal of Safety Research xxx (2019) xxx 13
improves the accuracy of single learner. Finally, bagging, random forest, The paper is not without limitations. While the decision tree regres-
and boosting estimate the final prediction by averaging multiple esti- sion is considered, we do not consider other data mining techniques to
mates of individual trees. check the prediction accuracy. It will be an interesting exercise to model
The aforementioned three ensemble techniques were implemented the other data mining techniques such as neural network, support vec-
based on the methodology showed in Fig. 1 and the goodness of fit mea- tor machine and their ensembles. Moreover, it might be beneficial to ex-
sure such as ASE and SDE are calculated for the spatial models of pedes- plore the similar models for multiple spatial units and several years.
trian and bicycle crashes. The comparison results of the ensemble
techniques along with the DTR models (weak learners) for both pedes- Acknowledgment
trian and bicycle crashes are presented in Table 7.
The table presents the ASE and SDE for the ensemble techniques and The authors would like to gratefully acknowledge Florida Depart-
DTR model for both training and testing samples. The percentage reduc- ment of Transportation (FDOT) for providing access to the Florida crash
tions of ASE and SDE of these ensemble techniques compared to the data.
base model (DTR model) were also calculated in order to compare the
improvements across the models. For example, in the pedestrian crash References
frequency model, gradient boosting decreased the ASE and SDE of train-
ing (testing) dataset by 5.6% (4.3%) and 2.9% (2.1%), respectively, com- Abdel-Aty, M., Chundi, S. S., & Lee, C. (2007). Geo-spatial and log-linear analysis of pedes-
trian and bicyclist crashes involving school-aged children. Journal of Safety Research,
pared to the decision tree regression. Three significant conclusions can 38(5), 571–579. https://doi.org/10.1016/j.jsr.2007.04.006.
be made from the results highlighted in Table 7. First, all models with Abdel-Aty, M., Keller, J., & Brady, P. (2005). Analysis of types of crashes at signalized inter-
ensemble techniques perform slightly better than the original DTR sections by using complete crash data and tree-based regression. Transportation
Research Record Journal of the Transportation Research Board, 1908, 37–45.
model given the complexity of these models. Second, gradient boosting Abdel-Aty, M., Lee, J., Siddiqui, C., & Choi, K. (2013). Geographical unit based analysis in
provides the best performance in all ensemble techniques compared to the context of transportation safety planning. Transportation Research Part A: Policy
the other counterparts. Third, Random forests is better than bagging in and Practice, 49, 62–75.
Blincoe, L., Seay, A., & Zaloshnja, E. (2000). T.Miller, Romano, E., S.Luchter, R.Spicer, 2002.
terms of goodness-of-fit measures. The economic impact of motor vehicle crashes. DOT HS, 809, 446.
Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1998). Classification and Regression
Trees. Chapman & Hall/CRC.
Cai, Q. (2017). Integrating the macroscopic and microscopic traffic safety analysis using hier-
6. Conclusion
archical models.
Cai, Q., Abdel-Aty, M., & Lee, J. (2017). Macro-level vulnerable road users crash analysis: A
This study applied machine learning techniques for pedestrian and Bayesian joint modeling approach of frequency and proportion. Accident Analysis &
Prevention, 107(May), 11–19. https://doi.org/10.1016/j.aap.2017.07.020.
bicycle crash analyses that captures the effects of important predictor
Cai, Q., Abdel-Aty, M., Lee, J., & Eluru, N. (2017). Comparative analysis of zonal systems for
variables at the macro-level. The study conducted decision tree regres- macro-level crash modeling. Journal of Safety Research, 61, 157–166.
sion (DTR) modeling analysis to highlight the importance of various Cai, Q., Lee, J., Eluru, N., & Abdel-Aty, M. (2016). Macro-level pedestrian and bicycle crash
traffic, roadway, and socio-demographic characteristics of the STAZ on analysis: Incorporating spatial spillover effects in dual state count models. Accident;
Analysis and Prevention, 93(407), 14–22. https://doi.org/10.1016/j.aap.2016.04.018.
the pedestrian and bicycle crash occurrence. To the best of the authors' Cai, Q., Saad, M., Abdel-aty, M., & Yuan, J. (2018). Safety impact of weaving distance on
knowledge, this is the first attempt to employ such DTR models at the freeway facilities with managed lanes using both microscopic traffic and driving sim-
macro-level. The study also considered spatial predictor variables from ulations. Transportation Research Record, 53. https://doi.org/10.1177/
0361198118780884.
neighboring STAZs in order to improve the prediction accuracy of DTR Chang, L. Y., & Chen, W. C. (2005). Data mining of tree-based models to analyze freeway
models for both pedestrian and bicycle crashes. It was found that the in- accident frequency. Journal of Safety Research, 36(4), 365–375.
troduction of spatial predictor variables on DTR models clearly Chang, L. Y., & Chien, J. T. (2013). Analysis of driver injury severity in truck-involved acci-
dents using a non-parametric classification tree model. Safety Science, 51(1), 17–22.
outperformed the DTR models that did not consider the spatial variables Chang, L. Y., & Wang, H. W. (2006). Analysis of traffic injury severity: An application of
in terms of goodness-of-fit measures. From the diagram of the decision non-parametric classification tree techniques. Accident; Analysis and Prevention, 38
tree regression, we can observe that the interactions between predictor (5), 1019–1027.
De Oña, J., López, G., & Abellán, J. (2013). Extracting decision rules from police accident
variables are automatically modeled as the response to one input vari-
reports through decision trees. Accident; Analysis and Prevention, 50, 1151–1160.
able depends on values of other inputs in the tree. It is also clear that Ekram, A. -A., & Rahman, M. S. (2018). Effects of connected and autonomous vehicles on
the interactions between the predictor variables are local in the decision contraflow operations for emergency evacuation: A microsimulation study.
Proceedings of the 97th Annual Meeting of the Transportation Research Board.
tree regression, while the classical regression have global interaction
Elith, J., Leathwick, J. R., & Hastie, T. (2008). A working guide to boosted regression trees.
between the independent variables. To facilitate a policy analysis at The Journal of Animal Ecology, 77(4), 802–813. https://doi.org/10.1111/j.1365-2656.
the macro-level, variable importance of DTR models for both pedes- 2008.01390.x.
trians and bicyclists crashes were computed. The variable importance Eluru, N., Bhat, C. R., & Hensher, D. A. (2008). A mixed generalized ordered response
model for examining pedestrian and bicyclist injury severity level in traffic crashes.
results clearly highlighted the significant predictor variables of the Accident; Analysis and Prevention, 40(3), 1033–1054.
targeted and neighboring STAZs including traffic (such as VMT), road- Eustace, D., Alqahtani, T., & Hovey, P. W. (2018). Classification tree modelling of factors
way (such as signalized intersection density, length of sidewalks and impacting severity of truck-related crashes in Ohio. Transportation Research Board
97th Annual Meeting.
bike lanes, etc.) and sociodemographic characteristics (such as popula- Gong, Y., Abdel-Aty, M., Cai, Q., & Rahman, M. S. (2019). A decentralized network level
tion density, commuters by public transportation, walking and bicy- adaptive signal control algorithm by deep reinforcement learning. Transportation Re-
cling) for both pedestrian and bicycle crashes. In terms of the planning search Board 98th Annual Meeting.
Huang, H., Abdel-Aty, M., & Darwiche, A. (2010). County-level crash risk analysis in Flor-
perspective, it is important to identify zones with high public transit ida Bayesian spatial modeling. Transportation Research Record Journal of the
commuter, employment area, pedestrian and bicyclist commuters, and Transportation Research Boar, 2148, 27–37.
undertake infrastructure upgrades to improve safety. Finally, the study Iragavarapu, V., Lord, D., & Fitzpatrick, K. (2015). Analysis of injury severity in pedestrian
crashes using classification regression trees. The Transportation Research Board 94th
undertook some ensemble techniques such as bagging, random forest,
Annual Meeting.
and gradient boosting to improve the prediction accuracy of pedestrian James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learn-
and bicycle crashes. The results revealed that, all the ensemble tech- ing. New York: Springer, 112.
Karlaftis, M. G., & Golias, I. (2002). Effects of road geometry and surface on speed and
niques offer slightly better fit compared to the original DTR models
safety. Accident; Analysis and Prevention, 34, 357–365 3 34 November 1998.
given the complexity of these techniques. Moreover, random forests is Kashani, A. T., & Mohaymany, A. S. (2011). Analysis of the traffic injury severity on two-
better than bagging in terms of goodness-of-fit measures. Finally, gradi- lane, two-way rural roads based on classification tree models. Safety Science, 49
ent boosting algorithms outperformed competing two ensemble tech- (10), 1314–1320.
Kashani, A. T., Rabieyan, R., & Besharati, M. M. (2014). A data mining approach to inves-
niques which found the best technique for predicting the pedestrian tigate the factors influencing the crash severity of motorcycle pillion passengers.
and bicycle crash in macro-level. Journal of safety research, 51, 93–98.
Please cite this article as: M.S. Rahman, M. Abdel-Aty, S. Hasan, et al., Applying machine learning approaches to analyze the vulnerable road-users'
crashes at statewide traf..., Journal of Safety Research, https://doi.org/10.1016/j.jsr.2019.04.008
14 M.S. Rahman et al. / Journal of Safety Research xxx (2019) xxx
Lee, C., & Abdel-Aty, M. (2005). Comprehensive analysis of vehicle-pedestrian crashes at Saad, M., Abdel-aty, M., Lee, J., & Wang, L. (2019). Integrated safety and operational anal-
intersections in Florida. Accident; Analysis and Prevention, 37(4), 775–786. https://doi. ysis of the access design of managed toll lanes. Transportation Research Record.
org/10.1016/j.aap.2005.03.019. https://doi.org/10.1177/0361198118823502.
Lee, J., Abdel-Aty, M., Choi, K., & Huang, H. (2015). Multi-level hot zone identification for Shankar, V., Milton, J., & Mannering, F. (1997). Modeling accident frequencies as zero-al-
pedestrian safety. Accident; Analysis and Prevention, 76, 64–73. tered probability processes: An empirical inquiry. Accident; Analysis and Prevention,
Lee, J., Yasmin, S., Eluru, N., Abdel-Aty, M., & Cai, Q. (2018). Analysis of crash proportion by 29(6), 829–837.
vehicle type at traffic analysis zone level: a mixed fractional split multinomial logit Siddiqui, C., Abdel-Aty, M., & Choi, K. (2012). Macroscopic spatial analysis of pedestrian
modeling approach with spatial effects. Accident Analysis & Prevention, 111(Septem- and bicycle crashes. Accident; Analysis and Prevention, 45, 382–391.
ber 2017), 12–22. Sohn, S. Y., & Lee, S. H. (2003). Data fusion, ensemble and clustering to improve the clas-
Lord, D., & Mannering, F. (2010). The statistical analysis of crash-frequency data: A review sification accuracy for the severity of road traffic accidents in Korea. Safety Science, 41
and assessment of methodological alternatives. Transportation Research Part A: Policy (1), 1–14.
and Practice, 44(5), 291–305. Son, H. D., Kweon, Y. J., & Park, B. B. (2011). Development of crash prediction models with
Lord, D., Washington, S., & Ivan, J. N. (2007). Further notes on the application of zero-in- individual vehicular data. Transportation Research Part C: Emerging Technologies, 19
flated models in highway safety. Accident; Analysis and Prevention, 39(1), 53–57. (6), 1353–1363.
Lord, D., Washington, S. P., & Ivan, J. N. (2005). Poisson, poisson-gamma and zero-inflated Song, Y. -Y., & Lu, Y. (2015). Decision tree methods: Applications for classification and
regression models of motor vehicle crashes: Balancing statistical fit and theory. prediction. Shanghai Archives of Psychiatry, 27(2), 130–135.
Accident; Analysis and Prevention, 37(1), 35–46. Tavakoli Kashani, A., Rabieyan, R., & Besharati, M. M. (2014). A data mining approach to
Mannering, F. L., Shankar, V., & Bhat, C. R. (2016). Unobserved heterogeneity and the sta- investigate the factors influencing the crash severity of motorcycle pillion passengers.
tistical analysis of highway accident data. Analytic Methods in Accident Research, 11, Journal of Safety Research, 51, 93–98.
1–16. Ukkusuri, S., Hasan, S., & Aziz, H. (2011). Random parameter model used to explain effects
Mayr, A., Binder, H., Gefeller, O., & Schmid, M. (2014). The evolution of boosting algo- of built-environment characteristics on pedestrian crash frequency. Transportation
rithms: From machine learning to statistical modelling. Methods of Information in Research Record Journal of the Transportation Research Board, 2237, 98–106. https://
Medicine, 53(6), 419–427. doi.org/10.3141/2237-11.
Montella, A., Aria, M., D'Ambrosio, A., & Mauriello, F. (2012). Analysis of powered two- Wah, Y. B., Nasaruddin, N., Voon, W. S., & Lazim, M. A. (2012). Decision tree model for
wheeler crashes in Italy by classification trees and rules discovery. Accident; count data. World Congress on Engineering I, 4–9.
Analysis and Prevention, 49, 58–72. Wang, X., Yuan, J., Schultz, G. G., & Fang, S. (2018). Investigating the safety impact of road-
Mounce, S. R., Ellis, K., Edwards, J. M., Speight, V. L., Jakomis, N., & Boxall, J. B. (2017). En- way network features of suburban arterials in Shanghai. Accident Analysis &
semble decision tree models using RUSBoost for estimating risk of iron failure in Prevention, 113(January), 137–148. https://doi.org/10.1016/j.aap.2018.01.029.
drinking water distribution systems. Water Resources Management, 31(5), Washington, S. (2000). Iteratively specified tree-based regression: Theory and trip gener-
1575–1589. ation example. Journal of Transportation Engineering, 126(6), 482–491.
Nashad, T., Yasmin, S., Eluru, N., Lee, J., & Abdel-Aty, M. A. (2016). Joint modeling of pedes- Washington, S., & Wolf, J. (1997). Hierarchical tree-based versus ordinary least squares
trian and bicycle crashes: A copula based approach. Transportation Research Record, linear regression models: Theory and example applied to trip generation.
2601, 119–127. Transportation Research Record, 1581(1), 82–88. https://doi.org/10.3141/1581-11.
NHTSA (2005). Motor vehicle traffic crashes as a leading cause of death in the United States Wu, Y., Abdel-Aty, M., Wang, L., & Rahman, M. S. (2019). Improving flow and safety in low
2002 1 Young, 3. visibility conditions by applying connected vehicles and variable speed limits tech-
NHTSA (2015). Traffic safety facts: Bicyclists and other cyclists. nologies. Transportation Research Board 98th Annual Meeting.
NHTSA (2017a). 2016 motor vehicle crashes: Overview. Traffic Safety Facts Research Note, Yuan, J., & Abdel-Aty, M. (2018). Approach-level real-time crash risk analysis for signal-
1–9. ized intersections. Accident Analysis & Prevention, 119, 274–289.
NHTSA (2017b). Traffic Safety Facts: Pedestrian. Zhang, Y., Bigham, J., Ragland, D., & Chen, X. (2015). Investigating the associations be-
Pande, A., Abdel-Aty, M., & Das, A. (2010). A classification tree based modeling approach tween road network structure and non-motorist accidents. Journal of Transport
for segment related crashes on multilane highways. Journal of Safety Research, 41(5), Geography, 42, 34–47. https://doi.org/10.1016/j.jtrangeo.2014.10.010.
391–397. Zheng, Z., Lu, P., & Denver, T. (2016). Accident prediction for highway-rail grade crossings
Pitt, R., Guyer, B., Hsieh, C. C., & Malek, M. (1990). The severity of pedestrian injuries in using decision tree approach: An empirical analysis. Transportation Research Record
children: An analysis of the pedestrian injury causation study. Accident; Analysis Journal of the Transportation Research Board, 2545, 115–122.
and Prevention, 22(6), 549–559. https://doi.org/10.1016/0001-4575(90)90027-I.
Prati, G., Pietrantoni, L., & Fraboni, F. (2017). Using data mining techniques to predict the
Md Sharikur Rahman is a graduate research assistant and Ph.D. student at the University
severity of bicycle crashes. Accident; Analysis and Prevention, 101, 44–54. https://doi.
of Central Florida. His research area is traffic safety analysis. He received his B.S. in Civil En-
org/10.1016/j.aap.2017.01.008.
gineering from Bangladesh University of Engineering and Technology. His research inter-
Rahman, M. H., Abdel-Aty, M., Lee, J., & Rahman, M. S. (2019). Enhancing traffic safety at
ests lie in the field of microscopic traffic safety analysis. He is also the vice president of
school zones by operation and engineering countermeasures: A microscopic simula-
American Society of Highway Engineers at UCF.
tion approach. Simulation Modelling Practice and Theory.
Rahman, M. S. (2018). Applying Machine Learning Techniques to Analyze the Pedestrian and
Mohamed Abdel-Aty is a Pegasus Professor, Chair of the Civil, Environmental and Con-
Bicycle Crashes at the Macroscopic Level. Electron. Theses Diss.
struction Engineering Department at the University of Central Florida and a registered pro-
Rahman, M. S., Abdel-Aty, M., Hasan, S., & Cai, Q. (2019). Applying data mining techniques
fessional engineer in Florida. His main expertise and interests are in the areas of traffic
to analyze the pedestrian and bicycle crashes at the macroscopic level. Transportation
safety analysis, simulation, big data and data analytics and intelligent transportation sys-
Research Board 98th Annual Meeting.
tems (ITS). In 2015, he was awarded the Pegasus Professorship, the highest honor at the
Rahman, M. S., Abdel-aty, M., Lee, J., & Rahman, H. (2019b). Understanding the safety ben-
university. He is the Editor-in-Chief of Accident Analysis and Prevention and a member
efits of connected and automated vehicles on arterials' intersections and segments.
of multiple TRB Standing Committees including Highway Safety Performance (ANB25),
Transportation Research Board 98th Annual Meeting.
User Information Systems (AND20) and Safety Data, Analysis and Evaluation (ANB20).
Rahman, M. S., Abdel-Aty, M., Wang, L., & Lee, J. (2018). Understanding the highway
Dr. Abdel-Aty is a leading traffic safety expert at both the national and international levels.
safety benefits of different approaches of connected vehicles in reduced visibility con-
In addition, he has been invited to deliver many Keynote speeches in conferences around
ditions. Transportation Research Record, 2672(19), 91–101.
the world, including in Belgium, Brazil, China, Korea, Turkey, KSA, Qatar, and UAE.
Rahman, S., & Abdel-aty, M. (2018). Longitudinal safety evaluation of connected vehicles ’
platooning on expressways. Accident Analysis & Prevention, 117(December 2017),
Samiul Hasan received the bachelor's and master's degrees in civil engineering from Ban-
381–391. https://doi.org/10.1016/j.aap.2017.12.012.
gladesh University of Engineering and Technology in 2004 and 2007, respectively, and the
Rahman, S., Abdel-aty, M., Lee, J., & Rahman, H. (2019). Safety benefits of arterials ’ crash
Ph.D. degree in transportation and infrastructure systems from Purdue University in 2013.
risk under connected and automated vehicles 1. Transportation Research Part C, 100
He is currently an Assistant Professor with the Department of Civil, Environmental and
(July 2018), 354–371. https://doi.org/10.1016/j.trc.2019.01.029.
Construction Engineering, University of Central Florida, Orlando, FL, USA. His research in-
Roudsari, B. S., Mock, C. N., Kaufman, R., Grossman, D., Henary, B. Y., & Crandall, J. (2004).
terests include human mobility, urban computing, network modeling, agent-based simu-
Pedestrian crashes: Higher injury severity and mortality rate for light truck vehicles
lation, and disaster management. He received the Best Dissertation Award presented by
compared with passenger vehicles. Injury Prevention, 10(3), 154–158. https://doi.
the Transportation Science and Logistics Society of the Institute for Operations Research
org/10.1136/ip.2003.003814.
and the Management Sciences in 2014.
Saad, M., Abdel-aty, M., & Lee, J. (2019). Analysis of driving behavior at expressway toll
plazas. Transportation Research Part F: Traffic Psychology and Behaviour, 61, 163–177.
Qing Cai is a postdoctoral researcher in the Civil, Environmental and Construction Engi-
https://doi.org/10.1016/j.trf.2017.12.008.
neering Department at the University of Central Florida (UCF). He received his Ph.D. in
Saad, M., Abdel-aty, M., Lee, J., & Cai, Q. (2019). Bicycle safety analysis at intersections
transportation engineering from the same university. His main expertise and interest is
from crowdsourced data. Transportation Research Record, 1–14. https://doi.org/10.
in the areas of traffic safety, transportation planning, socio-demographic and land-use
1177/0361198119836764.
modeling, and data analytics.
Saad, M., Abdel-aty, M., Lee, J., & Wang, L. (2018a). Determining the optimal access design
of managed lanes considering dynamic pricing. 18th International Conference Road
Safety on Five Continents.
Saad, M., Abdel-Aty, M., Lee, J., & Wang, L. (2018b). Safety analysis of access zone design
for managed toll lanes on freeways. Journal of Transportation Engineering Part A
Systems, 144(11), 1–13. https://doi.org/10.1061/JTEPBS.0000191.
Please cite this article as: M.S. Rahman, M. Abdel-Aty, S. Hasan, et al., Applying machine learning approaches to analyze the vulnerable road-users'
crashes at statewide traf..., Journal of Safety Research, https://doi.org/10.1016/j.jsr.2019.04.008