0% found this document useful (0 votes)
51 views14 pages

Applied Sciences: Comprehensive Analysis of Traffic Accidents in Seoul: Major Factors and Types Affecting Injury Severity

This document summarizes a study that analyzed traffic accident data from Seoul to identify major factors affecting injury severity. The study used machine learning methods including ensemble, regression, and clustering on the Seoul traffic accident dataset. The results showed that pedestrian-related factors like accidents involving pedestrians, rather than driver-related or environmental factors, were most important in determining injury severity. This suggests more preventative measures focused on pedestrian safety could help reduce serious injuries from traffic accidents in Seoul.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views14 pages

Applied Sciences: Comprehensive Analysis of Traffic Accidents in Seoul: Major Factors and Types Affecting Injury Severity

This document summarizes a study that analyzed traffic accident data from Seoul to identify major factors affecting injury severity. The study used machine learning methods including ensemble, regression, and clustering on the Seoul traffic accident dataset. The results showed that pedestrian-related factors like accidents involving pedestrians, rather than driver-related or environmental factors, were most important in determining injury severity. This suggests more preventative measures focused on pedestrian safety could help reduce serious injuries from traffic accidents in Seoul.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

applied

sciences
Article
Comprehensive Analysis of Traffic Accidents in Seoul: Major
Factors and Types Affecting Injury Severity
Hyeonchoel Jeong 1 , Inhi Kim 2 , Keejun Han 3, * and Jungeun Kim 1, *

1 Department of Computer Science and Engineering, Kongju National University, Cheonan 31080, Korea;
[email protected]
2 Department of Urban Systems Engineering, Kongju National University, Cheonan 31080, Korea;
[email protected]
3 Intelligent Convergence Research Laboratory, Electronics and Telecommunications Research Institute,
Daejeon 34129, Korea
* Correspondence: [email protected] (K.H.); [email protected] (J.K.)

Abstract: Accident and fatality rates of traffic accidents worldwide are steadily increasing every year;
thus, considerable effort has been made to prevent traffic accidents and prepare countermeasures.
This study aims to identify the major factors and types that affect the severity of traffic accidents
in Seoul by utilizing the Seoul Metropolitan Government’s traffic accident dataset. To achieve this,
we perform a comprehensive analysis by adopting various machine learning techniques—not only
supervised learning methods but also unsupervised learning methods. As a result of the experiment,
we derived several critical factors that were found to affect the severity of traffic accidents via
supervised learning methods (i.e., ensemble-based and regression-based algorithms) and discovered

 dominant accident types via unsupervised learning methods (i.e., clustering-based algorithms). One
Citation: Jeong, H.; Kim, I.; Han, K.; of our primary findings is that, in contrast to common sense, environmental factors such as weather,
Kim, J. Comprehensive Analysis of season, and day of the week do not significantly affect the severity of traffic accidents in Seoul.
Traffic Accidents in Seoul: Major Moreover, all methods highlight the importance of pedestrian-related factors, implying that it is
Factors and Types Affecting Injury highly necessary to prepare more meticulous institutional measures for pedestrians to reduce the
Severity. Appl. Sci. 2022, 12, 1790. negative influence of serious traffic accidents in Seoul.
https://doi.org/10.3390/
app12041790 Keywords: traffic accidents analysis; machine learning; logistic regression; XGBoost; DBSCAN
Academic Editors: Paweł Droździel,
Radovan Madleňák,
Saugirdas Pukalskas, Drago Sever
and Marcin Śl˛ezak 1. Introduction

Received: 17 January 2022


Traffic accidents have emerged as a serious social problem today, as the number of car
Accepted: 6 February 2022
registrations has increased rapidly owing to global economic growth and improvements in
Published: 9 February 2022
living standards [1–3]. According to the report published by the World Health Organization
(WHO) [4] in 2018, nearly 1.35 million people worldwide die in traffic accidents every year,
Publisher’s Note: MDPI stays neutral
implying that one person dies in a traffic accident every 24 s, an increase of 100,000 people
with regard to jurisdictional claims in
compared to 2015. In addition, according to the Center for Disease Control and Prevention
published maps and institutional affil-
(CDC) [5], the cost of medical and productivity losses associated with deaths from car
iations.
accidents in one year exceeds $63 billion. Therefore, it is necessary to identify major
factors and types of traffic accidents to prevent traffic accidents in advance based on the
results obtained.
Copyright: © 2022 by the authors.
Along these lines, a number of related studies and policies are being carried out abroad.
Licensee MDPI, Basel, Switzerland. However, there is still a lack of understanding of the major causes and mechanisms of
This article is an open access article serious traffic accidents in Seoul. Seoul is the largest city in South Korea, with various types
distributed under the terms and of transportations used by almost 10 million citizens and vehicles every day, implying that
conditions of the Creative Commons the traffic accidents would cause tremendous social and economic losses.
Attribution (CC BY) license (https:// The results of traffic accident data analysis may vary depending on the characteristics
creativecommons.org/licenses/by/ of the local traffic environment. Thus, it is necessary to focus on the intrinsic properties
4.0/). of Seoul for a deeper understanding of the causes and mechanisms of traffic accidents in

Appl. Sci. 2022, 12, 1790. https://doi.org/10.3390/app12041790 https://www.mdpi.com/journal/applsci


Appl. Sci. 2022, 12, 1790 2 of 14

Seoul. Furthermore, traffic accidents are caused by a combination of various factors such as
human-errors, road conditions, and environments. This means that we need to perform a
comprehensive analysis of traffic accident datasets. Additionally, there is no single method
that always yields the best results in all cases; therefore, various methodologies with
different philosophies should be used for complex analysis.
In this study, we aim to identify the significant factors and types that affect the severity
of traffic accidents by focusing on the cases of Seoul. To this end, we used big data on
traffic accidents in Seoul pertaining to various factors by adopting three widely used
machine learning techniques: ensemble-based, regression-based, and clustering-based
methodologies. Throughout the analysis, we found that the severity of traffic accidents is
mainly determined by pedestrian-related variables, not by driver-related variables, which is
different from the results reported in previous studies [6–10]. We assume that this is because
of the unique characteristics of Seoul, which has created a vehicle-oriented transportation
environment that has been inevitably promoted by the daily traffic volume being so high,
almost 10 million vehicles [11].
This paper makes the following contributions.
• We analyzed a set of features that affect the number of traffic accidents by classifying
the features into three main factors—human, road, and environment—with a focus on
Seoul, the capital of Republic of Korea.
• We unveiled the significant features that affect the severity of traffic accidents by
exploiting various machine learning approaches: ensemble, regression, and clustering-
based analytics.
• By performing further qualitative analysis, we suggest that establishing more preven-
tive measures against pedestrian accidents would be an adequate approach to reduce
the number of fatal injuries due to traffic accidents in Seoul.
The remainder of this paper is organized as follows: Section 2 reviews previous
research. Section 3 describes the characteristics of the dataset we used for the analysis.
Section 4 introduces three methods that we adopted, and Section 5 shows our findings
based on the analysis using these methods. Finally, Section 6 concludes the study.

2. Related Work
Extensive studies have been carried out for the analysis of traffic accidents abroad.
Chong et al. [12] used the traffic accident dataset provided by the General Estimates
System (GES) to classify no injury, possible injury, non-incapacitating injury, incapacitating
injury, and fatal injury classes for the period ranging from 1995 to 2000. The accuracy was
compared and analyzed using decision trees, support vector machines, neural networks,
and hybrid decision trees. The hybrid approach outperformed the other methodologies in
the non-incapacitating injection, incapacitating injection, and fat injection classes, whereas
decision trees have been shown to be the most suitable for classes with no injury or
possible injury. Feng et al. [13] used the British traffic accident dataset to perform the a
priori algorithm and then explored the rules for high support and lift. They found strong
correlations in environmental characteristics, speed limits, and locations. Juan et al. [14]
used Bayesian networks to analyze traffic accidents according to injury severity. Accident
type, driver age, lighting, and the number of injuries were the most important factors
associated with serious or fatal traffic accidents.
Zhao et al. [15] used the Bayesian network collision severity model to further analyze
the complex combination relationship between single -and multi-vehicle traffic accidents.
In addition, they ranked five factor combination sequences for the number of deaths and
three-factor combination sequences for the number of injuries according to severity, thereby
revealing the critical reason. Dong et al. [16] used a mixed logit model to investigate
the difference in probability of accidents between single -and multi-vehicle accidents and
used disaggregated data with response variables classified as no accidents, single-vehicle
accidents, and multi-vehicle accidents. The analysis revealed that speed intervals, section
lengths, and wet road surfaces are important for both single- and multi-vehicles, while most
Appl. Sci. 2022, 12, 1790 3 of 14

other variables are important only for multi-vehicles. Thanapong et al. [17] conducted a
study to determine ways to reduce rear and fatal rear collisions. To this end, a classification
and regression tree (CART) was used, and the predictors for at-fault and not-at-fault driver
models showed that the driver age was the most important, followed by the number of
lanes and median opening area. Furthermore, the use of safety equipment was found to
be the most important factor affecting fatality. Ahmed [18] applied a logistic regression
to identify important variables affecting on traffic accident deaths. As a result of the
experiment, the author showed that the major factors affecting traffic accident deaths were
speed, location, and vehicle type.
Several research efforts have been devoted to the analysis of traffic accidents in Korea.
Bhin and Son [19] investigated gender-related variables using the decision tree model
to determine the severity of traffic accidents according to the gender of bus drivers and
analyzed gender severity using the ordinal logit model. The analysis found that the signal
violation variables of the violation of the law were commonly adopted by all genders, and
that the same variables were adopted in the overall bus driver severity model and the male
bus driver severity model, indicating that the carelessness of the driver greatly affected the
severity of the accident. Lim et al. [20] analyzed traffic accident factors on roads with a
width of less than 9 m using logistic regression models and found that drivers were driving
straight and that women and pedestrians were driving bicycles. Kim et al. [21] conducted a
study on the effects of traffic accidents on their occurrence according to the age of drivers,
considering their human characteristics. Poisson regression analysis was used to develop a
severity model for elderly and non-elderly people, and it showed that elderly drivers had
an impact on their ability to predict stopping distance, discriminate surrounding situations,
and respond to attention.
Table 1 summarizes related work. Compared with previous related studies, the major
differences in this paper are as follows. First, we focused on Seoul, the capital of Republic
of Korea. Since the results of traffic accident analysis can be different depending on the
characteristics of the local traffic environment, it is crucial to consider the intrinsic properties
of the target region. Second, we analyzed common trends and differences through both
supervised and unsupervised learning methods. Finally, we reported new interesting
findings that have not been reported previously.

Table 1. Summary of related work.

Author Objective Analytical Algorithms


Performance comparison of four models to Decision tree, SVM, neural network,
Chong et al. [12]
predict traffic accident severity and hybrid decision tree
Feng et al. [13] Discovery of dominant rules for traffic accidents Apriori algorithm
Discovery of major factors affecting traffic
Juan et al. [14] Bayesian network
accident severity
Zhao et al. [15] Discovery of major factors for traffic accidents Bayesian network
Study of the difference in accident probability
Dong et al. [16] Mixed logit model
between single and multi-vehicle accidents
Discovery of factors that reduce rear-end
Thanapong et al. [17] CART
collisions and fatal rear-end collisions
Identification of important variables for traffic
Ahmed [18] Logistic regression
accident deaths
Investigation of factors affecting the severity of
Bhin and Son [19] Decision tree, ordered logit
accidents according to gender
Discovery of variables affecting traffic accidents
Lim et al. [20] Logistic regression
on roads with a width of less than 9 m
Investigation of driver skills by age and the
Kim et al. [21] influence of personality factors on the occurrence Poisson regression
of traffic accidents
Appl. Sci. 2022, 12, 1790 4 of 14

3. Characteristics of Dataset
The dataset we used includes 362,298 cases of traffic accidents that occurred between
2010 to 2018 in Seoul, provided by a public data portal [22]. The characteristics of the
dataset are summarized in Table 2.

Table 2. Characteristics of the dataset.

Factors Variables Data Type Attribute Values


Side collision, backup collision, head-on
collision, rear-end collision, crossing, passing on
Accident type Categorical
driveway, passing on the edge of the road,
passing on the sidewalk, other
Non-compliance with safe driving obligation,
not keeping a safe distance, signal violation,
Human Factor
Violation of law Categorical intersection crossing procedure violation,
centerline violation, pedestrian protection
obligation violation, speeding, other
Perpetrator’s gender Categorical Male, female, unidentified
Perpetrator’s age Categorical 1~117 years old, unidentified
Victim’s gender Categorical Male, female, unidentified
Victim’s age Categorical 1~117 years old, unidentified
Road surface Categorical Dry, wet, frozen, snow, flooding, unidentified
At an intersection, near an intersection, on
a crosswalk, near a crosswalk, over the bridge,
Road type Categorical single road, crosswalk at an intersection, inside
a tunnel, over an overpass, in an underpass,
Road Factor railroad crossing, unidentified, other
Passenger car, lorry, two-wheeler, van, bicycle,
Perpetrator’s vehicle type Categorical motorized bicycle, heavy equipment, specialty
vehicle, unidentified, other
Passenger car, lorry, two-wheeler, van, bicycle,
Victim’s vehicle type Categorical motorized bicycle, heavy equipment, specialty
vehicle, pedestrian, unidentified, other
Occurrence day Categorical 1 January 2010~31 December 2018
Occurrence time Categorical 00:00~23:00
Environmental Factor Monday, Tuesday, Wednesday, Thursday, Friday,
Day of the week Categorical
Saturday, Sunday
Weather Categorical Sunny, rainy, cloudy, snowy, foggy, unidentified

In the pre-processing step, data with unknown values (i.e., null) were removed. Addi-
tionally, extremely low-frequency attributes (i.e., less than 0.01%) were removed because it
is challenging to develop a good model if the data distribution is imbalanced.
The classification criteria for slight injury and serious injury are as follows: “slight
injury” implies an injury that requires treatment for more than five days but less than three
weeks due to a traffic accident. In contrast, “serious injury” implies an injury that requires
treatment for at least three weeks due to a traffic accident. Further, “death” is considered as
death within 30 days from the time of a traffic accident. In this study, serious accidents and
deaths were equally treated as serious accidents, including life-threatening cases [23].
Because the data are all categorical data, we used one-hot encoding to transform them
into a vector space model to use conventional machine learning algorithms. However, the
use of one-hot encoding can dramatically increase the number of variables, resulting in poor
classification performance of the algorithm. Therefore, we grouped the attribute values to
reduce the number of attributes. For example, days of occurrence were grouped by season,
and accident occurrence times were grouped into dawn (0: 00–6:00), day (6:00–18:00), and
night (18:00–24:00). Further, the ages were grouped into underage (0–18 years), youth
(19–34 years), middle-aged (35–49 years), old-aged (50–64 years), and elderly (≥65 years).
After pre-processing, the classification criteria and distribution ratios for each variable were
Appl. Sci. 2022, 12, 1790 5 of 14

presented with a table separated by human factors, road factors, and environmental factors.
A vast number of variables could be easily seen (Tables 3–5).

Table 3. Categories and frequencies of variables (human factor).

Ratio %
Category Frequency
Serious Slight
Side collision 93,841 32.0 68.0
Backup collision 574 11.8 88.2
Head-on collision 10,444 41.9 58.1
Rear-end collision 66,062 25.3 74.7
Accident type Crossing 33,684 58.0 42.0
Passing on driveway 7261 41.7 58.3
Passing on the edge of the road 5399 35.6 64.4
Passing on the sidewalk 3998 46.5 53.5
Other 77,542 35.2 64.8
Non-compliance with safe driving obligation 158,966 35.0 65.0
Not keeping a safe distance 44,245 24.5 75.5
Signal violation 39,416 34.6 65.4
Pedestrian protection obligation violation 12,384 29.6 70.4
Violation of law
Centerline violation 11,297 43.5 56.5
Violation of pedestrian protection obligation 10,754 51.7 48.3
Speeding 455 78.7 21.3
Other 21,288 32.2 67.8
Male 254,924 35.4 64.6
Perpetrator’s gender
Female 43,881 34.7 65.3
Underage 10,424 37.4 62.6
Youth 62,835 35.5 64.5
Perpetrator’s age Middle-aged 106,285 36.1 63.9
Old-aged 89,769 34.4 65.6
Elderly 29,492 34.7 65.3
Male 224,221 33.2 66.8
Victim’s gender
Female 74,584 41.7 58.3
Underage 20,436 37.3 62.7
Youth 70,421 31.4 68.6
Victim’s age Middle-aged 88,565 31.5 68.5
Old-aged 89,212 36.0 64.0
Elderly 30,171 52.1 47.9

Table 4. Categories and frequencies of variables (road factor).

Ratio %
Category Frequency
Serious Slight
Dry 264,085 35.1 64.9
Wet 31,596 37.0 63.0
Road surface
Frozen 2041 31.5 68.5
Snow 1083 33.4 66.6
At an intersection 70,541 37.1 62.9
Near an intersection 50,830 31.9 68.1
On a crosswalk 11,889 54.5 45.5
Road type Near a crosswalk 6037 47.5 52.5
Over the bridge 2916 32.8 67.2
Single road 151,963 33.7 66.3
Crosswalk at an intersection 2508 41.7 58.3
Appl. Sci. 2022, 12, 1790 6 of 14

Table 4. Cont.

Ratio %
Category Frequency
Serious Slight
Inside the tunnel 773 33.6 66.4
Over the overpass 755 29.4 70.6
In the underpass 593 32.9 67.1
Passenger car 215,661 33.3 66.7
Lorry 24,215 41.0 59.0
Two-wheeler 19,822 39.3 60.7
Van 19,772 42.2 57.8
Perpetrator’s vehicle type
Bicycle 9037 37.8 62.2
Motorized bicycle 8251 38.6 61.4
Heavy equipment 1530 45.2 54.8
Specialty vehicles 517 44.9 55.1
Passenger car 137,014 33.9 76.1
Lorry 9781 37.1 72.9
Two-wheeler 31,428 46.1 53.9
Van 13,235 33.0 67.0
Victim’s vehicle type Bicycle 14,368 41.8 58.2
Motorized bicycle 12,696 46.3 53.7
Heavy equipment 472 30.9 69.1
Specialty vehicles 277 28.2 71.8
Pedestrian 79,534 49.8 50.2

Table 5. Categories and frequencies of variables (environmental factor).

Ratio %
Category Frequency
Serious Slight
Spring 78,545 35.8 64.2
Summer 75,948 35.0 65.0
Season
Autumn 75,905 35.6 64.4
Winter 68,407 34.6 65.4
Day 152,061 35.2 64.8
Time Night 93,772 34.5 65.5
Dawn 52,972 36.9 63.1
Monday 40,767 35.6 64.4
Tuesday 42,735 35.0 65.0
Wednesday 43,732 35.6 64.4
Day of week Thursday 43,672 35.7 64.3
Friday 47,497 35.5 64.5
Saturday 46,122 34.3 65.7
Sunday 34,280 35.3 64.7
Sunny 258,349 35.0 65.0
Rainy 22,294 36.8 63.2
Weather
Foggy 16,084 37.8 62.2
Snowy 2078 33.3 66.7

3.1. Human Factors


Table 3 shows categories and frequencies of human factors. Human factors are classi-
fied into six categories: accident type, violation of law, perpetrator’s gender, perpetrator’s
age, victim’s gender, and victim’s age. First, in the category of accident type, we can
observe that crossing has the most significant influence on the severity of traffic accidents
when considering the frequency and ratio of serious injuries. Second, in the category of the
violation of law, speeding shows the highest ratio of serious injuries at 78.7 % despite the
low frequency. Third, the perpetrator’s gender and age have little effect on the severity of
Appl. Sci. 2022, 12, 1790 7 of 14

the traffic accident, and the victim’s gender and age have a more significant impact on the
severity of the traffic accident when they are women or older adults.

3.2. Road Factors


Table 4 shows categories and frequencies of road factors. Road factors are classified
into four categories: road surfaces, road types, perpetrator’s vehicle types, and victim’s
vehicle types. First, the condition of the road surface does not have a significant effect
on the severity of traffic accidents. When the road type is a crosswalk, accidents with
more serious injuries occur. Furthermore, in the category of perpetrator’s vehicle type, the
proportion of accidents with serious injury is high in the order of heavy equipment and
specialty vehicles, while in the category of victim’s vehicle type, the pedestrian shows the
highest ratio of accidents with serious injury at 49.8%.

3.3. Environment Factors


Table 5 shows categories and frequencies of environmental factors. Environmental
factors are classified into four categories: season, time, day of the week, and weather.
Examining the day of the week, it appears that Sunday has fewer traffic accidents than
other days of the week. Further, examining the time, day of the week, and weather, there is
little difference in the ratio of serious injury accidents. In particular, in contrast to common
sense, it is interesting that the proportion of serious injuries on snowy or rainy days is not
higher than that on other days. Thus, based on observations, it seems that there is not much
connection between environmental factors and the severity of traffic accidents compared to
other factors.

4. Analytical Methods
Widely used analytical methodologies, including ensemble-based and regression-
based classifications, were applied to investigate the interrelationship between a dependent
variable (i.e., the severity of traffic accidents) and independent variables (i.e., human, road,
and environmental factors). We also adopted clustering to group data to determine the
nature of each group so that we can discover dominant patterns of severe traffic accidents.
In this work, we used eXtreme Gradient Boosting (XGBoost) because of its robustness for
overfitting, which is critical for the classification problem [24]; logistic regression because
of its superiority for handling categorical data [25]; and DBSCAN because of the freedom
of the number of clusters [26].

4.1. XGBoost
XGBoost [24] is an ensemble algorithm that combines multiple decision trees and is
a boosting-based model that improves the overfitting problems, speed, and stability of
existing tree-based models. XGBoost sequentially trains a decision tree on the training data,
and the objective function of XGBoost is defined as follows.

m   K
∑ i i + ∑ Ω ( f k ),
(t)
Obj(θ ) = l y , ŷ θ = ( f1 , f2 , . . . , fK ) (1)
i =1 k =1

Here, i represents the ith sample in the dataset and m represents the total number of
dataset inserted into the kth tree while K is the total number of trees. yi is the class label,
while ŷi is the predicted label. l is the loss function and Ω is the regularization term.
XGBoost adopts an additive strategy to improve the value of the objective function
by adding a new decision tree to the previous one at each iteration. When the t-tree is
(t)
constructed, the predicted value ŷi can be formulated as follows.

t −1
( t −1)

(t)
ŷi = f k ( xi ) + f t ( xi ) = ŷi + f t ( xi ) (2)
k =1
Appl. Sci. 2022, 12, 1790 8 of 14

According to Equations (1) and (2), the objective function can be formulated as follows.

m   t
( t −1)
Obj(θ )t = ∑l yi , ŷi + f t ( xi ) + ∑ Ω( f k ) (3)
i =1 k =1

If the tree contains a total of T leaf nodes, the index of each leaf node is defined as j
and the weight of the samples for each leaf node is w j . Then, the regularization term Ω(f )
is defined as follows.
1 T
Ω( f ) = γT + λ ∑ w2j (4)
2 j =1

Here, γ and λ represent penalty factors.

4.2. Logistic Regression


Regression analysis is a statistical technique for predicting the value of dependent
variables from independent variables by understanding the causal relationship between
variables. It is used to analyze the relevance of dependent variables to independent
variables. A typical multiple linear regression (MLR) formula is equivalent to Equation (5).

p MLR (yi | xi ) = v + w T xi , i = 1, 2, . . . , m (5)

where X = [ x1 , . . . , xm ] T ∈ Rm×n is a set of training data and Y = [y1 , . . . , ym ] ∈ Rm is


a set of labels. w ∈ Rn are weighting values and v represents the intercept. p MLR (yi | xi )
is the predicted value of yi when the independent variable xi attains a certain value.
A typical regression analysis can acquire any value depending on the independent variable;
thus, the p MLR (yi | xi ) value can extend to infinity. If the dependent variable is a binary
categorical variable, linear regression does not properly represent the relationship between
the independent and dependent variables.
Therefore, logistic regression (LR) [25] can be used instead of linear regression if the
dependent variable is binary (i.e., yi ∈ {−1, +1}). Using logistic regression, the value of
the dependent variable can be represented as a value between zero and one. Expressing
logistic regression as a formula is equivalent to Equation (6).

exp v + w T xi yi
 
p LR (yi | xi ) = (6)
1 + exp((v + w T xi )yi )

When the independent variable xi acquires a certain value, the predicted value of
p LR (yi | xi ) has the concept of probability between 0 and 1.
The average logistic loss function is calculated from the negative log-likelihood of the
logistic model with respect to all samples.
m
1    
𝓁avg (w, v) =
m ∑ log 1 + exp −yi w T xi + v (7)
i =1

The model parameters w and v are determined in the direction of minimizing the
average logistic loss function by a maximum likelihood estimation.

minimize 𝓁avg (w, v) (8)

By adding weight-regulating terms, which is a standard technique for preventing


overfitting, to the mean logistic loss function, we can limit the weights from increasing in
value and improve the generalization performance of our models.

minimize 𝓁avg (w, v) + R(C ) (9)


Appl. Sci. 2022, 12, 1790 9 of 14

Here, R(C) is the regularization function, which can have different forms depending
on the regularization method. The 𝓁1 -regularized logistic regression problem is
n
1
minimize 𝓁avg (w, v) +
C ∑ | wi | (10)
i =1

The 𝓁2 -regularized logistic regression problem is


n
1
minimize 𝓁avg (w, v) +
C ∑ wi2 (11)
i =1

where C is a regularization parameter used to adjust the balance between the magnitude of
the weight vector and the average logistic loss measured by the 𝓁1 -norm or 𝓁2 -norm.

4.3. DBSCAN
DBSCAN [26] is an unsupervised learning method that clusters data with similar
characteristics, clustering dense parts of the data. D is the user’s database, and point
p, q ∈ D is a d-dimensional vector. Further, Neps ( p) = [q ∈ D dist( p, q) ≤ Eps] is the set
of points in the radius Eps centered on point p. When a point p satisfies the p ∈ Neps (q)
while p is part of a set of q and Neps ( p) ≥ minPts, point q is defined as the core point,
and point p is directly density-reachable from point q. Thus, if there are more than minPts
points within the Eps radius at point p, then point q is classified as a core point. If a chain
exists where pi+1 from point p to q is directly density-reachable from pi , then point p is
defined as density-reachable from point q. However, if a density-reachable point o exists
from points p and q, it is defined as density-connected. When Ci is considered a cluster
within D, we define noise = { p ∈ D |∀ i : p ∈/ Ci } as a noise point, which is a point that
does not belong to any cluster [26].

5. Results
Experimental analysis was done through the Seoul Metropolitan Government’s traffic
accident dataset with the following focuses: (i) critical factors affecting the severity of
traffic accidents (Section 5.1) and (ii) representative types of traffic accidents (Section 5.2).
All experiments were performed on a PC with AMD Ryzen 7 2700X Eight-Core Proces-
sor 3.7 GHz CPU and 32 Gbyte RAM, running Windows 10. All algorithms were imple-
mented in Python. In the data preprocessing step, we used one-hot encoding to han-
dle the categorical data. For supervised learning methods, the ratio between a training
set and test set is 75/25. All results in this section are statistically significant since the
p-values are less than a typical significance level 0.01. The source code of all experiments is
fully available at https://github.com/hyunchul1357/traffic-accident-analysis (accessed on
13 January 2022).

5.1. Factor Analysis through XGBoost and Logistic Regression


In XGBoost, there are three hyper-parameters to be first optimized to prevent overfit-
ting and increase accuracy. Table 6 lists the results of the hyper-parameter optimization.
We have observed that when the learning rate is 0.1, the depth of the tree is 3, and the
number of weak learners is 200, it achieves the highest accuracy, which is 68.95 %. However,
the difference in accuracy according to hyper-parameter changes is not large. In general,
hyper-parameter tuning is performed to prevent overfitting or underfitting the model in
order to find accurate trends in the dataset. The reason that hyper-parameter optimization
does not dramatically change the results is that the dataset has a clear tendency.
Appl. Sci. 2022, 12, 1790 10 of 14

Table 6. Effects of hyper-parameters of XGBoost.

Parameters
Accuracy %
Learning rate Estimators Max Depth
3 68.76
100 5 68.69
7 68.19
0.1
3 68.95
200 5 68.42
7 67.71
3 68.90
100 5 68.54
7 67.63
0.2
3 68.76
200 5 67.64
7 66.58

Table 7 lists the top five independent variables after learning the XGBoost.

Table 7. Importance of independent variables.

Independent Variables F Score


Victim’s vehicle type = Pedestrian 53
Victim’s vehicle type = Passenger car 48
Perpetrator’s vehicle type = Passenger car 47
Violation of law = Signal violation 46
Victim’s vehicle type = Two-wheeler 46

Comparing Table 7 with Tables 3–5 shows that the victim’s vehicle type = pedestrian,
violation of law = signal violation, and victim’s vehicle type = two-wheeler are considered
important variables in judging serious accidents, while victim’s vehicle type = passenger
car and perpetrator’s vehicle type = passenger car are considered important variables in
determining slight accidents. In particular, the victim’s vehicle type = pedestrian was
chosen as the most important variable in determining the severity of traffic accidents.
In logistic regression, we need to choose L1 or L2 regularization and optimize the
C value, which adjusts the degree of the fitting. Table 8 shows the results of the hyper-
parameter optimization. Based on the experimental results, the C value was set to be 1 with
L1 regularization. As in the case of XGBoost, even in logistic regression, the difference in
accuracy according to hyperparameter changes is not large. This again supports the clear
trend of the dataset.

Table 8. Effects of hyper-parameters of Logistic Regression.

Training Set Test Set


Regularization C RMSLE RMSE MAE
Accuracy % Accuracy %
0.001 0.3955 0.5706 73.1889 67.43 67.44
0.01 0.3904 0.5631 63.7788 68.23 68.30
L1 0.1 0.3894 0.5618 62.4428 68.35 68.44
1 0.3896 0.5618 62.3266 68.37 68.44
10 0.3896 0.5618 62.3238 68.36 68.44
0.001 0.3909 0.5639 65.8057 68.09 68.20
0.01 0.3896 0.5621 62.7888 68.34 68.40
L2 0.1 0.3896 0.5619 62.3579 68.36 68.43
1 0.3896 0.5618 62.3266 68.36 68.44
10 0.3896 0.5618 62.3238 68.36 68.44
Appl. Sci. 2022, 12, 1790 11 of 14

Table 9 shows the top 10 regression coefficients. A higher value means a higher influ-
ence on the severity of traffic accidents. The result shows that whether the perpetrator is
speeding has the most significant impact and that the victim’s vehicle type has a significant
impact on serious traffic accidents for the case of a motorized bicycle, pedestrian, bicycle,
and elderly victim. Additionally, when the perpetrator’s vehicle type is a two-wheeler,
it has a high impact on the severity of traffic accidents, supporting the claim that motor-
cyclists are more likely to be seriously injured in a traffic crash than people in passenger
cars. In fact, the death rate for two-wheelers has constantly increased in Seoul from 2010
to 2018 due to the increase in the number of single households and the need for delivery
services, although the total rate of death by traffic accidents slowly decreased during the
same period.

Table 9. Ten Variables leading to serious traffic accidents.

Variables Regression Coefficient Values


Violation of law = Speeding 1.8490
Perpetrator’s vehicle type = Two-wheeler 0.6885
Victim’s vehicle type = Motorized bicycle 0.6717
Victim’s age = Elderly 0.5820
Victim’s vehicle type = Pedestrian 0.5254
Victim’s vehicle type = Bicycle 0.5200
Violation of law = Centerline violation 0.4805
Road type = Crossing 0.4100
Violation of law = Signal violation 0.4005
Perpetrator’s vehicle type = Heavy equipment 0.3730

It is worth noting that the perpetrator’s violation of law = signal violation, victim’s ve-
hicle type = pedestrian, and victim’s vehicle type = two-wheeler variables are derived by not
only logistic regression but also XGBoost as critical variables affecting the severity of traffic
accidents. This demonstrates the necessity to prepare countermeasures against the perpe-
trators’ signal violations and accidents involving two-wheeled vehicles and pedestrians.
Table 10 shows the bottom 10 regression coefficients. A lower value means a higher
influence on slight traffic accidents. The results show that the backup collision is most
closely related to slight accidents, followed by victim’s vehicle type = passenger car and
accident type = passing on the edge of the road. In general, the rate of slight accidents
appears to be high because the vehicle speed is not high when backing up or passing
along the edge of the road. In addition, in Table 4, when the victim’s vehicle type is a
passenger car, many slight accidents occur, and a similar trend appears in the regression
analysis result.

Table 10. Ten variables leading to slight traffic accidents.

Variables Regression Coefficient Values


Accident type = Backup collision −1.0902
Victim’s vehicle type = Passenger car −0.4670
Accident type = Passing on the edge of the road −0.4471
Victim’s age = Underage −0.3909
Road type = Crosswalk at an intersection −0.3828
Victim’s vehicle type = Lorry −0.2660
Perpetrator’s vehicle type = Passenger car −0.2199
Road type = Near an intersection −0.2105
Victim’s age = Youth −0.2057
Victim’s gender = Male −0.1959

5.2. Cluster Analysis through DBSCAN


In DBSCAN, there are two input parameters: (i) eps and (ii) minPts. We determined
hyper-parameters following the heuristic suggested by literature [26,27]. MinPts is set to
Appl. Sci. 2022, 12, 1790 12 of 14

approximately twice the dimensionality, thus 30 for a 14-dimensional space. Then, the
value of eps is estimated by plotting the distance to the (MinPts−1)th nearest neighbor for
each of sampled points, sorted in descending order, and finding the distance to an “elbow”
of the curve. As a result of DBSCAN, we derived three major clusters corresponding to
97.8 of the entire dataset.
Table 11 shows the results of arranging the modes for each variable in each cluster
after clustering. Given the high proportion of variables within the accident type = side
collision and road type = at an intersection, cluster 1 appears to be a cluster for side
collision accidents within intersections. Further, cluster 2 appears to be a cluster for rear-
end collision accidents, given that the proportion of accident type = rear-end collision is
high. In addition, cluster 3 is considered to be a cluster for pedestrian accidents, given that
the proportion of the victim’s vehicle type = pedestrian is 100%. In cluster 3, violation of
law = non-compliance with safe-driving obligation was 92.1%, which was much higher
than that of the other two clusters. It can be inferred that non-compliance with safe-driving
obligation leads to many pedestrian accidents. Overall, the clustering results show that
environmental factors do not significantly influence traffic accidents given the distribution
ratio. This supports the previous results, in which environmental factors such as weather
do not significantly impact the occurrence of traffic accidents in Seoul. In addition, it is also
interesting to note that in all clusters, the gender of the victim is predominantly male.

Table 11. Characteristics of three major clusters.

Variables Cluster 1 Cluster 2 Cluster 3


Spring Spring Summer
Season
(26.5 %) (26.9 %) (27.3 %)
Day Day Day
Time
(47.5 %) (52.0 %) (56.6 %)
Friday Saturday Friday
Day of the week
(16.1 %) (17.2 %) (16.7 %)
Side collision Rear-end collision Other
Accident type
(57.6 %) (46.3 %) (53.0 %)
Non-compliance with Non-compliance with Non-compliance with
Violation of law
safe-driving obligation (45.3 %) safe-driving obligation (59.8 %) safe-driving obligation (92.1 %)
Dry Dry Dry
Road surface
(98.3 %) (96.2 %) (99.4 %)
Sunny Sunny Sunny
Weather
(97.6 %) (95.7 %) (98.4 %)
At an intersection Single road Single road
Road type
(61.1 %) (95.5 %) (95.0 %)
Passenger car Passenger car Passenger car
Perpetrator’s vehicle type
(87.9 %) (86.5 %) (77.2 %)
Male Male Male
Perpetrator’s gender
(90.4 %) (89.9 %) (84.9 %)
Middle-aged Middle-aged Middle-aged
Perpetrator’s age
(43.7 %) (40.0 %) (40.7 %)
Passenger car Passenger car Pedestrian
Victim’s vehicle type
(80.3 %) (75.2 %) (100 %)
Male Male Male
Victim’s gender
(89.1 %) (85.9 %) (54.2 %)
Middle-aged Old-aged Middle-aged
Victim’s age
(38.9 %) (36.8 %) (25.1 %)

6. Conclusions and Discussion


In this study, we used a traffic accident dataset, which included accidents in Seoul
from 2010 to 2018, to identify the major factors and types that affect the severity of traffic
accidents. To create a good classification, less frequent or skewed data were pre-processed
by being removed and re-grouped, and analyzed using XGBoost, Logistic Regression, and
Appl. Sci. 2022, 12, 1790 13 of 14

DBSCAN, which are the representative methodologies widely used in the field. In the
XGBoost results, the case where the perpetrator violated the signal or the victim was riding
a two-wheeled vehicle was also found to be an important variable in judging a serious
traffic accident. In addition, the case where the victim and perpetrator’s vehicle type
was a passenger car had a significant influence in judging the slight accident. In logistic
regression, the top and bottom 10 variables were analyzed according to the regression
coefficient values to identify factors affecting the severity of traffic accidents. As a result,
the perpetrator’s violation of the law was found to affect serious traffic accidents in the
order of speeding, two-wheeler or motorized bicycle, elderly, pedestrian, and bicycle. In
contrast, environmental factors did not significantly affect traffic accidents. The clustering
analysis results derived the top three clusters, represented by in-intersection side-crashes,
rear-end collision, and clusters for pedestrians. Considering the three methodologies as a
whole, environmental factors such as season, day of the week, and weather were found to
be insignificant on the severity of traffic accidents. On the other hand, it is worth noting
that variables for pedestrians appear in common among all of the three approaches, which
would suggest establishing more preventive measures against pedestrian accidents, in
order to reduce the fatal injury by the traffic accidents in Seoul.
In practice, actual traffic accidents are caused by a combination of more specific and
diverse factors than the variables in the dataset used in this study. For example, in this
study, variables such as speeding in violation of the law may vary depending on how fast
the analysis was conducted. Factors such as driver’s vision or seat belt wearing may also
affect the outcome. If data including more diverse information are available, a more specific
analysis will be possible; thus, an active data opening policy is needed. Notwithstanding the
aforementioned limitations, our study still provides important insights on the unique and
important features related to traffic conditions in Seoul for furthering the city’s traffic safety.

Author Contributions: Conceptualization, J.K. and K.H.; methodology, J.K., K.H., I.K. and H.J.;
software, H.J.; validation, J.K., I.K. and H.J.; investigation, H.J.; resources, J.K. and H.J.; data curation,
H.J.; writing—original draft preparation, J.K. and H.J.; writing—review and editing, J.K., K.H. and
I.K.; visualization, J.K., K.H. and H.J.; supervision, J.K. and K.H.; project administration, J.K. and K.H.;
funding acquisition, J.K. All authors have read and agreed to the published version of the manuscript.
Funding: This research was supported by the research grant of the Kongju National University in
2021 and by the National Research Foundation of Korea (NRF) grant funded by the Korea government
(MSIT) (No. 2021R1A4A1031509).
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: The data presented in this study are available at Public Data Portal,
http://www.data.go.kr (accessed on 18 March 2021).
Acknowledgments: The author would like to extend their thanks to reviewers and editors for helping
to improve this paper.
Conflicts of Interest: The authors declare no conflict of interest.

References
1. Yang, J.; Bi, J.; Zhang, H.Y.; Li, F.Y.; Zhou, J.B.; Liu, B.B. Evolvement of the relationship between environmental pollution accident
and economic growth in China. China Environ. Sci. 2010, 30, 571–576.
2. Law, T.H.; Noland, R.B.; Evans, A.W. Factors associated with the relationship between motorcycle deaths and economic growth.
Accid. Anal. Prev. 2009, 41, 234–240. [CrossRef] [PubMed]
3. Kopits, E.; Cropper, M. Traffic fatalities and economic growth. Accid. Anal. Prev. 2009, 37, 169–178. [CrossRef] [PubMed]
4. World Health Organization (WHO). Global Status Report on Road Safety 2018. Available online: https://www.who.int/
publications/i/item/9789241565684 (accessed on 2 November 2020).
5. Centers for Disease Control and Prevention. WISQARS Injury CENTER. Available online: https://www.cdc.gov/injury/wisqars/
(accessed on 23 April 2021).
6. Salgado, M.S.L.; Colombage, S.M. Analysis of fatalities in road accidents. Forensic Sci. Int. 1988, 36, 91–96. [CrossRef]
Appl. Sci. 2022, 12, 1790 14 of 14

7. Jacobs, G.D.; Sayer, I. Road accidents in developing countries. Accid. Anal. Prev. 1983, 15, 33–353. [CrossRef]
8. Mohamed, E.A. Predicting causes of traffic road accidents using multi-class support vector machines. Comput. Commun. 2014, 11,
441–447.
9. Shanks, N.J.; Ansari, M.; Ai-Kalai, D. Road traffic accidents in Saudi Arabia. Public Health 1994, 108, 27–34. [CrossRef]
10. Yang, B.M.; Kim, J.H. Road traffic accidents and policy interventions in Korea. Inj. Contr. Saf. Promot. 2003, 10, 89–94. [CrossRef]
[PubMed]
11. Seoul Urban Solution Agency. Seoul Transportation, Report: Safe and Convenient Seoul Transportation that Puts People First.
Available online: http://susa.or.kr/sites/default/files/resources/%5BSeoul%20Urban%20Solutions%5D%5BTransportation%
5DSeoul%20Public%20Transportation%28English%29.pdf (accessed on 7 March 2021).
12. Chong, M.; Abraham, A.; Paprzycki, M. Traffic accident analysis using machine learning paradigms. Informatica 2005, 29, 89–98.
13. Feng, M.; Zheng, J.; Ren, J.; Liu, Y. Towards Big Data Analytics and Mining for UK Traffic Accident Analysis, Visualization & Prediction.
In Proceedings of the 2020 12th International Conference on Machine Learning and Computing, Shenzhen, China,
15–17 February 2020.
14. De Oña, J.; Mujalli, R.O.; Calvo, F.J. Analysis of traffic accident injury severity on Spanish rural highways using Bayesian networks.
Accid. Anal. Prev. 2011, 43, 402–411. [CrossRef] [PubMed]
15. Chen, H.; Zhao, Y.; Ma, X. Critical factors analysis of severe traffic accidents based on Bayesian network in China. J. Adv. Transp.
2020, 4, 8878265. [CrossRef]
16. Dong, B.; Ma, X.; Chen, F.; Chen, S. Investigating the differences of single-vehicle and multivehicle accident probability using
mixed logit model. J. Adv. Transp. 2018, 2018, 2702360. [CrossRef] [PubMed]
17. Champahom, T.; Jomnonkwao, S.; Chatpattananan, V.; Karoonsoontawong, A.; Ratanavaraha, V. Analysis of rear-end crash on
Thai highway: Decision tree approach. J. Adv. Transp. 2019, 2019, 2568978. [CrossRef]
18. Ahmed, L.A. Using logistic regression in determining the effective variables in traffic accidents. Appl. Math. Sci. 2017, 11,
2047–2058. [CrossRef]
19. Bhin, M.Y.; Son, S.K. Analysis of factors influencing traffic accident severity according to gender of bus drivers. J. Korean Soc.
Transp. 2018, 36, 440–451. [CrossRef]
20. Lim, Y.J.; Moon, H.J.; Kang, P.K. Analysis on factors of traffic accident on roads having width of less than 9 meters. J. Korean Inst.
Intell. Transp. Syst. 2014, 13, 96–106.
21. Kim, T.H.; Kim, E.K.; Rho, J.H. Analysis of Old Driver’s Accident Influencing Factors Considering Human Factors. J. Korean Soc.
Saf. 2009, 24, 69–77.
22. Public Data Portal. Available online: http://www.data.go.kr/ (accessed on 18 March 2021).
23. The Road Traffic Authority. Available online: http://taas.koroad.or.kr/sta/acs/exs/wordArngPopup.do (accessed on
11 January 2021).
24. Chen, T.; Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794.
25. Peng, C.J.; Lee, K.L.; Ingersoll, G.M. An introduction to logistic regression analysis and reporting. J. Educ. Res. 2002, 96, 3–14.
[CrossRef]
26. Ester, M.; Kriegel, H.P.; Sander, J.; Xu, X. A density-based algorithm for discovering clusters in large spatial databases with noise.
KDD 1996, 96, 226–231.
27. Sander, J.; Ester, M.; Kriegel, H.P.; Xu, X. Density-based clustering in spatial databases: The algorithm gdbscan and its applications.
Data Min. Knowl. Disc. 1998, 2, 169–194. [CrossRef]

You might also like