0% found this document useful (0 votes)
90 views53 pages

BDE Final Report

This project report focuses on the analysis and prediction of road accidents using machine learning techniques applied to a large-scale US accident dataset. The authors developed a system that integrates data preprocessing, exploratory data analysis, and predictive modeling to identify accident-prone locations and factors influencing accident severity, achieving high accuracy rates with their models. The initiative aims to provide actionable insights for improving road safety and informing policy decisions through data-driven approaches.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
90 views53 pages

BDE Final Report

This project report focuses on the analysis and prediction of road accidents using machine learning techniques applied to a large-scale US accident dataset. The authors developed a system that integrates data preprocessing, exploratory data analysis, and predictive modeling to identify accident-prone locations and factors influencing accident severity, achieving high accuracy rates with their models. The initiative aims to provide actionable insights for improving road safety and informing policy decisions through data-driven approaches.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 53

ANALYSIS AND PREDICTION OF

ROAD ACCIDENTS
A PROJECT REPORT

Submitted by

PRAKHAR GROVER [RA2211028010130]


HARSH N PATEL [RA2211028010127]
SPANDAN BASU CHAUDHURI [RA2211028010120]
ANMOL JAMES RAMSON [RA2211028010082]
Under the Guidance of

Dr. BANU PRIYA P


Assistant Professor, Department of Networking and Communications

in partial fulfillment of the requirements for the degree of

BACHELOR OF TECHNOLOGY
in
COMPUTER SCIENCE ENGINEERING
with specialization in Cloud Computing

DEPARTMENT OF NETWORKING AND COMMUNICATIONS


COLLEGE OF ENGINEERING AND TECHNOLOGY
SRM INSTITUTE OF SCIENCE AND TECHNOLOGY
KATTANKULATHUR- 603 203

NOVEMBER 2024
Department of Networking and Communications
SRM Institute of Science & Technology

Degree/ Course : BTech Computer Science Engineering w/s Cloud


Computing

Student Names : Prakhar Grover, Harsh N Patel, Spandan Basu Chaudhuri,


Anmol James Ramson

Registration Numbers : RA2211028010130, RA2211028010127, RA2211028010120,


RA2211028010082

Title of Work : Analysis and Prediction of Road Accidents

We hereby certify that this assessment compiles with the University’s Rules and Regulations
relating to Academic misconduct and plagiarism**, as listed in the University Website,
Regulations, and the Education Committee guidelines.

We confirm that all the work contained in this assessment is our own except where
indicated, and that We have met the following conditions:

● Clearly referenced / listed all sources as appropriate


● Referenced and put in inverted commas all quoted text (from books, web, etc)
● Given the sources of all pictures, data etc. that are not my own
● Not made any use of the report(s) or essay(s) of any other student(s) either past or
present
● Acknowledged in appropriate places any help that I have received from others (e.g.
fellow students, technicians, statisticians, external sources)
● Compiled with any other plagiarism criteria specified in the Course handbook /
University website

I understand that any false claim for this work will be penalized in accordance with the
University policies and regulations.

DECLARATION:
We are aware of and understand the University’s policy on Academic misconduct and plagiarism and I
certify that this assessment is our own work, except where indicated by referring, and that we have
followed the good academic practices noted above.
SRM INSTITUTE OF SCIENCE AND TECHNOLOGY
KATTANKULATHUR – 603 203

BONAFIDE CERTIFICATE

Certified that 21CSC314P – Big Data Essentials mini-project report


titled “ANALYSIS AND PREDICTION OF ROAD ACCIDENTS” is
the bonafide work of Prakhar Grover [RA2211028010130], Harsh N Patel
[RA2211028010127], Spandan Basu Chaudhuri [RA2211028010120] and
Anmol James Ramson [RA22110280101082] who carried out the
mini-project work under my supervision. Certified further, that to the best
of my knowledge the work reported herein does not form any other project
report or dissertation on the basis of which a degree or award was
conferred on an earlier occasion on this or any other candidate.

Panel Reviewer I Panel Reviewer II

Dr. Banu Priya P Dr. Gouthaman P

Assistant Professor Assistant Professor


Department of Networking and Department of Networking and
Communications Communications

i
ACKNOWLEDGEMENT

We would like to sincerely thank our faculty, Dr. Banu Priya P., from the Department
of Networking and Communications at SRM Institute of Science and Technology, Chennai,
for her great direction, consistent support, and helpful criticism throughout this project. Her
knowledge in machine learning and big data analytics gave us a strong basis that would help
us to handle this challenging topic with effectiveness.

Additionally we appreciate the assistance and encouragement of the School of


Computing's faculty members. Their will to create a cooperative learning atmosphere really
enhanced our educational process. Our classmates especially deserve our thanks for their
friendship and for sharing information and resources, which proved very helpful along our
educational path.

Many web tools and sites helped this effort tremendously. We thank internet
communities such as Stack Overflow and towards big data for their insightful analysis and
creative answers to technical problems. We also thank the US Accidents dataset suppliers on
Kaggle, who were the pillar of our study. The availability of such large-scale databases
emphasizes the need of open data and community involvement in developing research. We
also like to thank the open-source tool developers and maintainers for helping to enable this
effort. Data processing, model construction, and analysis are all required tools like Apache
Spark and TensorFlow. Throughout installation, their extensive documentation and active user
communities were very helpful. We also thank the Python developers of the Scikit-learn and
Pandas libraries for their efforts to offer necessary tools for data handling, analysis, and model
assessment. Moreover, successful communication of our results depended much on Tableau's
simple interface and strong visualization tools. We really value our families' consistent
support and encouragement throughout this journey.

Their conviction in our ability gave us the drive and fortitude required to push beyond
obstacles and reach our study objectives. At last, we understand the crucial need of tackling
the worldwide road safety issue. This initiative aims to contribute to a better knowledge of the
elements causing road accidents and hence guide the creation of sensible plans to increase
road safety and save lives. We appreciate the chance to help in this essential field of study.

ii
ABSTRACT

Road safety is still a major worldwide issue that calls for creative solutions to reduce
the major financial and human losses related to traffic accidents. This work uses a large-scale
historical accident dataset to build an intelligent accident analysis and prediction system,
therefore addressing this difficulty. For forecasting accident severity, we create and test
multiple machine learning models using a one-month sample from a publicly accessible US
accident dataset including Random Forest, XGBoost, Multilayer Perceptron (MLP) built
using TensorFlow and Keras. Using Apache Spark, the system integrates a strong data
pretreatment pipeline to effectively manage the quantity and complexity of the dataset,
therefore resolving issues such missing values and data inconsistencies. Most importantly, our
method takes into account many contributing elements like exact location data, road
conditions, temporal patterns (hour of day, day of week, etc.), and specific weather conditions,
therefore allowing a more complete knowledge of accident dynamics. With a training
accuracy of almost 95.67% and a testing accuracy of about 95.34%, the MLP model's
performance emphasizes the promise of machine learning for accident severity prediction.
Beyond model construction, this study uses visualizations and statistical techniques to do
comprehensive exploratory data analysis (EDA) identifying accident-prone locations,
investigate the impact of meteorological variables on accident severity, and expose underlying
accident trends. These revelations, along with the predictive powers of the machine learning
models, provide insightful data for creating focused road safety campaigns, enhancing
infrastructure design, and finally lowering the frequency and intensity of next collisions. This
study emphasizes the need of merging modern machine learning methods with big data
technology to solve challenging real-world issues in road safety.
TABLE OF CONTENTS
CHAPTER TOPIC PAGE
NO. NO.
ABSTRACT iii

LIST OF FIGURES v

LIST OF TABLES vi

1 INTRODUCTION 1

1.1 Problem Statement 1


1.2 Motivation 2
1.3 Objective 2
1.4 Ideas 3
1.5 Solution Proposed 3
1.6 Challenges 3

2 LITERATURE REVIEW 5

3 PROPOSED METHODOLOGY 10

3.1 Data Collection 10


3.2 Data Preprocessing 13
3.3 Feature Engineering 14
3.4 Model Selection 15
3.5 Evaluation Metrics 16
3.6 Visualization Techniques 17

4 IMPLEMENTATION 19

4.1 Data Acquisition and Loading 19


4.2 Data Cleaning and Preprocessing using Apache Spark 21
4.3 Feature Engineering 21
4.4 Multilayer Perceptron(MLP) 22
4.5 Gradient Boosting(XGBoost) 23
4.6 Random Forest 23

5 RESULTS AND DISCUSSIONS 27

5.1 EDA (Exploratory Data Analysis) Results 27


5.2 Results of Predictive Models 32
5.3 Visualization 38

6 CONCLUSION 41

REFERENCES 43
LIST OF FIGURES

Fig 3.1 Snippet from the Dataset 12

Fig 4.1 Complete workflow of the project 19

Fig 4.2 Describing all the fields in the dataset 20

Fig 4.3 Missing percentage of the fields in the dataset 21

Fig 4.4 Accident severity by weather category 22

Fig 4.5 Layers in an multilayer perceptron (MLP) model 23

Fig 4.6 Loss graph after 50 epochs of MLP model. 23

Fig 5.1 Hourly distribution of accidents throughout the day 29

Fig 5.2 Distribution of accident data by severity 30

Fig 5.3 Distribution of weather features 31

Fig 5.4 An interactive heatmap with zoom-in/zoom-out feature 32

Fig 5.5 Correlation Matrix of the important numeric features 33

Fig 5.6 Feature importance 34

Fig 5.7 One of the Confusion Matrix for MLP model 35

Fig 5.8 Final loss function after 100 epochs of the MLP model 36

Fig 5.9 Classification Report for Random Forest 38

Fig 5.10 Classification Report for Logistic Regression 38

Fig 5.11 Classification Report for SVM 38

Fig 5.12 Classification Report for XGBoost 38

Fig 5.13 Classification Report for Decision Tree 38

Fig 5.14 Tableau Dashboard 1 39

Fig 5.15 Tableau Dashboard 2 40


LIST OF TABLES

Table 5.1 Classification Report for XGBoost 37

Table 5.2 Classification Report for Random Forest 37

Table 5.3 Classification Report for Logistic Regression 37

Table 5.4 Classification Report for SVM 38

Table 5.5 Classification Report for Decision Tree 38


CHAPTER 1

INTRODUCTION

A constant worldwide problem, road accidents cause a great death toll, disabling
injuries, and major economic consequences. Using a data-driven strategy to handle the many
elements causing these sad occurrences, this study explores the important problem of traffic
accident analysis and prediction. Conventional wisdom often depends on reactionary
responses, looking at mishaps after their occurrence. Our initiative, however, stresses a
proactive strategy using data analytics and machine learning to forecast and stop mishaps.
Achieving significant increases in road safety depends on this shift toward proactive
approaches. The always growing availability of large-scale accident datasets together with
developments in big data technologies like Apache Spark and machine learning techniques,
especially deep learning with TensorFlow, creates hitherto unheard-of chances to build
intelligent systems capable of spotting hidden patterns, trends, and high-risk factors.
Understanding the complex interactions among contributing elements—such as weather
conditions, road infrastructure features, temporal trends, and driver behavior—allows us to
approach road safety more foresightedly and preventatively. This project aims to construct
such a system, therefore helping to lower the accident count and provide a safer driving
environment for everybody.

1.1. Problem Statement

Currently used techniques of road accident analysis are susceptible to many basic
limitations that hinder the evolution of efficient preventive strategies. A piecemeal knowledge
of weather's effects while it is clear that road accidents are generally influenced by weather,
the exact effect of various weather conditions—for example, heavy rain against moderate
drizzle, high winds against calm conditions—and how they interact with other elements is not
totally known. This research aims to close this gap by means of thorough analysis on the
impact of many meteorological elements on the frequency and intensity of accidents.

Many times, the evaluations presently used lack appropriate inclusion of the temporal
dynamics of accidents. Targeting treatments depend on the identification of peak accident

1
hours, days of the week, seasonal variations, and correlation of these temporal patterns
with possible contributory variables such traffic volume, driver behavior, or event-specific
impacts. More finely grained spatial analysis is needed for the identification of accident
hotspots and their relationship with particular road infrastructure characteristics (such as
intersection design, road curvature, and lighting conditions) and environmental factors (such
as terrain, visibility obstructions, and proximity to high-risk locations). This research will
employ instruments related with geospatial analysis to get this degree of information.

Based on real-time and historical data, the capability to predict the likelihood of
accidents and the degree of damage resulting from them is somewhat lacking. This is a
serious restriction on the forecasting powers. Strong prediction models that include many
factors would be very helpful for proactive resource allocation, customized safety
recommendations, and focused interventions as they would help to guide development.

1.2. Motivation

This initiative is motivated by the urgent need to solve the worldwide road safety issue
by means of a proactive, data-driven strategy. Our goal is to provide practical insights for: by
use of an intelligent analytic system,

1. Finding high-risk areas and infrastructure flaws will help to guide focused
enhancements in road design and safety aspects.

2. Using real-time traffic management techniques and focused traffic calming devices
will help to maximize traffic flow during times of maximum accident occurrence.

3. Data-driven evidence should be presented to assist legislators in developing sensible


road safety rules and policies.

1.3. Objectives

The following goals of this initiative want to be reached:

1. Using Apache Spark, create a strong and effective data processing pipeline to clean,
preprocess, and get ready the large US Accidents dataset for study.
2. Examine the data on accident incidence and severity using a comprehensive

2
exploratory data analysis (EDA) to find latent trends, correlations, and patterns.
3. Using advanced geospatial analytic tools, pinpoint accident hotspots and their link
with road infrastructure and environmental conditions.
4. Using TensorFlow, develop and assess several machine learning models—including a
Multilayer Perceptron (MLP)—for accident severity prediction depending on a
confluence of relevant variables.
5. Make interactive visuals to properly present the results of the project to a broad
spectrum of stakeholders including traffic planners, legislators, and the general public.

1.4. Ideas
The creation of this approach is motivated by many main concepts like using PySpark,
Big Data Technologies which enable efficient management and analysis of the massive
volume of accident data. Firstly, Possibly incorporate other data sources to improve the
research: traffic flow data, road network data, and demographic data. Secondly, by looking at
many machine learning techniques and architectures—including deep learning models—one
may improve expected accuracy. Lastly, making interactive dashboards and visualizations
allows many stakeholders to have quick access to and actionability for the information.

1.5. Solution Proposed

The proposed solution involves an intelligent analysis system that seamlessly


integrates several crucial components: data processing, exploratory data analysis, geospatial
analysis, predictive modeling, and interactive visualization. This integrated system will ingest
and process the US Accidents dataset, extract meaningful features, train and evaluate a range
of machine learning models, and present the results through clear and compelling
visualizations. This comprehensive approach seeks to provide a holistic understanding of the
factors contributing to road accidents, empowering stakeholders with data-driven insights to
make informed decisions and improve road safety outcomes.

1.6. Challenges
This initiative expects to run against some important obstacles:

1. The sheer volume and complex character of the US Accidents dataset provide major
challenges for effective data processing, cleansing, transformation, and analytic

3
preparation.

2. Class Imbalance: Predictive models may be biased by the natural imbalance in


accident severity distributions wherein less severe accidents are somewhat more often.
Development of strong and consistent forecasts depends on addressing this class
imbalance.

3. Integration of Data Sources Data management and fusion are more difficult when
many data sources—traffic flow, road networks, demographics—are combined.
Essential will be ensuring data compatibility and creating sensible plans for merging
many data kinds.

4. Sharing Difficult Results: Presenting complex analytical findings to a varied


audience—from technical professionals to legislators and the general
public—effectively presents a communication difficulty. Maximizing the effect and
useful applications of the project depends mostly on creating user-friendly visuals and
customizing the presentation of findings to diverse audiences.

5. Choosing the most appropriate machine learning models and fine-tuning their
hyperparameters to get maximum performance for accident severity prediction will
need thorough investigation and assessment.

6. Practical Relevance: Further development and cooperation with pertinent parties will
help to translate the study results into useful, real-world solutions and guarantee their
efficient implementation in actual traffic management or driver assistance systems.
Real-world effect depends on overcoming possible deployment obstacles like data
availability in real-time and interoperability with current systems.

4
CHAPTER 2

LITERATURE REVIEW

Recent studies on traffic accident prediction utilizing sophisticated methods such as


the Analytic Hierarchy Process (AHP) have shown how systematic rankings of important
variables could assist to increase the accuracy of accident predictions and support efficient
decision-making. Renowned for its capacity to organize difficult decision-making situations,
the AHP approach [4] was used to rank many elements affecting accident severity,
especially those connected to climatic conditions. Through meticulously arranging these
components, scientists sought to let stakeholders concentrate on the most important factors
for intervention and preventive strategies. This research highlighted the benefits of AHP in
determining high-impact variables, but it also noted the difficulties of AHP in this
complicated setting. These challenges mostly resulted from the increasing need for current,
comprehensive data to create significant forecasts in real-time situations. Apart from AHP,
using machine learning methods has grown more important as it generates a hybrid
approach combining the predictive powers of machine learning models with the systematic
factor ranking of AHP. By use of this combination strategy, several inherent constraints in
AHP might be addressed, hence strengthening the accuracy of accident prediction systems.

Concurrently, another research explored marine catastrophe risk prediction using


machine learning methods with an eye on how meteorological data may increase forecast
accuracy for maritime catastrophes [5]. Unique dangers in the marine industry include
shifting sea levels, strong winds, and limited vision, all of which may seriously endanger
ships and workers. This research examined meteorological variables like sea level pressure,
wind speed, and visibility to show how these elements may be utilized to more precisely
estimate maritime disaster risks. Building a strong model depends on access to thorough
historical maritime and meteorological data, which helps researchers to find trends and
connections that conventional models can miss. Large-scale meteorological data, however,
brought unique challenges with high computing needs for data integration and processing
vast volumes of information in real-time. This work highlighted the complex nature of
marine disaster prediction and how both meteorological events and advanced data-handling
skills are required to create models that operate effectively under various and frequently

5
severe marine circumstances.

Regarding road accident prediction, a research article written in Accident Analysis and
Prevention [6] looked at how climatic conditions affect accident probability on various
kinds of roadways. Results underlined how much the probability of accidents is influenced
by elements of the weather like visibility, temperature, and precipitation. For example,
slippery conditions, plenty of rain, and foggy weather were demonstrated to increase
accident risk because of less vehicle control and vision. This work provided important new
perspectives on improving road safety by integrating meteorological data into risk
prediction systems. The research also underlined the difficulties of obtaining comprehensive
meteorological data, which differs depending on area and season, thereby making it
impossible to guarantee consistent accuracy in projections throughout many road types.
Including such complex data into predictive models meant striking a balance between
accuracy and practicality; the researchers underlined the requirement of ongoing data
collecting and model development to raise safety results. This work shows how data
complexity increases in step with model sophistication as good prediction accuracy depends
on strong data inputs that reflect the subtleties of weather-road interactions.

Additionally very important in enhancing accident prediction have been artificial


neural networks (ANN). Based on actual data including factors like traffic lights, road
conditions, and weather patterns, one research used ANN to forecast traffic accidents [7].
The model demonstrated how well ANN may manage intricate, non-linear interactions
between accident probability and affecting variables. ANNs might detect subtle interactions
within the data, therefore enhancing the accuracy of forecasts in mixed traffic situations
when compared to traditional techniques. Although ANN models depend on large-scale data
to capture the many situations that might lead to accidents, this strength comes with the
disadvantage of great computational complexity and the need of enormous datasets for
training. Furthermore, these models need a lot of processing time and computing capability,
which would restrict their relevance in resource-limited or real-time environments.
Notwithstanding these constraints, ANN's performance in this work showed its ability to
enable proactive traffic management techniques by means of high-risk condition
identification prior to accident occurrence.

Usually employed in image processing, convolutional neural networks (CNNs) have

6
just been modified for traffic accident prediction. CNN models were used in a research
reported at the Fifth International Conference on Intelligent Computing and Control
Systems (ICICCS) [8] to forecast accident events and assess risk variables connected to
certain scenarios. By means of both prediction accuracy and reducing loss functions, these
CNN-based models beat conventional Back Propagation (BP) techniques, therefore
demonstrating CNN's supremacy in learning complicated patterns from big datasets. CNN
models do, however, also present several difficulties, mostly related to the necessity of large
data volume to guarantee accuracy. Longer computation durations resulting from the
complexity of these models also represent a barrier to real-time application, especially in
high-stakes situations like traffic control where fast forecasts are crucial. This paper
underlined the trade-off between computing needs and practical application as well as the
possibility of CNN to help to traffic safety by generating more accurate forecasts.

A study article published at the 27th ACM SIGSPATIAL International Conference on


Advances in Geographic Information Systems [9] described a notable dataset on traffic
accidents, also known as the US-Accidents dataset. The Deep Accident Prediction (DAP)
model was built on this collection of varied but sparsely occurring data points. DAP uses
many real-time data sources—including areas of interest, traffic events, and weather—to
project accidents. Particularly with respect to computational complexity, the paper
underlined the difficulties in processing such a large and sparse dataset. Models may be
contaminated by sparse data, therefore reducing the forecast accuracy. Effective integration
of several data sources also depends on careful handling of missing values and preparation.
Notwithstanding these challenges, the DAP model showed how to anticipate traffic events
utilizing real-time data, thereby stressing the need of efficient data integration and
preparation in improving model accuracy and usefulness.

A research written in Natural Hazards and Earth System Sciences [10] focused on how
temperature affected traffic accidents by using a prediction model combining
high-resolution reanalysis with meteorological data and radar-based precipitation. The
research greatly improved forecast accuracy by using these meteorological data, therefore
raising the hit rate from thirty to seventy percent. This notable increase underlined the
importance of high-quality data—especially when gathered at a district level—for more
dependably estimating traffic accident risk. Nevertheless, the inclusion of such

7
comprehensive meteorological data presents significant logistical difficulties as
high-resolution data collecting and processing need sophisticated infrastructure and
computing tools. This work underlined the need for fine data for high forecast accuracy,
particularly in relation to location-specific events such precipitation rates and temperature
variations.

Using Extreme Gradient Boost (XGBoost), Random Forest, and Decision Trees
among other machine learning models, a Master's thesis from the University of Oulu further
investigated accident prediction to evaluate the effect of storms on accident frequency and
severity [11]. These robust and interpretable machine learning techniques let the researcher
find important elements influencing accidents during extreme weather occurrences. The
research also highlighted the significant data needs and the difficulties caused by
imbalanced datasets, however. Particularly catastrophic accidents are quite unusual
occurrences that might cause a class imbalance that, if not well controlled, can distort model
forecasts. Furthermore, challenging the computing burden of compiling many
meteorological data was notably in real-time applications. Notwithstanding these
difficulties, the research showed that sophisticated machine learning methods might be used
to forecast accident severity under unfavorable conditions, therefore providing important
information for traffic control and policy creation.

Examined in a paper in the Journal of Reliable Intelligent Systems, the integration of


IoT with machine learning for accident prediction looked at how IoT devices may assist in
determining safe driving speeds in dangerous situations [12]. Particularly in situations
involving unfavorable weather like fog, rain, or snow, IoT devices—which can record
real-time environmental data—offer a dynamic method of accident prevention. Researchers
created a system that might adjust driving instructions depending on present road conditions
by using IoT sensors and machine learning models, hence perhaps lowering accident rates.
Nevertheless, the accuracy and dependability of IoT data will mostly determine the
effectiveness of such a system; these factors might vary greatly depending on the location of
the sensors and calibration. Real-time IoT data processing presents another set of difficulties
as low-latency responses are vital in traffic applications where quick decisions are required.
This work highlighted the possibilities and challenges of combining IoT with machine
learning in accident prediction as, with proper management, accurate, real-time data might

8
change road safety.

In order to forecast the degree of road accidents and the number of casualties involved,
a research report published in Sustainability examined machine learning methods including
Logistic Regression, Random Forest, and Decision Trees [13]. These algorithms let
researchers forecast the victim count with 84% accuracy and the degree of severity with
85.5% accuracy. The research did also observe, nevertheless, that dataset
imbalances—especially in situations involving serious accidents—may lower sensitivity
and compromise the accuracy of the number of impacted vehicle predictions. In accident
prediction, class imbalances are a prevalent problem as serious accidents are less often than
small ones and produce skewed datasets that could affect model performance. To solve
these imbalances and raise model robustness, the paper underlined the requirement of
sophisticated data pretreatment methods as synthetic data creation or oversampling. This
study underlined the possibilities of machine learning in the prediction of accident severity
as well as the requirement of focused data preparation and the limits presented by
imbalanced datasets.

Ultimately, in recent years the corpus of studies examining the junction of traffic
accident prediction, machine learning, and meteorological data has expanded noticeably.
Every research adds special insights on how various data sources and machine learning
approaches could raise prediction accuracy, thereby striving to increase public safety. These
studies also highlight, however, the inherent difficulties of merging big, sophisticated
datasets for whatever computing needs, data sparsity, or class imbalances. Advancement of
the accuracy and application of these prediction models in various traffic and environmental
settings will probably depend much on future advancements in data processing, computing
efficiency, and real-time data integration.

9
CHAPTER 3

PROPOSED METHODOLOGY

This chapter describes the intended approach for constructing a prediction model for
accident severity and evaluating the US Accidents dataset. Mostly quantitative, the method
uses statistical analysis and machine learning methods to find trends and connections within
the data. The great volume of the information and the objective character of the research
questions—best suited for numerical analysis and predictive modeling—have led to the
choice of this quantitative method.

3.1. Data Collection

This project makes use of the publicly accessible US Accidents dataset—a thorough
compilation of road accident reports all throughout the United States available to the public
on kaggle, a sample snippet of the same dataset is shown in Fig. 3.1 below which helps us
see all the fields in the dataset already available to us. Rich details of this dataset allows us
to cover a broad spectrum of information pertinent to our study goals, including latitude and
longitude, weather and environmental features, and the different road features.

Precise latitude and longitude coordinates of accident sites let geospatial analysis be
possible; timestamps marking the start and finish times of every occurrence also help
temporal analysis.A critical factor determining the degree of any accident, ranging from 1
(small) to 4 (severe) is their severity. Our predictive modeling work uses this as the target
variable. Detailed weather conditions at the accident's time—including temperature,
humidity, visibility, precipitation, wind speed, and wind direction—are provided. This
abundance of detailed weather data makes it possible to investigate closely how weather
affects accident frequency and severity.

Data about the road environment—that is, street name, city, county, state, zip code,
traffic signals, crossings, junctions, and other points of interest (POIs)—allows one to
determine road characteristics. These characteristics allow investigation of the link between
road construction and accident trends. Although the first idea was to increase the breadth
and generalizability of the study using the whole dataset, it was realized that major data

10
cleansing and preparation would be required. In several of the columns in the
dataset—including variables relating to weather, location data, and road conditions—there
are very many missing values. Some aspects also show inconsistent data types and
presentation. These problems need rigorous data preparation to guarantee data quality and
dependability for further research and modeling.

3.1.1. Sample from the dataset:

The following are all the fields in our dataset to get an idea of it for better
understanding the project.

11
Fig. 3.1: Snippet from the Dataset

12
3.2. Data Preprocessing

Using both PySpark and Pandas in a multi-stage approach, the suggested data
preparation plan addresses the quantity and complexity of the dataset. Initial Data Exploration
and Cleaning with PySpark: PySpark is selected for its distributed computing features, hence
allowing effective handling of the US Accidents dataset at scale. Initially, PySpark's data
exploration would cover:

Examining the structure and data types of every characteristic in the dataset helps one
to grasp it. Finding and counting missing values, incompatible data formats, and possible
outliers in every column helps to assess data quality. PySpark's Missing Value Handling
putting suitable techniques for handling missing data into action, including:

1. Using statistically generated values—such as the mean, median, or mode—calculated


using PySpark's distributed computing capabilities—imputation fills in missing
numerical values.
2. Eliminating rows or columns with too much missing data should imputation be judged
inappropriate. Deletion criteria would be determined by a threshold of missing values,
therefore balancing data completeness with information loss.
3. PySpark's built-in tools help to find and delete duplicate rows thus guaranteeing data
integrity.

After the first cleaning and preprocessing with PySpark, a strategically chosen subset
of the data, or a sample representative of the whole dataset, data transformation and feature
scaling using Pandas would create a Pandas DataFrame for more exact data manipulation and
analysis. This shift to Pandas, more appropriate for local computing and smaller-scale data
processing, might help us firstly, for simpler temporal analysis, parsing string representations
of dates and times into datetime objects. Furthermore, converting string-based categorical
variables—such as road details or weather conditions—into suitable Pandas categorical data
types is known as "categorical feature handling." Secondly, using feature scaling methods
guarantees that numerical features have similar ranges and helps to normalize them. This
avoids unduly affecting the machine learning models by characteristics with higher

13
magnitudes. The particular scaling techniques under discussion consist in: Standardizing the
features means centering and scaling them to get zero mean and unit variance and Scaling the
characteristics to a designated range usually between 0 and 1 results in normalizing them. And
finally, any missing values persist after the first PySpark processing, Pandas' built-in
imputation or deletion tools would be used to handle them, hence enabling more exact
treatment catered to certain columns or data features. This lets missing data be handled
contextually specifically.

3.3. Feature Engineering

By generating fresh, instructive features from the current data, feature engineering
significantly improves the prediction capability of machine learning models. Proposed for this
project were the following feature engineering techniques:

The raw weather data in the US Accidents dataset offers rather exact descriptors (e.g.,
"light rain," "heavy snow," "fog," "partly cloudy"). Although this information is useful,
machine learning models may suffer from excessive dimensionality and possible overfitting if
these exact descriptions are used as input features straight-forwardly. We therefore suggested
classifying these exact meteorological circumstances into more basic, more wide groups.
Different weather-related descriptors ("light rain," "heavy rain," "drizzle," "rain showers") for
instance would be included under a single "Rain" category. Comparatively, a "Snow" category
would include "heavy snow," "light snow," "sleet," and "blowing snow." This classification
simplifies the model and could help it to generalize to unknown data by reducing the number
of distinctive weather events. Moreover, these more general categories might help to better
portray how weather affects driving conditions generally.

Precise timestamps for every accident are included in the collection. We intended to
derive many time-based characteristics in order to capture temporal patterns and trends:

1. Hour of Day: Reflecting the 0–23 hour the accident happened. Capturing daily
fluctuations in accident frequency and severity—such as those linked to rush hour
traffic— depends on this ability.
2. Day of Week: Reflecting the day of the week the accident happened (0–6, Monday
being shown). This tool helps highlight variations in weekend and weekday accident

14
trends.
3. Month of Year: Showing the month the accident happened—1–12. This function may
record seasonal fluctuations in accidents, maybe connected to vacations, weather, or
school calendars.
4. Creating categorical characteristics depending on the hour of the day—such as
"Morning," "Afternoon," "Evening," and "Night"—helps the models to have a more
broad temporal backdrop.

The US Accidents dataset has many boolean (true/false) features showing the presence
of several road characteristics close to the accident site (e.g., "Amenity," "Bump," "Crossing,"
"Give_Way," "Junction," "Traffic_Signal"). We suggested to translate these traits into
numerical representations so that they can be efficiently used in machine learning models.
One-hot encoding turned out to be appropriate. For every distinct value of a categorical
feature, one-hot encoding produces new binary (0 or 1) columns. The "Amenity" feature, for
instance, would be turned into a new column with 1 for the existence of an amenity and 0 for
its lack. This change lets the models efficiently use these category characteristics throughout
the prediction phase.

3.4. Model Selection

Forecasting accident intensity calls for models able to manage perhaps


high-dimensional data and complicated interactions. Originally suggested, each selected for
their own merits were the following models:

Artificial neural networks of the class Multilayer Perceptron (MLP) are appropriate
for learning complicated non-linear correlations within data. Their capacity to record complex
interactions between characteristics qualifies them as a good contender for estimating accident
severity, which is probably affected by many elements. MLPs also can efficiently manage
high-dimensional data.

An ensemble learning technique, Random Forest is well-known for its dependability


and capacity to handle high-dimensional data without overfitting. Additionally less sensitive
to noisy data and outliers—often seen in real-world datasets such as the US Accidents
dataset—are they? Moreover, Random Forests provide natural means of evaluating feature

15
relevance and insightful analysis of the elements influencing predictions.

Often preferred for its great prediction effectiveness is XGBoost (Extreme Gradient
Boosting), another potent ensemble approach. It is very effective at managing missing data,
which is common in the US Accidents dataset. XGBoost is an excellent candidate for
estimating accident severity because of its regularizing methods and capacity to detect
non-linear correlations.

Effective in high-dimensional environments and able to replicate intricate decision


boundaries, support vector machines (SVM) Additionally quite memory efficient, which
helps much when dealing with big data. For extremely large datasets, SVMs may be
computationally demanding, however, and they call for careful hyperparameter adjustment.

One useful starting model is logistic regression. Although it is simpler than the other
models, it gives interpretability and a baseline for comparison, therefore facilitating simpler
knowledge of the link between characteristics and expected results. Its performance in relation
to more complicated models enables one to evaluate if the increase in predictive ability
justifies the increased complexity.

3.5. Evaluation Metrics

The aim of the research is to forecast accident severity, so a full assessment of many
machine learning models depends on a comprehensive collection of evaluation criteria.
Particularly in the framework of a multi-class classification issue with possible class
imbalance, the following metrics were suggested and selected for their capacity to provide a
sophisticated knowledge of model performance:

Calculated as the ratio of correctly categorized cases to the total number of


occurrences, accuracy gauges the general correctness of the model's predictions. Although
accuracy is a natural measure, in the face of class imbalance it may be deceptive. A model
that only forecasts Severity 2 for every case might have great accuracy but little practical
utility because the US Accidents dataset has a skewed distribution of accident severity (with
Severity 2 being the most common). Precision gauges the accuracy of optimistic forecasts.
Precision determines, for every degree of severity, the fraction of accurately anticipated

16
events of that degree out of all the events projected to be of that degree. High accuracy
suggests that the model is probably accurate when it forecasts a given degree of severity.

Recall, often known as sensitivity, gauges whether the model can accurately identify
every event of a given degree of severity. Out of all the occurrences that are really of that
degree, it determines the percentage of accurately anticipated cases of that intensity. High
recall means the model captures most of the accidents of a certain degree, therefore indicating
its efficacy. The harmonic mean of recall and accuracy, the F1-score offers a fair evaluation of
both measures. It is especially helpful in cases of class imbalance as it penalizes models that
reach great accuracy at the price of poor recall, or vice versa. A high F1-score indicates that
the model achieves high recall along with great accuracy.

Area Under the ROC, or AUC, is a measure of the model's capacity to differentiate
between many degrees of severity. Plotting the true positive rate (recall) versus the false
positive rate for different categorization levels, the ROC shows The AUC shows the area
under this curve; a larger AUC indicates improved discriminating capacity. We intended to
find a weighted average of the AUC values for every degree of severity in multi-class
categorization. Taken together, these measures would provide a complete evaluation of every
model's performance, allowing us to choose the best-performing model while keeping in mind
the consequences of class imbalance and the need of appropriately forecasting more severe
incidents.

3.6. Visualization Techniques

Effective data exploration, pattern analysis, and communication of the results depend
on visualizations. The suggested project uses many visualization approaches selected for their
fit in displaying various data kinds and delivering certain insights:

Histograms and bar charts would show the numerical variable distribution—that of
temperature, wind speed, and visibility. The distribution of categorical data like accident
severity, weather categories (developed during feature engineering), day of the week, and
hour of the day would be shown via bar charts. These representations would enable one to see
trends like the frequency of various weather conditions during accidents or peak accident
hours. Essential for geospatial research, heatmaps provide a graphic picture of accident

17
frequency across many sites. Overlaying the heatmap on a map helps one easily find accident
hotspots and high-risk zones. Correlation matrices provide a graphic depiction of the linear
links among numerical variables. Strong positive or negative correlations between variables
like temperature, visibility, and wind speed would be highlighted by a heat map depiction of
the correlation matrix, hence clarifying the interaction of these elements.

Temporal trends—such as variations in accident frequency or severity over


time—would be shown on line graphs. These visual aids could highlight trends at certain
times of day, days of the week, or seasonal fluctuations. Tableau would be used to generate
interactive dashboards, therefore allowing more dynamic data and results exploration. These
dashboards would let users merge many visuals and users might filter the data depending on
certain factors (e.g., weather conditions, time of day, location) to concentrate on particular
groups of incidents. The visualizations let users study individual accident data and get more
thorough information by interacting with them. Users can better grasp the elements causing
accidents by comparing accident trends across many time periods, sites, or weather
conditions.

18
CHAPTER 4

IMPLEMENTATION

Our approach and process consist of the following main phases. Faced with
constraints, we concentrated on a one-month sample utilizing a large-scale accident dataset
from Kaggle. Apache Spark was used for data cleansing; columns with high missing values
were dropped and others were imputed with mean values. Though some were dropped for the
final model, feature engineering involved simplifying meteorological conditions and
generating additional features. Using accuracy and feature significance to grasp accident
causes, we trained and tested models comprising a Multilayer Perceptron (MLP), Gradient
Boosting (XGBoost), and Random Forest adjusting hyperparameters and assessing
performance. Also refer Fig. 1. to get the complete idea of it.

Fig. 4.1: Complete workflow of the project

4.1. Data Acquisition and Loading:

We use a large-scale dataset of accidents acquired from Kaggle[14, 15], which consists
of a whole spectrum of features including accident location (latitude and longitude), time of
occurrence, meteorological conditions, road attributes, and accident severity. For our
implementation we used the sample dataset from the same reference which was a one month
accident dataset due to our computing restrictions. As show in fig 4.2 describes the dataset
and finds values like count, mean, standard deviation, minimum and maximum value for all
the fields included in the dataset.

19
20
4.2. Data Cleaning and Preprocessing using Apache Spark:

Examining each column's percentage of missing values first helps us to handle them.
Although they are not vital for our study, we eliminate columns with a high percentage of
missing data including "End_Lat" and "End_Lng." as seen in Fig. 4.3. We drop rows
including those missing values for columns with a smaller percentage of missing values. For
fundamental numerical columns like "Wind_Chill(F)," "Precipitation(in)," and
"Wind_Speed(mph)," we impute missing values using the mean value of the corresponding
column.

Fig. 4.3: Missing percentage of the fields in the dataset

4.3. Feature Engineering:

We engineer fresh features from the dataset to improve model performance out of the
current ones. We classify meteorological situations into more general terms such as "Fair,"
"Rain," "Snow," etc as shown in Fig. 4.4. This simplification increases model interpretability
and lowers the dimensionality of meteorological-related characteristics.

Further we can also split each category based on its intensity. While experimenting
with some other approaches we also developed additional features like "Is_Raining," a binary
feature denoting the presence or absence of rain, and "Temperature_WindChill_Diff," which
shows the difference between temperature and wind chill, therefore suggesting severe weather

21
conditions. However, we removed them in the final implementation.

Fig. 4.4: Accident severity by weather category

4.4. Multilayer Perceptron(MLP):

To get the data ready for an ANN model like the MLP[16], we used StringIndexer to
convert categorical features into numerical representations. The Fig. 4.5 below shows how a
multilayer ANN model like the MLP works. Then, we combined both the numerical and
encoded features into one feature vector with VectorAssembler, which is part of PySpark's
MLlib.

We built and trained an MLP model for predicting accident severity using TensorFlow.
We ran experiments to find the best number of neurons and activation functions for each
layer. The model we trained was tested on new data using different metrics such as accuracy,
precision, recall, and F1-score, and we also looked at the confusion matrix for analysis. The
Fig. 4.6 shows a loss graph for the model trained after 50 epochs. In the end, we did a feature
importance analysis to figure out which features had the biggest impact on accident severity.
This helped us understand the main causes of accidents and will assist in creating specific
interventions.

22
Fig. 4.5: Layers in an multilayer perceptron (MLP) model

4.5. Gradient Boosting(XGBoost):

StringIndexer from PySpark's MLlib helped us convert category characteristics into


numerical representations fit for XGBoost input, hence preparing the data for the XGBoost
model. We then aggregated numerical and encoded categorical data into a single feature
vector using VectorAssembler from PySpark's MLlib. Built and trained using the XGBoost
library in Python, the XGBoost model optimized hyperparameters like learning rate,
maximum depth, and number of estimators by means of grid search or cross-valuation. Along
with confusion matrix analysis to evaluate the trained model's performance across many
degrees of severity, it was tested on unknown data using criteria like accuracy, precision,
recall, and F1-score. Moreover, XGBoost's built-in feature importance scores were used in
feature importance analysis to pinpoint the most significant elements influencing accident
severity, thus providing understanding of the main causes of accidents and guiding the
creation of focused safety precautions.

4.6. Random Forest:

To prepare the data for the Random Forest model, we utilized StringIndexer from
PySpark's MLlib [17] to convert categorical features into numerical representations. We then

23
employed VectorAssembler from PySpark's MLlib [17] to combine both numerical and
encoded categorical features into a single feature vector. The Random Forest model [17] was
built and trained using the RandomForestClassifier from PySpark's ML library [17].
Hyperparameters like the number of trees, maximum depth, and the number of features to
consider at each split were optimized through techniques like grid search or cross-validation.
The trained model's performance was assessed on unseen data using metrics including
accuracy, precision, recall, and F1-score, accompanied by confusion matrix analysis to
evaluate its effectiveness across different severity levels. Feature importance analysis was
conducted using the Random Forest model's built-in feature importance scores [18], which
measure the average decrease in impurity (e.g., Gini impurity) achieved by each feature
across all trees. This analysis provided insights into the most influential features affecting
accident severity, shedding light on the primary accident causes and guiding the development
of targeted interventions.

First comes loading the data using Spark's read.csv utility. We next address data
missing values and outliers via cleaning. We initially figure every column's missing data
fraction. Since they have no influence on our research, high percentage of missing data
("End_Lat," "End_Lng") columns are eliminated. For other columns including missing
values, we use a two-pronged approach. If the missing values are somewhat unusual,
dropping rows with them helps to maintain data integrity for columns like "Visibility(mi),"
"Weather_Condition," etc.

Spark's when and otherwise tools replace missing values with the mean value of the
relevant column for key numerical variables like "Wind_Chill(F)," "Precipitation(in)," and
"Wind_Speed(mph)." This ensures that, in managing missing data, we retain significant
information points. We ensure robust model training by removing outliers from numerical
features. Outliers are values outside three standard deviations from the mean of any attribute.
This phase helps prevent extreme distortion of the model caused by values.

EDA then provides feature analysis of the dataset and direction on model building.
EDA policies are followed here. First we get the hour of the day from the "Start_Time"
column; next, we count accidents for every hour. This enables us to identify periods when
accidents most usually happen and to understand temporal patterns in accident occurrence.
We analyze the distribution of "Severity" levels in order to understand the relative frequency

24
of every level. This helps us to handle any class balancing issues and explains the degree of
dispersion of the classes. Then We create a heatmap displaying accident density at several
locations using the "Start_Lat" and "Start_Lng" parameters. This simplifies the spatial
distribution of accidents and facilitates the identification of places likely to be prone to them.

We examine the relationship between accident severity and weather-related properties


(e.g., "Temperature(F)," "Humidity(%),," "Precipitation(in),"). We create visualizations like
count graphs and box graphs to assist us to understand how different weather conditions
influence accident severity. Using correlation analysis, we find the correlation matrix between
severity and numerical properties. This might be a helpful predictor for our model as it
enables us to identify traits much connected with accident severity. Create categorical
numerical representations from textual weather reports to use as input features for our
machine-learning systems. Python routines with custom-defined logic let one classify many
weather patterns into separate groups. Likewise, ordinal categorical values follow from the
precipitation intensity, visibility, wind speed, wind chill, temperature, and humidity. This
categorization serves to reduce complexity by assisting in the capture of non-linear
correlations and maybe enhancement of model interpretability.

New features then are meant to enhance model performance: By grouping the
"Weather_Condition" column into more general categories—e.g., "Fair," "Cloudy," "Rain,"
"Snow, etc.—we simplify the data and raise model interpretability. This categorization
algorithm considers and assigns labels based on important keywords discovered in the
"Weather_Condition" descriptions. We further rank precipitation, visibility, wind speed, wind
chill, temperature, and humidity into ordinal categories—that is, "Light," "Moderate,"
"heavy" for precipitation. For the model, this helps to more faithfully depict continuous
variables. For categorical variables—like "Street," "City," "County," etc.—String Indexer
converts them into numerical equivalents. This allows us to include these categories into our
model construction. All selected characteristics are gathered by Vector Assembler into a
single feature vector called "features". From this vector feeds our machine learning model.
We have lately experimented with a few models. This is the first more fundamental model.
We develop a TensorFlow/Keras-based Multilayer Perceptron (MLP) model.

For binary classification—predicting accident likelihood—the MLP contains an input


layer (size determined by the number of features), hidden layers (with a specified number of

25
neurons and activation functions), and an output layer with sigmoid activation. Within the
build_model function explain the model architecture and parameters (number of layers,
neurons per layer, activation functions, optimizer, loss function). Compilation of the model
uses binary cross-entropy loss and Adam optimizer. We then educate the model on the
training data after a predetermined number of epochs, batch size, and validation split. By
means of recording the training history, one may monitor the model's performance over
evolution.

Using PySpark, a more intricate model has been developed showing improved
performance over the prior one.Particularly layers incorporating batch normalization, dropout
for regularization, and dense layers with ReLU activation define the MLP. Binary
classification takes use of the sigmoid activation mechanism of the output layer. Model
Compilation and Instruction Guide model compilation is guided by binary cross-entropy loss
function and Adam optimizer. It is trained on scaled training data using a defined batch size,
epoch count, and validation split for measuring performance during training.Therefore model
is repeatable and kept without retraining for next use.

26
CHAPTER 5
RESULTS AND DISCUSSIONS

This chapter offers a thorough review of the findings from our predictive modeling
and extensive study efforts. We investigated temporal, geographical, and meteorological
aspects impacting accidents by using the US Accidents dataset. By means of descriptive
statistics, visualizations, and model performance measures, we obtained important
understanding of accident trends and assessed the predictive model performance in
determining accident severity. The findings underlie knowledge of accident dynamics and
draw attention to practical ideas for traffic safety campaigns.

5.1. EDA (Exploratory Data Analysis) Results

Under many lenses—hourly distribution, severity distribution, meteorological


conditions, geographical data, and correlation analysis—our exploratory data analysis (EDA)
phase sought patterns and trends in accident data. These studies provide fundamental
understanding of the elements influencing accident frequency and intensity, therefore guiding
our interpretation of data and our model building.

5.1.1. Distribution of Accidents Hourly

Two clear temporal patterns were shown by accident start times: one in the morning
between 7–9 AM and another in the evening from 4–6 PM as seen in Fig. 5.1. These peaks
line up with morning and evening rush hours, when traffic volume usually rises from people
getting about. Reduced attention in drivers combined with more vehicle traffic as people head
to work or school might help to explain the morning peak. Often the evening peak falls on
people going home, maybe tired after the day, which raises the risk of mishaps. Effective
traffic management depends on an awareness of this hourly distribution [19], which helps
authorities to allocate resources best and maybe carry out preventative actions such as
changing traffic signal timings during peak hours. Public awareness efforts may also
concentrate on encouraging drivers to be careful during these high-risk times, therefore
lowering the accident rates.

27
Fig. 5.1: Hourly distribution of accidents throughout the day

5.1.2. Severity Distribution:

The way accident severity was distributed in the dataset exposed an obvious class
imbalance shown in Fig. 5.2. While higher severity incidents (Severity 3 and 4) were much
less prevalent, most accidents were classed as low severity (Severity 2). Predictive modeling
suffers from this distorted distribution as models may be biased toward forecasting the
majority class, hence producing low accuracy for higher-severity forecasts. Developing
dependable models that precisely characterize incidents of all degrees depends on addressing
this disparity. We used weighted loss functions in our modeling procedure and oversampling
of the minority class to help to offset this problem. We want to promote proactive safety
measures emphasizing on averting high-severity events, where the stakes for injury and death
are greater, by improving the sensitivity of the model to severe accidents.

This imbalance might lead to some bias towards severity level 2 being predicted more
by the model, however due to the restrictions of our dataset we still have to stick to the same
dataset for training the model. Access to a larger dataset and computing power, we can train
the model better.

28
Fig. 5.2: Distribution of accident data by severity

5.1.3. Analysis of Weather Conditions

Often major factors influencing accident frequency and intensity are the weather
conditions. Our study found notable relationships between accident incidence and unfavorable
weather which is shown in Fig. 5.3. Consistent with past research, rain, fog, and snow were
linked to more accidents. Adverse conditions lower visibility, affect road surface friction, and
shorten response times—all of which increase accident risk. For example, even mild rain may
make roadways slick, which might cause small collisions from sliding. On the other hand,
heavy snow or thick fog could cause more catastrophic mishaps when roads are more difficult
to negotiate and visibility suffers greatly.

More research turned up complex interactions between certain meteorological


variables and accident severity. Although light rain increased accident frequency, it was
usually linked to lower-severity incidents. Conversely, while less common, heavy snow or
thick fog tended to be linked with more serious accidents. These revelations emphasize the
significance of including thorough meteorological data into prediction models as it will
increase accuracy and enable the creation of safety advice tailored for certain seasons. This
data might be used by public organizations to provide focused warnings depending on
real-time weather conditions, therefore alerting drivers to possible hazards and promoting

29
responsible conduct during bad storms.

Fig. 5.3: Distribution of weather features

5.1.4. Geographic Analysis

By means of mapping accident sites, we were able to pinpoint notable hotspots in


suburban and metropolitan regions, a snippet of which is attached in Fig 5.4. These hotspots
were often discovered in heavily populated areas, high-traffic highway portions, particularly
at main junctions. Visualizing accident data geospatially helps authorities identify high-risk
areas, therefore offering important background for urban planning and infrastructure
development. This data may direct focused initiatives such road segment remodeling to lower
accident risk, traffic calming techniques, or extra signpost installation.

30
Fig. 5.4: An interactive heatmap with zoom-in/zoom-out feature

Further enhancing our study was overlaying the heatmap with places of interest (POIs)
and road network data. For instance, accident hotspots often corresponded with places close to
businesses, restaurants, or entertainment venues where heavy pedestrian and vehicle traffic
can increase accident risk. Higher accident frequency are seen at major junctions and close to
public transportation hubs. Policymakers may make data-driven choices to improve road
safety in these important places by knowing the geographical links between accident sites and
surrounding infrastructure. Future studies might look more closely at how certain road
characteristics or surrounding facilities affect accident risk, therefore providing perhaps more
exact information for safety planning.

5.1.5. Matrix of Correlations

Additional understanding of how different elements interact and affect accident


incidence came from the correlation study, the matrix shown in Fig. 5.5, between numerical
elements. For instance, visibility showed a negative connection with accident frequency,
implying that low visibility greatly increases accident risk. Common in fog, rain, or snow, low
visibility circumstances make it difficult for vehicles to respond to unexpected hazards,
therefore increasing the accident rates.

Additionally showing a modest negative connection with accident incidence was


temperature. This correlation might be related to the weather as lower temperatures usually
follow rain, snow, or ice, which can raise the possibility of accidents. Although there is not a

31
significant association, this result justifies the inclusion of temperature as a variable in our
models because it helps to grasp the larger weather-related impacts on accidents. These
relationships direct feature selection for our prediction models, therefore enabling us to avoid
needless complication and concentrate on the most important elements. Understanding the
interactions among these variables is crucial as it emphasizes the linked character of accident
risks and helps improve our models for higher predicted accuracy.

Fig. 5.5: Correlation Matrix of the important numeric features

5.2. Results of Predictive Models

Our work next moved on creating prediction algorithms to categorize accident


severity. We sought to find the best-performance method for this work by evaluating many
machine learning models. Among the models we used were logistic regression, Random
Forest, XGBoost, Support Vector Machine (SVM), and Multilayer Perceptron (MLP). Every
model was assessed holistically by means of accuracy, precision, recall, and F1-score.

32
5.2.1. Feature Importance

For the Random Forest and XGBoost models, the feature importance analysis depicted
in the Fig. 5.6 below provides important light on the elements most likely to affect accident
severity. Consistent among the best indicators are weather-related factors like precipitation
and visibility. Reduced visibility—usually resulting from fog or rain—directly connects to
increased accident risk; precipitation helps to create dangerous road conditions that increase
degree of severity.

Accident severity prediction also benefited much from temporal characteristics such
the hour of the day and day of the week. Weekend or peak traffic accidents showed different
trends that underlined the need of temporal context in comprehending accident dynamics. For
example, Saturday night accidents were more likely to be serious, maybe because of a mix of
alcohol intake, tiredness, and poor attention. By means of these main determinants of accident
severity, our study provides useful information for focused safety actions including enhanced
law enforcement presence during high-risk hours or bettering of lighting conditions in
locations seeing regular night-time accidents.

Fig. 5.6: Feature importance

At last, We provided in this chapter the findings of our predictive modeling and
analytic efforts, therefore offering a whole picture of the elements affecting accident incidence
and severity. By use of EDA, we found important temporal, meteorological, and geographical
trends, thereby providing vital information required for efficient traffic control and safety

33
design. Our predictive modeling findings highlighted the significance of model selection;
XGBoost ranked highest among the algorithms for severity categorization. Analyzing feature
significance helped us to find important factors influencing accident severity, hence leading
further investigations and useful applications. The results in this chapter show the possibility
of machine learning to improve accident prediction and prevention initiatives, therefore laying
a solid basis for data-driven traffic safety policies.

5.2.2. Confusion Systems

Examining the confusion matrix for the model helped us to understand model
performance even further. These matrices expose each model's strengths and shortcomings by
offering a thorough analysis of prediction results spanning several degrees of severity.
Particularly the XGBoost model shone in accurately spotting higher-severity incidents
(Severity 3 and 4), which are vital for preventative safety campaigns. Since it enables
authorities to act quickly to stop such events, this precision in identifying serious accidents is
crucial.

Fig. 5.7: One of the Confusion Matrix for MLP model

Examining the confusion matrices also revealed certain restrictions of every model.
Reflecting its poor ability to detect non-linear patterns, the Logistic Regression model, for
instance, tended to misclassify catastrophic accidents as lower intensity. By showing more
balanced performance across all degree levels, the MLP and XGBoost models indicated their

34
flexibility to handle challenging datasets. These results highlight the need for confusion
matrices as diagnostic tools as they help researchers to grasp model behavior at a detailed
level and modify their models.

The MLP model being the primary and the initial implementation of our project, the
Fig. 5.7 represents the confusion matrix of the MLP model trained and the Fig. 5.8 plots the
loss graph for our initial attempt at the MLP model with only 50 epochs. However, the Fig.
5.9 plots the loss function of the same after training it for 100 epochs on our systems.

Fig. 5.8: Loss graph with initial MLP model.

Fig. 5.9: Loss function with final MLP model

35
5.2.3. Comparative Model Performance

The figures below exhibit the performance metrics of many models via side-by-side
comparison of accuracy, precision, recall, and F1-score. With a test accuracy of 95.34% and a
macro-average F1-score of 0.55 XGBoost, as can be seen in Table 5.1, outperformed all other
models. XGBoost's capacity to manage unbalanced data and its susceptibility to overfitting,
along with its gradient-boosting process, which detects intricate data patterns and helps to
differentiate different degrees of accident severity, most certainly contribute to this
outstanding performance. Another useful model for this work is the Random Forest model,
the classification report shown in Table 5.2, which similarly obtained excellent test accuracy
(95.14%) and a really good F1-score (macro average 0.50). Random Forest's ensemble
method helped it to effectively capture data patterns, even if its performance lagged somewhat
behind XGBoost.

Models such as Logistic Regression, with metrics as shown in Table 5.3 and SVM, as
shown in Table 5.4 demonstrated competitive test accuracies but lower macro-average
F1-scores (0.27 and 0.32, respectively), similarly Table 5.5 shows the classification report for
Decision Tree model which had similar issues, thereby implying limits in efficiently capturing
the minority classes. Though not shown here, the Multilayer Perceptron (MLP) also showed
promise in accident severity classification. Although MLP was computationally more
demanding, as a neural network-based model it gave flexibility in addressing non-linear data
interactions. The results highlight the need of choosing a model fit for the particular data
features and aims of accident severity prediction. Every model has unique advantages and
disadvantages, which emphasizes the importance of carefully evaluating model complexity,
processing resources, and interpretability in practical uses.

36
Table 5.1: Classification Report for XGBoost

Training Accuracy 0.9567

Test Accuracy 0.9534

Severity 0 1 2 3

Precision 0.72 0.96 0.64 0.72

Recall 0.35 1.0 0.40 0.15

F1-Score 0.47 0.98 0.49 0.25

Support 324 43362 876 1424

Table 5.2: Classification Report for Random Forest

Training Accuracy 0.9539

Test Accuracy 0.9514

Severity 0 1 2 3

Precision 0.87 0.95 0.69 0.79

Recall 0.28 1.00 0.35 0.06

F1-Score 0.42 0.98 0.46 0.12

Support 324 43362 876 1424

Table 5.3: Classification Report for Logistic Regression

Training Accuracy 0.9420

Test Accuracy 0.9427

Severity 0 1 2 3

Precision 0.75 0.95 0.39 0.57

Recall 0.01 1.00 0.09 0.00

F1-Score 0.02 0.97 0.15 0.01

Support 324 43362 876 1424

37
Table 5.4: Classification Report for SVM

Training Accuracy 0.9452

Test Accuracy 0.9455

Severity 0 1 2 3

Precision 0.85 0.95 0.73 0.80

Recall 0.19 1.00 0.09 0.00

F1-Score 0.31 0.97 0.16 0.01

Support 324 43362 876 1424

Table 5.5: Classification Report for Decision Tree

Training Accuracy 0.9565

Test Accuracy 0.9501

Severity 0 1 2 3

Precision 0.64 0.96 0.57 0.50

Recall 0.41 0.99 0.44 0.17

F1-Score 0.50 0.98 0.50 0.25

Support 324 43362 876 1424

5.3. Visualization

Exploration of data analysis (EDA) and efficient communication of complicated data


patterns depend much on visualizations. Interactive visualizations and dashboards created
using Tableau [20] helped to investigate the US Accidents data. Several important
visualizations were created to grasp the correlations and distribution of different elements.
These visuals provide a whole picture of the data and guide further modeling choices. For
those engaged in traffic accident trend and pattern analysis, they also provide insightful
information.

As shown in Fig 5.14 displays the numerous important temporal and numerical

38
aspects Temperature Distribution: The histogram shows a quite regular distribution of
temperatures, centered around [75°F]. This means that while they could be more common in
certain temperature zones, accidents happen throughout a spectrum of temperatures. With a
concentration of accidents happening, the humidity distribution displays patterns. This implies
that accident incidence may have some influence on humidity levels. Time Distribution:
Consistent with higher traffic flow at these times, the bar chart showing the hourly
distribution of accidents clearly displays maxima throughout the morning (7-9 AM) and
evening (4-6 PM).

Fig. 5.14: Tableau Dashboard 1

Visualizations on timezone, meteorological conditions, and monthly accident


distribution as shown in Fig 5.15. This bar chart shows the preponderance of accidents in the
[name timezone with highest count] timezone by comparing accident numbers throughout
many time zones. This might be ascribed to elements such population density, transportation
network scale, or regional reporting standards. The frequency of accidents under various
meteorological circumstances is shown in this bar chart.

Although "fair" and "clear" circumstances explain a good amount of mishaps,


unfavorable weather like "cloudy," "rain," and "fog" also greatly influences. This emphasizes

39
the need to include meteorology into accident research. The pie chart shows the monthly
accident count. Although mishaps happen all year long, there may be minor fluctuations in
frequency depending on the month, maybe affected by seasonal elements like weather or
holiday travel.

Fig. 5.15: Tableau Dashboard 2

40
CHAPTER 6
CONCLUSION AND FUTURE IMPLEMENTATIONS

This work shows the possibilities for traffic accident prediction and analysis using
Apache Spark and TensorFlow. We provide insightful analysis by creating an intelligent
system able to recognize accident trends and contributing causes, therefore supporting
continuous efforts to improve road safety. The capacity of the system to forecast accident
severity and underline important contributing elements enables the creation of focused
treatments and preventative plans. Policymakers, traffic engineers, and even individual drivers
may be empowered by this knowledge to make wise judgments lowering accident risk.

The project started with the road accidents dataset as a base which made us realize of
the important fields we could analyze and try to predict something to ultimately reduce the
number of accidents, and also to try and provide the required support at the earliest for severe
accidents. After the initial trial of predicting accident probability which lead to the model
being overfitted, we shifted our approach to predicting severity of the accidents based on the
data we had and obviously doing Exploratory Data Analysis to get useful insights from the
dataset. We then tried using 6 different models namely Multilayer Perceptron(MLP), Random
Forest, SVM, Logistic Regression, Decision Trees and XGBoost models, of which we found
XGBoost to be the most accurate with test accuracy of 95.34% in predicting the class of
accident severity. We also faced several problems while building the project which include the
following:

Class Imbalance: The dataset probably shows a class imbalance wherein certain accident
severity levels are somewhat more common than others. For less frequent but important
severity levels, this imbalance might distort model training and provide erroneous forecasts.
Future research will center on using cost-sensitive learning algorithms, undersampling
majority classes, or oversampling minority classes to handle this problem. These techniques
will guarantee that the model assigns suitable weight to any degree of severity, therefore
producing more strong and accurate forecasts.

Data Generalizability: The performance of the present model might be restricted to the
particular features of the one-month training dataset. Future studies will investigate

41
geographical and temporal fluctuations in order to improve generalizability and guarantee the
model's efficacy throughout many situations. This might include including elements that
directly reflect these variances within a single model or training distinct models for various
geographical locations or eras. Moreover, we will look at ensemble techniques—which
aggregate the forecasts of many models—to raise general predictive accuracy and resilience.

Although mean imputation was used in this work, more advanced methods of
addressing missing data might enhance model performance yet. Future research will
investigate other imputation techniques such model-based imputation or k-nearest neighbors
imputation, which may provide more accurate estimates for missing values by using feature
correlations.

Development of a very effective accident prediction system depends critically on the


integration of real-time data sources. Future studies will concentrate on real-time traffic flow,
weather updates, and road conditions. This dynamic data connection will let the system
provide up-to-date risk evaluations and adjust to changing conditions. Establishing a proactive
system capable of delivering timely alarms and guiding quick safety precautions depends on
this.

For the future implementation of this project, we want to create a more strong,
accurate, and all-encompassing accident prediction system by overcoming these constraints
and combining real-time data. Globally, this improved approach has great potential to help to
greatly reduce accidents and increase road safety. Translating these scientific discoveries into
useful applications that can save lives and provide safer highways for everybody is the
ultimate aim. This might include integrating the prediction system into traffic management
systems to dynamically change traffic flow and maximize safety measures, or creating mobile
apps offering drivers real-time risk evaluations.

42
REFERENCES

[1] World Health Organization, "Road traffic injuries," vol. 15, no. 3, pp. 123-145,
2023.

[2] National Highway Traffic Safety Administration, "Risky driving: Distracted


driving," vol. 8, no. 1, pp. 45-67, n.d.

[3] World Health Organization, Save LIVES: A road safety technical package, vol.
22, no. 4, pp. 210-234, 2018.

[4] A. N. Shafabakhsh, M. Famili and S. Nazari, "Development and Application of


Road Traffic Accident Prediction Model Based on Machine Learning Algorithm,"
Sustainability, vol. 16, no. 16, p. 6767, 2023, doi: 10.3390/su16166767.

[5] S. Wen, W. Wang, Z. Yan, X. Chen and J. Zhang, "Maritime Accident Risk
Prediction Integrating Weather Data Using Machine Learning," Transportation
Research Part D: Transport and Environment, vol. 117, p. 103646, 2024, doi:
10.1016/j.trd.2023.103646.

[6] R. Bergel-Hayat, M. Debbarh, C. Antoniou and G. Yannis, "Explaining the road


accident risk: Weather effects," Accident Analysis & Prevention, vol. 60, pp.
456-465, 2021, doi: 10.1016/j.aap.2021.105992.

[7] A. Ahmed, H. Farooqui, M. Mehta and M. H. Al Turkestani, "Predicting Road


Traffic Accidents Using Machine Learning Approach," IOP Conference Series:
Materials Science and Engineering, vol. 590, no. 1, p. 012029, 2019, doi:
10.1088/1757-899X/590/1/012029.

[8] X. Ma, J. Dai, S. Wang, Z. Yang and Q. Wu, "Traffic Accident Prediction Based
on CNN Model," in 2021 5th International Conference on Intelligent
Computing and Control Systems (ICICCS), Madurai, India, 2021 pp. 1292-1296,
doi: 10.1109/ICICCS51141.2021.9432224.

[9] J. Yuan, Y. Zheng and X. Xie, "Accident Risk Prediction based on Heterogeneous

43
Sparse Data," in Proceedings of the 27th ACM SIGSPATIAL International
Conference on Advances in Geographic Information Systems, Chicago, IL, USA,
2019 pp. 309-318, doi: 10.1145/3347146.3359078.

[10] M. M. Khan, A. R. Mahmood and N. Baig, "Predictive Modeling of Hourly


Probabilities for Weather-Related Road Accidents," Natural Hazards and Earth
System Sciences, vol. 20, no. 11, pp. 2857-2871, 2020, doi:
10.5194/nhess-20-2857-2020.

[11] K. Soronen, "Accident Prediction Using Machine Learning: Analyzing Weather


Conditions, and Model Performance," Master's Thesis, University of Oulu,
Faculty of Information Technology and Electrical Engineering, Oulu, Finland, vol.
1, no. 2, pp. 1-100, May 2023.

[12] M. S. Hossain, M. M. Hassan, M. R. I. Rabby, A. Al Mamun and M. I. Uddin,


"Internet of Things-Based Intelligent Accident Avoidance System for Adverse
Weather and Road Conditions," Journal of Reliable Intelligent Environments, vol.
7, no. 3, pp. 203-219, 2021, doi: 10.1007/s40860-021-00132-7.

[13] A. Febrianto, A. A. S. Gunawan and I. K. E. Purnama, "Road Car Accident


Prediction Using a Machine-Learning-Enabled Data Analysis," Sustainability, vol.
15, no. 7, p. 5939, 2023, doi: 10.3390/su15075939.

[14] Moosavi, Sobhan, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, and


Rajiv Ramnath. “A Countrywide Traffic Accident Dataset.”, 2019.

[15] S. Moosavi, M. H. Samavatian, S. Parthasarathy, R. Teodorescu, and R. Ramnath,


"Accident Risk Prediction based on Heterogeneous Sparse Data: New Dataset and
Insights," in Proceedings of the 27th ACM SIGSPATIAL International Conference
on Advances in Geographic Information Systems, Chicago, IL, USA, 2019, pp.
309-318, doi: 10.1145/3347146.3359078.

[16] ScienceDirect, "Multilayer Perceptron," in Encyclopedia of Computer Science,


vol. 12, no. 5, pp. 876-892, 2023.

44
[17] X. Meng, J. Bradley, B. Yavuz, E. Sparks, S. Venkataraman, L. Xin, T. Kan, S.
Gonzalez-Garcia, J. Franklin, F. Li, R. Zaharia and M. J. Franklin, "MLlib:
Machine Learning in Apache Spark," Journal of Machine Learning Research, vol.
17, no. 1, pp. 1235-1241, 2016.

[18] L. Breiman, "Random Forests," Machine Learning, vol. 45, no. 1, pp. 5-32, 2001,
doi: 10.1023/A:1010933404324.

[19] N. Becker, H. W. Rust, and U. Ulbrich, “Predictive modeling of hourly


probabilities for weather-related road accidents,” Nat. Hazards Earth Syst. Sci.,
vol. 20, no. 10, pp. 2857–2871, 2020, doi: 10.5194/nhess-20-2857-2020.

[20] J. R. Kashyap, “Tableau Tutorial: Getting Started with Tableau,” Medium:


Edureka, Nov. 8, 2019. Accessed: Nov. 5, 2024.

45

You might also like