BDE Final Report
BDE Final Report
ROAD ACCIDENTS
A PROJECT REPORT
Submitted by
BACHELOR OF TECHNOLOGY
in
COMPUTER SCIENCE ENGINEERING
with specialization in Cloud Computing
NOVEMBER 2024
Department of Networking and Communications
SRM Institute of Science & Technology
We hereby certify that this assessment compiles with the University’s Rules and Regulations
relating to Academic misconduct and plagiarism**, as listed in the University Website,
Regulations, and the Education Committee guidelines.
We confirm that all the work contained in this assessment is our own except where
indicated, and that We have met the following conditions:
I understand that any false claim for this work will be penalized in accordance with the
University policies and regulations.
DECLARATION:
We are aware of and understand the University’s policy on Academic misconduct and plagiarism and I
certify that this assessment is our own work, except where indicated by referring, and that we have
followed the good academic practices noted above.
SRM INSTITUTE OF SCIENCE AND TECHNOLOGY
KATTANKULATHUR – 603 203
BONAFIDE CERTIFICATE
i
ACKNOWLEDGEMENT
We would like to sincerely thank our faculty, Dr. Banu Priya P., from the Department
of Networking and Communications at SRM Institute of Science and Technology, Chennai,
for her great direction, consistent support, and helpful criticism throughout this project. Her
knowledge in machine learning and big data analytics gave us a strong basis that would help
us to handle this challenging topic with effectiveness.
Many web tools and sites helped this effort tremendously. We thank internet
communities such as Stack Overflow and towards big data for their insightful analysis and
creative answers to technical problems. We also thank the US Accidents dataset suppliers on
Kaggle, who were the pillar of our study. The availability of such large-scale databases
emphasizes the need of open data and community involvement in developing research. We
also like to thank the open-source tool developers and maintainers for helping to enable this
effort. Data processing, model construction, and analysis are all required tools like Apache
Spark and TensorFlow. Throughout installation, their extensive documentation and active user
communities were very helpful. We also thank the Python developers of the Scikit-learn and
Pandas libraries for their efforts to offer necessary tools for data handling, analysis, and model
assessment. Moreover, successful communication of our results depended much on Tableau's
simple interface and strong visualization tools. We really value our families' consistent
support and encouragement throughout this journey.
Their conviction in our ability gave us the drive and fortitude required to push beyond
obstacles and reach our study objectives. At last, we understand the crucial need of tackling
the worldwide road safety issue. This initiative aims to contribute to a better knowledge of the
elements causing road accidents and hence guide the creation of sensible plans to increase
road safety and save lives. We appreciate the chance to help in this essential field of study.
ii
ABSTRACT
Road safety is still a major worldwide issue that calls for creative solutions to reduce
the major financial and human losses related to traffic accidents. This work uses a large-scale
historical accident dataset to build an intelligent accident analysis and prediction system,
therefore addressing this difficulty. For forecasting accident severity, we create and test
multiple machine learning models using a one-month sample from a publicly accessible US
accident dataset including Random Forest, XGBoost, Multilayer Perceptron (MLP) built
using TensorFlow and Keras. Using Apache Spark, the system integrates a strong data
pretreatment pipeline to effectively manage the quantity and complexity of the dataset,
therefore resolving issues such missing values and data inconsistencies. Most importantly, our
method takes into account many contributing elements like exact location data, road
conditions, temporal patterns (hour of day, day of week, etc.), and specific weather conditions,
therefore allowing a more complete knowledge of accident dynamics. With a training
accuracy of almost 95.67% and a testing accuracy of about 95.34%, the MLP model's
performance emphasizes the promise of machine learning for accident severity prediction.
Beyond model construction, this study uses visualizations and statistical techniques to do
comprehensive exploratory data analysis (EDA) identifying accident-prone locations,
investigate the impact of meteorological variables on accident severity, and expose underlying
accident trends. These revelations, along with the predictive powers of the machine learning
models, provide insightful data for creating focused road safety campaigns, enhancing
infrastructure design, and finally lowering the frequency and intensity of next collisions. This
study emphasizes the need of merging modern machine learning methods with big data
technology to solve challenging real-world issues in road safety.
TABLE OF CONTENTS
CHAPTER TOPIC PAGE
NO. NO.
ABSTRACT iii
LIST OF FIGURES v
LIST OF TABLES vi
1 INTRODUCTION 1
2 LITERATURE REVIEW 5
3 PROPOSED METHODOLOGY 10
4 IMPLEMENTATION 19
6 CONCLUSION 41
REFERENCES 43
LIST OF FIGURES
Fig 5.8 Final loss function after 100 epochs of the MLP model 36
INTRODUCTION
A constant worldwide problem, road accidents cause a great death toll, disabling
injuries, and major economic consequences. Using a data-driven strategy to handle the many
elements causing these sad occurrences, this study explores the important problem of traffic
accident analysis and prediction. Conventional wisdom often depends on reactionary
responses, looking at mishaps after their occurrence. Our initiative, however, stresses a
proactive strategy using data analytics and machine learning to forecast and stop mishaps.
Achieving significant increases in road safety depends on this shift toward proactive
approaches. The always growing availability of large-scale accident datasets together with
developments in big data technologies like Apache Spark and machine learning techniques,
especially deep learning with TensorFlow, creates hitherto unheard-of chances to build
intelligent systems capable of spotting hidden patterns, trends, and high-risk factors.
Understanding the complex interactions among contributing elements—such as weather
conditions, road infrastructure features, temporal trends, and driver behavior—allows us to
approach road safety more foresightedly and preventatively. This project aims to construct
such a system, therefore helping to lower the accident count and provide a safer driving
environment for everybody.
Currently used techniques of road accident analysis are susceptible to many basic
limitations that hinder the evolution of efficient preventive strategies. A piecemeal knowledge
of weather's effects while it is clear that road accidents are generally influenced by weather,
the exact effect of various weather conditions—for example, heavy rain against moderate
drizzle, high winds against calm conditions—and how they interact with other elements is not
totally known. This research aims to close this gap by means of thorough analysis on the
impact of many meteorological elements on the frequency and intensity of accidents.
Many times, the evaluations presently used lack appropriate inclusion of the temporal
dynamics of accidents. Targeting treatments depend on the identification of peak accident
1
hours, days of the week, seasonal variations, and correlation of these temporal patterns
with possible contributory variables such traffic volume, driver behavior, or event-specific
impacts. More finely grained spatial analysis is needed for the identification of accident
hotspots and their relationship with particular road infrastructure characteristics (such as
intersection design, road curvature, and lighting conditions) and environmental factors (such
as terrain, visibility obstructions, and proximity to high-risk locations). This research will
employ instruments related with geospatial analysis to get this degree of information.
Based on real-time and historical data, the capability to predict the likelihood of
accidents and the degree of damage resulting from them is somewhat lacking. This is a
serious restriction on the forecasting powers. Strong prediction models that include many
factors would be very helpful for proactive resource allocation, customized safety
recommendations, and focused interventions as they would help to guide development.
1.2. Motivation
This initiative is motivated by the urgent need to solve the worldwide road safety issue
by means of a proactive, data-driven strategy. Our goal is to provide practical insights for: by
use of an intelligent analytic system,
1. Finding high-risk areas and infrastructure flaws will help to guide focused
enhancements in road design and safety aspects.
2. Using real-time traffic management techniques and focused traffic calming devices
will help to maximize traffic flow during times of maximum accident occurrence.
1.3. Objectives
1. Using Apache Spark, create a strong and effective data processing pipeline to clean,
preprocess, and get ready the large US Accidents dataset for study.
2. Examine the data on accident incidence and severity using a comprehensive
2
exploratory data analysis (EDA) to find latent trends, correlations, and patterns.
3. Using advanced geospatial analytic tools, pinpoint accident hotspots and their link
with road infrastructure and environmental conditions.
4. Using TensorFlow, develop and assess several machine learning models—including a
Multilayer Perceptron (MLP)—for accident severity prediction depending on a
confluence of relevant variables.
5. Make interactive visuals to properly present the results of the project to a broad
spectrum of stakeholders including traffic planners, legislators, and the general public.
1.4. Ideas
The creation of this approach is motivated by many main concepts like using PySpark,
Big Data Technologies which enable efficient management and analysis of the massive
volume of accident data. Firstly, Possibly incorporate other data sources to improve the
research: traffic flow data, road network data, and demographic data. Secondly, by looking at
many machine learning techniques and architectures—including deep learning models—one
may improve expected accuracy. Lastly, making interactive dashboards and visualizations
allows many stakeholders to have quick access to and actionability for the information.
1.6. Challenges
This initiative expects to run against some important obstacles:
1. The sheer volume and complex character of the US Accidents dataset provide major
challenges for effective data processing, cleansing, transformation, and analytic
3
preparation.
3. Integration of Data Sources Data management and fusion are more difficult when
many data sources—traffic flow, road networks, demographics—are combined.
Essential will be ensuring data compatibility and creating sensible plans for merging
many data kinds.
5. Choosing the most appropriate machine learning models and fine-tuning their
hyperparameters to get maximum performance for accident severity prediction will
need thorough investigation and assessment.
6. Practical Relevance: Further development and cooperation with pertinent parties will
help to translate the study results into useful, real-world solutions and guarantee their
efficient implementation in actual traffic management or driver assistance systems.
Real-world effect depends on overcoming possible deployment obstacles like data
availability in real-time and interoperability with current systems.
4
CHAPTER 2
LITERATURE REVIEW
5
severe marine circumstances.
Regarding road accident prediction, a research article written in Accident Analysis and
Prevention [6] looked at how climatic conditions affect accident probability on various
kinds of roadways. Results underlined how much the probability of accidents is influenced
by elements of the weather like visibility, temperature, and precipitation. For example,
slippery conditions, plenty of rain, and foggy weather were demonstrated to increase
accident risk because of less vehicle control and vision. This work provided important new
perspectives on improving road safety by integrating meteorological data into risk
prediction systems. The research also underlined the difficulties of obtaining comprehensive
meteorological data, which differs depending on area and season, thereby making it
impossible to guarantee consistent accuracy in projections throughout many road types.
Including such complex data into predictive models meant striking a balance between
accuracy and practicality; the researchers underlined the requirement of ongoing data
collecting and model development to raise safety results. This work shows how data
complexity increases in step with model sophistication as good prediction accuracy depends
on strong data inputs that reflect the subtleties of weather-road interactions.
6
just been modified for traffic accident prediction. CNN models were used in a research
reported at the Fifth International Conference on Intelligent Computing and Control
Systems (ICICCS) [8] to forecast accident events and assess risk variables connected to
certain scenarios. By means of both prediction accuracy and reducing loss functions, these
CNN-based models beat conventional Back Propagation (BP) techniques, therefore
demonstrating CNN's supremacy in learning complicated patterns from big datasets. CNN
models do, however, also present several difficulties, mostly related to the necessity of large
data volume to guarantee accuracy. Longer computation durations resulting from the
complexity of these models also represent a barrier to real-time application, especially in
high-stakes situations like traffic control where fast forecasts are crucial. This paper
underlined the trade-off between computing needs and practical application as well as the
possibility of CNN to help to traffic safety by generating more accurate forecasts.
A research written in Natural Hazards and Earth System Sciences [10] focused on how
temperature affected traffic accidents by using a prediction model combining
high-resolution reanalysis with meteorological data and radar-based precipitation. The
research greatly improved forecast accuracy by using these meteorological data, therefore
raising the hit rate from thirty to seventy percent. This notable increase underlined the
importance of high-quality data—especially when gathered at a district level—for more
dependably estimating traffic accident risk. Nevertheless, the inclusion of such
7
comprehensive meteorological data presents significant logistical difficulties as
high-resolution data collecting and processing need sophisticated infrastructure and
computing tools. This work underlined the need for fine data for high forecast accuracy,
particularly in relation to location-specific events such precipitation rates and temperature
variations.
Using Extreme Gradient Boost (XGBoost), Random Forest, and Decision Trees
among other machine learning models, a Master's thesis from the University of Oulu further
investigated accident prediction to evaluate the effect of storms on accident frequency and
severity [11]. These robust and interpretable machine learning techniques let the researcher
find important elements influencing accidents during extreme weather occurrences. The
research also highlighted the significant data needs and the difficulties caused by
imbalanced datasets, however. Particularly catastrophic accidents are quite unusual
occurrences that might cause a class imbalance that, if not well controlled, can distort model
forecasts. Furthermore, challenging the computing burden of compiling many
meteorological data was notably in real-time applications. Notwithstanding these
difficulties, the research showed that sophisticated machine learning methods might be used
to forecast accident severity under unfavorable conditions, therefore providing important
information for traffic control and policy creation.
8
change road safety.
In order to forecast the degree of road accidents and the number of casualties involved,
a research report published in Sustainability examined machine learning methods including
Logistic Regression, Random Forest, and Decision Trees [13]. These algorithms let
researchers forecast the victim count with 84% accuracy and the degree of severity with
85.5% accuracy. The research did also observe, nevertheless, that dataset
imbalances—especially in situations involving serious accidents—may lower sensitivity
and compromise the accuracy of the number of impacted vehicle predictions. In accident
prediction, class imbalances are a prevalent problem as serious accidents are less often than
small ones and produce skewed datasets that could affect model performance. To solve
these imbalances and raise model robustness, the paper underlined the requirement of
sophisticated data pretreatment methods as synthetic data creation or oversampling. This
study underlined the possibilities of machine learning in the prediction of accident severity
as well as the requirement of focused data preparation and the limits presented by
imbalanced datasets.
Ultimately, in recent years the corpus of studies examining the junction of traffic
accident prediction, machine learning, and meteorological data has expanded noticeably.
Every research adds special insights on how various data sources and machine learning
approaches could raise prediction accuracy, thereby striving to increase public safety. These
studies also highlight, however, the inherent difficulties of merging big, sophisticated
datasets for whatever computing needs, data sparsity, or class imbalances. Advancement of
the accuracy and application of these prediction models in various traffic and environmental
settings will probably depend much on future advancements in data processing, computing
efficiency, and real-time data integration.
9
CHAPTER 3
PROPOSED METHODOLOGY
This chapter describes the intended approach for constructing a prediction model for
accident severity and evaluating the US Accidents dataset. Mostly quantitative, the method
uses statistical analysis and machine learning methods to find trends and connections within
the data. The great volume of the information and the objective character of the research
questions—best suited for numerical analysis and predictive modeling—have led to the
choice of this quantitative method.
This project makes use of the publicly accessible US Accidents dataset—a thorough
compilation of road accident reports all throughout the United States available to the public
on kaggle, a sample snippet of the same dataset is shown in Fig. 3.1 below which helps us
see all the fields in the dataset already available to us. Rich details of this dataset allows us
to cover a broad spectrum of information pertinent to our study goals, including latitude and
longitude, weather and environmental features, and the different road features.
Precise latitude and longitude coordinates of accident sites let geospatial analysis be
possible; timestamps marking the start and finish times of every occurrence also help
temporal analysis.A critical factor determining the degree of any accident, ranging from 1
(small) to 4 (severe) is their severity. Our predictive modeling work uses this as the target
variable. Detailed weather conditions at the accident's time—including temperature,
humidity, visibility, precipitation, wind speed, and wind direction—are provided. This
abundance of detailed weather data makes it possible to investigate closely how weather
affects accident frequency and severity.
Data about the road environment—that is, street name, city, county, state, zip code,
traffic signals, crossings, junctions, and other points of interest (POIs)—allows one to
determine road characteristics. These characteristics allow investigation of the link between
road construction and accident trends. Although the first idea was to increase the breadth
and generalizability of the study using the whole dataset, it was realized that major data
10
cleansing and preparation would be required. In several of the columns in the
dataset—including variables relating to weather, location data, and road conditions—there
are very many missing values. Some aspects also show inconsistent data types and
presentation. These problems need rigorous data preparation to guarantee data quality and
dependability for further research and modeling.
The following are all the fields in our dataset to get an idea of it for better
understanding the project.
11
Fig. 3.1: Snippet from the Dataset
12
3.2. Data Preprocessing
Using both PySpark and Pandas in a multi-stage approach, the suggested data
preparation plan addresses the quantity and complexity of the dataset. Initial Data Exploration
and Cleaning with PySpark: PySpark is selected for its distributed computing features, hence
allowing effective handling of the US Accidents dataset at scale. Initially, PySpark's data
exploration would cover:
Examining the structure and data types of every characteristic in the dataset helps one
to grasp it. Finding and counting missing values, incompatible data formats, and possible
outliers in every column helps to assess data quality. PySpark's Missing Value Handling
putting suitable techniques for handling missing data into action, including:
After the first cleaning and preprocessing with PySpark, a strategically chosen subset
of the data, or a sample representative of the whole dataset, data transformation and feature
scaling using Pandas would create a Pandas DataFrame for more exact data manipulation and
analysis. This shift to Pandas, more appropriate for local computing and smaller-scale data
processing, might help us firstly, for simpler temporal analysis, parsing string representations
of dates and times into datetime objects. Furthermore, converting string-based categorical
variables—such as road details or weather conditions—into suitable Pandas categorical data
types is known as "categorical feature handling." Secondly, using feature scaling methods
guarantees that numerical features have similar ranges and helps to normalize them. This
avoids unduly affecting the machine learning models by characteristics with higher
13
magnitudes. The particular scaling techniques under discussion consist in: Standardizing the
features means centering and scaling them to get zero mean and unit variance and Scaling the
characteristics to a designated range usually between 0 and 1 results in normalizing them. And
finally, any missing values persist after the first PySpark processing, Pandas' built-in
imputation or deletion tools would be used to handle them, hence enabling more exact
treatment catered to certain columns or data features. This lets missing data be handled
contextually specifically.
By generating fresh, instructive features from the current data, feature engineering
significantly improves the prediction capability of machine learning models. Proposed for this
project were the following feature engineering techniques:
The raw weather data in the US Accidents dataset offers rather exact descriptors (e.g.,
"light rain," "heavy snow," "fog," "partly cloudy"). Although this information is useful,
machine learning models may suffer from excessive dimensionality and possible overfitting if
these exact descriptions are used as input features straight-forwardly. We therefore suggested
classifying these exact meteorological circumstances into more basic, more wide groups.
Different weather-related descriptors ("light rain," "heavy rain," "drizzle," "rain showers") for
instance would be included under a single "Rain" category. Comparatively, a "Snow" category
would include "heavy snow," "light snow," "sleet," and "blowing snow." This classification
simplifies the model and could help it to generalize to unknown data by reducing the number
of distinctive weather events. Moreover, these more general categories might help to better
portray how weather affects driving conditions generally.
Precise timestamps for every accident are included in the collection. We intended to
derive many time-based characteristics in order to capture temporal patterns and trends:
1. Hour of Day: Reflecting the 0–23 hour the accident happened. Capturing daily
fluctuations in accident frequency and severity—such as those linked to rush hour
traffic— depends on this ability.
2. Day of Week: Reflecting the day of the week the accident happened (0–6, Monday
being shown). This tool helps highlight variations in weekend and weekday accident
14
trends.
3. Month of Year: Showing the month the accident happened—1–12. This function may
record seasonal fluctuations in accidents, maybe connected to vacations, weather, or
school calendars.
4. Creating categorical characteristics depending on the hour of the day—such as
"Morning," "Afternoon," "Evening," and "Night"—helps the models to have a more
broad temporal backdrop.
The US Accidents dataset has many boolean (true/false) features showing the presence
of several road characteristics close to the accident site (e.g., "Amenity," "Bump," "Crossing,"
"Give_Way," "Junction," "Traffic_Signal"). We suggested to translate these traits into
numerical representations so that they can be efficiently used in machine learning models.
One-hot encoding turned out to be appropriate. For every distinct value of a categorical
feature, one-hot encoding produces new binary (0 or 1) columns. The "Amenity" feature, for
instance, would be turned into a new column with 1 for the existence of an amenity and 0 for
its lack. This change lets the models efficiently use these category characteristics throughout
the prediction phase.
Artificial neural networks of the class Multilayer Perceptron (MLP) are appropriate
for learning complicated non-linear correlations within data. Their capacity to record complex
interactions between characteristics qualifies them as a good contender for estimating accident
severity, which is probably affected by many elements. MLPs also can efficiently manage
high-dimensional data.
15
relevance and insightful analysis of the elements influencing predictions.
Often preferred for its great prediction effectiveness is XGBoost (Extreme Gradient
Boosting), another potent ensemble approach. It is very effective at managing missing data,
which is common in the US Accidents dataset. XGBoost is an excellent candidate for
estimating accident severity because of its regularizing methods and capacity to detect
non-linear correlations.
One useful starting model is logistic regression. Although it is simpler than the other
models, it gives interpretability and a baseline for comparison, therefore facilitating simpler
knowledge of the link between characteristics and expected results. Its performance in relation
to more complicated models enables one to evaluate if the increase in predictive ability
justifies the increased complexity.
The aim of the research is to forecast accident severity, so a full assessment of many
machine learning models depends on a comprehensive collection of evaluation criteria.
Particularly in the framework of a multi-class classification issue with possible class
imbalance, the following metrics were suggested and selected for their capacity to provide a
sophisticated knowledge of model performance:
16
events of that degree out of all the events projected to be of that degree. High accuracy
suggests that the model is probably accurate when it forecasts a given degree of severity.
Recall, often known as sensitivity, gauges whether the model can accurately identify
every event of a given degree of severity. Out of all the occurrences that are really of that
degree, it determines the percentage of accurately anticipated cases of that intensity. High
recall means the model captures most of the accidents of a certain degree, therefore indicating
its efficacy. The harmonic mean of recall and accuracy, the F1-score offers a fair evaluation of
both measures. It is especially helpful in cases of class imbalance as it penalizes models that
reach great accuracy at the price of poor recall, or vice versa. A high F1-score indicates that
the model achieves high recall along with great accuracy.
Area Under the ROC, or AUC, is a measure of the model's capacity to differentiate
between many degrees of severity. Plotting the true positive rate (recall) versus the false
positive rate for different categorization levels, the ROC shows The AUC shows the area
under this curve; a larger AUC indicates improved discriminating capacity. We intended to
find a weighted average of the AUC values for every degree of severity in multi-class
categorization. Taken together, these measures would provide a complete evaluation of every
model's performance, allowing us to choose the best-performing model while keeping in mind
the consequences of class imbalance and the need of appropriately forecasting more severe
incidents.
Effective data exploration, pattern analysis, and communication of the results depend
on visualizations. The suggested project uses many visualization approaches selected for their
fit in displaying various data kinds and delivering certain insights:
Histograms and bar charts would show the numerical variable distribution—that of
temperature, wind speed, and visibility. The distribution of categorical data like accident
severity, weather categories (developed during feature engineering), day of the week, and
hour of the day would be shown via bar charts. These representations would enable one to see
trends like the frequency of various weather conditions during accidents or peak accident
hours. Essential for geospatial research, heatmaps provide a graphic picture of accident
17
frequency across many sites. Overlaying the heatmap on a map helps one easily find accident
hotspots and high-risk zones. Correlation matrices provide a graphic depiction of the linear
links among numerical variables. Strong positive or negative correlations between variables
like temperature, visibility, and wind speed would be highlighted by a heat map depiction of
the correlation matrix, hence clarifying the interaction of these elements.
18
CHAPTER 4
IMPLEMENTATION
Our approach and process consist of the following main phases. Faced with
constraints, we concentrated on a one-month sample utilizing a large-scale accident dataset
from Kaggle. Apache Spark was used for data cleansing; columns with high missing values
were dropped and others were imputed with mean values. Though some were dropped for the
final model, feature engineering involved simplifying meteorological conditions and
generating additional features. Using accuracy and feature significance to grasp accident
causes, we trained and tested models comprising a Multilayer Perceptron (MLP), Gradient
Boosting (XGBoost), and Random Forest adjusting hyperparameters and assessing
performance. Also refer Fig. 1. to get the complete idea of it.
We use a large-scale dataset of accidents acquired from Kaggle[14, 15], which consists
of a whole spectrum of features including accident location (latitude and longitude), time of
occurrence, meteorological conditions, road attributes, and accident severity. For our
implementation we used the sample dataset from the same reference which was a one month
accident dataset due to our computing restrictions. As show in fig 4.2 describes the dataset
and finds values like count, mean, standard deviation, minimum and maximum value for all
the fields included in the dataset.
19
20
4.2. Data Cleaning and Preprocessing using Apache Spark:
Examining each column's percentage of missing values first helps us to handle them.
Although they are not vital for our study, we eliminate columns with a high percentage of
missing data including "End_Lat" and "End_Lng." as seen in Fig. 4.3. We drop rows
including those missing values for columns with a smaller percentage of missing values. For
fundamental numerical columns like "Wind_Chill(F)," "Precipitation(in)," and
"Wind_Speed(mph)," we impute missing values using the mean value of the corresponding
column.
We engineer fresh features from the dataset to improve model performance out of the
current ones. We classify meteorological situations into more general terms such as "Fair,"
"Rain," "Snow," etc as shown in Fig. 4.4. This simplification increases model interpretability
and lowers the dimensionality of meteorological-related characteristics.
Further we can also split each category based on its intensity. While experimenting
with some other approaches we also developed additional features like "Is_Raining," a binary
feature denoting the presence or absence of rain, and "Temperature_WindChill_Diff," which
shows the difference between temperature and wind chill, therefore suggesting severe weather
21
conditions. However, we removed them in the final implementation.
To get the data ready for an ANN model like the MLP[16], we used StringIndexer to
convert categorical features into numerical representations. The Fig. 4.5 below shows how a
multilayer ANN model like the MLP works. Then, we combined both the numerical and
encoded features into one feature vector with VectorAssembler, which is part of PySpark's
MLlib.
We built and trained an MLP model for predicting accident severity using TensorFlow.
We ran experiments to find the best number of neurons and activation functions for each
layer. The model we trained was tested on new data using different metrics such as accuracy,
precision, recall, and F1-score, and we also looked at the confusion matrix for analysis. The
Fig. 4.6 shows a loss graph for the model trained after 50 epochs. In the end, we did a feature
importance analysis to figure out which features had the biggest impact on accident severity.
This helped us understand the main causes of accidents and will assist in creating specific
interventions.
22
Fig. 4.5: Layers in an multilayer perceptron (MLP) model
To prepare the data for the Random Forest model, we utilized StringIndexer from
PySpark's MLlib [17] to convert categorical features into numerical representations. We then
23
employed VectorAssembler from PySpark's MLlib [17] to combine both numerical and
encoded categorical features into a single feature vector. The Random Forest model [17] was
built and trained using the RandomForestClassifier from PySpark's ML library [17].
Hyperparameters like the number of trees, maximum depth, and the number of features to
consider at each split were optimized through techniques like grid search or cross-validation.
The trained model's performance was assessed on unseen data using metrics including
accuracy, precision, recall, and F1-score, accompanied by confusion matrix analysis to
evaluate its effectiveness across different severity levels. Feature importance analysis was
conducted using the Random Forest model's built-in feature importance scores [18], which
measure the average decrease in impurity (e.g., Gini impurity) achieved by each feature
across all trees. This analysis provided insights into the most influential features affecting
accident severity, shedding light on the primary accident causes and guiding the development
of targeted interventions.
First comes loading the data using Spark's read.csv utility. We next address data
missing values and outliers via cleaning. We initially figure every column's missing data
fraction. Since they have no influence on our research, high percentage of missing data
("End_Lat," "End_Lng") columns are eliminated. For other columns including missing
values, we use a two-pronged approach. If the missing values are somewhat unusual,
dropping rows with them helps to maintain data integrity for columns like "Visibility(mi),"
"Weather_Condition," etc.
Spark's when and otherwise tools replace missing values with the mean value of the
relevant column for key numerical variables like "Wind_Chill(F)," "Precipitation(in)," and
"Wind_Speed(mph)." This ensures that, in managing missing data, we retain significant
information points. We ensure robust model training by removing outliers from numerical
features. Outliers are values outside three standard deviations from the mean of any attribute.
This phase helps prevent extreme distortion of the model caused by values.
EDA then provides feature analysis of the dataset and direction on model building.
EDA policies are followed here. First we get the hour of the day from the "Start_Time"
column; next, we count accidents for every hour. This enables us to identify periods when
accidents most usually happen and to understand temporal patterns in accident occurrence.
We analyze the distribution of "Severity" levels in order to understand the relative frequency
24
of every level. This helps us to handle any class balancing issues and explains the degree of
dispersion of the classes. Then We create a heatmap displaying accident density at several
locations using the "Start_Lat" and "Start_Lng" parameters. This simplifies the spatial
distribution of accidents and facilitates the identification of places likely to be prone to them.
New features then are meant to enhance model performance: By grouping the
"Weather_Condition" column into more general categories—e.g., "Fair," "Cloudy," "Rain,"
"Snow, etc.—we simplify the data and raise model interpretability. This categorization
algorithm considers and assigns labels based on important keywords discovered in the
"Weather_Condition" descriptions. We further rank precipitation, visibility, wind speed, wind
chill, temperature, and humidity into ordinal categories—that is, "Light," "Moderate,"
"heavy" for precipitation. For the model, this helps to more faithfully depict continuous
variables. For categorical variables—like "Street," "City," "County," etc.—String Indexer
converts them into numerical equivalents. This allows us to include these categories into our
model construction. All selected characteristics are gathered by Vector Assembler into a
single feature vector called "features". From this vector feeds our machine learning model.
We have lately experimented with a few models. This is the first more fundamental model.
We develop a TensorFlow/Keras-based Multilayer Perceptron (MLP) model.
25
neurons and activation functions), and an output layer with sigmoid activation. Within the
build_model function explain the model architecture and parameters (number of layers,
neurons per layer, activation functions, optimizer, loss function). Compilation of the model
uses binary cross-entropy loss and Adam optimizer. We then educate the model on the
training data after a predetermined number of epochs, batch size, and validation split. By
means of recording the training history, one may monitor the model's performance over
evolution.
Using PySpark, a more intricate model has been developed showing improved
performance over the prior one.Particularly layers incorporating batch normalization, dropout
for regularization, and dense layers with ReLU activation define the MLP. Binary
classification takes use of the sigmoid activation mechanism of the output layer. Model
Compilation and Instruction Guide model compilation is guided by binary cross-entropy loss
function and Adam optimizer. It is trained on scaled training data using a defined batch size,
epoch count, and validation split for measuring performance during training.Therefore model
is repeatable and kept without retraining for next use.
26
CHAPTER 5
RESULTS AND DISCUSSIONS
This chapter offers a thorough review of the findings from our predictive modeling
and extensive study efforts. We investigated temporal, geographical, and meteorological
aspects impacting accidents by using the US Accidents dataset. By means of descriptive
statistics, visualizations, and model performance measures, we obtained important
understanding of accident trends and assessed the predictive model performance in
determining accident severity. The findings underlie knowledge of accident dynamics and
draw attention to practical ideas for traffic safety campaigns.
Two clear temporal patterns were shown by accident start times: one in the morning
between 7–9 AM and another in the evening from 4–6 PM as seen in Fig. 5.1. These peaks
line up with morning and evening rush hours, when traffic volume usually rises from people
getting about. Reduced attention in drivers combined with more vehicle traffic as people head
to work or school might help to explain the morning peak. Often the evening peak falls on
people going home, maybe tired after the day, which raises the risk of mishaps. Effective
traffic management depends on an awareness of this hourly distribution [19], which helps
authorities to allocate resources best and maybe carry out preventative actions such as
changing traffic signal timings during peak hours. Public awareness efforts may also
concentrate on encouraging drivers to be careful during these high-risk times, therefore
lowering the accident rates.
27
Fig. 5.1: Hourly distribution of accidents throughout the day
The way accident severity was distributed in the dataset exposed an obvious class
imbalance shown in Fig. 5.2. While higher severity incidents (Severity 3 and 4) were much
less prevalent, most accidents were classed as low severity (Severity 2). Predictive modeling
suffers from this distorted distribution as models may be biased toward forecasting the
majority class, hence producing low accuracy for higher-severity forecasts. Developing
dependable models that precisely characterize incidents of all degrees depends on addressing
this disparity. We used weighted loss functions in our modeling procedure and oversampling
of the minority class to help to offset this problem. We want to promote proactive safety
measures emphasizing on averting high-severity events, where the stakes for injury and death
are greater, by improving the sensitivity of the model to severe accidents.
This imbalance might lead to some bias towards severity level 2 being predicted more
by the model, however due to the restrictions of our dataset we still have to stick to the same
dataset for training the model. Access to a larger dataset and computing power, we can train
the model better.
28
Fig. 5.2: Distribution of accident data by severity
Often major factors influencing accident frequency and intensity are the weather
conditions. Our study found notable relationships between accident incidence and unfavorable
weather which is shown in Fig. 5.3. Consistent with past research, rain, fog, and snow were
linked to more accidents. Adverse conditions lower visibility, affect road surface friction, and
shorten response times—all of which increase accident risk. For example, even mild rain may
make roadways slick, which might cause small collisions from sliding. On the other hand,
heavy snow or thick fog could cause more catastrophic mishaps when roads are more difficult
to negotiate and visibility suffers greatly.
29
responsible conduct during bad storms.
30
Fig. 5.4: An interactive heatmap with zoom-in/zoom-out feature
Further enhancing our study was overlaying the heatmap with places of interest (POIs)
and road network data. For instance, accident hotspots often corresponded with places close to
businesses, restaurants, or entertainment venues where heavy pedestrian and vehicle traffic
can increase accident risk. Higher accident frequency are seen at major junctions and close to
public transportation hubs. Policymakers may make data-driven choices to improve road
safety in these important places by knowing the geographical links between accident sites and
surrounding infrastructure. Future studies might look more closely at how certain road
characteristics or surrounding facilities affect accident risk, therefore providing perhaps more
exact information for safety planning.
31
significant association, this result justifies the inclusion of temperature as a variable in our
models because it helps to grasp the larger weather-related impacts on accidents. These
relationships direct feature selection for our prediction models, therefore enabling us to avoid
needless complication and concentrate on the most important elements. Understanding the
interactions among these variables is crucial as it emphasizes the linked character of accident
risks and helps improve our models for higher predicted accuracy.
32
5.2.1. Feature Importance
For the Random Forest and XGBoost models, the feature importance analysis depicted
in the Fig. 5.6 below provides important light on the elements most likely to affect accident
severity. Consistent among the best indicators are weather-related factors like precipitation
and visibility. Reduced visibility—usually resulting from fog or rain—directly connects to
increased accident risk; precipitation helps to create dangerous road conditions that increase
degree of severity.
Accident severity prediction also benefited much from temporal characteristics such
the hour of the day and day of the week. Weekend or peak traffic accidents showed different
trends that underlined the need of temporal context in comprehending accident dynamics. For
example, Saturday night accidents were more likely to be serious, maybe because of a mix of
alcohol intake, tiredness, and poor attention. By means of these main determinants of accident
severity, our study provides useful information for focused safety actions including enhanced
law enforcement presence during high-risk hours or bettering of lighting conditions in
locations seeing regular night-time accidents.
At last, We provided in this chapter the findings of our predictive modeling and
analytic efforts, therefore offering a whole picture of the elements affecting accident incidence
and severity. By use of EDA, we found important temporal, meteorological, and geographical
trends, thereby providing vital information required for efficient traffic control and safety
33
design. Our predictive modeling findings highlighted the significance of model selection;
XGBoost ranked highest among the algorithms for severity categorization. Analyzing feature
significance helped us to find important factors influencing accident severity, hence leading
further investigations and useful applications. The results in this chapter show the possibility
of machine learning to improve accident prediction and prevention initiatives, therefore laying
a solid basis for data-driven traffic safety policies.
Examining the confusion matrix for the model helped us to understand model
performance even further. These matrices expose each model's strengths and shortcomings by
offering a thorough analysis of prediction results spanning several degrees of severity.
Particularly the XGBoost model shone in accurately spotting higher-severity incidents
(Severity 3 and 4), which are vital for preventative safety campaigns. Since it enables
authorities to act quickly to stop such events, this precision in identifying serious accidents is
crucial.
Examining the confusion matrices also revealed certain restrictions of every model.
Reflecting its poor ability to detect non-linear patterns, the Logistic Regression model, for
instance, tended to misclassify catastrophic accidents as lower intensity. By showing more
balanced performance across all degree levels, the MLP and XGBoost models indicated their
34
flexibility to handle challenging datasets. These results highlight the need for confusion
matrices as diagnostic tools as they help researchers to grasp model behavior at a detailed
level and modify their models.
The MLP model being the primary and the initial implementation of our project, the
Fig. 5.7 represents the confusion matrix of the MLP model trained and the Fig. 5.8 plots the
loss graph for our initial attempt at the MLP model with only 50 epochs. However, the Fig.
5.9 plots the loss function of the same after training it for 100 epochs on our systems.
35
5.2.3. Comparative Model Performance
The figures below exhibit the performance metrics of many models via side-by-side
comparison of accuracy, precision, recall, and F1-score. With a test accuracy of 95.34% and a
macro-average F1-score of 0.55 XGBoost, as can be seen in Table 5.1, outperformed all other
models. XGBoost's capacity to manage unbalanced data and its susceptibility to overfitting,
along with its gradient-boosting process, which detects intricate data patterns and helps to
differentiate different degrees of accident severity, most certainly contribute to this
outstanding performance. Another useful model for this work is the Random Forest model,
the classification report shown in Table 5.2, which similarly obtained excellent test accuracy
(95.14%) and a really good F1-score (macro average 0.50). Random Forest's ensemble
method helped it to effectively capture data patterns, even if its performance lagged somewhat
behind XGBoost.
Models such as Logistic Regression, with metrics as shown in Table 5.3 and SVM, as
shown in Table 5.4 demonstrated competitive test accuracies but lower macro-average
F1-scores (0.27 and 0.32, respectively), similarly Table 5.5 shows the classification report for
Decision Tree model which had similar issues, thereby implying limits in efficiently capturing
the minority classes. Though not shown here, the Multilayer Perceptron (MLP) also showed
promise in accident severity classification. Although MLP was computationally more
demanding, as a neural network-based model it gave flexibility in addressing non-linear data
interactions. The results highlight the need of choosing a model fit for the particular data
features and aims of accident severity prediction. Every model has unique advantages and
disadvantages, which emphasizes the importance of carefully evaluating model complexity,
processing resources, and interpretability in practical uses.
36
Table 5.1: Classification Report for XGBoost
Severity 0 1 2 3
Severity 0 1 2 3
Severity 0 1 2 3
37
Table 5.4: Classification Report for SVM
Severity 0 1 2 3
Severity 0 1 2 3
5.3. Visualization
As shown in Fig 5.14 displays the numerous important temporal and numerical
38
aspects Temperature Distribution: The histogram shows a quite regular distribution of
temperatures, centered around [75°F]. This means that while they could be more common in
certain temperature zones, accidents happen throughout a spectrum of temperatures. With a
concentration of accidents happening, the humidity distribution displays patterns. This implies
that accident incidence may have some influence on humidity levels. Time Distribution:
Consistent with higher traffic flow at these times, the bar chart showing the hourly
distribution of accidents clearly displays maxima throughout the morning (7-9 AM) and
evening (4-6 PM).
39
the need to include meteorology into accident research. The pie chart shows the monthly
accident count. Although mishaps happen all year long, there may be minor fluctuations in
frequency depending on the month, maybe affected by seasonal elements like weather or
holiday travel.
40
CHAPTER 6
CONCLUSION AND FUTURE IMPLEMENTATIONS
This work shows the possibilities for traffic accident prediction and analysis using
Apache Spark and TensorFlow. We provide insightful analysis by creating an intelligent
system able to recognize accident trends and contributing causes, therefore supporting
continuous efforts to improve road safety. The capacity of the system to forecast accident
severity and underline important contributing elements enables the creation of focused
treatments and preventative plans. Policymakers, traffic engineers, and even individual drivers
may be empowered by this knowledge to make wise judgments lowering accident risk.
The project started with the road accidents dataset as a base which made us realize of
the important fields we could analyze and try to predict something to ultimately reduce the
number of accidents, and also to try and provide the required support at the earliest for severe
accidents. After the initial trial of predicting accident probability which lead to the model
being overfitted, we shifted our approach to predicting severity of the accidents based on the
data we had and obviously doing Exploratory Data Analysis to get useful insights from the
dataset. We then tried using 6 different models namely Multilayer Perceptron(MLP), Random
Forest, SVM, Logistic Regression, Decision Trees and XGBoost models, of which we found
XGBoost to be the most accurate with test accuracy of 95.34% in predicting the class of
accident severity. We also faced several problems while building the project which include the
following:
Class Imbalance: The dataset probably shows a class imbalance wherein certain accident
severity levels are somewhat more common than others. For less frequent but important
severity levels, this imbalance might distort model training and provide erroneous forecasts.
Future research will center on using cost-sensitive learning algorithms, undersampling
majority classes, or oversampling minority classes to handle this problem. These techniques
will guarantee that the model assigns suitable weight to any degree of severity, therefore
producing more strong and accurate forecasts.
Data Generalizability: The performance of the present model might be restricted to the
particular features of the one-month training dataset. Future studies will investigate
41
geographical and temporal fluctuations in order to improve generalizability and guarantee the
model's efficacy throughout many situations. This might include including elements that
directly reflect these variances within a single model or training distinct models for various
geographical locations or eras. Moreover, we will look at ensemble techniques—which
aggregate the forecasts of many models—to raise general predictive accuracy and resilience.
Although mean imputation was used in this work, more advanced methods of
addressing missing data might enhance model performance yet. Future research will
investigate other imputation techniques such model-based imputation or k-nearest neighbors
imputation, which may provide more accurate estimates for missing values by using feature
correlations.
For the future implementation of this project, we want to create a more strong,
accurate, and all-encompassing accident prediction system by overcoming these constraints
and combining real-time data. Globally, this improved approach has great potential to help to
greatly reduce accidents and increase road safety. Translating these scientific discoveries into
useful applications that can save lives and provide safer highways for everybody is the
ultimate aim. This might include integrating the prediction system into traffic management
systems to dynamically change traffic flow and maximize safety measures, or creating mobile
apps offering drivers real-time risk evaluations.
42
REFERENCES
[1] World Health Organization, "Road traffic injuries," vol. 15, no. 3, pp. 123-145,
2023.
[3] World Health Organization, Save LIVES: A road safety technical package, vol.
22, no. 4, pp. 210-234, 2018.
[5] S. Wen, W. Wang, Z. Yan, X. Chen and J. Zhang, "Maritime Accident Risk
Prediction Integrating Weather Data Using Machine Learning," Transportation
Research Part D: Transport and Environment, vol. 117, p. 103646, 2024, doi:
10.1016/j.trd.2023.103646.
[8] X. Ma, J. Dai, S. Wang, Z. Yang and Q. Wu, "Traffic Accident Prediction Based
on CNN Model," in 2021 5th International Conference on Intelligent
Computing and Control Systems (ICICCS), Madurai, India, 2021 pp. 1292-1296,
doi: 10.1109/ICICCS51141.2021.9432224.
[9] J. Yuan, Y. Zheng and X. Xie, "Accident Risk Prediction based on Heterogeneous
43
Sparse Data," in Proceedings of the 27th ACM SIGSPATIAL International
Conference on Advances in Geographic Information Systems, Chicago, IL, USA,
2019 pp. 309-318, doi: 10.1145/3347146.3359078.
44
[17] X. Meng, J. Bradley, B. Yavuz, E. Sparks, S. Venkataraman, L. Xin, T. Kan, S.
Gonzalez-Garcia, J. Franklin, F. Li, R. Zaharia and M. J. Franklin, "MLlib:
Machine Learning in Apache Spark," Journal of Machine Learning Research, vol.
17, no. 1, pp. 1235-1241, 2016.
[18] L. Breiman, "Random Forests," Machine Learning, vol. 45, no. 1, pp. 5-32, 2001,
doi: 10.1023/A:1010933404324.
45