Project Report - Principles of Data Analytics DAMO-500-1
Project Report - Principles of Data Analytics DAMO-500-1
1
Contents
1. Introduction .........................................................................................................................3
Context and Relevance .........................................................................................................3
Rationale ..............................................................................................................................3
Objectives and Scope ...........................................................................................................4
2. Data Description ...................................................................................................................5
Dataset Overview..................................................................................................................5
Data Types and Key Variables ................................................................................................5
Data Preprocessing...............................................................................................................6
Dataset Justification .............................................................................................................7
3. Research Questions and Hypotheses ......................................................................................8
Methodology....................................................................................................................... 11
4. Results ............................................................................................................................... 16
Research Question 1 ........................................................................................................... 16
Research Question 2 ........................................................................................................... 23
Research Question 3 ........................................................................................................... 31
Research Question 4:.......................................................................................................... 38
8. Conclusion.......................................................................................................................... 51
References ............................................................................................................................. 52
2
1. Introduction
Context and Relevance
Traffic safety is a critical concern in Toronto, where the Vision Zero Road Safety Plan (RSP)
represents an ongoing commitment to eliminate fatalities and serious injuries on the city’s roads.
Despite the progress achieved through measures such as speed management, pedestrian
pressing issue.
This project focuses on analyzing fatal and serious injury collisions using the "Total Killed or
Seriously Injured (KSI) Collisions" dataset, which provides detailed records of high-severity
traffic incidents. These collisions, involving fatalities or major injuries, highlight systemic risks
in Toronto’s traffic infrastructure and patterns of road usage. Addressing these incidents is vital
to achieving Vision Zero goals, as they disproportionately affect pedestrians, cyclists, and
Global cities such as London and New York have successfully employed data-driven approaches
to reduce traffic fatalities and injuries. For example, New York City’s Vision Zero initiative
pedestrian crossings (New York City DOT, 2020). Similarly, London reduced pedestrian injuries
by 25% over five years through hotspot analyses (London Road Safety Review, 2019). This
project builds on these models, tailoring the analysis to Toronto’s unique urban dynamics to
Rationale
Localized studies analyzing fatal and serious injury collisions in Toronto are limited. By
leveraging the Toronto Police Service’s dataset, this project addresses a critical gap in
3
understanding traffic safety trends in the city. The dataset includes granular information, such as
collision locations, injury severities, and external factors, offering a comprehensive foundation
Assess external variables, such as weather conditions and time of day, which contribute
to collision rates.
collisions.
By addressing these objectives, the project will directly support Toronto's Vision Zero
initiative, offering insights to improve resource allocation, safety policies, and urban
planning.
Identify high-risk areas: Pinpoint neighborhoods with frequent fatal and serious injury
collisions.
Analyze collision types: Examine trends across accidents involving pedestrians, cyclists,
Assess how external variables affect collisions: Investigate the influence of weather,
4
Develop predictive models: Use historical data to forecast future high-risk areas and
trends.
Scope: The analysis will utilize the Total Killed or Seriously Injured (KSI) Collisions dataset
provided by the Toronto Police Service, spanning multiple years to identify both recent and long-
term trends.
2. Data Description
Dataset Overview
This study utilized the "Killed or Seriously Injured (KSI) Collisions" dataset, sourced from the
Toronto Police Service Open Data Portal. The dataset contains detailed records of traffic
collisions involving major or fatal injuries, spanning the period from 2006 to 2023. A total of
6,870 unique serious or fatal accidents were analyzed for this study, with data aggregated at the
1. Collision Details:
5
Injury Severity: Includes only "Major" and "Fatal" injuries for this study, as the analysis
is conducted at the accident level. The most severe injury per accident was used to
Number of Victims: Derived from unique identifiers to ensure one record per accident.
3. External Variables:
Road Conditions: Includes road surface conditions (e.g., dry, wet, snowy) to evaluate
fog).
Data Preprocessing
To ensure the dataset was suitable for robust analysis, the following preprocessing steps were
conducted:
1. Data Cleaning:
Duplicates were removed to ensure each accident was represented only once.
Missing values in key fields (e.g., visibility, road conditions) were flagged or imputed
Unique identifiers were generated for each accident by combining location and time
variables to handle cases where multiple individuals were involved in a single event.
6
Spatial data (latitude and longitude) were cross-referenced with Toronto’s official
Temporal data were categorized into defined time periods (Night, Morning,
Afternoon, Evening) and seasons (Winter, Spring, Summer, Fall) for comparative
analysis.
3. Data Filtering:
Only collisions involving major or fatal injuries were retained to focus the study
on high-severity events.
Dataset Justification
Relevance to Research Questions:
The dataset provides granular information on collision severity, location, and external
accidents.
Comprehensive Coverage:
Spanning 17 years, the dataset captures long-term trends and allows for robust temporal
7
Alignment with Objectives:
By focusing on serious and fatal collisions, the dataset aligns directly with the study’s
goal of supporting Toronto’s Vision Zero initiative and addressing systemic risks in
traffic safety.
Variables such as road conditions and visibility provide critical insights into the role of
road safety.
The "Killed or Seriously Injured (KSI) Collisions" dataset was selected for its detailed and
relevant records, providing the necessary foundation to explore and address critical research
questions. It offers robust support for identifying high-risk areas, assessing the impact of external
factors, and predicting future collision trends. Future analyses could further benefit from
integrating additional datasets, such as real-time traffic volume and pedestrian density, for a
alternative hypothesis (H₁). These are specific, measurable, and directly aligned with the
project’s objectives.
8
H₀ (Null Hypothesis): Collision frequencies are uniformly distributed across
neighborhoods in Toronto.
Rationale: Testing these hypotheses helps identify geographic areas requiring safety
interventions.
neighborhoods?
neighborhoods.
Rationale: Identifying the types of collisions prevalent in high-risk areas highlights vulnerable
road users and informs targeted safety measures, such as cyclist and pedestrian protections.
Research Question 3: How do external factors such as weather, time of day, and seasons
Hypothesis 3.1: Collision rates increase significantly during adverse weather conditions
9
• H₀ (Null Hypothesis): Weather conditions do not significantly influence collision rates.
Hypothesis 3.2: Fatal collisions are more likely to occur during peak traffic hours.
• H₀ (Null Hypothesis): Fatal collisions are not significantly more frequent during peak
Hypothesis 3.3: Seasonal variations, such as winter months, correlate with higher collision
rates.
Rationale: External factors like weather and time affect road conditions and driver behavior.
Understanding these correlations can inform safety strategies, such as enhanced enforcement or
Research Question 4: How can predictive models forecast future collision rates?
Hypothesis 4.1: Historical collision data trends accurately predict future collision rates.
H₀ (Null Hypothesis): Historical collision data trends cannot reliably predict future collision
rates.
10
H₁ (Alternative Hypothesis): Historical collision data trends can reliably predict future collision
rates.
Hypothesis 4.2: External factors such as weather and road conditions improve the accuracy
of predictive models.
H₀ (Null Hypothesis): Including external factors such as weather and road condition does not
H₁ (Alternative Hypothesis): Including external factors such as weather and road condition
Rationale: Accurate predictions based on historical data and external factors enable proactive
Methodology
This section outlines the analytical framework and statistical methods applied in this study to
address the research questions. It builds upon the data preparation steps detailed earlier, focusing
1. Analytical Techniques
Descriptive statistics provided foundational insights into the dataset and guided the subsequent
11
Neighborhood-Level Analysis: Summarized collision frequencies to identify high-risk
Temporal Analysis: Collisions were analyzed by time of day and season to detect
External Factors: Environmental variables, such as road conditions and visibility, were
To test the hypotheses and explore significant relationships, the following methods were applied:
neighborhoods.
2. Chi-Square Tests:
3. Pairwise Comparisons:
12
Pairwise Chi-Square tests were conducted for collision types (e.g., Vehicle vs.
4. Predictive Modeling:
1. Heatmaps:
high-risk areas.
2. Bar Charts:
13
Illustrated collision type distributions, offering a clear representation of the
3. Time-Series Plots:
Showed trends in collision counts over time and highlighted seasonal peaks.
4. Cluster Analysis:
intervention strategies.
1. Statistical Assumptions:
which were tested using Levene’s test and visual diagnostics. Non-parametric
tests like Kruskal-Wallis were employed where these assumptions did not hold.
Metrics such as R-squared, RMSE, and MAE were used to evaluate the accuracy
14
Bonferroni corrections adjusted p-values in pairwise Chi-Square tests to ensure
statistical validity.
4. Justification of Methods
Statistical tests and predictive models were selected based on their alignment with
2. Data Characteristics:
temporal trends.
5. Limitations
1. Data Constraints:
Missing data in key variables, such as pedestrian density and real-time traffic
2. Forecasting Uncertainty:
15
Wider confidence intervals for long-term predictions underscore the need for
alongside robust predictive modeling. These methods addressed the research questions
effectively while ensuring statistical rigor. Visual and spatial analyses complemented
quantitative findings, providing actionable insights for urban planning and traffic safety
interventions. This methodology establishes a strong foundation for future enhancements and
research extension.
4. Results
Research Question 1
Which neighborhoods in Toronto experience the highest collision frequencies, and how do
Hypotheses
Collision Frequencies:
Toronto.
Summary Statistics
16
The dataset includes 6,870 collisions across 158 neighborhoods in Toronto.
These statistics show that collision counts are highly variable across neighborhoods, with a
A. ANOVA Test
17
Objective: To assess whether significant differences exist in collision frequencies across
neighborhoods.
Assumptions Tested:
Homogeneity of Variances:
Levene’s test was performed to check whether the variance of collision counts is equal
across neighborhoods.
Results:
Based on the mean: p < 0.001, indicating that the assumption of equal variances is
violated.
Based on the median and trimmed mean: p > 0.05, suggesting homogeneity when these
Conclusion: While variances based on the mean are unequal, the results based on median and
trimmed mean suggest that ANOVA can still be performed with caution.
18
ANOVA Results:
Table 1
F-statistic: 71,446.799
p-value: < 0.001
Interpretation: The ANOVA results confirm significant differences in collision frequencies
Objective: To confirm the findings from the ANOVA test without assuming equal variances or
normality.
Results:
Table 2
Kruskal-Wallis H: 5679.160.
Degrees of Freedom 139
(df):
p-value: < 0.001.
19
Interpretation:
The significant p-value indicates that collision frequencies vary significantly across
neighborhoods, consistent with the ANOVA results. This strengthens the conclusion that certain
neighborhoods.
Results:
Table 3
Chi-Square Value 2140.255.
Degrees of Freedom (df): 77
p-value: < 0.001.
Figure 6
Interpretation: The Chi-Square test shows that observed collision frequencies deviate
20
Both ANOVA and Chi-Square tests indicate significant variability in collision frequencies across
neighborhoods.
Hypothesis Evaluation:
Heatmap
Findings:
High-density collision areas are concentrated in central and southern Toronto, particularly in
21
Figure 7 Spatial distribution heatmap of high-collision areas in Toronto
Heatmap Insights
central and southern areas of Toronto. The neighborhoods with the highest collision frequencies
are:
o Neighborhood IDs: 162, 78, 164, 165, 166, 73, 168, 72, 71, 95, 70, 85, 84
a. Central neighborhoods, such as 162, 164, and 166, are known for their dense
b. These areas are often hubs for commuting, increasing the risk of collisions during
22
a. Major roads and highways, such as [insert key roads/highways if known],
a. Neighborhoods like 85 and 84 are likely to have busy commercial districts and
transit stops, attracting higher foot traffic and public transit activity, which
frequency of collisions.
b. Road surface conditions in specific areas during adverse weather may also play a
role.
Research Question 2
What types of collisions are most prevalent in high-risk neighborhoods, and which road user
Hypotheses
Collision Frequencies:
23
H₁ (Alternative Hypothesis): There is a significant difference in the frequency of vehicle-
Summary Statistics
The dataset consists of collision frequencies for Vehicle, Pedestrian, and Cyclist groups across
Table 4
Collision Type Total Collisions
Vehicle 1,332
Pedestrian 544
Cyclist 165
Figure 8 Collision frequencies for vehicle, pedestrian and cyclist across top 10 high risk
neighborhoods
24
Interpretation:
Pedestrian collisions make up 26%, while cyclist collisions are the least frequent at 9%.
Vehicle, Pedestrian, and Cyclist groups differ significantly from a uniform distribution.
Table 5
Collision Type Observed Count Expected Count Residual
Vehicle 1,332 680.33 +651.67
Pedestrian 544 680.33 -136.33
Cyclist 165 680.33 -515.33
Chi-Square Test Output (SPSS Results):
Figure 9 Frequency
25
Interpretation: The null hypothesis is rejected, indicating significant differences in collision
frequencies among the three groups. Vehicle collisions are significantly more frequent than
B. Pairwise Comparisons
1. Vehicle vs. Pedestrian To identify specific differences, a pairwise Chi-Square test was
Table 6
Group Observed Count Expected Count Residual
Vehicle 1,332 945.7 +386.3
Pedestrian 544 386.3 +157.7
26
Figure 12 Chi-square tests
Key Statistics:
Interpretation: Vehicle collisions are significantly more frequent than pedestrian collisions in
high-risk neighborhoods.
2. Vehicle vs. Cyclist A second pairwise Chi-Square test was conducted between Vehicle and
Cyclist collisions.
Table 7
Group Observed Count Expected Count Residual
Vehicle 1,332 1,185.2 +146.8
Cyclist 165 311.8 -146.8
27
Chi-Square Test Output (SPSS Results):
Interpretation: Vehicle collisions are significantly more frequent than cyclist collisions in high-
risk neighborhoods.
28
3. Pedestrian vs. Cyclist A final pairwise Chi-Square test was conducted between Pedestrian
Table 8
Group Observed Count Expected Count Residual
29
Figure 16 Pearson Chi-square test
Key Statistics:
Interpretation: Pedestrian collisions are significantly more frequent than cyclist collisions in
high-risk neighborhoods.
Statistical Evidence:
The overall Chi-Square test and pairwise comparisons confirm significant differences in collision
Hypothesis Evaluation:
30
Conclusion:
Vehicle collisions are the most frequent, followed by pedestrian collisions, with cyclist collisions
Research Question 3
How do external factors such as weather, time of day, and seasons influence collision
occurrences?
Hypotheses
H₀ (Null Hypothesis): Fatal collisions are not more likely to occur during peak traffic
31
H3.1 Results: Collision Rates and Weather Conditions
1. Descriptive Analysis
Table 9
Good 85.8% (5,894 collisions)
Moderate 11.4% (782 collisions)
Adverse 2.6% (182 collisions)
Unknown 0.2% (12 collisions)
Figure 18
Interpretation:
Most collisions (85.8%) occurred under good weather conditions. Collisions during adverse
weather were much less frequent (2.6%), likely due to lower traffic volumes.
2. Statistical Analysis
= 6,115.913,
=3
< 0.001.
Figure 19
32
Observed and Expected Frequencies:
Table 10
Observed Expected
Good 5894 1717.5
Moderate 782 1717.5
Adverse 182 1717.5
Unknown 12 1717.5
Interpretation: The Chi-Square test indicates a significant difference ( < 0.001) between
observed and expected collision frequencies. Collisions in good weather were significantly
overrepresented, while collisions during moderate, adverse, and unknown weather were
underrepresented.
6000
5000
4000
3000
2000
1000
0
Good Moderate Adverse Unknown
Weather Conditions
Statistical Evidence: The Chi-Square test confirms a significant relationship between weather
33
Hypothesis Evaluation:
Descriptive Analysis
Summary Statistics
Collisions were categorized by peak hours (7–9 AM, 4–6 PM) and non-peak hours:
Table 11
Time Period Non-Fatal Collisions Fatal Collisions Total Collisions % Fatal
Peak Hours 3,344 489 3,833 12.8%
Non-Peak Hours 2,569 468 3,037 15.4%
Total 5,913 957 6,870 13.9%
Table H3.2.1.1 Collision Counts by Peak and Non-Peak Hours
Interpretation
Fatal collisions appear more frequent during peak hours, though non-fatal collisions dominate
overall.
Statistical Analysis
Chi-Square Test
Objective: To test whether fatal collisions are more likely during peak hours compared to non-
34
peak hours.
Results:
The results indicate a statistically significant relationship between peak hours and fatal
collisions. Fatal collisions are more likely to occur during peak traffic hours.
35
Figure 22 Summary of collisions during peak and non-peak hours
Descriptive Analysis
Table 12
Season Non-Fatal Collisions (%) Fatal Collisions (%) Total Collisions (%)
36
Season Non-Fatal Collisions (%) Fatal Collisions (%) Total Collisions (%)
Interpretation
Non-Fatal collisions dominate across all seasons (~86%), and seasonal differences in Fatal
Statistical Analysis
Chi-Square Test
Results:
= 4.734.
= 0.192.
37
Interpretation:
The results indicate no statistically significant relationship between seasons and collision rates.
Research Question 4:
How can predictive models forecast future collision rates?
38
Objective: Develop predictive models using historical collision data to identify and forecast
future collision rates. Incorporate factors such as weather, time, and collision types to build a
Hypotheses:
H4.1: Historical collision data trends accurately predict future collision rates.
H₀ (Null Hypothesis): Historical collision trends cannot reliably predict future collision rates.
H₁ (Alternative Hypothesis): Historical collision trends can reliably predict future collision
rates.
Forecasts: 2024-2028
Base Model used: SARIMA Model (Seasonal Autoregressive Integrated Moving Average)
39
Figure 25 Historical and predicted collision trends
Key Findings:
40
41
Figure 26 Monthly collision patterns
42
Conclusion for H4.1
Statistical Evidence: The model evaluation confirms that historical collision data significantly
Hypothesis Evaluation:
Conclusion: Historical collision data trends accurately forecast future collision rates.
Hypothesis 4.2: External factors such as weather and road condition improve the accuracy
of predictive models.
H₀ (Null Hypothesis): Including external factors such as weather and road condition does not
43
H₁ (Alternative Hypothesis): Including external factors such as weather and road condition
Figure 27 Comparison of predictive model accuracy with and without external factors
44
Conclusion for H4.2
Statistical Evidence: Model comparison indicates that including external factors (e.g., weather
and road type) significantly improves predictive accuracy (R² = 0.842, p < 0.05).
Hypothesis Evaluation:
Conclusion: External factors such as weather and road type significantly enhance the accuracy of
45
The analysis confirms that external factors significantly improve collision predictions, with
weather being the most influential factor. Based on this finding, it is recommended to incorporate
both weather and road condition data in future predictive models to enhance forecasting
accuracy.
5. Discussion
The findings of this study provide critical insights into traffic safety in Toronto, focusing on the
patterns, external factors, and predictive modeling of fatal and serious injury collisions. These
insights carry significant implications for urban planning, road safety policy, and resource
allocation.
Broader Implications
High-Risk Neighborhoods
actionable data for targeted safety interventions. High-risk neighborhoods, primarily located in
central and southern Toronto, correlate with high population densities, traffic volumes, and the
presence of commercial and transit hubs. These findings emphasize the need for location-specific
46
Traffic calming measures (e.g., speed bumps, reduced speed limits).
Collision Types
of pedestrians and cyclists underscores the need for tailored safety interventions, including:
Reduced vehicle speeds in areas with significant pedestrian and cyclist activity.
Awareness campaigns to educate drivers on sharing the road with vulnerable road users.
External Factors
Adverse weather conditions and peak traffic hours significantly influence collision rates,
The minimal impact of seasonal variations suggests a more localized focus on daily and weekly
Predictive Modeling
The strong performance of predictive models, particularly those incorporating external factors,
validates their use for forecasting future collision rates and hotspots. This capability enables:
47
Data-driven strategies for anticipating and mitigating collision risks.
Urban Planning
The findings provide actionable insights for designing safer roadways, pedestrian zones, and
cycling lanes in high-collision areas. Incorporating these insights into urban development plans
Policy Development
Evidence-based enforcement policies, such as increased monitoring during adverse weather and
peak hours, are supported by the study's findings. These strategies align with broader road safety
initiatives.
The study supports Toronto’s Vision Zero initiative by offering data-driven strategies to
eliminate serious and fatal collisions. By focusing on high-risk neighborhoods and collision
6. Limitations
Data Constraints
The absence of variables like pedestrian density and real-time traffic volume limited the scope of
some analyses. Incorporating such data in future studies would allow for more nuanced findings.
Prediction Uncertainty
48
Long-term forecasts exhibited wider confidence intervals, reflecting uncertainty over extended
timeframes. Regular updates to predictive models using new data are essential to maintain
accuracy.
Simplified Categories
Grouping injury severity at the accident level excluded nuances in individual injuries, which
Mitigation Strategies
Enhancing Data Integration: Future research should incorporate additional datasets, such
7. Recommendations
Neighborhood-Specific Measures
Prioritize traffic enforcement during peak traffic hours and adverse weather conditions.
49
Medium-Term Planning (2025–2026)
Resource Allocation
Develop resource allocation plans targeting seasonal patterns and high-risk time periods,
Infrastructure Investments
Upgrade road surface conditions and implement advanced traffic control systems in high-
collision neighborhoods.
Public Awareness
Launch targeted education campaigns emphasizing pedestrian and cyclist safety in high-
risk areas.
Integrate real-time traffic and weather data into predictive models for dynamic safety
planning.
Create tailored emergency response strategies for areas with high seasonal and temporal
risk patterns.
50
8. Conclusion
This project highlights the critical role of data-driven approaches in understanding and mitigating
traffic safety risks. By focusing on Toronto's serious and fatal collision patterns, the study not
only provides a framework for identifying high-risk areas but also enhances the predictive
Key insights into collision types, environmental factors, and temporal trends offer actionable
knowledge that bridges the gap between traffic data and practical interventions. The predictive
models developed demonstrate strong reliability, underscoring their potential for forecasting
The methodological rigor applied, including spatial and statistical analyses, contributes to the
growing body of knowledge in urban traffic safety. By integrating findings with policy
recommendations, this project provides a roadmap for cities aiming to replicate or adapt the
Moving forward, the research invites further innovation in real-time data integration and
advanced analytics. It emphasizes the need for continuous collaboration between planners,
policymakers, and enforcement agencies to address dynamic urban safety challenges. The project
not only supports Toronto's Vision Zero ambitions but also serves as a template for advancing
51
References
CARSP. (2021). Evaluating the effectiveness of left-turn calming measures in Toronto.
https://www.carsp.ca
City of Toronto. (2023). Vision Zero Road Safety Plan Overview. Toronto Vision Zero.
transportation/road-safety/vision-zero/
London Road Safety Review. (2019). Five-Year Road Safety Summary. London City Safety
do/environment/london-road-safety-review
New York City DOT. (2020). Vision Zero Annual Report. NYC Department of Transportation.
52