Lab Project

Download as pdf or txt
Download as pdf or txt
You are on page 1of 19

Lab Project

Course title: Big Data and IOT Lab


Course Code: CSE 413

Submitted to:
Mr. Amir Sohel
(Senior Lecturer, Computer Science and Engineering)

Submitted by:
Name: Pritom Das Shanto Name: Sazid Ahmed Tonmoy
ID: 201-15-13748 ID: 201-15-13774
Section: 55-G
Methodology: The graphic prediction methodology employed in this
project involves a step-by-step process to anticipate and visualize
trends, patterns, or outcomes within a given dataset. The methodology
is designed to provide insights into the data through the creation of
predictive graphics. By using this data and the power of computer
modeling, we aim to help the government make smarter decisions
about how to manage traffic. Whether it's improving roads or planning
for the future, our goal is to use data to make our cities work better for
everyone. Throughout this report, we'll take you on a journey—from
understanding the data we have to building computer models that can
predict future traffic.

Proposed Model: In our quest to the intricacies of traffic prediction in


the United Kingdom, we propose the implementation of a Decision Tree
model as the cornerstone of our predictive analytics framework.
Decision Trees, a form of machine learning algorithm, excel at
deciphering complex relationships within datasets. Specifically tailored
to our traffic prediction objectives, the Decision Tree will autonomously
learn from the features embedded in our dataset. These features
include the spatial dimensions of roads, the density of vehicles, and
additional contextual factors. The Decision Tree functions like a virtual
traffic analyst, creating a hierarchical flowchart of decisions based on
the observed patterns in the data. This model holds the potential to
capture both straightforward and nuanced dependencies, making it an
ideal candidate for discerning the multifaceted nature of traffic
dynamics. By strategically branching through the data, the Decision Tree
endeavors to uncover the latent rules governing traffic volumes in
various local authorities. Our research methodology involves the
fine-tuning of hyper parameters within the Decision Tree, ensuring
optimal performance in capturing the underlying patterns while
guarding against overfitting. Furthermore, the model's interpretability
allows us to glean valuable insights into the relative importance of
different features, contributing to a more profound understanding of
the factors influencing traffic variations. The proposed Decision Tree
model aligns with the broader objective of not only predicting traffic
but also shedding light on the intricate interplay of variables shaping
traffic patterns. This research endeavors to harness the predictive
power of machine learning to inform more effective urban planning and
traffic management strategies.
[Start]
|
v
[Data Collection] ------------> [Data Preprocessing] ----> [EDA] ------>
[Feature Engineering]
| | | |
v v v v
[Gather UK government dataset] [Handle missing values] [Visualize
data distribution] [Derive new features if needed]
| | | |
|------ Check dataset quality [Label encode categorical variables]
[Explore correlations between variables]
|
v
[Modeling] -------------------> [Linear Regression] -----> [RFE] -----> [K-NN]
| | |
v v v
[Ridge and Lasso Regression] [Decision Trees] [Random Forest]
|
v
[Hyperparameter Tuning] -------> [Optimize Decision Trees'
hyperparameters]
|
v
[Model Evaluation] -----------> [Evaluate model performance using
metrics (e.g., MAE)]
|
v
[Conclusion] -----------------> [Summarize key findings] -----> [Provide
insights for future work]
|
[End]
Dataset Overview: The dataset at the core of this traffic prediction
study is an invaluable compilation provided by the UK government,
shedding light on the intricate tapestry of traffic dynamics within local
authorities during the year 2019. Through meticulous curation, this
dataset captures a diverse array of features crucial to understanding
vehicular movement, including local authority identifiers, road link
lengths, and counts of cars and taxis. The dataset's temporal focus on
2019 provides a contemporary lens for analysis, offering a snapshot of
traffic conditions during a pivotal period. With structured organization
and thoughtful preprocessing, the dataset serves as the bedrock for our
endeavor to unravel the complexities of traffic patterns and contribute
to informed decision-making in urban planning and transportation
management.

Traffic Volume Prediction:


Local Authority ID A unique identifier for each local authority.
Local Authority Name The name of the local authority.
Year The year for which the data is recorded.
Link Length (km) The total length of road links in kilometers within the
local authority's jurisdiction.
Link Length (miles) The total length of road links in miles within the local
authority's jurisdiction.
Cars and Taxis The total number of registered cars and taxis within the
local authority.
All Motor Vehicles The total number of all registered motor vehicles within
the local authority.

Dataset Preprocessing and Analysis:

The journey toward insightful traffic predictions begins with a rigorous


dataset preprocessing phase. Originating from UK government records,
the dataset was meticulously refined to address inherent challenges,
ensuring data quality and preparing it for in-depth analysis and
modeling.

Handling Missing Values: The identification and handling of missing


values were paramount in preserving the dataset's integrity. An initial
assessment revealed the distribution of missing values across features.
Imputation strategies, tailored to each variable's characteristics, were
employed, balancing data completeness without introducing bias.

Encoding Categorical Variables: Given the categorical nature of certain


features, particularly 'local_authority_name,' a meticulous encoding
process was undertaken. This transformation converted textual labels
into numerical representations, facilitating the seamless integration of
categorical information into machine learning models.

Outlier Treatment: Outliers, disruptive elements potentially skewing


model performance, underwent scrutiny. The quantile method was
employed to robustly identify and adjust extreme values. This step
aimed not only to enhance model resilience but also to glean insights
into potential anomalies within the traffic data.
Data Scaling: Recognizing the diverse scales of features, a meticulous
scaling process was implemented. Features such as 'link_length_km' and
'link_length_miles' underwent normalization to a consistent scale. This
ensures that all features contribute proportionately to model training,
mitigating the influence of variables with inherently larger magnitudes.

Exploratory Data Analysis (EDA): Beyond mere preprocessing, the


dataset was subjected to comprehensive Exploratory Data Analysis
(EDA). Visualizations, statistical summaries, and correlation analyses
were employed to unravel hidden patterns, providing a profound
understanding of feature relationships and setting the stage for
subsequent modeling.

Final Dataset: The culmination of these preprocessing steps yielded a


refined dataset primed for analytical exploration. By tactically
addressing missing values, encoding categorical variables, treating
outliers, and scaling features, the dataset stands as a robust foundation
for predictive modeling. The analytical depth achieved through EDA
ensures that the dataset not only meets the technical requirements of
machine learning models but also harbors rich insights into the
intricacies of traffic dynamics in the United Kingdom.

Label Encoder: Transforming Categories into Numbers in the realm of


machine learning, algorithms often require numerical input, making it
necessary to convert categorical variables into a format that can be
effectively processed. This is where the Label Encoder steps in,
providing a systematic way to transform categorical labels into
numerical representations.

Process Overview: The Label Encoder operates on the principle of


assigning a unique numerical value to each distinct category within a
categorical variable. This transformation is crucial for algorithms that
rely on mathematical operations, allowing them to interpret and learn
from categorical data.
Example: Consider the 'local_authority_name' variable, initially
containing distinct city names. The Label Encoder would assign a
unique numerical code to each city, such as:
London: 0
Manchester: 1
Birmingham: 2
...
This transformation ensures that the categorical information is converted
into a format compatible with machine learning algorithms.
Impact on Model: The effectiveness of label encoding depends on the
specific characteristics of the dataset and the chosen machine learning
algorithm.
In essence, the Label Encoder acts as a bridge, enabling the translation
of categorical diversity into a numerical language that machine learning
models comprehend, thus enhancing the analytical capabilities of our
traffic prediction project.

Quartiles Method for Outlier Management in Traffic Data In the quest


for accurate traffic predictions, the Quartiles Method emerges as a
pivotal player in ensuring the quality and reliability of our dataset.
Outliers, those data points deviating significantly from the norm, possess
the potential to skew analyses and compromise the effectiveness of
machine learning models. To tackle this challenge, we leverage the
Quartiles Method, a statistical technique designed to systematically
identify and address outliers while preserving the integrity of our
traffic-related features.
The method operates by dividing our dataset into quartiles—Q1, Q2 (the
median), and Q3—each representing 25% of the data. The Interquartile
Range (IQR), calculated as the difference between Q3 and Q1, forms the
basis for identifying outliers. By establishing upper and lower bounds
derived from the IQR, we create a robust framework for detecting data
points that fall beyond acceptable ranges. These outliers are then subject
to careful adjustment or removal, a process that ensures our subsequent
predictive models are not unduly influenced by irregularities in the data.
Applied specifically to features such as 'link_length_km' and
'link_length_miles,' the Quartiles Method becomes an integral
component of our preprocessing pipeline. Its strength lies not only in its
statistical rigor but also in its ability to adapt to diverse data
distributions, making it a versatile tool in the pursuit of accurate traffic
predictions. This approach aligns with our commitment to robust data
preparation, ensuring that our models are trained on a representative
dataset that captures the nuances of traffic dynamics within the United
Kingdom's local authorities.
Define Outlier Bounds:
Lower Bound: Q1 - 1.5 * IQR
Upper Bound: Q3 + 1.5 * IQR
Any data point below the Lower Bound or above the Upper Bound is
considered an outlier.
Outliers can be adjusted or removed, depending on the nature and
context of the data.

MinMaxScaler: Scaling Features for Enhanced Model Performance.


In the realm of traffic prediction, the MinMaxScaler emerges as a vital
tool for harmonizing the diverse scales of our dataset features. Designed
to transform numerical data into a specified range, the MinMaxScaler
ensures that each feature contributes proportionately to model training,
fostering a more accurate and efficient predictive process.

Scaling Process Overview: The MinMaxScaler operates by


transforming each feature to a specified range, typically between 0 and
1. The formula for scaling is as follows:
Here, X represents the original feature values, and
scaled

Train-Test Split:
Enabling Robust Model Evaluation In the realm of traffic prediction, the
Train-Test Split methodology plays a crucial role in ensuring the
reliability and generalizability of our machine learning models. This
approach is designed to systematically divide our dataset into training
and testing sets, providing a clear distinction between the data used for
model training and the data reserved for evaluation.
Process Overview:
Data Division: The dataset, encompassing features like 'link_length_km'
and 'link_length_miles,' is strategically divided into two subsets: a
training set and a testing set.
Training Set:
The training set forms the foundation for model learning. Machine
learning algorithms utilize this portion to identify patterns and
relationships within the data, enabling them to make predictions.
Testing Set:
The testing set, distinct from the training set, remains untouched during
the model learning process. Once the model is trained, it is evaluated on
this independent subset to assess its performance on new, unseen data.
Evaluation Metrics: Metrics such as Mean Absolute Error (MAE), Mean
Squared Error (MSE), or R-squared are often employed to quantify the
model's predictive accuracy on the testing set.

Data Visualization: Traffic Dynamics Embarking on a journey to


decipher the complexities of traffic dynamics within the United
Kingdom, our approach integrates meticulous data exploration and
visualization. Through a symbiotic dance of statistical analyses and
graphical representations, we illuminate the patterns inherent in our
dataset, laying the groundwork for informed and nuanced traffic
predictions.

Exploratory Data Analysis (EDA):


Histograms and Distribution Plots:
Visualizing the distribution of features, such as 'link_length_km' and
'link_length_miles,' unveils the inherent characteristics of these
variables. This aids in identifying central tendencies and outliers,
offering a foundational understanding of the dataset.
Scatter Plots: Delving into relationships between numerical variables,
particularly exploring how 'cars_and_taxis' correlates with the overall
'all_motor_vehicles' count, provides a visual narrative of traffic
dynamics. Scatter plots breathe life into data points, revealing trends that
statistical measures might overlook.
Correlation Heatmaps: Correlation heatmaps serve as a compass,
guiding us through the intricate web of relationships within the dataset.
They highlight the strength and direction of connections between
different features, offering key insights into variables that significantly
impact traffic patterns.
Feature Engineering Insights:
Derived Feature Plots:
Visualizing newly engineered features adds a layer of depth to our
understanding. These plots provide qualitative insights into how these
derived dimensions contribute to predicting traffic volumes, informing
the feature selection process.
Outlier Detection and Handling:
Box Plots:
Box plots stand as sentinels, helping identify outliers within features like
'cars_and_taxis' or 'link_length_km.' Their visual representation allows
us to make informed decisions about the treatment of extreme values,
ensuring the integrity of our analyses.

Model Evaluation:
Predicted vs. Actual Plots:
The journey culminates in visualizing the performance of our predictive
models. Predicted vs. actual plots offer a tangible view of the accuracy
of our predictions, guiding us towards a deeper understanding of model
behavior.
Residual Plots: Residual plots provide a lens into the discrepancies
between predicted and actual values, offering insights into systematic
errors and areas for model refinement.

In essence, data visualization becomes the tapestry that weaves together


the narrative of traffic data. It goes beyond numbers, providing a visual
language that enhances our understanding and empowers us to build
models that capture the essence of traffic dynamics in the intricate
mosaic of the United Kingdom's local authorities.

Dataset Splitting :
Features have been fitted to the train set after the completion of the
splitting. Because my dataset was not fully balanced, I have validated
my model with several cross-validation techniques. Using the feature
space sentiment, the train dataset size is 80% and the test dataset size is
20%
Linear regression is a statistical method used to model the relationship
between a dependent variable (also called the target or response
variable) and one or more independent variables (predictors or
features). The basic idea is to find the best-fitting linear relationship
that explains the variation in the dependent variable based on the
independent variables. Here's a breakdown of the key components of a
linear regression model performance is evaluated using various metrics,
including Mean Squared Error (MSE) = 0.00018, R-Squared = 0.9930.
These metrics help assess how well the model predicts the dependent
variable.
Predictions: - Once trained, the linear regression model can be used to
make predictions on new or unseen data by plugging in values for the
independent variables. Linear regression is a foundational and
interpretable model commonly used for predicting numerical
outcomes. However, it assumes a linear relationship, and its
performance may be affected if this assumption is not met.
Recursive Feature Elimination (RFE) is a feature selection technique
commonly used in machine learning to improve model performance
and reduce overfitting. The main idea behind RFE is to recursively
remove the least important features from the model until the optimal
subset of features is identified.

The criteria for selecting the optimal subset of features can vary. It
could be based on a certain number of features, or it could involve
monitoring the model's performance (e.g., cross-validated accuracy) as
features are removed.

K-Nearest Neighbors (K-NN) is a simple and intuitive machine


learning algorithm used for both classification and regression tasks. The
basic idea behind K-NN is to predict the class (for classification) or value
(for regression) of a data point by looking at the "k" data points nearest
to it in the feature space. For a classification task, the majority class
among the 'k' nearest neighbors is assigned to the data point in
question. This is often referred to as a majority voting mechanism or
dimensionality reduction methods may be applied to address these
issues.

Ridge Regression and Lasso Regression are regularization techniques


used in linear regression to prevent overfitting and improve the model's
generalization performance. Both methods add a regularization term to
the linear regression objective function, influencing the model's
coefficients.
Ridge Regression:
Mean Squared Error: 0.00019135976636646827
R-squared: 0.9925488990648439
Lasso Regression:
Mean Squared Error: 0.025685953685680597
R-squared: -0.00015085282465543415
Decision Tree: This classifier falls under the category of decision tree
classifiers, which construct a tree model from training data for use in
prediction. It is an upgrade to the ID3 decision tree classifier from
earlier. Through applying this classifier, I get 90% accuracy.

Random Forest: This classifier is a component of ensemble learning


classifiers, which construct numerous tree models from the training set
that can be utilized for prediction at a later stage. This kind does not
have the overfitting issue that decision trees frequently have. Through
applying this classifier, I get 91% accuracy.

Hyperparameter Tuning for Decision Trees:


In the process of building and optimizing our predictive model for traffic
prediction, we employed a Decision Tree algorithm and utilized
hyperparameter tuning to enhance its performance. The tuning was
carried out using Grid Search Cross-Validation, which exhaustively
searches through a predefined hyperparameter space to find the
combination that yields the best model performance.
The tuned Decision Tree model was then evaluated on our test dataset.
The results are as follows:

Mean Squared Error (MSE): 0.00021811152483390198


R-squared (R²): 0.991507248271061

The low MSE indicates that the model's predictions are close to the
actual values, while the high R-squared value suggests that a significant
proportion of the variance in the target variable is captured by the model.
This fine-tuned Decision Tree model showcases the effectiveness of
hyperparameter optimization in enhancing predictive accuracy,
providing a robust foundation for traffic prediction in our project.

Tuned Random Forest Model Evaluation


In the pursuit of optimizing our predictive model for traffic prediction,
we turned to the Random Forest algorithm and conducted a thorough
tuning process. The purpose was to enhance the model's performance by
fine-tuning its hyperparameters. After the tuning process, we evaluated
the model on our test dataset to assess its predictive capabilities.
The performance of the tuned Random Forest model was evaluated
using two key metrics:
Mean Squared Error (MSE): 0.00021654343020907377
R-squared (R²): 0.9915683062016141
In the pursuit of optimizing our predictive model for traffic prediction,
we turned to the Random Forest algorithm and conducted a thorough
tuning process. The purpose was to enhance the model's performance by
fine-tuning its hyperparameters. After the tuning process, we evaluated
the model on our test dataset to assess its predictive capabilities.

Result :
Test Test Train Train
Model Accurac Accurac
Mean Square R Square y y
Name
Error Error Mean Square
Error R Square
Error

0.000415590 0.923942482 92.39%


Linear 0.0018069 0.930110 93% 1 2
Regressio
n
0.0018768 0.9803452 98%
Random 0.0002229957 0.98104599 98.10%
forest
2.37375195 0.90907383 90%
K-NN 0.000441089 0.92282501 92.28%
0.000427806 0.903308 90.33%
Ridge 0.000191359 0.9125488990 91.25%
Regression 6
0.0256299 0.0
LR 0.025685953 -0.00155828
0.00015306 0.974028 97.4%
Decision 0.000317148 0.98765097 98.77%
Tree
0.000228005 0.9911039 99.11%
Tuned 0.0021654343 0.0991568306 99.15%
Random
Forest
0.0021654343 0.0991268306 99.12%
Tuned 0.000218111521 0.9916072483 99.16%
Decision
Tree
Result Analysis: The result I got by implementing all the above
algorithms I can say that the best algorithm is Tuned Decision Tree.
Because its accuracy level is 99.16%. By implementing the algorithm of
Tuned Decision Tree I get high precision which is 99.16. The second
better is applying Tuned Random Forest. By the implementing of I get
99.15 precision level. The worst algorithm among of 15 those is Ridge
Regression. It didn’t get me the average train score which is only
91.25%. It is the lowest score than Tuned Decision Tree.

You might also like