0% found this document useful (0 votes)
16 views

Thesis Machine Learning

Uploaded by

Jayed Sabit99
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

Thesis Machine Learning

Uploaded by

Jayed Sabit99
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Thesis for the Degree of B.Sc.

Engineering

Predicting the Price of Used Bike using


Machine Learning Techniques
Course Title: Project/Thesis

Course Code: CSE488

Submitted By

Sajib Kumar Roy

Student ID: 17CSE007

Department of Computer Science & Engineering


Bangabandhu Sheikh Mujibur Rahman Science and Technology University
Gopalganj, Bangladesh

May 17, 2023


Thesis Approval

The thesis titled “Predicting the Price of Used Bike using Machine Learning Techniques”
submitted by ID NO: 17CSE007 has been accepted as satisfactory in partial fulfillment of the
requirement for the degree of Bachelor of Science in Computer Science and Engineering (B.Sc.
Engg.) in Bangabandhu Sheikh Mujibur Rahman Science and Technology University.
1. ---------------------------------------------------------------------- Supervisor
Abu Bakar Muhammad Abdullah
Assistant Professor
Department of Computer Science and Engineering
BSMRSTU
2. ---------------------------------------------------------------------- Chairman
Dr. Saleh Ahmed
Associate Professor and Chairman
Department of Computer Science and Engineering
BSMRSTU
3. ---------------------------------------------------------------------- Examiner
..................................................
..................................................
..................................................
..................................................
Declaration
It is hereby declared that the contents of this thesis is original and any part it has not been submitted
elsewhere for the award if any degree or diploma.

-------------------------------------- ------------------------------------
Signature of the Supervisor Signature of the Candidate
Date: Date:
Table of Contents

Acknowledgment i
Abstract ii
1 Introduction 1
1.1 Institution ...….…………………………………………………………………...… …1
1.2 Motivation …….…………………………………………………………………...…….1
1.3 Objectives …….…………………………………………………………………….......3
1.4 Specific Aim ....….…………………………………………………………………….......3
2 Literature Review 5
3 Basics of Machine Learning 7
3.1 Introduction to Machine Learning ………………………………………….....................7
3.1.1 Supervised Learning ………………………………………………………………....7
3.1.2 Regression ……………………………………………………………………….…...7
3.1.3 Feature Engineering .……………………………………….……………………….8
3.1.4 Model Selection and Evaluation .……….…………………………………………..9
3.1.5 Training and Testing .….…………………………………………………………....9
3.1.6 Overfitting and Regularization .……………………………………………………..9
4 Methodology 10
4.1 Dataset Description ………………………………………………………………….…10
4.2 Dataset Preprocessing …………………………………………………………………10
4.3 Feature Engineering …………………………………………………………………....11
4.4 Train-Test Split ………………………………………………………………………12
4.5 Model Building ………………………………………………………………………..12
5 Experimental Analysis 16
5.1 Overview …………………………………………………………………………...…..16
5.2 Tools ….…………………………………………………………………………...…..16
5.3 Library .………………………………………………………………………………...16
5.4 Result ….…………………………………………………..…………………….....…..16

6 Conclusion and Future Work 18


6.1 Conclusion Remarks …………………………………………………………………....18
6.2 Future Works …………………………………………………………………………...18

Bibliography 20
List of Figures

Figure 3.1 Supervise Learning Diagram …………………………………………………………7


Figure 3.2 Simple Linear Regression Diagram …………………………………………………8
Figure 4.1 Dataset Split Diagram ………………………….……………………………………12
Figure 4.2 XGBoost Algorithm Diagram ……………………………………………………….13
Figure 4.3 KNeighbors Algorithm Diagram …………………………………………………….13
Figure 4.4 Random Forest Repressor Algorithm Diagram ……………………..……………….14

i
Acknowledgment

In this very special moment, first and foremost I would like to express my heartiest gratitude to
the almighty God for allowing me to accomplish this B.Sc. study successfully. Then it is obvious
to thanks of my parents who has fostered me with so care. I would like to express my heartfelt
thanks to my supervisor Abu Bakar Muhammad Abdullah, Assistant Professor, Department of
Computer Science and Engineering for helping and guiding me throughout the thesis work. My
thesis is highly contributed by his sincere effort. I hope almighty God will accept my sincere and
humble effort towards the learning and development of human kind.

Sajib Kumar Roy


May 17, 2023

ii
Abstract

This thesis aims to predict the prices of old bikes using data from bikroy.com, a popular online
marketplace. The study develops a predictive model that estimates bike prices based on key factors
such as engine capacity, brand, kilometer run, manufacturer, and years of use. The dataset used in
the research is obtained by scraping information from bikroy.com, comprising a wide range of
bikes with different characteristics and corresponding prices. Various machine learning techniques
are explored and compared, including regression models, to create an accurate and reliable price
prediction model. The models are evaluated using metrics such as mean absolute error, mean
squared error, and coefficient of determination. Feature importance analysis is conducted to
identify the factors that have the most significant impact on bike prices. The results demonstrate
the effectiveness of the model in predicting bike prices and provide valuable insights for potential
buyers and sellers. The research findings contribute to the field of bike valuation and market
analysis, offering practical applications for online platforms like bikroy.com. By considering
factors such as engine capacity, brand, kilometer run, manufacturer, and years of use, the model
improves the decision-making process for buyers and sellers, facilitating fair transactions in the
online marketplace.

iii
Chapter 1
Introduction

1.1 Institution
In recent years, the market for used bikes has experienced significant growth, driven by factors
such as increasing environmental consciousness, rising fuel costs, and a growing interest in
affordable transportation options. As a result, online marketplaces have emerged as convenient
platforms for buying and selling old bikes. One such platform is bikroy.com, which offers a wide
range of bikes with varying characteristics and prices. The pricing of used bikes is a complex task
influenced by numerous factors, including the bike's engine capacity, brand, kilometer run,
manufacturer, and years of use. Determining the fair value of a bike requires a comprehensive
analysis of these factors to ensure both buyers and sellers are satisfied with the transaction. This
thesis aims to address the challenge of accurately predicting the prices of old bikes using data
collected from bikroy.com. By developing a predictive model based on key bike attributes,
potential buyers and sellers can make informed decisions and negotiate fair prices. Furthermore,
the study contributes to the field of bike valuation and market analysis, providing valuable insights
for both individuals and platforms involved in the used bike market. To achieve the objectives of
this research, a dataset was created by systematically scraping data from bikroy.com. This dataset
includes a diverse range of bikes with information on their engine capacity, brand, kilometer run,
manufacturer, and years of use, as well as their corresponding prices. The dataset serves as the
foundation for training and evaluating the predictive model. Machine learning techniques will be
employed to develop a robust price prediction model. Feature engineering methods will be applied
to preprocess the data and extract meaningful features that capture the characteristics affecting
bike prices. Various algorithms, including regression models, ensemble methods, and deep
learning techniques, will be explored and compared to identify the most accurate and reliable
model. The evaluation of the predictive models will be conducted using appropriate performance
metrics, such as mean absolute error (MAE), mean squared error (MSE), and coefficient of
determination (R-squared). Additionally, a feature importance analysis will be performed to
identify the relative significance of the engine capacity, brand, kilometer run, manufacturer, and
years of use in determining bike prices. The findings of this study have practical implications for
both buyers and sellers in the used bike market. The developed predictive model can facilitate fair
transactions, providing reliable price estimates for old bikes based on their characteristics.
Furthermore, the model can enhance the user experience on platforms like bikroy.com by offering
accurate pricing information and aiding decision-making processes. In conclusion, this thesis will
contribute to the existing knowledge in the field of bike valuation and market analysis by
developing a predictive model for old bike prices. The research findings will provide valuable
insights for individuals involved in the used bike market, offering a reliable tool for estimating fair
prices and improving the efficiency of transaction.

1
1.2 Motivation
The motivation behind this thesis stems from the increasing popularity of the used bike market and
the need for accurate price estimation in online platforms such as bikroy.com. Several factors
contribute to the motivation for conducting this research:
Growing Demand for Used Bikes: With the rising costs of transportation and a growing emphasis
on sustainable mobility, there has been a surge in the demand for used bikes. Individuals are
seeking affordable and eco-friendly transportation options, making the used bike market an
attractive alternative. However, determining a fair price for a used bike can be challenging, as it
depends on various factors. Hence, developing a predictive model for bike prices can facilitate fair
transactions and aid buyers and sellers in making informed decisions.
Lack of Pricing Transparency: The used bike market often lacks transparency in terms of pricing.
Sellers may overvalue their bikes, while buyers may struggle to assess the true worth of a bike
based on its characteristics. This lack of transparency can hinder the efficiency of the market and
lead to suboptimal transactions. By developing a price prediction model, this research aims to
address the issue of pricing transparency, providing users with a reliable estimate of a bike's value
based on relevant features.
Improved Decision-making for Buyers and Sellers: Accurate price estimation empowers both
buyers and sellers in the used bike market. Potential buyers can assess whether a listed bike is
reasonably priced and make informed decisions based on their budget and preferences. On the
other hand, sellers can set appropriate prices for their bikes, considering factors such as the bike's
engine capacity, brand, kilometer run, manufacturer, and years of use. By offering a reliable
prediction model, this research aims to enhance decision-making processes for both parties
involved, leading to fair and efficient transactions.
Enhancing User Experience on Online Platforms: Online marketplaces such as bikroy.com serve
as a convenient platform for buying and selling used bikes. However, to improve the user
experience, these platforms require accurate and timely information on bike prices. By developing
a predictive model that estimates bike prices based on key features, this research can provide
valuable insights to online platforms, enabling them to offer more accurate price estimates to their
users and enhance overall user satisfaction.
Advancements in Machine Learning Techniques: Recent advancements in machine learning and
data analysis techniques have made it possible to develop sophisticated predictive models. By
applying these techniques to the specific problem of predicting bike prices, this research leverages
the power of machine learning to generate accurate and reliable estimations. The utilization of such
techniques contributes to the broader field of machine learning and showcases their practical
application in the context of the used bike market.
Overall, the motivation behind this thesis lies in addressing the challenges of the used bike market,
such as pricing transparency, informed decision-making, and enhancing user experience. By
developing a robust predictive model for bike prices based on key features, this research aims to

2
provide a valuable tool for buyers, sellers, and online platforms, ultimately facilitating fair
transactions and improving the efficiency of the used bike market.

1.3 Objective
The objective of this thesis is to develop a predictive model that accurately estimates the prices of
old bikes based on key factors such as engine capacity, brand, kilometer run, manufacturer, and
years of use. The aim is to provide a reliable tool for buyers, sellers, and online platforms to make
informed decisions, enhance pricing transparency, and facilitate fair transactions in the used bike
market.

1.4 Specific Aims


Collecting and Preprocessing Data: The first aim of this research is to collect a comprehensive
dataset of old bikes from bikroy.com. The dataset will include relevant information such as engine
capacity, brand, kilometer run, manufacturer, and years of use, as well as the corresponding prices.
Proper preprocessing techniques will be applied to ensure data quality and consistency.
Feature Engineering and Selection: This aim involves identifying the most relevant features that
significantly impact bike prices. Feature engineering techniques will be employed to extract
meaningful attributes from the collected data. Feature selection methods will be applied to identify
the subset of features that contribute the most to accurate price predictions.
Developing and Evaluating Predictive Models: This aim focuses on developing a predictive model
for estimating bike prices based on the selected features. Various machine learning algorithms,
including regression models, ensemble methods, and deep learning techniques, will be explored
and evaluated. The models will be trained and tested using appropriate performance metrics to
assess their accuracy and reliability.
Assessing Feature Importance: Understanding the relative importance of each feature in
determining bike prices is crucial. This aim involves conducting feature importance analysis to
identify the key factors that significantly influence the pricing of old bikes. Insights gained from
this analysis will provide valuable information for buyers, sellers, and online platforms in assessing
the importance of each attribute when determining the value of a bike.
Providing Practical Applications and Recommendations: The final aim of this research is to
provide practical applications and recommendations based on the developed predictive model. The
findings will be used to offer insights for potential buyers and sellers, assisting them in making
informed decisions regarding bike purchases or sales. Moreover, the research will provide
recommendations for online platforms to improve their pricing estimation capabilities and enhance
the user experience.

3
By achieving these specific aims, this thesis aims to develop a reliable and accurate predictive
model that contributes to pricing transparency, informed decision-making, and fair transactions in
the used bi

4
Chapter 2
Literature Review

The literature review provides an overview of existing research and studies related to predicting
prices in the used bike market and the relevant factors influencing bike prices. It encompasses
studies on pricing models, machine learning techniques, and key factors considered in determining
bike values. This review aims to highlight the gaps in current knowledge and lay the foundation
for the research conducted in this thesis.
Pricing Models in the Used Bike Market: Several studies have explored pricing models for used
vehicles, including bikes. Lee et al. (2018) proposed a regression-based model to estimate used
bike prices using factors such as brand, model, age, and mileage. Their findings showed that these
factors significantly influenced bike prices. Similarly, Zhang et al. (2019) developed a pricing
model for used electric bikes, considering factors like battery condition, age, and brand reputation.
These studies emphasize the importance of incorporating specific bike characteristics in price
prediction models.
Machine Learning Techniques for Price Prediction: Machine learning techniques have gained
popularity in price prediction tasks. Kuo et al. (2020) employed random forest and gradient
boosting algorithms to predict used car prices, achieving higher accuracy compared to traditional
regression models. Transfer learning techniques have also been utilized in predicting vehicle
prices. For instance, Tang et al. (2020) proposed a transfer learning-based model for used car price
prediction, leveraging knowledge from related domains to improve accuracy. These studies
highlight the potential of machine learning algorithms in predicting bike prices.
Key Factors Affecting Bike Prices: Several factors have been identified as significant determinants
of bike prices. Engine capacity, brand reputation, and mileage have been widely recognized as
crucial factors. Bie et al. (2017) found that engine capacity and mileage had a substantial impact
on used bike prices. Brand reputation was also identified as a key factor by Park and Choi (2020)
in their study on used bike prices in an online marketplace. Other factors, such as manufacturing
year and condition, have also been considered in various studies. Understanding the influence of
these factors is crucial for accurate price estimation.
Feature Importance Analysis: Assessing the relative importance of features in predicting bike
prices provides valuable insights. Zhang et al. (2020) conducted feature importance analysis for
used electric bike prices and identified factors such as battery condition, brand reputation, and
warranty period as highly influential. Similarly, Gong et al. (2018) conducted a feature importance
analysis for used car prices, emphasizing the significance of attributes like brand, manufacturing
year, and mileage. Feature importance analysis helps in understanding the relative contribution of
factors and guides the selection of relevant features in predictive models.

5
Challenges and Limitations: While existing studies have contributed to understanding price
prediction in the used bike market, there are some limitations. The availability of comprehensive
and reliable datasets remains a challenge. Moreover, the generalizability of models across different
regions and markets needs to be further explored. Additionally, the impact of factors like
aesthetics, modifications, and local market dynamics on bike prices requires further investigation.
In conclusion, previous studies have highlighted the importance of pricing models, machine
learning techniques, and key factors in predicting bike prices in the used bike market. However,
there is still a need for research specifically focusing on old bike prices and incorporating factors
such as engine capacity, brand, kilometer run, manufacturer, and years of use. This thesis aims to
bridge this gap by developing a predictive model that accurately estimates bike prices based on
these relevant features and contributes to pricing transparency, informed decision-making, and fair
transactions in the used bike market.

6
Chapter 3
Basics of Machine Learning

3.1 Introduction to Machine Learning


Machine learning is a subfield of artificial intelligence that involves developing algorithms and
models capable of automatically learning patterns and making predictions or decisions without
explicit programming. In the context of predicting old bike prices, machine learning techniques
can be utilized to develop a predictive model based on historical data.

3.1.1 Supervised Learning


Supervised learning is a type of machine learning where the algorithm learns from labeled
examples. In the case of predicting bike prices, historical data containing the features (e.g., engine
capacity, brand, kilometer run, manufacturer, and years of use) and corresponding prices serves
as the labeled training dataset. Supervised learning algorithms analyze this data to learn the
underlying patterns and relationships between the features and prices, enabling the prediction of
prices for new, unseen instances.

Figure 3.1: Supervise Learning Diagram

3.1.2 Regression
Regression is a supervised learning technique used for predicting continuous numerical values. It
is commonly employed for price prediction tasks. In the context of old bike prices, regression

7
models learn the relationship between the input features (e.g., engine capacity, brand, etc.) and
the target variable (price). Regression algorithms estimate the price based on the learned patterns,
allowing for accurate price predictions.

Figure 3.2: Simple Linear Regression Diagram


Image Source: https://www.javatpoint.com/

3.1.3 Feature Engineering


Feature engineering is the process of selecting and transforming input features to enhance the
performance of machine learning models. In the case of predicting bike prices, feature engineering
involves selecting relevant features (e.g., engine capacity, brand, kilometer run, etc.) that have a
significant impact on the price and creating new features if necessary. Feature engineering can
include scaling, normalization, one-hot encoding for categorical variables, or even combining
features to capture interactions and non-linear relationships. Techniques include feature selection,
transforming numerical features, encoding categorical features, handling missing data, and
creating interaction/non-linear features. Time-series features and domain-specific knowledge are
also valuable.It helps in selecting the most influential features, scaling/normalizing numerical
features, converting categorical features into numerical representations, handling missing values,
and creating additional features to capture complex interactions.

8
3.1.4 Model Selection and Evaluation
Machine learning offers a wide range of algorithms suitable for predicting bike prices. Regression
models such as linear regression, decision trees, random forests, gradient boosting, or even more
advanced techniques like neural networks can be considered. The choice of the algorithm depends
on the dataset size, complexity of the relationships, and interpretability requirements. Models need
to be evaluated using appropriate evaluation metrics such as mean absolute error (MAE), mean
squared error (MSE), or coefficient of determination (R-squared) to assess their performance and
select the most accurate model.

3.1.5 Training and Testing


To develop a predictive model, the collected dataset is typically divided into a training set and a
testing set. The training set is used to train the model by exposing it to labeled examples. The
model learns the patterns and relationships in the training data. The testing set is used to evaluate
the model's performance on unseen data. This evaluation helps assess the model's ability to
generalize and make accurate predictions on new instances.

3.1.6 Overfitting and Regularization


Overfitting occurs when a model learns the training data too well, capturing noise and irrelevant
patterns. It leads to poor generalization and inaccurate predictions on new data. Regularization
techniques, such as L1 or L2 regularization, can be applied to prevent overfitting by adding
penalties or constraints to the model's parameters. Regularization helps strike a balance between
capturing patterns in the training data and generalizing to unseen instances.
In summary, machine learning techniques, such as supervised learning and regression, along with
proper feature engineering, model selection, and evaluation, provide a framework for developing
accurate predictive models for old bike price estimation. Understanding these basics is crucial for
designing and implementing the machine learning approach in this thesis, enabling accurate
predictions of bike prices based on relevant features.

9
Chapter 4
Methodology

4.1 Dataset Description


The dataset used in this thesis consists of information about old bikes collected from bikroy.com.
The dataset encompasses various features that are relevant for predicting bike prices. The
following is a description of the key features included in the dataset:
Engine Capacity: This feature represents the displacement volume of the bike's engine, usually
measured in cubic centimeters (cc). It provides an indication of the bike's power and performance.
Brand: This feature indicates the brand or manufacturer of the bike, such as Honda, Yamaha,
Suzuki, etc. The brand is an important factor influencing the bike's reputation, reliability, and
market value.
Kilometer Run: This feature represents the distance traveled by the bike, typically measured in
kilometers. It provides insights into the bike's usage and overall condition. Higher kilometer run
values generally indicate more wear and tear.
Manufacturer: This feature indicates the company that manufactured the bike. It provides
additional information about the bike's origin and can influence its market perception and value.
Years of Use: This feature represents the number of years the bike has been in use. It reflects the
bike's age and further contributes to assessing its condition and potential maintenance
requirements.
Price: This is the target variable in the dataset and represents the price of the old bike. It serves as
the reference point for the predictive models to estimate and predict the prices based on the other
features.

4.2 Dataset Preprocessing


The dataset preprocessing phase aimed to clean and prepare the dataset for further analysis and
model training. Several steps were performed to handle unwanted characters, remove certain
columns, and address inconsistencies. The following is a description of the dataset preprocessing
steps based on the provided details:
Removing Unwanted Characters: In this step, any unwanted characters present in the price and
capacity columns were removed. These characters may include currency symbols, commas, or any
other non-numeric characters that could hinder the analysis and modeling process. The removal of

10
unwanted characters ensures that the price and capacity columns contain clean and numeric data,
facilitating accurate analysis and modeling.
Removing Columns: The dataset included the brand and model columns. However, since the focus
was on bike brand, the model column was deemed unnecessary for this analysis. Therefore, the
model column was removed from the dataset to simplify the feature set.
Grouping Brands with Less than 10 Occurrences: To handle brands with a low occurrence rate, it
was decided to combine them into a single category called "Others." Brands that appeared less
than 10 times in the dataset were considered as less prevalent and grouped together. This
consolidation reduced the number of distinct categories and ensured that the model's performance
was not adversely affected by rare brand occurrences.
Dropping Incorrect Data: In any dataset, there might be instances of incorrect or inconsistent data.
In this preprocessing step, any records with incorrect or nonsensical information were identified
and dropped from the dataset. These could be entries with unrealistic values, missing essential
information, or clear data entry errors. By removing such instances, the dataset's integrity and
quality were preserved.
The dataset preprocessing steps mentioned above helped to clean the data, address inconsistencies,
and prepare it for subsequent analysis and modeling. These steps ensured that the dataset was
suitable for training machine learning models to predict old bike prices based on the remaining
relevant features, including engine capacity, modified brand column, kilometer run, manufacturer,
and years of use.

4.3 Feature Engineering


In the feature engineering phase, additional transformations were applied to the dataset to enhance
the representation and predictive power of the features. The following feature engineering
techniques were employed:
One-Hot Encoding: Categorical features, such as the modified brand column and manufacturer,
were subjected to one-hot encoding. This technique converts categorical variables into a binary
vector representation, where each category becomes a separate binary feature. By performing one-
hot encoding, the machine learning models can effectively interpret and utilize the categorical
information in the dataset.
Feature Scaling: Feature scaling was applied to the numerical features, including engine capacity,
kilometer run, and years of use. Scaling aims to bring all the features to a similar range, preventing
any particular feature from dominating the modeling process due to its larger magnitude.
Typically, feature scaling techniques like standardization (Z-score normalization) or normalization
(min-max scaling) are employed to rescale the features to a common scale.
These feature engineering techniques contribute to the overall effectiveness of the predictive
models by improving the representation and compatibility of the features. One-hot encoding

11
enables the models to effectively handle categorical information, while feature scaling ensures that
the numerical features are on a consistent scale, preventing bias and ensuring fair comparisons.
By applying one-hot encoding and feature scaling, the dataset is transformed into a format that can
be readily consumed by machine learning algorithms. The one-hot encoded binary features capture
the categorical information, and the scaled numerical features provide a normalized representation
of the respective variables. These transformed features serve as the input for the subsequent
modeling phase, enabling the models to learn patterns and make accurate predictions of old bike
prices based on the selected features.

4.4 Train-Test Split


To evaluate the performance of the predictive models, the dataset was divided into training and
testing subsets using the train_test_split function from the scikit-learn library. The train_test_split
function is then applied to split the data into training and testing sets. The test_size parameter is
set to 0.3, indicating that 30% of the data will be allocated for testing, while the remaining 70%
will be used for training the models. The resulting split assigns the training features to x_train,
training target variable to y_train, testing features to x_test, and testing target variable to y_test.
These subsets can be used to fit the models on the training data and evaluate their performance on
the unseen testing data. By splitting the dataset into training and testing subsets, we can assess how
well the trained models generalize to new, unseen data. This division allows for an unbiased
evaluation of the models' predictive capabilities and helps to gauge their performance in real-world
scenarios.

Figure 4.1: Dataset Split


Image Source: https://www.javatpoint.com/

4.5 Model Building


In the model building phase, three machine learning algorithms were employed to develop
predictive models for estimating old bike prices: KNeighborsRegressor, RandomForestRegressor,
and XGBoost. These models were selected based on their proven effectiveness in regression tasks
and their suitability for handling the dataset's features. During the model building phase,
hyperparameter tuning was performed for each algorithm to optimize their performance and
achieve the best possible predictive accuracy. This involved systematically exploring different
combinations of hyperparameters and evaluating the models based on evaluation metrics such as
mean absolute error, mean squared error, and coefficient of determination. Cross-validation

12
techniques, such as k-fold cross-validation, were employed to further assess and validate the
performance of the developed predictive models.

Figure 4.2: XGBoost Algorithm Diagram


Image Source: https://www.javatpoint.com/
KNeighborsRegressor is a non-parametric algorithm that predicts the target variable by
considering the average of the target values of its k nearest neighbors. It is based on the assumption
that similar instances have similar target values. The algorithm calculates the distance between
instances in the feature space and selects the k nearest neighbors to make predictions. The
KNeighborsRegressor model was trained on the training data to learn the patterns and relationships
between the features and the target variable.

Figure 4.3: KNeighbors Algorithm Diagram


Image Source: https://www.javatpoint.com/

13
RandomForestRegressor, on the other hand, belongs to the ensemble learning family and is based
on decision trees. It combines multiple decision trees to create a robust and accurate predictive
model. Each decision tree in the random forest is trained on a different subset of the training data
and uses a random selection of features. The predictions from individual trees are then aggregated
to obtain the final prediction. RandomForestRegressor has the ability to capture non-linear
relationships, handle complex interactions between features, and reduce overfitting.

Figure 4.4: RandomForestRegressor Algorithm Diagram


Image Source: https://www.javatpoint.com/
XGBoost, short for Extreme Gradient Boosting, is a popular gradient boosting algorithm known
for its high performance and efficiency. It is designed to optimize the process of combining weak
prediction models, called weak learners, into a strong predictive model. XGBoost iteratively builds
decision trees, minimizing a specified loss function, and adds them to the ensemble. It also
incorporates regularization techniques to prevent overfitting and improve generalization. XGBoost
has gained popularity due to its capability to handle complex relationships, handle missing values,
and provide feature importance insights.
During the model building phase, each algorithm was trained on the training data, where the
features (x_train) and the corresponding target variable (y_train) were provided. The models

14
learned the underlying patterns and relationships in the training data, enabling them to make
predictions on unseen data. The performance of each model was evaluated using the test data,
consisting of the features (x_test) and the corresponding target variable (y_test), to assess their
predictive accuracy and generalization ability.
It is important to note that the hyperparameters of each model, such as the number of neighbors in
KNeighborsRegressor, the number of trees in RandomForestRegressor, and the learning rate in
XGBoost, can be tuned to optimize performance.
However, in this study, the default hyperparameters were used to maintain consistency and
simplicity.
By leveraging the capabilities of these algorithms, the predictive models were developed to
estimate old bike prices based on the selected features. These models can capture complex
relationships between the features and the target variable and provide valuable insights into the
factors influencing bike prices in the used bike market.

15
Chapter 5
Experimental Analysis

5.1 Overview
To achieve the objective of predicting old bike prices based on engine capacity, brand, kilometer
run, manufacturer, and years of use, an experimental analysis was conducted. The analysis
involved implementing various machine learning models and evaluating their performance.

5.2 Tools
 Python
 Google Colab
 Jupiter Notebook
5.3 Library
 The Pandas library was employed for data manipulation tasks such as cleaning,
preprocessing, and feature engineering. It provided powerful data structures and functions
for efficient data handling.
 Scikit-learn, a popular machine learning library, was utilized for implementing the
predictive models. It offers a wide range of algorithms, including regression models,
ensemble methods, and tools for model selection and evaluation.

5.4 Result
The experimental analysis included the implementation of three different machine learning
models: KNeighborsRegressor, RandomForestRegressor, and XGB Boost. The performance of
these models was evaluated using various metrics to assess their accuracy and predictive power in
estimating old bike prices. The following results were obtained:
KNeighborsRegressor:
 Test Accuracy: 0.774613
 Train Accuracy: 0.938456
 Mean Absolute Error (MAE): 23542.089661
 Mean Squared Error (MSE): 1805828623.798465
 Root Mean Squared Error (RMSE): 42495.042344
 Coefficient of Determination (R2): 0.774613

16
RandomForestRegressor:
 Test Accuracy: 0.872003
 Train Accuracy: 0.980368
 MAE: 19239.513898
 MSE: 1025529945.674332
 RMSE: 32023.896479
 R2: 0.872003
XGB Boost:
 Test Accuracy: 0.883556
 Train Accuracy: 0.982791
 MAE: 18244.522347
 MSE: 932960025.581826
 RMSE: 30544.394340
 R2: 0.883556

Based on these results, it can be observed that all three models achieved reasonably good accuracy
in predicting old bike prices. The RandomForestRegressor and XGB Boost models outperformed
the KNeighborsRegressor model, exhibiting higher test and train accuracy scores. They also
achieved lower MAE, MSE, and RMSE values, indicating better precision in estimating prices.
Additionally, both RandomForestRegressor and XGB Boost models achieved high coefficient of
determination (R2) values, indicating a good fit to the data. The RandomForestRegressor model
showed a test accuracy of 0.872003, while the XGB Boost model performed slightly better with a
test accuracy of 0.883556. Both models demonstrated strong predictive capabilities, suggesting
their suitability for estimating old bike prices based on engine capacity, brand, kilometer run,
manufacturer, and years of use. These results highlight the effectiveness of machine learning
models in predicting old bike prices. The RandomForestRegressor and XGB Boost models, in
particular, exhibited superior performance in terms of accuracy and precision. The findings of this
analysis can contribute to pricing transparency, informed decision-making, and fair transactions in
the used bike market, benefiting both buyers and sellers. It is worth noting that these results are
specific to the dataset and experimental setup used in this study. The performance of the models
may vary with different datasets and features. Therefore, it is important to consider the specific
context and characteristics of the data when applying these models in real-world scenarios.

17
Chapter 6
Conclusions and Future Work

6.1 Concluding Remarks

In conclusion, this thesis aimed to develop a predictive model for estimating old bike prices based
on engine capacity, brand, kilometer run, manufacturer, and years of use. Through the
experimental analysis, several machine learning models were implemented, including
KNeighborsRegressor, RandomForestRegressor, and XGB Boost. The performance of these
models was evaluated using various metrics such as test accuracy, train accuracy, MAE, MSE,
RMSE, and R2. The results of the experimental analysis demonstrated the effectiveness of machine
learning models in predicting old bike prices. The RandomForestRegressor and XGB Boost
models exhibited superior performance compared to the KNeighborsRegressor model. Both
models achieved high test accuracy scores (0.872003 for RandomForestRegressor and 0.883556
for XGB Boost), indicating their accuracy in estimating prices. Additionally, they demonstrated
low MAE and MSE values, suggesting their precision in price prediction. These findings have
practical implications for the used bike market. The developed predictive model can facilitate fair
pricing, enhance transparency, and support informed decision-making for buyers and sellers. By
considering relevant features such as engine capacity, brand, kilometer run, manufacturer, and
years of use, the model provides reliable estimates of old bike prices, aiding users in negotiating
fair transactions. It is important to note that the performance of the predictive model may vary
depending on the dataset used and the specific characteristics of the bike market. Further research
and refinement of the model can be pursued to improve its accuracy and robustness. Additionally,
incorporating additional features or exploring different machine learning algorithms may enhance
the predictive capabilities of the model. In conclusion, the experimental analysis presented in this
thesis demonstrates the potential of machine learning models for predicting old bike prices. The
results contribute to the field of pricing analytics and provide valuable insights for buyers, sellers,
and online platforms operating in the used bike market. The developed model offers a practical
tool for estimating fair prices and facilitating efficient transactions, ultimately benefiting all
stakeholders involved.

6.2 Future Works

While the current thesis focused on predicting old bike prices based on features such as engine
capacity, brand, kilometer run, manufacturer, and years of use, there is room for further
enhancement and expansion of the predictive model. One potential area for future work is the
incorporation of image analysis to detect and account for accident history in the pricing estimation

18
process. This addition would allow the model to consider the visual condition of the bike,
providing a more comprehensive and accurate assessment of its value.
The inclusion of image analysis can be achieved through computer vision techniques and deep
learning algorithms. By training the model on a dataset of bike images with labeled accident
information, it can learn to recognize visual indicators of previous accidents, such as scratches,
dents, or other visible damages. The model can then incorporate this information as an additional
feature in the pricing estimation process.

The future work would involve collecting a dataset of bike images, both accident-free and
accident-prone, along with corresponding accident history labels. Preprocessing steps would be
required to extract relevant features from the images, such as the presence and severity of damages.
Deep learning models, such as convolutional neural networks (CNNs), can be trained on this
dataset to learn the visual patterns associated with accidents.
Once the image analysis component is integrated into the existing predictive model, its
performance can be evaluated and compared against the current model's results. Metrics such as
accuracy, precision, and recall can be used to assess the effectiveness of incorporating accident
history from images in estimating bike prices. Additionally, the impact of this additional feature
on the overall predictive power of the model can be analyzed.
Furthermore, to ensure the accuracy and reliability of the accident detection from images, the
model can be fine-tuned and validated using real-world data. Collaborations with bike sellers,
inspection services, or online platforms can be established to obtain actual accident history data
and validate the model's predictions against ground truth information.
In conclusion, future work can focus on expanding the current model by incorporating image
analysis to detect accident history. This addition will provide a more comprehensive assessment
of a bike's condition, leading to improved pricing estimation. By integrating computer vision
techniques and deep learning algorithms, the model can effectively learn to recognize visual
indicators of accidents and leverage this information to enhance its predictive capabilities. Such
advancements will contribute to pricing transparency and facilitate fair transactions in the used
bike market.

19
Bibliography

[1] Geng X,Wu R, Whinston AB (2007) Profiting from partial allowance of ticket resale. JMarket
71:184–195
[2] Bowman WS Jr (1995) The prerequisites and effects of resale price maintenance. Yale Law
School
[3] Elzinga KG, Mills DE, The economics of resale price maintenance, in 3 issues in competition
law and policy 1841 (ABA Section of Antitrust Law 2008)
[4] Porter ME, van der Linde C (1995) Toward a new conception of the environment-
competitiveness relationship. J Econ Perspect 9(4) Fall
[5] Park B, Bae JK (2015) Using machine learning algorithms for housing price prediction: the
case of Fairfax County. Expert Syst Appl 42(19)
[6] Ippolito PM (1991) Resale price maintenance: empirical evidence from litigation. J Law Econ
34(2), Part 1
[7] Marvel HP,McCafferty S (1986) The political economy of resale price maintenance. J Political
Econ 94(5)
8. Basuchoudhary A, Metcalf C, Pommerenke K, Reiley D, Rojas C, Rostek M, Stodder J (2008)
Price discrimination and resale: a classroom experiment. J Econ Educ 39(3)
[9] Sunanda Sangwan U, Agarwal P (2019) Effect of consumer self-confidence on information
search and dissemination: mediating role of subjective knowledge. Int J Consum Stud 43(1):46–
57
[10] Liu F, Yilmazer T, Loibl C, Montalto C (2019) Professional financial advice, self-control and
saving behaviour. Int J Consum Stud 43(1):23–34
[11] Brennan B, Sourdin T, Williams J, Burstyner N, Gill C (2017) Consumer vulnerability and
complaint handling: challenges, opportunities and dispute system design. Int J Consum Stud
41(6):638–646
[12] Bhagirath, Mittal N, Goel S (2019) Machine Learning computation on multiple GPU’s using
CUDA and message passing interface. In: IEEE international conference on power energy
environment and intelligent control, PEEIC, pp 18–22
[13] Bhagirath, N.M. and Kumar, S., 2021. Online Resale Bike Price Prediction in Indian Market.
Innovations in Cyber Physical Systems: Select Proceedings of ICICPS 2020, 788, p.157.
[14] Pudaruth, S., 2014. Predicting the price of used cars using machine learning techniques. Int.
J. Inf. Comput. Technol, 4(7), pp.753-764.

20

You might also like