0% found this document useful (0 votes)
30 views14 pages

ITS307 Group 4 Report

This document discusses predicting used car prices through machine learning models. It describes related work using random forests, support vector machines and neural networks. The paper proposes using K-nearest neighbors regression on a dataset containing vehicle features to predict prices, with an accuracy of up to 85%.

Uploaded by

Kinley Pemo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views14 pages

ITS307 Group 4 Report

This document discusses predicting used car prices through machine learning models. It describes related work using random forests, support vector machines and neural networks. The paper proposes using K-nearest neighbors regression on a dataset containing vehicle features to predict prices, with an accuracy of up to 85%.

Uploaded by

Kinley Pemo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Second Hand Car Price

Prediction
ITS307 DATA ANALYTICS
BACHELOR OF SCIENCE IN INFORMATION
TECHNOLOGY (YEAR III, SEMESTER I)

RESEARCHER (S)
Jamyang Gyeltshen 12200048
Domtu Drukpa 12200043
Kinley Pemo 12200058
Gyem Tshering 12200045

GUIDED BY
NIMA DEMA
Gyalpozhing College of Information Technology Gyalpozhing,
Mongar
Table of Content

Topic Page No
1. Abstract……………………………………………………………………3
2. Introduction………………………………………………………………..3
2.1Problem statement……………..……………………………………….3
2.2 Aim……………………………………………………………………3
2.3 Scope and limitations………………………………………………….4
3. Related work………………………………………………………………4
4. Methodology………………………………………………………………6
4.1Proposed Methods……………………………………………………..6
4.2 Evaluation metrics………………………………………………….…8
5. Result and Discussion…………………………………………………….11
6. Conclusion………………………………………………………………..13
7. References……………………………………………………………...…14

2
1. Abstract

As the mobile internet improves by leaps and bounds, the model of traditional offline used car
trading has gradually lost the ability to live up to the needs of consumers, and online used car
trading platforms have emerged as the times require. Second-hand car price assessment is the
premise of second-hand car trading, and a reasonable price can reflect the objective, fair, and true
nature of the second-hand car market and also determining whether the listed price of a used car
is a challenging task, due to the many factors that drive a used vehicle’s price on the market. The
focus of this project is developing machine learning models that can accurately predict the price
of a used car based on its features, in order to make informed purchases. We implement and
evaluate various learning methods on a dataset consisting of the sale prices of different makes
and models of various cars. Our results show that KNeighbor Regressor yields the best results.

Keywords:second hand car; price; used car; linear regression; models; car trading;reasonable
price; machine learning;

2. Introduction

With an increasingly flourishing quantity of private cars and the advancement of the used car
market, used cars have to become the top priority for buyers. The price of a used car is an
important aspect of a successful transaction for both buyers and sellers. For car buyers,
acknowledging the price of used cars allows for trading with peace of mind; for car sellers,
evaluating the residual value of used cars can help them set prices reasonably. In other
commodity markets, such as stock markets, gold markets , and agricultural markets , price
forecasting has been a key focus of research. Used cars, as a commodity, can be priced in the
same way. However, used car transactions are much more complex than other commodity
transactions, as the sale price is influenced not only by the basic features of the car itself, such as
brand, power, and structure, but also by the condition of the car, such as mileage,fuel type,usage
time and transmission type etc, as well as a lack of presently available methods determining
which factors hit the sale price most dramatically . At the same time, online transactions also
make it difficult to assess the price of used cars. Used cars are experience goods. Different from
search goods, it is difficult for consumers to make a purchase decision based on the vehicle
configuration. The actual user experience has a big impact on the purchase. This exacerbates the
difficulty of predicting used car prices accurately. Therefore, how to screen the characteristic
variables that affect the price of used cars and improve the accuracy of price prediction of used
cars is of great significance for fair transactions between buyers and sellers and the sustainable
and healthy development of the used car market.

Problem statement

We came up with this project mainly to solve some of the real life problems related with car
pricing because we can see that most of the buyers are confused about how to buy their car at an
accurate price and sometimes especially the buyer becomes the victim of the brokers who tend to
charge them extra price.

3
Aims:

The main aim of this project is to predict the price of used cars using the various Machine
Learning (ML) models.

Scope and Limitations

User Scope

The scope of this project is to understand and predict the car price in Bhutan and can be used by
any users who are interested in buying the used car with relatively accurate price prediction
based on the information the user input.

System Scope

The system scope of this project is to determine the price of the used car using machine learning
data based on multiple aspects, including vehicle mileage, year of manufacturing, fuel
consumption, transmission, fuel type and brand of the car.

Limitation

In the recent past year the world of automobiles has seen a drastic change with the various
changes in the resource shortages after the pandemic, which led to drastic change in used car
prices. Hence, there was a fast change in car prices during this study which will affect the actual
car pricing prediction.

3. Related Work

Several studies and related works have been done previously to predict used car prices
around the world using different methodologies and approaches, with varying results of
accuracy from 50% to 90%.

I. In order to predict the price of used cars, researchers (Nabarun Pal, 2018) used a
supervised learning method known as Random Forest. Kaggle's dataset was used as
a basis for predicting used car prices. In order to determine the price impact of each
feature, careful exploratory data analysis was performed. 500 Decision Trees were
trained with Random Forests. It is most commonly used for classification, but they
turned it into a regression model by transforming the problem into an equivalent
regression problem. Using experimental results, it was found that training accuracy
was 95.82%, and testing accuracy was 83.63%. By selecting the most correlated
features, the model can accurately predict the car price.

II. (Gegic, Isakovic, Keco, Masetic, & Kevric, 2019) from the International Burch
University in Sarajevo, used three different machine learning techniques to predict
used car prices. Using data scraped from a local Bosnian website for used cars
totalled at 797 car samples after pre-processing, and proposed using these methods:

4
Support Vector Machine, Random Forest and Artificial Neural network. Results
have shown using only one machine learning algorithm achieved results less than
50%, whereas after combining the algorithms with pre-calcification of prices using
Random Forest, results with accuracies up to 87.38% was recorded.

III. (K.Samruddhi & Kumar, 2020) Proposed using Supervised machine learning model
using K-Nearest Neighbor to predict used car prices from a data set obtained from
Kaggle containing 14 different attributes, using this method accuracy reached up to
85% after different values of K as well as Changing the percent of training data to
testing data, expectedly when increasing the percent of data that is tested better
accuracy results are achieved. The model was also cross validated with 5 and 10
folds by using the K fold method.

IV. Pal et al. (2018) used Random Forest, a controlled learning method to estimate the price
of used cars. The model was chosen after careful exploration data analysis to determine
the effect of each feature on the price. A Random Forest with 500 Decision Trees was
created to train the data. From the test results, it was found that the accuracy of the
training was 95.82% and the test accuracy was 83.63%. The model was able to accurately
estimate the price of the cars by choosing the most suitable features.

V. Noor and Jan (2017) offer a vehicle price forecasting system using the supervised
machine learning technique in their articles. The research uses multiple linear regression
as a machine learning estimation method that provides 98% prediction precision. Using
multiple linear regression, there are multiple independent variables, but there is one and
only one dependent variable compared to the actual and predicted values to find the
precision of the results. This article proposes a system for which the price is predicted to
be the predicted variable, and this price is derived from factors such as vehicle model,
brand, city, version, color, mileage, alloy wheels and power steering.

Hence, from all literature review it is concluded that used cars price prediction is an
important topic which is the area of many researchers nowadays. So far, the best achieved
accuracy is 83.63% on kaggle’s dataset using random forest technique. The researchers have
tested multiple regressors and the final model is a regression model using linear regression.

5
4. Methodology

Figure: System Flowchart.

For the second car price prediction we have first collected the data from one of the India
second hand car selling websites using a web scraping tool called octoparse, which is an
online software for scraping the web to extract the required data. after the web scraping
we have the raw data with all the unnecessary features as well so we the do the data
cleaning process to remove the unwanted features and null values and perform some data
featuring technique to make the data ready for the model training.then using the cleaned
data we have train the model and perform the price prediction and performance of the
model.

Dataset

For this project we have collected the datasets from the existing popular second hand car
selling websites of India https://droom.in/ which is India’s first and only online
marketplace for buying and selling new and used automobiles. At over 65% of the
automobile transactions market share online, Droom is the largest auto portal in India

6
with a fully transactional online platform enabling end-to-end transactions between buyer
and seller.it has over 238,105 vehicles listed on its platform.

Octoparse

For the data extraction we have used Octoparse which is a modern visual web data
extraction software. Both experienced and inexperienced users would find it easy to use
Octoparse to bulk extract information from websites, for most of the scraping tasks no
coding needed. It is one of the easiest and fastest ways to get data from the web without
having you to code.

Figure: Octoparse Logo

So after the data extraction on Octaprase , the data is then exported in the excel format
which is useful in creating a machine learning model to predict changes in the values of
data for data cleaning and after which is then used for model training.

For the data cleaning process:

First drop the features which are not needed for our model.

Rename the column name with their respective name.

Remove the data in km_driven features, where km is attached to every data. (Data
Cleaning).

7
Evaluation Metrics

Evaluation metrics are one of the essential machine learning metrics to measure the
quality of the statistical or machine learning model. Evaluating machine learning models
or algorithms is essential for checking the imbalance and flaws of a dataset owing to its
capability to discriminate among model results.

since we have chosen the KNN as the algorithm for our model training we have chosen
the following evaluation metrics for model training

1. Mean squared error (MSE)

Mean squared error (MSE) measures the amount of error in statistical models. It assesses
the average squared difference between the observed and predicted values. Mean square
error is calculated by taking the average, specifically the mean, of errors squared from
data as it relates to a function. As the data points fall closer to the regression line, the
model has less error, decreasing the MSE. A model with less error produces more precise
predictions.

When a model has no error, the MSE equals zero. As model error increases, its value
increases. The mean squared error is also known as the mean squared deviation (MSD).

2. Root Mean Square Error (RMSE)

Root mean square error or root mean square deviation is one of the most commonly used
measures for evaluating the quality of predictions. It shows how far predictions fall from
measured true values using Euclidean distance.

Lower values of RMSE indicate better fit. The closer the value of RMSE is to zero , the
better is the Regression Model.

3. Mean Absolute Error(MAE)

Mean Absolute Error is a model evaluation metric used with regression models. Mean
Absolute Error tells us how big of an error we can expect from the forecast on average.
The Mean Absolute Error measures the average magnitude of the errors in a set of
predictions, without considering their direction. The Mean Absolute Error is the average
over the test sample of the absolute differences between prediction and actual observation
where all individual differences have equal weight.

Lower values of MAE indicate better fit. The closer the value of MAE is to zero , the
better is the Regression Model.

4. R-Squared or R2 Score

8
R-Squared (R² or the coefficient of determination) is a statistical measure in a regression
model that determines the proportion of variance in the dependent variable that can be
explained by the independent variable. In other words, r-squared shows how well the data
fit the regression model (the goodness of fit).

R-squared can take any values between 0 to 1. Although the statistical measure provides
some useful insights regarding the regression model, the user should not rely only on the
measure in the assessment of a statistical model.

A low r-squared figure is generally a bad sign for predictive models. However, in some
cases, a good model may show a small value.

Experimental Setup

1. Python

Python is a very popular general-purpose interpreted, interactive, object-oriented, and


high-level programming language. Python is a dynamically-typed and garbage-collected
programming language. This project also uses Python
programming language in collaboration with the Flask framework.

2. Anaconda

Anaconda is a distribution of packages built for data science. It comes with conda, a
package, and environment manager. Anaconda is an open-source distribution of the
Python and R programming languages for data science that aims to simplify package
management and deployment. Package versions in Anaconda are managed by the
package management system, conda, which analyzes the current environment before
executing an installation to avoid disrupting other frameworks and packages.

3. Jupyter Notebook

The Jupyter Notebook is a server-client application that allows editing and running
notebook documents via a web browser. The Jupyter Notebook can be executed on a
local desktop requiring no internet access (as described in this document) or can be
installed on a remote server and accessed through the internet.

4. Pandas

Pandas is a Python package that provides fast, flexible, and expressive data structures
designed to make working with "relational" or "labeled" data both easy and intuitive. It
aims to be the fundamental high-level building block for doing practical, real world data

9
analysis in Python. Additionally, it has the broader goal of becoming the most powerful
and flexible open source data analysis / manipulation tool available in any language. It is
already well on its way towards this goal.

5. Scikit-learn

Scikit-learn is a key library for the Python programming language that is typically used in
machine learning projects. Scikit-learn is focused on machine learning tools including
mathematical, statistical and general purpose algorithms that form the basis for many
machine learning technologies.

6. Matplotlib

Matplotlib is a comprehensive library for creating static, animated, and interactive


visualizations in Python. Matplotlib makes easy things easy and hard things possible.
Create publication quality plots. Make interactive figures that can zoom, pan, update.

7. Seaborn

Seaborn is a library for making statistical graphics in Python. It builds on top of


matplotlib and integrates closely with pandas data structures. Seaborn helps you explore
and understand your data.

8. NumPy

NumPy is a Python library used for working with arrays. It also has functions for working
in the domain of linear algebra, fourier transform, and matrices. NumPy was created in
2005 by Travis Oliphant. It is an open source project and you can use it freely.

10
5. Result and Discussion

So for this model training we have performed all the algorithm available in machine
learning to see the performance and accuracy score of each algorithm and with which
we have finally concludes with the KNeighbor Regressor scoring the highest accuracy
score for both test and train data and from this we have conclude that K nearest
regression algorithm can be best used in predicting the price of the second hand cars for
our dataset.

Following are the several algorithms that we have used to predict the model before
choosing the best algorithm:

1. SVM Algorithm

Support Vector Machine(SVM) is a supervised machine learning algorithm used for


both classification and regression. This algorithm works by finding the hyperplane
in N-dimensional space that distinctly classifies the data points.

This algorithm is used in used car price prediction because they can find complex
relationships between car price data without doing a lot of transformations on our
own. Another reason includes the ability of the algorithm to give more accurate
results when dealing with the dataset containing hundreds and thousands of data.
so for the SVM algorithm we have the accuracy score as following

2. Linear Regression

Linear Regression is a machine learning algorithm based on supervised learning.


This algorithm is one of the most suitable algorithms for the used car price
prediction.

Linear Regression algorithm is used in predicting the used car price because this
algorithm predicts the dependent variable which the selling car price based on the
given features. So, this algorithm will find the linear relationship between the
features and selling price of the car.

11
for the linear regression we have the accuracy score as follow:

3. Decision Tree

In a decision tree, for predicting the class of the given dataset, the algorithm starts
from the root node of the tree. This algorithm compares the values of root attribute
with the record (real dataset) attribute and, based on the comparison, follows the
branch and jumps to the next node

We used a decision tree regressor since our dataset is numerical and also our target
value is continuous. It works by splitting the data up in a tree-like pattern
into smaller and smaller subsets. Then, when predicting the selling price, it will
predict based on the subset features of the data.

4. K-Nearest Neighbour (KNN)

K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on


Supervised Learning technique.

The KNN algorithm uses ‘feature similarity’ to predict the values of any new data points.
So, in our dataset the new point(selling price) is assigned a value based on how closely it
resembles the points in the training set. In this way, the new data are
automatically recognizes and categorizes it based on the feature similarity by
the KNN algorithm.

12
6. Conclusion

In recent years, online used car trading platforms have developed rapidly, but they still
face many problems. In practice, institutions and individuals differ in how they screen the
characteristic variables of used car prices and predict used car prices. Under such
conditions, it is easy to lead to the unsound development of the market, and it is difficult
to establish a unified evaluation system, which causes great difficulties in the transaction
of used cars. In terms of theory, traditional used car price evaluation methods rely too
much on the subjective judgment of evaluators, which can no longer meet the needs of
online transactions in the used car market. Therefore, it is necessary to establish an
efficient, reasonable, fair, and accurate used car price evaluation system.therefore, this
paper analyzes the factors affecting the price of used cars from various aspect of car such
as name,mode,fuel type, transmission and mileage etc and base on this we have trained a
machine learning model that can able to predict the car price base on the features
mentioned above and with the implementation of the linear regression model and have
successfully secured model accuracy of 84.44% for prediction the car price.

13
7. References

[1] K. Noor and S. Jan, “Vehicle Price Prediction System using Machine Learning Techniques,”

International Journal of Computer Applications, vol. 167, no. 9, pp. 27–31, Jun.
2017, doi: 10.5120/ijca2017914373

[2] N. Pal, P. Arora, D. Sundararaman, P. Kohli, and S. S. Palakurthy, “How much is my car
worth? A methodology for predicting used cars prices using Random Forest,”
arxiv.org, Nov. 2017 [Online]. Available: https://arxiv.org/abs/1711.06970

[3] “TEM JOURNAL - Technology, Education, Management, Informatics,”


www.temjournal.com. [Online]. Available:
https://www.temjournal.com/content/81/TEMJournalFebruary2019_113_118.html
. [Accessed: Dec. 01, 2022]

[4] A. Kharwal, “Car Price Prediction with Machine Learning | Aman Kharwal,”
thecleverprogrammer, Aug. 04, 2021. [Online]. Available:
https://thecleverprogrammer.com/2021/08/04/car-price-prediction-with-machine-l
earning/. [Accessed: Dec. 01, 2022]

[5] T. Akhtar, “Predicting Car Price using Machine Learning,” Medium, Nov. 23, 2020. [Online].
Available:https://towardsdatascience.com/predicting-car-price-using-machine-lear
ning-8d2df3898f16

[6] “Build software better, together,” GitHub. [Online]. Available:


https://github.com/topics/car-price-prediction . [Accessed: Dec. 01, 2022]

[7] K. Kumari, “Car Price Prediction - Machine Learning vs Deep Learning,” Analytics Vidhya,
Jul. 25, 2021. [Online]. Available:
https://www.analyticsvidhya.com/blog/2021/07/car-price-prediction-machine-lear
ning-vs-deep-learning/ . [Accessed: Dec. 01, 2022]

14

You might also like