0% found this document useful (0 votes)
22 views10 pages

Sample Paper 6

This document discusses predicting used car resale prices using machine learning regression models. It explores using multiple linear regression, KNN, random forest, gradient boosting, and XGBoost models on a dataset of over 50,000 used car listings. Previous related work that used other algorithms or smaller datasets is reviewed. The dataset is preprocessed by removing unwanted columns, outliers, imputing missing values, and converting categorical variables before model training and evaluation. Exploratory data analysis is also performed to visualize the data. The goal is to compare the different models and techniques to most accurately predict used car prices.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views10 pages

Sample Paper 6

This document discusses predicting used car resale prices using machine learning regression models. It explores using multiple linear regression, KNN, random forest, gradient boosting, and XGBoost models on a dataset of over 50,000 used car listings. Previous related work that used other algorithms or smaller datasets is reviewed. The dataset is preprocessed by removing unwanted columns, outliers, imputing missing values, and converting categorical variables before model training and evaluation. Exploratory data analysis is also performed to visualize the data. The goal is to compare the different models and techniques to most accurately predict used car prices.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Predicting resale car prices using machine

learning regression models with ensemble


techniques
Cite as: AIP Conference Proceedings 2516, 240001 (2022); https://doi.org/10.1063/5.0108560
Published Online: 30 November 2022

Jayashree Rama Krishnan and Vanitha Selvaraj

AIP Conference Proceedings 2516, 240001 (2022); https://doi.org/10.1063/5.0108560 2516, 240001

© 2022 Author(s).
Predicting Resale Car Prices Using Machine Learning
Regression Models With Ensemble Techniques
Jayashree Rama Krishnana),Vanitha Selvarajb)

Department of Computer Application, College of Science and Humanities,


SRM Institute of Science and Technology,
Kattankulathur, Chennai, India - 603203.
b)
Corresponding author: [email protected]
a)
[email protected]

Abstract. Cars are the preferred vehicle for comfortable travel, especially on a long trip. Nowadays, peoples are
trying to buy new cars, but the new tax policiesincrease the price amount by adding additional charges. So, most of
the people go for second-hand cars due to nominal price. Online portals also helped a lot,for buying and selling the
used cars. Here, Machine learning algorithms play a vital role to predict the right price for the right car. In this paper,
Multiple Linear Regression, KNN, Random Forest, Gradient Boosting and XGBoost models are developed and
results are compared for accuracy. Among those XGBoost gives the highest r2 score, which is 88% of training data
and 87% of test data. In the existing system, while pre-processing, all the null values are dropped and in categorical
data conversion label encoder is used for numerical conversion but that is suitable for ordinal data. In the proposed
system, two datasets are usedto analyze the performance of imputation. For that, all the null values are imputed in
one dataset and the results are comparedwith another non-imputed dataset. In categorical data conversion, one-hot
encoding is used to represent the feature availability. And finally, results are analyzed the discussed the pros and
cons of each technique.

Keywords: Machine Learning, Multiple Linear Regression, KNN, Random Forest, Gradient Boosting, XGBoost,
Car price prediction.

INTRODUCTION
Cars are becoming one of the important things in our home. New cars are too costly and the prices are still
increasing for several reasons. This leads to change people's mindset to go for second-hand cars. In the last
couple of years, there aremany e-commerce portals were developed like autoportal.com, cartrade.com,
gaadi.com, and cars24.com for buying and selling the used cars. It makes money by sitting in our home and
helps to buy and sell the car instantly on both buyer and seller end. Here, the price is another important factor
that people have to think about while buying and selling a car.
To predict the correct price, machine learning algorithms are doing a wonderful job. This can be done by
considering various features of cars like Age, kilometer, horsepower, car model, brand, fuel type. Using some
statistical measures, the machine learning algorithms predict the price of the car. So the buyer and seller can be
confident before meeting the customer. The goal of this paper is to predict the used car price by using some
Regression models (Multiple Regression, KNN, Random Forest, Gradient Boost, and XGBoost) and finally, the
results are compared.

LITERATURE REFERENCES
Pudaruth[7] used different machine learning algorithms, which are multiple linear regression analysis, k-
nearest neighbors, decision trees, and naïve bayes for predicting car price in Mauritius. This author collected the
dataset manually from local newspapers in less period. He considered the following variables to create a model
which are brand, model, cubic capacity, mileage in km, manufacturing year, exterior color, transmission type
2nd International Conference on Mathematical Techniques and Applications
AIP Conf. Proc. 2516, 240001-1–240001-9; https://doi.org/10.1063/5.0108560
Published by AIP Publishing. 978-0-7354-4234-4/$30.00

240001-1
and price. But, the author found out that Naive Bayes and Decision Tree were unable to handle numeric values.
Also, fewer records were used in his model. Accuracy was reached 70%.
Nabarun et al. [8] used an ensemble technique called Random Forest to predict the used car prices. The
authors achieved 95% of accuracy for training data and 83% for test data by creating 500 decision tree models.
This model is over-fitted (the difference between the test and train data accuracy) and the ratio of over-fitting is
somewhat reduced which is 92% for train data 85% for test data in my Random Forest model.
Monburinon et al. [9] used advanced algorithms which are gradient boosting, random forest for predicting
the used car prices, and they compared with the traditional regression model that is Multiple Linear Regression.
Among those Gradient boosting (MSE= 0.28) outperforms compare with other model Random Forest (MSE =
0.35) and Multiple Linear Regression (MSE= 0.55). In this paper authors used label encoding for categorical
data conversion due to the reason of system limitation. While using label encoding, even though the values of
the particular column values are discrete, after conversion it is considered that there is some relationship
between the values. It might give some wrong information while creating the model. This problem is rectified in
my proposed system by applying one hot encoding technique for categorical to numerical conversion. And
accuracy also improved in my model. Mean Squared Error(MSE) for Multiple Linear Regression is 0.29,
Random Forest and Gradient Boosting is 0.16.
Noor et al. [10] used the multiple linear regression algorithm to predict the used car price. The author took
less period to collect dataset and added the following car features: price, cubic capacity, exterior color, date
when the ad was posted, number of ad views, power steering, mileage in kilometer, rims type, type of
transmission, engine type, city, registered city, model, version, make and model year. After applying feature
selection, the authors included only a few input features that are engine type, price, model year, and model.
Finally, the authors have attained a prediction accuracy of 98%.
Enis Gegic et al. [11] proposed Car Price Prediction using Machine Learning Techniques. These authors
used ensemble models like Support Vector Machine, Random Forest and the Artificial neural network(ANN) for
their model. They collected the data from the web portal www.autopijaca.ba and build this model to predict the
price of used cars in Herzegovina and Bosnia. The prediction model accuracy is 87%.

DATASET AND PRE-PROCESSING


The Dataset is collected from www.kaggle.com which is in a CSV format [1]. This dataset contains used car
details taken from eBay- Kleinanzeigen, a German e-commerce company. Totally 50,001 records and 19
columns (13 categorical columns and 6 numerical columns). To prepare the dataset for the ML model below
preprocessing steps are followed:

Dropping Unwanted Columns


In this dataset dateCrawled, name, abtest, dateCreated, lastseen, offerType columns are considered as not
relevant and removed from the dataset[2].

Removing Outliers
Outliers are extreme values compare to other values in the dataset. It needs to be removed to get a more
generalized model.

Imputing Null Values


It is the process of replacing null values with some other values. In this dataset, float types are replaced by
the median, and the other types are replaced by the most frequent values of that column. Also tried mean and
mode, but it gives less value than the median.

Conversion of Categorical Value


One-hot encoding is used for numerical conversion to represent the availability of the feature. After
conversion total of 301 columns are created.While applying the label encoder,eventhough there is no
relationship between the variables the model will consider that there is some relationship between them.

240001-2
Log Transform
In this dataset, the target values (price) are skewed very much so log transform is used for normal
distribution of data[3].
After pre-processing there are two datasets maintained for model creation. One is for evaluating imputed
data with 42,772 records and another oneis for non-imputed data with 32,884 records after removing all the null
value rows.

EXPLORATORY DATA ANALYSIS(EDA)


Exploratory data Analysis always helps to visualize the data using different types of graphs throughpictorial
representation. The Following graphs are used to analyze the nature of the dataset.

FIGURE1. Boxplot for powerPS

Figure.1 the boxplot shows the statistical measure of powerPS that is minimum, maximum, and percentile
values. The first horizontal black line from the bottom is for the minimum value of the powerPS, the second line
is for 25 percentile, the third line is for 50 percentile or mean value, the fourth line is for 75 percentile and the
fifth line is for maximum value. Beyond the fifth line, there are some black thick adjacent circles. But it is
looking like a thick vertical line due to the overlapping of data. These are considered outliers.

FIGURE 2. Count Plot for VehicleType

240001-3
Figure.2 the bar graph shows the number of vehicle types and their count. There are 8 different vehicle types
available in this dataset including others. In this, the Imousine type of vehicle is having the highest count
compare with others.

FIGURE 3. Kilometer vs Price

Figure.3 the boxplot shows the close relationship between kilometer and price. While increasing the
kilometer gradually decreases the price of the car. The first box is considered an outlier. Out of that, the graph
gives a conclusion that the kilometer is one of the important features to predict the price.

TABLE 1.Correlation Table


Price PowerPS Kilometer Age

Price 1.000 0.575 -0.440 -0.336

PowerPS 0.575 1.000 -0.016 -0.151

Kilometer -0.440 -0.016 1.000 0.292

Age -0.336 -0.151 0.292 1.000

Table.1 the correlation between the columns is measured and analyzed all the important features that should
be given to the model. The correlation is evaluated only on numerical values. Here, the powerPs and kilometer
are correlated with price value. If the value is >= 0.5 and < 0.8 it is considered as moderate dependency but, if
the value is >= 0.8 then it is considered a highly dependent variable.

240001-4
METHODOLOGY

FIGURE. 4 Proposed System Architecture

Figure.4 shows the flowchart of proposed system and it explains the complete workflow of the machine
learning model.

After pre-processing the dataset is ready to build the model. The dataset needs to be split into two parts. One
is for train the model and another is for evaluating the model. In this paper ratio of division is 0.70% and 0.30%
respectively for training and testing. After creating the model, it should be evaluated by regression metrics if the
accuracy is satisfied then it is deployed else the model needs to be improved by doing parameter tuning.

MULTIPLE LINEAR REGRESSION(MLR)


MLR is a prediction algorithm. It is doing some statistical calculations to forecasts future values. The
difference between Linear and Multiple linear is MLR using more than one independent variable to predict the
target or dependent variables. It helps to find the relationship between the independent and dependent variables
and how the changes in independent variables affect the dependent variable[4].

240001-5
y  c0 + c1* x1 + c2 * x2 + c3 * x3 + …cn * xn + e (1)

Eq.1 where y=dependent variable


xi = independent variables
c0 =y-intercept (Constant)
cn =coefficients of each independent variable called slope.
e = error/residuals of the model

K NEAREST NEIGHBORS(KNN)
KNN is working on the logic of “feature similarity” to predict the values of new data points. It first measures
the relationship/ distance between the new and neighbor data points. The distance is calculated by using a
mathematical function. The most common technique used for measure the distance is Manhattan Distance and
Euclidean distance. It is decided by sending the value of parameter p in python. If P=1 means it refers to
Manhattan else if p=2 means it will choose Euclidean distance. In this paper, chosen power parameter is 2.

 xi  yi 
n 2
(2)
i 1

Eq. 2 is used for Euclidean Distance. Here x is a new point and y is an existing point. KNN also follows the
lazy leaner technique. That means the learning process is delayed until any request is made to the system.
In KNN, Choosing the number of neighbors(n_neighbors)is a very important parameter that is used to check
the similarity of selected neighboring data points. For example, ifn_neighbors=2 then it will choose the 2 most
neighboring data points to check the feature similarity. By assigning a range of values, the most optimal value
will be selected. In this paper, the range is given from 1 to 20 and the selected optimal value is 6 neighbors. For
prediction, KNN calculates the average value of nearest neighbors.

RANDOM FOREST(RF)
Random Forest is an ensemble algorithm that means many models will be created and all the models will
choose their sample set randomly from the training data. The final decision is calculated based on the result of
different models. This is called the bagging technique. Each model is a decision tree and the result is called
voting. So the Random Forest is one of the best performing models because it takes decisions based on the
majority of the voting system. It can be defined as:

f ( x)  t1( x)  t 2( x)  t 3( x)....tn( x) (3)


Eq.3 where fis the sum of various base model ti.In this paper, 100 trees are created, maximum tree depth is
100 and minimum samples split are 10.

GRADIENT BOOSTING(GB)
Gradient Boosting is also an ensemble model like random forest (creating multiple trees). It is an improved
version of the Random Forest. In RF, all the models are decision trees and created independently the results are
evaluated and combined at the end. But in GB, Models are created sequentially at the same time results are
combined in a parallel way. It is introducing a weak learner concept to improve the drawbacks of previous
model errors.

XGBOOST
XGBoost is also an ensemble technique and it is an improvement of the Gradient Boosting Algorithm.This is
the most powerful ensemble technique. It uses a gradient descent algorithm to reduce the loss function when
adding new models and advanced regularization (L1 & L2), which also improves the model generalization[5]. The

240001-6
advantage ofXGBoost isthe very fast training of data and it is distributed across clusters. So, the ensemble techniques
always produce better results compare with other techniques[6].

RESULTS

FIGURE. 5 Scatter Plot for Actual and Predicted price(before imputation)

Figure.5 the scatter plot shows the actual and predicted data and it shows all the values are closely related.
Using this scatter plot, it is concluded that the price is an incremental value, and the values are predicted nearly
to the actual value. So, most of the blue dots are not able to see separately in this graph.

TABLE. 2 Comparison metrics (before imputation)


Mean Absolute Mean Squred Root Mean Squred R-Squared R-Squared
Model
Error(MAE) Error(MSE) Error(RMSE) (training data) (test data)

Multiple Linear
0.38 0.29 0.54 0.78 0.76
Regression
KNN (K=6) 0.36 0.26 0.51 0.81 0.80
Random Forest 0.30 0.19 0.43 0.92 0.85
Gradient Boosting 0.28 0.16 0.44 0.85 0.84
XGBoost 0.27 0.16 0.40 0.88 0.87

Table.2 shows the different types of accuracy metrics to evaluate the performance of the algorithms using a
non-imputed dataset.In this,XGBoost gives the highest accuracy than others. The MAE, MSE, and RMSE
values should be closer to zero, if so it is the better model prediction [3]. If it is a large value then there is a large
error. The R-squared values should be closer to 1. If it is 1, then the prediction is 100% perfect.

240001-7
TABLE. 3 Comparison metrics (after imputation)

Mean Absolute Mean Squred Root Mean Squred R-Squared R-Squared


Model
Error(MAE) Error(MSE) Error(RMSE) (training data) (test data)

Multiple Linear
0.47 0.42 0.64 0.70 0.70
Regression
KNN (K=6) 0.37 0.28 0.53 0.79 0.79
Random Forest 0.34 0.24 0.49 0.90 0.82
Gradient Boosting 0.32 0.20 0.45 0.86 0.84
XGBoost 0.31 0.21 0.46 0.91 0.84

Table.3 shows the different types of accuracy metrics to evaluate the performance of the algorithms using an
imputed dataset. Here, Gradient Boosting performs well but the accuracy of all the algorithms is less compared
to the non- imputed dataset.

CONCLUSION
In this paper, Python programming is used for creating all these regression models,and then results
arecompared.In this comparison, the data imputation is not increasing the accuracy but it may work well in any
other dataset. Eventhough the ensemble algorithms having the capability to handle missing values it isalso
handled manually withthe desired value. Before imputation XGBoost gives the highest r2 score which is 88%
for training data and 87% for test data. The next one is Gradient Boosting,whichis performed well compared to
Random Forest with 85% for training data and 84% for test data. The least one is Multiple Linear Regression
with 78% for training data and 76% for test data. After imputation, Gradient Boosting provides better
performance than XGBoost and others. The ratio of overfitting also high in both XGBoost and Random Forest.
Note that the parameters are not changed in both experiments. For examplenumber of neighbors value in KNN,
the number of trees created in Random Forest, and learning rate in boosting technique.
The objective of the result is Boosting algorithms(XGBoost and Gradient Boosting) performed well with less
over-fitting. Random Forest took more time to compare with other models and there is some over-fitting also
(difference between training and test data score). KNN also taking more time to get optimal value but the ratio
of over fitting is less compare to Random Forest. And finally, MLR is executing fast but it gives less accuracy.
So, while increasing the accuracy level, it always increase the complexity of the algorithms as well as execution
time.

FUTURE WORK
Increase the performance of this model by doing a hyperparameter tuning technique. Also, going to apply all
these methods in any of the other real datasets like OLX to compare the model performance.

REFERENCES
1. Information on https://www.kaggle.com/datasets
2. G.Chandrashekar and F. Sahin, “A survey on feature selection methods”,Computers & Electrical
Engineering,40(1), pp.16–28,(2014).
3. M.C.Newman,“Regression analysis of log-transformed data:Statistical bias and its
correction”,Environmental Toxicology and Chemistry, 12(6), pp. 1129–1133, (1993).
4. Kuiper, Shonda. "Introduction to Multiple Regression: How Much Is Your Car Worth?”,Journal of
Statistics Education, 16(3), (2008).
5. T. Chen and C. Guestrin, “Xgboost: A scalable tree boosting system”, KDD '16: Proceedings of the 22nd
ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785–794, (2016).
6. S. Lessmann and S. Voß, “Car resale price forecasting: The impact of regression method, private
information, and heterogeneity on forecast accuracy”,International Journal of Forecasting, 33(4), pp. 864
– 877, (2017).
7. Pudaruth, Sameerchand,“Predicting the price of used cars using machine learning
techniques“,International Journalof Information Computing. Technology, 4(7),pp-753-764, (2014).

240001-8
8. N. Pal, P. Arora, D. Sundararaman, P. Kohli, and S. Sumanth Palakurthy,“How much is my car worth? A
methodology for predicting used cars prices using Random Forest“,Advances in Information and
Communication Networks,1, pp. 413-422,(2017).
9. N.Monburinon,P.Chertchom, T. Kaewkiriya, S. Rungpheung, S. Buya, P. Boonpou ,“Prediction of prices for used
car by using regression models“, 5th International Conference on Business and Industrial Research
(ICBIR), pp. 115-119,(2018)
10. KanwaalNoor, Jan Sadaqat,“Vehicle price prediction system using machine learning
techniques“,International Journal of Computer Applications,167(9),pp. 27-31,(2017).
11. Enis Gegic, Becir Isakovic, Dino Keco, Zerina Masetic, Jasmin Kevric,“Car price prediction using machine
learning techniques“,TEM Journal, 8(1), pp. 113–118, (2019).

240001-9

You might also like