Car Price Prediction
Car Price Prediction
on
Car Price Prediction
Abstract
India has one of the biggest automobile markets all over the globe every day many buyers
usually sell their cars after using for the time to another buyer, we call them as 2nd /3rd owner
etc. Many platforms such as cars24.com, cardekho.com and OLX.com provides these buyers
with a platform where they can sell their used cars, but what should be the price of the car, this is
the toughest question ever. Machine Learning algorithms can bring a solution to this problem.
Using a history of previously used cars selling data and using machine learning techniques such
as Supervised Learning can predict a fair price of the car, here I also used machine learning
algorithms such as Random Forest and Extra Tree Regression along with powerful python
library Scikit-Learn to predict the selling price of the used car. The result has shown that these
both algorithms are highly accurate in prediction even the dataset is large or small, irrespective
Keywords: Machine Learning, Supervised Learning, Random Forest, Extra Tress Regression,
Introduction
Car price prediction is somehow interesting and popular problem. As per information that was
gotten from the Agency for Statistics of BiH, 921.456 vehicles were registered in 2014 from
which 84% of them are cars for personal usage. This number is increased by 2.7% since 2013
and it is likely that this trend will continue, and the number of cars will increase in future. This
adds additional significance to the problem of the car price prediction. Accurate car price
prediction involves expert knowledge, because price usually depends on many distinctive
features and factors. Typically, most significant ones are brand and model, age, horsepower and
mileage. The fuel type used in the car as well as fuel consumption per mile highly affect price of
a car due to a frequent change in the price of a fuel. Different features like exterior color, door
number, type of transmission, dimensions, safety, air condition, interior, whether it has
navigation or not will also influence the car price. In this paper, we applied different methods
and techniques in order to achieve higher precision of the used car price prediction.
Predicting the resale value of a car is not a simple task. It is trite knowledge that the value of
used cars depends on a number of factors. The most important ones are usually the age of the car,
its make (and model), the origin of the car (the original country of the manufacturer), its mileage
(the number of kilometers it has run) and its horsepower. Due to rising fuel prices, fuel economy
is also of prime importance. Unfortunately, in practice, most people do not know exactly how
much fuel their car consumes for each km driven. Other factors such as the type of fuel it
uses, the interior style, the braking system, acceleration, the volume of its cylinders (measured
in cc), safety index, its size, number of doors, paint colour, weight of the car, consumer
1
reviews, prestigious awards won by the car manufacturer, its physical state, whether it is a sports
car, whether it has cruise control, whether it is automatic or manual transmission, whether it
belonged to an individual or a company and other options such as air conditioner, sound system,
power steering, cosmic wheels, GPS navigator all may influence the price as well. Some
special factors which buyers attach importance in Mauritius is the local of previous owners,
whether the car had been involved in serious accidents and whether it is a lady-driven car. The
look and feel of the car certainly contribute a lot to the price. As we can see, the price depends on
a large number of factors. Unfortunately, information about all these factors are not always
available and the buyer must make the decision to purchase at a certain price based on few
factors only.
2
CHAPTER-2
Literature Survey
predict the price of used cars. The predictions are based on historical data collected from kaggle.
Different techniques like Random Forest, multiple linear regression analysis, k-nearest
neighbours, naïve bayes and decision trees have been used to make the predictions. The
Considerable number of distinct attributes are examined for the reliable and accurate prediction,
we have applied three machine learning techniques (Artificial Neural Network, Support Vector
Machine and Random Forest). Model in second hand car system based on neural networks. In
this paper, the price evaluation model based on big data analysis is proposed, which takes
advantage of widely circulated vehicle data and a large number of vehicle transaction data to
analyze the price data for each type of vehicles by using the optimized neural network algorithm.
It aims to establish a second-hand car price evaluation model to get the price that best matches
the car.
The Problem
The prices of new cars in the industry is fixed by the manufacturer with some additional costs
incurred by the Government in the form of taxes. So, customers buying a new car can be assured
of the money they invest to be worthy. But due to the increased price of new cars and the
incapability of customers to buy new cars due to the lack of funds, used cars sales are on a global
increase .There is a need for a used car price prediction system to effectively determine the
worthiness of the car using a variety of features. Even though there are websites that offers this
service, their prediction method may not be the best. Besides, different models and systems may
3
contribute on predicting power for a used car’s actual market value. It is important to know their
4
CHAPTER 3
Working of Project
The Client
To be able to predict used cars market value can help both buyers and sellers.
Used car sellers (dealers): They are one of the biggest target group that can be interested in
results of this study. If used car sellers better understand what makes a car desirable, what the
important features are for a used car, then they may consider this knowledge and offer a better
service.
The Data
The data used in this project was downloaded from Kaggle. This dataset contains information
about used cars.This data can be used for a lot of purposes such as price prediction to exemplify
the use of linear regression in Machine Learning. The columns in the given dataset are as
follows:
1. name
2. year
3. selling_price
4. km_driven
5. fuel
6. seller_type
7. transmission
8. Owner
5
Data Wrangling
In this section, it will be discussed about how data cleaning and wrangling methods are applied
on the used cars data file.Before making data cleaning, some explorations and data visualizations
were applied on data set. This gave some idea and guide about how to deal with missing values
and extreme values. After data cleaning, data exploration was applied again in order to
Data cleaning: First step for data cleaning was to remove unnecessary features. For this purpose,
‘url’,’image_url’, ‘lat’, ‘long’, ‘city_url’, ‘desc’, ‘city’, ‘VIN’ features were dropped totally. As a
next step, it was investigated number of null points and percentage of null data points for that
feature . As the second step, some missing values were filled with appropriate values. For the
missing ‘condition’ values, it was paid attention to fill depending on category. Average odometer
of all ‘condition’ sub-categories were calculated. Then, missing values were filled by considering
this average odometer values for each condition sub-category. In addition, cars that have model
value higher than 2019 were filled as ‘new’, and between 2017–2019 were filled as ‘like new’.
At the end of this process, all missing values in ‘condition’ feature were cleaned.
While exploring the data, we will look at the different combinations of features with the help of
visuals. This will help us to understand our data better and give us some clue about pattern in
data.
6
An examination of price trend
Price is the feature that we are predicting in this study. Before applying any models, taking a
SQL
Structured Query Language (SQL) is an indispensable skill in the data science industry and
generally speaking, learning this skill is relatively straightforward. However, most forget that
SQL isn’t just about writing queries, which is just the first step down the road. Ensuring that
queries are performant or that they fit the context that you’re working in is a whole other thing.
That’s why this SQL tutorial will provide you with a small peek at some steps that you can go
• First off, you’ll start with a short overview of the importance of learning SQL for jobs in data
science;
• Next, you’ll first learn more about how SQL query processing and execution so that you can
adequately understand the importance of writing qualitative queries: more specifically, you’ll see
With that in mind, you’ll not only go over some query anti-patterns that beginners make when
writing queries, but you’ll also learn more about alternatives and solutions to those possible
mistakes; You’ll also learn more about the set-based versus the procedural approach to querying.
• You’ll also see that these anti-patterns stem from performance concerns and that, besides the
“manual” approach to improving SQL queries, you can analyze your queries also in a more
7
structured, in-depth way by making use of some other tools that help you to see the query plan;
And,
• You’ll briefly go more into time complexity and the big O notation to get an idea about the
time complexity of an execution plan before you execute your query; Lastly,
• You'll briefly get some pointers on how you can tune your query further.
SQL is far from dead: it’s one of the most in-demand skills that you find in job descriptions from
the data science industry, whether you’re applying for a data analyst, a data engineer, a data
scientist or any other roles. This is confirmed by 70% of the respondents of the 2016 O’Reilly
Data Science Salary Survey, who indicate that they use SQL in their professional context.
What’s more, in this survey, SQL stands out way above the R (57%) and Python (54%)
programming languages.
You get the picture: SQL is a must-have skill when you’re working towards getting a job in the
Not bad for a language that was developed in the early 1970s, right?
But why exactly is it that it is so frequently used? And why isn’t it dead even though it has been
There are several reasons: one of the first reasons would be that companies mostly store data in
Systems (RDSMS) and you need SQL to access that data. SQL is the lingua franca of data: it
gives you the ability to interact with almost any database or even to build your own locally!
As if this wasn’t enough yet, keep in mind that there are quite a few SQL implementations that
are incompatible between vendors and do not necessarily follow standards. Knowing the
8
standard SQL is thus a requirement for you to find your way around in the (data science)
industry.
On top of that, it’s safe to say that SQL has also been embraced by newer technologies, such as
Hive, a SQL-like query language interface to query and manage large datasets, or Spark SQL,
which you can use to execute SQL queries. Once again, the SQL that you find there will differ
from the standard that you might have learned, but the learning curve will be considerably easier.
If you do want to make a comparison, consider it as learning linear algebra: by putting all that
effort into this one subject, you know that you will be able to use it to master machine learning as
well!
• It’s is fairly easy to learn, even for total newbies. The learning curve is quite easy and gradual,
• It follows the “learn once, use anywhere” principle, so it’s a great investment of your time!
• It’s an excellent addition to programming languages; In some cases, writing a query is even
To improve the performance of your SQL query, you first have to know what happens internally
First, the query is parsed into a “parse tree”; The query is analyzed to see if it satisfies the
syntactical and semantical requirements. The parser creates an internal representation of the
9
It is then the task of the optimizer to find the optimal execution or query plan for the given query.
The execution plan defines exactly what algorithm is used for each operation, and how the
To indeed find the most optimal execution plan, the optimizer enumerates all possible execution
plans, determines the quality or cost of each plan, takes information about the current database
state and then chooses the best one as the final execution plan. Because query optimizers can be
imperfect, database users and administrators sometimes need to manually examine and tune the
Now you probably wonder what is considered to be a “good query plan”. As you already read,
the quality of the cost of a plan plays a considerable role. More specifically, things such as the
number of disk I/Os that are required to evaluate the plan, the plan’s CPU cost and the overall
response time that can be observed by the database client and the total execution time are
essential. That is where the notion of time complexity will come in. You’ll read more about this
later on.
Next, the chosen query plan is executed, evaluated by the system’s execution engine and the
10
Writing SQL Queries
What might not have become clear from the previous section is that the Garbage In, Garbage Out
(GIGO) principle naturally surfaces within the query processing and execution: the one who
formulates the query also holds the keys to the performance of your SQL queries. If the
optimizer gets a poorly formulated query, it will only be able to do as much… That means that
there are some things that you can do when you’re writing a query. As you already saw in the
introduction, the responsibility is two-fold: it’s not only about writing queries that live up to a
certain standard, but also about gathering an idea of where performance problems might be
11
An ideal starting point is to think of “spots” within your queries where issues might sneak in.
And, in general, there are four clauses and keywords where newbies can expect performance
issues to occur:
Granted, this approach is simple and naive, but as a beginner, these clauses and statements are
excellent pointers, and it’s safe to say that when you’re just starting out, these spots are the ones
where mistakes happen and, ironically enough, where they’re also hard to spot.
However, you should also realize that performance is something that needs a context to become
meaningful: merely saying that these clauses and keywords are bad isn’t the way to go when
you’re thinking about SQL performance. Having a WHERE or HAVING clause in your query
Take a look at the following section to learn more about anti-patterns and alternative approaches
to building up your query. These tips and tricks are meant as a guide. How and if you actually
need to rewrite your query depends on the amount of data, the database and the number of times
you need to execute the query, among other things. It entirely depends on the goal of your query
and having some prior knowledge about the database that you want to query is crucial!
You Need The mindset of “the more data, the better” isn’t one that you should necessarily live
by when you’re writing SQL queries: not only do you risk obscuring your insights by getting
12
more than what you actually need, but also your performance might suffer from the fact that your
That’s why it’s generally a good idea to look out for the SELECT statement, the DISTINCT
A first thing that you can already check when you have written your query is whether the
SELECT statement is as compact as possible. Your aim here should be to remove unnecessary
columns from SELECT. This way you force yourself only to pull up data that serves your query
goal. In case you have correlated subqueries that have EXISTS, you should try to use a constant
in the SELECT statement of that subquery instead of selecting the value of an actual column.
Remember that a correlated subquery is a subquery that uses values from the outer query. And
note that, even though NULL can work in this context as a “constant”, it’s very confusing!
The SELECT DISTINCT statement is used to return only distinct (different) values. DISTINCT
is a clause that you should definitely try to avoid if you can. As you have read in other examples,
the execution time only increases if you add this clause to your query. It’s therefore always a
good idea to consider whether you really need this DISTINCT operation to take place to get the
13
The LIKE Operator
When you use the LIKE operator in a query, the index isn’t used if the pattern starts with % or _.
It will prevent the database from using an index (if it exists). Of course, from another point of
view, you could also argue that this type of query potentially leaves the door open to retrieve too
Once again, your knowledge of the data that is stored in the database can help you to formulate a
pattern that will filter correctly through all the data to find only the rows that really matter for
your query.
When you cannot avoid filtering down on your SELECT statement, you can consider limiting
your results in other ways. Here’s where approaches such as the LIMIT clause and data type
You should always use the most efficient, that is, smallest, data types possible. There’s always a
risk when you provide a huge data type when a smaller one will be more sufficient.
However, when you add data type conversion to your query, you only increase the execution
time.
An alternative is to avoid data type conversion as much as possible. Note also here that it’s not
always possible to remove or omit the data type conversion from your queries, but that you
should definitely aim to be careful in including them and that when you do, you test the effect of
14
3. Don’t Make Queries More Complex Than They Need To Be
The data type conversions bring you to the next point: you should not over-engineer your
queries. Try to keep them simple and efficient. This might seem too simple or stupid even to be a
However, you’ll see in the examples that are mentioned in the next sections that you can easily
start making simple queries more complex than they need to be.
The OR Operator
When you use the OR operator in your query, it’s likely that you’re not using an index.
Remember that an index is a data structure that improves the speed of the data retrieval in your
database table, but it comes at a cost: there will be additional writes, and additional storage space
is needed to maintain the index data structure. Indexes are used to quickly locate or look up data
without having to search every row in a database every time the database table is accessed.
If you don’t make use of the indexes that the database includes, your query will inevitably take
longer to run. That’s why it’s best to look for alternatives to using the OR operator in your query;
The first query uses the WHERE clause to restrict the number of rows that need to be summed,
whereas the second query sums up all the rows in the table and then uses HAVING to throw
away the sums it calculated. In these types of cases, the alternative with the WHERE clause is
You see that this is not about limiting the result set, but instead about limiting the intermediate
15
Note that the difference between these two clauses lies in the fact that the WHERE clause
introduces a condition on individual rows, while the HAVING clause introduces a condition on
aggregations or results of a selection where a single result, such as MIN, MAX, SUM,… has
You see, evaluating the quality, writing and rewriting of queries is not an easy job when you
take into account that they need to be as performant as possible; Avoiding anti-patterns and
considering alternatives will also be a part of responsibility when you write queries that you want
This list was just a small overview of some anti-patterns and tips that will hopefully help
beginners; If you’d like to get an insight into what more senior developers consider the most
What was implicit in the above anti-patterns is the fact that they actually boil down to the
The procedural approach of querying is an approach that is much like programming: you tell the
An example of this is the redundant conditions in joins or cases where you abuse the HAVING
clause, like in the above examples, in which you query the database by performing a function
16
and then calling another function, or you use logic that contains loops, conditions, User Defined
Functions (UDFs), cursors, … to get the final result. In this approach, you’ll often find yourself
asking a subset of the data, then requesting another subset from the data and so on.
It’s no surprise that this approach is often called “step-by-step” or “row-by-row” querying.
The other approach is the set-based approach, where you just specify what to do. Your role
consists of specifying the conditions or requirements for the result set that you want to obtain
How your data is retrieved, you leave to the internal mechanisms that determine the
implementation of the query: you let the database engine determine the best algorithms or
Since SQL is set-based, it’s hardly a surprise that this approach will be quite more effective than
the procedural one and it also explains why, in some cases, SQL can work faster than code. Tip
the set-based approach of querying is also the one that most top employers in the data science
You’ll often need to switch between these two types of approaches. Note that if you ever find
yourself with a procedural query, you should consider rewriting or refactoring it.
Knowing that anti-patterns aren’t static and evolve as you grow as an SQL developer and the fact
that there’s a lot to consider when you’re thinking about alternatives also means that avoiding
query anti-patterns and rewriting queries can be quite a difficult task. Any help can come in
17
handy, and that’s why a more structured approach to optimize your query with some tools might
be the way to go. Note also that some of the anti-patterns mentioned in the last section had roots
in performance concerns, such as the AND, OR and NOT operators and their lack of index
usage.
Thinking about performance doesn’t only require a more structured approach but also a more in-
depth one.
Be however it may, this structured and in-depth approach will mostly be based on the query plan,
which, as you remember, is the result of the query that’s first parsed into a “parse tree” and
defines precisely what algorithm is used for each operation and how the execution of operations
is coordinated.
Query Optimization
As you have read in the introduction, it could be that you need to examine and tune the plans that
are produced by the optimizer manually. In such cases, you will need to analyze your query
To get a hold of this plan, you will need to use the tools that your database management system
provides to you.
Some tools that you might have at your disposal are the following:
• Some packages feature tools which will generate a graphical representation of a query plan.
18
• Other tools will be able to provide you with a textual description of the query plan. One
example is the EXPLAIN PLAN statement in Oracle, but the name of the instruction
varies according to the RDBMS that you’re working with. Elsewhere, you might find
Note that if you’re working with PostgreSQL, you make the difference between EXPLAIN,
where you just get a description that states the idea of how the planner intends to execute the
query without running it, while EXPLAIN ANALYZE actually executes the query and returns to
you an analysis of the expected versus actual query plans. Generally speaking, a real execution
plan is one where you actually run the query, whereas an estimated execution plan works out
what it would do without executing the query. Although logically equivalent, an actual execution
plan is much more useful as it contains additional details and statistics about what actually
In the remainder of this section, you’ll learn more about EXPLAIN and ANALYZE and how you
can use these two to learn more about your query plan and the possible performance of your
query.
19
To do this, you’ll get started with some examples in which you’ll work with two tables:
You can retrieve the current information of the table one_million with the help of EXPLAIN;
Make sure to put it right on top of your query, and when it’s run, it’ll give you back the query
plan:
Now that you have examined the query plan briefly, you can start digging deeper and think about
the performance in more formal terms with the help of the computational complexity theory.
This area in theoretical computer science that focuses on classifying computational problems
according to their difficulty, among other things; These computational problems can be
For queries, however, you’re not necessarily classifying them according to their difficulty, but
rather to the time it takes to run it and get some results back. This is explicitly referred to as time
complexity and to articulate or measure this type of complexity, you can use the big O notation.
With the big O notation, you express the runtime in terms of how quickly it grows relative to the
input, as the input gets arbitrarily large. The big O notation excludes coefficients and lower order
terms so that you can focus on the important part of your query’s running time: its rate of
growth. When expressed this way, dropping coefficients and lower order terms, the time
complexity is said to be described asymptotically. That means that the input size goes to infinity.
20
In database language, the complexity measures how much longer it takes a query to run as the
Note that the size of your database doesn’t only increase as more data is stored in tables, but also
the mere fact that indexes are present in the database also plays a role in the size.
As you have seen before, the execution plan defines, among other things, what algorithm is used
for each operation, which makes that every query execution time can be logically expressed as a
function of the table size involved in the query plan, which is referred to as a complexity
function. In other words, you can use the big O notation and your execution plan to estimate the
In the following subsections, you’ll get a general idea about the four types of time complexities,
and you’ll see some examples of how queries’ time complexity can vary according to the context
Note, though, that there are different types of indexes, different execution plans and different
implementations for various databases to consider, so that the time complexities listed below are
An algorithm is said to run in constant time if it requires the same amount of time regardless of
the input size. When you’re talking about a query, it will run in constant time if it requires the
21
Creating a View
We will start by generating a simple chart. In this section, we will get to know our data and will
begin to ask questions about the data to gain insights. There are some important terms that we
Dimension
Measures
Aggregation
Dimensions are qualitative data, such as a name or date. By default, Tableau automatically
classifies data that contains qualitative or categorical information as a dimension, for example,
any field with text or date values. These fields generally appear as column headers for rows of
data, such as Customer Name or Order Date, and also define the level of granularity that shows
By default, Tableau treats any field containing this kind of data as a measure, for example, sales
transactions or profit. Data that is classified as a measure can be aggregated based on a given
Aggregation is the row-level data rolled up to a higher category, such as the sum of sales or total
profit. Tableau automatically sorts the fields in Measures and Dimensions. However, for any
Steps
1. Go to the worksheet. Click on the tab Sheet 1 at the bottom left of the tableau workspace.
22
2. Once, you are in the worksheet, from Dimensions under the Data pane, drag the Order Date to
the Column shelf. On dragging the Order Date to the columns shelf, a column for each year of
Orders is created in the dataset. An 'Abc' indicator is visible under each column which implies
that text or numerical or text data can be dragged here. On the other hand, if we pulled Sales
here, a cross-tab would be created which would show the total Sales for each year.
3. Similarly, from the Measures tab, drag the Sales field onto the Rows shelf. Tableau populates
a chart with sales aggregated as a sum. Total aggregated sales for each year by order date is
displayed. Tableau always populates a line chart for a view that includes time-field which in this
example is Order Date. Refining the View Let us delve deeper and try to find out more insights
Let's start by adding the product categories to look at sales totals in a different way. Steps 1.
Drag it to the columns shelf and place it next to YEAR(Order Date). The Category should be
placed to the right of Year. In doing so, the view immediately changes to a bar chart type from a
line. The chart shows the overall Sales for every Product by year.
Learn More To view information about each data point (that is, mark) in the view, hover over
one of the bars to reveal a tooltip. The tooltip displays total sales for that category. Here is the
23
2. The view above nicely shows sales by category, i.e., furniture, office supplies, and technology.
We can also infer that furniture sales are growing faster than sales of office supplies except for
2016. Hence it will be wise to focus sales efforts on furniture instead of office supplies.
But furniture is a vast category and consists of many different items. How can we identify which
furniture item is contributing towards maximum sales? To help us answer that question, we
decide to look at products by Sub-category to see which items are the big sellers. Let's say for the
Furniture category; we want to look at details about only bookcases, chairs, furnishings, and
tables.
We will Double-click or drag the SubCategory dimension to the Columns shelf. The sub-
category is another discrete field. It further dissects the Category and displays a bar for every
sub-category broken down by category and year. However, it is a humongous amount of data to
make sense of visually. In the next section, we will learn about filters, color and other ways to
24
make the view more comprehensible. Hands On 3.Emphasizing the Results In this section, we
will try to focus on specific results. Filters and colors are ways to add more focus to the details
Adding filters to the view Filters can be used to include or exclude values in the view. Here we
try to add two simple filters to the worksheet to make it easier to look at product sales by sub-
Steps In the Data pane, under Dimensions, right-click Order Date and select Show Filter.Repeat
Filters are the type of cards and can be moved around on the worksheet by simple drag and drop
Adding colors to the view Colors can be helpful in the visual identification of a pattern. Steps In
the Data pane, under Measures, drag Profit to Color on the Marks card. It can be seen that
Bookcases, Tables and even machine contribute to negative profit, i.e., loss. A powerful insight.
25
CHAPTER 4
Let's take a closer look at the filters to find out more about the unprofitable products.
Steps
1. In the view, in the Sub-Category filter card, uncheck all boxes except Bookcases, Tables, and
Machines. This brings to light an interesting fact. While in some years, Bookcases and Machines
2. Select All in the Sub-Category filter card to show all the subcategories again.
3. From the Dimensions, drag Region to the Rows shelf and place it to the left of the Sum(Sales)
tab. We notice that machines in the South are reporting a higher negative profit overall than in
4. Let us now give a name to the sheet. At the bottom-left of the workspace, doubleclick Sheet 1
5. In order to preserve the view, Tableau allows us to duplicate our worksheet so that we can
6. In your workbook, right-click the Sales by Product and Region sheet and select Duplicate and
7. In the new worksheet, from Dimensions, drag Region to the Filters shelf to add it as a filter in
the view.
8. In the Filter Region dialogue box, clear all check boxes except South and then click OK. Now
we can focus on sales and profit in the South. We find that machine sales had a negative profit in
2014 and again in 2016. We will investigate this in the next section
26
9. Lastly, do not forget to save the results by selecting File > Save As. Let us name our
This dataset consists information about used car listed on cardekho.com. It has 9 columns each
columns consists information about specific features like Car_Name gives information about car
company .which Year the brand new car has been purchased.selling_price the price at which car
is being sold this will be target label for further prediction of price.km_driven number of
kilometre car has been driven.fuel this feature the fuel type of car (CNG , petrol,diesel
etc).seller_type tells whether the seller is individual or a dealer. transmission gives information
about the whether the car is automatic and manual.owner number of previous owner of the car.
Step 1 : Setting a virtual environment This should be initial step when you are building an end to
end project. We need new virtual environment because each project required different set and
can feed all the essential library to that environment. Follow these step to do so….
27
It might be possible that we will get issue about absence of jupyter notebook for that we have to
our environment is created, now we will do our complete project on this environment.
Step 2 :- Acquiring data set and importing all the essential library I have taken data set from here
& The data set is in csv format. Now I will import all the essential library that will be needed for
this project.
Here we have cars of 98 different companies and name of companies won’t affect car’s price
,price depends upon how many year it’s been used ,fuel type etc.,.so I will drop the column
such issue but if datatype is categorical then we need to convert all those categorical features into
numerical values.
If we will observe above output we can say that there are some features which are having object
data type now in my next step i will create a cat_df and will store all the categorical feature into
cat_df.
Now we will make the categorical dataframe with all the features having categorical variables,
and will drop all the categorical features from original dataframe.
number of year car has been used = current year — previous year
So as we are in 2020 and car is of 2014 then number of years it’s been used will be :- number of
year car has been used=2020–2014=6 to print number of year car has been used we need to add a
We have successfully added the current_ year now we will add number_of_years column and
On an average car has been driven 36947 kilometres and max distance the car has been traveled
is 5,00,000 kilometres. The car with highest ex-showroom selling price present in data set is 92.6
lakh. Maximum number of years car has been used and then come for sell is 17 years.maximum
number of owner that has used a single car is 3.Maximum selling price for used car is 35 lakh
29
Step 4 :- Data Visualization
This is most important step of data science life cycle, here we understand the behavior of data
let’s understand it by doing… More number of Years you will use your car lesser the amount
you will get. lesser the car would be driven higher will be the cost as we see the graph at max
distance i e:- 500000 kilometres the car’s cost is near to Zero or we can say nobody is willing to
pay any amount to those cars. plotting pair plot We cannot visualize multi dimensional scatter
plot hence by using pair plot we can visualize each and every dimension of (Dimension with
As we see there are very less overlapping in dataset is seen so we cannot use knn ,linear
regression,svm and because of the dynamic nature of dataset we even cannot use decision tree so
we will go with random forest and xgboost. Uni variate analysis :- when analysis involve single
Most number of car has been sold within a price range of 1–10 lakh and for a price range of 25 -
Demand for those car that has been traveled less distance are in more demand especially if car
has traveled distance within a range of 0–5000 kilometre people are more attracted towards
them.
30
C.D.F Plot
It defines how many percentage of variable has value less than and equal to corresponding xaxis.
lets’ take above example how many percentage of vehicle has selling price less than 15 lakh then
As we can see 94.7% of cars that are on cardekho has price ≤15 lakh . So one thing is clear that if
we want to purchase used car with a price range of 20–25 we won’t prefer to go to cardekho.com
If we see above graph we can understand that for those vehicle whose original price lie within
range of 0–20 lakh they are getting approximately 50% of their money when they sell their car
How Machine Learning and Artificial Intelligence Changing the Face of eCommerce? | Data
Driven…
www.datadriveninvestor.com
As we have already made the cat_df with all the categorical features so we will drop all those
feature consisting categorical variable from original dataframe and at the end after using feature
engineering in cat_df we will concatenate cat_df with original dataframe and by doing this we
31
Now we will do feature engineering on cat_df to convert the categorical variable into numerical
variable.But before that we will check how many unique categorical variable each feature
consists.
Now we have converted all the features into numerical variable.Here we will check correlation
Because Feature selection is used when we have large features in a dataset but in this dataset we
only check how features in dataset are correlated with each other.
32
Step 6 : Pre Modeling Steps
33
checking which model will be best for our dataset as from pair plot it is clear that we have to take
model which do prediction on non linear and combination of categorical and numerical data that
are decision tree,random forest and xgboost then we will check which model will have high
accuracy based on that we select the model. We can choose our best fit model using cross
validation score.
As we see above accuracy score result we can say that XGboost gives better accuracy with very
Feature Importance :- checking which all features are important for output features out of all
34
Transmission_manual,seller_Type_individual,Fuel_Type_petrol,Fuel_Type_Diesel,No_of_Ye
Saving a model in serialized format : Save model in serialized format and when we need to do
prediction we will just load the pickle file and make prediction using that serialized file we don’t
need to again make a new model for prediction on new test data.
Before understanding the working of the random forest we must look into the ensemble
technique. Ensemble simply means combining multiple models. Thus a collection of models is
1. Bagging– It creates a different training subset from sample training data with replacement &
the final output is based on majority voting. For example, Random Forest.
2. Boosting– It combines weak learners into strong learners by creating sequential models such
that the final model has the highest accuracy. For example, ADA BOOST, XG BOOST
As mentioned earlier, Random forest works on the Bagging principle. Now let’s dive in and
Bagging
Bagging, also known as Bootstrap Aggregation is the ensemble technique used by random forest.
Bagging chooses a random sample from the data set. Hence each model is generated from the
35
known as row sampling. This step of row sampling with replacement is called bootstrap. Now
each model is trained independently which generates results. The final output is based on
majority voting after combining the results of all models. This step which involves combining all
the results and generating output based on majority voting is known as aggregation.
Now let’s look at an example by breaking it down with the help of the following figure. Here the
bootstrap sample is taken from actual data (Bootstrap sample 01, Bootstrap sample 02, and
Bootstrap sample 03) with a replacement which means there is a high possibility that each
sample won’t contain unique data. Now the model (Model 01, Model 02, and Model 03)
obtained from this bootstrap sample is trained independently. Each model generates results as
shown.
Now Happy emoji is having a majority when compared to sad emoji. Thus based on majority
36
Steps involved in random forest algorithm:
Step 1: In Random forest n number of random records are taken from the data set having k
number of records.
Step 4: Final output is considered based on Majority Voting or Averaging for Classification and
regression respectively.
For example: consider the fruit basket as the data as shown in the figure below. Now n number
of samples are taken from the fruit basket and an individual decision tree is constructed for each
sample. Each decision tree will generate an output as shown in the figure. The final output is
considered based on majority voting. In the below figure you can see that the majority decision
tree gives output as an apple when compared to a banana, so the final output is taken as an apple.
37
Important Features of Random Forest
1. Diversity- Not all attributes/variables/features are considered while making an individual tree,
2. Immune to the curse of dimensionality- Since each tree does not consider all the features, the
3. Parallelization-Each tree is created independently out of different data and attributes. This
means that we can make full use of the CPU to build random forests.
4. Train-Test split- In a random forest we don’t have to segregate the data for train and test as
there will always be 30% of the data which is not seen by the decision tree.
5. Stability- Stability arises because the result is based on majority voting/ averaging. Difference
Random forest is a collection of decision trees; still, there are a lot of differences in their
behavior.
Thus random forests are much more successful than decision trees only if the trees are diverse
and acceptable.
38
Important Hyperparameters
Hyperparameters are used in random forests to either enhance the performance and predictive
1. n_estimators– number of trees the algorithm builds before averaging the predictions.
node.
1. n_jobs– it tells the engine how many processors it is allowed to use. If the value is 1, it can use
2. random_state– controls randomness of the sample. The model will always produce the same
results if it has a definite value of random state and if it has been given the same hyperparameters
3. oob_score – OOB means out of the bag. It is a random forest cross-validation method. In this
one-third of the sample is not used to train the data instead used to evaluate its performance.
39
40
41
Popular features of used cars
When buying a used car, people pay serious attention to the odometer value on the car. We can
see that odometer changes the price of a car significantly . On the other hand, this does not mean
that only low odometer cars are sold. Depending on the price, high odometer cars also have
buyers.
42
43
Machine Learning Model
This section used applied machine learning models as a framework for the data analysis. The
data set is a supervised data which refers to fitting a model of dependent variables to the
independent variables, with the goal of accurately predicting the dependent variable for future
44
Pre-processing the data
Label Encoding. In the dataset, there are 13 predictors. 2 of them are numerical variables while
rest of them are categorical. In order to apply machine learning models, we need numeric
representation of the features. Therefore, all non-numeric features were transformed into
numerical form.
Train the data. In this process, 20% of the data was split for the test data and 80% of the data
Scaling the Data. While exploring the data in the previous sections, it was seen that the data is
not normally distributed. Without scaling, the machine learning models will try to disregard
coefficients of features that has low values because their impact will be so small compared to the
While scaling, it’s also important to scale with correct method because inappropriate scaling
E., & Marwala, 2012). Min-max scaler is appropriate especially when the data is not normal
distribution and want outliers to have reduced influence. Besides, both ridge and lasso get
Random Forest
Random forest is a set of multiple decision trees. Deep decision trees may suffer from
overfitting, but random forest prevents overfitting by creating trees on random subsets. That’s
45
46
47
CHAPTER 5
Conclusion
The increased prices of new cars and the financial incapability of the customers to buy them,
Used Car sales are on a global increase. Therefore, there is an urgent need for a Used Car Price
Prediction system which effectively determines the worthiness of the car using a variety of
features. The proposed system will help to determine the accurate price of used car price
prediction. By performing different models, it was aimed to get different perspectives and
eventually compared their performance. With this study, it purpose was to predict prices of used
cars by using a dataset that has 13 predictors and 380962 observations. With the help of the data
visualizations and exploratory data analysis, the dataset was uncovered and features were
explored deeply. The relation between features were examined. At the last stage, predictive
Future Scope
In future this machine learning model may bind with various website which can provide real
time data for price prediction. Also we may add large historical data of car price which can help
to improve accuracy of the machine learning model. We can build an android app as user
interface for interacting with user. For better performance, we plan to judiciously design deep
learning network structures, use adaptive learning rates and train on clusters of data rather than
48
References
1. Agencija za statistiku BiH. (n.d.), retrieved from: http://www.bhas.ba . [accessed July 18,
2018.]
2. Listiani, M. (2009). Support vector regression analysis for price prediction in a car
3. Du, J., Xie, L., & Schroeder, S. (2009). Practice Prize Paper—PIN Optimal Distribution
recognition, 1995., proceedings of the third international conference on (Vol. 1, pp. 278-
282). IEEE
5. Auto pijaca BiH. (n.d.), Retrieved from: https://www.autopijaca.ba. [accessed August 10,
2018].
6. Gongqi, S., Yansong, W., & Qiang, Z. (2011, January). New Model for Residual Value
Prediction of the Used Car Based on BP Neural Network and Nonlinear Curve Fit. In
7. Pudaruth, S. (2014). Predicting the price of used cars using machine learning techniques.
8. Noor, K., & Jan, S. (2017). Vehicle Price Prediction System using Machine Learning
9. Auto pijaca BiH. (n.d.), Retrieved from: https://www.autopijaca.ba. [accessed August 10,
2018]
49
10. Weka 3 - Data Mining with Open Source Machine Learning Software in Java. (n.d.),
11. Ho, T. K. (1995, August). Random decision forests. In Document analysis and
recognition, 1995., proceedings of the third international conference on (Vol. 1, pp. 278-
282). IEEE.
12. .Russell, S. (2015). Artificial Intelligence: A Modern Approach (3rd edition). PE.
13. Ben-Hur, A., Horn, D., Siegelmann, H. T., & Vapnik, V. (2001). Support vector
pattern recognition learning. Automation and remote control, 25, 821- 837
50