0% found this document useful (0 votes)
97 views

Car Popularity Prediction

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
97 views

Car Popularity Prediction

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Car Popularity Prediction: A Machine Learning

Approach
Sunakshi Mamgain Srikant Kumar Kabita Manjari Nayak
Department of Computer Science Department of Computer Science Department of Computer Science
M.Tech, I.I.I.T.-Bhubaneswar M.Tech, I.I.I.T.-Bhubaneswar M.Tech, I.I.I.T.-Bhubaneswar
Bhubaneeswar, India Bhubaneeswar, India Bhubaneeswar, India
[email protected] [email protected] [email protected]
Swati Vipsita
Department of Computer Science
I.I.I.T.-Bhubaneswar
Bhubaneeswar, India
[email protected]

Abstract— Today is a world of technology with a foreseen classified nor labeled. It also studies to infer a
future of a machine reacting and thinking same as human. In function from a system to describe a hidden
this process of emerging Artificial Intelligence, Machine structure from unlabeled data. Clustering is an
Learning, Knowledge Engineering, Deep Learning plays an approach of unsupervised learning.
essential role. In this paper, the problem is identified as
regression or classification problem and here we have solved a • Semi supervised learning [6] [11]: It takes the
real world problem of popularity prediction of a car company characteristics of both unsupervised learning and
using machine learning approaches. supervised learning. These algorithms uses small
amount of labeled data and large amount of
Keywords—Machine Learning, Regression, Classification, unlabeled data.
Supervised Machine Learning, Logistic Regression, KNN, • Reinforcement [12]: In this algorithm, interaction
Random Forest. is made to environment by actions and discovering
errors. It allows machines and software agents in
I. INTRODUCTION determining ideal behavior in a specific context
such that performance could be maximized.
In the era which we live in, technology has a big impact on
Regression and Classification problems are types of
our lives. Artificial intelligence [6], knowledge engineering,
problems in supervised learning. In classification,
Machine learning, Deep learning [4][5], Natural language
conclusion is drawn using values which are obtained by
processing[7][8] are emerging technologies which plays an
observation. A discrete output variable say y is
important role in the leading projects of today's world.
approximated by this problem using a mapping function say
Artificial intelligence is an area or branch which aims or
f on input variables say x. The output of classification is
emphasizes on creating machine that works intelligently and
generally discrete but it can also be continuous for every
their reactions is similar to that of human.
class label in the form of probability. A regression problem
has output variable as a real or continuous value. A
In Artificial Intelligence, Machine learning is an essential
continuous output variable say y is approximated by this
and core part providing the ability of learning and
problem using a mapping function say f on input variables
improving by itself. The focus of this technique is on
say x. The output of regression is generally continuous but it
creation of programs which can pick the data and learn from
can also be discrete for any class label in the form of an
it by itself. Earlier, statistician and developers worked
integer. A problem with many output variables is referred to
together for predicting success, failure, future etc. of any
multivariate regression problem.
product. This process led to delay of the product
In this paper we will be focusing on a problem picked from
development and launch. Maintenance of such product in
hackerrank where a company is trying to launch a new car
the changing technology and data is also one of the major
modified on the basis of the popular features of their
challenges.
existing cars. The popularity will be predicted using
machine learning approach. It can be classified as regression
Machine learning made this process easier and faster. problem especially a multivariate regression problem and
There are various Machine learning algorithms broadly the problem can be classified under supervised learning.
categorized into four paradigms: Thus various supervised learning algorithms will be used for
• Supervised learning [7] [9] [10]: This learning this prediction.
algorithm provides a function so as to make II. RELATED WORKS
predictions for output values, where process starts
from analysis of a known training dataset. This In paper “Predicting stock movement direction with
algorithm can be applied to the past learned data to machine learning: An extensive study on S&P 500
new data using labels so as to predict future events. stocks[1]”, author has reviewed some classification
• Unsupervised learning: This algorithm is used on algorithms such as random forest, gradient boosted trees,
training dataset and informs which is neither artificial neural network and logistic regression to predict

978-1-5386-5257-2/18/$31.00 ©2018 IEEE


463 stocks of the S&P 500. In order to study the 2. Arithmetic mean is calculated for r-
predictability of these stocks, author has performed
multiples of experiments with these classification
algorithms. The obtained result of predicting future prices
(1)
from the past available data was not up to the mark as the
expected result, The author wanted to obtain. However, they 3. Return as output value for t.
successfully showed the vast growth in predictability of
B. Logistic Regression [14]:
European and Asian indexes closed a little while back.
Logistic regression is an appropriate predictive analysis. For
In paper “Performance evaluation of predictive models for a binary variable which is dependent, logistic regression is
missing data imputation in weather data[2]”, author has used. Using this algorithm a relation between independent
suggested a new approach to manage the missing data in variable(s) and dependent variable can be explained and
weather data by performing various tests with NCDC data can also be described. It is statistical method in which
dataset to assess the prediction error of five methods: linear variables which are dependent are binary containing data as
regression, SVM, random forest, KNN Implementation and 1. The aim of this algorithm is describing relation among
kernel ridge. In order to handle the missing values of dataset variables which are independent and characteristics of
they performed two actions: 1.removing the entire row interest which are binary by discovering a model which is
which contains missing value and 2. Impute the missing best fitted. Logistic regression predicts a logit
data. They performed both the methods to handle the transformation by generating confidents of a formula:
missing data and compared the observed result.
In paper “Amazon EC2 Spot Price Prediction using logit(p)= (2)
Regression Random Forests [3]”, author has proposed
Regression Random Forests (RRFs) model to forecast the here p denotes probability if characteristic of interest is
Amazon EC2 Spot Price one week ahead and one month present.
ahead. This prediction model would help in planning when The logit transformation is defined as logged odds.
to acquire the spot instance, the model also predicts the
execution cost and it also suggests the user when to bid in Odds = (3)
order to minimize the execution cost.
And logit (p) = (4)
III. ALGORITHMS In logistic regression, estimation is made by choosing
parameters that maximizes likelihood of observing the
A. KNN (K-Nearest Neighbor) [13]:
sample values.
KNN has yet another specialty, it does not explicitly go
through training phase or to say the training phase is C. Random Forest [15] [16]:
minimal and fast. It also means that KNN does not use Random forest is a type of a supervised classification
training data for generalization and all this data is generally algorithm for creating a forest and making it random by
needed in testing phase. Thus, KNN is often referred as lazy some way.
algorithm. Larger number of trees indicates more accuracy in results.
Process of KNN- Random forest is used for both classification and regression
Assuming: tasks. The classifier of random forest can handle missing
1. Data set is a matrix of dimension NXP. values and can be modeled for categorical values.
2. P is scenarios ,…., . Random Forest works in two stages:
1. First stage is creating a Random Forest.
3. Each scenario contains N features;
a. Select K features randomly from m features.
={ ,……., }.
b. Calculate node ‘d’, among ‘K’ features, using
4. O is the vector of output values for each best split point.
scenario , o={ ….. }. c. Node is split into daughter nodes.
d. Do a to c until number of nodes reached is 1.
Steps:
e. By repeating steps from a to d, n number of
1. Output values to query scenario q of X nearest
times to create n number of trees. Thus a forest
neighbors are stored in vector r = { …… } by is build.
the following steps repeated X times in a loop. 2. Next stage is to make prediction using random
a. From data set, next scenarios here I denotes forest classifier:
ongoing iteration in the domain {1… P}. a. A outcome is predicted and stored by using
b. If or t not set testing features along with rules of each
decision tree created randomly.
Then t and f .
b. Votes are calculated for each predicted target.
c. Do until all entries in data set are over.
d. Storing t in vector c and f in vector r.

2018 Fourth International Conference on Computing Communication Control and


Automation (ICCUBEA)
Final prediction is considered to be highest voted predicted number_of_doors represents the
target. number of doors in the car.
D. Support Vector Machine [17] [18]: d. number_of_seats: The
Support Vector Machine also referred to as SVM is also a number_of_seats attribute is used to
supervised machine learning algorithm used mostly for describe the number of seats in the
classification problems and also used for regression car, and the values are [2, 4, 5], where
problems. SVM’s objective is finding optimal separating each value of represents the number
hyper plane maximizing margin of the training data if it of seats in the car.
classifies training data correctly and this algorithm does e. luggage_boot_size: The
generalization better on unseen data.
luggage_boot_size attribute is used to
IV. EXPERIMENT DETAILS AND NUMERICAL denote the luggage boot size , and its
SIMULATIONS values ranges from [1..3]. Value 1
smallest and 3 is largest luggage boot
There are two data sets available in a .csv file which is size.
comma separated file with useful information: f. Safety_rating: The safety_rating
1. Train.csv -:
attribute is used to describe the safety
This is a file that is used as training dataset whose each
rating of cars. Its value ranges from
row provides information on each car. With values such
[1...3] where 1 represents low safety
as buying_price, maintenance_cost, number_of_doors,
and 3 is high safety.
number_of_seats etc. Some of the attributes are
g. popularity: The popularity attribute is
explained as follows-
used to describe the popularity of the
a. buying_price: The buying_price
cars. Its values ranges from [1…4]
attribute is used to describe the
where 1 represents the unacceptable
buying price of the cars. It ranges
car, 2 represents an acceptable car, 3
from [1…4] where 1 represents the
represents a good car, and 4
lowest price and 4 is representing
represents the best car.
highest price.
b. maintenance_cost: The
We have performed the experiment in python programming
maintenance_cost attribute is used to
language. We have used pandas, numpy, matplotlib,
describe the maintenance cost of the
seaborn, sklearn python libraries for solving the problem.
cars. It ranges from [1…4] where 1
The snippet of training data is shown in fig 1.The schema of
represents the lowest maintenance
training data is shown in fig 2.
cost and 4 is representing highest
maintenance cost. Brief description of training data is shown in fig 3.
c. number_of_doors: The
number_of_doors attribute is used to
describe the number of doors in the
car, and the values ranges from
[2...5], where each value of

Fig 1: Training Data

Fig 2: Training Data Schema

2018 Fourth International Conference on Computing Communication Control and


Automation (ICCUBEA)
Fig 3: Training Data Description

Training data visualization: maintainence_cost on the scale of 1 to 4 and y axis is


Fig 4 represents bar chart of parameter popularity where x representing safety_rating, popularity on the scale of 0 to
axis represents popularity on the scale of 1 to 4 and y 3.5.
represents total count of cars belonging to a particular
scaling parameter. Fig 5 represents hexplot of parameter 2. Test.csv -:
popularity where x axis is representing safety_rating on the It is the test dataset of cars along with above attributes
scale of 1 to 3 and y axis is representing popularity on the excluding popularity. The goal is to predict the popularity of
scale of 1 to 4. Fig 6 represents stacked plot of parameter cars of test dataset based on their remaining attributes.
popularity where x axis is representing buying_price,

Fig 4: Bar Chart representation of Popularity parameter

Fig 5: Hexplot of popularity

Fig 6: Stacked Plot of parameter popularity

2018 Fourth International Conference on Computing Communication Control and


Automation (ICCUBEA)
V. RESULT AND DISCUSSION In Proceedings of the 28th International Conference
on International Conference on Machine Learning,
After executing the Machine Learning Algorithm the next pp. 265-272. Omnipress, 2011.
step is to find out the effectiveness of model based on [6] Zhu, Xiaojin. "Semi-supervised learning literature
various performance metrics. Different performance survey." (2005).
[7] Olsson, Fredrik. "A literature survey of active
metrics are used for different Machine Learning machine learning in the context of natural language
Algorithms. For example: For classification we use processing." (2009).
different performance metrics [19] such as Accuracy, [8] Cambria, Erik, and White B. "Jumping NLP curves:
Cross Validation, Precision, Recall, and f1 Score. If the A review of natural language processing
machine learning algorithm is used for prediction (for research." IEEE Computational intelligence
magazine 9.2 (2014): 48-57.
example: stock price prediction, housing prediction and [9] Kotsiantis, Sotiris B., I. Zaharakis, and P. Pintelas.
like in our case car popularity prediction) we use Root "Supervised machine learning: A review of
Mean Square Error (RMSE) [20], Mean Square Error classification techniques." Emerging artificial
(MSE) [20]. Because of absence of output data, we are intelligence applications in computer engineering 160
(2007): 3-24.
unable to measure the performance of the Machine [10] Khan, A., Baharudin, B., Lee, L.H. and Khan, K.,
Learning Algorithms we applied in this problem. However, 2010. “A review of machine learning algorithms for
we have stored the predicted output values in .csv file we text-documents classification.” Journal of advances
received after performing the algorithms we implemented in information technology, 1(1), pp.4-20.
in this paper. We have calculated the accuracy of the [11] Jiang J. “A literature survey on domain adaptation of
statistical classifiers.” URL: http://sifaka. cs. uiuc.
machine learning models we implemented which is shown edu/jiang4/domainadaptation/survey. 2008 Mar 6;3.
in table 1. [12] Kaelbling, L.P., Littman, M.L. and Moore, A.W.,
1996. “Reinforcement learning: A survey.” Journal of
VI. CONCLUSION AND FUTURE WORK artificial intelligence research, 4, pp.237-285
[13] Ban, Tao, Ruibin Zhang, Shaoning Pang,
Machine Learning is a fast growing approach to solve Abdolhossein Sarrafzadeh, and Daisuke Inoue.
real world problems. This paper focused on some of the "Referential knn regression for financial time series
supervised learning algorithms such as Logistic forecasting." In International Conference on Neural
Information Processing, pp. 601-608. Springer,
Regression, KNN, SVM and Random Forest for prediction Berlin, Heidelberg, 2013.
popularity on a scaling measure of [1…4] for a car [14] Dutta, A., Bandopadhyay, G. and Sengupta, S., 2015.
company. From table 1 it is clear that SVM is giving us the “Prediction of stock performance in indian stock
best result. Thus for future work, our focus would be on market using logistic regression.” International
Journal of Business and Information, 7(1).
modifying SVM model used and will try to make the [15] Liaw, A. and Wiener, M. “Classification and
prediction more accurate. Also implementing the problem regression by randomForest." R news (2002), 2(3),
using deep learning deep learning and neural network pp.18-22.
algorithms will be our focus, as they provide more [16] Svetnik, V., Liaw, A., Tong, C., Culberson, J.C.,
generalization of problems. Sheridan, R.P. and Feuston, B.P. “Random forest: a
classification and regression tool for compound
classification and QSAR modeling.” Journal of
REFERENCES chemical information and computer sciences (2003),
[1] Jiao, Yang, and Jérémie Jakubowicz. "Predicting 43(6), pp.1947-1958.
stock movement direction with machine learning: An [17] Smola, A.J. and Schölkopf, B. “A tutorial on support
extensive study on S&P 500 stocks." Big Data (Big vector regression.” Statistics and computing (2004),
Data), 2017 IEEE International Conference on. IEEE, 14(3), pp.199-222.
2017. [18] Gunn, S.R. “Support vector machines for
[2] Gad, Ibrahim, and B. R. Manjunatha. "Performance classification and regression.” ISIS technical report
evaluation of predictive models for missing data (1998), 14(1), pp.5-16.
imputation in weather data." Advances in Computing, [19] Williams, N., Zander, S. and Armitage, G. “A
Communications and Informatics (ICACCI), 2017 preliminary performance comparison of five machine
International Conference on. IEEE, 2017. learning algorithms for practical IP traffic flow
[3] Khandelwal, Veena, Anand Chaturvedi, and Chandra classification.” ACM SIGCOMM Computer
Prakash Gupta. "Amazon EC2 Spot Price Prediction Communication Review (2006), 36(5), pp.5-16.
using Regression Random Forests." IEEE [20] Willmott, C.J. and Matsuura, K. “Advantages of the
Transactions on Cloud Computing, 2017. mean absolute error (MAE) over the root mean
[4] LeCun, Yann, Yoshua Bengio, and Geoffrey Hinton. square error (RMSE) in assessing average model
"Deep learning." nature 521.7553 (2015): 436.. performance.” Climate research (2005), 30(1), pp.79-
[5] Le, Quoc V., Jiquan Ngiam, Adam Coates, Abhik 82.
Lahiri, Bobby Prochnow, and Andrew Y. Ng. "On
optimization methods for deep learning."

TABLE I. TRAING TESTING ACCURACY OF MODELS

Model Training Accuracy Test Accuracy

KNN 0.97 0.94


Logistic Regression 0.83 0.99
Random Forest 0.86 0.98
SVM 0.97 0.99

2018 Fourth International Conference on Computing Communication Control and


Automation (ICCUBEA)

You might also like