Report
Report
Prepared For
Submitted To
The automobile industry today is the most profitable industry. Due to increase in the income
in both rural and urban sector and availability of easy finance are the main drivers of high-
volume car segments. Further competition is heating up with host of new players coming in
and global manufacturers. This analysis and visualization of the automobile dataset will be
helpful for the existing and new entrant car manufacturing companies in India to find out the
customer expectations and the current analysis of various thousands of variants of vehicles
that are running in the market currently. Indian Automobile car business is influenced by the
presence of many national and multinational manufacturers which are covered in the dataset
which consisted of several tens and hundreds of manufacturers from around the world. This
project presents various levels of visualizations using barplots, histograms, scatter plots,
boxplots, violinplots etc. And data analysis of consumer automobiles to get a proper
understanding of consumer buying and pricing behavior of vehicles that are currently in
market to predict prices of future cars based on their other attributes.
The objective of this project is to visualize and provide various insights from the considered
Indian automobile dataset by performing data analysis that utilizing machine learning
algorithms in R programming language. The considered dataset is of Indian cars that consists
of various features such as model, manufacturer, year, transmission, engine, power etc. The
insights that could be estimated from this dataset would be feature such as price of a specific
car model that could be estimated using the other attributes of that particular car model using
machine learning algorithms like Linear Regression. The objective also includes the study of
various attributes of the considered Indian automobile dataset and finding the relationship or
statistically, finding the correlation between them and visualizing the findings. The result of
finding this relationship between various attributes of a vehicle will provide useful insights in
building in a prediction model capable of predicting the price of a vehicle based on the other
attributes. This kind of an analytics will help the consumers to decide the selling price of a
vehicle without rough estimates which sometimes may underestimate the price of vehicles
leading to loss of customer automobile value. Thus, this kind of analytics will certainly have
a practical industry use case which might be useful to create end products to consumers
which are capable of providing insights of various attributes of automobiles and also to look
into analytics and knowing the segment of automobiles that are successful in the market.
Introduction
Background
The project aims to perform various visualizations and perform data analysis on the
automobile dataset in order to determine the various relationships between different features
of the vehicle. The visualization starts with univariate analysis, analyzing the data in
perspective of a single attribute then with bivariate analysis and then with multivariate which
deals with more than two attributes at the same time. In this project we are using the Indian
automobile dataset and perform various analysis of the attributes like the capacity and power
of the automobiles using R programming language. The insights that could be estimated from
this dataset would be feature such as price of a specific car model that could be estimated
using the other attributes of that particular car model using machine learning algorithms like
linear regression or polynomial regression. Finally, we shall be building a machine learning
model that is capable of predicting the price of a vehicle based on the other attributes of the
automobile.
Objective
The primary objective of this project is to visualize and provide various insights from the
considered Indian automobile dataset by performing data analysis that utilizing machine
learning algorithms in R programming language. Also, to derive a prediction model that can
appropriately estimate the pricing of various car models with their parameters like
manufacturer, year, horsepower and so on. The considered dataset is of Indian cars that
consists of various features such as model, manufacturer, year, transmission, engine, power
etc. The insights that could be estimated from this dataset would be feature such as price of a
specific car model that could be estimated using the other attributes of that particular car
model using machine learning algorithms like linear regression. The objective also includes
the study of various attributes of the considered Indian automobile dataset and attempts to
consolidate the findings of the relationship between the attributes or statistically, finding the
correlation between them and visualizing the findings. Of these features some of them might
be a redundant and might be a good contributor to the prediction model and the task of
eliminating such attributes also shall be considered.
Motivation
The reason for choosing this particular project was because of its practical applications
involved in it. Many people often face the problem of pricing vehicles while they are selling
it online. Thus, a prediction model capable of pricing of a particular model of a car can be
useful when an owner wants to sell their vehicle. Also, with the help of some attributes of the
car like manufacturer, engine capacity, horsepower, the price of an upcoming car can be
closely estimated without its release. These kind of prediction models can be used in online
websites to provide prediction to the website users, either for estimating price of an vehicle
before its being revealed by the manufacturer using data analysis on the data in the dataset or
the predictions will be really helpful while users are selling their vehicles.
Contributions of project
The data taken into consideration is taken from Kaggle website which hosts a variety of
datasets from all over the world. The dataset contains 5975 rows and 14 columns, cars with
their variants there are more than 1200 model car variants to study. The data concerns pricing
of vehicles in rupees, to be predicted in terms of 5 multivalued discrete attributes -
manufacturer, location, fuel type, transmission, ownership, ownership and 6 continuous
attributes - year, km driven, engine, seats, horsepower. There is a variety of models which
can be studied. Car prices ranges from few lacs to few crores. The dataset consisted of many
missing values and some required attributed were wrongly recorded as zero values like
mileage which can only be a non-zero value. Since the rows that consisted missing values
only amount to less than one percent of the data, rows with missing values are deleted and
some rows with zero values are imputed with the mode of that particular attribute.
Organization of project
The organization of the entire project is divided into two parts which are the visualization and
the data analysis parts of the project. The visualization part consists of univariate analysis,
analyzing the data in perspective of a single attribute then with bivariate analysis, analysis
using two attributes and then with multivariate which deals with more than two attributes at
the same time. The data analysis part deals with finding the relationship between various
attributes and building a prediction model capable of predicting the price of a vehicle based
on the other attributes.
Project Resource Requirements
Software Requirements
• R software environment for statistical computing
• R Studio IDE (Integrated Development Environment)
• Ggplot2 graphic data visualization package
• Plotly visualization library
• Dplyr data manipulation package
• MLBench package
• Caret machine learning package
Hardware Requirements
• Intel-compatible chipset
• 1GB RAM
• 20GB of free disk space
• Windows or Linux or Mac
LITERATURE SURVEY
Background
In this section literature survey of various papers on topics of linear and polynomial
regression is performed and analyzed various methodologies of each paper and their
respective advantages and drawbacks.
Literature Review
Authors Method Purpose Advantages Disadvantages
Dacheng Multivariate linear To find two low- Numerical There are two
Tao regression and its dimensional experiments types of error
principal component coefficients so carefully in PCR, which
that the principal
regression, deal well discussed the are resulted
components
with the situations of selection influences of from small
data having low- problem is all the factors sample size
dimensional vector. avoided. in models and and noise,
When the dimension showed that respectively.
grows higher, it leads MMR is more
to the under-sample effective in
problem. case of
multilinear
regression
problems.
Wolfgang Regression Analysis To detect several The technique due to the
Tysiak of Intrinsic Linear monotone used to numerical
Models with transformations optimize the optimization a
Automated by means of transformation relatively large
Transformations of logistic functions and the OLS- sample size is
Monotone Predictors. in the predictors. criteria in one needed.
step can also
be applied to
discover the
structure of
distributed
lags within
the data.
Jinfang BTP Prediction The prediction prediction prediction
Sheng model based on values and result is accuracy is a
regression analysis correlated calculated little bit low
variables are sent based on both
back to L1 the prediction
system through value from N-
BTP control BTP-
model to realize Prediction
a closed-loop model and the
control system. compensation
value from
linear
regression
model.
H. Shakouri Fuzzy linear To find the This This is idea of
regression models parameters of a approach, is reducing
with absolute errors linear fuzzy much more distance
and optimum
regression. It is accurate, between the
uncertainty
designed and compared to output of the
solved, by which the other possibility
a minimum methods model and the
degree of measured
acceptable output, while
uncertainty. trying to
increase their
Conjunction.
A. Comparison of near To compare The best Other methods
Martinez- infrared spectroscopy signal multilinear are more
Coll (NIRS) signal quantitation by regressions accurate than
quantitation by
conventional had multilinear
multilinear regression
multiple significant regression
regression. shortcomings methods
with regards
to
underestimate
flow values
below the
mean.
X. Feng Contact temperature To analyse and regression in multiple linear
prediction of high process a multiple linear regression
voltage switchgear monitoring point regression has uses big data.
based on multiple
data, the a high
linear regression
model regression model accuracy in
of temperature is the long-term
established by prediction of
using multiple temperature.
linear regression
method.
S. Polynomial to predict the it shows a it is preferable
Yamamoto regression-based optimal control satisfactory to the
model-free predictive input it utilizes control previously
control for nonlinear
massive stored performance maintained
systems
and observed when large datasets.
input/output datasets are
dataset. available
around the
reference
trajectory.
Shunxin Parametric modelling It is necessary to modelling parametric
Wang of the coupling understand the method modelling
channel of conducted characteristic of provides an method in
interference based on
the coupling approach to frequency
multi-linear
regression model channel, so that analyses the domain and its
the EMI of the internal evaluation
EUT can be interference method based
effectively coupling on the multi-
analyzed and channel of the linear
restrained EUT that regression
needs to pass model
the CE test
standard.
Ahmed A Multi Linear To predict Splitting the It outperforms
Karama Regression Approach Missing values data set into BGA for all
for Handling Missing for a data set training and the missing
Values with
with Unknown test sets, value ratios.
Unknown Dependent
Variable Dependent finding the
variable. dependent
variable from
the dataset
while
performing
training set,
then using the
model to
predict
missing
values in the
test set
M. C. A comparison of comparing and dynamic the smallest
Roziqin Monte Carlo linear calculating the polynomial MSE as
and dynamic deviation value regression compared to
polynomial regression
of the predicted able to predict linear
in predicting dengue
fever case number of cases, very well as regression,
as a result of the compared to exponential
prediction, to the Monte Carlo regression,
number of actual linear and quadratic
cases. regression regression
method
S.Edebalı Prediction of To understand High coefficient of
wastewater treatment the effects of the correlation determination
plant performance tested coefficient of and mean
using multilinear
parameters, determination squared error
regression
regression and lowest were not
function was mean squared obtained with
developed in error value this model
Multilinear (MSE)
regression between the
method measured and
predicted
output
variables
J. Wu Personalized to minimize the minimize the Prediction
Collaborative value of the value of the errors are
Filtering linear regression cost function. present but
Recommendation
cost function to The average smaller than
Algorithm based on
Linear Regression obtain the item deviation other
label between the algorithms
predicted
score and the
actual value is
calculated. .
S. A model-free To propose a Estimating the maintaining a
Yamamoto predictive control model-free coefficients of rich dataset is
method based on predictive control polynomial important’ that
polynomial regression
method for regression, an is, the dataset
nonlinear appropriate must contain
systems on the control input input/output
basis of can be data that is
polynomial determined by near the
regression. containing the desired output.
input/output
data of the
controlled
system.
Renato Dimension reduction To formulate the this method is It can be
Monteiro and coefficient dimension extended to a unstable
estimation in reduction and non- because of
multivariate linear
coefficient parametric discrete nature
regression
estimation in the model for of selecting
multivariate predicting the number of
linear model. multiple factors.
responses
Anatolii V. Polynomial To define i-th allows us to roundoff
Omelchenko regression order regression create errors on
coefficients coefficient with efficient precision of
estimation in finite
respect to methods to regression
differences space
equidistantly synthesize coefficients
spaced samples estimators for calculation.
are finite polynomial
differences of the regression
same order coefficients at
presence of
correlated
disturbances.
Summary
From the papers studies we have concluded that the automobile data needs a machine
learning model using polynomial regression model would be the best predictive model for a
dataset similar to this. Also, for RSME which is root mean square error will be used as the
evaluation metric for quantifying the error parameter. A prediction model that can
appropriately estimate the pricing of various car models with their parameters like
manufacturer, year, horsepower. Considering a linear relationship might among the attributes
in the dataset not always be suffice so, a polynomial model ensures that the attributes related
in non linear could be correlated appropriately, making the model much more precise and
reducing the error generated by the RSME parameter.
References
[1] W. Tysiak, "Regression Analysis of Intrinsic Linear Models with Automated
Transformations of Monotone Predictors," 2005 IEEE Intelligent Data Acquisition and
Advanced Computing Systems: Technology and Applications, Sofia, 2005, pp. 620-623, doi:
10.1109/IDAACS.2005.283059.
[2] H. Shakouri, G. R. Nadimi and F. Ghaderi, "Fuzzy linear regression models with absolute
errors and optimum uncertainty," 2007 IEEE International Conference on Industrial
Engineering and Engineering Management, Singapore, 2007, pp. 917-921, doi:
10.1109/IEEM.2007.4419325.
[3] A. E. Tümer and S. Edebalı, "Prediction of wastewater treatment plant performance using
multilinear regression and artificial neural networks," 2015 International Symposium on
Innovations in Intelligent SysTems and Applications (INISTA), Madrid, 2015, pp. 1-5, doi:
10.1109/INISTA.2015.7276742.
[5] B. Wang, Y. Fang, J. Sheng, W. Gui and Y. Sun, "BTP Prediction Model Based on ANN
and Regression Analysis," 2009 Second International Workshop on Knowledge Discovery
and Data Mining, Moscow, 2009, pp. 108-111, doi: 10.1109/WKDD.2009.179.
[7] S. Wang, F. Dai and T. Zheng, "Parametric modeling of the coupling channel of
conducted interference based on multi-linear regression model," 2016 IEEE MTT-S
International Conference on Numerical Electromagnetic and Multiphysics Modeling and
Optimization (NEMO), Beijing, 2016, pp. 1-2, doi: 10.1109/NEMO.2016.7561656.
[8] A. Karama, M. Farouk and A. Atiya, "A Multi Linear Regression Approach for Handling
Missing Values with Unknown Dependent Variable (MLRMUD)," 2018 14th International
Computer Engineering Conference (ICENCO), Cairo, Egypt, 2018, pp. 195-201, doi:
10.1109/ICENCO.2018.8636126.
[9] H. Li and S. Yamamoto, "Polynomial regression-based model-free predictive control for
nonlinear systems," 2016 55th Annual Conference of the Society of Instrument and Control
Engineers of Japan (SICE), Tsukuba, 2016, pp. 578-582, doi: 10.1109/SICE.2016.7749264.
[10] Yuan, M., Ekici, A., Lu, Z. and Monteiro, R. (2007), Dimension reduction and
coefficient estimation in multivariate linear regression. Journal of the Royal Statistical
Society: Series B (Statistical Methodology), 69: 329-346. doi:10.1111/j.1467-
9868.2007.00591.x
[12] M. C. Roziqin, A. Basuki and T. Harsono, "A comparison of Montecarlo linear and
dynamic polynomial regression in predicting dengue fever case," 2016 International
Conference on Knowledge Creation and Intelligent Computing (KCIC), Manado, 2016, pp.
213-218, doi: 10.1109/KCIC.2016.7883649.
[13] H. Li and S. Yamamoto, "A model-free predictive control method based on polynomial
regression," 2016 SICE International Symposium on Control Systems (ISCS), Nagoya, 2016,
pp. 1-6, doi: 10.1109/SICEISCS.2016.7470167.
[14] X. Feng, Y. Zhou, T. Hua, Y. Zou and J. Xiao, "Contact temperature prediction of high
voltage switchgear based on multiple linear regression model," 2017 32nd Youth Academic
Annual Conference of Chinese Association of Automation (YAC), Hefei, 2017, pp. 277-280,
doi: 10.1109/YAC.2017.7967419.
Proposed Architecture
Methodology
The architecture of the entire project is divided into two parts which are the visualization and
the data analysis parts of the project. The visualization part of the project deals with the
various plotting of attributes while the data analysis part of the project deals with finding the
relationship between various attributes in the dataset.
First the dataset if taken into preprocessing where the data is cleaned of missing and nan
values. Also, the data imputation takes place in this step. The dataset consists of many
missing values and some required attributed that were false recorded as zero values like
mileage which can only be a non-zero value. Since the rows that consisted missing values
only amount to less than one percent of the data, rows with missing values are deleted and
some rows with zero values are imputed with the mode of that particular attribute.
The visualization part consists of univariate analysis, analyzing the data in perspective of a
single attribute then with bivariate analysis, analysis using two attributes and then with
multivariate which deals with more than two attributes at the same time. Here the attribute’s
distributions are visualized using count plots, barplots, histograms, etc. The bivariate analysis
is done using scatter plots, box plots, violin plots and so on. Similar plots are used in
multivariate analysis but the third or more dimensions are represented on two dimensions by
adding colors or size to the plot attributes.
The data analysis is performed on the automobile dataset utilizing machine learning
algorithms in order to study the various relationships between attributes of the considered
Indian automobile dataset and attempts to consolidate the findings of the relationship between
the attributes or statistically, finding the correlation between them and visualizing the
findings. Of these features some of them might be a redundant and might be a good
contributor to the prediction model and the task of eliminating such attributes also shall be
considered. The result of finding this relationship between various attributes of a vehicle will
provide useful insights in building in a prediction model capable of predicting the price of a
vehicle based on the other parameters like manufacturer, year, horsepower and so on.
Implementation Details and User Manuals
Introduction
The data taken into consideration is taken from Kaggle website which hosts a variety of
datasets from all over the world. The dataset contains 5975 rows and 14 columns, cars with
their variants. The data concerns pricing of vehicles in rupees, to be predicted in terms of 5
multivalued discrete attributes - manufacturer, location, fuel type, transmission, ownership,
ownership and 6 continuous attributes - year, km driven, engine, seats, horsepower.
Implementation Details
DATASET AND PACKAGES
We have imported all the packages and libraries we will be using for the exploration of data.
First the data using read csv function and the path for the location of the dataset csv file is
given as argument. Exploration and visualization using ggplot and plotly packages in R.
PREPROCESSING
The dataset consists of many missing values and some required attributed that were false
recorded as zero values like mileage which can only be a non-zero value. Since the rows that
consisted missing values only amount to less than one percent of the data, rows with missing
values are deleted and some rows with zero values are imputed with the mode of that
particular attribute. Also, the engine capacity and engine power attributes of the data had
units appended at the end of the data like cc and hp which are needed to be removed in order
to convert the attributes to numerical from object datatype. Now the attributes are ordered
according to their datatype. This completes the basic dataset processing.
VISUALIZATION
The visualizations of data are performed which starts with univariate analysis, analyzing the
data in perspective of a single attribute then with bivariate analysis and then with multivariate
which deals with more than two attributes at the same time. Here the attribute’s distributions
are visualized using count plots, barplots, histograms, etc. Before performing the bivariate
analysis, the values of both the dimensions are scaled in order for the visual plots to appear
appropriately. The bivariate analysis is done using scatter plots, box plots, violin plots and so
on. Similar plots are used in multivariate analysis but the third or more dimensions are
represented on two dimensions by adding colors or size to the plot attributes. Now the data is
split into train and test data to perform the model building, training and testing.
FITTING DATA TO REGRESSION MODEL
Now that we know what our data looks good, we use some machine learning models to
predict the value of prices of vehicles given the values of the other attributes. We will use
caret package to train test and tune various regression models on our data and compare the
results. Building evaluating and tuning different regression models using caret machine
learning algorithms package. Before that categorical attributes with number of levels in them
are identified since the categorical variables cannot be directly trained in the model. Instead
we create dummy variables to represent each level in a categorical variable sort of like one
hot encoding to represent the category of a particular attribute. The numerical values are then
scaled to mean zero and variance one making them scale to the range of zero to one. This
scaled data is now used for training the machine learning model. The model used to predict
the price of automobile is multivariate regression model and this machine learning model is
considered since we need a model capable of handling more than two attributes and therefore
multiple regression is used. Multiple Regression is performed using the dummy encoded
variables and then trained. Also, there might a probability of a need to considering
polynomial regression also, since we the relationship between dependent and independent
variables might not always be a linear one. The Price attribute of the data is regressed on the
remaining numerical and categorical attributes to create a regression model.
User Manual
The entire project is divided into two parts which are the visualization and the data analysis
parts of the project. The visualization part consists of univariate analysis, analyzing the data
in perspective of a single attribute then with bivariate analysis, analysis using two attributes
and then with multivariate which deals with more than two attributes at the same time. The
data analysis part deals with finding the relationship between various attributes and building a
prediction model capable of predicting the price of a vehicle based on the other attributes.
Experiment Results and Analysis
Introduction
The visualization of various attributes of the dataset has been done highlighting the various
relationships between the attributes of the data. After fitting the model to the data, price
prediction can be performed and regression plots are plotted to identify the extent of
correlation the attribute has with the independent variable that is price.
Results
UNIVARIATE ANALYSIS
HISTOGRAMS
Fig 1.1
Fig 1.2
Fig 1.3 Fig 1.4
Fig 1.5
Fig 1.6
BOXPLOTS
Fig 2.1
Fig 2.2
Fig 2.3
Fig 2.4
Fig 2.5
Fig 2.6
BOXPLOTS
Fig 2.9
2.10 2.11
Fig 2.12
VIOLINPLOTS
Fig 2.13
Fig 2.14
MULTIVARIATE ANALYSIS
BARPLOTS
Fig 3.1
Fig 3.2
Fig 3.3
SCATTERPLOTS
Fig 3.4
Fig 3.5
Fig 3.6
Fig 3.7
Fig 3.8
Fig 3.9
Fig 3.10
Fig 3.11
Fig 3.12
Fig 3.13
Fig 3.14
Fig 3.15
BOXPLOTS
Fig 3.16
Fig 3.19
VIOLINPLOTS
Fig 3.20
Fig 3.21
Fig 4.1
REGRESSION PLOTS
Fig 4.2
Fig 4.3
Fig 4.4
Fig 4.5
Fig 4.6
Fig 4.7
Fig 4.8
CONCLUSION AND FUTURE WORK
CONCLUSION
Thus, we have visualized and derived various insights from the considered Indian automobile
dataset by performing data analysis that utilizing machine learning algorithms in R
programming language. We have performed univariate analysis, analyzing the data in
perspective of a single attribute then with bivariate analysis, analysis using two attributes and
then with multivariate which deals with more than two attributes at the same time presenting
various levels of visualizations using barplots, histograms, scatter plots, boxplots, violinplots.
The result of finding this relationship between various attributes of a vehicle will provided
useful insights in building in a prediction model capable of predicting the price of a vehicle
based on the other attributes. We have derived one polynomial regression model and studied
the results, outcomes, and interpretations in addition to the methodologies to evaluate these
models. From the data analysis we have summarized that the attributes Engine, Power,
Mileage are the major factor which effected the price of the car largely and the rest of the
attributes have some impact but not a huge one. Thus, we could conclude that price is heavily
correlated with car engine, power and mileage attributes of the dataset.
FUTURE WORK
In the future extension of this project, more data can be collected that are related this dataset
so that this could add more features for the predicting and finding the correlation between the
different variables which effect the price of the vehicle. Also, more advanced machine
learning models can be used to reduce the amount of error the current had produced. Also, the
various hyper parameters can be tuned during the training of the model in order to decrease
the RSME value making the prediction model closer to the actual values, increasing the
precision of the model. Also, if more and more attributes are added to the dataset them, a
deep learning neural network approach can be taken to train the ANN model which has
higher chances of predicting the price of automobile with more accuracy.
APPENDIX
VISUALIZATION CODE:
UNIVARIATE ANALYSIS
library(plotly)
summary(auto_data)
BIVARIATE ANALYSIS
library(plotly)
# Cleaning Data
new_df = auto_data
new_df %>% filter(Seats>0) -> new_df
new_df %>% filter(Mileage.Km.L>0) -> new_df
new_df %>% filter(Price<70) -> new_df
new_df = new_df[3:14]
new_df = new_df[,-2]
new_df$Price = new_df$Price*100000
# Checking
sum(is.na(auto_data))==0
auto_data = new_df
REGRESSION PLOT
library(caret)
library(plotly)
library(heatmaply)