National University of Sciences and Technology
School of Mechanical and Manufacturing Engineering
Robotics and Intelligent Machines Engineering
Project Report: Car Price Prediction
Intent:
Train a Machine Learning algorithm and predict Car price based on the selected features.
Method:
The Machine Learning method used for this Project, Car Price Prediction, is Multiple-Linear
Regression with Gradient Descent.
Process:
1. Packages:
• Numpy
• Pandas
• [Link]
2. Data:
Dataset collected from PAKWHEELS processed and saved as CSV (Comma Delimited) format.
To access the data ‘Pandas’ library function READ_CSV() is used. The data is then separated
as Features and Label, as ‘x_train’ and ‘y_train’, respectively.
The data() function returns ‘x_train’ and ‘y_train’ to the main() function for further use.
3. Normalization
The feature set is then normalized using the formula
X = (X – MEAN) / (STANDARD DEVIATION)
The normal() function returns ‘x_train’, ‘data_mean’ and ‘data_std’: which are the
normalized features, mean of the features and standard deviation of the features,
respectively, to the main() function. The ‘data_mean’ and ‘data_std’ will be used in
normalizing the TEST INPUT, in the [Link], to predict the car price.
4. Complete Data
The feature set is completed for processing by adding the ‘bias’ unit for all the rows, in the
data_complete() function, using the [Link]() function.
5. Parameter Initialization
The parameters ‘theta’ is randomly initialized using the [Link](), and is
returned to the main() function.
6. Gradient Descent
The gradient() function has been used to minimize the parameter theta and reduce the cost
iteratively, on each iteration theta values are updated which generates a cost history. The
last training cost is then multiplied by million to have our final cost for training set.
7. Mean Absolute error
The mae() function calculates the absolute mean of the error in the test set, training set and
the CV set of the data by using formula which is called in the main function using print() and
displayed for the user.
Flow of Algorithm:
Train Model:
2. Calculate
Retrieve Data 1. Generate Random
Hypothesis
Theta Values
Choose a Learning 3. Calculate initial Cost
Data Cleansing
Algorithm Function
4. Gradient Descent:
Define Features and Update values of theta
Append bias units
Label iteratively to generate
cost history
Separate the data into:
1. Training set Normalize the Data Predict Car Prices
2. Cross validation set
3. Test set
Code Running Instructions:
The code has been split into two parts; Training and Prediction.
Training:
In training part, the data has been read from a csv file, separated into training feature set and
training label. For the data set provided the dimensions of the feature set and the label set are
(16281, 9) (16281,) respectively.
Then the data has been categorized in another pair of sets named cross validation feature set and
cross validation label set. The calculated value for the dimensions of the feature and label set are
(5425, 9) (5425,) respectively.
After that the last pair of needed sets, the test feature set and the test label set has been created
similarly. The dimensions for these pair of sets are (5428, 9) (5428,).
These columns are then completed by adding the bias units or appending the columns with ones
hence the resultant values for dimensions become
(16281, 10) (16281, 1) (5425, 10) (5425, 1) (5428, 10) (5428, 1) (16281, 10) (16281, 1) (5425, 10)
(5425, 1) (5428, 10) (5428, 1).
Random theta values are initialized
Value set of theta is [[0.49354865]
[0.62653841]
[0.11832303]
[0.0742843 ]
[0.42119429]
[0.39886133]
[0.27029176]
[0.76941718]
[0.92276763]
[0.51739262]].
After completing the sets, the cost formula is used to calculate the cost function before and after the
data is trained and for the cross-validation data set. These values are
Cost before training: 1.4038338473544203
Final Cost for Train Set: 0.28237719451856336
Final Cost for CV Set: 0.2663015740160541
The graph of final cost for training set and cross-validation set is plotted as:
Figure 1: Final cost for training set and cross-validation set
The next task is to calculate the absolute mean error on the three sets that we earlier created.
This is done by using the formula of the absolute mean error which is error = [Link] (abs (h-y) / m,
where all the used variables have already been defined in the code.
The results for these errors are
Mean Absolute Error for Training Set: 0.38244467157992945
Mean Absolute Error for CV Set: 0.3486779215566904
Mean Absolute Error for Test Set: 0.3678959796760039.
Prediction:
The training2 file of code is imported to this prediction file, to be able to use the calculated values
for all the sets created.
In the prediction() function an array for all the nine features is created, the data is normalized by
using the mean and standard deviation functions created in the training2 code of file.
This file is appended with ones to be able to calculate the dot product of the test data with the final
value of theta.
Value obtained in the previous step is multiplied with one million to get an appropriate price for the
car whose price needs to be predicted.
An example prediction is attached in the following snapshot
Conclusion:
The algorithm used in this program predicts the car prices by dividing the provided sets into multiple
sets of data. Calculations are performed, and data is normalized to generate efficient prediction
results.