0% found this document useful (0 votes)

6 views12 pages

Final Report Team 4

The report analyzes car sales data to predict car prices using various models, including linear regression, regression trees, and classification trees. Key findings indicate that luxury brands and mechanical attributes like horsepower significantly impact car prices, with the regression tree model recommended for its ability to capture complex relationships. The report emphasizes the importance of data quality, model validation, and feature selection for accurate predictions and suggests future improvements such as expanding the dataset and exploring advanced modeling techniques.

Uploaded by

jkravets

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views12 pages

Final Report Team 4

Uploaded by

jkravets

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

454 Final Report: Predicting Car Prices

Noah Colson, Jack Hutchins, Joseph Colman, Joshua Kravets

Isenberg School of Management, University of Massachusetts

OIM 454

Professor Ying Liu

May 16, 2024

Table of Contents
1. Project Description (Pg.2)
2. Business Questions (Pg.2)
3. Data Preprocessing (Pg.2)
4. Models Used (Pg. 4)
5. Conclusions (Pg. 8)
6. Summary (Pg. 8)

Project description
Describe the context and goals of the project.
This is the car sales data set which includes information about different cars. The data set
could be used for various analyses and machine learning tasks related to the automotive industry,
such as predicting car prices, understanding market trends, or identifying factors influencing
sales. This data set is being taken from the Analytixlabs for prediction. We will mainly focus on
which variables are the most important when predicting car prices.

Business questions
1. Which manufacturer has the greatest effect on price?

2. Which mechanical attributes of cars have the highest impact on price?

3. Which Predictor variable has the overall largest impact on car price?

4. Can we determine which variables that may cause overfitting in the future?

5. What factors affect whether a car will be labeled expensive or not?

Data Preprocessing
Random sampling
The dataset does not have enough records for sampling
Handling missing data
1. Accessed data mining from excel ribbon
2. Chose transform, then missing data handling
3. Selected all variables and applied delete record as the way of handling

Summary characteristics
There seems to be three outliers present, one within each of the scatter plots. We have come to
the conclusion that they are extreme values and not errors. These extreme values seem possible
as they are coming from a reliable source.
Correlation table
Our Dependent variable is the Price of the car in thousands

Possible Independent variables: based on our correlation table some possible independent
variables our, Engine size, Horsepower and Power_perf_factor as these variables yielded high
correlation with our dependent variable. Also we believe that categorical variables, manufacturer
and vehicle type, have a strong correlation with our dependent variable, and these variables are
crucial to use when answering our research questions.

All of these distributions that we chose are shown to be generally normally distributed, with a
few outliers present. Which may possibly skew our results in the future.
As we can see from all three scatterplots, there is a strong positive correlation between our
dependent variable (Price in thousands), and the three numerical variables we chose. Price in
thousands and power perf factor seem to have the most positive correlation as the trendline is
steep. This means that the higher the power perf factor is for a certain car, the higher the price
will be.
Models Used
1st Model: Linear Regression
We chose this model as our dependent variable was numerical and believed it would work
efficiently with our data set. First, we had to create dummies for all categorical variables to use
them for linear regression.
Through some trial and error, we decided to exclude Sales in thousands, Latest launch, Power
perf factor, and year resale value of the car because it was highly correlated with our output
variable.We also excluded the model because it had too many unique values.
The variables that we ended up using were, Horsepower, Length,Curb_weight,Fuel_efficiency,
ManufacturerBMW, ManufacturerLexusManufacturer, Mercedes-BManufacturer Porsche
,Vehicle_type_car.
The r2 for both the Training prediction summary are adequate, .93 and .79 respectfully.
The Training dataset has a much lower RMSE and higher r2 in comparison to the prediction data
set, indicating that it is a better fit for the model.

Output

Out output equation is: Price in thousands = 11.485 + 0.155Horsepower - 0.354Length +

12.273Curb_Weight + 0.456Fuel_efficiency + 10.219Man_BMW + 5.41Man_Lexus +
19.231Man_Mercedes-B + 28.237Man_Porsche – 4.008Vehicle_type_car
We wanted to also test the equation so we made up a Hypothetical car:
Hypothetical: Horsepower is 195, Length is 182, Curb Weight is 3.276, Fuel efficiency is 25,
Manufacturer is BMW
Price in thousands = 11.485 + 0.155*195 – 0.354*182 + 12.273*3.276 + 0.456*25 + 10.219*1
= 39.107 or $39,107
After looking at the price of BMWs in our data set, this test of our model seems to accurately
predict the price of a BMW.
How do our results answer our questions?

Our Linear regression has shown us which manufactures have the highest impact on the price.
From our results we can determine that BMW, Porche, Lexus and Mercedes-B have the highest
impact due how low their p-values are (lower then .05)

Our linear regression also shows us which mechanical aspects of the car have the highest impact
on Car Price (these aspects being fuel_efficency, length, curb_weight and horsepower).

2nd Model: Regression Tree

For the regression tree we used every variable in our dataset except for resale price. We chose to
exclude this variable as resale price was highly correlated with the dependent variable, thus
causing multicollinearity. Lack of removal may have led to overfitting. Model type variable was
also again excluded as this variable yielded too many unique values.
Inputs variables: Sales_in_thousands, price_in_thousands, engine_size, horsepower, wheelbase,
width, length, curb_weight, fuel_capacity, fuel_efficiency, power_perf_factor, Manufacturer And
vehicle type

Output:
The R2 value of both the training and validation are both adequate, containing values .63 and
.86.
As for RMSE, the Training Summary is much lower, thus indicating it is a better fit for the
data-set. Overall training is the better dataset.
Results:
The Full tree has 7 nodes and 4 levels
Power performance factor was the most impactful variable in predicting price as it yielded the
highest future importance score (359.61)
Among the other predicting variables; Curb weight, Width and sales_in_thousands all yielded an
importance score of around (30).
Our best pruned further verifies the overwhelming impact that Power_Perf_Factor had on sales,
as all other variables shown in the full tree have been removed.
This model has shown us how the degree of impact on different variables can predicting the price
of different cars.
Our results have shown that, for price prediction, the best predictor variable in our data set is
Power_perf_factor.
The regression tree has revealed to us that a couple of variables may have been causing
overfitting issues such as year sales in thousands.
The third model we used was a classification tree. We decided to use an output variable of
expensive or not with a cutoff of $40,000 in determining it.
The input variables were: Engine size, Horsepower, Wheelbase, Width, Length, Curb weight,
Fuel capacity, Power Performance Factor, Manufacturer, and vehicle type

Output/Results:
Our metrics show that our classification model is 100% accurate in predicting weather or not a
car is labeled "expensive".Due to our limited dataset, there was not a best pruned tree
Once again, we see that power performance factor is a very important node affecting the expense
of a car
Engine size has a large effect on the expense as well
Although many Ford cars have the qualities of a car that is typically labeled expensive, we can
see from the tree that the cars manufactured by Ford are rarely expensive indeed.
Consider a hypothetical car with the following attributes

● Horsepower: 220
● Length: 190 inches
● Curb Weight: 3.500 thousand pounds
● Fuel Efficiency: 22 miles per gallon
● Manufacturer: Porsche
● Vehicle Type: Car

Using the output equation from the linear regression model:

Price in
thousands=11.485+0.155×220−0.354×190+12.273×3.500+0.456×22+28.237×1−4.008×1

=11.485+34.1−67.26+42.955+10.032+28.237−4.008

=55.541 or $55,541

Thus, the predicted price of this hypothetical Porsche car is $55,541.

Comparison Between Models

● Linear Regression: This model is simple to implement and interpret. It provided an R² of
0.93 for the training set and 0.79 for the prediction set, indicating a good fit but some
overfitting. The model identified significant predictors such as manufacturer, horsepower,
length, curb weight, and fuel efficiency.
● Regression Tree: This model accounted for non-linear relationships and interactions
between variables. It provided an R² of 0.86 for the validation set, which is robust but
lower than linear regression. The tree showed that the power performance factor is the
most important predictor, followed by curb weight and width.
● Classification Tree: This model categorized cars as expensive or not with a cutoff of
$40,000. It achieved 100% accuracy in classification due to the limited dataset, with
power performance factor and engine size being the most significant predictors.

Model Recommendation
Considering the trade-off between interpretability and predictive power, the Regression Tree
model is recommended. Although it has a slightly lower R² compared to linear regression, it
better captures the complexity of the relationships between variables and provides more
actionable insights about variable importance, which can help avoid overfitting in the future.

Conclusions
Our analysis indicates that several key factors significantly impact car prices. Among
manufacturers, luxury brands such as Porsche, Mercedes-Benz, BMW, and Lexus have the
greatest effect on increasing prices. Mechanical attributes like horsepower, curb weight, and
power performance factor are also critical in determining car prices. For car manufacturers and
dealerships, focusing on enhancing these features can justify higher pricing. However, it is
crucial to monitor for potential overfitting by regularly validating models with new data. To
classify cars as expensive, a model considering both mechanical attributes and brand can be
highly accurate. These insights can guide strategic decisions in production, marketing, and sales
to target the high-end market effectively.

Summary
Lessons Learned

1. Importance of Data Quality: Accurate and comprehensive data is crucial for reliable
predictions. We observed that handling missing data and excluding highly correlated
variables helped improve the model's performance.
2. Model Selection and Validation: Different models provide varying insights and
performance levels. The linear regression model offered simplicity and interpretability,
while the regression tree captured complex interactions. Validation metrics like R² and
RMSE were essential in assessing model fit and avoiding overfitting.
3. Feature Importance: Identifying the most influential variables, such as manufacturer and
mechanical attributes (horsepower, curb weight, fuel efficiency), guided our
understanding of what drives car prices. This knowledge is valuable for making
data-driven business decisions.
4. Handling Outliers: Recognizing and deciding on the treatment of outliers is crucial. We
determined that the outliers in our dataset were genuine extreme values rather than errors,
ensuring the integrity of our analysis.
5. Categorical Variables: Creating dummy variables for categorical data allowed us to
incorporate qualitative factors like manufacturer and vehicle type into our models,
enhancing their predictive power.

Potential Issues and Future Extensions

1. Dataset Size and Diversity: The dataset had a limited number of records, which may not
fully capture the diversity of the car market. In future, expanding the dataset to include
more records across different manufacturers, models, and regions will improve model
generalizability.
2. Missing Data Handling: While we chose to delete records with missing data, other
imputation methods (e.g., mean imputation, k-nearest neighbors) could be explored to
retain more data and potentially enhance model accuracy.
3. Feature Engineering: Additional features such as market conditions, economic indicators,
and consumer preferences could be included to provide a more comprehensive analysis of
car prices.
4. Longitudinal Data: Incorporating time-series data on car sales and prices can help in
understanding trends over time and predicting future prices more accurately.
5. Advanced Modeling Techniques: Experimenting with more sophisticated machine
learning techniques such as Random Forests, Gradient Boosting Machines, or Neural
Networks could yield better predictive performance and insights.

Task Distribution Table

Task Description Group Member(s)

Data Collection and Initial Cleaning Noah

Missing Data Handling Josh

Exploratory Data Analysis Joe

Model Selection and Implementation Jack

Linear Regression Model Noah

Regression Tree Model Josh

Classification Tree Model Joe

Model Validation and Comparison Jack

Results Interpretation and Reporting Noah

Hypothetical Example and Predictions Josh

Lessons Learned and Future Extensions Joe

Task Coordination and Documentation Jack

Car Price Prediction Using Machine Learning
33% (3)
Car Price Prediction Using Machine Learning
15 pages
Capstone PPT Final
No ratings yet
Capstone PPT Final
25 pages
Final PPT - Phishing Website
100% (1)
Final PPT - Phishing Website
23 pages
Car Price Prediction
No ratings yet
Car Price Prediction
21 pages
Pre-Owned Car Price and Life Prediction Using Machine Learning
No ratings yet
Pre-Owned Car Price and Life Prediction Using Machine Learning
26 pages
Bulldozer Price Prediction Using Regression Model (Research Ethics)
No ratings yet
Bulldozer Price Prediction Using Regression Model (Research Ethics)
19 pages
Cars4u Project: Proprietary Content. © Great Learning. All Rights Reserved. Unauthorized Use or Distribution Prohibited
100% (2)
Cars4u Project: Proprietary Content. © Great Learning. All Rights Reserved. Unauthorized Use or Distribution Prohibited
30 pages
Trainity Project Report
No ratings yet
Trainity Project Report
21 pages
PPSD 1743674861
No ratings yet
PPSD 1743674861
3 pages
Duplichecker Plagiarism Report
No ratings yet
Duplichecker Plagiarism Report
3 pages
HW3 Isye 7406
No ratings yet
HW3 Isye 7406
8 pages
Prediction of Car Price Using Linear Regression
No ratings yet
Prediction of Car Price Using Linear Regression
4 pages
Car Pricing Prediction
No ratings yet
Car Pricing Prediction
15 pages
Project Impact of Car Features
No ratings yet
Project Impact of Car Features
9 pages
Sample Paper 6
No ratings yet
Sample Paper 6
10 pages
Automobile Sales Predictions
No ratings yet
Automobile Sales Predictions
19 pages
Ajay and Saurabh
No ratings yet
Ajay and Saurabh
16 pages
Car Price Detection Based On The Travelling Distance
No ratings yet
Car Price Detection Based On The Travelling Distance
15 pages
Car Price Prediction
No ratings yet
Car Price Prediction
8 pages
ANN Initial Phase Presentation
No ratings yet
ANN Initial Phase Presentation
19 pages
DSPY Lab Project (Formatted) 2
No ratings yet
DSPY Lab Project (Formatted) 2
14 pages
Car Price Prediction Report
No ratings yet
Car Price Prediction Report
8 pages
Data Science Simpli-Ed Part 4: Simple Linear Regression Models
No ratings yet
Data Science Simpli-Ed Part 4: Simple Linear Regression Models
1 page
Regression Presentation
No ratings yet
Regression Presentation
12 pages
Car Price Prediction Using Machine Learning Techniques: March 2024
No ratings yet
Car Price Prediction Using Machine Learning Techniques: March 2024
8 pages
Car Price Prediction
No ratings yet
Car Price Prediction
5 pages
33 Submission
No ratings yet
33 Submission
8 pages
ML Case Study
No ratings yet
ML Case Study
11 pages
Model Evalution
No ratings yet
Model Evalution
6 pages
Learning/"
No ratings yet
Learning/"
32 pages
openSAP Sac5 Week 4 Unit 7 PREDKEYINT Exercise
No ratings yet
openSAP Sac5 Week 4 Unit 7 PREDKEYINT Exercise
18 pages
Auto Value Estimation Predicting Car Price
No ratings yet
Auto Value Estimation Predicting Car Price
29 pages
Assignment 1
No ratings yet
Assignment 1
11 pages
Cardaatset Project
No ratings yet
Cardaatset Project
31 pages
Updated Used Cars Price Prediction Using Machine Learning
No ratings yet
Updated Used Cars Price Prediction Using Machine Learning
24 pages
Cardaatset Project
No ratings yet
Cardaatset Project
31 pages
Car Trend Analysis
No ratings yet
Car Trend Analysis
12 pages
Automating Data Analyses Using Artificial Intelligence
No ratings yet
Automating Data Analyses Using Artificial Intelligence
114 pages
Car Price Prediction
No ratings yet
Car Price Prediction
18 pages
Practical AI For Cybersecurity
No ratings yet
Practical AI For Cybersecurity
293 pages
Vehicleprice Prediction (Car) Using Ibm SPSS: Department of Artificial Intelligence
No ratings yet
Vehicleprice Prediction (Car) Using Ibm SPSS: Department of Artificial Intelligence
44 pages
Finalised FBA CIA 3
No ratings yet
Finalised FBA CIA 3
16 pages
Car Price Prediction
No ratings yet
Car Price Prediction
18 pages
Final DAProject
No ratings yet
Final DAProject
48 pages
Project Documentation
No ratings yet
Project Documentation
1 page
Team AN
No ratings yet
Team AN
23 pages
Machine Learning-Based Models For Accurate Car Pri
No ratings yet
Machine Learning-Based Models For Accurate Car Pri
6 pages
Wa0014.
No ratings yet
Wa0014.
3 pages
Car Pricing Prediction
No ratings yet
Car Pricing Prediction
15 pages
Analyzing The Impact of Car Features On Price and Profitability
No ratings yet
Analyzing The Impact of Car Features On Price and Profitability
18 pages
Ai and Machine Learning For Predicting
No ratings yet
Ai and Machine Learning For Predicting
9 pages
Deriving Insights From Data
No ratings yet
Deriving Insights From Data
8 pages
Analysis On Car Resale Price
No ratings yet
Analysis On Car Resale Price
13 pages
Car Selling Price Prediction
No ratings yet
Car Selling Price Prediction
14 pages
Statistical Learning Algorithms Applied To PDF
0% (1)
Statistical Learning Algorithms Applied To PDF
680 pages
Car Price Prediction Project
No ratings yet
Car Price Prediction Project
34 pages
Project Soft
No ratings yet
Project Soft
28 pages
Secondhand Car Price Analysis
No ratings yet
Secondhand Car Price Analysis
12 pages
Capstone Project
No ratings yet
Capstone Project
24 pages
Class Participation
No ratings yet
Class Participation
9 pages
Car Price Prediction
No ratings yet
Car Price Prediction
12 pages
Predicting Used Car Prices With Data Analytics
No ratings yet
Predicting Used Car Prices With Data Analytics
10 pages
Impact of Car Features On Price and Profitability
No ratings yet
Impact of Car Features On Price and Profitability
23 pages
Unit 1 Question and Answers
100% (1)
Unit 1 Question and Answers
29 pages
Microsoft - Practicetest.ai 900.dumps.2023 Oct 25.by - Scott.84q.vce
No ratings yet
Microsoft - Practicetest.ai 900.dumps.2023 Oct 25.by - Scott.84q.vce
7 pages
World Class Training Solutions
No ratings yet
World Class Training Solutions
49 pages
07cp18 Neural Networks and Applications 3 0 0 100
0% (1)
07cp18 Neural Networks and Applications 3 0 0 100
2 pages
Flight Delay and Cancellation
No ratings yet
Flight Delay and Cancellation
11 pages
Artificial Intelligence and Machine Learning Question Bank
No ratings yet
Artificial Intelligence and Machine Learning Question Bank
23 pages
PM Shri Kendriya Vidyalaya Sukna Ai Project
No ratings yet
PM Shri Kendriya Vidyalaya Sukna Ai Project
20 pages
A Benchmark of Machine Learning Approaches For Credit Score Prediction
No ratings yet
A Benchmark of Machine Learning Approaches For Credit Score Prediction
8 pages
The C4.5 Algorithm: A Literature Review
No ratings yet
The C4.5 Algorithm: A Literature Review
6 pages
A Machine Learning Based CIDS Model For Intrusion Detection To Ensure Security Within Cloud Network
No ratings yet
A Machine Learning Based CIDS Model For Intrusion Detection To Ensure Security Within Cloud Network
9 pages
Harnessing Machine Learning For Diabetes Prediction: Optimizing Classifiers To Tackle Canada's Growing Health Challenge
No ratings yet
Harnessing Machine Learning For Diabetes Prediction: Optimizing Classifiers To Tackle Canada's Growing Health Challenge
9 pages
Industrial Training Report
No ratings yet
Industrial Training Report
33 pages
CS194 Fall 2011 Lecture 01
No ratings yet
CS194 Fall 2011 Lecture 01
88 pages
53 856 14 Gradinaru Badea Dragomir
No ratings yet
53 856 14 Gradinaru Badea Dragomir
15 pages
Afzal Et Al., 2020
No ratings yet
Afzal Et Al., 2020
18 pages
KNN HMM
No ratings yet
KNN HMM
51 pages
Psychoradiologic Utility of MR Imaging For Diagnosis of Attention Deficit Hyperactivity Disorder
No ratings yet
Psychoradiologic Utility of MR Imaging For Diagnosis of Attention Deficit Hyperactivity Disorder
11 pages
Machine Learning Lab Manual
No ratings yet
Machine Learning Lab Manual
11 pages
Face Mask Detection Final Report Draft
No ratings yet
Face Mask Detection Final Report Draft
69 pages
ENsemble, Random Forest
No ratings yet
ENsemble, Random Forest
28 pages
2 - A Machine Learning Approach For Tracing Regulatory Codes To Product Specific Requirements
No ratings yet
2 - A Machine Learning Approach For Tracing Regulatory Codes To Product Specific Requirements
10 pages
K MedoidExample
No ratings yet
K MedoidExample
7 pages
Packag Technol Sci - 2022 - Esfahanian - A Novel Packaging Evaluation Method Using Sentiment Analysis of Customer Reviews
No ratings yet
Packag Technol Sci - 2022 - Esfahanian - A Novel Packaging Evaluation Method Using Sentiment Analysis of Customer Reviews
9 pages
Ch. 10: Introduction To Convolution Neural Networks CNN and Systems
No ratings yet
Ch. 10: Introduction To Convolution Neural Networks CNN and Systems
69 pages
Senthilnathan Resume
No ratings yet
Senthilnathan Resume
2 pages
Customer Behavior Analysis in E-Commerce Using Decision Tree Machine Learning Approach
No ratings yet
Customer Behavior Analysis in E-Commerce Using Decision Tree Machine Learning Approach
9 pages

Final Report Team 4

Uploaded by

Final Report Team 4

Uploaded by

454 Final Report: Predicting Car Prices

Noah Colson, Jack Hutchins, Joseph Colman, Joshua Kravets

Isenberg School of Management, University of Massachusetts

Professor Ying Liu

May 16, 2024

2. Which mechanical attributes of cars have the highest impact on price?​

5. What factors affect whether a car will be labeled expensive or not?​

Out output equation is: Price in thousands = 11.485 + 0.155Horsepower - 0.354Length +

2nd Model: Regression Tree

Using the output equation from the linear regression model:

Thus, the predicted price of this hypothetical Porsche car is $55,541.

Comparison Between Models

Potential Issues and Future Extensions

Task Distribution Table

Task Description Group Member(s)

Data Collection and Initial Cleaning Noah

Missing Data Handling Josh

Exploratory Data Analysis Joe

Model Selection and Implementation Jack

Linear Regression Model Noah

Regression Tree Model Josh

Classification Tree Model Joe

Model Validation and Comparison Jack

Results Interpretation and Reporting Noah

Hypothetical Example and Predictions Josh

Lessons Learned and Future Extensions Joe

You might also like

2. Which mechanical attributes of cars have the highest impact on price?

5. What factors affect whether a car will be labeled expensive or not?