0% found this document useful (0 votes)
6 views12 pages

Final Report Team 4

The report analyzes car sales data to predict car prices using various models, including linear regression, regression trees, and classification trees. Key findings indicate that luxury brands and mechanical attributes like horsepower significantly impact car prices, with the regression tree model recommended for its ability to capture complex relationships. The report emphasizes the importance of data quality, model validation, and feature selection for accurate predictions and suggests future improvements such as expanding the dataset and exploring advanced modeling techniques.

Uploaded by

jkravets
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views12 pages

Final Report Team 4

The report analyzes car sales data to predict car prices using various models, including linear regression, regression trees, and classification trees. Key findings indicate that luxury brands and mechanical attributes like horsepower significantly impact car prices, with the regression tree model recommended for its ability to capture complex relationships. The report emphasizes the importance of data quality, model validation, and feature selection for accurate predictions and suggests future improvements such as expanding the dataset and exploring advanced modeling techniques.

Uploaded by

jkravets
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

454 Final Report: Predicting Car Prices

Noah Colson, Jack Hutchins, Joseph Colman, Joshua Kravets

Isenberg School of Management, University of Massachusetts

OIM 454

Professor Ying Liu

May 16, 2024


Table of Contents
1. Project Description (Pg.2)
2. Business Questions (Pg.2)
3. Data Preprocessing (Pg.2)
4. Models Used (Pg. 4)
5. Conclusions (Pg. 8)
6. Summary (Pg. 8)

Project description
Describe the context and goals of the project.
This is the car sales data set which includes information about different cars. The data set
could be used for various analyses and machine learning tasks related to the automotive industry,
such as predicting car prices, understanding market trends, or identifying factors influencing
sales. This data set is being taken from the Analytixlabs for prediction. We will mainly focus on
which variables are the most important when predicting car prices.

Business questions
1. Which manufacturer has the greatest effect on price?​

2. Which mechanical attributes of cars have the highest impact on price?​

3. Which Predictor variable has the overall largest impact on car price?​

4. Can we determine which variables that may cause overfitting in the future?​

5. What factors affect whether a car will be labeled expensive or not?​

Data Preprocessing
Random sampling
The dataset does not have enough records for sampling
Handling missing data
1. Accessed data mining from excel ribbon
2. Chose transform, then missing data handling
3. Selected all variables and applied delete record as the way of handling

Summary characteristics
There seems to be three outliers present, one within each of the scatter plots. We have come to
the conclusion that they are extreme values and not errors. These extreme values seem possible
as they are coming from a reliable source.
Correlation table
Our Dependent variable is the Price of the car in thousands

Possible Independent variables: based on our correlation table some possible independent
variables our, Engine size, Horsepower and Power_perf_factor as these variables yielded high
correlation with our dependent variable. Also we believe that categorical variables, manufacturer
and vehicle type, have a strong correlation with our dependent variable, and these variables are
crucial to use when answering our research questions.

All of these distributions that we chose are shown to be generally normally distributed, with a
few outliers present. Which may possibly skew our results in the future.
As we can see from all three scatterplots, there is a strong positive correlation between our
dependent variable (Price in thousands), and the three numerical variables we chose. Price in
thousands and power perf factor seem to have the most positive correlation as the trendline is
steep. This means that the higher the power perf factor is for a certain car, the higher the price
will be.
Models Used
1st Model: Linear Regression
We chose this model as our dependent variable was numerical and believed it would work
efficiently with our data set. First, we had to create dummies for all categorical variables to use
them for linear regression​.
Through some trial and error, we decided to exclude Sales in thousands, Latest launch, Power
perf factor, and year resale value of the car because it was highly correlated with our output
variable.​We also excluded the model because it had too many unique values​.
The variables that we ended up using were, Horsepower​, Length,​Curb_weight,​Fuel_efficiency​,
ManufacturerBMW​, ManufacturerLexus​Manufacturer, Mercedes-B​Manufacturer Porsche
​,Vehicle_type_car​.
The r2 for both the Training prediction summary are adequate, .93 and .79 respectfully. ​
The Training dataset has a much lower RMSE and higher r2 in comparison to the prediction data
set, indicating that it is a better fit for the model.​

Output

Out output equation is: Price in thousands = 11.485 + 0.155Horsepower - 0.354Length +


12.273Curb_Weight + 0.456Fuel_efficiency + 10.219Man_BMW + 5.41Man_Lexus +
19.231Man_Mercedes-B + 28.237Man_Porsche – 4.008Vehicle_type_car​
We wanted to also test the equation so we made up a Hypothetical car:
Hypothetical: Horsepower is 195, Length is 182, Curb Weight is 3.276, Fuel efficiency is 25,
Manufacturer is BMW ​
Price in thousands = 11.485 + 0.155*195 – 0.354*182 + 12.273*3.276 + 0.456*25 + 10.219*1​
= 39.107 or $39,107​
After looking at the price of BMWs in our data set, this test of our model seems to accurately
predict the price of a BMW.
How do our results answer our questions?

Our Linear regression has shown us which manufactures have the highest impact on the price.
From our results we can determine that BMW, Porche, Lexus and Mercedes-B have the highest
impact due how low their p-values are (lower then .05)​

Our linear regression also shows us which mechanical aspects of the car have the highest impact
on Car Price (these aspects being fuel_efficency, length, curb_weight and horsepower). ​

2nd Model: Regression Tree


For the regression tree we used every variable in our dataset except for resale price. We chose to
exclude this variable as resale price was highly correlated with the dependent variable, thus
causing multicollinearity. Lack of removal may have led to overfitting. Model type variable was
also again excluded as this variable yielded too many unique values.​
Inputs variables: Sales_in_thousands, price_in_thousands, engine_size, horsepower, wheelbase,
width, length, curb_weight, fuel_capacity, fuel_efficiency, power_perf_factor, Manufacturer And
vehicle type​

Output:
The R2 value of both the training and validation are both adequate, containing values .63 and
.86. ​
As for RMSE, the Training Summary is much lower, thus indicating it is a better fit for the
data-set. Overall training is the better dataset.
Results:
The Full tree has 7 nodes and 4 levels ​
Power performance factor was the most impactful variable in predicting price as it yielded the
highest future importance score (359.61)​
Among the other predicting variables; Curb weight, Width and sales_in_thousands all yielded an
importance score of around (30).
Our best pruned further verifies the overwhelming impact that Power_Perf_Factor had on sales,
as all other variables shown in the full tree have been removed.
This model has shown us how the degree of impact on different variables can predicting the price
of different cars.​
Our results have shown that, for price prediction, the best predictor variable in our data set is
Power_perf_factor.​
The regression tree has revealed to us that a couple of variables may have been causing
overfitting issues such as year sales in thousands.
The third model we used was a classification tree. We decided to use an output variable of
expensive or not with a cutoff of $40,000 in determining it.
The input variables were: Engine size​, Horsepower​, Wheelbase​, Width​, Length​, Curb weight,
Fuel capacity​, Power Performance Factor, Manufacturer, and vehicle type

Output/Results:
Our metrics show that our classification model is 100% accurate in predicting weather or not a
car is labeled "expensive".​Due to our limited dataset, there was not a best pruned tree
Once again, we see that power performance factor is a very important node affecting the expense
of a car​
Engine size has a large effect on the expense as well​
Although many Ford cars have the qualities of a car that is typically labeled expensive, we can
see from the tree that the cars manufactured by Ford are rarely expensive indeed.
Consider a hypothetical car with the following attributes

● Horsepower: 220
● Length: 190 inches
● Curb Weight: 3.500 thousand pounds
● Fuel Efficiency: 22 miles per gallon
● Manufacturer: Porsche
● Vehicle Type: Car

Using the output equation from the linear regression model:

Price in
thousands=11.485+0.155×220−0.354×190+12.273×3.500+0.456×22+28.237×1−4.008×1

Price in
thousands=11.485+0.155×220−0.354×190+12.273×3.500+0.456×22+28.237×1−4.008×1

=11.485+34.1−67.26+42.955+10.032+28.237−4.008

=11.485+34.1−67.26+42.955+10.032+28.237−4.008

=55.541 or $55,541

=55.541 or $55,541

Thus, the predicted price of this hypothetical Porsche car is $55,541.

Comparison Between Models


● Linear Regression: This model is simple to implement and interpret. It provided an R² of
0.93 for the training set and 0.79 for the prediction set, indicating a good fit but some
overfitting. The model identified significant predictors such as manufacturer, horsepower,
length, curb weight, and fuel efficiency.
● Regression Tree: This model accounted for non-linear relationships and interactions
between variables. It provided an R² of 0.86 for the validation set, which is robust but
lower than linear regression. The tree showed that the power performance factor is the
most important predictor, followed by curb weight and width.
● Classification Tree: This model categorized cars as expensive or not with a cutoff of
$40,000. It achieved 100% accuracy in classification due to the limited dataset, with
power performance factor and engine size being the most significant predictors.

Model Recommendation
Considering the trade-off between interpretability and predictive power, the Regression Tree
model is recommended. Although it has a slightly lower R² compared to linear regression, it
better captures the complexity of the relationships between variables and provides more
actionable insights about variable importance, which can help avoid overfitting in the future.

Conclusions
Our analysis indicates that several key factors significantly impact car prices. Among
manufacturers, luxury brands such as Porsche, Mercedes-Benz, BMW, and Lexus have the
greatest effect on increasing prices. Mechanical attributes like horsepower, curb weight, and
power performance factor are also critical in determining car prices. For car manufacturers and
dealerships, focusing on enhancing these features can justify higher pricing. However, it is
crucial to monitor for potential overfitting by regularly validating models with new data. To
classify cars as expensive, a model considering both mechanical attributes and brand can be
highly accurate. These insights can guide strategic decisions in production, marketing, and sales
to target the high-end market effectively.

Summary
Lessons Learned

1. Importance of Data Quality: Accurate and comprehensive data is crucial for reliable
predictions. We observed that handling missing data and excluding highly correlated
variables helped improve the model's performance.
2. Model Selection and Validation: Different models provide varying insights and
performance levels. The linear regression model offered simplicity and interpretability,
while the regression tree captured complex interactions. Validation metrics like R² and
RMSE were essential in assessing model fit and avoiding overfitting.
3. Feature Importance: Identifying the most influential variables, such as manufacturer and
mechanical attributes (horsepower, curb weight, fuel efficiency), guided our
understanding of what drives car prices. This knowledge is valuable for making
data-driven business decisions.
4. Handling Outliers: Recognizing and deciding on the treatment of outliers is crucial. We
determined that the outliers in our dataset were genuine extreme values rather than errors,
ensuring the integrity of our analysis.
5. Categorical Variables: Creating dummy variables for categorical data allowed us to
incorporate qualitative factors like manufacturer and vehicle type into our models,
enhancing their predictive power.

Potential Issues and Future Extensions

1. Dataset Size and Diversity: The dataset had a limited number of records, which may not
fully capture the diversity of the car market. In future, expanding the dataset to include
more records across different manufacturers, models, and regions will improve model
generalizability.
2. Missing Data Handling: While we chose to delete records with missing data, other
imputation methods (e.g., mean imputation, k-nearest neighbors) could be explored to
retain more data and potentially enhance model accuracy.
3. Feature Engineering: Additional features such as market conditions, economic indicators,
and consumer preferences could be included to provide a more comprehensive analysis of
car prices.
4. Longitudinal Data: Incorporating time-series data on car sales and prices can help in
understanding trends over time and predicting future prices more accurately.
5. Advanced Modeling Techniques: Experimenting with more sophisticated machine
learning techniques such as Random Forests, Gradient Boosting Machines, or Neural
Networks could yield better predictive performance and insights.

Task Distribution Table

Task Description Group Member(s)

Data Collection and Initial Cleaning Noah

Missing Data Handling Josh

Exploratory Data Analysis Joe

Model Selection and Implementation Jack

Linear Regression Model Noah

Regression Tree Model Josh

Classification Tree Model Joe

Model Validation and Comparison Jack

Results Interpretation and Reporting Noah

Hypothetical Example and Predictions Josh

Lessons Learned and Future Extensions Joe


Task Coordination and Documentation Jack

You might also like