Final Report Team 4
Final Report Team 4
OIM 454
Project description
Describe the context and goals of the project.
This is the car sales data set which includes information about different cars. The data set
could be used for various analyses and machine learning tasks related to the automotive industry,
such as predicting car prices, understanding market trends, or identifying factors influencing
sales. This data set is being taken from the Analytixlabs for prediction. We will mainly focus on
which variables are the most important when predicting car prices.
Business questions
1. Which manufacturer has the greatest effect on price?
3. Which Predictor variable has the overall largest impact on car price?
4. Can we determine which variables that may cause overfitting in the future?
Data Preprocessing
Random sampling
The dataset does not have enough records for sampling
Handling missing data
1. Accessed data mining from excel ribbon
2. Chose transform, then missing data handling
3. Selected all variables and applied delete record as the way of handling
Summary characteristics
There seems to be three outliers present, one within each of the scatter plots. We have come to
the conclusion that they are extreme values and not errors. These extreme values seem possible
as they are coming from a reliable source.
Correlation table
Our Dependent variable is the Price of the car in thousands
Possible Independent variables: based on our correlation table some possible independent
variables our, Engine size, Horsepower and Power_perf_factor as these variables yielded high
correlation with our dependent variable. Also we believe that categorical variables, manufacturer
and vehicle type, have a strong correlation with our dependent variable, and these variables are
crucial to use when answering our research questions.
All of these distributions that we chose are shown to be generally normally distributed, with a
few outliers present. Which may possibly skew our results in the future.
As we can see from all three scatterplots, there is a strong positive correlation between our
dependent variable (Price in thousands), and the three numerical variables we chose. Price in
thousands and power perf factor seem to have the most positive correlation as the trendline is
steep. This means that the higher the power perf factor is for a certain car, the higher the price
will be.
Models Used
1st Model: Linear Regression
We chose this model as our dependent variable was numerical and believed it would work
efficiently with our data set. First, we had to create dummies for all categorical variables to use
them for linear regression.
Through some trial and error, we decided to exclude Sales in thousands, Latest launch, Power
perf factor, and year resale value of the car because it was highly correlated with our output
variable.We also excluded the model because it had too many unique values.
The variables that we ended up using were, Horsepower, Length,Curb_weight,Fuel_efficiency,
ManufacturerBMW, ManufacturerLexusManufacturer, Mercedes-BManufacturer Porsche
,Vehicle_type_car.
The r2 for both the Training prediction summary are adequate, .93 and .79 respectfully.
The Training dataset has a much lower RMSE and higher r2 in comparison to the prediction data
set, indicating that it is a better fit for the model.
Output
Our Linear regression has shown us which manufactures have the highest impact on the price.
From our results we can determine that BMW, Porche, Lexus and Mercedes-B have the highest
impact due how low their p-values are (lower then .05)
Our linear regression also shows us which mechanical aspects of the car have the highest impact
on Car Price (these aspects being fuel_efficency, length, curb_weight and horsepower).
Output:
The R2 value of both the training and validation are both adequate, containing values .63 and
.86.
As for RMSE, the Training Summary is much lower, thus indicating it is a better fit for the
data-set. Overall training is the better dataset.
Results:
The Full tree has 7 nodes and 4 levels
Power performance factor was the most impactful variable in predicting price as it yielded the
highest future importance score (359.61)
Among the other predicting variables; Curb weight, Width and sales_in_thousands all yielded an
importance score of around (30).
Our best pruned further verifies the overwhelming impact that Power_Perf_Factor had on sales,
as all other variables shown in the full tree have been removed.
This model has shown us how the degree of impact on different variables can predicting the price
of different cars.
Our results have shown that, for price prediction, the best predictor variable in our data set is
Power_perf_factor.
The regression tree has revealed to us that a couple of variables may have been causing
overfitting issues such as year sales in thousands.
The third model we used was a classification tree. We decided to use an output variable of
expensive or not with a cutoff of $40,000 in determining it.
The input variables were: Engine size, Horsepower, Wheelbase, Width, Length, Curb weight,
Fuel capacity, Power Performance Factor, Manufacturer, and vehicle type
Output/Results:
Our metrics show that our classification model is 100% accurate in predicting weather or not a
car is labeled "expensive".Due to our limited dataset, there was not a best pruned tree
Once again, we see that power performance factor is a very important node affecting the expense
of a car
Engine size has a large effect on the expense as well
Although many Ford cars have the qualities of a car that is typically labeled expensive, we can
see from the tree that the cars manufactured by Ford are rarely expensive indeed.
Consider a hypothetical car with the following attributes
● Horsepower: 220
● Length: 190 inches
● Curb Weight: 3.500 thousand pounds
● Fuel Efficiency: 22 miles per gallon
● Manufacturer: Porsche
● Vehicle Type: Car
Price in
thousands=11.485+0.155×220−0.354×190+12.273×3.500+0.456×22+28.237×1−4.008×1
Price in
thousands=11.485+0.155×220−0.354×190+12.273×3.500+0.456×22+28.237×1−4.008×1
=11.485+34.1−67.26+42.955+10.032+28.237−4.008
=11.485+34.1−67.26+42.955+10.032+28.237−4.008
=55.541 or $55,541
=55.541 or $55,541
Model Recommendation
Considering the trade-off between interpretability and predictive power, the Regression Tree
model is recommended. Although it has a slightly lower R² compared to linear regression, it
better captures the complexity of the relationships between variables and provides more
actionable insights about variable importance, which can help avoid overfitting in the future.
Conclusions
Our analysis indicates that several key factors significantly impact car prices. Among
manufacturers, luxury brands such as Porsche, Mercedes-Benz, BMW, and Lexus have the
greatest effect on increasing prices. Mechanical attributes like horsepower, curb weight, and
power performance factor are also critical in determining car prices. For car manufacturers and
dealerships, focusing on enhancing these features can justify higher pricing. However, it is
crucial to monitor for potential overfitting by regularly validating models with new data. To
classify cars as expensive, a model considering both mechanical attributes and brand can be
highly accurate. These insights can guide strategic decisions in production, marketing, and sales
to target the high-end market effectively.
Summary
Lessons Learned
1. Importance of Data Quality: Accurate and comprehensive data is crucial for reliable
predictions. We observed that handling missing data and excluding highly correlated
variables helped improve the model's performance.
2. Model Selection and Validation: Different models provide varying insights and
performance levels. The linear regression model offered simplicity and interpretability,
while the regression tree captured complex interactions. Validation metrics like R² and
RMSE were essential in assessing model fit and avoiding overfitting.
3. Feature Importance: Identifying the most influential variables, such as manufacturer and
mechanical attributes (horsepower, curb weight, fuel efficiency), guided our
understanding of what drives car prices. This knowledge is valuable for making
data-driven business decisions.
4. Handling Outliers: Recognizing and deciding on the treatment of outliers is crucial. We
determined that the outliers in our dataset were genuine extreme values rather than errors,
ensuring the integrity of our analysis.
5. Categorical Variables: Creating dummy variables for categorical data allowed us to
incorporate qualitative factors like manufacturer and vehicle type into our models,
enhancing their predictive power.
1. Dataset Size and Diversity: The dataset had a limited number of records, which may not
fully capture the diversity of the car market. In future, expanding the dataset to include
more records across different manufacturers, models, and regions will improve model
generalizability.
2. Missing Data Handling: While we chose to delete records with missing data, other
imputation methods (e.g., mean imputation, k-nearest neighbors) could be explored to
retain more data and potentially enhance model accuracy.
3. Feature Engineering: Additional features such as market conditions, economic indicators,
and consumer preferences could be included to provide a more comprehensive analysis of
car prices.
4. Longitudinal Data: Incorporating time-series data on car sales and prices can help in
understanding trends over time and predicting future prices more accurately.
5. Advanced Modeling Techniques: Experimenting with more sophisticated machine
learning techniques such as Random Forests, Gradient Boosting Machines, or Neural
Networks could yield better predictive performance and insights.