0% found this document useful (0 votes)
3 views

Assignment1

The document outlines an assignment on linear regression for predicting used car prices, detailing data visualization, variable relationships, and the importance of data partitioning. Key findings indicate that Age, Mileage, and Weight are significant predictors of price, and the Backward Elimination method was preferred for variable selection. Additionally, the assignment highlights the business value of price prediction models and the broader applications of linear regression in fields like finance and healthcare.

Uploaded by

1182023362w
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Assignment1

The document outlines an assignment on linear regression for predicting used car prices, detailing data visualization, variable relationships, and the importance of data partitioning. Key findings indicate that Age, Mileage, and Weight are significant predictors of price, and the Backward Elimination method was preferred for variable selection. Additionally, the assignment highlights the business value of price prediction models and the broader applications of linear regression in fields like finance and healthcare.

Uploaded by

1182023362w
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 11

Course: 4KG3

Assignment 1 -Linear regression for prediction

1. Using data visualization for initial variable investigation and selection (20%)

a) What are the distributions for the dependent and independent variables?

There are 11 variables. Among them, 10 independent variables: Age, Mileage, Fuel Type, Horse Power, Metalic,
Automatic, CC, Doors, QuartTax and Weight, and one dependent variable Price.
By analyzing the distribution, it is observed that some continuous variables such as Price, Mileage, and Weight are
left skewed, while age exhibits right-skewed distributions. Most variables, such as Price, age, Mileage, Horse Power,
CC, QuartTax, Weight have outliers. Other variables are relatively concentrated or dispersed.

b) How do they related to each other?

Based on the Scatterplot Matrix analysis and “fit Y by X by price”, there is a negative relationship between Price and
Age, Price and Mileage. As the age or mileage of a used car increases, its price tends to decrease, vice vera. In
contrast, there is a positive relationship between Price and Weight, indicating heavier vehicles are typically more
expensive. Horse Power and QuartTax also rise with price, but Fuel type, Doors, Metallic, Automatic, CC have less
clear effects on price.
Scatterplot Matrix:

Fit Y by X by price:
c) What appears to be the three or four most important car specifications for predicting the used car price?

The most important specifications that affect the price of a used car are Age, Mileage, and Weight. By analyzing the
Scatterplot Matrix and Fit Y by X above, we can see that Price has a strong negative correlation with Age and
Mileage, which means that the older the car, the higher the mileage, the lower the price. Conversely, there is a
strong positive correlation between weight and price, indicating that heavier vehicles tend to be more expensive.

2. Data preparation and partition for training, validation, and testing (10%)

a) Why do we need to convert fuel type to dummy variables?

Since the Linear Regression model only deals with numerical variables. Dummy variables transform categorical
(discrete) data into numerical data. Adding dummy variables to the analysis will help to create a better fit of the
model. To ensure that the model runs effectively and predicts accurately, we need to convert the Fuel Type to a
dummy variable.

b) Why do we need to do data partition?


In constructing the prediction model, the data was divided into Training, Validation and Test sets. The model was
trained using the training set, and the performance of the model was evaluated on the validation set. Without data
partitioning, the model may overfit the noise in the training data, resulting in decreasing predictive ability and
accuracy. Data partitioning can reflect the performance of the model more objectively and realistically, mitigate
overfitting issues and improve performance of the predicting model.

3. Run a linear regression with all available variables (30%)

a) What is the mathematical formula of the regression model obtained?

b) How do you calculate the predicted price and the prediction error (residual) for each record?

We can use the prediction expression to calculate the predicted price. For each record, plugging in the value of each
independent variable into the prediction expression, we can calculate the predicted price.

The residual is known by calculating the difference between the actual value and the predicted value
(residual=actual value-predicted value). The residuals can also be approximated from the adjusted R-square. When
the adjusted R-square is high, it means that the model fits the data well, and the overall deviation of the predicted
value from the actual value is relatively small; and vice versa.

c) Show the error distribution for training, validation and testing? What are the differences between them?

Below are the error distributions for Training, Validation and Testing.

The residual mean of the Training set is close to 0, and the standard deviation is 1332.2942, indicating that the
model fits well on the training data. The Validation set has a residual mean of 74.991602 and a standard deviation of
1422.1056, which is slightly higher than the training set, but still within acceptable limits. The Test set has a residual
mean of -134.0457 and a standard deviation of 1332.3742, which is similar to the validation set, and the overall
performance is stable.
The R-square values are 0.8711 for the Training set, 0.8571 for the Validation set, and 0.8781 for the Test set. The
values of these three sets are close to each other. Overall, these values indicate that my model performs well across
different datasets, there is no obvious overfitting problem. The performance of the model is reliable, it can predict
the price consistently.
4. Automated variable selection (30%)

a) What methods have you used for variable selection?

Three methods were used for variable selection: Forward selection, Backward elimination and Mixed Stepwise.

Forward selection: using Max Validation R- square as the stopping rule.


Backward elimination: using Max Validation R- square as the stopping rule.

Mixed Stepwise: p-value Threshold is used as the stopping rule.


b) What is the best set of variables you will use? What are the criteria used for selection?

The best set of variables to select are Age, Weight, Mileage, Horse Power, QuartTax, Fuel Type (CNG).

Max Validation R-Square was used as criteria for selection. In my analysis, the Backward Elimination method has
higher R-Square (0.8661) and higher Adjusted R-square (0. 8627) than forward selection method (R-Square 0.8653
and Adjusted R-square 0.8605). In addition, Backward Elimination has lower BIC and AIC values, demonstrated
better performance and simplicity. Therefore, Backward elimination is the preferred model to select variables.

The selection criterion is based on the Backward Elimination method. It starts with all predictors and removes the
least relevant predictor one by one in order and stops when the Validation R-Square doesn’t improve when
removing additional variables. By doing so, Age, Weight, Mileage, Horse Power, QuartTax, Fuel Type (CNG) are
selected as effective predictors in the model.

5.Possible real word applications (10%)

a) What is the potential business value for used car price prediction?

A used car price prediction model can greatly benefit dealerships and buyers. For dealerships, it helps in pricing cars
competitively, ensuring a fair profit margin. Buyers can make informed decisions, avoiding overpaying for a vehicle.
The model can also analyze trends, helping sellers understand market demands. Additionally, it streamlines the
buying and selling process, reducing the time spent on negotiations. By predicting accurate prices, it promotes
transparency and trust between buyers and sellers. Overall, this model enhances decision-making, improves
customer satisfaction, and boosts sales in the used car market.

b) Search the web and find out potential utilities of linear regression in other domains such as finance, healthcare
etc.

Linear regression has potential utility in the financial field and healthcare. Take financial as an example. Linear
regression is primarily used to identify relationships between different financial variables, allowing analysts to
predict future values like stock prices, forecast company performance, and assess the impact of various factors on
investment returns. By analyzing historical data and identifying trends through a "best fit" line that shows how
variables change in relation to stock price or company value. For example, we can use linear regression method to
calculate a stock's "beta" to measure its volatility compared to the overall market.

5. Report how much time you have spent on this assignment, what problem you have faced and what you have
learned.

This assignment took me about 10 hours to finish it. It was the first time for me to use JMP, it took some time to
understand its functions. Through the professor's demonstrations and videos in class, I gradually understood the JMP
software and know how to use it to do basic linear regression. By doing the assignment, I gained a more
comprehensive understanding of data distribution, data partition and their importance in data analysis. I also
learned how to use three different stepwise methods to select the most meaningful variables, and what criterions
can be used to judge their performance so that we can build the most effective prediction model. One problem I
met is understanding of the patriation and how it affects outcomes. At first, I didn’t understand why each partition
generates different prediction expressions. Later, I knew that when we partition the data, the date sets divided into
3 data sets: training, validation and test. Each time we partition the data sets, different data sets are created, so the
prediction expression is different either. I also learned that linear regression is a useful data analysis tool for building
predictive modes that can be widely applied in different business sectors, such as financial, healthcare, marketing
etc. to make informed decisions.

You might also like