Report Revathy
Report Revathy
REVATHY
LINEAR REGRESSION PROJECT PRABHAKARAN
Student No:8903669
1. Abstract
2. Gathering Data
o Dataset Selection
o Dataset Description
3. Initial Modeling
o Variables Fixed
o Results of Missing Values
o Rows Removed
o Duplicates Found and Removed
o Structure of the Cleaned Dataset
4. Diagnostics
o Diagnostic Methods Employed
o Assumption Checks and Results
5. Model Selection
o List of Evaluated Models
6. Model Evaluation
o Evaluation Metrics
o Choosing the Best Model
7. Prediction
o Model Performance on Existing Data
8. Conclusion
o Key Findings
o Suggestions for Future Work
9. Appendix (if applicable)
o Dataset Link
o R program codes
o Diagnostic Plots
Abstract
This report explores using linear regression to predict wine quality. The analysis utilizes a
dataset containing measurements of various wine properties, like fixed acidity, volatile
acidity, and alcohol content. The goal is to develop a model predicting wine quality based
on these properties.
Following data exploration, an initial linear regression model was created. Tests were
conducted to assess the validity of the model's assumptions, and its performance was
evaluated using metrics like R-squared and root mean squared error (RMSE). Techniques
like polynomial transformations and interaction terms were then applied to improve the
model.
Four models were evaluated based on their performance and adherence to linear
regression assumptions. The models are Multiple Linear Regression, Logarithmic
Transformations, Interaction Terms and Forward Selection. Model 3(Interaction Terms),
which incorporates interaction terms between predictors, was chosen as the best
performing model due to its superior predictive accuracy and model fit. Predictions were
made using Model 3, and the results were analyzed to identify the most influential
predictors of wine quality.
This analysis offers insights into the factors impacting wine quality and demonstrates the
application of linear regression for creating predictive models. Further research and
refinement of the model could enhance predictive accuracy and understanding of factors
influencing wine quality.
Gathering Data
Dataset Selection:
For this project, the Wine Quality dataset sourced from Kaggle was selected. It contains
various wine properties, including fixed acidity, volatile acidity, citric acid, residual sugar,
chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol content,
and wine quality scores ranging from 0 to 10. (Raj Parmar, 2016)
Dataset Description:
The dataset comprises 6497 observations and 12variables. The response variable is wine
quality, and the predictors include fixed acidity, volatile acidity, citric acid, residual sugar,
chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, and alcohol
content. The objective is to predict wine quality based on these properties.
Initial Modelling
Diagnostics
Model Selection
Model Evaluation
The initial models’ performances were evaluated using metrics such as R-squared,
adjusted R-squared, and root mean squared error (RMSE) to determine its goodness of fit
and predictive accuracy. The results were used to assess the model's effectiveness in
capturing the variability in wine quality.
After careful consideration, Model 3, incorporating interaction terms between predictors,
was selected as the optimal model due to its superior performance in terms of higher R-
squared, lower RMSE, and adherence to linear regression assumptions.
Prediction
The assessment of different models using wine data indicates that Model3 outperforms
other models analyzed. Model3's predictions on the existing wine dataset, demonstrate a
range of numerical values spanning approximately 4.32 to 7.01, representing the predicted
quality scores for each observation. The observed variation in predicted quality scores
suggests that Model3 effectively captures the intricacies of wine sample quality. Thus, it
can be inferred that Model3 excels in accurately predicting the quality scores of wine
samples based on the dataset's features.
Conclusion
In conclusion, this study investigates the use of linear regression for predicting wine quality
using diverse properties. Among the models developed, Model 3, which incorporates
interaction terms, emerges as the most precise predictor. Its effectiveness in capturing the
subtleties of wine quality is evident from metrics such as R-squared and RMSE. This
analysis emphasizes the significance of linear regression in predicting wine quality and
proposes potential avenues for refining models to improve predictive precision.
Abstract
#Diagnostics
# Residuals vs. Fitted Values Plot
plot(model, which = 1)
# Model Selection
# 2. Logarithmic Transformation
wine_data_log <- wine_data
wine_data_log$volatile_acidity <- log(wine_data_log$volatile_acidity + 1)
model2 <- lm(quality ~ ., data = wine_data_log)
stargazer(model2, type = "text") # Print model summary using stargazer
# 3. Interaction Terms
wine_data_interaction <- wine_data
wine_data_interaction$interaction_term <- wine_data_interaction$fixed_acidity *
wine_data_interaction$volatile_acidity
model3 <- lm(quality ~ . + interaction_term, data = wine_data_interaction)
stargazer(model3, type = "text") # Print model summary using stargazer
#Diagnostics
# Residuals vs. Fitted Values Plot
plot(model3, which = 1)
# Model Evaluation
# Function to calculate and store evaluation metrics
evaluate_model <- function(model) {
# Calculate desired metrics (e.g., R-squared, RMSE, AIC)
r_squared <- summary(model)$r.squared
rmse <- sqrt(mean((model$residuals)^2))
aic <- AIC(model)
print(model_evaluations)
3) Diagnistic Plots
Fig. 1
Fig.2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig.10
Fig. 11.
Fig 12.