0% found this document useful (0 votes)
24 views13 pages

Report Revathy

This report explores using linear regression to predict wine quality based on various wine properties. Four models were evaluated and model 3, incorporating interaction terms between predictors, was selected as the best model due to its superior predictive accuracy and model fit. The analysis demonstrates the application of linear regression for creating predictive models of wine quality.

Uploaded by

Revathy P
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views13 pages

Report Revathy

This report explores using linear regression to predict wine quality based on various wine properties. Four models were evaluated and model 3, incorporating interaction terms between predictors, was selected as the best model due to its superior predictive accuracy and model fit. The analysis demonstrates the application of linear regression for creating predictive models of wine quality.

Uploaded by

Revathy P
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

WINE QUALITY PREDICTION- Submitted By:

REVATHY
LINEAR REGRESSION PROJECT PRABHAKARAN
Student No:8903669

Course Name: Multivariate Statistics-8031


Submitted To: Bip. Thapa
College: Conestoga College
Table of Contents

1. Abstract
2. Gathering Data
o Dataset Selection
o Dataset Description
3. Initial Modeling
o Variables Fixed
o Results of Missing Values
o Rows Removed
o Duplicates Found and Removed
o Structure of the Cleaned Dataset
4. Diagnostics
o Diagnostic Methods Employed
o Assumption Checks and Results
5. Model Selection
o List of Evaluated Models
6. Model Evaluation
o Evaluation Metrics
o Choosing the Best Model
7. Prediction
o Model Performance on Existing Data
8. Conclusion
o Key Findings
o Suggestions for Future Work
9. Appendix (if applicable)
o Dataset Link
o R program codes
o Diagnostic Plots
Abstract
This report explores using linear regression to predict wine quality. The analysis utilizes a
dataset containing measurements of various wine properties, like fixed acidity, volatile
acidity, and alcohol content. The goal is to develop a model predicting wine quality based
on these properties.
Following data exploration, an initial linear regression model was created. Tests were
conducted to assess the validity of the model's assumptions, and its performance was
evaluated using metrics like R-squared and root mean squared error (RMSE). Techniques
like polynomial transformations and interaction terms were then applied to improve the
model.
Four models were evaluated based on their performance and adherence to linear
regression assumptions. The models are Multiple Linear Regression, Logarithmic
Transformations, Interaction Terms and Forward Selection. Model 3(Interaction Terms),
which incorporates interaction terms between predictors, was chosen as the best
performing model due to its superior predictive accuracy and model fit. Predictions were
made using Model 3, and the results were analyzed to identify the most influential
predictors of wine quality.
This analysis offers insights into the factors impacting wine quality and demonstrates the
application of linear regression for creating predictive models. Further research and
refinement of the model could enhance predictive accuracy and understanding of factors
influencing wine quality.

Gathering Data

Dataset Selection:
For this project, the Wine Quality dataset sourced from Kaggle was selected. It contains
various wine properties, including fixed acidity, volatile acidity, citric acid, residual sugar,
chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol content,
and wine quality scores ranging from 0 to 10. (Raj Parmar, 2016)
Dataset Description:
The dataset comprises 6497 observations and 12variables. The response variable is wine
quality, and the predictors include fixed acidity, volatile acidity, citric acid, residual sugar,
chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, and alcohol
content. The objective is to predict wine quality based on these properties.

Initial Modelling

Variables Fixed: The dataset initially contained 13 variables, including "type",


"fixed_acidity", "volatile_acidity", "citric_acid", "residual_sugar", "chlorides",
"free_sulfur_dioxide", "total_sulfur_dioxide", "density", "pH", "sulphates", "alcohol", and
"quality". The column "type" was removed as it was a categorical variable not relevant to
this study's scope.
Results of Missing Values: The missing_values output indicates the presence of missing
data for each variable in the dataset, highlighting any variables with missing values that
required handling.
Rows Removed: The number of rows removed due to missing values can be inferred by
comparing the dataset's row count before and after the removal process.
Duplicates Found and Removed: The duplicated_rows output shows whether any
duplicated rows were identified in the dataset. Duplicates, if found, would be eliminated
using the duplicated(wine_data) condition. However, the exact number of duplicates
detected and removed is not explicitly stated.
Structure of the Cleaned Dataset: The cleaned dataset consists of 6463 observations and
12 variables, encompassing both numerical and character types. Each row represents a
wine sample, with variables reflecting attributes such as acidity, residual sugar, and
alcohol content. Missing values and duplicate rows were addressed during the data
cleaning process.

Diagnostics

Diagnostic Methods Employed:


The analysis utilized diagnostic methods including Residuals vs Fitted Values Plot, Normal
Q-Q Plot.The Residuals vs Fitted Values Plot assessed linearity and homoscedasticity, the
Normal Q-Q Plot evaluated the normality of residuals.

Assumption Checks and Results:


Upon applying these diagnostic methods, the following results were obtained:
1. Residuals vs Fixed Acidity: No significant violations were detected, and a reasonable fit
was observed. (Fig. 1)
2. Normal QQ Plot: A slight deviation from normal distribution was noted. (Fig. 2)
3. Residuals vs Volatile Acidity: An underprediction tendency was observed for high
volatile acidity. (Fig. 3)
4. Residuals vs Citric Acid: No major violations were found, and a reasonable fit was
observed. (Fig. 4)
5. Residuals vs Chlorides: No substantial violations were detected, and a reasonable fit
was observed. (Fig. 6)
6. Residuals vs Free Sulfur Dioxide: No significant violations were detected, and a
reasonable fit was observed. (Fig. 7)
7. Residuals vs Total Sulfur Dioxide: Possible issues with the model fit were identified.
(Fig. 8)
8. Residuals vs Density: No significant violations were observed, and a reasonable fit was
noted. (Fig. 9)
9. Residuals vs pH: No major violations were detected, and a reasonable fit was observed.
(Fig. 10)
10. Residuals vs Alcohol: No significant violations were found, and a reasonable fit was
observed. (Fig. 12)
11.Residuals vs Sulphates: There is a general trend of increasing residuals with
increasing sulfates, there are also some data points that fall outside of this trend.
(Fig.11)
12. Residuals vs Residual Sugar: Relationship is minimum. (Fig. 5)
To improve the model's performance, potential non-linear relationships and interactions
between predictors were explored. Four models were evaluated based on their
performance.

Model Selection

1. Model 1: Multiple Linear Regression- Includes all predictors.


2. Model 2: Polynomial Transformation - Includes polynomial transformations of
predictors.
3. Model 3: Interaction Terms - Includes interaction terms between predictors.
4. Model 4: Stepwise Regression - Utilizes stepwise regression to select the best subset of
predictors.

Model Evaluation

The initial models’ performances were evaluated using metrics such as R-squared,
adjusted R-squared, and root mean squared error (RMSE) to determine its goodness of fit
and predictive accuracy. The results were used to assess the model's effectiveness in
capturing the variability in wine quality.
After careful consideration, Model 3, incorporating interaction terms between predictors,
was selected as the optimal model due to its superior performance in terms of higher R-
squared, lower RMSE, and adherence to linear regression assumptions.

(R output for model evaluation)

Prediction

The assessment of different models using wine data indicates that Model3 outperforms
other models analyzed. Model3's predictions on the existing wine dataset, demonstrate a
range of numerical values spanning approximately 4.32 to 7.01, representing the predicted
quality scores for each observation. The observed variation in predicted quality scores
suggests that Model3 effectively captures the intricacies of wine sample quality. Thus, it
can be inferred that Model3 excels in accurately predicting the quality scores of wine
samples based on the dataset's features.
Conclusion

In conclusion, this study investigates the use of linear regression for predicting wine quality
using diverse properties. Among the models developed, Model 3, which incorporates
interaction terms, emerges as the most precise predictor. Its effectiveness in capturing the
subtleties of wine quality is evident from metrics such as R-squared and RMSE. This
analysis emphasizes the significance of linear regression in predicting wine quality and
proposes potential avenues for refining models to improve predictive precision.

Suggestions for Future Work Improvement: Incorporating additional features beyond


the current set of wine properties for example the variable “type” which we ignored in this
study as it was a categorical variable.
Applying more complex Machine Learning/ Deep Learning algorithms like Randon Forests,
Decision Trees, Gradient Boosting and Neural Networks that may capture non-linear
relationships more effectively.
Validating the model's performance to another dataset to evaluate its generalizability.

Abstract

1. Raj Parmar. (2016). Wine Quality Dataset. Kaggle. Retrieved from


https://www.kaggle.com/datasets/rajyellow46/wine-quality
2. Program used for analysis using R:
# Load required libraries (they are already installed)
library(dplyr)
library(ggplot2)
library(tidyr)
library(MASS)
library(stargazer)
library(corrplot)

# Load dataset (assuming the file path is correct)


wine_data <-
read.csv("C:\\Users\\revak\\OneDrive\\Desktop\\CaseStudy2\\winequalityN.csv")

# Fix variable names:


names(wine_data) <- c("type", "fixed_acidity", "volatile_acidity", "citric_acid",
"residual_sugar", "chlorides",
"free_sulfur_dioxide", "total_sulfur_dioxide", "density", "pH", "sulphates",
"alcohol", "quality")

# Check for missing values


missing_values <- colSums(is.na(wine_data))
print(missing_values)
# Remove rows with missing values
wine_data <- wine_data[complete.cases(wine_data), ]

# Check for duplicated rows


duplicated_rows <- wine_data[duplicated(wine_data), ]
print(duplicated_rows)

# Remove duplicated rows


wine_data <- wine_data[!duplicated(wine_data), ]

# Check the structure of the cleaned dataset


str(wine_data)

#Diagnostics
# Residuals vs. Fitted Values Plot
plot(model, which = 1)

# Normal Q-Q Plot


qqnorm(model$residuals)
qqline(model$residuals)

# Create a dataframe of predictor variables


predictors <- wine_data[, -which(names(wine_data) == "quality")]

# Model Selection

# Remove type column


wine_data <- wine_data[, -which(names(wine_data) == "type")]

# 1. Multiple Linear Regression (Baseline)


model1 <- lm(quality ~ ., data = wine_data)
stargazer(model1, type = "text") # Print model summary using stargazer

# 2. Logarithmic Transformation
wine_data_log <- wine_data
wine_data_log$volatile_acidity <- log(wine_data_log$volatile_acidity + 1)
model2 <- lm(quality ~ ., data = wine_data_log)
stargazer(model2, type = "text") # Print model summary using stargazer

# 3. Interaction Terms
wine_data_interaction <- wine_data
wine_data_interaction$interaction_term <- wine_data_interaction$fixed_acidity *
wine_data_interaction$volatile_acidity
model3 <- lm(quality ~ . + interaction_term, data = wine_data_interaction)
stargazer(model3, type = "text") # Print model summary using stargazer

#Diagnostics
# Residuals vs. Fitted Values Plot
plot(model3, which = 1)

# Normal Q-Q Plot


qqnorm(model3$residuals)
qqline(model3$residuals)

# Create a dataframe of predictor variables


predictors <- wine_data[, -which(names(wine_data) == "quality")]

# Residuals vs. Predictor Variables Plots


for (predictor in colnames(predictors)) {
plot(predictors[[predictor]], model3$residuals, xlab = predictor, ylab = "Residuals",
main = paste("Residuals vs.", predictor))
}

# Calculate VIF for each predictor variable


vif <- car::vif(model3)
print(vif)

# 4. Forward selection using MASS


forward_model <- step(lm(quality ~ ., data = wine_data), direction = "forward")

# Extract the final model formula


best_model_formula <- formula(forward_model)

# Create the final model using lm


model4 <- lm(best_model_formula, data = wine_data)

# Summary and further analysis


stargazer(model4, type = "text") # Print model summary using stargazer

# Model Evaluation
# Function to calculate and store evaluation metrics
evaluate_model <- function(model) {
# Calculate desired metrics (e.g., R-squared, RMSE, AIC)
r_squared <- summary(model)$r.squared
rmse <- sqrt(mean((model$residuals)^2))
aic <- AIC(model)

# Return a data frame with metrics


return(data.frame(model = deparse(substitute(model)), r_squared, rmse, aic))
}

# Evaluate each model using their formulas


model_evaluations <- rbind(
evaluate_model(model1),
evaluate_model(model2),
evaluate_model(model3),
evaluate_model(model4)
)

print(model_evaluations)

# model3 has the best scores

# Make predictions on existing data using the best model


predictions_existing_data <- predict(model3, data = wine_data_interaction)

# Print the predictions


print(predictions_existing_data)

3) Diagnistic Plots

Fig. 1
Fig.2

Fig. 3

Fig. 4
Fig. 5

Fig. 6

Fig. 7
Fig. 8

Fig. 9

Fig.10
Fig. 11.

Fig 12.

You might also like