0% found this document useful (0 votes)
59 views27 pages

Wine Quality Analysis

The document outlines a mini project focused on analyzing wine quality using machine learning techniques on a dataset of Portuguese 'Vinho Verde' wines. It includes sections on data preparation, exploratory data analysis, various regression methods employed, results evaluation, and conclusions drawn from the findings. Key insights reveal the importance of physicochemical properties, particularly alcohol content, in predicting wine quality, with models like Random Forest and SVM showing the best performance.

Uploaded by

erkshitizarora
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
59 views27 pages

Wine Quality Analysis

The document outlines a mini project focused on analyzing wine quality using machine learning techniques on a dataset of Portuguese 'Vinho Verde' wines. It includes sections on data preparation, exploratory data analysis, various regression methods employed, results evaluation, and conclusions drawn from the findings. Key insights reveal the importance of physicochemical properties, particularly alcohol content, in predicting wine quality, with models like Random Forest and SVM showing the best performance.

Uploaded by

erkshitizarora
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Mini Project

UNC514: Data Science Fundamentals

Wine Quality Analysis

Submitted by: Submitted to:


Kshitiz Arora (102215226) Department of ECE

THAPAR INSTITUTE OF ENGINEERING & TECHNOLOGY, PATIALA, PUNJAB

November 2024
Table of Contents

Sr. No Content Page No

1 Introduction (objectives and problem statement) Max 2 pages

2 Data Preparation (detail introduction of dataset used) Max 2 pages

3 Exploratory Data Analysis (various EDA operation Max 4 pages


you have applied)
4 Methods Used (define and discuss all Max 5 pages
classification/regression models you have used)
5 Results and Evaluation (include all plots, graphs with Atleast 5 pages
proper numbering and name)
6 Conclusions 1 page

7 References (websites, books, researcharticle) 1 Page


1. Introduction

1.1 Overview

Wine is more than just a drink; for thousands of years, humanity has been captivated by the
complexities of this timeless elixir. But what truly makes a wine exceptional? Is it the
sharpness of its acidity, the warmth of its alcohol, or the perfect harmony of sweetness and
bitterness?

Our goal is not to replace the human touch but to complement it—to provide winemakers,
enthusiasts, and consumers alike with a powerful tool that demystifies the science of wine.
By analyzing the physicochemical properties of wine, such as pH, alcohol content, residual
sugar, and acidity, we aim to uncover the secrets hidden in each drop, revealing the factors
that set apart an ordinary wine from an award-winning masterpiece.

1.2 Problem Definition and Scope

The process of evaluating wine quality has long been shrouded in subjectivity, relying on the
discerning taste of experts who may differ in their judgments. This subjectivity creates a
challenge for winemakers striving to consistently craft high-quality products. How can the
industry ensure a fair, consistent, and scalable method of assessing wine quality?

With this project, we tackle the pressing problem of developing a data-driven approach to
predict wine quality based on measurable attributes. The dataset at our disposal, rich with
information about red wine, offers the perfect canvas for this endeavor.

Through this journey, we embrace the challenge of translating the mystique of wine into
code, algorithms, and visualizations, all while preserving its soul. Because at the end of the
day, a good wine doesn’t just taste great—it tells a story. And with this project, we’re here to
write the next chapter.
2. Data Preparation
2.1 Dataset Used

The dataset utilized in this project is derived from the physicochemical and sensory
properties of Portuguese "Vinho Verde" wines.

● Name: Wine Quality Dataset


● Origin: Sourced from the UCI Machine Learning Repository and available on Kaggle.
● Owner: The dataset was originally curated by Paulo Cortez and colleagues .
● Type: structured dataset, primarily numerical in nature.
● Country: Portugal (specifically focused on the "Vinho Verde" wine region).
● Size: 1,599 instances.

Column Names and Descriptions

The dataset contains 11 input variables and 1 output variable:

Input Variables (Physicochemical Properties):

1. Fixed Acidity: Concentration of non-volatile acids (e.g., tartaric acid) that do not
evaporate.
2. Volatile Acidity: Concentration of volatile acids (e.g., acetic acid), which can
contribute to a vinegar-like taste if excessive.
3. Citric Acid: Enhances freshness and adds a citrus-like flavor.
4. Residual Sugar: Sugar left after fermentation; affects sweetness.
5. Chlorides: Salt content in wine; influences taste balance.
6. Free Sulfur Dioxide: Protects wine from oxidation and spoilage.
7. Total Sulfur Dioxide: The total amount of SO₂, including bound forms.
8. Density: Relates to the alcohol and sugar content.
9. pH: Indicates acidity/basicity of wine.
10. Sulphates: Adds to the wine's preservation and taste complexity.
11. Alcohol: Ethanol content in the wine.

Output Variable (Sensory Property):


12. Quality: Rated on a scale of 0 to 10 based on sensory evaluation by wine tasters.
3. Exploratory Data Analysis
3.1 Getting Insights About the Dataset

● Dataset Shape and Size: The red wine dataset contains 1,599 samples, while the
white wine dataset includes 4,898 samples, with 12 columns each.
● Data Types: All features are numerical, and suitable for direct application of machine
learning models without additional conversions.
● Summary Statistics: Used descriptive statistics to examine the distribution of each
feature, including mean, median, minimum, and maximum values, along with
standard deviation.
● Class Distribution: Analyzed the distribution of the quality variable to observe
imbalances. It was found that most wines fall into the average quality range (scores of
5–6), while high-quality and poor-quality wines are less frequent.
● Correlation Analysis: Computed the Pearson correlation coefficients between
features to identify relationships.

3.2 Data Encoding (If Applicable)

● No categorical data was present in the dataset, so no encoding was required. The data
was inherently numerical, simplifying preprocessing.

3.3 Data Visualization

● Distribution Plots:
○ Examined the distribution of numerical variables such as alcohol, pH and
citric acid.
○ Observed that variables like alcohol exhibit a right-skewed distribution, while
others like density are nearly uniform.
● Box Plots:
○ Explored variations in wine quality against features like volatile acidity and
residual sugar.
● Correlation Heatmap:
○ Generated a heatmap to visualize the correlation matrix, identifying strong
positive correlations like alcohol and quality and negative ones like volatile
acidity.
4. Methods Used
Regression Techniques:
4.1 Linear Regression:
● The simplest regression model, used to predict the wine quality score as a
linear combination of features. While interpretable, it struggled with
non-linear relationships.

4.2 Ridge Regression:


● A variant of linear regression, it included a penalty term to control overfitting
and handle multicollinearity among features. The regularization strength was
(alpha) tuned for optimal results.

4.3 Support Vector Regression:


● SVR extended SVM for regression tasks by fitting a hyperplane that
minimizes prediction error within a margin of tolerance. Non-linear kernels
like RBF were applied to model complex patterns.

4.4 Decision Tree Regression


● Modeled the wine quality score using a decision tree, which segmented the
data space based on physicochemical features. Pruning techniques were used
to prevent overfitting.

4.5 Lasso Regression:


● Similar to ridge regression but employed L1 regularization, driving
coefficients of less important features to zero. This helped in feature selection
and simplified the model.

4.6 Regression Analysis Using Artificial Neural Networks:


● A neural network regression model was trained to predict continuous quality
scores. The architecture consisted of multiple hidden layers, and the model
was trained using backpropagation. Dropout layers and batch normalization
were used to prevent overfitting and accelerate convergence.

4.7 Naïve Bayes Regression:

● Naïve Bayes Regression predicts wine quality by assuming Gaussian


distributions for features. It is simple and efficient but limited by its
independence assumption and struggles with complex relationships, making it
a baseline for comparison.
Results and Evaluation
4.6 Results

Fig Number 1. Comparison Between R^2 Score for various regression models for min. value

Figure Number 2. Accuracy % Comparison between all regression models for min. value

Figure Number 3. MAE Comparison for min. Value


Fig Number 4. RMSE Comparison for min. value

Figure Number 5. MAPE Comparison for min. value

Fig Number 6. Comparison Between R^2 Score for various regression models for max. value
Figure Number 7. Accuracy % Comparison between all regression models for max. value

Figure Number 8. MAE Comparison for max. Value

Figure Number 9. RMSE Comparison for max. value


Figure Number 10. MAPE for max. value

Figure Number 11. Comparison Between R^2 Score for the mean value

Figure Number 12. Accuracy % Comparison between all regression models for mean value
Fig Number 13. Comparison Between R^2 Score for various regression models for Standard
Deviation

Figure Number 14. Accuracy % Comparison between all regression models for standard
deviation

Fig Number 13. Comparison Between R^2 Score for Normal Data
Fig Number 14. Comparison Between MAE, MAPE, and RMSE Score for Mean Value

Fig Number 14. Comparison of Accuracy % for Normal Data


Fig Number 14. Comparison Between MAE, MAPE, and RMSE Score for Normal Data
Conclusion
In this project, we delved into the intriguing world of wine quality prediction by applying a variety of
machine-learning models to the Vinho Verde wine dataset. The project’s primary objective was to
determine how physicochemical properties influence the quality of wine and to leverage these insights
for accurate predictions of wine quality scores.

Throughout the process, we performed thorough Exploratory Data Analysis (EDA), where we
visualized distributions, correlations, and potential outliers within the dataset. This helped in
understanding the inherent patterns in the data, such as the strong correlation between alcohol content
and wine quality, and the negative correlation between volatile acidity and quality. Data cleaning and
preprocessing were also essential steps, ensuring that the dataset was ready for modeling by handling
missing values, encoding, and visualizing relationships.

We applied a wide range of classification and regression models, from simpler techniques like Naïve
Bayes and Decision Trees to more complex methods like Random Forests, Support Vector Machines
(SVM), and Neural Networks. Each model was tested for its ability to predict wine quality, either as a
categorical outcome (good vs. bad wine) or as a continuous variable (quality score). The classification
models showed varied success, with Random Forest and SVM achieving high accuracy in
distinguishing between "good" and "bad" wines. The regression models, particularly Random Forest
Regression and Neural Networks, demonstrated the best performance in predicting continuous quality
scores.

Key findings from the analysis include:

1. Feature Importance: The alcohol content was consistently one of the most important
predictors of wine quality.
2. Data Imbalance: The dataset's skewed distribution of wine quality classes necessitated special
attention to balance and model tuning.
3. Model Performance: Among the models, Random Forest and SVM outperformed others in
terms of predictive accuracy. The Neural Network provided a robust prediction for continuous
quality scores but was computationally more expensive.

Ultimately, the results highlight the value of machine learning in understanding complex data and
making predictions that can be applied to real-world problems, such as wine quality assessment in the
food and beverage industry. This project not only showcased the power of various algorithms but also
the importance of data preprocessing, feature selection, and model evaluation in building an effective
predictive system.

In conclusion, this analysis offers valuable insights into the factors influencing wine quality and
serves as a foundation for future improvements, such as incorporating additional features or
experimenting with advanced ensemble techniques. The lessons learned here can be applied to other
domains where prediction based on numerical features plays a crucial role in decision-making.
References

● P. Cortez, A. Cerdeira, F. Almeida, T. Matos, and J. Reis


Modeling wine preferences by data mining from physicochemical properties,
Decision Support Systems, Elsevier, 47(4):547-553, 2009.
● UCI Machine Learning Repository
Wine Quality Dataset
● Kaggle
Wine Quality Dataset
● Scikit-learn Documentation
https://scikit-learn.org
● Matplotlib Documentation
https://matplotlib.org
● Pandas Documentation
https://pandas.pydata.org
● Seaborn Documentation
https://seaborn.pydata.org
● Python Official Documentation
https://python.org
● Relevant Tutorials and Articles (Feature Engineering Techniques for Regression
Models)
Annexure 1:

You might also like