Wine Quality Analysis
Wine Quality Analysis
November 2024
Table of Contents
1.1 Overview
Wine is more than just a drink; for thousands of years, humanity has been captivated by the
complexities of this timeless elixir. But what truly makes a wine exceptional? Is it the
sharpness of its acidity, the warmth of its alcohol, or the perfect harmony of sweetness and
bitterness?
Our goal is not to replace the human touch but to complement it—to provide winemakers,
enthusiasts, and consumers alike with a powerful tool that demystifies the science of wine.
By analyzing the physicochemical properties of wine, such as pH, alcohol content, residual
sugar, and acidity, we aim to uncover the secrets hidden in each drop, revealing the factors
that set apart an ordinary wine from an award-winning masterpiece.
The process of evaluating wine quality has long been shrouded in subjectivity, relying on the
discerning taste of experts who may differ in their judgments. This subjectivity creates a
challenge for winemakers striving to consistently craft high-quality products. How can the
industry ensure a fair, consistent, and scalable method of assessing wine quality?
With this project, we tackle the pressing problem of developing a data-driven approach to
predict wine quality based on measurable attributes. The dataset at our disposal, rich with
information about red wine, offers the perfect canvas for this endeavor.
Through this journey, we embrace the challenge of translating the mystique of wine into
code, algorithms, and visualizations, all while preserving its soul. Because at the end of the
day, a good wine doesn’t just taste great—it tells a story. And with this project, we’re here to
write the next chapter.
2. Data Preparation
2.1 Dataset Used
The dataset utilized in this project is derived from the physicochemical and sensory
properties of Portuguese "Vinho Verde" wines.
1. Fixed Acidity: Concentration of non-volatile acids (e.g., tartaric acid) that do not
evaporate.
2. Volatile Acidity: Concentration of volatile acids (e.g., acetic acid), which can
contribute to a vinegar-like taste if excessive.
3. Citric Acid: Enhances freshness and adds a citrus-like flavor.
4. Residual Sugar: Sugar left after fermentation; affects sweetness.
5. Chlorides: Salt content in wine; influences taste balance.
6. Free Sulfur Dioxide: Protects wine from oxidation and spoilage.
7. Total Sulfur Dioxide: The total amount of SO₂, including bound forms.
8. Density: Relates to the alcohol and sugar content.
9. pH: Indicates acidity/basicity of wine.
10. Sulphates: Adds to the wine's preservation and taste complexity.
11. Alcohol: Ethanol content in the wine.
● Dataset Shape and Size: The red wine dataset contains 1,599 samples, while the
white wine dataset includes 4,898 samples, with 12 columns each.
● Data Types: All features are numerical, and suitable for direct application of machine
learning models without additional conversions.
● Summary Statistics: Used descriptive statistics to examine the distribution of each
feature, including mean, median, minimum, and maximum values, along with
standard deviation.
● Class Distribution: Analyzed the distribution of the quality variable to observe
imbalances. It was found that most wines fall into the average quality range (scores of
5–6), while high-quality and poor-quality wines are less frequent.
● Correlation Analysis: Computed the Pearson correlation coefficients between
features to identify relationships.
● No categorical data was present in the dataset, so no encoding was required. The data
was inherently numerical, simplifying preprocessing.
● Distribution Plots:
○ Examined the distribution of numerical variables such as alcohol, pH and
citric acid.
○ Observed that variables like alcohol exhibit a right-skewed distribution, while
others like density are nearly uniform.
● Box Plots:
○ Explored variations in wine quality against features like volatile acidity and
residual sugar.
● Correlation Heatmap:
○ Generated a heatmap to visualize the correlation matrix, identifying strong
positive correlations like alcohol and quality and negative ones like volatile
acidity.
4. Methods Used
Regression Techniques:
4.1 Linear Regression:
● The simplest regression model, used to predict the wine quality score as a
linear combination of features. While interpretable, it struggled with
non-linear relationships.
Fig Number 1. Comparison Between R^2 Score for various regression models for min. value
Figure Number 2. Accuracy % Comparison between all regression models for min. value
Fig Number 6. Comparison Between R^2 Score for various regression models for max. value
Figure Number 7. Accuracy % Comparison between all regression models for max. value
Figure Number 11. Comparison Between R^2 Score for the mean value
Figure Number 12. Accuracy % Comparison between all regression models for mean value
Fig Number 13. Comparison Between R^2 Score for various regression models for Standard
Deviation
Figure Number 14. Accuracy % Comparison between all regression models for standard
deviation
Fig Number 13. Comparison Between R^2 Score for Normal Data
Fig Number 14. Comparison Between MAE, MAPE, and RMSE Score for Mean Value
Throughout the process, we performed thorough Exploratory Data Analysis (EDA), where we
visualized distributions, correlations, and potential outliers within the dataset. This helped in
understanding the inherent patterns in the data, such as the strong correlation between alcohol content
and wine quality, and the negative correlation between volatile acidity and quality. Data cleaning and
preprocessing were also essential steps, ensuring that the dataset was ready for modeling by handling
missing values, encoding, and visualizing relationships.
We applied a wide range of classification and regression models, from simpler techniques like Naïve
Bayes and Decision Trees to more complex methods like Random Forests, Support Vector Machines
(SVM), and Neural Networks. Each model was tested for its ability to predict wine quality, either as a
categorical outcome (good vs. bad wine) or as a continuous variable (quality score). The classification
models showed varied success, with Random Forest and SVM achieving high accuracy in
distinguishing between "good" and "bad" wines. The regression models, particularly Random Forest
Regression and Neural Networks, demonstrated the best performance in predicting continuous quality
scores.
1. Feature Importance: The alcohol content was consistently one of the most important
predictors of wine quality.
2. Data Imbalance: The dataset's skewed distribution of wine quality classes necessitated special
attention to balance and model tuning.
3. Model Performance: Among the models, Random Forest and SVM outperformed others in
terms of predictive accuracy. The Neural Network provided a robust prediction for continuous
quality scores but was computationally more expensive.
Ultimately, the results highlight the value of machine learning in understanding complex data and
making predictions that can be applied to real-world problems, such as wine quality assessment in the
food and beverage industry. This project not only showcased the power of various algorithms but also
the importance of data preprocessing, feature selection, and model evaluation in building an effective
predictive system.
In conclusion, this analysis offers valuable insights into the factors influencing wine quality and
serves as a foundation for future improvements, such as incorporating additional features or
experimenting with advanced ensemble techniques. The lessons learned here can be applied to other
domains where prediction based on numerical features plays a crucial role in decision-making.
References