Project Report AS
Project Report AS
(MBA 2023-25)
Advanced Statistics
Project Report on
MANOVA
Dr Lalit Kumar
Submitted By:
1|Page
Table of Contents
Abstract
Introduction
Motivation and Importance of Predicting Wine Quality
Objectives
Problem Statement
Data
Methodology
Code Breakdown
Future Work
Evaluation
Analysis
Findings from Analysis
Data Insights and Visualization
Challenges and Considerations
Insights and Implications
Final Thoughts
Conclusion
2|Page
1. ABSTRACT
This report presents a comprehensive analysis of the relationship between the physicochemical
properties of red wine and its quality rating, employing Multivariate Analysis of Variance
(MANOVA). The study leverages a dataset comprising various attributes of red wine, including
pH, alcohol content, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide,
total sulfur dioxide, density, and sulfates, alongside a subjective wine quality score. These
features serve as independent variables, while the wine quality score is treated as the dependent
variable.
The aim of this analysis is to investigate how these multiple chemical properties collectively
influence the perceived quality of wine, which is rated on a scale from 0 to 10. Data
preprocessing was performed to clean the dataset, where missing values were handled, outliers
were identified, and column names were standardized to ensure compatibility with the statistical
methods applied.
Following the MANOVA analysis, post-hoc tests were conducted using Tukey's Honestly
Significant Difference (HSD) test to further explore pairwise comparisons between different
wine quality levels. These post-hoc tests help to identify specific wine properties that contribute
most significantly to higher or lower quality scores.
The findings from this analysis reveal that certain variables, such as alcohol content, volatile
acidity, and residual sugar, exhibit statistically significant relationships with wine quality,
whereas others, such as chlorides and density, have less pronounced effects. The results of this
study offer valuable insights into the underlying factors that contribute to the sensory experience
of wine, providing potential guidelines for winemakers to fine-tune production processes for
enhanced wine quality. Furthermore, the findings highlight the importance of a balanced
chemical composition in producing higher-quality wines and can serve as a scientific foundation
for further research in oenology.
3|Page
2. Introduction
The wine industry has long been regarded as a field where tradition and scientific innovation
intersect. Winemakers have historically relied on their expertise, intuition, and subjective tasting
to judge the quality of wine. However, in recent years, advancements in data analysis and
machine learning have provided new avenues for understanding the intricate relationships
between a wine’s chemical composition and its sensory qualities. In this context, statistical
methods such as Multivariate Analysis of Variance (MANOVA) offer powerful tools to explore
how various chemical properties contribute to wine quality and enable winemakers to optimize
their processes based on objective data.
Wine quality is a multifaceted characteristic, typically determined by human sensory panels that
assess attributes such as flavor, aroma, mouthfeel, and overall balance. These subjective
assessments are then translated into numerical scores that rate the wine's quality on a defined
scale, often ranging from 0 to 10. However, this quality is inherently influenced by the chemical
composition of the wine. Factors such as alcohol content, acidity levels, and sugar concentrations
play critical roles in shaping the overall sensory experience. Thus, identifying the underlying
chemical properties that drive these quality scores is of great interest to winemakers, researchers,
and the beverage industry at large.
The physicochemical properties of wine are quantifiable metrics that can be measured and
analyzed using modern laboratory techniques. Common attributes include pH, residual sugar,
alcohol content, volatile acidity, and levels of various sulfites. Each of these factors can affect
the wine's taste, structure, and preservation. For example, acidity influences the sharpness and
freshness of the wine, while alcohol content contributes to its body and perceived warmth. By
analyzing these variables in tandem, one can gain deeper insights into how different chemical
properties work together to influence the final product's quality. This is where MANOVA, a
statistical method designed to analyze the relationship between multiple dependent and
independent variables simultaneously, becomes invaluable.
In this report, we apply MANOVA to a dataset containing the physicochemical properties and
corresponding quality scores of red wine samples. The goal of this analysis is to determine which
chemical attributes, when considered collectively, have a significant impact on the overall
quality of the wine. Unlike univariate techniques, which examine each variable in isolation,
MANOVA allows for the examination of multiple variables at once, capturing the
interdependencies and interactions between them. This provides a more holistic understanding of
the factors that contribute to wine quality, which is critical for both scientific research and
practical applications in the wine industry.
The dataset used for this study consists of several physicochemical properties for a set of red
wine samples, including pH, alcohol content, volatile acidity, citric acid, residual sugar,
chlorides, free sulfur dioxide, total sulfur dioxide, density, and sulfates. Each wine sample is also
rated on a quality scale from 0 to 10, based on sensory evaluations. The challenge is to determine
how these measurable properties can be used to predict the quality rating and to what extent each
variable influences the final quality score. Understanding these relationships can help
winemakers adjust their processes, such as fermentation or aging, to achieve wines with desired
characteristics and higher quality ratings.
4|Page
The MANOVA approach is particularly well-suited for this analysis because it can detect the
combined effects of multiple independent variables on the dependent variable (wine quality).
This is important because wine quality is likely influenced by complex interactions between
chemical properties rather than by individual factors acting independently. For example, a certain
level of acidity might enhance the wine's flavor if balanced by appropriate levels of sugar and
alcohol, but it could detract from the overall quality if the sugar content is too low or the alcohol
content too high. MANOVA enables us to examine these types of interactions and assess their
collective impact on wine quality.
In addition to applying MANOVA, we also conduct post-hoc analyses using Tukey’s Honestly
Significant Difference (HSD) test to further explore which specific wine properties contribute to
differences in quality ratings between wine samples. This helps in identifying not only whether
the physicochemical properties affect quality but also how different levels of these properties
differentiate higher-quality wines from lower-quality ones. The findings of this analysis provide
valuable insights that could inform winemaking practices and lead to improvements in wine
quality, ultimately benefiting both producers and consumers.
In summary, this report aims to bridge the gap between the scientific analysis of wine
composition and the subjective evaluation of wine quality by applying statistical techniques to
uncover the relationships between wine's chemical properties and its perceived quality. By doing
so, we hope to contribute to a deeper understanding of the factors that influence wine quality and
provide actionable insights that can be used by winemakers to enhance the production process.
In the global wine industry, which is both highly competitive and deeply rooted in tradition, the
quality of wine plays a pivotal role in determining its market success, pricing, and consumer
demand. Wine quality is not only a measure of a winemaker’s craftsmanship but also a critical
factor influencing the branding, marketing, and commercial positioning of a wine. Consequently,
understanding and predicting wine quality based on measurable chemical properties is of
paramount importance to winemakers, researchers, and industry stakeholders alike.
Traditionally, the assessment of wine quality has been a subjective process, relying heavily on
the expertise of trained sommeliers and sensory panels. These individuals evaluate the sensory
characteristics of wine—such as taste, aroma, appearance, and mouthfeel—to provide a quality
score. While these sensory evaluations are crucial, they are also influenced by human perception,
which can be inconsistent due to factors such as palate fatigue, personal biases, and variability
between tasters. Additionally, sensory evaluation is time-consuming, expensive, and requires the
presence of highly trained professionals. This has motivated researchers and winemakers to seek
objective, data-driven approaches to assess and predict wine quality, especially during the early
stages of production.
The advent of machine learning, statistical analysis, and predictive modeling has opened up new
possibilities for assessing wine quality in a more objective and scalable manner. By utilizing
measurable physicochemical properties—such as pH, acidity, alcohol content, and sulfur dioxide
levels—researchers can develop models that predict wine quality without the need for costly and
subjective sensory evaluations. This transition towards data-driven wine quality prediction not
5|Page
only increases the efficiency of the winemaking process but also enables winemakers to make
informed decisions about adjustments in production, such as optimizing fermentation conditions
or blending different batches to achieve the desired quality.
Furthermore, predicting wine quality based on chemical analysis provides valuable insights into
the scientific principles underlying the sensory experience of wine. While sensory attributes such
as flavor and aroma are complex and influenced by many factors, the chemical composition of
the wine can be quantified and analyzed to determine how it contributes to these attributes.
Understanding these relationships allows winemakers to fine-tune their processes, leading to
consistent quality improvements and the production of wines that meet specific consumer
preferences.
From an industry perspective, the ability to predict wine quality with greater accuracy has
several practical benefits:
1. Improved Production Efficiency: During the winemaking process, there are numerous
stages—such as grape harvesting, fermentation, and aging—where decisions need to be
made that impact the final quality of the wine. By predicting quality based on measurable
variables early in the process, winemakers can intervene and make adjustments to
optimize the final product. For instance, if a wine is predicted to have a lower quality due
to high acidity or low alcohol content, changes in fermentation techniques or blending
options can be explored to achieve better balance and flavor.
2. Cost Reduction: Predicting wine quality using chemical analysis reduces reliance on
time-intensive sensory panels, lowering the costs associated with quality control. This
can be especially beneficial for large-scale wineries that need to assess quality across
numerous batches of wine quickly and efficiently. Additionally, early prediction of
quality can help prevent the production of suboptimal wine, thus minimizing financial
losses associated with producing unsellable or low-quality batches.
3. Consistency in Wine Production: One of the key challenges in winemaking is achieving
consistency in quality across different vintages and batches. Wine is a natural product,
and factors such as grape variety, soil composition, and climate conditions can vary
significantly from year to year. By using predictive models based on chemical analysis,
winemakers can maintain more consistent quality by adjusting their methods in response
to chemical measurements, even when raw materials or environmental conditions vary.
This consistency is crucial for maintaining brand reputation and customer loyalty.
4. Consumer Satisfaction and Marketability: Wine consumers, particularly those who
purchase premium wines, have high expectations for quality. Predictive models that help
ensure high-quality production can improve consumer satisfaction and drive
marketability. Consistently producing wines that meet or exceed consumer expectations
can enhance a winery’s reputation, lead to better reviews and ratings, and ultimately drive
sales. In a market where competition is fierce, the ability to consistently deliver high-
quality wines can be a decisive factor for commercial success.
5. Sustainability and Waste Reduction: Predicting wine quality early in the production
process can also contribute to sustainability efforts by reducing waste. Winemakers can
identify potential issues with wine quality before significant resources are invested in the
aging or bottling stages, allowing them to adjust production methods or make decisions
that minimize resource use. This not only reduces the environmental footprint of
6|Page
winemaking but also aligns with growing consumer demand for sustainable and eco-
friendly practices in the food and beverage industry.
6. Scientific Advancement in Oenology: From a research perspective, the development
and refinement of predictive models for wine quality contribute to the broader field of
oenology—the science of wine and winemaking. By applying statistical and machine
learning techniques, researchers can explore the complex interactions between wine’s
chemical properties and its sensory characteristics, leading to new discoveries about the
factors that influence taste, aroma, and overall quality. This knowledge not only benefits
winemakers but also advances the scientific understanding of wine as a product of both
art and science.
In this context, the use of Multivariate Analysis of Variance (MANOVA) offers a particularly
powerful approach for predicting wine quality. Unlike univariate methods that assess each
variable in isolation, MANOVA evaluates the joint influence of multiple independent variables
on wine quality. This is especially important for wine, where the interaction between different
chemical properties (e.g., the balance between alcohol content and acidity) can significantly
impact the final quality. By identifying which combinations of physicochemical properties are
most strongly associated with higher quality scores, winemakers can make more informed
decisions about how to adjust their production processes to achieve the desired results.
In conclusion, the motivation to predict wine quality using objective data-driven methods stems
from a desire to improve production efficiency, reduce costs, maintain consistency, and
ultimately enhance consumer satisfaction. The ability to predict and control wine quality is not
only a competitive advantage for wineries but also a step towards a more scientifically informed
and sustainable future for the wine industry. This report, by applying MANOVA to the analysis
of red wine quality, seeks to contribute to this growing body of knowledge and provide practical
insights for winemakers aiming to optimize their products.
2.2. Objectives
The primary objective of this report is to explore and analyze the relationship between various
physicochemical properties of red wine and its perceived quality rating using Multivariate
Analysis of Variance (MANOVA). Specifically, the study aims to identify which combinations
of measurable chemical characteristics—such as pH, alcohol content, volatile acidity, and
residual sugar—have a statistically significant influence on the overall quality of the wine as
rated by a sensory panel. By leveraging the power of MANOVA, this analysis seeks to go
beyond individual factor effects and uncover how multiple variables interact to collectively
shape the sensory experience of wine.
The objectives of this study can be broken down into several key components:
7|Page
ensuring that column names are standardized and variables are appropriately
structured for multivariate analysis.
o The dataset will be carefully examined to understand the distributions of
individual variables, ensuring that no inconsistencies or anomalies are present that
could bias the results. This data preprocessing step is crucial for accurate
statistical analysis and reliable conclusions.
2. Application of MANOVA:
o The core objective of this report is to apply MANOVA to evaluate the
simultaneous effects of multiple independent variables (the chemical properties of
wine) on a single dependent variable (wine quality). MANOVA will help
determine whether the differences in the quality scores can be explained by
differences in the wine’s chemical composition. This multivariate approach is key
because wine quality is likely influenced by the interaction of various chemical
factors rather than being determined by any single variable in isolation.
o The report aims to demonstrate how MANOVA can provide a comprehensive
analysis by considering the combined effects of multiple variables, capturing the
complexity of wine’s chemical makeup and its impact on quality.
3. Identification of Significant Variables:
o Another major objective is to identify which physicochemical properties of wine
have the most significant impact on quality. The study will assess whether certain
variables (e.g., alcohol content, volatile acidity, or pH) have a greater influence on
the quality score and how these variables interact with one another.
o By identifying these significant variables, the analysis seeks to provide actionable
insights for winemakers, guiding them toward optimizing their production
processes. Understanding which chemical properties drive higher quality could
inform decisions regarding fermentation techniques, blending practices, and other
key stages in the winemaking process.
4. Post-Hoc Analysis Using Tukey's Honestly Significant Difference (HSD) Test:
o In addition to performing MANOVA, the report aims to conduct post-hoc tests
using Tukey’s HSD to make pairwise comparisons between different wine quality
levels. This step will help further investigate how specific combinations of
chemical properties differentiate high-quality wines from lower-quality ones.
o The objective here is to provide clarity on which variables contribute to
differences in wine quality across different quality categories. Tukey’s HSD test
will highlight significant differences in chemical compositions between wines
rated at different levels, offering additional insights into how winemakers can
fine-tune these variables to achieve desired quality outcomes.
5. Visualizing and Interpreting Results:
o Another objective is to effectively visualize the results of the MANOVA and
post-hoc analyses. Clear and intuitive visualizations (e.g., plots, graphs, and
charts) will be used to present the relationships between physicochemical
properties and wine quality in a way that is easily interpretable by both technical
and non-technical audiences.
o The study aims to provide a clear narrative that connects the statistical results
with practical implications, helping winemakers and other stakeholders in the
wine industry understand how the findings can be applied to improve wine
quality.
8|Page
6. Contributing to Oenological Knowledge:
o A broader objective is to contribute to the growing body of research in oenology
by using data-driven methods to deepen the understanding of the chemical factors
that influence wine quality. By applying MANOVA in this context, the report
aims to showcase the value of multivariate statistical techniques in wine analysis
and provide a methodological framework that can be replicated or extended in
future studies.
o The findings of this report will add to the scientific understanding of how
measurable properties of wine impact sensory perception and quality ratings,
helping bridge the gap between subjective taste evaluations and objective
chemical analysis.
7. Providing Practical Recommendations for Winemakers:
o Lastly, an important objective is to provide actionable recommendations for
winemakers based on the findings of the analysis. By pinpointing the key
chemical properties that influence wine quality, this report aims to offer practical
guidance on how production processes—such as fermentation, aging, and
blending—can be adjusted to enhance wine quality.
o The goal is to empower winemakers to make informed decisions using objective
data, ultimately improving the consistency and quality of the wines they produce.
To explore the dataset and perform data preprocessing for accurate analysis.
To apply MANOVA to determine the collective influence of multiple physicochemical
variables on wine quality.
To identify the most significant chemical properties that impact wine quality.
To conduct post-hoc analysis using Tukey's HSD test for pairwise comparisons between
wine quality categories.
To visualize and interpret the results for clear, actionable insights.
To contribute to the field of oenology by using data-driven methods to predict wine
quality.
To provide practical recommendations for winemakers to enhance the quality of their
products.
By achieving these objectives, this report seeks to demonstrate the practical and scientific value
of using multivariate statistical methods like MANOVA to analyze and predict wine quality
based on its chemical composition. The findings are expected to have important implications for
the wine industry, particularly in terms of improving production techniques and enhancing the
consistency of wine quality.
3. Problem Statement
Wine quality is a key determinant of its market success, consumer preference, and pricing in the
competitive global wine industry. However, the process of evaluating wine quality is often
subjective, relying heavily on sensory evaluations by trained sommeliers or tasting panels. These
assessments, while valuable, can be inconsistent due to the inherent variability in human
perception, influenced by factors such as palate fatigue, personal biases, and environmental
9|Page
conditions. Additionally, sensory evaluations are time-consuming, expensive, and difficult to
scale for large quantities of wine production.
Given the growing demand for objective, reliable, and efficient methods to assess wine quality,
there is a pressing need to explore data-driven approaches that can predict quality based on
measurable chemical properties. Wine is a complex product whose sensory qualities—such as
taste, aroma, and mouthfeel—are directly influenced by its underlying chemical composition.
Attributes such as pH, alcohol content, acidity, and sugar levels play critical roles in determining
how a wine is perceived. Therefore, identifying the relationship between these physicochemical
properties and the sensory quality of wine is crucial for optimizing the winemaking process and
ensuring consistent product quality.
Despite the importance of this relationship, winemakers often face the challenge of
understanding how multiple chemical properties interact to influence overall wine quality. While
some factors, such as alcohol content and acidity, are well-known to affect the sensory
experience, the combined effect of various physicochemical variables on wine quality remains
difficult to analyze using traditional univariate statistical methods. This complexity necessitates
the use of more advanced statistical techniques that can account for the simultaneous influence of
multiple variables.
The primary problem addressed in this report is how to accurately predict wine quality based on
its physicochemical properties, using a multivariate statistical approach. Specifically, this study
seeks to determine which combinations of chemical attributes are most strongly associated with
higher or lower wine quality ratings. This analysis will provide insights into how different
variables interact and contribute to the overall sensory experience, enabling winemakers to better
control and optimize the quality of their products.
10 | P a g e
Problem Definition:
The specific problem this report seeks to solve is how to develop a multivariate statistical model
that can accurately predict wine quality based on its chemical properties. By applying
Multivariate Analysis of Variance (MANOVA), the goal is to analyze the relationship between
several independent variables (physicochemical properties) and the dependent variable (wine
quality) to determine the factors that most significantly influence the final quality score.
Additionally, the study aims to identify whether these variables act independently or interact in
complex ways to affect the overall sensory perception of the wine.
Ultimately, solving this problem will provide a deeper understanding of the underlying drivers of
wine quality, helping winemakers make data-informed decisions to enhance production methods
and deliver consistently high-quality wines to the market.
4. Data
4.1. Data Collection
The dataset used in this study is sourced from publicly available wine quality datasets,
commonly used in wine quality prediction studies. It contains physicochemical properties and
corresponding quality ratings for a variety of red wine samples from the Vinho Verde region in
Portugal. Each wine sample is described by a series of chemical attributes that reflect its acidity,
alcohol content, and sugar levels, among other factors. The dependent variable, wine quality, is a
numerical score assigned by a sensory panel based on the wine’s overall taste and quality.
Dependent Variable
In this study, the dependent variable (DV) is the wine quality, which is categorized into three
groups for MANOVA analysis:
Wine quality is a categorical variable that represents the expert tasters' evaluation of the wine
based on a scale ranging from 3 to 8. For the purpose of the MANOVA, it is treated as a
categorical variable with three levels (low, medium, high).
Independent Variables
The independent variables (IVs) are the physicochemical properties of the wine, which serve as
predictors for the wine quality rating. These properties are continuous variables, and they include
the following:
The dataset contains 1,599 instances (wine samples) with 12 features (physicochemical
properties) and a target variable (quality). The wine quality ratings are based on a scale from 0
to 10, where 0 is the lowest quality and 10 is the highest, though most wines tend to score
between 3 and 8. The dataset is relatively balanced, though with slightly fewer samples in the
highest and lowest quality categories.
The features in the dataset cover a range of chemical attributes that are known to influence the
sensory characteristics of wine, such as acidity, alcohol, and sugar levels. These attributes
provide the basis for predicting the quality of the wine through statistical and machine learning
techniques
Data preprocessing is a critical step that transforms raw data into a format suitable for analysis
and model training. This process ensures that the data is clean, organized, and optimally prepared
for machine learning algorithms. The following steps were undertaken during data
preprocessing:
4.3.1. Loading and Combining Datasets: Initially, the separate datasets for white and
red wines were loaded into the environment. These datasets, while containing the
same features, pertain to different types of wine. After loading, they were
combined into a single dataset, with an additional categorical column indicating
the type of wine (either "red" or "white"). This differentiation allows for a more
nuanced analysis of the data.
4.3.2. Separation of Features and Target Variable: Post-combination, the dataset was
structured to separate the features (physicochemical properties) from the target
variable (wine quality score). This delineation is essential for supervised learning,
where the model learns to predict the target variable based on input features.
4.3.3. Data Cleaning: During this phase, any missing or erroneous data points were
identified and addressed to ensure a complete and accurate dataset. This may
involve removing rows with missing values, imputing values, or correcting any
inconsistencies found within the data.
4.3.4. Feature Scaling: Feature scaling was performed using the StandardScaler from
the sklearn library. Standardization transforms the features so that they each have
a mean of 0 and a standard deviation of 1. This process is particularly crucial for
12 | P a g e
machine learning models like SVR and ANN, which are sensitive to the scale of
input data. Without scaling, features with larger numeric ranges (such as alcohol
content) could disproportionately influence the model compared to features with
smaller ranges (like pH), leading to suboptimal performance.
4.3.5. Train-Test Split: To validate the model’s predictive ability, the dataset was split
into training and testing sets. In this project, 80% of the data was allocated for
training, while 20% was reserved for testing. This split is vital for evaluating how
well the model generalizes to unseen data.
5. Methodology
In this project, we aim to predict wine quality using two machine learning models: Support
Methodology
The purpose of this study is to analyze the relationship between various physicochemical
properties of red wine and its quality score using Multivariate Analysis of Variance
(MANOVA). MANOVA is an extension of ANOVA that allows for the examination of the
effect of multiple independent variables on several dependent variables simultaneously. In this
study, the dependent variable is the categorical wine quality score, and the independent variables
are the 11 physicochemical properties of the wine. The methodology can be broken down into
several key steps:
1. Data Preprocessing
Before conducting MANOVA, the dataset was subjected to several preprocessing steps to ensure
that the data was in a suitable format for multivariate analysis:
Data Cleaning: The dataset was checked for missing values, outliers, and
inconsistencies. No missing values were found, and outliers were examined to assess
whether they represented valid extreme values or data entry errors.
Standardization: Since the physicochemical properties of wine are measured in different
units (e.g., pH, alcohol percentage, and residual sugar in grams), it was necessary to
standardize the data. Z-score normalization was applied to ensure all variables
contributed equally to the MANOVA model.
Transformation of Dependent Variable: The quality ratings were transformed from a
scale of 0-10 to three categories to reduce class imbalance:
o Low Quality: Ratings of 3-4
o Medium Quality: Ratings of 5-6
o High Quality: Ratings of 7-8
13 | P a g e
Prior to the formal statistical analysis, exploratory data analysis (EDA) was performed to
understand the dataset’s characteristics and relationships among variables:
MANOVA was employed as the primary statistical technique to investigate the relationship
between the physicochemical properties of red wine and its quality ratings. This method was
chosen for its ability to assess the collective effect of multiple independent variables on the
dependent variable, taking into account the correlations between the physicochemical properties.
The steps involved in the MANOVA are as follows:
14 | P a g e
4. Post-Hoc Analysis Using Tukey’s Honestly Significant Difference (HSD) Test
After the MANOVA analysis, Tukey’s HSD test was performed as a post-hoc analysis to
identify specific differences between the wine quality categories. This test was used to make
pairwise comparisons between the means of different wine quality groups (e.g., comparing low
vs. medium and medium vs. high-quality wines). Tukey’s HSD provides a way to control for
Type I error in multiple comparisons, ensuring that only statistically significant differences
between groups are highlighted.
The goal of this step was to determine which physicochemical properties significantly
distinguish wines of different quality categories. For example, the test might reveal that alcohol
content significantly differs between high- and medium-quality wines, providing actionable
insights for winemakers aiming to produce higher-quality wines.
H₀: μ₁ = μ₂ = μ₃
o Where μ₁, μ₂, and μ₃ are the vectors of the means of the physicochemical
properties for wines in the low, medium, and high-quality categories, respectively.
H₁: At least one of the means (μ₁, μ₂, or μ₃) differs from the others.
6. Visualization of Results
To aid in the interpretation of the MANOVA results, various visualizations were created:
15 | P a g e
properties. This technique transforms the multivariate data into a lower-dimensional
space while preserving the differences between groups.
Boxplots and Pairwise Comparisons: Boxplots were used to display the distribution of
each physicochemical property across the three quality categories, helping to visually
compare the group means. Pairwise comparisons from Tukey’s HSD test were also
visualized to illustrate significant differences between groups.
Summary of Methodology
The methodology applied in this study was designed to rigorously analyze the relationship
between red wine’s physicochemical properties and its quality. By using MANOVA, we were
able to examine the collective influence of multiple variables on wine quality, followed by post-
hoc tests to pinpoint the factors that most significantly differentiate wines of varying quality.
This statistical approach provided valuable insights into the complex interactions between wine’s
chemical components and its perceived quality, offering practical recommendations for
winemakers.
6. Code Breakdown
1. Importing Libraries
16 | P a g e
2. Loading and cleaning the Dataset Running MANOVA
ds: Loads the dataset into a pandas DataFrame called ds from the file "win equality.csv."
ds.columns.str.replace(' ', '_'): This replaces any spaces in the column names with
underscores to make them easier to reference in code, especially for statistical formulas.
mv_test(): Fits the MANOVA model and performs the multivariate test on the specified
formula. This method provides the test statistics for the entire model.
17 | P a g e
The MANOVA test returns several statistics:
Wilks' Lambda: Measures the amount of variance in the dependent variables not
explained by the independent variable (quality). A lower value indicates greater
significance.
Pillai's Trace: Tests the overall significance of the effects.
Hotelling-Lawley Trace: Evaluates the magnitude of group differences.
Roy's Greatest Root: Focuses on the largest difference between groups.
The MANOVA test was used to assess the combined effect of wine quality on multiple
chemical properties (e.g., acidity, alcohol).
The test evaluated four main statistics: Wilks' Lambda, Pillai's Trace, Hotelling-Lawley
Trace, and Roy's Greatest Root.
For all test statistics, the p-values were highly significant (p < 0.0001).
This indicates a statistically significant multivariate effect of wine quality on the
chemical composition of the wine.
In simpler terms, wine quality levels significantly influence the variation in the wine's
chemical properties.
Explanation:
18 | P a g e
Groups=ds['quality']: Specifies the independent variable (quality) to group the
data by.
alpha=0.05: The significance level is set at 5%, meaning we consider a result
significant if the p-value is less than 0.05.
Interpretation:
The Tukey HSD test compares the alcohol content across different wine quality levels. The
output shows several significant pairwise differences:
Low Quality vs. Premium Quality shows a significant difference (p-value = 0.0) with a
mean alcohol content difference of 2.13%.
Other significant differences are seen between Low Quality and High Quality, and
Below Average vs. Premium Quality.
This indicates that higher quality wines tend to have significantly higher alcohol content
compared to lower quality wines, confirming alcohol content as a distinguishing factor for wine
quality.
Visualizations
19 | P a g e
Boxplot with Jitter
1. Boxplot Creation:
o A boxplot is generated using sns.boxplot, showing the distribution of alcohol
content for each quality group. The y-axis represents alcohol content, and the x-
axis denotes wine quality scores.
2. Adding Jitter:
o The code adds a strip plot (scatter plot overlaid on the box plot) with jitter to
display individual data points. This helps to visualize the distribution of data
points, especially when many overlap.
3. Plot Customization:
o Titles and axis labels are added to clarify the meaning of the plot: "Wine Quality
Score" on the x-axis and "Alcohol Content (%)" on the y-axis.
4. Quality Labels:
o The x-axis labels are updated with descriptive names for the quality levels,
ranging from 'Very Poor' to 'Excellent', making the plot easier to interpret.
5. Displaying the Plot:
o The code finalizes the layout and shows the plot.
20 | P a g e
Interpreting Results:
This visualization provides a summary of how alcohol content varies across different
wine quality levels. The boxplots show the median, quartiles, and potential outliers, while
individual points give insight into the spread of data within each quality level.
21 | P a g e
Barplot of Mean Alcohol Content
22 | P a g e
Interpreting Results:
This plot highlights the differences in mean alcohol content across quality levels. The
error bars show the precision of the mean estimates, and the overall mean provides a
benchmark for comparison. It gives a clear view of which quality levels have higher or
lower average alcohol content compared to others and the overall population.
23 | P a g e
Heatmap of Statistical Significance Between Wine Quality Levels for Alcohol and
Sulphates Content
24 | P a g e
significance, light green for sulphates significance, and orange for significance in both
alcohol and sulphates.
Plotting the Heatmap: The heatmap is generated with annotations showing the numeric
significance levels between quality groups, and custom labels are applied to the x and y
axes using the descriptive quality levels.
Legend Creation: A legend is added to explain the significance patterns, with custom
labels for 'Not Significant', 'Significant for Alcohol Only', 'Significant for Sulphates
Only', and 'Significant for Both'.
Displaying the Plot: The layout is adjusted, and the plot is displayed with the title
“Statistical Significance Pattern Between Quality Levels for Alcohol and Sulphates
Content.”
Interpretation:
25 | P a g e
This heatmap visualizes the statistical significance of differences in alcohol and sulphates
content between various wine quality levels based on Tukey's HSD test results. The matrix
compares six quality levels: 'Low', 'Below Average', 'Average', 'Above Average', 'High', and
'Premium'. The colors represent different types of significance:
White (0): No significant difference for either alcohol or sulphates between the two
quality levels.
Light Blue (1): Significant difference in alcohol content only.
Light Green (2): Significant difference in sulphates content only.
Orange (3): Significant difference in both alcohol and sulphates.
Key observations:
Significant differences in both alcohol and sulphates are found between lower quality
levels (like 'Low' and 'Below Avg') and higher levels ('High' and 'Premium').
For comparisons like 'Below Avg' vs. 'Above Avg' and 'Average' vs. 'Above Avg',
significant differences are found only in alcohol content.
The highest quality levels ('High' vs. 'Premium') show no significant differences,
suggesting they may have similar alcohol and sulphates profiles.
26 | P a g e
Effect Size Visualization and Correlation Analysis
Interpretation:
27 | P a g e
This visualization provides insight into how large the differences are between various
wine quality levels based on alcohol content. Cohen's d values help quantify the effect
size:
o Small effect (d ≈ 0.2): Minor differences between quality levels.
o Medium effect (d ≈ 0.5): Moderate differences.
o Large effect (d ≈ 0.8): Substantial differences.
This information is critical for understanding which quality groups differ meaningfully
from one another.
Correlation Analysis:
o The code calculates and prints the Pearson correlation coefficient between alcohol
content and wine quality.
o Output: The correlation value (correlation) is displayed, indicating the
strength and direction of the relationship between these two variables.
Interpretation:
This provides an overall measure of how alcohol content impacts the perceived quality of wine,
supplementing the effect size analysis with an overall trend.
The code groups the dataset ds by the 'quality' variable and calculates descriptive
statistics for the 'alcohol' content. It computes the following for each quality level:
o
Count: The number of data points.
o
Mean: The average alcohol content.
o
Standard Deviation (Std Dev): A measure of the spread or variability in alcohol
content.
o Minimum (Min): The lowest alcohol content value.
o Maximum (Max): The highest alcohol content value. The results are rounded to
three decimal places for clarity.
2. Renaming Columns:
The aggregated statistics columns are renamed to more descriptive labels: 'Count', 'Mean',
'Std Dev', 'Min', and 'Max'. The index is labeled as 'Quality Level' to represent the
different quality groups.
3. Displaying Results:
The code prints out the descriptive statistics in a table format, with a clear heading and
separator for readability.
29 | P a g e
Interpreting Results:
Count: This shows how many observations exist for each quality level, giving an idea of
sample sizes.
Mean: Indicates the average alcohol content for each quality level. Higher mean values
suggest a potential relationship between quality and alcohol content.
Std Dev: The standard deviation reflects the variability of alcohol content within each
quality level. Larger values suggest greater diversity in alcohol content for that quality
level.
Min and Max: These provide the range of alcohol content values, showing the extremes
within each quality group.
7. Conclusion
This study utilized Multivariate Analysis of Variance (MANOVA) to investigate the relationship
between wine quality and various physicochemical properties, including alcohol content, pH, and
volatile acidity. The primary aim was to ascertain whether these properties exhibit significant
variations across different wine quality categories (low, medium, and high quality).
MANOVA Results
The MANOVA test, assessed via Wilks' Lambda, yielded a significant result (Lambda = 0.6394,
p < 0.0001), which led to the rejection of the null hypothesis. This indicates that the
physicochemical properties significantly differ among the wine quality categories.
Following the MANOVA, Tukey’s HSD test was employed for post-hoc analysis to identify
which specific physicochemical properties differ across the wine quality levels:
Alcohol Content:
o Low vs. High Quality: A mean difference of 1.51% alcohol (p < 0.001).
o Low vs. Premium Quality: A mean difference of 2.14% alcohol (p < 0.001).
o Medium vs. Premium Quality: A mean difference of 2.19% alcohol (p < 0.001).
30 | P a g e
Volatile Acidity:
o Low vs. High Quality: A mean difference of -0.48 (p < 0.001), suggesting that
high-quality wines typically have lower volatile acidity.
Total Sulfur Dioxide and pH:
o Total sulfur dioxide: A mean difference of 31.61 mg/L between low and medium-
quality wines (p = 0.02).
o pH levels: Displayed smaller yet statistically significant differences across some
quality levels.
These findings underscore the significant roles of alcohol content, volatile acidity, total sulfur
dioxide, and pH in distinguishing wine quality categories. Notably, higher alcohol content and
lower volatile acidity are closely linked to higher-quality wines.
To assess the relationship between physicochemical properties and wine quality, the following
hypotheses were established:
Despite uncovering significant insights, the study faced limitations, including the presence of
outliers and an imbalanced dataset with fewer samples in the premium quality category.
Nonetheless, the statistically significant findings provide actionable insights for winemakers and
producers.
The evaluation involved advanced statistical methods like MANOVA and Tukey’s HSD to
analyze the relationship between wine's chemical properties and its quality. The primary goal
was to identify which chemical factors significantly contribute to higher wine quality levels,
particularly focusing on alcohol content.
31 | P a g e
Analysis
The analysis was conducted on a dataset containing chemical features and quality scores of
wines. Both multivariate and pairwise comparisons were performed to gain insights into how
wine quality is impacted by its chemical composition. Significant findings emerged, particularly
concerning the relationship between alcohol content and wine quality.
MANOVA results confirmed that wine quality levels significantly affect various
chemical properties (p-value < 0.0001).
The Tukey’s HSD test highlighted that higher wine quality levels are associated with
significantly elevated alcohol content.
Descriptive statistics and visualizations demonstrated that Premium Quality wines
(Quality 8) exhibited the highest average alcohol content, while lower-quality wines had
significantly less.
The analysis provided valuable insights into how winemakers can enhance wine quality by
adjusting chemical compositions, particularly alcohol levels. The implications for winemakers
are evident: by managing key chemical factors, they can potentially elevate the perceived quality
of their wines and produce more consistent, high-quality products.
Final Thoughts
This analysis yielded a comprehensive understanding of the relationships between wine quality
and its chemical composition, especially the significant role of alcohol content. The application
of MANOVA and Tukey’s HSD provided robust statistical support for these findings, while
visualizations enhanced the accessibility and interpretability of the data.
32 | P a g e