0% found this document useful (0 votes)
21 views29 pages

Xstkfinal

This project report from the Faculty of Applied Science at Ho Chi Minh City University of Technology focuses on analyzing the quality of wine using probability and statistics. It details the dataset properties, data collection methods, and the importance of physicochemical variables in predicting wine quality. The report includes sections on data handling, descriptive statistics, and multiple regression analysis to explore the relationships between various factors affecting wine quality.

Uploaded by

T.T.P
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views29 pages

Xstkfinal

This project report from the Faculty of Applied Science at Ho Chi Minh City University of Technology focuses on analyzing the quality of wine using probability and statistics. It details the dataset properties, data collection methods, and the importance of physicochemical variables in predicting wine quality. The report includes sections on data handling, descriptive statistics, and multiple regression analysis to explore the relationships between various factors affecting wine quality.

Uploaded by

T.T.P
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Vietnam National University Ho Chi Minh City

Ho Chi Minh City University of Technology


Faculty of Applied Science
🙞···☼···🙜

PROBABILITY AND STATISTICS


PROJECT REPORT

Class: CC07 Group: 2


Instructor: Dr. Nguyen Tien Dung

No. Student Student ID Faculty


1 Trần Duy Phát 2052644 Applied Science
2 Lý Thanh Thúy Vy 1852885 Chemical Engineering
3 Đào Các Tường 2050025 Chemical Engineering
4 Lý Phổ Phương 2153710 Chemical Engineering
5 Võ Đăng Hoàng Vũ 2153982 Chemical Engineering
Ho Chi Minh City – May, 2023

CONTRIBUTION OF MEMBERS

No. Student Student ID Contribution


1 Trần Duy Phát 2052644 100%
2 Lý Thanh Thúy Vy 1852885 100%
3 Đào Các Tường 2050025 100%
4 Lý Phổ Phương 2153710 100%
5 Võ Đăng Hoàng Vũ 2153982 100%
Table of Contents
CONTRIBUTION OF MEMBERS ................................................................................ 2
1. INTRODUCTION ................................................................................................... 4
2. PROBLEM DEFINING .......................................................................................... 6
2.1. Definition ......................................................................................................... 6
2.2. Datasets properties .......................................................................................... 6
2.3. Data collection ................................................................................................. 7
2.4. Hypothesis........................................................................................................ 7
3. HANDLING THE DATA ........................................................................................ 8
3.1. Import the dataset ........................................................................................... 8
3.2. Data cleaning ................................................................................................... 8
3.2.1. Checking N/A ............................................................................................. 9
3.2.2. Removing duplicate .................................................................................... 9
3.2.3. Data summary .......................................................................................... 10
4. DESCRIPTIVE STATISTICS ............................................................................... 13
4.1. Univariate Analysis ............................................................................................ 13
4.1.1. Quality of wine ......................................................................................... 13
4.1.2. Level of alcohol ........................................................................................ 14
4.1.3. Density of wine......................................................................................... 15
4.1.4. Level of Volatile acidity............................................................................ 16
4.1.5. Level of Chlorides ( level of salt) .............................................................. 17
4.1.6. Summary .................................................................................................. 18
4.2. Bivariate Analysis .......................................................................................... 18
4.2.1. Correlation test. ......................................................................................... 18
4.2.1.1. Theories. ..................................................................................................... 18
5. MULTIPLE REGRESSION .................................................................................. 26
5.1. What is muiltiple regression? ....................................................................... 26
5.2. Applying into predicting the quality of the wine: ........................................ 26
6. TOTAL SUMMARY ................................................................................................. 28
6.1 Variables affect each other ................................................................................ 28
6.2 Variables affect the wine's quality ..................................................................... 28
REFERENCES............................................................................................................. 29
1. INTRODUCTION
Wine is an alcoholic drink typically made from fermented grapes. Yeast consumes
the sugar in the grapes and converts it to ethanol and carbon dioxide, releasing heat in the
process. Different varieties of grapes and strains of yeasts are major factors in different
styles of wine. These differences result from the complex interactions between the
biochemical development of the grape, the reactions involved in fermentation, the grape's
growing environment (terroir), and the wine production process. Many countries enact
legal appellations intended to define styles and qualities of wine. These typically restrict
the geographical origin and permitted varieties of grapes, as well as other aspects of wine
production. Wines can be made by fermentation of other fruit crops such as plum, cherry,
pomegranate, blueberry, currant and elderberry.
Wine has long played an important role in religion. Red wine was associated with
blood by the ancient Egyptians and was used by both the Greek cult of Dionysus and the
Romans in their Bacchanalia; Judaism also incorporates it in the Kiddush, and Christianity
in the Eucharist. Egyptian, Greek, Roman, and Israeli wine cultures are still connected to
these ancient roots. Similarly the largest wine regions in Italy, Spain, and France have
heritages in connection to sacramental wine, likewise, viticulture traditions in the
Southwestern United States started within New Spain as Catholic friars and monks first
produced wines in New Mexico and California.
Wine testing is a meliculus process of testing from the appearance, the smell to the
taste. All of that to conclude a quality score of the wine. The wine is always tested before
coming to the market. Moreover, testing wine is always needed in the R & D department
of any wine, any small change throughout the process of making the wine should change
its quality.
Testing wine was a human-based process which is slow and inefficient. In order to
produce more quality breeds of wine, machines need to come to hand. For that to happen,
we need machines to understand the wine. Unlike humans, machines do not get which
smell is good or which tastes of the wine is delicious, machines only test the wine by its
properties. Our goal is from the properties of wine, predicting the wine's quality.
The wine in test today is Vinho Verde, a wine originated from Portugal. Vinho
Verde is not a grape variety, the grape that comes from "Vinho Verde region" is the grape
that makes the wine. The wine is chosen due to its variability in breed of wine.
The data set is related to red variants of the wine. Due to privacy and logistic issues,
only physicochemical (inputs) and sensory (the output) variables are available (e.g. there
is no data about grape types, wine brand, wine selling price, etc.).
The dataset is obtained via:
Paulo Cortez, University of Minho, Guimarães, Portugal,
http://www3.dsi.uminho.pt/pcortez
A. Cerdeira, F. Almeida, T. Matos and J. Reis, Viticulture Commission of the Vinho
Verde Region(CVRVV), Porto, Portugal
@2009
2. PROBLEM DEFINING
2.1. Definition
As mentioned, wine testing is meticulous and human - based process which is slow
and inefficient. To check the quality of wine in mass production before coming out to the
market, machines must come in hand.
Machines don't understand intuitive quality like good smell nor delicious taste. The
idea is to examine all the physicochemical of the wine, and with the quality given by a
specialist, we'll try to predict the quality of the wine by physicochemical.

2.2. Datasets properties


The dataset we got is the list of physicochemicals and the quality of the wine:
Variables Unit Description
Input variables
Fixed acidity g/dm3 Most acids involved with wine or fixed or nonvolatile (do
not evaporate readily)
Volatile acidity g/dm2 The amount of acetic acid in wine, which at too high of
levels can lead to an unpleasant, vinegar taste
Citric acid g/dm3 Found in small quantities, citric acid can add ‘freshness’
and flavor to wines
Residual sugar g/dm3 The amount of sugar remaining after fermentation stops,
it’s rare to find wines with less than 1 gram/liter and wines
with greater than 45 grams/liter are considered sweet
Chlorides g/dm3 The amount of salt sodium chloride in the wine
Free sulfur mg/dm3 The free form of SO2 exists in equilibrium between
dioxide molecular SO2 (as a dissolved gas) and bisulfite ion; it
prevents microbial growth and the oxidation of wine
Total sulfur mg/dm3 Amount of free and bound forms of S02; in low
dioxide concentrations, SO2 is mostly undetectable in wine, but at
free SO2 concentrations over 50 ppm, SO2 becomes
evident in the nose and taste of wine
Density g/cm3 The density of water is close to that of water depending on
the percent alcohol and sugar content
pH Describes how acidic or basic a wine is on a scale from 0
(very acidic) to 14 (very basic); most wines are between 3-
4 on the pH scale
Sulphates g/dm3 Potassium Sulphate - a wine additive which can contribute
to sulfur dioxide gas (SO2) levels, wich acts as an
antimicrobial and antioxidant
Alcohol % by volume The percent alcohol content of the wine
Output variable
Quality score between The quality of the wine.
0 and 10

2.3 . Data collection


The dataset is a list of wines tested and examined for its qualities and
physicochemical. Each bottle of wine will be examined and collect its properties then given
a score from specialists.
The wine selected in the list can vary from a sample of new wine to wine from
factory that need to check before coming to the market. With the pre-given score, we can
hopefully success in predicting the result.

2.3. Hypothesis
The plans are to first test dependency between all the input variable, and then test
dependency of the physicochemical to the quality. But we can come up with some
hypothesis that most likely to come true:
● The amount of CO2 (which made the wine's famous gassy taste) can affect
the quality.
● More alcohol may affect the quality.
● Density of the wine can affect the quality.
● Also, the amount of salt in the wine can affect the quality.
● And more....
The main hypothesis is that, maybe the physicochemical affect each other. We
haven't known which one affecting which one yet, trials will give us answer.

3. HANDLING THE DATA


3.1. Import the dataset
Firstly, we'll need to include some packages (if the packages not installed, install
it):
#package for cleaning

#install.packages("janitor")
library(janitor)
library(dplyr)
#package for ploting

#install.packages("ggplot2")
#install.packages("GGally")

library(ggplot2)
library(GGally)
winequality <- read.csv('winequality-red.csv')
The command will import the dataset and put it into the name winequality for us to
use.
Note that, from now, the dataset in R is in the name of winequality.

3.2. Data cleaning


Before analyzing the data, we must check and maybe clean the data (if needed).
3.2.1. Checking N/A
First, we will look over the dataset and see if there is any N/A value. This can be
done with the following function:
winequality <- winequality %>% distinct()

dim(winequality)
colSums(is.na(winequality))

The result is the numbers of N/A of each column:


● fixed acidity: 0
● volatile acidity: 0
● citric acid: 0
● residual sugar: 0
● chlorides: 0
● free sulfur dioxide: 0
● total sulfur dioxide: 0
● density: 0
● pH: 0
● sulphates: 0
● alcohol: 0
● quality: 0
We can see that there are no missing values in our data, that mean there is no
cleaning needed in this compartment.

3.2.2. Removing duplicate


Next, we remove duplicate entry (if there is any) using these commands:
winequality <- winequality %>% distinct()
In order to prove there is duplicate entry, I'll add these commands:
dim(winequality)

The dimmension has been decreased, that mean there are duplicate rows.
Note that, because the ease of only add 1 command, we will ignore the process of checking
first then remove.

3.2.3. Data summary


After cleaning the dataset, we will have a summary look to it. We'll investigate:
● Dimension of the dataset, using:
dim(winequality)

● Min, Max, Mean, Median, 1st quadrant, 3rd Quadrant, using:


summary(winequality)
● Structure of the data set, using:
str(winequality)

After investigating the dataset, we get some observation:


● Mean residual sugar level is 5.4 g/l, but there is a sample of very sweet wine with
65.8 g/l (an outlier).
● Mean free sulfur dioxide is 30.5 ppm. Max value is 289 which is quite high as 75%
is 41 ppm.
● PH of wine is within range from 2.7 till 4, mean 3.2.
● There are no basic wines in this dataset (no high pH levels).
● Alcohol: lightest wine is 8%, strongest is 14.9.
● Minimum quality mark is 3, mean 5.8, highest is 9.
Also, as there is Outlier in our dataset, we need to keep in mind when going further
in analyzing the dataset.
4. DESCRIPTIVE STATISTICS
4.1. Univariate Analysis
First, we'll need to have a look at some of the variables to see plot, and determined
which one will be chosen to be examining further.
4.1.1. Quality of wine
As our main concern is the quality of wine, we might as well look at it first.
We have seen the summary for all the data, but we will call it again for easier anlysing.
Also, we will add another command that will draw a table for us.
summary(winequality$quality)
dim(winequality$quality)
table(winequality$quality)

As we can see, the quality will vary from 3 to 8, and there is no outlier. The unit is
decimal, so we will plot quality's pmf with the limit of 2 to 9 ( larger than 2 and less than
9) and unit is 1.

qplot(quality, data = winequality, fill = "red", binwidth = 1) +


scale_x_continuous(breaks = seq(3,8,1), lim = c(2,9)) +
scale_y_sqrt()
We can see that quality have a normal distrubution with the peak at 5 at 6 (5 is a
little bit higher)

4.1.2. Level of alcohol


Level of alcohol must be one of the importance properties that we want to look at,
we will follow the same step of first get the summary of it, also as it will vary way more
than quality, we will not built the table of it (you will see the variation after we plot it).

summary(winequality$alcohol)

Then we plot it:


ư
Alcohol level distribution looks skewed. Most frequently wines have 9.5%, mean is
10.49% of alcohol.

4.1.3. Density of wine


Density of wine is another properties that we want to look at. The step is the same
as level of alcohol, so we will cut the instruction:

summary(winequality$density)

Then we plot it, to see the distribution clearer, we'll use log10:

qplot(density, data = winequality, fill = "red", binwidth = 0.0002) +


scale_x_log10(lim = c(min(winequality$density), 1.00370),
breaks = seq(min(winequality$density), 1.00370, 0.002))
Looking at ‘table’ summary we see that there are two outliers: 1.0103 and 1.03898.
To see the distribution of density clearer I used log10 and limited the data. Now we can
see that density distribution of the wine is normal.

4.1.4. Level of Volatile acidity


High level of acidity can lead to bad taste, we might as well looking at it:
Summary:
summary(winequality$volatile.acidity)

Ploting:
qplot(volatile.acidity, data = winequality, fill = "red", binwidth = 0.001) +
scale_x_log10(breaks = seq(min(winequality$volatile.acidity),
max(winequality$volatile.acidity), 0.1))
Volatile acidity has normal distribution.

4.1.5. Level of Chlorides (Level of salt)


As mentioned before, high level of salt in the wine is not good, we might want to
look at it:
Summary:
summary(winequality$clhorides)

Ploting:
qplot(chlorides, data = winequality, fill = "red", binwidth = 0.01) +
scale_x_log10(breaks = seq(min(winequality$chlorides), max(winequality$chlorides),
0.1))
Chlorides distribution initially is skewed so I used log10 to see the distribution
clearer.

4.1.6. Summary
After examining all the properties, we decided to only show some of it as it's
unnecessary to see all the plot.
We can see that some of the quality is in normal distribution and some quite skewed.

4.2. Bivariate Analysis


4.2.1. Correlation test.
4.2.1.1. Theories.
Correlation analysis is a statistical method used to evaluate the strength of the
relation- ship between two or more quantitative variables. A high correlation coefficient
means that two or more variables have a strong relationship with each other, while a weak
correlation means that the variables are hardly associated.
Statistical correlation is measured by the coefficent of correlation (𝑟). Its numerical
value ranges from + 1.0 to − 1.0. It gives us an indication of both the strength and direction
of the relationship between variables.
In genaral, 𝑟 > 0 indicates a positive relationship while 𝑟 < 0 signals a negative
relationship. 𝑟 = 0 indicate dis-allocation (or that the variables are independent of each
other and not related).
𝑟 = +1.0 describes a perfect positive correlation and 𝑟 = −1.0 describes a
perfect negative correlation. The closer the coefficients are to + 1.0 to − 1.0, the greater
the strength of disintegration of the relationship between the variables.

4.2.1.2. Methodologies.
Method that we're using to perform correlation analysis:
• Pearson correlation formula:
Σ(x − 𝑥 )(𝑦 − 𝑦)
r =
√Σ(𝑥 − 𝑥 )2 (𝑦 − 𝑦 )2
The p-value can then be determined via the t-value which follows the t-distribution
with 𝑛 − 2 degree of freedom:
𝑟
𝑡= . √𝑛 − 2
√1 − 𝑟 2
If p-value < 5% then the correlation between variables are significant.

4.2.2. Linear regression


Regression analysis is a collection of statistical tools that are used to model and
explore relationships between variables that are related in a nondeterministic manner.
Multiple Linear Regression attempts to model a linear relationship between a dependent
variable (response) and some independent variables (predictors/regressors).
To model the dataset using multiple linear regression, consider 𝑋𝑖 and 𝑌𝑖 where i =
1, 2, 3, ..., n. The model states that:
𝑌𝑖 = 𝛼 + 𝛽𝑋𝑖 + 𝜀𝑖
where 𝛼, 𝛽 are regression coefficients and 𝜀 is a variable follows the normal
distribution with 𝜀~𝑁(0, 𝜎 2 ).

In reality, we can develop this into multiple 𝑋𝑖 and 𝑌𝑖 , namely 𝑋𝑖𝑗 and 𝑌𝑖𝑗 that
follows:
𝑌𝑖 = 𝛼 + 𝛽1𝑋𝑖1 + 𝛽2𝑋𝑖2 + ⋯ + 𝛽𝑗𝑋𝑖𝑗 + 𝜀𝑖

4.2.3. Relationship visualization.


As said in the hypothesis, we also want to find the relations between variables if
there is any. We will try to calculate the correlation between the variables.
We can mass calculate correlations between the variables, then only look at the pair
with high correlation. The method of calculating correlation is Pearson correlation
formula.
Also, we can plot data and linear regression line between pairs only use the
command:
ggpairs(winequality)
After seeing the plot, we can see that there are these pari that have high
correlation:
● citric acid and fixed acidity: 0.667.
● density and fixed acidity: 0.670.
● pH and fixed acidity: -0.687
● citric acid and volatile acidity: -0.551
● pH and citric acidity: -0.550
● total sulfur dioxide and free sulfur dioxide: 0.667.
● alcohol density: -0.505.
These correlations will need to double check by plotting. We will ignore pair
between pH and acids or acidity as we have all known their relation. The method we are
doing so is linear regression, drawing data and plot line.
To do so, we will need write our own function for easier coding:
f <- function(dataset, x, y, opts=NULL) {
ggplot(dataset, aes_string(x = x, y = y)) + #the plot
geom_point(alpha = 1/5, position = position_jitter(h = 0), size = 2) + #plot point
geom_smooth(method = 'lm') #plot the correlation line
}
The function will plot point and correlation line for us.
● Citric acid and Fixed acidity: 0.667.
# Citric acid and fixed acidity
p <- f(winequality, "citric.acid", "fixed.acidity")
p + coord_cartesian(xlim=c(min(winequality$citric.acid),max(winequality$citric.acid)),
ylim=c(min(winequality$fixed.acidity),max(winequality$fixed.acidity)))

● Density and Fixed acidity: 0.670.


# Density and fixed acidity
p <- f(winequality, "density", "fixed.acidity")
p + coord_cartesian(xlim=c(min(winequality$density),max(winequality$density)),
ylim=c(min(winequality$fixed.acidity),max(winequality$fixed.acidity)))
● Citric acid and Volatile acidity: -0.551
# citric acid and volatile acidity
p <- f(winequality, "citric.acid", "volatile.acidity")
p + coord_cartesian(xlim=c(min(winequality$citric.acid),max(winequality$citric.acid)),
ylim=c(min(winequality$volatile.acidity),max(winequality$volatile.acidity)))
● Total sulfur dioxide and free sulfur dioxide: 0.667.
# total sulfur dioxide and free sulfur dioxide
p <- f(winequality, "free.sulfur.dioxide", "total.sulfur.dioxide")
p+
coord_cartesian(xlim=c(min(winequality$free.sulfur.dioxide),max(winequality$free.sulfu
r.dioxide)),
ylim=c(min(winequality$total.sulfur.dioxide),max(winequality$total.sulfur.dioxide)))

● Alcohol and density: -0.505.


# density vs. alcohol plot
p <- f(winequality, "density", "alcohol")
p + coord_cartesian(xlim=c(min(winequality$density),max(winequality$density)),
ylim=c(min(winequality$alcohol),max(winequality$alcohol)))
We can see all the correlation calculate are true.
Summary:
After analyzed 2 of each variables, we got relation between:
● Citric acid and Fixed acidity, correlation value: 0.667.
● Fensity and Fixed acidity, correlation value: 0.670.
● Citric acid and volatile acidity, correlation value: -0.551
● Total Sulfur dioxide and free sulfur dioxide, correlation value: 0.667.
● Alcohol and Density, correlation value: -0.505.
5. MULTIPLE LINEAR REGRESSION
5.1. What is muiltiple regression?
In reality, cheking dependency pair by pairs is inefficiency, using multiple linear
regression can mass calculate the coefficient, especially we'll using machine to calculate.
From the linear regression formula, we can develop this into multiple 𝑋𝑖 and 𝑌𝑖 ,
namely 𝑋𝑖𝑗 and 𝑌𝑖𝑗 that follows:
𝑌𝑖 = 𝛼 + 𝛽1𝑋𝑖1 + 𝛽2𝑋𝑖2 + ⋯ + 𝛽𝑗𝑋𝑖𝑗 + 𝜀𝑖

5.2. Applying into predicting the quality of the wine:


We will use multiple linear regression to test the variable affecting the wine's
quality:
We will use these codes to applying :
abc <-
glm(quality~fixed.acidity+volatile.acidity+citric.acid+residual.sugar+chlorides+free.sulf
ur.dioxide+total.sulfur.dioxide+density+pH +sulphates+alcohol, data = winequality)

summary(abc)
The first line is building the model with all the variables, the second is to display it
to the screen.

We will want to look as the variables with p-value nearly 0 ( with the code
***behind)
● volatile acidity with the coefficient of: -1.1204370.
● chlorides with the coefficient of: - 1.9302567
● total sulfur dioxide with the coefficient of: - 0.0027073
● sulphates with the coefficient of: 0.9147023
● alcohol with the coefficient of: 0.2895307
Each with the coefficient of affecting the quality score. For example, with the
coefficient of -1.12, each 1% of volatile acidity increase, the quality score decreases 1.12%.
As we can see there a lot of variables will affect the result.

6. TOTAL SUMMARY
6.1 Variables affect each other
As predicting from the beginning, there are varialble that will affect each other
beside the one that is obvious. The results are those pair:
● Citric acid and Fixed acidity, correlation value: 0.667.
● Density and Fixed acidity, correlation value: 0.670.
● Citric acid and Volatile acidity, correlation value: -0.551
● Total sulfur dioxide and Free sulfur dioxide, correlation value: 0.667.
● Alcohol and Density, correlation value: -0.505.

6.2 Variables affect the wine's quality


There are quite alot variable that will affect the wine quality:
● Volatile acidity with the coefficient of: -1.1204370.
● Chlorides with the coefficient of: - 1.9302567
● Total sulfur dioxide with the coefficient of: - 0.0027073
● Sulphates with the coefficient of: 0.9147023
● Alcohol with the coefficient of: 0.2895307
REFERENCES
[1] Bevans, R. (2022, November 15). Linear Regression in R | A Step-by-Step Guide &
Examples. Scribbr. https://www.scribbr.com/statistics/linear-regression-in-r/
[2] Kelly, L., PhD. (2020, November 20). Practice 9 Calculating Confidence Intervals in R
| R Practices for Learning Statistics. https://bookdown.org/logan_kelly/r_practice/p09.html

You might also like