Xstkfinal
Xstkfinal
CONTRIBUTION OF MEMBERS
2.3. Hypothesis
The plans are to first test dependency between all the input variable, and then test
dependency of the physicochemical to the quality. But we can come up with some
hypothesis that most likely to come true:
● The amount of CO2 (which made the wine's famous gassy taste) can affect
the quality.
● More alcohol may affect the quality.
● Density of the wine can affect the quality.
● Also, the amount of salt in the wine can affect the quality.
● And more....
The main hypothesis is that, maybe the physicochemical affect each other. We
haven't known which one affecting which one yet, trials will give us answer.
#install.packages("janitor")
library(janitor)
library(dplyr)
#package for ploting
#install.packages("ggplot2")
#install.packages("GGally")
library(ggplot2)
library(GGally)
winequality <- read.csv('winequality-red.csv')
The command will import the dataset and put it into the name winequality for us to
use.
Note that, from now, the dataset in R is in the name of winequality.
dim(winequality)
colSums(is.na(winequality))
The dimmension has been decreased, that mean there are duplicate rows.
Note that, because the ease of only add 1 command, we will ignore the process of checking
first then remove.
As we can see, the quality will vary from 3 to 8, and there is no outlier. The unit is
decimal, so we will plot quality's pmf with the limit of 2 to 9 ( larger than 2 and less than
9) and unit is 1.
summary(winequality$alcohol)
summary(winequality$density)
Then we plot it, to see the distribution clearer, we'll use log10:
Ploting:
qplot(volatile.acidity, data = winequality, fill = "red", binwidth = 0.001) +
scale_x_log10(breaks = seq(min(winequality$volatile.acidity),
max(winequality$volatile.acidity), 0.1))
Volatile acidity has normal distribution.
Ploting:
qplot(chlorides, data = winequality, fill = "red", binwidth = 0.01) +
scale_x_log10(breaks = seq(min(winequality$chlorides), max(winequality$chlorides),
0.1))
Chlorides distribution initially is skewed so I used log10 to see the distribution
clearer.
4.1.6. Summary
After examining all the properties, we decided to only show some of it as it's
unnecessary to see all the plot.
We can see that some of the quality is in normal distribution and some quite skewed.
4.2.1.2. Methodologies.
Method that we're using to perform correlation analysis:
• Pearson correlation formula:
Σ(x − 𝑥 )(𝑦 − 𝑦)
r =
√Σ(𝑥 − 𝑥 )2 (𝑦 − 𝑦 )2
The p-value can then be determined via the t-value which follows the t-distribution
with 𝑛 − 2 degree of freedom:
𝑟
𝑡= . √𝑛 − 2
√1 − 𝑟 2
If p-value < 5% then the correlation between variables are significant.
In reality, we can develop this into multiple 𝑋𝑖 and 𝑌𝑖 , namely 𝑋𝑖𝑗 and 𝑌𝑖𝑗 that
follows:
𝑌𝑖 = 𝛼 + 𝛽1𝑋𝑖1 + 𝛽2𝑋𝑖2 + ⋯ + 𝛽𝑗𝑋𝑖𝑗 + 𝜀𝑖
summary(abc)
The first line is building the model with all the variables, the second is to display it
to the screen.
We will want to look as the variables with p-value nearly 0 ( with the code
***behind)
● volatile acidity with the coefficient of: -1.1204370.
● chlorides with the coefficient of: - 1.9302567
● total sulfur dioxide with the coefficient of: - 0.0027073
● sulphates with the coefficient of: 0.9147023
● alcohol with the coefficient of: 0.2895307
Each with the coefficient of affecting the quality score. For example, with the
coefficient of -1.12, each 1% of volatile acidity increase, the quality score decreases 1.12%.
As we can see there a lot of variables will affect the result.
6. TOTAL SUMMARY
6.1 Variables affect each other
As predicting from the beginning, there are varialble that will affect each other
beside the one that is obvious. The results are those pair:
● Citric acid and Fixed acidity, correlation value: 0.667.
● Density and Fixed acidity, correlation value: 0.670.
● Citric acid and Volatile acidity, correlation value: -0.551
● Total sulfur dioxide and Free sulfur dioxide, correlation value: 0.667.
● Alcohol and Density, correlation value: -0.505.