0% found this document useful (0 votes)
36 views23 pages

Report Stats PDF

The document is a report submitted by a group of students to their professor. It contains information on variables, histograms, box plots, measures of dispersion, and regression analysis. It provides examples and interpretations of using R code to analyze data on student heights and money spent on lunch. The document demonstrates different statistical techniques for exploring and summarizing data.

Uploaded by

Sid Ra Rajpoot
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views23 pages

Report Stats PDF

The document is a report submitted by a group of students to their professor. It contains information on variables, histograms, box plots, measures of dispersion, and regression analysis. It provides examples and interpretations of using R code to analyze data on student heights and money spent on lunch. The document demonstrates different statistical techniques for exploring and summarizing data.

Uploaded by

Sid Ra Rajpoot
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Department of Biotechnology, University of Sialkot

Report submission

Semester / MS
nd
2 / MS biotechnology

Submitted by

Group # 2
Group members
 Hadia Akram 22104014-022
 Sumbal 22104014-006
 Sumaira 22104014-002
 Shingraf 22104014-Naz 015
 Maria Akbar 22104014-012

Submitted to
Sir Waqas

Subject
Advanced biostatistics

Date of submission
20th august, 2023
Variables:
It is the characteristic that can be measured and assume different values.It is categorized into
two such as:-
 Qualitative:The variables that express qualitative attributes such as
colour,race,gender etc.
 Quantative:The variables which have some numeric values.Quantative data may be
discrete and continuous.
Discrete Continuous
The quantative data which is countable The quantative data which is measureable
called discrete data. called continuous data.

Example: Number of students, number of Example:Height,weight


employees in company .

Histogram
Histogram is the graphical representation of the continuous data .It is applied to check the
normailty of the data.If the graph show only one peak at the mid like bell-shaped mean data is
normally distributed.

Example to construct histogram:


No.of students Height of
students(cm)

1 141

2 143

3 145

4 145

5 147

6 152

7 143

8 144

9 149

10 141

11 138

12 143
Mr. Larry, a famous doctor, is researching
the height of the students studying in the
8th standard. He has gathered 12 students
but wants to know which maximum
category is where they belong.
Input:
height=c(141,143,145,145,147,152,143,14
4,149,141,138,143)
hist(height)
hist(height,xlab="height of
person",ylab="frequency",xlim=c(138,152)
,ylim=c(0,10))

Output:
> height=c(141,143,145,145,147,152,143,1
44,149,141,138,143)
> hist(height)
> hist(height,xlab="height of person",ylab=
"frequency",xlim=c(138,152),ylim=c(0,10)
)

Interpretation:
The peak of the graph gives us information about that most of the student‟s height lie
between142 to 144(cm).
The peak is on the left side (at start) which shows that data is positively skewed.

Box plot:
A box and whisker plot also
called a box plot displays
the five-number summary of
a set of data. The five-
number summary is the:
1) Minimum value
2) First quartile
3) Median
4) Third quartile
5) Maximum value

Input:
boxplot(height)
summary(height)

Output:
> boxplot(height)
> summary(height)
Min. 1st Qu. Median Mean 3rd Qu. Max.
138.0 142.5 143.5 144.2 145.5 152.0

Interpretation:
This box plot graph give us information about
Minimum value =138
1st Qu." Means the first quartile (25th percentile) =142.5
"Median" is the middle value=143.5
"3rd Qu." is the third quartile (75th percentile) value=145.5
"Max." is the maximum value=152.0

Measure of dispersion
The measures of dispersion help to interpret the variability of data. It shows how squeezed or
scattered the variable is. There are many Measures of Dispersion found that help us to get
more insights into the data:
 Range
 Variance
 Standard Deviation
 Coefficient of variation

Example:
Suppose XUZ Pvt. Ltd is a company where each 19 employee spends these money for lunch:
money_spent=c($5,$3,$4,$10,$12,$15,$11,$13,$9,$11,$8,$6,$25,$23,$21,$19,$17,$20,$16).
how can we utilize R to compute a measure of dispersion for these expenditures, considering
their spread and variability?

1. Range
Range is the measure of the difference between the largest and smallest value of the data
variability. The range is the simplest form of Measures of Dispersion. The range obtained
for data above mentioned is

 Range = max. value – min. value


= 25 - 3
= 22
2. Variance
We can get variance by taking square of standard variation. In Rstudio following
command is used.
Command for variance is:
var(money_spent)
Output
> var(money_spent)
[1] 43.05263

3. Standard deviation
Standard Deviation can be represented as the square root of Variance. In Rstudio,
following command is used:
Command
sd(money_spent)
Output
> sd(money_spent)
[1] 6.56145

4. Coefficient of variation
A lower CV indicates lower relative variability, meaning the data points are relatively
close to the mean. Conversely, a higher CV suggests higher relative variability, indicating
that the data points are more spread out from the mean.
We can find coefficient of variation by following formula:

CV = (Standard Deviation / Mean) x 100


The resulting CV value is expressed as a percentage
Command
cv=(sd(money_spent)/mean(money_spent)*100)
cv
Output
> cv=(sd(money_spent)/mean(money_spent)*100)
> cv
[1] 50.2691

 Interpretation
The data you suggests that the money_spent for lunch dataset has a variance of 43.05263 and
a standard deviation of 6.56145. The coefficient of variation (CV) is calculated to be
50.26918.
Interpreting these results, we can say that the money_spent dataset exhibits a moderate degree
of variability. The CV value of 50.26918 indicates that the standard deviation is
approximately 50.27% .This suggests that the data points in the dataset are relatively spread
out from the mean value.
Importance of measure of dispersion

 Understanding Data Spread: Measures of dispersion give you an idea of how


spread out or clustered the data points are. If the dispersion is high, it indicates that the
data points are widely scattered. If the dispersion is low, the data points are closer to the
center.
 Comparing Datasets: When comparing two or more datasets, measures of dispersion
help you determine which dataset has greater variability. This can provide insights into
the consistency or volatility of different data sets.
 Assessing Data Quality: High dispersion might indicate inconsistencies or errors in
data collection. It can help identify outliers, which are data points significantly different
from the rest of the data and might need further investigation.
 Analysis: Measures of dispersion are important for various statistical analyses.They can
affect the assumptions and outcomes of different statistical methods.
Regression Analysis:-
Regression analysis is a powerful statistical method that allows you to examine the
relationship between two or more variables of interest.
 Dependent Variable: This is the main factor that you‟re trying to understand or
predict.
 Independent Variables: These are the factors that you hypothesize have an impact
on your dependent variable.
The two basic types of regression are simple linear regression and multiple linear regression,
although there are non-linear regression methods for more complicated data and analysis.
Simple linear regression uses one independent variable to explain or predict the outcome of
the dependent variable Y, while multiple linear regression uses two or more independent
variables to predict the outcome. In order for regression results to be properly interpreted,
several assumptions about the data and the model itself must hold.
Simple linear regression:

Equation:

Null Hypothesis (H0): There is no linear relationship between x and y, meaning β1=0.
Alternative Hypothesis (Ha): There is a linear relationship between x and y, meaning β1≠0

Multiple linear regression:

Equation:
Null Hypothesis (H0): meaning β1= β2= β3=..= βk=0.
Alternative Hypothesis (H1): at least one β is significant
Example:

To call data from excel: command


1-read=read.csv. (Choose. Files ())
Result:
> read=read.csv. (Choose. Files ())
>read
Y X1 X2 X3
1 43 22 18 20
2 32 20 19 19
3 34 21 13 18
4 41 20 20 17
5 54 24 19 21
6 52 22 22 21
7 34 22 16 21
8 34 20 18 21
9 39 22 19 22
10 44 21 17 19

Result:
> Price=data$Y; food=data$X1; décor= data$X2; service= data$X3.
>price
[1] 42 32 34 41 54 52 34 34 39 44
To build regression model:
command
Reg=lm(price~food+décor+service)
Result
>reg=lm(price~food+décor+ service )
>reg
Call:
lm(formula(price~food+décor+service)

Coefficient:
(intercept) Food décor service
-65.872 5.305 1.831 -2.014
Y=a+b1X1+b2X2+b3X3

Price=-65.872 food(5.305)+décor( 1.831)service( -2.014)


Ŷ=-65.872 =22( 5.305)+18(1.831)+20(-2.014)
Residual=observed value-expected value.
Ʃ (y1- Ŷ) =0
Residual=y- Ŷ
Residual=43-43.516
Residual=-0.516
This is for the restaurant 1price.
Residual can be calculated by using command:
Sum (reg$residuels)
Result
> Reg$residuels
1 2 3 4 5 6 7 8 9 10

0.2014636 -0.4566691 -0.6130033 0.3186037 0.1977655 0.6641691 -0.3500792 -0.5867071 -0.1946958 0.8204527

Reg$fitted.values.
Result:
> Reg$fitted.values

1 2 3 4 5 6 7 8 9 10

104.7985 115.4567 116.6143 116.6814 111.8022 120.3358 121.3501 110.5867 110.1947 113.1795

From these values we can calculate the residual of all other one by one the observed value
minus expected value.

. Summary (reg)
> Summary (reg)
Call:
lm (formula=price~food+décor+service)
Residuals:
Min 1Q Median 3Q Max
-4.7337 -2.1235 -0.2114 2.5769 5.6227
Coefficients:
Estimated std. error t value pr (≥ItI)
Intercept -65.8716 24.2595 -2.715 0.03486
Food 5.3048 1.2994 4.082 0.00648
decor 1.8306 0.5758 3.179 0.01909
service -2.0144 1.0628 -1.895 0.10686

Signif. Code: 0‟***‟ 0.001‟**‟0.01‟*‟0.05‟.‟0.1 „, 1


Residual standard error: 4.032 on 6 degree of freedom multiple R –squared: 0.8173, Adjusted
R –squared: 0.726 F-statistic: 8.95 on 3 and 6 DF, p-value: 0.01238.

Interpretation:-
.. Multiple R –squared: 0.8173
Adjusted R –squared: 0.726
. P-value: 0.01238
. P-value:
P-value of food is 0.00648
P value of décor is 0.01909

. Hypothesis:
. H°:ß = ß = ß =0
1 2 3
H1: At least one regression co-efficient is signification.
Food ß1 and décor ß2 are statistically significant their p-value is less than 0.05
accepting alternative hypothesis, while service ß3 is not statistically significant due to p-
value 0.1which is greater than 0.05.the variable which are non-significant (service) will not
be included in regression model.

Model goodness:-
Model is good with adjusted R-squared: 0.726.

Correlation/correlation co-efficient
Correlation, specifically the correlation coefficient, is a statistical measure that quantifies the
strength and direction of the linear relationship between two variables. It tells us how closely
two variables vary together

Command:
. Plot (price.decore)
Result:

Scatter plot interpretation:


Scatter plot between price and décor is positive.
. Cor (price.decore)
Result:
 cor(price, décor)
[1] 0.5491141
Interpretation:
The value of correlation co-efficient (r) between price and décor is 05491141 which
lies between 0.26≤r≤0.74modrate positive correlation.

.http://rstudiopubsstatic.s3.amazonaws.com/536833_75890191109d4f6baef4f69fa6390856.html#multiple-
regression.
t-test analysis
t-test refers to a statistical function commonly used in programming languages like R or other
statistical software. It is used to perform t-tests, which are hypothesis tests that assess whether
there is a significant difference between the means of two groups or between the mean of a
sample and a known population mean.
Important points:
1. P value: It is a probability value. On the basis of p value hypothesis is generated. If p
value is greater than 0.05 then we accept null hypothesis and reject alternative hypothesis
and if its value is less than 0.05 then we accept alternative hypothesis and reject null
hypothesis. A hypothesis is a statement or proposition about a population parameter or
relationship between variables that can be tested using data. There are two types of
hypothesis.

 Null Hypothesis (H0):


The null hypothesis is a statement of no effect or no difference. It suggests that any
observed differences or effects in the data are due to random chance or variability. It is
denoted by H0
H0: μ = μo
 Alternative Hypothesis (H1):
The alternative hypothesis is a statement that contradicts the null hypothesis. It proposes a
specific effect, difference, or relationship that is being investigated. It is denoted by H1
H1: μ ≠ μo
μ: Population parameter of interest
μ0: Specific value hypothesized for the population parameter under the null
hypothesis

2. T-value: The t-value is a statistical measure used in hypothesis testing, specifically in t-


tests. It quantifies the difference between a sample statistic and a hypothesized population
parameter.

3. Confidence interval: A confidence interval (CI) is a statistical range of values that is


used to estimate a population parameter with a certain level of confidence. It provides a
way to quantify the uncertainty associated with a sample statistic (like a mean,
proportion, or difference) and gives a range within which the true population parameter is
likely to fall. The 95 percent confidence interval indicates that we are 95 percent
confident that the true population mean lies within the interval.

4. Level of significance: The level of significance, often denoted by the symbol "α" (alpha),
is a critical concept in hypothesis testing and statistical analysis. It represents the
threshold at which you are willing to reject the null hypothesis when conducting a
statistical test for example if the level of significance is 5 percent it means the chances of
error will be 5 percent
5. Shapiro test: The Shapiro-Wilk test is used to assess whether a dataset follows a normal
distribution. The command used for shapiro test is shapiro.test(variable) The null
hypothesis for the Shapiro-Wilk test states that the data follows a normal distribution.
H0: μ = 0 (data is normal) (p value is greater than 0.05)

The alternative hypothesis contradicts the null hypothesis and suggests that the data does
not follow a normal distribution.
H1: μ ≠ 0 (data is not normal) (p value is less than 0.05)
Types of t-test
There are different types of t-test such as
 One sample t-test
 Paired sample t-test
 Two sample t-test

One sample t-test

This test is used to estimate the mean of single population. It is useful when you have a
sample and want to test whether the sample mean is significantly different from a specific
value.
Example:
Researches are interested in whether the pulse rate of long-distance runners differ from that
of long-distance runners they randomly take sample of 8 long distance runners measure their
resting pulse and obtain the following data 45,42,64,54,58,49,48,56 now perform a one
sample t-test on this data?
Input:
pulse_rate=c (45,42,64,54,58,49,48,56)
shapiro.test(pulse_rate)
t.test(pulse_rate)

Output:
> pulse_rate=c (45,42,64,54,58,49,48,56)
> shapiro.test(pulse_rate)

Shapiro-Wilk normality test


data: pulse_rate
W = 0.9761, p-value = 0.9411
> t.test(pulse_rate)
One Sample t-test
data: pulse_rate
t = 20.122, df = 7, p-value = 1.875e-07
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
45.88912 58.11088
sample estimates:
mean of x
52

Interpretation:
 The Shapiro-Wilk test is used to assess whether a dataset follows a normal distribution. In
this case, the test was applied to the "pulse_rate" data. As the p value is more than 0.05 it
means it follows null hypothesis and data is normal. Then we can apply t-test when data
is normal.
 The p-value is 0.017 which is less than 0.05 so we accept alternative hypothesis and
reject null hypothesis.
H0: u1 = u2
HA: u1 ≠ u2
 The value of degree of freedom is 7 the degrees of freedom are equal to the number of
observations minus 1. The 95 percent confidence interval indicates that we are 95 percent
confident that the true population mean for the "pulse_rate" lies within the interval of
45.88912 to 58.11088.
 The sample mean (52) represents the average "pulse_rate" value observed in the given
sample.

Paired sample t-test:

A paired sample t-test is a statistical test used to compare the means of two related or paired
samples. It is suitable for situations where each observation in one sample is directly related
or matched to an observation in the other sample.
Example
A study was conducted to investigate the effectiveness of a new treatment on a group of
individuals. "Before" and "after" measurements were taken for each individual. The "before"
measurements (in mmHg) are as follows: 213.4, 225.0, 217.0, 183.7, 197.2, 223.6, 224.2,
215.2, and 202.4. The "after" measurements (in mmHg) are: 200.1, 216.4, 195.6, 175.0,
201.3, 214.8, 215.7, 200.7, and 211.7. Perform a paired-sample t-test to determine if there is a
statistically significant difference between the blood pressure readings before and after the
treatment?
Input
#weight loss before diet
before=c (213.4,225.0,217.0,183.7,197.2,223.6,224.2,215.2,202.4)
#weight loss after diet
after=c (200.1,216.4,195.6,175.0,201.3,214.8,215.7,200.7,211.7)
d=before-after
shapiro.test(d)
t.test (before, after, paired=TRUE)

Output:
#weight loss before diet
> before=c (213.4,225.0,217.0,183.7,197.2,223.6,224.2,215.2,202.4)
> #weight loss after diet
> after=c (200.1,216.4,195.6,175.0,201.3,214.8,215.7,200.7,211.7)
Shapiro-Wilk normality test
data: d
W = 0.90194, p-value = 0.2634
> t.test (before, after, paired=TRUE)
Paired t-test
data: before and after
[1] 13.3 8.6 21.4 8.7 -4.1 8.8 8.5 14.5 -9.3
t = 2.514, df = 8, p-value = 0.03615
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
0.6471293 14.9973151
sample estimates:
mean difference
7.822222

Interpretation:
 The p value of shapiro test is 0.2634 which is more than 0.05 it means that data is normal.
As the data is normal so we can apply t-test
 The p value is 0.03 and it is less than 0.05 so we accept alternative hypothesis and rejects
null hypothesis.
H0: u1-u2=0
H1: u1-u2≠0
 The value of degree of freedom is 8. The 95 percent confidence interval indicates that
we are 95 percent confident that the true population mean for the "pulse rate" lies within
the interval of 0.6471293 14.9973151
 The sample estimate is the calculated mean difference between the "before" and "after"
measurements, which is 7.822222.

Independent sample t test

The independent samples t-test is a statistical test used to compare the means of two
independent groups and determine if there is a significant difference between them. It's also
known as the two-sample t-test. It's often used in research and analysis when you want to
compare the effects of different treatments, interventions, or conditions on different groups

For Example
Let‟s run an example of independent sample t test! Our hypothetical scenario is that we are
comparing scores from two teaching methods. We drew two random samples of students.
Students in one group learned using Method A while the other group used Method B. These
samples contain entirely separate students. Now, we want to determine whether the two
means are different.
. Method A Method B
72.4 72.1
72.1 89.8
69.7 98
61.2 84.4
76.5 80.5
81.2 84.8
75.8 70.8
71.6 90.8
82 73.1
52.7 93.7
64.7 83.3

Input:
#independent sample t test
method_1=c (72.4,72.1,69.7,61.2,76.5,81.2,75.8,71.6,82,52.7,64.7)
method_2=c (72.1,89.8,98,84.4,80.5,84.8,70.8,90.8,73.1,93.7,83.3)
shapiro.test(method_1)
shapiro.test(method_2)
group=rep(c("method_1","method_2"), each=11)
data.frame(group,method)
data=data.frame(group,method)
var.test (method~group, data=data)
t.test (method_1, method_2, var. equal=TRUE)
Output:
#independent sample t test
> method_1=c (72.4,72.1,69.7,61.2,76.5,81.2,75.8,71.6,82,52.7,64.7)
> method_2=c (72.1,89.8,98,84.4,80.5,84.8,70.8,90.8,73.1,93.7,83.3)
> shapiro.test(method_1)
Shapiro-Wilk normality test
data: method_1
W = 0.94084, p-value = 0.5305
> shapiro.test(method_2)
Shapiro-Wilk normality test
data: method_2
W = 0.94745, p-value = 0.6116
> group=rep(c("method_1","method_2"),each=11)
> data.frame(group,method)
group method
1 method_1 72.4
2 method_1 72.1
3 method_1 69.7
4 method_1 61.2
5 method_1 76.5
6 method_1 81.2
7 method_1 75.8
8 method_1 71.6
9 method_1 82.0
10 method_1 52.7
11 method_1 64.7
12 method_2 72.1
13 method_2 89.8
14 method_2 98.0
15 method_2 84.4
16 method_2 80.5
17 method_2 84.8
18 method_2 70.8
19 method_2 90.8
20 method_2 73.1
21 method_2 93.7
22 method_2 83.3
> data=data.frame(group,method)
> var.test (method~group, data=data)
F test to compare two variances
data: method by group
F = 0.92239, num df = 10, denom df = 10, p-value = 0.9009
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
0.2481681 3.4283292
sample estimates:
ratio of variances
0.9223893
> t.test (method_1, method_2, var. equal=TRUE)
Two Sample t-test

data: method_1 and method_2


t = -3.4008, df = 20, p-value = 0.002836
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-20.739094 -4.969997
sample estimates:
mean of x mean of y
70.90000 83.75455
Interpretations:
1. Normality of variables
The Shapiro-Wilk test is a statistical test used to assess the normality of a dataset. It helps you
determine whether a given dataset follows a normal distribution. The null hypothesis of the S
hapiro-Wilk test is that the data is normally distributed. While the alternative hypothesis of sh
apiro test is that the data is not normally distributed

H0: The variable method 1 is normally distributed


H1: The variable method 1 is not normally distributed

The p-values for method_1 is 0.5305 which is greater than 0.05 so we accept Null hypothesis
and reject alternative hypothesis

H0: The variable method 2 is normally distributed


H1: The variable method 2 is not normally distributed

The p-values for method_2 is 0.6116 which is greater than 0.05 so we accept Null hypothesis
and reject alternative hypothesis. So, it means that data for both method_1 and method_2 is
normally distributed

2. Homogeneity of variances
The assumption of homogeneity of variances refers to the requirement that the variances of
the two groups being compared are roughly equal. then you can apply this test. The F-test is a
statistical test used to compare the variances of different population. As the p value is 0.9009
which is more than 0.05 sow we accept H0 and reject H1 which means that both populations
have equal variances
H0: σ12 = σ22
H1: σ12 ≠ σ22
The calculated ratio of variances is 0.9223893. This estimate is close to 1, further suggesting
that the variances are not substantially different.

3. Testing of population mean


The p value of independent sample t test is 0.002836 and it is less than 0.05 so we accept
alternative hypothesis and rejects null hypothesis.
H0: u1 = u2
H1: u1 ≠ u2
The value of degree of freedom is 10. The 95 percent confidence interval indicates that we a
re 95 percent confident that the true population mean for the "method" lies within the interval
of -20.739094 -4.969997
The mean of method 1 is 70.90000 and method 2 is 83.75455. The means of the two groups
are quite different from each other.
Comparison between One sample t test paired sample t test and independent sample t
test

Aspect One-Sample T-Test Paired sample T-test Independent sample t


test

Compare sample Compare means of Compare means of


Purpose mean to a known or related observations two separate groups
hypothesized mean (before-after)

Single sample with a Paired observations Two independent


Data Type set of measurements (before and after) groups of observations

Hypotheses H0: u1 = u2 H0: u1-u2=0 H0: u1 = u2


HA: u1 ≠ u2 H1: u1-u2≠0 H1: u1 ≠ u2

Data is normally Paired differences are Both samples are


Assumptions distributed normally distributed normally distributed,
and independent have similar variances,
and independent

R code t.test(variable) t.test (before, after, t.test (group 1, group


paired=TRUE) 2, var. equal=TRUE)
ANOVA
ANOVA, which stands for Analysis of Variance, is a statistical test used to analyze the
difference between the means of more than two groups.
In ANOVA, “groups” or “levels” refer to the different categories of the independent variable
being compared. For example, if the independent variable is “eggs,” the levels might be Non-
Organic, Organic, and Free Range Organic. The dependent variable could then be the price
per dozen eggs.

One way ANOVA


One-way ANOVA refers to a type of ANOVA test where there will be only one independent
variable. The test compares means of groups, generally three or more groups, to analyze the
variance. The one-way ANOVA compares the means between the groups you are interested
in and determines whether any of those means are statistically significantly different from
each other.

Hypothesis
 The null hypothesis states that all population means are equal

where µ = group mean and k = number of groups.


 If, however, the one-way ANOVA returns a statistically significant result, we accept
the alternative hypothesis (H1), which states that at least one population mean will
vary from others.

Assumptions
One-way ANOVA makes several assumptions about the data in order to produce reliable
results. These assumptions need to be checked before interpreting the ANOVA results. The
main assumptions are:

1. Normality: The data within each group should be approximately normally


distributed.
2. Homogeneity of Variances (Homoscedasticity): The variances of the dependent
variable should be roughly equal across all groups. This assumption is important
because ANOVA is sensitive to unequal variances. Levene's test or similar tests can
be used to check this assumption.
 Null Hypothesis (Ho): The variances of the groups being compared are equal.
 Alternative Hypothesis (H1): At least one group's variance is different from
the others.
3. Random Sampling: The samples should be randomly selected from their respective
populations.
At this point, it is important to realize that the one-way ANOVA is an omnibus test statistic
and cannot tell you which specific groups were statistically significantly different from each
other, only that at least two groups were. To determine which specific groups differed from
each other, you need to use a post hoc test.

Calculation method
The calculation method involves the F-ratio formula.
F = Mean sum of squares between the groups/ Mean sum of squares within groups
= MSB/MSW
Example:
A research study seeks to explore whether there exists a statistically significant discrepancy
in income among individuals with different educational levels. The educational backgrounds
are divided into three categories: "bachelor‟s," "master‟s," and "doctoral." The investigation
incorporates income data collected from a total of 18 individuals, with 6 individuals in each
educational group, distributed as follows:
 Bachelor‟s Degree: Income ($ in thousands): $48.5, $50.2, $52.1, $45.8, $47.6, $49.3
 Master‟s Degree: Income ($ in thousands): $56.3, $57.8, $55.1, $54.2, $53.6, $55.9
 Doctoral Degree: Income ($ in thousands): $65.7, $67.3, $68.8, $66.5, $64.9, $67.1
The purpose of this study is to analyze the income differences among the three educational
levels. This analysis will involve conducting appropriate statistical tests to determine if there
is a noteworthy disparity in mean income across the groups.

Solution:
Following Commands were used to perform one way ANOVA.
 Bachelor_Degree=c(48.5,50.2,52.1,45.8,47.6,49.3)
 Master_Degree=c(56.3,57.8,55.1,54.2,53.6,55.9)
 Doctoral_Degree=c(65.7,67.3,68.8,66.5,64.9,67.1)
 degrees=c(48.5,50.2,52.1,45.8,47.6,49.3,56.3,57.8,55.1,54.2,53.6,55.9,65.7,67.3,6
8.8,66.5,64.9,67.1)
 group=rep(c("Bachelor_Degree","Master_Degree","Doctoral_Degree"),each=6)
 data=data.frame(degrees,group)
 library("car")
 leveneTest(degrees,group)
 ANOVA=aov(degrees~group,data=data)
 summary(ANOVA)
 TukeyHSD(ANOVA)

Information of each command performed


1. Bachelor_Degree, Master_Degree, Doctoral_Degree: These are variables that store
income data for individuals with different education levels.
2. degrees: This variable combines all the income data into a single vector for analysis.
3. group: This variable assigns labels to each income data point based on education
level, organizing the data for grouping.
4. data=data.frame(degrees,group): This creates a data frame using the degrees and
group variables, bringing the data together in a structured format.
5. library("car"): This loads the "car" package, which contains useful functions for
statistical analysis, including the Levene's test.
6. leveneTest(degrees,group): This tests if the variances of income data across
education groups are equal or not, important for ANOVA.
7. ANOVA=aov(degrees~group,data=data): This fits a one-way ANOVA model,
analyzing if there are significant differences in mean income among education groups.
8. summary(ANOVA): This summarizes the ANOVA results, showing the F-statistic,
p-value, and whether education groups significantly affect income.
9. TukeyHSD(ANOVA): This performs a Tukey post hoc test to identify which
education groups have significantly different incomes after ANOVA.

Output:
> Bachelor_Degree=c(48.5,50.2,52.1,45.8,47.6,49.3)
> Master_Degree=c(56.3,57.8,55.1,54.2,53.6,55.9)
> Doctoral_Degree=c(65.7,67.3,68.8,66.5,64.9,67.1)
>degrees=c(48.5,50.2,52.1,45.8,47.6,49.3,56.3,57.8,55.1,54.2,53.6,55.9,65.7,67.3,68.8,66.5,6
4.9,67.1)
> group=rep(c("Bachelor_Degree","Master_Degree","Doctoral_Degree"),each=6)
> data=data.frame(degrees,group)
> library("car")
> leveneTest(degrees,group)
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 2 0.6138 0.5543
15
> ANOVA=aov(degrees~group,data=data)
> summary(ANOVA)
Df Sum Sq Mean Sq F value Pr(>F)
group 2 972.3 486.1 164.6 6.23e-11 ***
Residuals 15 44.3 3.0
---
Signif. codes: 0 „***‟ 0.001 „**‟ 0.01 „*‟ 0.05 „.‟ 0.1 „ ‟ 1
> TukeyHSD(ANOVA)
Tukey multiple comparisons of means
95% family-wise confidence level

Fit: aov(formula = degrees ~ group, data = data)

$group
diff lwr upr p adj
Doctoral_Degree-Bachelor_Degree 17.800000 15.222666 20.377334 0.00e+00
Master_Degree-Bachelor_Degree 6.566667 3.989333 9.144001 2.31e-05
Master_Degree-Doctoral_Degree -11.233333 -13.810667 -8.655999 0.00e+00

Interpretation of output obtained:


Levene's Test for Homogeneity of Variance:
The Levene's test assesses if the variances of income data are equal across education groups.
The result is as follows:
 Df (Degrees of Freedom): There are 2 degrees of freedom in the numerator (number
of groups minus one) and 15 degrees of freedom in the denominator (total number of
observations minus number of groups).
 F value: The calculated F value is 0.6138.
 p-value: The p-value is 0.5543.
 Interpretation: Since the p-value is greater than the common significance level of
0.05, you don't have strong evidence to reject the assumption of equal variances.

“This suggests that the assumption of homogeneity of variances is met, which is


necessary for the reliability of ANOVA results. This means null hypothesis Ho is
accepted and alternative hypothesis is rejected, hence data is said to have homogeneity.”

One-Way ANOVA:
The ANOVA summary provides information about differences in income across education
groups:
 Df (Degrees of Freedom): There are 2 degrees of freedom for the "group" factor and
15 degrees of freedom for residuals.
 Sum Sq (Sum of Squares): The sum of squares between groups is 972.3, and the
sum of squares within groups is 44.3.
 Mean Sq (Mean Square): The mean square is calculated by dividing the sum of
squares by the degrees of freedom.
 F value: The calculated F value is 164.6.
 p-value: The p-value is almost zero (6.23e-11).
 Interpretation: The very low p-value suggests that there is strong evidence to
conclude that at least one education group's mean income is significantly different
from the others.

“This means that we reject null hypothesis and accept alternative hypothesis which
suggests that at least one education group's mean income is significantly different from
the others.”

Tukey HSD Post Hoc Test:


The Tukey HSD test identifies which education groups have significantly different means. p-
values for each pairwise comparison.
According to the Tukey HSD results:

 The difference in mean income between Doctoral and Bachelor's degrees is


significant (p-value < 0.001). The extremely low p-value (less than 0.001) suggests
that this difference is highly unlikely to have occurred by chance. It provides strong
evidence that Doctoral degree holders generally earn more than those with Bachelor's
degrees.
 The difference in mean income between Master's and Bachelor's degrees is
significant (p-value = 2.31e-05). The p-value of 2.31e-05 indicates a very small
probability of observing such a difference by chance alone. This result adds further
weight to the conclusion that Master's degree holders typically earn more than those
with Bachelor's degrees.
 The difference in mean income between Master's and Doctoral degrees is
significant (p-value < 0.001). The p-value of less than 0.001 provides strong
evidence that this difference is not likely due to random variation. It reinforces the
conclusion that Doctoral degree holders generally enjoy higher earnings than those
with Master's degrees.
This means that there are significant differences in mean income among all pairs of education
groups.

ANOVA table
The ANOVA table provides a summarized view of the results of your one-way ANOVA
analysis and is generated as an output of the statistical software, such as R, after performing
the analysis.
As we are already done with one way ANOVA analysis using r and got the output, we can
construct an ANOVA table

Source of DF Sum of squares Mean sum of F


variation (SS) squares
k-1 972.3 = 972.3/2
Between 3-1=2 = 486.15 = 986.15/2.953
samples = 164.612
n-k 44.3 = 44.3/15
Within samples 18-3=15 = 2.953

n-1 1016.6
Total 18-1=17

F value =164.612

"In conclusion, the combination of visualizations, statistical tests, and analytical techniques
used in this report has revealed valuable insights and trends within the data. Through
histograms, boxplots, dispersion measures, regression and correlation analyses, t-tests, and
one-way ANOVA, we've deeply explored the intricacies of the data. These findings
collectively improve our understanding of the subject, emphasizing important relationships
and distinctions. This analysis underscores the significance of a comprehensive approach in
data exploration and adds depth to our understanding of the topic."

References:
 https://sphweb.bumc.bu.edu/otlt/mph-modules/bs/bs704_hypothesistesting-
anova/bs704_hypothesistesting-
anova_print.html#:~:text=The%20ANOVA%20table%20breaks%20down,Sums%20o
f%20Squares%20(SS)
 https://statistics.laerd.com/statistical-guides/one-way-anova-statistical-guide.php
 https://www.simplypsychology.org/anova.html#ANOVA-F-value
 https://www.wallstreetmojo.com/one-way-anova/
 https://www.google.com/url?sa=i&url=https%3A%2F%2Ffanyv88.com%3A443%2Fhttps%2Fslideplayer.com%2Fslide%
2F8367889%2F&psig=AOvVaw1w3Li2JtdPQ
 https://www.geeksforgeeks.org/paired-sample-t-test-in-excel/
 https://www.ibm.com/docs/en/spss-statistics/SaaS?topic=tests-paired-samples-t-
test#:~:text=The%20Paired%2DSamples%20T%20Test,t%2Dtest%20effect%20size
%20computation.
 https://www.bing.com/search?q=independent+sample+t+test+example&aqs=edge.0.6
9i64i450l8.1410265j0j1&FORM=ANNTA1&PC=U531#
 International Journal of Computers for Mathematical Learning, 12 (1) (2018), pp. 23-
55, 10.1007/s10758-007-9110-6
 http://rstudiopubsstatic.s3.amazonaws.com/536833_75890191109d4f6baef4f69fa6390856.html#multip
le-regression.
 https://www.alchemer.com/resources/blog/regression-analysis/

You might also like