Report Stats PDF
Report Stats PDF
Report submission
Semester / MS
nd
2 / MS biotechnology
Submitted by
Group # 2
Group members
Hadia Akram 22104014-022
Sumbal 22104014-006
Sumaira 22104014-002
Shingraf 22104014-Naz 015
Maria Akbar 22104014-012
Submitted to
Sir Waqas
Subject
Advanced biostatistics
Date of submission
20th august, 2023
Variables:
It is the characteristic that can be measured and assume different values.It is categorized into
two such as:-
Qualitative:The variables that express qualitative attributes such as
colour,race,gender etc.
Quantative:The variables which have some numeric values.Quantative data may be
discrete and continuous.
Discrete Continuous
The quantative data which is countable The quantative data which is measureable
called discrete data. called continuous data.
Histogram
Histogram is the graphical representation of the continuous data .It is applied to check the
normailty of the data.If the graph show only one peak at the mid like bell-shaped mean data is
normally distributed.
1 141
2 143
3 145
4 145
5 147
6 152
7 143
8 144
9 149
10 141
11 138
12 143
Mr. Larry, a famous doctor, is researching
the height of the students studying in the
8th standard. He has gathered 12 students
but wants to know which maximum
category is where they belong.
Input:
height=c(141,143,145,145,147,152,143,14
4,149,141,138,143)
hist(height)
hist(height,xlab="height of
person",ylab="frequency",xlim=c(138,152)
,ylim=c(0,10))
Output:
> height=c(141,143,145,145,147,152,143,1
44,149,141,138,143)
> hist(height)
> hist(height,xlab="height of person",ylab=
"frequency",xlim=c(138,152),ylim=c(0,10)
)
Interpretation:
The peak of the graph gives us information about that most of the student‟s height lie
between142 to 144(cm).
The peak is on the left side (at start) which shows that data is positively skewed.
Box plot:
A box and whisker plot also
called a box plot displays
the five-number summary of
a set of data. The five-
number summary is the:
1) Minimum value
2) First quartile
3) Median
4) Third quartile
5) Maximum value
Input:
boxplot(height)
summary(height)
Output:
> boxplot(height)
> summary(height)
Min. 1st Qu. Median Mean 3rd Qu. Max.
138.0 142.5 143.5 144.2 145.5 152.0
Interpretation:
This box plot graph give us information about
Minimum value =138
1st Qu." Means the first quartile (25th percentile) =142.5
"Median" is the middle value=143.5
"3rd Qu." is the third quartile (75th percentile) value=145.5
"Max." is the maximum value=152.0
Measure of dispersion
The measures of dispersion help to interpret the variability of data. It shows how squeezed or
scattered the variable is. There are many Measures of Dispersion found that help us to get
more insights into the data:
Range
Variance
Standard Deviation
Coefficient of variation
Example:
Suppose XUZ Pvt. Ltd is a company where each 19 employee spends these money for lunch:
money_spent=c($5,$3,$4,$10,$12,$15,$11,$13,$9,$11,$8,$6,$25,$23,$21,$19,$17,$20,$16).
how can we utilize R to compute a measure of dispersion for these expenditures, considering
their spread and variability?
1. Range
Range is the measure of the difference between the largest and smallest value of the data
variability. The range is the simplest form of Measures of Dispersion. The range obtained
for data above mentioned is
3. Standard deviation
Standard Deviation can be represented as the square root of Variance. In Rstudio,
following command is used:
Command
sd(money_spent)
Output
> sd(money_spent)
[1] 6.56145
4. Coefficient of variation
A lower CV indicates lower relative variability, meaning the data points are relatively
close to the mean. Conversely, a higher CV suggests higher relative variability, indicating
that the data points are more spread out from the mean.
We can find coefficient of variation by following formula:
Interpretation
The data you suggests that the money_spent for lunch dataset has a variance of 43.05263 and
a standard deviation of 6.56145. The coefficient of variation (CV) is calculated to be
50.26918.
Interpreting these results, we can say that the money_spent dataset exhibits a moderate degree
of variability. The CV value of 50.26918 indicates that the standard deviation is
approximately 50.27% .This suggests that the data points in the dataset are relatively spread
out from the mean value.
Importance of measure of dispersion
Equation:
Null Hypothesis (H0): There is no linear relationship between x and y, meaning β1=0.
Alternative Hypothesis (Ha): There is a linear relationship between x and y, meaning β1≠0
Equation:
Null Hypothesis (H0): meaning β1= β2= β3=..= βk=0.
Alternative Hypothesis (H1): at least one β is significant
Example:
Result:
> Price=data$Y; food=data$X1; décor= data$X2; service= data$X3.
>price
[1] 42 32 34 41 54 52 34 34 39 44
To build regression model:
command
Reg=lm(price~food+décor+service)
Result
>reg=lm(price~food+décor+ service )
>reg
Call:
lm(formula(price~food+décor+service)
Coefficient:
(intercept) Food décor service
-65.872 5.305 1.831 -2.014
Y=a+b1X1+b2X2+b3X3
0.2014636 -0.4566691 -0.6130033 0.3186037 0.1977655 0.6641691 -0.3500792 -0.5867071 -0.1946958 0.8204527
Reg$fitted.values.
Result:
> Reg$fitted.values
1 2 3 4 5 6 7 8 9 10
104.7985 115.4567 116.6143 116.6814 111.8022 120.3358 121.3501 110.5867 110.1947 113.1795
From these values we can calculate the residual of all other one by one the observed value
minus expected value.
. Summary (reg)
> Summary (reg)
Call:
lm (formula=price~food+décor+service)
Residuals:
Min 1Q Median 3Q Max
-4.7337 -2.1235 -0.2114 2.5769 5.6227
Coefficients:
Estimated std. error t value pr (≥ItI)
Intercept -65.8716 24.2595 -2.715 0.03486
Food 5.3048 1.2994 4.082 0.00648
decor 1.8306 0.5758 3.179 0.01909
service -2.0144 1.0628 -1.895 0.10686
Interpretation:-
.. Multiple R –squared: 0.8173
Adjusted R –squared: 0.726
. P-value: 0.01238
. P-value:
P-value of food is 0.00648
P value of décor is 0.01909
. Hypothesis:
. H°:ß = ß = ß =0
1 2 3
H1: At least one regression co-efficient is signification.
Food ß1 and décor ß2 are statistically significant their p-value is less than 0.05
accepting alternative hypothesis, while service ß3 is not statistically significant due to p-
value 0.1which is greater than 0.05.the variable which are non-significant (service) will not
be included in regression model.
Model goodness:-
Model is good with adjusted R-squared: 0.726.
Correlation/correlation co-efficient
Correlation, specifically the correlation coefficient, is a statistical measure that quantifies the
strength and direction of the linear relationship between two variables. It tells us how closely
two variables vary together
Command:
. Plot (price.decore)
Result:
.http://rstudiopubsstatic.s3.amazonaws.com/536833_75890191109d4f6baef4f69fa6390856.html#multiple-
regression.
t-test analysis
t-test refers to a statistical function commonly used in programming languages like R or other
statistical software. It is used to perform t-tests, which are hypothesis tests that assess whether
there is a significant difference between the means of two groups or between the mean of a
sample and a known population mean.
Important points:
1. P value: It is a probability value. On the basis of p value hypothesis is generated. If p
value is greater than 0.05 then we accept null hypothesis and reject alternative hypothesis
and if its value is less than 0.05 then we accept alternative hypothesis and reject null
hypothesis. A hypothesis is a statement or proposition about a population parameter or
relationship between variables that can be tested using data. There are two types of
hypothesis.
4. Level of significance: The level of significance, often denoted by the symbol "α" (alpha),
is a critical concept in hypothesis testing and statistical analysis. It represents the
threshold at which you are willing to reject the null hypothesis when conducting a
statistical test for example if the level of significance is 5 percent it means the chances of
error will be 5 percent
5. Shapiro test: The Shapiro-Wilk test is used to assess whether a dataset follows a normal
distribution. The command used for shapiro test is shapiro.test(variable) The null
hypothesis for the Shapiro-Wilk test states that the data follows a normal distribution.
H0: μ = 0 (data is normal) (p value is greater than 0.05)
The alternative hypothesis contradicts the null hypothesis and suggests that the data does
not follow a normal distribution.
H1: μ ≠ 0 (data is not normal) (p value is less than 0.05)
Types of t-test
There are different types of t-test such as
One sample t-test
Paired sample t-test
Two sample t-test
This test is used to estimate the mean of single population. It is useful when you have a
sample and want to test whether the sample mean is significantly different from a specific
value.
Example:
Researches are interested in whether the pulse rate of long-distance runners differ from that
of long-distance runners they randomly take sample of 8 long distance runners measure their
resting pulse and obtain the following data 45,42,64,54,58,49,48,56 now perform a one
sample t-test on this data?
Input:
pulse_rate=c (45,42,64,54,58,49,48,56)
shapiro.test(pulse_rate)
t.test(pulse_rate)
Output:
> pulse_rate=c (45,42,64,54,58,49,48,56)
> shapiro.test(pulse_rate)
Interpretation:
The Shapiro-Wilk test is used to assess whether a dataset follows a normal distribution. In
this case, the test was applied to the "pulse_rate" data. As the p value is more than 0.05 it
means it follows null hypothesis and data is normal. Then we can apply t-test when data
is normal.
The p-value is 0.017 which is less than 0.05 so we accept alternative hypothesis and
reject null hypothesis.
H0: u1 = u2
HA: u1 ≠ u2
The value of degree of freedom is 7 the degrees of freedom are equal to the number of
observations minus 1. The 95 percent confidence interval indicates that we are 95 percent
confident that the true population mean for the "pulse_rate" lies within the interval of
45.88912 to 58.11088.
The sample mean (52) represents the average "pulse_rate" value observed in the given
sample.
A paired sample t-test is a statistical test used to compare the means of two related or paired
samples. It is suitable for situations where each observation in one sample is directly related
or matched to an observation in the other sample.
Example
A study was conducted to investigate the effectiveness of a new treatment on a group of
individuals. "Before" and "after" measurements were taken for each individual. The "before"
measurements (in mmHg) are as follows: 213.4, 225.0, 217.0, 183.7, 197.2, 223.6, 224.2,
215.2, and 202.4. The "after" measurements (in mmHg) are: 200.1, 216.4, 195.6, 175.0,
201.3, 214.8, 215.7, 200.7, and 211.7. Perform a paired-sample t-test to determine if there is a
statistically significant difference between the blood pressure readings before and after the
treatment?
Input
#weight loss before diet
before=c (213.4,225.0,217.0,183.7,197.2,223.6,224.2,215.2,202.4)
#weight loss after diet
after=c (200.1,216.4,195.6,175.0,201.3,214.8,215.7,200.7,211.7)
d=before-after
shapiro.test(d)
t.test (before, after, paired=TRUE)
Output:
#weight loss before diet
> before=c (213.4,225.0,217.0,183.7,197.2,223.6,224.2,215.2,202.4)
> #weight loss after diet
> after=c (200.1,216.4,195.6,175.0,201.3,214.8,215.7,200.7,211.7)
Shapiro-Wilk normality test
data: d
W = 0.90194, p-value = 0.2634
> t.test (before, after, paired=TRUE)
Paired t-test
data: before and after
[1] 13.3 8.6 21.4 8.7 -4.1 8.8 8.5 14.5 -9.3
t = 2.514, df = 8, p-value = 0.03615
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
0.6471293 14.9973151
sample estimates:
mean difference
7.822222
Interpretation:
The p value of shapiro test is 0.2634 which is more than 0.05 it means that data is normal.
As the data is normal so we can apply t-test
The p value is 0.03 and it is less than 0.05 so we accept alternative hypothesis and rejects
null hypothesis.
H0: u1-u2=0
H1: u1-u2≠0
The value of degree of freedom is 8. The 95 percent confidence interval indicates that
we are 95 percent confident that the true population mean for the "pulse rate" lies within
the interval of 0.6471293 14.9973151
The sample estimate is the calculated mean difference between the "before" and "after"
measurements, which is 7.822222.
The independent samples t-test is a statistical test used to compare the means of two
independent groups and determine if there is a significant difference between them. It's also
known as the two-sample t-test. It's often used in research and analysis when you want to
compare the effects of different treatments, interventions, or conditions on different groups
For Example
Let‟s run an example of independent sample t test! Our hypothetical scenario is that we are
comparing scores from two teaching methods. We drew two random samples of students.
Students in one group learned using Method A while the other group used Method B. These
samples contain entirely separate students. Now, we want to determine whether the two
means are different.
. Method A Method B
72.4 72.1
72.1 89.8
69.7 98
61.2 84.4
76.5 80.5
81.2 84.8
75.8 70.8
71.6 90.8
82 73.1
52.7 93.7
64.7 83.3
Input:
#independent sample t test
method_1=c (72.4,72.1,69.7,61.2,76.5,81.2,75.8,71.6,82,52.7,64.7)
method_2=c (72.1,89.8,98,84.4,80.5,84.8,70.8,90.8,73.1,93.7,83.3)
shapiro.test(method_1)
shapiro.test(method_2)
group=rep(c("method_1","method_2"), each=11)
data.frame(group,method)
data=data.frame(group,method)
var.test (method~group, data=data)
t.test (method_1, method_2, var. equal=TRUE)
Output:
#independent sample t test
> method_1=c (72.4,72.1,69.7,61.2,76.5,81.2,75.8,71.6,82,52.7,64.7)
> method_2=c (72.1,89.8,98,84.4,80.5,84.8,70.8,90.8,73.1,93.7,83.3)
> shapiro.test(method_1)
Shapiro-Wilk normality test
data: method_1
W = 0.94084, p-value = 0.5305
> shapiro.test(method_2)
Shapiro-Wilk normality test
data: method_2
W = 0.94745, p-value = 0.6116
> group=rep(c("method_1","method_2"),each=11)
> data.frame(group,method)
group method
1 method_1 72.4
2 method_1 72.1
3 method_1 69.7
4 method_1 61.2
5 method_1 76.5
6 method_1 81.2
7 method_1 75.8
8 method_1 71.6
9 method_1 82.0
10 method_1 52.7
11 method_1 64.7
12 method_2 72.1
13 method_2 89.8
14 method_2 98.0
15 method_2 84.4
16 method_2 80.5
17 method_2 84.8
18 method_2 70.8
19 method_2 90.8
20 method_2 73.1
21 method_2 93.7
22 method_2 83.3
> data=data.frame(group,method)
> var.test (method~group, data=data)
F test to compare two variances
data: method by group
F = 0.92239, num df = 10, denom df = 10, p-value = 0.9009
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
0.2481681 3.4283292
sample estimates:
ratio of variances
0.9223893
> t.test (method_1, method_2, var. equal=TRUE)
Two Sample t-test
The p-values for method_1 is 0.5305 which is greater than 0.05 so we accept Null hypothesis
and reject alternative hypothesis
The p-values for method_2 is 0.6116 which is greater than 0.05 so we accept Null hypothesis
and reject alternative hypothesis. So, it means that data for both method_1 and method_2 is
normally distributed
2. Homogeneity of variances
The assumption of homogeneity of variances refers to the requirement that the variances of
the two groups being compared are roughly equal. then you can apply this test. The F-test is a
statistical test used to compare the variances of different population. As the p value is 0.9009
which is more than 0.05 sow we accept H0 and reject H1 which means that both populations
have equal variances
H0: σ12 = σ22
H1: σ12 ≠ σ22
The calculated ratio of variances is 0.9223893. This estimate is close to 1, further suggesting
that the variances are not substantially different.
Hypothesis
The null hypothesis states that all population means are equal
Assumptions
One-way ANOVA makes several assumptions about the data in order to produce reliable
results. These assumptions need to be checked before interpreting the ANOVA results. The
main assumptions are:
Calculation method
The calculation method involves the F-ratio formula.
F = Mean sum of squares between the groups/ Mean sum of squares within groups
= MSB/MSW
Example:
A research study seeks to explore whether there exists a statistically significant discrepancy
in income among individuals with different educational levels. The educational backgrounds
are divided into three categories: "bachelor‟s," "master‟s," and "doctoral." The investigation
incorporates income data collected from a total of 18 individuals, with 6 individuals in each
educational group, distributed as follows:
Bachelor‟s Degree: Income ($ in thousands): $48.5, $50.2, $52.1, $45.8, $47.6, $49.3
Master‟s Degree: Income ($ in thousands): $56.3, $57.8, $55.1, $54.2, $53.6, $55.9
Doctoral Degree: Income ($ in thousands): $65.7, $67.3, $68.8, $66.5, $64.9, $67.1
The purpose of this study is to analyze the income differences among the three educational
levels. This analysis will involve conducting appropriate statistical tests to determine if there
is a noteworthy disparity in mean income across the groups.
Solution:
Following Commands were used to perform one way ANOVA.
Bachelor_Degree=c(48.5,50.2,52.1,45.8,47.6,49.3)
Master_Degree=c(56.3,57.8,55.1,54.2,53.6,55.9)
Doctoral_Degree=c(65.7,67.3,68.8,66.5,64.9,67.1)
degrees=c(48.5,50.2,52.1,45.8,47.6,49.3,56.3,57.8,55.1,54.2,53.6,55.9,65.7,67.3,6
8.8,66.5,64.9,67.1)
group=rep(c("Bachelor_Degree","Master_Degree","Doctoral_Degree"),each=6)
data=data.frame(degrees,group)
library("car")
leveneTest(degrees,group)
ANOVA=aov(degrees~group,data=data)
summary(ANOVA)
TukeyHSD(ANOVA)
Output:
> Bachelor_Degree=c(48.5,50.2,52.1,45.8,47.6,49.3)
> Master_Degree=c(56.3,57.8,55.1,54.2,53.6,55.9)
> Doctoral_Degree=c(65.7,67.3,68.8,66.5,64.9,67.1)
>degrees=c(48.5,50.2,52.1,45.8,47.6,49.3,56.3,57.8,55.1,54.2,53.6,55.9,65.7,67.3,68.8,66.5,6
4.9,67.1)
> group=rep(c("Bachelor_Degree","Master_Degree","Doctoral_Degree"),each=6)
> data=data.frame(degrees,group)
> library("car")
> leveneTest(degrees,group)
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 2 0.6138 0.5543
15
> ANOVA=aov(degrees~group,data=data)
> summary(ANOVA)
Df Sum Sq Mean Sq F value Pr(>F)
group 2 972.3 486.1 164.6 6.23e-11 ***
Residuals 15 44.3 3.0
---
Signif. codes: 0 „***‟ 0.001 „**‟ 0.01 „*‟ 0.05 „.‟ 0.1 „ ‟ 1
> TukeyHSD(ANOVA)
Tukey multiple comparisons of means
95% family-wise confidence level
$group
diff lwr upr p adj
Doctoral_Degree-Bachelor_Degree 17.800000 15.222666 20.377334 0.00e+00
Master_Degree-Bachelor_Degree 6.566667 3.989333 9.144001 2.31e-05
Master_Degree-Doctoral_Degree -11.233333 -13.810667 -8.655999 0.00e+00
One-Way ANOVA:
The ANOVA summary provides information about differences in income across education
groups:
Df (Degrees of Freedom): There are 2 degrees of freedom for the "group" factor and
15 degrees of freedom for residuals.
Sum Sq (Sum of Squares): The sum of squares between groups is 972.3, and the
sum of squares within groups is 44.3.
Mean Sq (Mean Square): The mean square is calculated by dividing the sum of
squares by the degrees of freedom.
F value: The calculated F value is 164.6.
p-value: The p-value is almost zero (6.23e-11).
Interpretation: The very low p-value suggests that there is strong evidence to
conclude that at least one education group's mean income is significantly different
from the others.
“This means that we reject null hypothesis and accept alternative hypothesis which
suggests that at least one education group's mean income is significantly different from
the others.”
ANOVA table
The ANOVA table provides a summarized view of the results of your one-way ANOVA
analysis and is generated as an output of the statistical software, such as R, after performing
the analysis.
As we are already done with one way ANOVA analysis using r and got the output, we can
construct an ANOVA table
n-1 1016.6
Total 18-1=17
F value =164.612
"In conclusion, the combination of visualizations, statistical tests, and analytical techniques
used in this report has revealed valuable insights and trends within the data. Through
histograms, boxplots, dispersion measures, regression and correlation analyses, t-tests, and
one-way ANOVA, we've deeply explored the intricacies of the data. These findings
collectively improve our understanding of the subject, emphasizing important relationships
and distinctions. This analysis underscores the significance of a comprehensive approach in
data exploration and adds depth to our understanding of the topic."
References:
https://sphweb.bumc.bu.edu/otlt/mph-modules/bs/bs704_hypothesistesting-
anova/bs704_hypothesistesting-
anova_print.html#:~:text=The%20ANOVA%20table%20breaks%20down,Sums%20o
f%20Squares%20(SS)
https://statistics.laerd.com/statistical-guides/one-way-anova-statistical-guide.php
https://www.simplypsychology.org/anova.html#ANOVA-F-value
https://www.wallstreetmojo.com/one-way-anova/
https://www.google.com/url?sa=i&url=https%3A%2F%2Ffanyv88.com%3A443%2Fhttps%2Fslideplayer.com%2Fslide%
2F8367889%2F&psig=AOvVaw1w3Li2JtdPQ
https://www.geeksforgeeks.org/paired-sample-t-test-in-excel/
https://www.ibm.com/docs/en/spss-statistics/SaaS?topic=tests-paired-samples-t-
test#:~:text=The%20Paired%2DSamples%20T%20Test,t%2Dtest%20effect%20size
%20computation.
https://www.bing.com/search?q=independent+sample+t+test+example&aqs=edge.0.6
9i64i450l8.1410265j0j1&FORM=ANNTA1&PC=U531#
International Journal of Computers for Mathematical Learning, 12 (1) (2018), pp. 23-
55, 10.1007/s10758-007-9110-6
http://rstudiopubsstatic.s3.amazonaws.com/536833_75890191109d4f6baef4f69fa6390856.html#multip
le-regression.
https://www.alchemer.com/resources/blog/regression-analysis/