There are many cases in data analysis where you’ll want to compare means for two populations or samples and which technique you should use depends on what type of data you have and how that data is grouped together. The comparison of means tests helps to determine if your groups have similar means. So this article contains statistical tests to use for comparing means in R programming. These tests include:
Comparing Means in R Programming
So as we have discussed before various techniques are used depending on what type of data we have and how the data is grouped together. So let’ discuss one by one techniques depending on the different types of data.
Comparing the means of one-sample data
There are mainly two techniques used to compare the one-sample mean to a standard known mean. These two techniques are:
- One Sample T-test
- One-Sample Wilcoxon Test
One Sample T-test
The One-Sample T-Test is used to test the statistical difference between a sample mean and a known or assumed/hypothesized value of the mean in the population.
Implementation in R:
For performing a one-sample t-test in R, use the function t.test(). The syntax for the function is given below:
Syntax: t.test(x, mu = 0)
Parameters:
- x: the name of the variable of interest
- mu: set equal to the mean specified by the null hypothesis
Example:
R
set.seed (0)
sweetSold <- c ( rnorm (50, mean = 140, sd = 5))
result = t.test (sweetSold, mu = 150)
print (result)
|
Output:
One Sample t-test
data: sweetSold
t = -15.249, df = 49, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 150
95 percent confidence interval:
138.8176 141.4217
sample estimates:
mean of x
140.1197
One-Sample Wilcoxon Test
The one-sample Wilcoxon signed-rank test is a non-parametric alternative to a one-sample t-test when the data cannot be assumed to be normally distributed. It’s used to determine whether the median of the sample is equal to a known standard value i.e. a theoretical value.
Implementation in R:
To perform a one-sample Wilcoxon-test, R provides a function wilcox.test() that can be used as follows:
Syntax: wilcox.test(x, mu = 0, alternative = “two.sided”)
Parameters:
- x: a numeric vector containing your data values
- mu: the theoretical mean/median value. Default is 0 but you can change it.
- alternative: the alternative hypothesis. Allowed value is one of “two.sided” (default), “greater” or “less”.
Example: Here, let’s use an example data set containing the weight of 10 rabbits. Let’s know if the median weight of the rabbit differs from 25g?
R
set.seed (1234)
myData = data.frame (
name = paste0 ( rep ( "R_" , 10), 1:10),
weight = round ( rnorm (10, 30, 2), 1)
)
print (myData)
result = wilcox.test (myData$weight, mu = 25)
print (result)
|
Output:
name weight
1 R_1 27.6
2 R_2 30.6
3 R_3 32.2
4 R_4 25.3
5 R_5 30.9
6 R_6 31.0
7 R_7 28.9
8 R_8 28.9
9 R_9 28.9
10 R_10 28.2
Wilcoxon signed rank test with continuity correction
data: myData$weight
V = 55, p-value = 0.005793
alternative hypothesis: true location is not equal to 25
In the above output, the p-value of the test is 0.005793, which is less than the significance level alpha = 0.05. So we can reject the null hypothesis and conclude that the average weight of the rabbit is significantly different from 25g with a p-value = 0.005793.
Comparing the means of paired samples
There are mainly two techniques are used to compare the means of paired samples. These two techniques are:
- Paired sample T-test
- Paired Samples Wilcoxon Test
Paired sample T-test
This is a statistical procedure that is used to determine whether the mean difference between two sets of observations is zero. In a paired sample t-test, each subject is measured two times, resulting in pairs of observations.
Implementation in R:
For performing a one-sample t-test in R, use the function t.test(). The syntax for the function is given below.
Syntax: t.test(x, y, paired =TRUE)
Parameters:
- x, y: numeric vectors
- paired: a logical value specifying that we want to compute a paired t-test
Example:
R
set.seed (0)
shopOne <- rnorm (50, mean = 140, sd = 4.5)
shopTwo <- rnorm (50, mean = 150, sd = 4)
result = t.test (shopOne, shopTwo,
var.equal = TRUE )
print (result)
|
Output:
Two Sample t-test
data: shopOne and shopTwo
t = -13.158, df = 98, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-11.482807 -8.473061
sample estimates:
mean of x mean of y
140.1077 150.0856
Paired Samples Wilcoxon Test
The paired samples Wilcoxon test is a non-parametric alternative to paired t-test used to compare paired data. It’s used when data are not normally distributed.
Implementation in R:
To perform Paired Samples Wilcoxon-test, the R provides a function wilcox.test() that can be used as follows:
Syntax: wilcox.test(x, y, paired = TRUE, alternative = “two.sided”)
Parameters:
- x, y: numeric vectors
- paired: a logical value specifying that we want to compute a paired Wilcoxon test
- alternative: the alternative hypothesis. Allowed value is one of “two.sided” (default), “greater” or “less”.
Example: Here, let’s use an example data set, which contains the weight of 10 rabbits before and after the treatment. We want to know, if there is any significant difference in the median weights before and after treatment?
R
before <- c (190.1, 190.9, 172.7, 213, 231.4,
196.9, 172.2, 285.5, 225.2, 113.7)
after <- c (392.9, 313.2, 345.1, 393, 434,
227.9, 422, 383.9, 392.3, 352.2)
myData <- data.frame (
group = rep ( c ( "before" , "after" ), each = 10),
weight = c (before, after)
)
print (myData)
result = wilcox.test (before, after, paired = TRUE )
print (result)
|
Output:
group weight
1 before 190.1
2 before 190.9
3 before 172.7
4 before 213.0
5 before 231.4
6 before 196.9
7 before 172.2
8 before 285.5
9 before 225.2
10 before 113.7
11 after 392.9
12 after 313.2
13 after 345.1
14 after 393.0
15 after 434.0
16 after 227.9
17 after 422.0
18 after 383.9
19 after 392.3
20 after 352.2
Wilcoxon signed rank test
data: before and after
V = 0, p-value = 0.001953
alternative hypothesis: true location shift is not equal to 0
In the above output, the p-value of the test is 0.001953, which is less than the significance level alpha = 0.05. We can conclude that the median weight of the mice before treatment is significantly different from the median weight after treatment with a p-value = 0.001953.
Comparing the means of more than two groups
There are mainly two techniques are used to compare the one-sample mean to a standard known mean. These two techniques are:
- Analysis of Variance (ANOVA)
- One way ANOVA
- Two way ANOVA
- MANOVA Test
- Kruskal–Wallis Test
One way ANOVA
The one-way analysis of variance (ANOVA), also known as one-factor ANOVA, is an extension of independent two-samples t-test for comparing means in a situation where there are more than two groups. In one-way ANOVA, the data is organized into several groups base on one single grouping variable.
Implementation in R:
For performing the one-way analysis of variance (ANOVA) in R, use the function aov(). The function summary.aov() is used to summarize the analysis of the variance model. The syntax for the function is given below.
Syntax: aov(formula, data = NULL)
Parameters:
- formula: A formula specifying the model.
- data: A data frame in which the variables specified in the formula will be found
Example:
One way ANOVA test is performed using mtcars dataset which comes preinstalled with dplyr package between disp attribute, a continuous attribute, and gear attribute, a categorical attribute.
R
library (dplyr)
mtcars_aov <- aov (mtcars $ disp ~ factor (mtcars $ gear))
print ( summary (mtcars_aov))
|
Output:
Df Sum Sq Mean Sq F value Pr(>F)
factor(mtcars$gear) 2 280221 140110 20.73 2.56e-06 ***
Residuals 29 195964 6757
—
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The summary shows that the gear attribute is very significant to displacement(Three stars denoting it). Also, P value less than 0.05, so it proves that gear is significant to displacement i.e related to each other, and we reject the Null Hypothesis.
Two way ANOVA
Two-way ANOVA test is used to evaluate simultaneously the effect of two grouping variables (A and B) on a response variable. It takes two categorical groups into consideration.
Implementation in R:
For performing the two-way analysis of variance (ANOVA) in R, also use the function aov(). The function summary.aov() is used to summarize the analysis of variance model. The syntax for the function is given below.
Syntax: aov(formula, data = NULL)
Parameters:
- formula: A formula specifying the model.
- data: A data frame in which the variables specified in the formula will be found
Example: Two way ANOVA test is performed using mtcars dataset which comes preinstalled with dplyr package between disp attribute, a continuous attribute and gear attribute, a categorical attribute, am attribute, a categorical attribute.
R
library (dplyr)
mtcars_aov2 = aov (mtcars $ disp ~ factor (mtcars $ gear) *
factor (mtcars $ am))
print ( summary (mtcars_aov2))
|
Output:
Df Sum Sq Mean Sq F value Pr(>F)
factor(mtcars$gear) 2 280221 140110 20.695 3.03e-06 ***
factor(mtcars$am) 1 6399 6399 0.945 0.339
Residuals 28 189565 6770
—
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The summary shows that gear attribute is very significant to displacement(Three stars denoting it) and am attribute is not much significant to displacement. P-value of gear is less than 0.05, so it proves that gear is significant to displacement i.e related to each other. P-value of am is greater than 0.05, am is not significant to displacement i.e not related to each other.
MANOVA Test
Multivariate analysis of variance (MANOVA) is simply an ANOVA (Analysis of variance) with several dependent variables. It is a continuation of the ANOVA. In an ANOVA, we test for statistical differences on one continuous dependent variable by an independent grouping variable. The MANOVA continues this analysis by taking multiple continuous dependent variables and bundles them collectively into a weighted linear composite variable. The MANOVA compares whether or not the newly created combination varies by the different levels, or groups, of the independent variable.
Implementation in R:
R provides a method manova() to perform the MANOVA test. The class “manova” differs from class “aov” in selecting a different summary method. The function manova() calls aov and then add class “manova” to the result object for each stratum.
Syntax: manova(formula, data = NULL, projections = FALSE, qr = TRUE, contrasts = NULL, …)
Parameters:
- formula: A formula specifying the model.
- data: A data frame in which the variables specified in the formula will be found. If missing, the variables are searched for in the standard way.
- projections: Logical flag
- qr: Logical flag
- contrasts: A list of contrasts to be used for some of the factors in the formula.
…: Arguments to be passed to lm, such as subset or na.action
Example: To perform the MANOVA test in R let’s take iris data set.
R
library (dplyr)
myData = iris
set.seed (1234)
dplyr:: sample_n (myData, 10)
|
Output:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.5 2.5 4.0 1.3 versicolor
2 5.6 2.5 3.9 1.1 versicolor
3 6.0 2.9 4.5 1.5 versicolor
4 6.4 3.2 5.3 2.3 virginica
5 4.3 3.0 1.1 0.1 setosa
6 7.2 3.2 6.0 1.8 virginica
7 5.9 3.0 4.2 1.5 versicolor
8 4.6 3.1 1.5 0.2 setosa
9 7.9 3.8 6.4 2.0 virginica
10 5.1 3.4 1.5 0.2 setosa
To know if there is any important difference, in sepal and petal length, between the different species then perform MANOVA test. Hence, the function manova() can be used as follows.
R
sepal = iris$Sepal.Length
petal = iris$Petal.Length
result = manova ( cbind (Sepal.Length, Petal.Length) ~ Species,
data = iris)
summary (result)
|
Output:
Df Pillai approx F num Df den Df Pr(>F)
Species 2 0.9885 71.829 4 294 < 2.2e-16 ***
Residuals 147
—
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
From the output above, it can be seen that the two variables are highly significantly different among Species.
Kruskal–Wallis test
The Kruskal–Wallis test is a rank-based test that is similar to the Mann–Whitney U test but can be applied to one-way data with more than two groups. It is a non-parametric alternative to the one-way ANOVA test, which extends the two-samples Wilcoxon test. A group of data samples is independent if they come from unrelated populations and the samples do not affect each other. Using the Kruskal-Wallis Test, it can be decided whether the population distributions are similar without assuming them to follow the normal distribution.
Implementation in R:
R provides a method kruskal.test() which is available in the stats package to perform a Kruskal-Wallis rank-sum test.
Syntax: kruskal.test(x, g, formula, data, subset, na.action, …)
Parameters:
- x: a numeric vector of data values, or a list of numeric data vectors.
- g: a vector or factor object giving the group for the corresponding elements of x
- formula: a formula of the form response ~ group where response gives the data values and group a vector or factor of the corresponding groups.
- data: an optional matrix or data frame containing the variables in the formula .
- subset: an optional vector specifying a subset of observations to be used.
- na.action: a function which indicates what should happen when the data contain NA
- …: further arguments to be passed to or from methods.
Example: Let’s use the built-in R data set named PlantGrowth. It contains the weight of plants obtained under control and two different treatment conditions.
R
myData = PlantGrowth
print (myData)
print ( levels (myData$group))
|
Output:
weight group
1 4.17 ctrl
2 5.58 ctrl
3 5.18 ctrl
4 6.11 ctrl
5 4.50 ctrl
6 4.61 ctrl
7 5.17 ctrl
8 4.53 ctrl
9 5.33 ctrl
10 5.14 ctrl
11 4.81 trt1
12 4.17 trt1
13 4.41 trt1
14 3.59 trt1
15 5.87 trt1
16 3.83 trt1
17 6.03 trt1
18 4.89 trt1
19 4.32 trt1
20 4.69 trt1
21 6.31 trt2
22 5.12 trt2
23 5.54 trt2
24 5.50 trt2
25 5.37 trt2
26 5.29 trt2
27 4.92 trt2
28 6.15 trt2
29 5.80 trt2
30 5.26 trt2
[1] "ctrl" "trt1" "trt2"
Here the column “group” is called factor and the different categories (“ctr”, “trt1”, “trt2”) are named factor levels. The levels are ordered alphabetically. The problem statement is we want to know if there is any significant difference between the average weights of plants in the 3 experimental conditions. And the test can be performed using the function kruskal.test() as given below.
R
myData = PlantGrowth
result = kruskal.test (weight ~ group,
data = myData)
print (result)
|
Output:
Kruskal-Wallis rank sum test
data: weight by group
Kruskal-Wallis chi-squared = 7.9882, df = 2, p-value = 0.01842
As the p-value is less than the significance level 0.05, it can be concluded that there are significant differences between the treatment groups.
Similar Reads
Explicit Coercion in R Programming
Coercing of an object from one type of class to another is known as explicit coercion. It is achieved through some functions which are similar to the base functions. But they differ from base functions as they are not generic and hence do not call S3 class methods for conversion. Difference between
3 min read
Array vs Matrix in R Programming
The data structure is a particular way of organizing data in a computer so that it can be used effectively. The idea is to reduce the space and time complexities of different tasks. Data structures in R programming are tools for holding multiple values. The two most important data structures in R ar
3 min read
Handling Errors in R Programming
Error Handling is a process in which we deal with unwanted or anomalous errors which may cause abnormal termination of the program during its execution. In R Programming, there are basically two ways in which we can implement an error handling mechanism. Either we can directly call the functions lik
3 min read
Functions in R Programming
A function accepts input arguments and produces the output by executing valid R commands that are inside the function. Functions are useful when you want to perform a certain task multiple times. In R Programming Language when you are creating a function the function name and the file in which you a
8 min read
Condition Handling in R Programming
Decision handling or Condition handling is an important point in any programming language. Most of the use cases result in either positive or negative results. Sometimes there is the possibility of condition checking of more than one possibility and it lies with n number of possibilities. In this ar
5 min read
dplyr Package in R Programming
The dplyr package for R offers efficient data manipulation functions. It makes data transformation and summarization simple with concise, readable syntax. Key Features of dplyrData Frame and TibbleData frames in dplyr in R is organized tables where each column stores specific types of information, l
4 min read
R Program Commands
R is a powerful programming language and environment designed for statistical computing and data analysis. It is widely used by statisticians, data scientists, and researchers for its extensive capabilities in handling data, performing statistical analysis, and creating visualizations. Table of Cont
6 min read
File Handling in R Programming
In R Programming, handling of files such as reading and writing files can be done by using in-built functions present in R base package. In this article, let us discuss reading and writing of CSV files, creating a file, renaming a file, check the existence of the file, listing all files in the worki
4 min read
Pearson Correlation Testing in R Programming
Correlation is a statistical measure that indicates how strongly two variables are related. It involves the relationship between multiple variables as well. For instance, if one is interested to know whether there is a relationship between the heights of fathers and sons, a correlation coefficient c
5 min read
Learn R Programming
R is a Programming Language that is mostly used for machine learning, data analysis, and statistical computing. It is an interpreted language and is platform independent that means it can be used on platforms like Windows, Linux, and macOS. In this R Language tutorial, we will Learn R Programming La
15+ min read