R Programming Unit 4
R Programming Unit 4
UNIT 4
Statistical Testing and Modelling
Introduction
Statistical testing and modelling are fundamental aspects of data analysis and research,
used to make inferences and draw conclusions from data. They help researchers and analysts make
decisions, assess the significance of relationships, and understand the reliability of their findings.
Here’s an introduction to statistical testing and modelling:
Statistical Testing
Statistical testing involves the use of data-driven methods to make inferences about
populations or data samples. The process typically follows these steps:
Formulate Hypotheses: Establish a bull hypothesis (no effect) and an alternative
hypothesis (there is an effect or difference).
Select a Test: Choose an appropriate statistical test based on the nature of the data and the
research question.
Collect Data: Gather data from the sample or population of interest.
Compute the Test Statistic: Calculate a test statistic based on the collected date and the
chosen test.
Determine Significance: Compare the test statistic to a critical value or compute a p-value
to determine whether the results are statistically significant.
Common types of statistical tests include t-tests, AN\OVA (analysis of variance), chi-
square tests, and regression analysis.
Statistical Modelling
Statistical modeling involves the use of mathematical models to describe and understand
relationship within data. This process helps in predicting outcomes, understanding complex
systems, and identifying underlying patterns in data. Key steps in statistical modeling include:
Data Collection and Preparation: Gather and preprocess data for analysis, ensuring data
quality and relevance.
Model Selection: Choose an appropriate statistical model based on the data characteristics
and the research objectives.
Model Fitting:Estimate the parameters of the selected model using techniques such as
least squares estimation, maximum likelihood estimation, or Bayesian inference.
Model Evaluation: Assess the model’s performance and validity using various metrics,
such as goodness-of-fit measures, prediction accuracy, and diagnostic tests.
Model Interpretation: Interpret the results and make conclusions about the relationships
between variables and the overall model’s predictive capability.
Statistical modeling techniques include linear regression, logistic regression, time series
analysis, machine learning models, and more advanced methods like neural networks and decision
trees.
Both statistical testing and modeling are essential tools in various fields such as economics,
social sciences, natural sciences, and engineering.
Sampling Distributions in R
Samples
Samples is a common task, when working with statistical analyses, simulations, and data
modeling. R provides several functions and methods to create samples from different types of
distributions or datasets. Here are some common ways to generate samples in R:
Random Sampling from a Vector: the sample() function to randomly sample elements
from a vector or a set of values.
Ex of random sampling from a vector
x<-c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
rand_samp<-sample(x, size=5, replace=FALSE)
rand_samp
Output: [1] 5 3 9 1 4
Sampling Distribution:
A sampling distribution is a theoretical probability that describes the behavior of a
static based on repeated sampling from a population. It provides information about the variability
of a statistic, such as the same mean or sample proportion, across multiple samples of the same
size drawn from the same population. Understanding sampling distribution is crucial in statistical
inference, as they help access.
Syntax:
hist(v,main,xlab,ylab,col)
where
v is vector containing vales used in histogram
main indicates title of the chart.
col is used to set color of the bars.
xlab is used to give descriptive of x-axis
ylab is used to give descriptive of y-axis
Output:
Ex:
#Setting the parameter
pop_mean<-100
pop_sd<-15
sam_size<-30
ns<-1000
#Generating the population data
pop<-rnorm(10000, mean=pop_mean, sd=pop_sd)
#Creating an empty vector to store sample mean
sam_mean<-numeric(ns)
#Simulating multiple samples and calculating means
for(i in 1:ns)
sam<-sample(pop, size-sam_size)
sam_mean[i]<-mean(sam)
}
Output:
Testing:
“Testing” refers to the process of evaluating or examining something to assess its
characteristics. Performance or quality. In the context of software development or quality
assurance, testing refers to the systematic process of verifying that a software application or system
meets specified requirements and functions as expected. It involves the identification of errors,
bugs or other issues that may affect the functionality, reliability or security of the software.
Regression Testing: This type of testing in R involves checking whether recent code changes
have affected the existing functionality and if any new bugs or erros have been introduced.
Each type of testing serves a specific purpose and helps ensure the overall quality and
effectiveness of the software product. The goal of testing is to identify and resolve any defects or
issues before the software is released to end-users, thereby enhancing the user experience and
minimizing potential risks and vulnerabilities.
Factor in R:
Factor in R is a variable used to categorize and store the data, having a limited number of
different values. It stores the data as a vector of integer values. Factor in R is also known as a
categorical variable that stores both string and integer data values as levels. Factor is mostly used
in Statistical Modeling and exploratory data analysis with R.
In a dataset we can distinguish two types of variable
Categorical
Continuous
In descriptive statistics for categorical variables in R, the value is limited and usually based
on a particular finite group. For Ex, a categorical variable in R can be countries, year, gender,
occupation.
A continuous variable, however can take any values from integer to decimal. For ex, we
can have the revenue, price of a share etc.
Categorical Variable:
Categorical variables in R are stored into factor. Characters are not supported in machine
learning algorithm, and the only way is to convert a string to an integer.
Syntax:
factor(x=character(), levels, labels=levels, ordered=is.ordered(x))
Arguments:
x: a vector of categorical data in R. Need to be a string or integer, not decimal
levels: A vector of possible values taken by x. this argument is optimal. The default value
is the unique list of items of the vector x.
labels: add a label to the c categorical data in R. For ex, 1 can take the label ‘male’ while
0, the label ‘female’.
ordered: determine if the levels should be ordered in categorical data in R
Ex:
gen_v<-c(“Male”, ”Female”, ”Female”, “Male”, “Male”)
class(gen_v)
fac_gen_v<-factor(gen_v)
class(fac_gen_v)
Output:
[1] “character”
[1] “factor”
A categorical variable in R can be divide into nominal categorical variable and ordinal
categorical variable.
Ex:
col_v<-c(“blue”, ”red”, “green”, “white”, “black”, “yellow”)
#convert the vector
fac_col<factor(col_v)
print(fac_col)
Output:
##[1] blue red green white black yellow
## Levels: black blue green red white yellow
HYPOTHESIS TESTING
Introduction:
Hypothesis testing is a fundamental statistical method used to make inferences about
population parameters based on sample data. It involves evaluating the variety of a claim or
hypothesis about a population parameter using sample evidence. The process of hypothesis testing
follows a structured framework and aims to assess whether the evidence from the data support or
contradicts the proposed hypothesis.
The process of testing the hypothesis made by the researcher or to validate the hypothesis. To
perform hypothesis testing, a random sample of data from the population is taken and testing
is performed. Based on the results of the testing, the hypothesis is either selected or rejected.
This concept is known as Statistical Inference.
One sample T-Testing
Two sample T-Testing
Paired T-Test
R provides a powerful environment for conducting various statistical tests and analyses,
allowing researchers and data analysts to test specific hypothesis and draw meaningful
conclusions.
test_statistic <-res$statistic
p-value<-res$p.value
4. Make a Decision/Interpret Decision:
Compare the p-value to the significance level (α) to decide whether to reject the null
hypothesis or fail to reject it.
#Ex of making a decision
if(p_value<alpha){
print(“Reject the null hypothesis”)
}else{
print(Fail to reject the null hypothesis”)
}
Parameters:
x: represents numeric vector of data.
Mu: represents true value of the mean
Ex:
#defining sample vector
x<-rnorm(100)
#One sample T-Test
t.test(x, mu=5)
Output:
One Sample t-test
data: x
t = -56.42, df = 99, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 5
95 percent confidence interval:
-0.2482380 0.1083696
sample estimates:
mean of x
-0.06993419
Data: the dataset ‘x’ was used for the test
The determined t-value is -56.42
Degree of Freedom(df): the t-test has 99 degrees of freedom.
The p-value is 22e -16, which indicates that there is substantial evidence refuting the null
hypothesis.
Alternative hypothesis: The true mean is not equal to five, according to the alternative
hypothesis.
95% confidence interval: (-0.2482380, 0.1083696) is the confidence interval’s value. This
range denotes the values that, with 95% confidence, correspond to the genuine population
mean.
Ex:
data<-c(5.1, 4.8, 5.2, 4.9, 5.0, 5.1, 4.9, 5.2, 5.0)
#assuming the null hypothesis that the true mean is 5
res<-t.test(data, mu=5)
print(res)
Output:
One Sample t-test
data: data
t = 0.78899, df = 7, p-value = 0.456
alternative hypothesis: true mean is not equal to 5
95 percent confidence interval:
4.82526 5.34974
sample estimates:
mean of x
5.0875
Syntax: t.test(x, y)
Parameters: x and y are numeric vectors
Ex:
#sample data for two groups
g1<-c(25,27,26,28,30,29,31,32,30,28)
g2<-c(22,24,26,27,25,28,29,31,20,28)
res<-t.test(g1,g2)
print(res)
Output:
Welch Two Sample t-test
data: g1 and g2
t = 2.0526, df = 15.676, p-value = 0.05719
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.08973199 5.28973199
sample estimates:
mean of x mean of y
28.6 26.0
wilcox.test():
wilcox.test() function is used to conduct a Wilcoxon rank-sum test (Mann-Whitney U test)
to compare the medians of two independent samples. This non-parametric test is used when the
assumptions of the t-test are not met, such as when the data are not normally distributed or when
the sample sizes are small.
Syntax:
wilcox.test(x, y, alternative=c(“two.sided”,”less”,”greater”), mu=0, paired=FALSE,
conf.int=FALSE, conf.level=0.95)
x and y: numeric vectors containing the two samples that you want to compare.
Alternative: A characer string specifying the alternative hypothesis, which can be
“two.sided” (default). “less” or “greater”.
mu: a numeric value giving the hypothesized median under the null hypothesis.
paired: A logical value indicating whether the samples are paired.
conf.int: A logical value indicating whether to compute the confidence interval of the true
location shift.
conf.level: the confidence level of the interval, which is set to 0.95 by default.
Ex:
g1<-c(17,12,20,21,19)
g2<-c(15,14,18,16,13)
res<-wilcox.test(g1,g2,alternative="two.sided")
print(res)
Output:
Wilcoxon rank sum exact test
data: g1 and g2
W = 19, p-value = 0.2222
alternative hypothesis: true location shift is not equal to 0
Paired T-Test:
Paired t-test is performed using the t.test() function. The paired t-test is used when you
have two related groups or samples. The data should be paired because the measurements in each
group are not independent, and each observation in one group has a unique corresponding
measurement in the other group.
Ex:
b<-c(23, 21, 29, 18, 25)
a<-c(19, 20, 21, 17, 22)
res<-t.test(b,a, paired=TRUE)
print(res)
Output:
Paired t-test
data: b and a
t = 2.6389, df = 4, p-value = 0.05765
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
-0.1771993 6.9771993
sample estimates:
mean difference
3.4
b and a are the two sets of paired data. Each element in b(before) corresponds to the same
element at the same index in a(after).
t.test() is used to perform the paired t-test. Setting paired=TRUE indicates that the two
samples are paired.
The result is stored in the res variable.
Finally, the result is printed.
Chi-Square Test:
The Chi-Square Test is used to analyze the frequency table(i.e. contingency table), which
is formed by two categorical variables. The chi-square test evaluates whether there is significant
relationship between the categories of the two variables.
The Chi-Square Test is a statistical method which is used to determine whether two
categorical variables have a significant correlation between them. These variables should be from
the same population and should be categorical like Yes/No, Red/Green, Male/Female, etc. grusp
chi-square test is also useful while comparing the tallies or counts of categorical responses between
two (or more) independent s.
Syntax: chisq.test(x, y=NULL, correct=TRUE)
x: A contingency table (a matrix or a data frame) containing the observed counts or
frequencies of the categories.
y: if you have a contingency table, you can provide it as x, and y can be left as NULL. If
you have a matrix or data frame with rows and columns representing two categorical
variables, you can provide both x and y.
correct: A logical value indicating whether to apply continuity correction. Its usually set to
TRUE.
An example of how to use the chi-square test in R:
Ex:
To take the survey data in the MASS library which represents the data from a survey
conducted on students.
#load the MASS package
library(MASS)
print(survey)
Output:
'data.frame': 237 obs. of 12 variables:
$ Sex : Factor w/ 2 levels "Female","Male": 1 2 2 2 2 1 2 1 2 2 ...
$ Wr.Hnd: num 18.5 19.5 18 18.8 20 18 17.7 17 20 18.5 ...
$ NW.Hnd: num 18 20.5 13.3 18.9 20 17.7 17.7 17.3 19.5 18.5 ...
$ W.Hnd : Factor w/ 2 levels "Left","Right": 2 1 2 2 2 2 2 2 2 2 ...
$ Fold : Factor w/ 3 levels "L on R","Neither",..: 3 3 1 3 2 1 1 3 3 3 ...
$ Pulse : int 92 104 87 NA 35 64 83 74 72 90 ...
$ Clap : Factor w/ 3 levels "Left","Neither",..: 1 1 2 2 3 3 3 3 3 3 ...
$ Exer : Factor w/ 3 levels "Freq","None",..: 3 2 2 2 3 3 1 1 3 3 ...
$ Smoke : Factor w/ 4 levels "Heavy","Never",..: 2 4 3 2 2 2 2 2 2 2 ...
$ Height: num 173 178 NA 160 165 ...
$ M.I : Factor w/ 2 levels "Imperial","Metric": 2 1 NA 2 2 1 1 2 2 2 ...
$ Age : num 18.2 17.6 16.9 20.3 23.7 ...
NULL
The above result shows the dataset has many Factor variables which can be considered as
categorical variables.
In this, model we consider the variables “Exer” and “Smoke”.
The Smoke column records the students smoking habit. The Exer column records their exercise
level.
To test the hypothesis whether the students smoke habit is independent of their exercise level at
0.05 significance level.
Ex:
stu_data=data.frame(survey$Smoke,survey$Exer)
> stu_data=table(survey$Smoke, survey$Exer)
> print(stu_data)
Output:
Freq None Some
Heavy 7 1 3
Never 87 18 84
Occas 12 3 4
Regul 9 1 7
As the p-value 0.4828 is greater than the .05, we conclude that the smoking habit is
independent of the exercise level of the student and hence there is weak or no correlation between
the two variables.
Ex:
ob_data<-matrix(c(45, 20, 15, 60), nrow=2, byrow=TRUE)
colnames(ob_data)<-c("Category A", "Category B")
rownames(ob_data)<-c("Group 1", "Group 2")
res<-chisq.test(ob_data)
print(res)
Output:
Pearson's Chi-squared test with Yates' continuity correction
data: ob_data
X-squared = 32.481, df = 1, p-value = 1.204e-08
In this example,
two rows representing two groups and two columns representing two categories.
We specify row and column names for the table
We use the chisq.test() function to perform the chi-square test on the contingency table.
The results are stored in the result variable.
We print the results, which include the chi-squared statistic, degrees freedom, the p-value,
and other information.
Disadvantages:
Assumptions: Hypothesis testing relies on various assumptions about the data, such as normality,
independence and homogeneity of variance. Violations of these assumptions can affect the validity
of the results.
Simplification of Reality: Hypothesis testing often simplifies complex real-world scenarios,
leading to a reduction in the richness of the data and potentially overlooking important nuances.
Risk of Errors: There is a risk of committing Type1 errors (rejecting a true null hypothesis) and
Type II errors (failing to reject a false null hypothesis), especially when the sample size is small
or the effect size is small.
Limited Scope: Hypothesis testing is constrained by the specific hypothesis formulated by the
researchers. It may not capture the full complexity of relationships between variables in the data.
Proportions Testing:
Proportions testing is commonly used to analyze categorical data, especially when working
with binary outcomes or proportions. It helps assess whether the proportions of successes or events
differ significantly across groups pr whether a sample proportion significantly deviates from an
expected proportion.
In R you can perform various types of proportions testing using specific functions and
methods. Some common types of testing include:
Syntax:
The prop.test() function is used to perform proportions testing
prop.test(x, n, p=NULL, alternative=”two.sided”, conf.level=0.95)
x: Number of successes (a vector of counts of successes)
n: Number of trails or total sample size (a vector of the same length as x)
p: A single number giving the hypothetical probability of success. If not given, it is set to
the overall proportion of successes.
alternative: A character string specifying the alternative hypothesis, which can be
“two.sided”, “less” or “greater”.
Ex:
Program to perform a proportions test in R:
su<-20
tot_tr<-50
res<-prop.test(su,tot_tr, alternative="two.sided",conf.level=0.95)
print(res)
Output:
1-sample proportions test with continuity correction
data: su out of tot_tr, null probability 0.5
X-squared = 1.62, df = 1, p-value = 0.2031
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
0.2673293 0.5479516
sample estimates:
p
0.4
Where successes represent the number of successes and total_trails represent the total
number of trails. The alternative parameter is set to “two.sided” to test for a two-tailed alternative
hypothesis, and the conf.level parameter is set to 0.95 to calculate a 95% confidence interval.
For Ex: the height of persons living in an area is different or identical to other persons living in
other areas.
Syntax:
t.test(x, mu)
parameters:
x: represents numeric vector of data.
mu: represents true value of the mean.
Implementation in R:
In R language the function used for performing a z-test is binom.test() and prop.test().
Syntax:
binom.test(x, n, p=0.5, alternative=”two.sided”) prop.test(x, n, p=NULL,
alternative=”two.sided”, correct=TRUE)
Parameters:
x= number of successes and failures in data set.
n=size of dataset.
p=probabilities of success. It must be in the range of 0 to 1.
alternative=a character string specifying the alternative hypothesis.
Correct=a logical indicating whether Yates continuity correction should be applied where
possible.
Ex:
Let’s assume that 30 out of 70 people recommended street Food to their friends. To test
this claim, a random sample of 150 people obtained. Of these 150 people, 80 indicate that they
recommended Street Food to their friend. Is this claim accurate? Use alpha=0.05
Solution: Now x=80, P=.30, n=150. We want to know, whether the people recommend
street food to their friend than healthy food? Let’s use the function prop.test() in R.
#Using prop.test()
prop.test(x=80, n=150, p=0.3, correct=FALSE)
Output:
1-sample proportions test without continuity correction
data: 80 out of 150, null probability 0.3
X-squared = 38.889, df = 1, p-value = 4.486e-10
alternative hypothesis: true p is not equal to 0.3
95 percent confidence interval:
0.4536625 0.6113395
sample estimates:
p
0.5333333
Two-Proportion Z-Test:
The two-proportion z-test compares the proportions of two independent groups or samples.
The z-test is used to determine if there is a significant difference between the proportions of success
in the two groups. The prop.test() function to conduct to conduct a two-proportion z-test.
Ex:
res<-prop.test(x=c(490,400), n=c(500,500))
res
Output:
2-sample test for equality of proportions with continuity correction
data: c(490, 400) out of c(500, 500)
X-squared = 80.909, df = 1, p-value < 2.2e-16
alternative hypothesis: two.sided
95 percent confidence interval:
0.1408536 0.2191464
sample estimates:
prop 1 prop 2
0.98 0.80
Type I Error:
A type I error, also known as an alpha error, occurs when the null hypothesis is rejected
even though it is actually true. This means that the test results lead to the conclusion that there is
a significant effect, when in fact there is no such effect. In the context of R, the probability of
committing a Type I error is represented by significance level, denoted by alpha (α).
General syntax for conducting a hypothesis test that may result in a Type I error in R:
#sample data
d<-c(3, 5, 4, 6, 7, 8, 2, 3, 5, 4)
res<-t.test(d, mu=4.5)
#one Sample t-test
p_val<-res$p.value
#Extract p-value
alpha<-0.05
#check for Type I error
if(p_val<alpha){
print("Reject the Null hypothesis, Type I error might have occured.")
}else{
print("Fail o reject the null hypothesis.")
}
Output: [1] "Fail to reject the null hypothesis."
In this example, a one-sample t-test using the t.test() function in R. we then extract the p-
value from the test results and define the significance level (alpha). If the p-value is less than the
significance
Type II Error:
A Type II error, often denoted as β (beta), is statistical term used in hypothesis testing
that occurs when the null hypothesis is not rejected when it is actually false. In simpler terms, it
is the error of failing to detect a real effect or difference when it exists. This error is particularly
relevant when evaluating the power of a statistical test, which is the probability of correctly
rejecting the null hypothesis when it is false.
Ex:
#Generate two set of data with different mean
set.seed(123)
d1<-rnorm(100, mean=5, sd=2)
d2<-rnorm(100, mean=7, sd=2)
For specific type of statistical tests (eg: t-tests, ANOVA, chi-squared tests), different functions and
packages can be used to perform power analysis and calculate power. For instance, in R you can
use packages like pwr, pwr.t.test, pwr.anova.test or pwr.chisq.test to conduct power analysis for
various test scenarios.
Ex:
To calculate the power of a t-test in R using the pwr.t.test function
#sinstall.packages("pwr")
#Load the required library
library(pwr)
#Set parameters
effect-size<-0.8 #This is the standard effect size
In this example, we are performing a two-sample t-test and assuring equal variances
between the groups. We set the effect size, sample size per group, and significance level. The
pwr.t.test function calculates the power of the t-test and we extract the power value from the
analysis. The script then prints the statistical power to console.
Ex:
A car company wishes to compare the average petrol consumption of THREE similar
models of car and has available six vehicles of each model.
A teacher is interested in a comparison of average percentage marks attained in the
examinations of FIVE different subjects and has available the marks of eight students who
all completed each examination.
ANOVA function in R: Use the aov() function in R to perform ANOVA. This function takes a
formula as an argument, typically in the form of response_variable ~ group_variable.
Interpreting ANOVA Results: After running the ANOVA, you can use the summary () function
to view the ANOVA table, which provides key statistics such as the F-value, degrees of freedom
and p_value.
One-way ANOVA:
The process of a one-way ANOVA (one independent variable) and a two-way ANOVA
(two independent variables).
ANOVA is a statistical method used to compare means of three or more groups to
determine whether there are any statistically significant differences among the means. It is called
“one-way” because there is only one independent variable affecting the dependent variable. This
independent variable typically represents the different groups or levels being compared.
Syntax:
model<-aov(response_variable ~ group_variable, data=your_data_frame)
Here’s
model: This is the variable that will store the ANOVA model
aov(): The function used to fit the ANOVA model
response_variable: This is the continuous response variable you are analyzing across
different groups.
group_variable: This represents the categorical variable that defines the different groups.
your_data_frame: This refers to the data frame that contains your data.
Output:
Df Sum Sq Mean Sq F value Pr(>F)
group 2 91.2 45.6 12.67 0.0011 **
Residuals 12 43.2 3.6
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
In this example:
There are three groups of data: g1, g2 and g3
To combine the data into a data frame data, where the variable value contains the
measurements and the variable group represents the group labels.
The aov() function to fit the one-way ANOVA model, with value as the response variable
and group as the independent variable.
Two-way ANOVA:
The two-way ANOVA is used to analyze the effect of two categorical independent
variables on a continuous dependent variable. Here’s an example of how to perform a two-way
ANOVA in R.
set.seed(123)