0% found this document useful (0 votes)
13 views

R Programming Unit 4

The document provides an overview of statistical testing and modeling, highlighting their importance in data analysis and research. It details the steps involved in statistical testing, such as hypothesis formulation, data collection, and significance determination, as well as the process of statistical modeling, including data preparation and model evaluation. Additionally, it covers sampling distributions in R, types of testing, factors in R, and the steps involved in hypothesis testing.

Uploaded by

Chaya Anu
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

R Programming Unit 4

The document provides an overview of statistical testing and modeling, highlighting their importance in data analysis and research. It details the steps involved in statistical testing, such as hypothesis formulation, data collection, and significance determination, as well as the process of statistical modeling, including data preparation and model evaluation. Additionally, it covers sampling distributions in R, types of testing, factors in R, and the steps involved in hypothesis testing.

Uploaded by

Chaya Anu
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

R PROGRAMMING V - SEM BCA

UNIT 4
Statistical Testing and Modelling

Introduction
Statistical testing and modelling are fundamental aspects of data analysis and research,
used to make inferences and draw conclusions from data. They help researchers and analysts make
decisions, assess the significance of relationships, and understand the reliability of their findings.
Here’s an introduction to statistical testing and modelling:

Statistical Testing
Statistical testing involves the use of data-driven methods to make inferences about
populations or data samples. The process typically follows these steps:
Formulate Hypotheses: Establish a bull hypothesis (no effect) and an alternative
hypothesis (there is an effect or difference).
Select a Test: Choose an appropriate statistical test based on the nature of the data and the
research question.
Collect Data: Gather data from the sample or population of interest.
Compute the Test Statistic: Calculate a test statistic based on the collected date and the
chosen test.
Determine Significance: Compare the test statistic to a critical value or compute a p-value
to determine whether the results are statistically significant.
Common types of statistical tests include t-tests, AN\OVA (analysis of variance), chi-
square tests, and regression analysis.

Statistical Modelling
Statistical modeling involves the use of mathematical models to describe and understand
relationship within data. This process helps in predicting outcomes, understanding complex
systems, and identifying underlying patterns in data. Key steps in statistical modeling include:

Data Collection and Preparation: Gather and preprocess data for analysis, ensuring data
quality and relevance.

Model Selection: Choose an appropriate statistical model based on the data characteristics
and the research objectives.

Model Fitting:Estimate the parameters of the selected model using techniques such as
least squares estimation, maximum likelihood estimation, or Bayesian inference.

Model Evaluation: Assess the model’s performance and validity using various metrics,
such as goodness-of-fit measures, prediction accuracy, and diagnostic tests.

SHREE MEDHA DEGREE COLLEGE MANJESH M


R PROGRAMMING V - SEM BCA

Model Interpretation: Interpret the results and make conclusions about the relationships
between variables and the overall model’s predictive capability.

Statistical modeling techniques include linear regression, logistic regression, time series
analysis, machine learning models, and more advanced methods like neural networks and decision
trees.
Both statistical testing and modeling are essential tools in various fields such as economics,
social sciences, natural sciences, and engineering.

Sampling Distributions in R
Samples

Samples is a common task, when working with statistical analyses, simulations, and data
modeling. R provides several functions and methods to create samples from different types of
distributions or datasets. Here are some common ways to generate samples in R:

Random Sampling from a Vector: the sample() function to randomly sample elements
from a vector or a set of values.
Ex of random sampling from a vector
x<-c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
rand_samp<-sample(x, size=5, replace=FALSE)
rand_samp
Output: [1] 5 3 9 1 4

Sampling Distribution:
A sampling distribution is a theoretical probability that describes the behavior of a
static based on repeated sampling from a population. It provides information about the variability
of a statistic, such as the same mean or sample proportion, across multiple samples of the same
size drawn from the same population. Understanding sampling distribution is crucial in statistical
inference, as they help access.

In statistics, a population is an entire pool from which a statistical sample is drawn. A


population may refer to an entire group of people, objects, events, hospital visits, or measurements.
A population can thus be said to be an aggregate observation of subjects grouped together by a
common feature.
 A sampling distribution is a statistic that is arrived out through repeated sampling from a
larger population.
 It describes a range of possible outcomes that of a static, such as the mean or mode of some
variable, as it truly exists a population.
 The majority of data analyzed by researchers are actually drawn from samples, and not
populations.

SHREE MEDHA DEGREE COLLEGE MANJESH M


R PROGRAMMING V - SEM BCA

Syntax:
hist(v,main,xlab,ylab,col)
where
v is vector containing vales used in histogram
main indicates title of the chart.
col is used to set color of the bars.
xlab is used to give descriptive of x-axis
ylab is used to give descriptive of y-axis

Steps to calculate Sampling Distributions in R:


Step 1: Here, first we have to define a number of samples(n=1000)
n<-1000
Step 2: Next we create a vector (sample means) of length ‘n’ with Null(NA) values rep() function
is used to replicate the values in the vector
Syntax: rep(value_to_be_replicated, number_of_times)
Step 3: Later we filled the created sample_means null vector with sample means from the
considered population using the mean() function which are having a sample mean of 10(mean)
and standard deviation of 10(sd) of 20 samples(n)using rnorm() which is used to generate normal
distributions.
Syntax: mean(x, trim=0)
rnorm(n, mean, sd)
Step 4: To check the created samples we used head() which returns the first six samples of the
dataframe(vector, list etc)
Syntax: head(data_frame, no_of_rows_be_returned) #by default second argument is set to
6 in R.
Step 5: Finally to visualize the sample_mean dataset we plotted a histogram (for better
visualization) using hist() function in R.
Step 6: Finally we found the probability of generated sample means which are having mean greater
than or equal to 10.

#define number of samples


n<-1000
#create empty vector with mean
sam_mean=rep(NA, n)
#fill empty vector with means
for(I in 1:n){
sam_mean[i]=mean(rnorm(20, mean=10, sd=10))
}
head(sam_mean)

#create histogram to visualize

SHREE MEDHA DEGREE COLLEGE MANJESH M


R PROGRAMMING V - SEM BCA

hist(sam_mean, main=”Sampling distribution”, xlab=”Sample Means”,


ylab=”Frequency”, col=”blue”)

#To cross check find mean and sd of sample


mean(sam_mean)
sd(sam_mean)
#To find probability
sum(sam_mean>=10)/length(sam_mean)

Output:

Ex:
#Setting the parameter
pop_mean<-100
pop_sd<-15
sam_size<-30
ns<-1000
#Generating the population data
pop<-rnorm(10000, mean=pop_mean, sd=pop_sd)
#Creating an empty vector to store sample mean
sam_mean<-numeric(ns)
#Simulating multiple samples and calculating means
for(i in 1:ns)
sam<-sample(pop, size-sam_size)
sam_mean[i]<-mean(sam)
}

SHREE MEDHA DEGREE COLLEGE MANJESH M


R PROGRAMMING V - SEM BCA

#Plotting the sample disrtibution of means


hist(sam_mean,
main="Sampling Distribution of Sample Means",
xlab="Sample Mean",
ylab="Frequency",
col="skyblue"
)

Output:

Testing:
“Testing” refers to the process of evaluating or examining something to assess its
characteristics. Performance or quality. In the context of software development or quality
assurance, testing refers to the systematic process of verifying that a software application or system
meets specified requirements and functions as expected. It involves the identification of errors,
bugs or other issues that may affect the functionality, reliability or security of the software.

Software Testing includes various types and levels of testing:


Hypothesis Testing: R provides functions and packages for conducting various hypothesis tests,
such as t-tests, ANOVA, chi-square tests and others to assess the significance of relationships or
differences in data.
Unit Testing: Unit testing in R involves the process of evaluating individual components or units
of code to ensure that each component functions correctly in isolation. Testthat andRUnit are
popular R packages used for unit testing.
Integration Testing: Integration testing in R involves testing how different components or
functions work together to ensure that the integrated parts of the code function correctly as a whole.

SHREE MEDHA DEGREE COLLEGE MANJESH M


R PROGRAMMING V - SEM BCA

Regression Testing: This type of testing in R involves checking whether recent code changes
have affected the existing functionality and if any new bugs or erros have been introduced.

Each type of testing serves a specific purpose and helps ensure the overall quality and
effectiveness of the software product. The goal of testing is to identify and resolve any defects or
issues before the software is released to end-users, thereby enhancing the user experience and
minimizing potential risks and vulnerabilities.

Factor in R:
Factor in R is a variable used to categorize and store the data, having a limited number of
different values. It stores the data as a vector of integer values. Factor in R is also known as a
categorical variable that stores both string and integer data values as levels. Factor is mostly used
in Statistical Modeling and exploratory data analysis with R.
In a dataset we can distinguish two types of variable
 Categorical
 Continuous

In descriptive statistics for categorical variables in R, the value is limited and usually based
on a particular finite group. For Ex, a categorical variable in R can be countries, year, gender,
occupation.
A continuous variable, however can take any values from integer to decimal. For ex, we
can have the revenue, price of a share etc.

Categorical Variable:
Categorical variables in R are stored into factor. Characters are not supported in machine
learning algorithm, and the only way is to convert a string to an integer.

Syntax:
factor(x=character(), levels, labels=levels, ordered=is.ordered(x))

Arguments:
 x: a vector of categorical data in R. Need to be a string or integer, not decimal
 levels: A vector of possible values taken by x. this argument is optimal. The default value
is the unique list of items of the vector x.
 labels: add a label to the c categorical data in R. For ex, 1 can take the label ‘male’ while
0, the label ‘female’.
 ordered: determine if the levels should be ordered in categorical data in R

Ex:
gen_v<-c(“Male”, ”Female”, ”Female”, “Male”, “Male”)
class(gen_v)

SHREE MEDHA DEGREE COLLEGE MANJESH M


R PROGRAMMING V - SEM BCA

fac_gen_v<-factor(gen_v)
class(fac_gen_v)

Output:
[1] “character”
[1] “factor”

A categorical variable in R can be divide into nominal categorical variable and ordinal
categorical variable.

Nominal Categorical Variable:


A categorical variable has several values but the order does not matter. For instance, male
or female. Categorical variables in R does not have ordering.

Ex:
col_v<-c(“blue”, ”red”, “green”, “white”, “black”, “yellow”)
#convert the vector
fac_col<factor(col_v)
print(fac_col)

Output:
##[1] blue red green white black yellow
## Levels: black blue green red white yellow

Ordinal Categorical Variable:


Ordinal categorical variable do have a nature ordering. We can specify the order, from the
lowest to the highest with order=TRUE and highest to lowest with order=FALSE.
Ex:
day_v<-c('evening', 'morning','afternoon','midday','midnight','evening')
fac_day<-factor(day_v, order=TRUE, levels=c('morning','midday',' afternoon','evening',
'midnight' ))
fac_day
Output:
[1] evening morning afternoon midday midnight evening
Levels: morning < midday < afternoon < evening < midnight

#count the number of occurance of each level


summary(fac_day)
Output:
morning midday afternoon evening midnight
1 1 1 2 1

SHREE MEDHA DEGREE COLLEGE MANJESH M


R PROGRAMMING V - SEM BCA

HYPOTHESIS TESTING
Introduction:
Hypothesis testing is a fundamental statistical method used to make inferences about
population parameters based on sample data. It involves evaluating the variety of a claim or
hypothesis about a population parameter using sample evidence. The process of hypothesis testing
follows a structured framework and aims to assess whether the evidence from the data support or
contradicts the proposed hypothesis.

The key steps involved in hypothesis testing:


 Formulate Hypothesis: define a null hypothesis (H0) that represents the default ot status
quo assumption and an alternative hypothesis (Ha or H1) that contradicts the null
hypothesis and represents the claim or effect you want to test.
 Collect Data: Gather relevant data through experiments, survey or observations. The data
collected should be representation of the population under investigation and should align
with the research question and hypothesis.
 Choose a Statistical Test: Select an appropriate statistical test based on the type of data
and the nature of the hypothesis being tested. Common tests include t-tests, z-tests, chi-
square tests, ANOVA, regression analysis, and others, depending on the specific research
context.
 Compute the Test Statistic and P-value: Calculate the test statistics from the sample data
and determine the probability of observing the test statistic, or a more extreme value, under
the assumption that the null hypothesis is true. This probability is known as the p-value.
 Compare the P-value with Significance Level: Compare the calculated p-value with
predetermined significance level (α) to decide whether to reject the null hypothesis. If the
p-value is less than or equal to the significance level, the null hypothesis is rejected in favor
of the alternative hypothesis.
 Draw Conclusions: Based on the statistical results, draw conclusions about the validity of
the null hypothesis and make inferences about the population parameter of interest.
Consider the practical significance of the results in addition to the statistical significance.

The process of testing the hypothesis made by the researcher or to validate the hypothesis. To
perform hypothesis testing, a random sample of data from the population is taken and testing
is performed. Based on the results of the testing, the hypothesis is either selected or rejected.
This concept is known as Statistical Inference.
 One sample T-Testing
 Two sample T-Testing
 Paired T-Test

R provides a powerful environment for conducting various statistical tests and analyses,
allowing researchers and data analysts to test specific hypothesis and draw meaningful
conclusions.

SHREE MEDHA DEGREE COLLEGE MANJESH M


R PROGRAMMING V - SEM BCA

Objectives of conducting Hypothesis Testing:


 Testing Research Hypothesis: Hypothesis testing allows researchers to formally evaluate
research questions or hypothesis based on empirical data.
 Making Data-Driven Decisions: By using hypothesis testing in R, data analysts can make
informed decisions about whether to accept or reject a specific claim or hypothesis based
on the available data.
 Quantifying Uncertainty: Hypothesis testing provides framework for quantifying the
uncertainty associated with sample estimates and determining the likelihood observed
effects or differences occurring by chance.
 Comparing Groups or Conditions: Hypothesis testing facilitates the comparison of
group means, proportions, variances or other parameters, enabling researchers to identify
significance differences or similarities between different groups or conditions.
 Evaluating Relationships: Hypothesis testing allows analysts to assess the strength and
significance of relationships between variables, helping to determine whether observed
associations are statistical meaningful.

Four-Step Process of Hypothesis Testing:


The four-step process of hypothesis testing in R involves a systematic approach to evaluate
a claim about a population parameter based on sample data. Here’s an overview of the four steps
and how they can be implemented in R.

1. State the Hypothesis:


 Null Hypothesis(H0): A statement of no effect or no difference.
 Alternative Hypothesis(H1 or Ha): A statement contradicting the null hypothesis.
Ex of stating hypothesis:
#H0: Population mean is equal to a specific value .
#H1: Population mean is not equal to a specific value.
H0: µ=100
H1:µ≠100
2. Set the Significance Level:
Choose the significance level (α), typically 0.05 or 0.01, which represents the threshold for
deciding whether to reject the null hypothesis.
# Ex of setting the significance level
alpha<-0.05
3. Calculate the Test Statistic and P-value/Analyze Sample Data:
Conduct the appropriate statistical test based on the data and hypothesis to calculate the
test statistic and the corresponding p-value.
#Ex of calculating the test statistic and p-value
#Assuming data is in the vector ‘data’
res<-t.test(data)
# Extract the test statistic and p-value

SHREE MEDHA DEGREE COLLEGE MANJESH M


R PROGRAMMING V - SEM BCA

test_statistic <-res$statistic
p-value<-res$p.value
4. Make a Decision/Interpret Decision:
Compare the p-value to the significance level (α) to decide whether to reject the null
hypothesis or fail to reject it.
#Ex of making a decision
if(p_value<alpha){
print(“Reject the null hypothesis”)
}else{
print(Fail to reject the null hypothesis”)
}

One Sample T-Testing:


One sample t-test is used to compare the mean of a single sample to a known or
hypothesized population mean. In R it is possible to conduct a one-sample t-test using t.test()
function.
Syntax: t.test(x, mu)

Parameters:
 x: represents numeric vector of data.
 Mu: represents true value of the mean

Ex:
#defining sample vector
x<-rnorm(100)
#One sample T-Test
t.test(x, mu=5)

Output:
One Sample t-test
data: x
t = -56.42, df = 99, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 5
95 percent confidence interval:
-0.2482380 0.1083696
sample estimates:
mean of x
-0.06993419
 Data: the dataset ‘x’ was used for the test
 The determined t-value is -56.42
 Degree of Freedom(df): the t-test has 99 degrees of freedom.

SHREE MEDHA DEGREE COLLEGE MANJESH M


R PROGRAMMING V - SEM BCA

 The p-value is 22e -16, which indicates that there is substantial evidence refuting the null
hypothesis.
 Alternative hypothesis: The true mean is not equal to five, according to the alternative
hypothesis.
 95% confidence interval: (-0.2482380, 0.1083696) is the confidence interval’s value. This
range denotes the values that, with 95% confidence, correspond to the genuine population
mean.
Ex:
data<-c(5.1, 4.8, 5.2, 4.9, 5.0, 5.1, 4.9, 5.2, 5.0)
#assuming the null hypothesis that the true mean is 5
res<-t.test(data, mu=5)
print(res)
Output:
One Sample t-test
data: data
t = 0.78899, df = 7, p-value = 0.456
alternative hypothesis: true mean is not equal to 5
95 percent confidence interval:
4.82526 5.34974
sample estimates:
mean of x
5.0875

Two Sample T-Testing:


A two sample t-test is used to compare the mean of two independent groups to determine
whether they are significantly different from each other. In R, you can perform a two-sample t-test
using the t.test() function.

Syntax: t.test(x, y)
Parameters: x and y are numeric vectors

Ex:
#sample data for two groups
g1<-c(25,27,26,28,30,29,31,32,30,28)
g2<-c(22,24,26,27,25,28,29,31,20,28)

res<-t.test(g1,g2)
print(res)
Output:
Welch Two Sample t-test

SHREE MEDHA DEGREE COLLEGE MANJESH M


R PROGRAMMING V - SEM BCA

data: g1 and g2
t = 2.0526, df = 15.676, p-value = 0.05719
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.08973199 5.28973199
sample estimates:
mean of x mean of y
28.6 26.0

wilcox.test():
wilcox.test() function is used to conduct a Wilcoxon rank-sum test (Mann-Whitney U test)
to compare the medians of two independent samples. This non-parametric test is used when the
assumptions of the t-test are not met, such as when the data are not normally distributed or when
the sample sizes are small.

Syntax:
wilcox.test(x, y, alternative=c(“two.sided”,”less”,”greater”), mu=0, paired=FALSE,
conf.int=FALSE, conf.level=0.95)

 x and y: numeric vectors containing the two samples that you want to compare.
 Alternative: A characer string specifying the alternative hypothesis, which can be
“two.sided” (default). “less” or “greater”.
 mu: a numeric value giving the hypothesized median under the null hypothesis.
 paired: A logical value indicating whether the samples are paired.
 conf.int: A logical value indicating whether to compute the confidence interval of the true
location shift.
 conf.level: the confidence level of the interval, which is set to 0.95 by default.

Ex:
g1<-c(17,12,20,21,19)
g2<-c(15,14,18,16,13)
res<-wilcox.test(g1,g2,alternative="two.sided")
print(res)

Output:
Wilcoxon rank sum exact test
data: g1 and g2
W = 19, p-value = 0.2222
alternative hypothesis: true location shift is not equal to 0

Paired T-Test:

SHREE MEDHA DEGREE COLLEGE MANJESH M


R PROGRAMMING V - SEM BCA

Paired t-test is performed using the t.test() function. The paired t-test is used when you
have two related groups or samples. The data should be paired because the measurements in each
group are not independent, and each observation in one group has a unique corresponding
measurement in the other group.

Ex:
b<-c(23, 21, 29, 18, 25)
a<-c(19, 20, 21, 17, 22)

res<-t.test(b,a, paired=TRUE)
print(res)

Output:
Paired t-test
data: b and a
t = 2.6389, df = 4, p-value = 0.05765
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
-0.1771993 6.9771993
sample estimates:
mean difference
3.4
 b and a are the two sets of paired data. Each element in b(before) corresponds to the same
element at the same index in a(after).
 t.test() is used to perform the paired t-test. Setting paired=TRUE indicates that the two
samples are paired.
 The result is stored in the res variable.
 Finally, the result is printed.

Chi-Square Test:
The Chi-Square Test is used to analyze the frequency table(i.e. contingency table), which
is formed by two categorical variables. The chi-square test evaluates whether there is significant
relationship between the categories of the two variables.

The Chi-Square Test is a statistical method which is used to determine whether two
categorical variables have a significant correlation between them. These variables should be from
the same population and should be categorical like Yes/No, Red/Green, Male/Female, etc. grusp
chi-square test is also useful while comparing the tallies or counts of categorical responses between
two (or more) independent s.
Syntax: chisq.test(x, y=NULL, correct=TRUE)

SHREE MEDHA DEGREE COLLEGE MANJESH M


R PROGRAMMING V - SEM BCA


x: A contingency table (a matrix or a data frame) containing the observed counts or
frequencies of the categories.
 y: if you have a contingency table, you can provide it as x, and y can be left as NULL. If
you have a matrix or data frame with rows and columns representing two categorical
variables, you can provide both x and y.
 correct: A logical value indicating whether to apply continuity correction. Its usually set to
TRUE.
An example of how to use the chi-square test in R:

Ex:
To take the survey data in the MASS library which represents the data from a survey
conducted on students.
#load the MASS package
library(MASS)
print(survey)

Output:
'data.frame': 237 obs. of 12 variables:
$ Sex : Factor w/ 2 levels "Female","Male": 1 2 2 2 2 1 2 1 2 2 ...
$ Wr.Hnd: num 18.5 19.5 18 18.8 20 18 17.7 17 20 18.5 ...
$ NW.Hnd: num 18 20.5 13.3 18.9 20 17.7 17.7 17.3 19.5 18.5 ...
$ W.Hnd : Factor w/ 2 levels "Left","Right": 2 1 2 2 2 2 2 2 2 2 ...
$ Fold : Factor w/ 3 levels "L on R","Neither",..: 3 3 1 3 2 1 1 3 3 3 ...
$ Pulse : int 92 104 87 NA 35 64 83 74 72 90 ...
$ Clap : Factor w/ 3 levels "Left","Neither",..: 1 1 2 2 3 3 3 3 3 3 ...
$ Exer : Factor w/ 3 levels "Freq","None",..: 3 2 2 2 3 3 1 1 3 3 ...
$ Smoke : Factor w/ 4 levels "Heavy","Never",..: 2 4 3 2 2 2 2 2 2 2 ...
$ Height: num 173 178 NA 160 165 ...
$ M.I : Factor w/ 2 levels "Imperial","Metric": 2 1 NA 2 2 1 1 2 2 2 ...
$ Age : num 18.2 17.6 16.9 20.3 23.7 ...
NULL

The above result shows the dataset has many Factor variables which can be considered as
categorical variables.
In this, model we consider the variables “Exer” and “Smoke”.

The Smoke column records the students smoking habit. The Exer column records their exercise
level.
To test the hypothesis whether the students smoke habit is independent of their exercise level at
0.05 significance level.

SHREE MEDHA DEGREE COLLEGE MANJESH M


R PROGRAMMING V - SEM BCA

Ex:
stu_data=data.frame(survey$Smoke,survey$Exer)
> stu_data=table(survey$Smoke, survey$Exer)
> print(stu_data)
Output:
Freq None Some
Heavy 7 1 3
Never 87 18 84
Occas 12 3 4
Regul 9 1 7

Finally we apply the chisq.test() function to the contingency table stu_data


#applying chisq.test() function
print(chisq.test(stu_data))
Output:
Pearson's Chi-squared test
data: stu_data X-squared = 5.4885, df = 6, p-value = 0.4828

As the p-value 0.4828 is greater than the .05, we conclude that the smoking habit is
independent of the exercise level of the student and hence there is weak or no correlation between
the two variables.

Ex:
ob_data<-matrix(c(45, 20, 15, 60), nrow=2, byrow=TRUE)
colnames(ob_data)<-c("Category A", "Category B")
rownames(ob_data)<-c("Group 1", "Group 2")
res<-chisq.test(ob_data)
print(res)
Output:
Pearson's Chi-squared test with Yates' continuity correction
data: ob_data
X-squared = 32.481, df = 1, p-value = 1.204e-08

In this example,
 two rows representing two groups and two columns representing two categories.
 We specify row and column names for the table
 We use the chisq.test() function to perform the chi-square test on the contingency table.
 The results are stored in the result variable.
 We print the results, which include the chi-squared statistic, degrees freedom, the p-value,
and other information.

SHREE MEDHA DEGREE COLLEGE MANJESH M


R PROGRAMMING V - SEM BCA

Advantages and Disadvantages of Hypothesis Testing


Advantages:
Objective decision making: Hypothesis testing provides a structured framework for making
objective decisions based on data analysis, reducing the influence of personal bias.
Scientific Validity: it allows researchers to draw conclusions about population parameters based
on sample data, ensuring scientific validity in the research process.
Standardized approach: Hypothesis testing follows a standardized approach, making it easier for
other researchers to replicate the analysis and verify the results.
Quantitative Results: It provides quantitative results in the form of test statistics and p-values,
enabling researchers to make statistically sound inferences about the research hypotheses.
Controlled Experiments: Hypothesis testing is well-suited for controlled experiments, where
researchers can establish causality and test specific hypothesis.

Disadvantages:
Assumptions: Hypothesis testing relies on various assumptions about the data, such as normality,
independence and homogeneity of variance. Violations of these assumptions can affect the validity
of the results.
Simplification of Reality: Hypothesis testing often simplifies complex real-world scenarios,
leading to a reduction in the richness of the data and potentially overlooking important nuances.
Risk of Errors: There is a risk of committing Type1 errors (rejecting a true null hypothesis) and
Type II errors (failing to reject a false null hypothesis), especially when the sample size is small
or the effect size is small.
Limited Scope: Hypothesis testing is constrained by the specific hypothesis formulated by the
researchers. It may not capture the full complexity of relationships between variables in the data.

Proportions Testing:
Proportions testing is commonly used to analyze categorical data, especially when working
with binary outcomes or proportions. It helps assess whether the proportions of successes or events
differ significantly across groups pr whether a sample proportion significantly deviates from an
expected proportion.
In R you can perform various types of proportions testing using specific functions and
methods. Some common types of testing include:
Syntax:
The prop.test() function is used to perform proportions testing
prop.test(x, n, p=NULL, alternative=”two.sided”, conf.level=0.95)
 x: Number of successes (a vector of counts of successes)
 n: Number of trails or total sample size (a vector of the same length as x)
 p: A single number giving the hypothetical probability of success. If not given, it is set to
the overall proportion of successes.
 alternative: A character string specifying the alternative hypothesis, which can be
“two.sided”, “less” or “greater”.

SHREE MEDHA DEGREE COLLEGE MANJESH M


R PROGRAMMING V - SEM BCA

 conf.level: Confidence level of the interval.

Ex:
Program to perform a proportions test in R:
su<-20
tot_tr<-50

res<-prop.test(su,tot_tr, alternative="two.sided",conf.level=0.95)
print(res)

Output:
1-sample proportions test with continuity correction
data: su out of tot_tr, null probability 0.5
X-squared = 1.62, df = 1, p-value = 0.2031
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
0.2673293 0.5479516
sample estimates:
p
0.4

Where successes represent the number of successes and total_trails represent the total
number of trails. The alternative parameter is set to “two.sided” to test for a two-tailed alternative
hypothesis, and the conf.level parameter is set to 0.95 to calculate a 95% confidence interval.

One Sample T-Testing:


 One sample T-testing approach collects a huge amount of data and tests it an random
samples.
 To perform T-test in R, normally distributed data is required. This test is used to test the
mean of the sample with the population.

For Ex: the height of persons living in an area is different or identical to other persons living in
other areas.

Syntax:
t.test(x, mu)
parameters:
x: represents numeric vector of data.
mu: represents true value of the mean.

SHREE MEDHA DEGREE COLLEGE MANJESH M


R PROGRAMMING V - SEM BCA

One Proportion Z –Test:


A one-proportion z-test to access whether a sample proportion significantly differs from a
hypothesized population proportion. The test evaluates whether there is enough evidence to reject
the null hypothesis that the sample is equal to the specified value.
The One proportion Z-test is used to compare an observed proportion to a theoretical one
when there are only two categories.
For Ex: we have a population of mice containing half male and half females (p=0.5 = 50%)
Some of these mice(n=160) have developed spontaneous cancer, including 95 males and
65 females.
We want to know, whether cancer affects more males than females? So in this problem:
 The number of successes (male with cancer) is 95
 The observed proportion (po) of the male is 95/160
 The observed proportion (q) of the female is 1-po
 The expected proportion (pe) of the male is 0.5(50%)
 The number of observations (n) is 160.

The Formula for one-proportions Z-test

The test statistic (also known as z-test) can be calculated as follow:


r=po-pe/√(po q/n)

where, po: the observed proportion


q : 1-po
pe : the expected proportion
n: the sample size

Implementation in R:
In R language the function used for performing a z-test is binom.test() and prop.test().

Syntax:
binom.test(x, n, p=0.5, alternative=”two.sided”) prop.test(x, n, p=NULL,
alternative=”two.sided”, correct=TRUE)

Parameters:
 x= number of successes and failures in data set.
 n=size of dataset.
 p=probabilities of success. It must be in the range of 0 to 1.
 alternative=a character string specifying the alternative hypothesis.
 Correct=a logical indicating whether Yates continuity correction should be applied where
possible.
Ex:

SHREE MEDHA DEGREE COLLEGE MANJESH M


R PROGRAMMING V - SEM BCA

Let’s assume that 30 out of 70 people recommended street Food to their friends. To test
this claim, a random sample of 150 people obtained. Of these 150 people, 80 indicate that they
recommended Street Food to their friend. Is this claim accurate? Use alpha=0.05
Solution: Now x=80, P=.30, n=150. We want to know, whether the people recommend
street food to their friend than healthy food? Let’s use the function prop.test() in R.

#Using prop.test()
prop.test(x=80, n=150, p=0.3, correct=FALSE)

Output:
1-sample proportions test without continuity correction
data: 80 out of 150, null probability 0.3
X-squared = 38.889, df = 1, p-value = 4.486e-10
alternative hypothesis: true p is not equal to 0.3
95 percent confidence interval:
0.4536625 0.6113395
sample estimates:
p
0.5333333

Two-Proportion Z-Test:
The two-proportion z-test compares the proportions of two independent groups or samples.
The z-test is used to determine if there is a significant difference between the proportions of success
in the two groups. The prop.test() function to conduct to conduct a two-proportion z-test.

Ex: we have two groups of individuals:


Group A with lung cancer: n=500
Group B, healthy individuals: n=500
The number of smokers in each group is as follows:
Group A with lung cancer : n=500, 490 smokers, pA=490/500=98
Group B with healthy individuals: n=500, 400 smokers, pB=400/500=80
In this setting:
The overall proportion of smokers is p=frac(490+400)500=89
The overall proportion of non-smokers is q=1-p=11
prop.test(x, n, p=NULL, alternative=”two.sided”, correct=TRUE)
 x: a vector of counts of successes
 n: a vector of count trails
 alternative: a character string specifying the alternative hypothesis
 correct: a logical indicating whether Yates continuity correction should be applied where
possible

SHREE MEDHA DEGREE COLLEGE MANJESH M


R PROGRAMMING V - SEM BCA

Ex:
res<-prop.test(x=c(490,400), n=c(500,500))
res
Output:
2-sample test for equality of proportions with continuity correction
data: c(490, 400) out of c(500, 500)
X-squared = 80.909, df = 1, p-value < 2.2e-16
alternative hypothesis: two.sided
95 percent confidence interval:
0.1408536 0.2191464
sample estimates:
prop 1 prop 2
0.98 0.80

Error in Hypothesis Testing:


In Hypothesis testing, error is the estimate of the approval or rejection of a particular
hypothesis. There are mainly two types of errors in hypothesis testing:
1. Type I Error (also known as alpha error): Type I error occurs when we reject the Null
hypothesis but the Null hypothesis is correct. This case is also known as a false positive.
2. Type II Error (also known as beta error): Type II error occurs when we fail to remove
the Null hypothesis when Null hypothesis is incorrect/the alternative hypothesis is correct.
This case is also known as a false negative.

Type I and Type II Error


Null hypothesis is True False
Rejected Type I error False positive Correct decision True
probability = α position probability = 1- β
Not Rejected Correct decision True Correct decision False
negative probability = 1- α negative probability = β

Type I Error:
A type I error, also known as an alpha error, occurs when the null hypothesis is rejected
even though it is actually true. This means that the test results lead to the conclusion that there is
a significant effect, when in fact there is no such effect. In the context of R, the probability of
committing a Type I error is represented by significance level, denoted by alpha (α).
General syntax for conducting a hypothesis test that may result in a Type I error in R:
#sample data
d<-c(3, 5, 4, 6, 7, 8, 2, 3, 5, 4)
res<-t.test(d, mu=4.5)
#one Sample t-test
p_val<-res$p.value

SHREE MEDHA DEGREE COLLEGE MANJESH M


R PROGRAMMING V - SEM BCA

#Extract p-value
alpha<-0.05
#check for Type I error
if(p_val<alpha){
print("Reject the Null hypothesis, Type I error might have occured.")
}else{
print("Fail o reject the null hypothesis.")
}
Output: [1] "Fail to reject the null hypothesis."

In this example, a one-sample t-test using the t.test() function in R. we then extract the p-
value from the test results and define the significance level (alpha). If the p-value is less than the
significance

Type II Error:
A Type II error, often denoted as β (beta), is statistical term used in hypothesis testing
that occurs when the null hypothesis is not rejected when it is actually false. In simpler terms, it
is the error of failing to detect a real effect or difference when it exists. This error is particularly
relevant when evaluating the power of a statistical test, which is the probability of correctly
rejecting the null hypothesis when it is false.
Ex:
#Generate two set of data with different mean
set.seed(123)
d1<-rnorm(100, mean=5, sd=2)
d2<-rnorm(100, mean=7, sd=2)

#Perform a two sample t-test with unequal variances


res<-t.test(d1, d2, var.equal=FALSE)

#Extract the p-value and tru means


p_val<-res$p.value
true_mean1<-mean(d1)
true_mean2<-mean(d2)

#Set the significance level


alpha<-0.05

#check for Type II error


if(p_val> alpha){
print("Fail to reject the null hypothesis, Type II error might have occured.")
}else{

SHREE MEDHA DEGREE COLLEGE MANJESH M


R PROGRAMMING V - SEM BCA

print("Reject the null hypothesis.")


}
Output: [1] "Reject the null hypothesis."

Power of Hypothesis Testing:


Hypothesis testing, the power of a statistical test is the probability that the test will correctly
reject the null hypothesis when the alternative hypothesis is true. In simpler terms, it is the
probability of detecting an effect or a difference if it truly exists. High statistical power is desirable
as it indicates that the test can effectively identify true effects, thus reducing the likelihood of Type
II errors (false negatives).
The power of a statistical test is influenced by several factors:
Effect Size(ES): The magnitude of the difference or effect you are trying to detect. A larger effect
size increases the statistical power.
Sample size (n): The number of observations or participants in your study. Increasing the sample
size generally enhances statistical power.
Significance Level(α): the chosen level of significance, often set to 0.05. A lower alpha increases
power but also increases the risk of Type I errors.
Variability (Standard Deviation or Variance): The amount of variability or speed in the data.
A smaller variance increases power.
To calculate statistical power in hypothesis testing, you can use statistical software like R
or specialized power analysis packages. Here’s general formula for power calculation:
Power = 1-P (Type II Error) = 1-β
Where
 Power is the statistical power.
 P(Type II Error) is the probability of Type II error.
 β(beta) is the probability of Type II error.

For specific type of statistical tests (eg: t-tests, ANOVA, chi-squared tests), different functions and
packages can be used to perform power analysis and calculate power. For instance, in R you can
use packages like pwr, pwr.t.test, pwr.anova.test or pwr.chisq.test to conduct power analysis for
various test scenarios.

Ex:
To calculate the power of a t-test in R using the pwr.t.test function

#sinstall.packages("pwr")
#Load the required library
library(pwr)

#Set parameters
effect-size<-0.8 #This is the standard effect size

SHREE MEDHA DEGREE COLLEGE MANJESH M


R PROGRAMMING V - SEM BCA

n<-50 # Sample size per group


alpha <- 0.05 #Significance level

#Conduct a power analysis


pow_anl<-pwr.t.test(n=n, d=effect_size, sig.level=alpha, type="two.sample")
#Extract the powervalue
pow_val<-pow_anl$power
#Print the result
print(paste("Statistical power: ",pow_val))

In this example, we are performing a two-sample t-test and assuring equal variances
between the groups. We set the effect size, sample size per group, and significance level. The
pwr.t.test function calculates the power of the t-test and we extract the power value from the
analysis. The script then prints the statistical power to console.

Analysis of Variance (ANOVA):


Analysis of Variance (ANOVA) is a statistical technique, commonly used to study
differences between two or more group means. ANOVA test is centered on the different sources
of variation in a typical variable. ANOVA in R primarily provides evidence of the existence of the
mean equality between the groups. This statistical method is an extension of t-test, it is used in a
situation where the factor variable has more than one group. ANOVA is a statistical test for
estimating how a quantitative dependent variable change according to the levels of one or more
categorical independent variables. ANOVA tests whether these is difference in the means of the
groups at each level of the independent variable.
The null hypothesis (H0) of the ANOVA is no difference in means, and the alternative
hypothesis (Ha) is that the means are different from one another. ANOVA using the aov() function,
which stands for “analysis of variance”. ANOVA is commonly used in various fields, including
psychology, biology and social sciences, to compare group means and understand the effects of
different factors on the outcome variable.

Ex:
 A car company wishes to compare the average petrol consumption of THREE similar
models of car and has available six vehicles of each model.
 A teacher is interested in a comparison of average percentage marks attained in the
examinations of FIVE different subjects and has available the marks of eight students who
all completed each examination.

The Basic performing ANOVA in R:


Data preparation: Organize your data in a way that represents the groups or factors you want to
compare. This typically involves setting up a data frame where each column represents a different
group or factor.

SHREE MEDHA DEGREE COLLEGE MANJESH M


R PROGRAMMING V - SEM BCA

ANOVA function in R: Use the aov() function in R to perform ANOVA. This function takes a
formula as an argument, typically in the form of response_variable ~ group_variable.
Interpreting ANOVA Results: After running the ANOVA, you can use the summary () function
to view the ANOVA table, which provides key statistics such as the F-value, degrees of freedom
and p_value.

Some key properties of ANOVA:


 Decomposition of Variance: ANOVA decomposes the total variance in the data into
different components, such as the variance between groups and the variance within groups.
This decomposition helps to understand the relative contribution of different sources of
variability.
 F-Test: ANOVA uses the F-test to access whether the variance among group means in
larger than the variation within the groups. The F-test compares the ratio of the mean square
variation between groups to the mean square variation within groups.
 Assumptions: ANOVA relies on several assumptions, including the normality of the data
within each group, homogeneity of variances and independence of observations. Violations
of these assumptions can affect the validity of ANOVA results.
 One-way and Two-way ANOVA: ANOVA can be categorized into one-way and two-
way ANOVA based on te number of factors being studied. One-way ANOVA examines
the effects of a single factor, where two-way ANOVA investigates the effects of two factors
simultaneously.
 Post hoc Tests: in cases where ANOVA indicates significant differences between groups,
post hoc test can be identify which specific groups differ significantly from each other.
Common post hoc tests include Turkey’s HSD (Honestly Significant Difference),
Bonferroni and Scheffe tests.

One-way ANOVA:
The process of a one-way ANOVA (one independent variable) and a two-way ANOVA
(two independent variables).
ANOVA is a statistical method used to compare means of three or more groups to
determine whether there are any statistically significant differences among the means. It is called
“one-way” because there is only one independent variable affecting the dependent variable. This
independent variable typically represents the different groups or levels being compared.

Syntax:
model<-aov(response_variable ~ group_variable, data=your_data_frame)

Here’s
 model: This is the variable that will store the ANOVA model
 aov(): The function used to fit the ANOVA model

SHREE MEDHA DEGREE COLLEGE MANJESH M


R PROGRAMMING V - SEM BCA

 response_variable: This is the continuous response variable you are analyzing across
different groups.
 group_variable: This represents the categorical variable that defines the different groups.
 your_data_frame: This refers to the data frame that contains your data.

#Example data for one-way ANOVA:


g1<-c(18, 20, 22, 24, 19)
g2<-c(15, 17, 16, 19, 18)
g3<-c(21, 23, 25, 24, 22)

#Combine the data into a data frame


d<-data.frame(value=c(g1, g2, g3), group=factor(rep(1:3, each=5)))

#Perform one-way ANOVA


model<-aov(value ~ group, data=d)
print(summary(model))

Output:
Df Sum Sq Mean Sq F value Pr(>F)
group 2 91.2 45.6 12.67 0.0011 **
Residuals 12 43.2 3.6
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

In this example:
 There are three groups of data: g1, g2 and g3
 To combine the data into a data frame data, where the variable value contains the
measurements and the variable group represents the group labels.
 The aov() function to fit the one-way ANOVA model, with value as the response variable
and group as the independent variable.

Two-way ANOVA:
The two-way ANOVA is used to analyze the effect of two categorical independent
variables on a continuous dependent variable. Here’s an example of how to perform a two-way
ANOVA in R.

set.seed(123)

group1<-gl(2, 4, labels=c("A", "B"))


group2<-gl(2, 2, 8, labels=c("X", "Y"))
response<-rnorm(8, mean=10, sd=2)

SHREE MEDHA DEGREE COLLEGE MANJESH M


R PROGRAMMING V - SEM BCA

d<-data.frame(group1, group2, response)

model<-aov(response ~ group1 + group2 + group1:group2, data=d)


print(summary(model))
Output:
Df Sum Sq Mean Sq F value Pr(>F)
group1 1 0.020 0.020 0.005 0.946
group2 1 0.026 0.026 0.007 0.939
group1:group 2 1 12.844 12.844 3.286 0.144
Residuals 4 15.635 3.909
 A sample dataset with two categorical independent variables (group1 and group2) and pne
continuous dependent variable (response).
 The gl() function is used to generate factor levels for the groups.
 The data is combined into a data frame data.
 The aov() function is used to perform the two-way ANOVA, including the main effects of
group1 and group2 as well as their interaction (group1:group2)
 The summary() function is then used to print the ANOVA results, including the F-value,
p-value and other relevant statistics.

Calculate test statistics using aov function


mtcars_aov<-aov(mtcars$disp ~ factor(mtcars$gear))
sum<-summary(mtcars_aov)
print(sum)
Output:
Df Sum Sq Mean Sq F value Pr(>F)
factor(mtcars$gear) 2 280221 140110 20.73 2.56e-06 ***
Residuals 29 195964 6757
--- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
 Df: The model’s degrees of freedom
 Sum Sq: The sums of squares, which represents the variability that the model is able to
account squares.
 Mean Sq: The variance explained by each component is represented by the mean squares.
 F-value: It is the measure used to compare the mean squares both within and between
groups.
 Pr(>F): the F-statistics p-value, which denotes the factors statistical significance.
 Residuals: Relative deviations from the group mean are often known as residuals and their
summary statistics.
 Identifier Codes: Astericks (*) are used to show the degree of significance, they stand for
p 0.05, p 0.01, and p 0.001 respectively.

SHREE MEDHA DEGREE COLLEGE MANJESH M

You might also like