Unit4 R
Unit4 R
UNIT-4
Statistical Testing
Statistical tests are mathematical tools for analysing quantitative data generated
in a research study and making inference. Here are the general steps involved in
statistical testing:
Formulate Hypotheses:
Null Hypothesis (H0): This is a statement of no effect or no difference in the
population or data samples.
Alternative Hypothesis (H1 or Ha): This is a statement that there is an effect or a
difference in the population.
Select the Appropriate Test:
Choose a statistical test based on the nature of your data and the type of
comparison you are making (e.g., t-test, chi-square test, ANOVA, etc.).
Collect and Prepare Data:
Ensure that your sample is representative and meets the assumptions of the
chosen test. Clean and organize the data for analysis.
Calculate Test Statistic:
Compute the test statistic based on the formula associated with the chosen
statistical test.
Determine the Critical Region:
Identify the critical region or critical values for the test statistic based on the
chosen significance level.
Make a Decision:
Compare the calculated test statistic with the critical value(s). If the test statistic
falls in the critical region, reject the null hypothesis. If it falls outside the critical
region, fail to reject the null hypothesis.
Draw Conclusions:
Based on your decision, draw conclusions about the null hypothesis. .
Report Results:
Clearly communicate the results of the statistical test, including the test statistic,
p-value (if applicable), and any relevant confidence intervals.
Consider Limitations:
Discuss any limitations or assumptions made during the analysis.
Statistical Modelling
Statistical modelling is a powerful technique used in data analysis to uncover
patterns, relationships, and trends within datasets. By applying statistical methods
and models, researchers and analysts can gain insights, make predictions, and
support decision-making processes. Key steps are
INNAHAI ANUGRAHAM
BCA V SEM R PROGRAMMING RAJADHANI DEGREE COLLGE
Sampling Distributions in R
A sampling distribution is a statistic that is arrived out through repeated
sampling from a larger population
It describes a range of possible outcomes that of a statistic, such as the
mean or mode of some variable, as it truly exists a population.
The majority of data analyzed by researchers are actually drawn from
samples(a part of the pool of data), and not populations(entire pool of data).
Steps to Calculate Sampling Distributions in R:
Step 1: Here, first we have to define a number of samples(n=1000).
n<-1000
Step 2: Next we create a vector(sample_means) of length ‘n’ with Null(NA)
values [ rep() function is used to replicate the values in the vector
Syntax: rep(value_to_be_replicated,number_of_times)
Step 3: Later we filled the created sample_means null vector with sample means
from the considered population using the mean() function which are having a
sample mean of 10(mean) and standard deviation of 10(sd) of 20 samples(n)
using rnorm() which is used to generate normal distributions.
Step 4: To check the created samples we used head() which returns the first six
samples of the dataframe (vector,list etc,.).
INNAHAI ANUGRAHAM
BCA V SEM R PROGRAMMING RAJADHANI DEGREE COLLGE
sd(sample_means)
# To find probability
sum(sample_means >= 10)/length(sample_means)
INNAHAI ANUGRAHAM
BCA V SEM R PROGRAMMING RAJADHANI DEGREE COLLGE
Hypothesis Testing
As we might know, when we infer something from data, we make an inference
based on a collection of samples rather than the true population. The main
question that comes from it is: can we trust the result from our data to make a
general assumption of the population? This is the main goal of hypothesis testing.
There are several steps that we should do to properly conduct a hypothesis testing.
The Four Key steps involved are
State the Hypotheses, form our null hypothesis and alternative hypothesis.
Null Hypothesis (H0): This is a statement of no effect or no difference in
the population or data samples.
Alternative Hypothesis (H1 or Ha): This is a statement that there is an
effect or a difference in the population.
Formulate an analysis plan and set the criteria for decision(Set our
significance level). The significance level varies depending on our use
case, but the default value is 0.05.
Calculate the Test statistic and P-value. Perform a statistical test that
suits our data. The probability is known as the p-value.
Check the resulting p-Value and Make a Decision. If the p-Value is
smaller than our significance level, then we reject the null hypothesis in
favour of our alternative hypothesis. If the p-Value is higher than our
significance level, then we go with our null hypothesis.
INNAHAI ANUGRAHAM
BCA V SEM R PROGRAMMING RAJADHANI DEGREE COLLGE
Output:
One Sample t-test
data: x
t = -49.504, df = 99, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 5
95 percent confidence interval:
-0.1910645 0.2090349
sample estimates:
mean of x
0.008985172
Data: The dataset ‘x’ was used for the test.
The determined t-value is -49.504.
Degrees of Freedom (df): The t-test has 99 degrees of freedom.
The p-value is 2.2e-16, which indicates that there is substantial evidence
refuting the null hypothesis.
Alternative hypothesis: The true mean is not equal to five, according to the
alternative hypothesis.
95 percent confidence interval: (-0.1910645, 0.2090349) is the confidence
interval’s value. This range denotes the values that, with 95% confidence,
correspond to the genuine population mean.
Two Sample T-Testing
In two sample T-Testing, the sample vectors are compared. If var. equal = TRUE,
the test assumes that the variances of both the samples are equal.
INNAHAI ANUGRAHAM
BCA V SEM R PROGRAMMING RAJADHANI DEGREE COLLGE
Syntax: t.test(x, y)
Parameters:
x and y: Numeric vectors
Example:
# Defining sample vector
x <- rnorm(100)
y <- rnorm(100)
Parameters:
x: a numeric vector containing your data values
mu: the theoretical mean/median value. Default is 0 but you can change it.
alternative: the alternative hypothesis. Allowed value is one of “two.sided”
(default), “greater” or “less”.
INNAHAI ANUGRAHAM
BCA V SEM R PROGRAMMING RAJADHANI DEGREE COLLGE
Example
# R program to illustrate
# one-sample Wilcoxon signed-rank test
data: myData$weight
V = 55, p-value = 1
alternative hypothesis: true location is less than 25
Parameters:
x, y: numeric vectors
paired: a logical value specifying that we want to compute a paired
Wilcoxon test
alternative: the alternative hypothesis. Allowed value is one of “two.sided”
(default), “greater” or “less”.
Example
# R program to illustrate
# Paired Samples Wilcoxon Test
INNAHAI ANUGRAHAM
BCA V SEM R PROGRAMMING RAJADHANI DEGREE COLLGE
Output:
Paired t-test
Paired test is used to check whether there is a significant difference between two
population means when their data is in the form of matched pairs.
Syntax: t.test(x, y, paired = TRUE, alternative = "two.sided")
where
x,y: numeric vectors
INNAHAI ANUGRAHAM
BCA V SEM R PROGRAMMING RAJADHANI DEGREE COLLGE
chisq.test(data)
Parameters:
data: data is a table containing count values of the variables in the table.
Example:
# Load the library.
library("MASS")
INNAHAI ANUGRAHAM
BCA V SEM R PROGRAMMING RAJADHANI DEGREE COLLGE
data: car.data
X-squared = 33.001, df = 10, p-value = 0.0002723
Warning message:
In chisq.test(car.data) : Chi-squared approximation may be incorrect
Advantages of Hypothesis Testing:
INNAHAI ANUGRAHAM
BCA V SEM R PROGRAMMING RAJADHANI DEGREE COLLGE
Sensitivity to Sample Size:Small sample sizes can lead to less reliable results.
The power of a test (the ability to detect a true effect) increases with larger sample
sizes, and small samples may fail to detect real differences.
Risk of Errors:The balance between these errors depends on the chosen
significance level and statistical power.
Limited Scope:Hypothesis testing typically focuses on specific hypotheses and
may not provide a complete picture of the data.
Proportion Test
Proportion testing is commonly used to ananlyze categorical data, especially
when working with binary outcomes or proportions.
Syntax:
prop.test(x, n, p = NULL, alternative = c("two.sided", "less", "greater"),
conf.level = 0.95, correct = TRUE)
where
x->a vector of counts of successes, a one-dimensional table with two
entries, or a two-dimensional table (or matrix) with 2 columns, giving the
counts of successes and failures, respectively.
n->a vector of counts of trials; ignored if x is a matrix or a table.
p->a vector of probabilities of success. The length of p must be the same
as the number of groups specified by x, and its elements must be greater
than 0 and less than 1.
Alternative->a character string specifying the alternative hypothesis, must
be one of "two.sided" (default), "greater" or "less". You can specify just the
initial letter. Only used for testing the null that a single proportion equals a
given value, or that two proportions are equal; ignored otherwise.
conf.level->confidence level of the returned confidence interval. Must be
a single number between 0 and 1. Only used when testing the null that a
single proportion equals a given value, or that two proportions are equal;
ignored otherwise.
Correct->a logical indicating whether Yates' continuity correction should
be applied where possible.
Example:
smokers <- c( 83, 90, 129, 70 )
patients <- c( 86, 93, 136, 82 )
prop.test(smokers, patients)
output:
4-sample test for equality of proportions without continuity correction
INNAHAI ANUGRAHAM
BCA V SEM R PROGRAMMING RAJADHANI DEGREE COLLGE
where,
po: the observed proportion
q: 1 – po
pe: the expected proportion n: the sample size
Implementation in R
In R Language, the function used for performing a z-test is binom.test() and
prop.test().
Syntax:
binom.test(x, n, p = 0.5, alternative = “two.sided”)
prop.test(x, n, p = NULL, alternative = “two.sided”, correct = TRUE)
Parameters:
x = number of successes and failures in data set.
n = size of data set.
INNAHAI ANUGRAHAM
BCA V SEM R PROGRAMMING RAJADHANI DEGREE COLLGE
INNAHAI ANUGRAHAM
BCA V SEM R PROGRAMMING RAJADHANI DEGREE COLLGE
where, pA: the proportion observed in group A with size nA pB: the proportion
observed in group B with size nB p and q: the overall proportions
In R, the function used for performing a z-test is prop.test().
Syntax:
prop.test(x, n, p = NULL, alternative = c(“two.sided”, “less”, “greater”),
correct = TRUE)
Parameters:
x = number of successes and failures in data set.
n = size of data set.
p = probabilities of success. It must be in the range of 0 to 1.
alternative = a character string specifying the alternative hypothesis.
correct = a logical indicating whether Yates’ continuity correction should
be applied where possible.
Example:
# prop Test in R
prop.test(x = c(342, 290),n = c(400, 400))
Output:
INNAHAI ANUGRAHAM
BCA V SEM R PROGRAMMING RAJADHANI DEGREE COLLGE
prop 1 prop 2
0.855 0.725
Errors in Hypothesis Testing
Errors in Hypothesis Testing is the estimate of the approval or rejection of a
particular hypothesis. There are two types of errors.
Type I Error:
A type I error appears when the null hypothesis (H0) of an experiment is true, but
still, it is rejected. It is stating something which is not present or a false hit. A type
I error is often called a false positive (an event that shows that a given condition
is present when it is absent). It is denoted by alpha (α).
Type II Error
A type II error appears when the null hypothesis is false but mistakenly fails to
be refused. It is losing to state what is present and a miss. A type II error is also
known as false negative (where a real hit was rejected by the test and is observed
as a miss), in an experiment checking for a condition with a final outcome of true
or false.A type II error is assigned when a true alternative hypothesis is not
acknowledged. It is denoted by beta(β)
Type I and Type II Errors Example
Example 1: Let us consider a null hypothesis – A man is not guilty of a crime.
INNAHAI ANUGRAHAM
BCA V SEM R PROGRAMMING RAJADHANI DEGREE COLLGE
# Conduct a t-test
test_result <- t.test(sample1, sample2)
# Parameters
effect_size <-
0.5 # The difference between null and alternative hypotheses
sample_size <- 100 # The number of observations in each group
sd <- 15 # The standard deviation
alpha <- 0.05 # The significance level
INNAHAI ANUGRAHAM
BCA V SEM R PROGRAMMING RAJADHANI DEGREE COLLGE
Output
> # Print Type II Error
> print(type_II_error)
[1] 0.9436737
Analysis of Variance(ANOVA)
ANOVA also known as Analysis of variance is used to investigate relations
between categorical variables and continuous variables in the R Programming
Language. It is a type of hypothesis testing for population variance.
ANOVA test involves setting up:
Null Hypothesis: The default assumption, or null hypothesis, is that there is no
meaningful relationship or impact between the variables. The null hypothesis is
commonly written as H0.
INNAHAI ANUGRAHAM
BCA V SEM R PROGRAMMING RAJADHANI DEGREE COLLGE
Syntax in R:
aov(formula, data = NULL, projections = FALSE, qr = TRUE, contrasts =
NULL, …)
Arguments
Formula-A formula specifying the model.
Data-A data frame in which the variables specified in the formula will be
found. If missing, the variables are searched for in the standard way.
Projections-Logical flag: should the projections be returned?
Qr-Logical flag: should the QR decomposition be returned?
Contrasts-A list of contrasts to be used for some of the factors in the
formula. These are not used for any Error term, and supplying contrasts for
factors only in the Error term will give a warning.
…-Arguments to be passed to lm, such as subset or na.action.
INNAHAI ANUGRAHAM
BCA V SEM R PROGRAMMING RAJADHANI DEGREE COLLGE
head(mtcars)
mtcars_aov <- aov(mtcars$disp~factor(mtcars$gear))
summary(mtcars_aov)
Output:
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Here we will print top 5 record of our dataset to get an idea about our dataset.
INNAHAI ANUGRAHAM
BCA V SEM R PROGRAMMING RAJADHANI DEGREE COLLGE
variable: planting density. We test the effects of 3 types of fertilizer and 2 different
planting densities on crop yield.
A two-way ANOVA test is performed using mtcars dataset which comes
preinstalled with dplyr package between disp attribute, a continuous attribute and
gear attribute, a categorical attribute, am attribute, a categorical attribute.
INNAHAI ANUGRAHAM
BCA V SEM R PROGRAMMING RAJADHANI DEGREE COLLGE
The histogram shows the mean values of gear with respect to displacement. Hear
categorical variables are gear and am on which factor function is used and
continuous variable is disp.
INNAHAI ANUGRAHAM