0% found this document useful (0 votes)
38 views21 pages

Unit4 R

The document discusses statistical testing and modeling. It covers topics like hypotheses formulation, statistical tests, sampling distributions, and hypothesis testing. Statistical tests discussed include t-tests, ANOVA, and Wilcoxon signed-rank test. Steps for statistical modeling like data collection, model selection, fitting and evaluation are also covered.

Uploaded by

Professor
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views21 pages

Unit4 R

The document discusses statistical testing and modeling. It covers topics like hypotheses formulation, statistical tests, sampling distributions, and hypothesis testing. Statistical tests discussed include t-tests, ANOVA, and Wilcoxon signed-rank test. Steps for statistical modeling like data collection, model selection, fitting and evaluation are also covered.

Uploaded by

Professor
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

BCA V SEM R PROGRAMMING RAJADHANI DEGREE COLLGE

UNIT-4
Statistical Testing
Statistical tests are mathematical tools for analysing quantitative data generated
in a research study and making inference. Here are the general steps involved in
statistical testing:
Formulate Hypotheses:
Null Hypothesis (H0): This is a statement of no effect or no difference in the
population or data samples.
Alternative Hypothesis (H1 or Ha): This is a statement that there is an effect or a
difference in the population.
Select the Appropriate Test:
Choose a statistical test based on the nature of your data and the type of
comparison you are making (e.g., t-test, chi-square test, ANOVA, etc.).
Collect and Prepare Data:
Ensure that your sample is representative and meets the assumptions of the
chosen test. Clean and organize the data for analysis.
Calculate Test Statistic:
Compute the test statistic based on the formula associated with the chosen
statistical test.
Determine the Critical Region:
Identify the critical region or critical values for the test statistic based on the
chosen significance level.
Make a Decision:
Compare the calculated test statistic with the critical value(s). If the test statistic
falls in the critical region, reject the null hypothesis. If it falls outside the critical
region, fail to reject the null hypothesis.
Draw Conclusions:
Based on your decision, draw conclusions about the null hypothesis. .
Report Results:
Clearly communicate the results of the statistical test, including the test statistic,
p-value (if applicable), and any relevant confidence intervals.
Consider Limitations:
Discuss any limitations or assumptions made during the analysis.
Statistical Modelling
Statistical modelling is a powerful technique used in data analysis to uncover
patterns, relationships, and trends within datasets. By applying statistical methods
and models, researchers and analysts can gain insights, make predictions, and
support decision-making processes. Key steps are

INNAHAI ANUGRAHAM
BCA V SEM R PROGRAMMING RAJADHANI DEGREE COLLGE

 Data Collection and preparation- Gather and preprocess data


 Model Selection- Choose appropriate statistical model
 Model Fitting- Use techniques like square estimation, Bayesian
interference to estimate the parameters.
 Model Evaluation-Access the performance using metrics as goodness to fit
measures, prediction accuracy and diagnostic tests.
 Model Interpretation- Interpret Results and make conclusions.

Eg., Linear Regression, logistic regression, time series analysis.

Sampling Distributions in R
 A sampling distribution is a statistic that is arrived out through repeated
sampling from a larger population
 It describes a range of possible outcomes that of a statistic, such as the
mean or mode of some variable, as it truly exists a population.
 The majority of data analyzed by researchers are actually drawn from
samples(a part of the pool of data), and not populations(entire pool of data).
Steps to Calculate Sampling Distributions in R:
Step 1: Here, first we have to define a number of samples(n=1000).
n<-1000
Step 2: Next we create a vector(sample_means) of length ‘n’ with Null(NA)
values [ rep() function is used to replicate the values in the vector
Syntax: rep(value_to_be_replicated,number_of_times)
Step 3: Later we filled the created sample_means null vector with sample means
from the considered population using the mean() function which are having a
sample mean of 10(mean) and standard deviation of 10(sd) of 20 samples(n)
using rnorm() which is used to generate normal distributions.

Syntax: mean(x, trim = 0)

Syntax: rnorm(n, mean, sd)

Step 4: To check the created samples we used head() which returns the first six
samples of the dataframe (vector,list etc,.).

Syntax:head(data_frame,no_of_rows_be_returned) #By default second argument


is set to 6 in R.

INNAHAI ANUGRAHAM
BCA V SEM R PROGRAMMING RAJADHANI DEGREE COLLGE

Step 5: Finally to visualize the sample_mean dataset we plotted a histogram ( for


better visualization ) using hist() function in R.
Syntax:hist(v,main,xlab,ylab,col)
Step 6: Finally we found the probability of generated sample means which are
having mean greater than or equal to 10.
Example:
# define number of samples
n < -1000

# create empty vector of length n


sample_means = rep(NA, n)

# fill empty_vector with means


for(i in 1: n){
sample_means[i] = mean(rnorm(20, mean=10, sd=10))
}
head(sample_means)

# create histogram to visualize


hist(sample_means, main="Sampling Distribution",
xlab="Sample Means", ylab="Frequency", col="blue")

# To cross check find mean and sd of sample


mean(sample_means)

sd(sample_means)

# To find probability
sum(sample_means >= 10)/length(sample_means)

INNAHAI ANUGRAHAM
BCA V SEM R PROGRAMMING RAJADHANI DEGREE COLLGE

Hypothesis Testing
As we might know, when we infer something from data, we make an inference
based on a collection of samples rather than the true population. The main
question that comes from it is: can we trust the result from our data to make a
general assumption of the population? This is the main goal of hypothesis testing.

There are several steps that we should do to properly conduct a hypothesis testing.
The Four Key steps involved are
 State the Hypotheses, form our null hypothesis and alternative hypothesis.
Null Hypothesis (H0): This is a statement of no effect or no difference in
the population or data samples.
Alternative Hypothesis (H1 or Ha): This is a statement that there is an
effect or a difference in the population.
 Formulate an analysis plan and set the criteria for decision(Set our
significance level). The significance level varies depending on our use
case, but the default value is 0.05.
 Calculate the Test statistic and P-value. Perform a statistical test that
suits our data. The probability is known as the p-value.
 Check the resulting p-Value and Make a Decision. If the p-Value is
smaller than our significance level, then we reject the null hypothesis in
favour of our alternative hypothesis. If the p-Value is higher than our
significance level, then we go with our null hypothesis.

INNAHAI ANUGRAHAM
BCA V SEM R PROGRAMMING RAJADHANI DEGREE COLLGE

One Sample T-Testing


One sample T-Testing approach collects a huge amount of data and tests it on
random samples. This test is used to test the mean of the sample with the
population.
Syntax: t.test(x, mu)
Parameters:
 x: represents numeric vector of data
 mu: represents true value of the mean
Example
# Defining sample vector
x <- rnorm(100)

# One Sample T-Test


t.test(x, mu = 5)

Output:
One Sample t-test

data: x
t = -49.504, df = 99, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 5
95 percent confidence interval:
-0.1910645 0.2090349
sample estimates:
mean of x
0.008985172
 Data: The dataset ‘x’ was used for the test.
 The determined t-value is -49.504.
 Degrees of Freedom (df): The t-test has 99 degrees of freedom.
 The p-value is 2.2e-16, which indicates that there is substantial evidence
refuting the null hypothesis.
 Alternative hypothesis: The true mean is not equal to five, according to the
alternative hypothesis.
 95 percent confidence interval: (-0.1910645, 0.2090349) is the confidence
interval’s value. This range denotes the values that, with 95% confidence,
correspond to the genuine population mean.
Two Sample T-Testing
In two sample T-Testing, the sample vectors are compared. If var. equal = TRUE,
the test assumes that the variances of both the samples are equal.

INNAHAI ANUGRAHAM
BCA V SEM R PROGRAMMING RAJADHANI DEGREE COLLGE

Syntax: t.test(x, y)
Parameters:
 x and y: Numeric vectors
Example:
# Defining sample vector
x <- rnorm(100)
y <- rnorm(100)

# Two Sample T-Test


t.test(x, y)
Output:

Welch Two Sample t-test


data: x and y
t = -1.0601, df = 197.86, p-value = 0.2904
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.4362140 0.1311918
sample estimates:
mean of x mean of y
-0.05075633 0.10175478

Wilcoxon Signed-Rank Test in R


This test can be divided into two parts:
 One-Sample Wilcoxon Signed Rank Test
 Paired Samples Wilcoxon Test
One-Sample Wilcoxon Signed Rank Test
The one-sample Wilcoxon signed-rank test is a non-parametric alternative to a
one-sample t-test when the data cannot be assumed to be normally distributed.
It’s used to determine whether the median of the sample is equal to a known
standard value i.e. a theoretical value.
Syntax: wilcox.test(x, mu = 0, alternative = “two.sided”)

Parameters:
 x: a numeric vector containing your data values
 mu: the theoretical mean/median value. Default is 0 but you can change it.
 alternative: the alternative hypothesis. Allowed value is one of “two.sided”
(default), “greater” or “less”.

INNAHAI ANUGRAHAM
BCA V SEM R PROGRAMMING RAJADHANI DEGREE COLLGE

Example
# R program to illustrate
# one-sample Wilcoxon signed-rank test

# The data set


set.seed(1234)
myData = data.frame(
name = paste0(rep("R_", 10), 1:10),
weight = round(rnorm(10, 30, 2), 1)
)

# One-sample wilcoxon test


wilcox.test(myData$weight, mu = 25,
alternative = "less")

# Printing the results


print(result)
Output:

Wilcoxon signed rank exact test

data: myData$weight
V = 55, p-value = 1
alternative hypothesis: true location is less than 25

Paired Samples Wilcoxon Test in R


The paired samples Wilcoxon test is a non-parametric alternative to paired t-test
used to compare paired data. It’s used when data are not normally distributed.

Syntax: wilcox.test(x, y, paired = TRUE, alternative = “two.sided”)

Parameters:
 x, y: numeric vectors
 paired: a logical value specifying that we want to compute a paired
Wilcoxon test
 alternative: the alternative hypothesis. Allowed value is one of “two.sided”
(default), “greater” or “less”.
Example
# R program to illustrate
# Paired Samples Wilcoxon Test

INNAHAI ANUGRAHAM
BCA V SEM R PROGRAMMING RAJADHANI DEGREE COLLGE

# The data set


# Weight of the rabbit before treatment
before <-c(190.1, 190.9, 172.7, 213, 231.4,
196.9, 172.2, 285.5, 225.2, 113.7)

# Weight of the rabbit after treatment


after <-c(392.9, 313.2, 345.1, 393, 434,
227.9, 422, 383.9, 392.3, 352.2)

# Create a data frame


myData <- data.frame(
group = rep(c("before", "after"), each = 10),
weight = c(before, after)
)

# Paired Samples Wilcoxon Test


result = wilcox.test(weight ~ group,
data = myData,
paired = TRUE,
alternative = "less")

# Printing the results


print(result)

Output:

Wilcoxon signed rank test

data: weight by group


V = 55, p-value = 1
alternative hypothesis: true location shift is less than 0

Paired t-test
Paired test is used to check whether there is a significant difference between two
population means when their data is in the form of matched pairs.
Syntax: t.test(x, y, paired = TRUE, alternative = "two.sided")
where
 x,y: numeric vectors

INNAHAI ANUGRAHAM
BCA V SEM R PROGRAMMING RAJADHANI DEGREE COLLGE

 paired: a logical value specifying that we want to compute a paired t-test


 alternative: the alternative hypothesis. Allowed value is one of “two.sided”
(default), “greater” or “less”.
Example:
# Define the datasets

before <- c(39,43,41,32,37,40,42,40,37,38)


after <- c(42,45,42,43,40,44,40,43,41,40)

# Perform the paired t-test

t.test(x=before,y=after,paired = TRUE,alternative = "greater")


Output:
Paired t-test
data: before and after
t = -2.9876, df = 9, p-value = 0.9924
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
-5.002085 Inf
sample estimates:
mean of the differences
-3.1
Chi-Square test
Chi-Square test is a statistical method to determine if two categorical variables
have a significant correlation between them. Both those variables should be from
same population and they should be categorical like − Yes/No, Male/Female,
Red/Green etc.
Syntax:

chisq.test(data)
Parameters:
 data: data is a table containing count values of the variables in the table.
Example:
# Load the library.
library("MASS")

# Create a data frame from the main data set.


car.data <- data.frame(Cars93$AirBags, Cars93$Type)

INNAHAI ANUGRAHAM
BCA V SEM R PROGRAMMING RAJADHANI DEGREE COLLGE

# Create a table with the needed variables.


car.data = table(Cars93$AirBags, Cars93$Type)
print(car.data)

# Perform the Chi-Square test.


print(chisq.test(car.data))
Output:
Compact Large Midsize Small Sporty Van
Driver & Passenger 2 4 7 0 3 0
Driver only 9 7 11 5 8 3
None 5 0 4 16 3 6

Pearson's Chi-squared test

data: car.data
X-squared = 33.001, df = 10, p-value = 0.0002723

Warning message:
In chisq.test(car.data) : Chi-squared approximation may be incorrect
Advantages of Hypothesis Testing:

 Objectivity:Hypothesis testing provides a structured and objective


approach to decision-making in statistical analysis.
 Inference:Hypothesis testing enables researchers to make inferences about
population parameters based on sample data.
 Standardization:The use of standardized procedures in hypothesis testing
allows for consistency across different studies and ensures that statistical
analyses are conducted in a systematic manner.
 Decision-Making:Hypothesis testing provides a clear framework for
decision-making
 Scientific Rigor:By setting up null and alternative hypotheses and
applying statistical tests, hypothesis testing contributes to the scientific
rigor of research.
Disadvantages of Hypothesis Testing:

Assumptions:Many hypothesis tests rely on assumptions about the data, If these


assumptions are violated, the results may be unreliable.

INNAHAI ANUGRAHAM
BCA V SEM R PROGRAMMING RAJADHANI DEGREE COLLGE

Sensitivity to Sample Size:Small sample sizes can lead to less reliable results.
The power of a test (the ability to detect a true effect) increases with larger sample
sizes, and small samples may fail to detect real differences.
Risk of Errors:The balance between these errors depends on the chosen
significance level and statistical power.
Limited Scope:Hypothesis testing typically focuses on specific hypotheses and
may not provide a complete picture of the data.

Proportion Test
Proportion testing is commonly used to ananlyze categorical data, especially
when working with binary outcomes or proportions.
Syntax:
prop.test(x, n, p = NULL, alternative = c("two.sided", "less", "greater"),
conf.level = 0.95, correct = TRUE)
where
 x->a vector of counts of successes, a one-dimensional table with two
entries, or a two-dimensional table (or matrix) with 2 columns, giving the
counts of successes and failures, respectively.
 n->a vector of counts of trials; ignored if x is a matrix or a table.
 p->a vector of probabilities of success. The length of p must be the same
as the number of groups specified by x, and its elements must be greater
than 0 and less than 1.
 Alternative->a character string specifying the alternative hypothesis, must
be one of "two.sided" (default), "greater" or "less". You can specify just the
initial letter. Only used for testing the null that a single proportion equals a
given value, or that two proportions are equal; ignored otherwise.
 conf.level->confidence level of the returned confidence interval. Must be
a single number between 0 and 1. Only used when testing the null that a
single proportion equals a given value, or that two proportions are equal;
ignored otherwise.
 Correct->a logical indicating whether Yates' continuity correction should
be applied where possible.
Example:
smokers <- c( 83, 90, 129, 70 )
patients <- c( 86, 93, 136, 82 )
prop.test(smokers, patients)

output:
4-sample test for equality of proportions without continuity correction

INNAHAI ANUGRAHAM
BCA V SEM R PROGRAMMING RAJADHANI DEGREE COLLGE

data: smokers out of patients


X-squared = 12.6, df = 3, p-value = 0.005585
alternative hypothesis: two.sided
sample estimates:
prop 1 prop 2 prop 3 prop 4
0.9651163 0.9677419 0.9485294 0.8536585

One-Proportion Z-Test in R Programming

The One proportion Z-test is used to compare an observed proportion to a


theoretical one when there are only two categories. For example, we have a
population of mice containing half male and half females (p = 0.5 = 50%). Some
of these mice (n = 160) have developed spontaneous cancer, including 95 males
and 65 females. We want to know, whether cancer affects more males than
females? So in this problem:

 The number of successes (male with cancer) is 95


 The observed proportion (po) of the male is 95/160
 The observed proportion (q) of the female is 1 – po
 The expected proportion (pe) of the male is 0.5 (50%)
 The number of observations (n) is 160
The Formula for One-Proportion Z-Test
The test statistic (also known as z-test) can be calculated as follow:

where,
 po: the observed proportion
 q: 1 – po
 pe: the expected proportion n: the sample size
Implementation in R
In R Language, the function used for performing a z-test is binom.test() and
prop.test().
Syntax:
binom.test(x, n, p = 0.5, alternative = “two.sided”)
prop.test(x, n, p = NULL, alternative = “two.sided”, correct = TRUE)
Parameters:
 x = number of successes and failures in data set.
 n = size of data set.

INNAHAI ANUGRAHAM
BCA V SEM R PROGRAMMING RAJADHANI DEGREE COLLGE

 p = probabilities of success. It must be in the range of 0 to 1.


 alternative = a character string specifying the alternative hypothesis.
 correct = a logical indicating whether Yates’ continuity correction should
be applied where possible.
Example:
# Using prop.test()
prop.test(x = 95, n = 160, p = 0.8, correct = FALSE)
o/p:
1-sample proportions test without continuity correction
data: 95 out of 160, null probability 0.8
X-squared = 42.539, df = 1, p-value = 6 . 928e-11
alternative hypothesis: true p is not equal to 0.8
95 percent confidence interval:
0.5163169 0.6667870
sample estimates:
p
0.59375
 It returns p-value which is 6.928462
 alternative hypothesis
 a 95% confidence intervals.
 probability of success is 0.59
# Using binom.test()
binom.test(x =25, n = 100, p = 0.15)
o/p:
Exact binomial test

data: 25 and 100


number of successes = 25, number of trials = 100, p-value = 0.007633
alternative hypothesis: true probability of success is not equal to 0.15
95 percent confidence interval:
0.1687797 0.3465525
sample estimates:
probability of success
0.25
Two-Proportions Z-Test in R Programming
A two-proportion z-test allows us to compare two proportions to see if they are
the same. For example, let there be two groups of individuals:

Group A with lung cancer: n = 500

INNAHAI ANUGRAHAM
BCA V SEM R PROGRAMMING RAJADHANI DEGREE COLLGE

Group B, healthy individuals: n = 500


The number of smokers in each group is as follows:
 Group A with lung cancer: n = 500, 490 smokers, pA = 490/500 = 98
 Group B, healthy individuals: n = 500, 400 smokers, pB = 400/500 = 80
In this setting:
 The overall proportion of smokers is p = frac(490+400) 500 + 500 = 89
 The overall proportion of non-smokers is q = 1 – p = 11
So we want to know, whether the proportions of smokers are the same in the two
groups of individuals.

The Formula for Two-Proportion Z-Test


The test statistic (also known as z-test) can be calculated as follow:

where, pA: the proportion observed in group A with size nA pB: the proportion
observed in group B with size nB p and q: the overall proportions
In R, the function used for performing a z-test is prop.test().
Syntax:
prop.test(x, n, p = NULL, alternative = c(“two.sided”, “less”, “greater”),
correct = TRUE)

Parameters:
 x = number of successes and failures in data set.
 n = size of data set.
 p = probabilities of success. It must be in the range of 0 to 1.
 alternative = a character string specifying the alternative hypothesis.
 correct = a logical indicating whether Yates’ continuity correction should
be applied where possible.
Example:
# prop Test in R
prop.test(x = c(342, 290),n = c(400, 400))
Output:

2-sample test for equality of proportions with continuity correction


data: c(342, 290) out of c(400, 400)
X-squared = 19.598, df = 1, p-value = 9.559e-06
alternative hypothesis: two.sided
95 percent confidence interval:
0.07177443 0.18822557
sample estimates:

INNAHAI ANUGRAHAM
BCA V SEM R PROGRAMMING RAJADHANI DEGREE COLLGE

prop 1 prop 2
0.855 0.725
Errors in Hypothesis Testing
Errors in Hypothesis Testing is the estimate of the approval or rejection of a
particular hypothesis. There are two types of errors.
Type I Error:
A type I error appears when the null hypothesis (H0) of an experiment is true, but
still, it is rejected. It is stating something which is not present or a false hit. A type
I error is often called a false positive (an event that shows that a given condition
is present when it is absent). It is denoted by alpha (α).
Type II Error
A type II error appears when the null hypothesis is false but mistakenly fails to
be refused. It is losing to state what is present and a miss. A type II error is also
known as false negative (where a real hit was rejected by the test and is observed
as a miss), in an experiment checking for a condition with a final outcome of true
or false.A type II error is assigned when a true alternative hypothesis is not
acknowledged. It is denoted by beta(β)
Type I and Type II Errors Example
Example 1: Let us consider a null hypothesis – A man is not guilty of a crime.

Then in this case:


Type I error (False Positive) Type II error (False Negative)
He is condemned to crime, though he He is condemned not guilty when the
is not guilty or committed the crime. court actually does commit the crime
by letting the guilty one go free.

Example for Type 1 error in R


# Set the parameters
alpha <- 0.05
sample_size <- 30
num_simulations <- 10000

# Set the seed for reproducibility


set.seed(123)

# Initialize the counter for false positives


false_positives <- 0

# Perform the simulations


for (i in 1:num_simulations) {

INNAHAI ANUGRAHAM
BCA V SEM R PROGRAMMING RAJADHANI DEGREE COLLGE

# Generate two samples from the same normal


# distribution (null hypothesis is true)
sample1 <- rnorm(sample_size, mean = 0, sd = 1)
sample2 <- rnorm(sample_size, mean = 0, sd = 1)

# Conduct a t-test
test_result <- t.test(sample1, sample2)

# Check if the p-value is less than the alpha level


if (test_result$p.value < alpha) {
false_positives <- false_positives + 1
}
}

# Calculate the Type I error rate


type1_error_rate <- false_positives / num_simulations

# Print the Type I error rate


cat("Type I Error Rate:", type1_error_rate)
Output
> # Print the Type I error rate
> cat("Type I Error Rate:", type1_error_rate)
Type I Error Rate: 0.0481

Example for Type II error in R


# Install and load required packages
if (!require(pwr))
install.packages("pwr")
library(pwr)

# Parameters
effect_size <-
0.5 # The difference between null and alternative hypotheses
sample_size <- 100 # The number of observations in each group
sd <- 15 # The standard deviation
alpha <- 0.05 # The significance level

# Calculate Type II Error


pwr_result <-pwr.t.test(n = sample_size,d = effect_size / sd,sig.level = alpha,type
= "two.sample",alternative = "two.sided")

INNAHAI ANUGRAHAM
BCA V SEM R PROGRAMMING RAJADHANI DEGREE COLLGE

type_II_error <- 1 - pwr_result$power

# Print Type II Error


print(type_II_error)

Output
> # Print Type II Error
> print(type_II_error)
[1] 0.9436737

Power of a Hypothesis Test


The probability of not committing a Type II error is called the power of a
hypothesis test.
Factors That Affect Power
The power of a hypothesis test is affected by these factors.

Effect size(ES): The difference between the hypothesized value of a parameter


and its true value. A larger effect size increases statistical power.
Sample size (n). Other things being equal, the greater the sample size, the greater
the power of the test.
Significance level (α). The lower the significance level, the lower the power of
the test. If you reduce the significance level (e.g., from 0.05 to 0.01), the region
of acceptance gets bigger. As a result, you are less likely to reject the null
hypothesis. This means you are less likely to reject the null hypothesis when it is
false, so you are more likely to make a Type II error. In short, the power of the
test is reduced when you reduce the significance level; and vice versa.
The "true" value of the parameter being tested. The greater the difference
between the "true" value of a parameter and the value specified in the null
hypothesis, the greater the power of the test. That is, the greater the effect size,
the greater the power of the test.

Analysis of Variance(ANOVA)
ANOVA also known as Analysis of variance is used to investigate relations
between categorical variables and continuous variables in the R Programming
Language. It is a type of hypothesis testing for population variance.
ANOVA test involves setting up:
Null Hypothesis: The default assumption, or null hypothesis, is that there is no
meaningful relationship or impact between the variables. The null hypothesis is
commonly written as H0.

INNAHAI ANUGRAHAM
BCA V SEM R PROGRAMMING RAJADHANI DEGREE COLLGE

Alternate Hypothesis: The opposite of the null hypothesis is the alternative


hypothesis. It implies that there is a significant relationship, difference, or link
among the population’s variables. Alternative hypotheses are sometimes referred
to as H1 or HA.

Syntax in R:
aov(formula, data = NULL, projections = FALSE, qr = TRUE, contrasts =
NULL, …)
Arguments
 Formula-A formula specifying the model.
 Data-A data frame in which the variables specified in the formula will be
found. If missing, the variables are searched for in the standard way.
 Projections-Logical flag: should the projections be returned?
 Qr-Logical flag: should the QR decomposition be returned?
 Contrasts-A list of contrasts to be used for some of the factors in the
formula. These are not used for any Error term, and supplying contrasts for
factors only in the Error term will give a warning.
 …-Arguments to be passed to lm, such as subset or na.action.

One-way ANOVA: One-way When there is a single categorical independent


variable (also known as a factor) and a single continuous dependent variable, an
ANOVA is employed. It seeks to ascertain whether there are any notable
variations in the dependent variable’s means across the levels of the independent
variable. Eg., In the one-way ANOVA, we test the effects of 3 types of fertilizer
on crop yield.
One-way ANOVA test is performed using mtcars dataset which comes
preinstalled with dplyr package between disp attribute, a continuous
attribute and gear attribute, a categorical attribute. here are some steps.

 Setup Null Hypothesis and Alternate Hypothesis


 H0 = mu = mu01 = mu02(There is no difference between average
displacement for different gears)
 H1 = Not all means are equal.
# Installing the package
install.packages("dplyr")

# Loading the package


library(dplyr)

INNAHAI ANUGRAHAM
BCA V SEM R PROGRAMMING RAJADHANI DEGREE COLLGE

head(mtcars)
mtcars_aov <- aov(mtcars$disp~factor(mtcars$gear))
summary(mtcars_aov)

Output:
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1

Here we will print top 5 record of our dataset to get an idea about our dataset.

Df Sum Sq Mean Sq F value Pr(>F)


factor(mtcars$gear) 2 280221 140110 20.73 2.56e-06 ***
Residuals 29 195964 6757
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

 Df: The model’s degrees of freedom.


 Sum Sq: The sums of squares, which represent the variability that the
model is able to account for.
 Mean Sq: The variance explained by each component is represented by the
mean squares.
 F-value: It is the measure used to compare the mean squares both within
and between groups.
 Pr(>F): The F-statistics p-value, which denotes the factors’ statistical
significance.
 Residuals: Relative deviations from the group mean, are often known as
residuals and their summary statistics.
Two-way ANOVA: When there are two categorical independent variables
(factors) and one continuous dependent variable, two-way ANOVA is used as an
extension of one-way ANOVA. You can evaluate both the direct impacts of each
independent variable and how they interact with one another on the dependent
variable. Eg., In the two-way ANOVA, we add an additional independent

INNAHAI ANUGRAHAM
BCA V SEM R PROGRAMMING RAJADHANI DEGREE COLLGE

variable: planting density. We test the effects of 3 types of fertilizer and 2 different
planting densities on crop yield.
A two-way ANOVA test is performed using mtcars dataset which comes
preinstalled with dplyr package between disp attribute, a continuous attribute and
gear attribute, a categorical attribute, am attribute, a categorical attribute.

 Setup Null Hypothesis and Alternate Hypothesis


 H0 = mu0 = mu01 = mu02(There is no difference between average
displacement for different gear)
 H1 = Not all means are equal
# Installing the package
install.packages("dplyr")

# Loading the package


library(dplyr)

# Variance in mean within group and between group


histogram(mtcars$disp~mtcars$gear, subset = (mtcars$am == 0),
xlab = "gear", ylab = "disp", main = "Automatic")
histogram(mtcars$disp~mtcars$gear, subset = (mtcars$am == 1),
xlab = "gear", ylab = "disp", main = "Manual")
Output:

INNAHAI ANUGRAHAM
BCA V SEM R PROGRAMMING RAJADHANI DEGREE COLLGE

The histogram shows the mean values of gear with respect to displacement. Hear
categorical variables are gear and am on which factor function is used and
continuous variable is disp.

Calculate test statistics using aov function


mtcars_aov2 <- aov(mtcars$disp~factor(mtcars$gear) *
factor(mtcars$am))
summary(mtcars_aov2)
Output:

Df Sum Sq Mean Sq F value Pr(>F)


factor(mtcars$gear) 2 280221 140110 20.695 3.03e-06 ***
factor(mtcars$am) 1 6399 6399 0.945 0.339
Residuals 28 189565 6770
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

INNAHAI ANUGRAHAM

You might also like