0% found this document useful (0 votes)

38 views21 pages

Unit4 R

The document discusses statistical testing and modeling. It covers topics like hypotheses formulation, statistical tests, sampling distributions, and hypothesis testing. Statistical tests discussed include t-tests, ANOVA, and Wilcoxon signed-rank test. Steps for statistical modeling like data collection, model selection, fitting and evaluation are also covered.

Uploaded by

Professor

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

38 views21 pages

Unit4 R

Uploaded by

Professor

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

BCA V SEM R PROGRAMMING RAJADHANI DEGREE COLLGE

UNIT-4
Statistical Testing
Statistical tests are mathematical tools for analysing quantitative data generated
in a research study and making inference. Here are the general steps involved in
statistical testing:
Formulate Hypotheses:
Null Hypothesis (H0): This is a statement of no effect or no difference in the
population or data samples.
Alternative Hypothesis (H1 or Ha): This is a statement that there is an effect or a
difference in the population.
Select the Appropriate Test:
Choose a statistical test based on the nature of your data and the type of
comparison you are making (e.g., t-test, chi-square test, ANOVA, etc.).
Collect and Prepare Data:
Ensure that your sample is representative and meets the assumptions of the
chosen test. Clean and organize the data for analysis.
Calculate Test Statistic:
Compute the test statistic based on the formula associated with the chosen
statistical test.
Determine the Critical Region:
Identify the critical region or critical values for the test statistic based on the
chosen significance level.
Make a Decision:
Compare the calculated test statistic with the critical value(s). If the test statistic
falls in the critical region, reject the null hypothesis. If it falls outside the critical
region, fail to reject the null hypothesis.
Draw Conclusions:
Based on your decision, draw conclusions about the null hypothesis. .
Report Results:
Clearly communicate the results of the statistical test, including the test statistic,
p-value (if applicable), and any relevant confidence intervals.
Consider Limitations:
Discuss any limitations or assumptions made during the analysis.
Statistical Modelling
Statistical modelling is a powerful technique used in data analysis to uncover
patterns, relationships, and trends within datasets. By applying statistical methods
and models, researchers and analysts can gain insights, make predictions, and
support decision-making processes. Key steps are

INNAHAI ANUGRAHAM
BCA V SEM R PROGRAMMING RAJADHANI DEGREE COLLGE

 Data Collection and preparation- Gather and preprocess data

 Model Selection- Choose appropriate statistical model
 Model Fitting- Use techniques like square estimation, Bayesian
interference to estimate the parameters.
 Model Evaluation-Access the performance using metrics as goodness to fit
measures, prediction accuracy and diagnostic tests.
 Model Interpretation- Interpret Results and make conclusions.

Eg., Linear Regression, logistic regression, time series analysis.

Sampling Distributions in R
 A sampling distribution is a statistic that is arrived out through repeated
sampling from a larger population
 It describes a range of possible outcomes that of a statistic, such as the
mean or mode of some variable, as it truly exists a population.
 The majority of data analyzed by researchers are actually drawn from
samples(a part of the pool of data), and not populations(entire pool of data).
Steps to Calculate Sampling Distributions in R:
Step 1: Here, first we have to define a number of samples(n=1000).
n<-1000
Step 2: Next we create a vector(sample_means) of length ‘n’ with Null(NA)
values [ rep() function is used to replicate the values in the vector
Syntax: rep(value_to_be_replicated,number_of_times)
Step 3: Later we filled the created sample_means null vector with sample means
from the considered population using the mean() function which are having a
sample mean of 10(mean) and standard deviation of 10(sd) of 20 samples(n)
using rnorm() which is used to generate normal distributions.

Syntax: mean(x, trim = 0)

Syntax: rnorm(n, mean, sd)

Step 4: To check the created samples we used head() which returns the first six
samples of the dataframe (vector,list etc,.).

Syntax:head(data_frame,no_of_rows_be_returned) #By default second argument

is set to 6 in R.

INNAHAI ANUGRAHAM
BCA V SEM R PROGRAMMING RAJADHANI DEGREE COLLGE

Step 5: Finally to visualize the sample_mean dataset we plotted a histogram ( for

better visualization ) using hist() function in R.
Syntax:hist(v,main,xlab,ylab,col)
Step 6: Finally we found the probability of generated sample means which are
having mean greater than or equal to 10.
Example:
# define number of samples
n < -1000

# create empty vector of length n

sample_means = rep(NA, n)

# fill empty_vector with means

for(i in 1: n){
sample_means[i] = mean(rnorm(20, mean=10, sd=10))
}
head(sample_means)

# create histogram to visualize

hist(sample_means, main="Sampling Distribution",
xlab="Sample Means", ylab="Frequency", col="blue")

# To cross check find mean and sd of sample

mean(sample_means)

sd(sample_means)

# To find probability
sum(sample_means >= 10)/length(sample_means)

INNAHAI ANUGRAHAM
BCA V SEM R PROGRAMMING RAJADHANI DEGREE COLLGE

Hypothesis Testing
As we might know, when we infer something from data, we make an inference
based on a collection of samples rather than the true population. The main
question that comes from it is: can we trust the result from our data to make a
general assumption of the population? This is the main goal of hypothesis testing.

There are several steps that we should do to properly conduct a hypothesis testing.
The Four Key steps involved are
 State the Hypotheses, form our null hypothesis and alternative hypothesis.
Null Hypothesis (H0): This is a statement of no effect or no difference in
the population or data samples.
Alternative Hypothesis (H1 or Ha): This is a statement that there is an
effect or a difference in the population.
 Formulate an analysis plan and set the criteria for decision(Set our
significance level). The significance level varies depending on our use
case, but the default value is 0.05.
 Calculate the Test statistic and P-value. Perform a statistical test that
suits our data. The probability is known as the p-value.
 Check the resulting p-Value and Make a Decision. If the p-Value is
smaller than our significance level, then we reject the null hypothesis in
favour of our alternative hypothesis. If the p-Value is higher than our
significance level, then we go with our null hypothesis.

INNAHAI ANUGRAHAM
BCA V SEM R PROGRAMMING RAJADHANI DEGREE COLLGE

One Sample T-Testing

One sample T-Testing approach collects a huge amount of data and tests it on
random samples. This test is used to test the mean of the sample with the
population.
Syntax: t.test(x, mu)
Parameters:
 x: represents numeric vector of data
 mu: represents true value of the mean
Example
# Defining sample vector
x <- rnorm(100)

# One Sample T-Test

t.test(x, mu = 5)

Output:
One Sample t-test

data: x
t = -49.504, df = 99, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 5
95 percent confidence interval:
-0.1910645 0.2090349
sample estimates:
mean of x
0.008985172
 Data: The dataset ‘x’ was used for the test.
 The determined t-value is -49.504.
 Degrees of Freedom (df): The t-test has 99 degrees of freedom.
 The p-value is 2.2e-16, which indicates that there is substantial evidence
refuting the null hypothesis.
 Alternative hypothesis: The true mean is not equal to five, according to the
alternative hypothesis.
 95 percent confidence interval: (-0.1910645, 0.2090349) is the confidence
interval’s value. This range denotes the values that, with 95% confidence,
correspond to the genuine population mean.
Two Sample T-Testing
In two sample T-Testing, the sample vectors are compared. If var. equal = TRUE,
the test assumes that the variances of both the samples are equal.

INNAHAI ANUGRAHAM
BCA V SEM R PROGRAMMING RAJADHANI DEGREE COLLGE

Syntax: t.test(x, y)
Parameters:
 x and y: Numeric vectors
Example:
# Defining sample vector
x <- rnorm(100)
y <- rnorm(100)

# Two Sample T-Test

t.test(x, y)
Output:

Welch Two Sample t-test

data: x and y
t = -1.0601, df = 197.86, p-value = 0.2904
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.4362140 0.1311918
sample estimates:
mean of x mean of y
-0.05075633 0.10175478

Wilcoxon Signed-Rank Test in R

This test can be divided into two parts:
 One-Sample Wilcoxon Signed Rank Test
 Paired Samples Wilcoxon Test
One-Sample Wilcoxon Signed Rank Test
The one-sample Wilcoxon signed-rank test is a non-parametric alternative to a
one-sample t-test when the data cannot be assumed to be normally distributed.
It’s used to determine whether the median of the sample is equal to a known
standard value i.e. a theoretical value.
Syntax: wilcox.test(x, mu = 0, alternative = “two.sided”)

Parameters:
 x: a numeric vector containing your data values
 mu: the theoretical mean/median value. Default is 0 but you can change it.
 alternative: the alternative hypothesis. Allowed value is one of “two.sided”
(default), “greater” or “less”.

INNAHAI ANUGRAHAM
BCA V SEM R PROGRAMMING RAJADHANI DEGREE COLLGE

Example
# R program to illustrate
# one-sample Wilcoxon signed-rank test

# The data set

set.seed(1234)
myData = data.frame(
name = paste0(rep("R_", 10), 1:10),
weight = round(rnorm(10, 30, 2), 1)
)

# One-sample wilcoxon test

wilcox.test(myData$weight, mu = 25,
alternative = "less")

# Printing the results

print(result)
Output:

Wilcoxon signed rank exact test

data: myData$weight
V = 55, p-value = 1
alternative hypothesis: true location is less than 25

Paired Samples Wilcoxon Test in R

The paired samples Wilcoxon test is a non-parametric alternative to paired t-test
used to compare paired data. It’s used when data are not normally distributed.

Syntax: wilcox.test(x, y, paired = TRUE, alternative = “two.sided”)

Parameters:
 x, y: numeric vectors
 paired: a logical value specifying that we want to compute a paired
Wilcoxon test
 alternative: the alternative hypothesis. Allowed value is one of “two.sided”
(default), “greater” or “less”.
Example
# R program to illustrate
# Paired Samples Wilcoxon Test

INNAHAI ANUGRAHAM
BCA V SEM R PROGRAMMING RAJADHANI DEGREE COLLGE

# The data set

# Weight of the rabbit before treatment
before <-c(190.1, 190.9, 172.7, 213, 231.4,
196.9, 172.2, 285.5, 225.2, 113.7)

# Weight of the rabbit after treatment

after <-c(392.9, 313.2, 345.1, 393, 434,
227.9, 422, 383.9, 392.3, 352.2)

# Create a data frame

myData <- data.frame(
group = rep(c("before", "after"), each = 10),
weight = c(before, after)
)

# Paired Samples Wilcoxon Test

result = wilcox.test(weight ~ group,
data = myData,
paired = TRUE,
alternative = "less")

# Printing the results

print(result)

Output:

Wilcoxon signed rank test

data: weight by group

V = 55, p-value = 1
alternative hypothesis: true location shift is less than 0

Paired t-test
Paired test is used to check whether there is a significant difference between two
population means when their data is in the form of matched pairs.
Syntax: t.test(x, y, paired = TRUE, alternative = "two.sided")
where
 x,y: numeric vectors

INNAHAI ANUGRAHAM
BCA V SEM R PROGRAMMING RAJADHANI DEGREE COLLGE

 paired: a logical value specifying that we want to compute a paired t-test

 alternative: the alternative hypothesis. Allowed value is one of “two.sided”
(default), “greater” or “less”.
Example:
# Define the datasets

before <- c(39,43,41,32,37,40,42,40,37,38)

after <- c(42,45,42,43,40,44,40,43,41,40)

# Perform the paired t-test

t.test(x=before,y=after,paired = TRUE,alternative = "greater")

Output:
Paired t-test
data: before and after
t = -2.9876, df = 9, p-value = 0.9924
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
-5.002085 Inf
sample estimates:
mean of the differences
-3.1
Chi-Square test
Chi-Square test is a statistical method to determine if two categorical variables
have a significant correlation between them. Both those variables should be from
same population and they should be categorical like − Yes/No, Male/Female,
Red/Green etc.
Syntax:

chisq.test(data)
Parameters:
 data: data is a table containing count values of the variables in the table.
Example:
# Load the library.
library("MASS")

# Create a data frame from the main data set.

car.data <- data.frame(Cars93$AirBags, Cars93$Type)

INNAHAI ANUGRAHAM
BCA V SEM R PROGRAMMING RAJADHANI DEGREE COLLGE

# Create a table with the needed variables.

car.data = table(Cars93$AirBags, Cars93$Type)
print(car.data)

# Perform the Chi-Square test.

print(chisq.test(car.data))
Output:
Compact Large Midsize Small Sporty Van
Driver & Passenger 2 4 7 0 3 0
Driver only 9 7 11 5 8 3
None 5 0 4 16 3 6

Pearson's Chi-squared test

data: car.data
X-squared = 33.001, df = 10, p-value = 0.0002723

Warning message:
In chisq.test(car.data) : Chi-squared approximation may be incorrect
Advantages of Hypothesis Testing:

 Objectivity:Hypothesis testing provides a structured and objective

approach to decision-making in statistical analysis.
 Inference:Hypothesis testing enables researchers to make inferences about
population parameters based on sample data.
 Standardization:The use of standardized procedures in hypothesis testing
allows for consistency across different studies and ensures that statistical
analyses are conducted in a systematic manner.
 Decision-Making:Hypothesis testing provides a clear framework for
decision-making
 Scientific Rigor:By setting up null and alternative hypotheses and
applying statistical tests, hypothesis testing contributes to the scientific
rigor of research.
Disadvantages of Hypothesis Testing:

Assumptions:Many hypothesis tests rely on assumptions about the data, If these

assumptions are violated, the results may be unreliable.

INNAHAI ANUGRAHAM
BCA V SEM R PROGRAMMING RAJADHANI DEGREE COLLGE

Sensitivity to Sample Size:Small sample sizes can lead to less reliable results.
The power of a test (the ability to detect a true effect) increases with larger sample
sizes, and small samples may fail to detect real differences.
Risk of Errors:The balance between these errors depends on the chosen
significance level and statistical power.
Limited Scope:Hypothesis testing typically focuses on specific hypotheses and
may not provide a complete picture of the data.

Proportion Test
Proportion testing is commonly used to ananlyze categorical data, especially
when working with binary outcomes or proportions.
Syntax:
prop.test(x, n, p = NULL, alternative = c("two.sided", "less", "greater"),
conf.level = 0.95, correct = TRUE)
where
 x->a vector of counts of successes, a one-dimensional table with two
entries, or a two-dimensional table (or matrix) with 2 columns, giving the
counts of successes and failures, respectively.
 n->a vector of counts of trials; ignored if x is a matrix or a table.
 p->a vector of probabilities of success. The length of p must be the same
as the number of groups specified by x, and its elements must be greater
than 0 and less than 1.
 Alternative->a character string specifying the alternative hypothesis, must
be one of "two.sided" (default), "greater" or "less". You can specify just the
initial letter. Only used for testing the null that a single proportion equals a
given value, or that two proportions are equal; ignored otherwise.
 conf.level->confidence level of the returned confidence interval. Must be
a single number between 0 and 1. Only used when testing the null that a
single proportion equals a given value, or that two proportions are equal;
ignored otherwise.
 Correct->a logical indicating whether Yates' continuity correction should
be applied where possible.
Example:
smokers <- c( 83, 90, 129, 70 )
patients <- c( 86, 93, 136, 82 )
prop.test(smokers, patients)

output:
4-sample test for equality of proportions without continuity correction

INNAHAI ANUGRAHAM
BCA V SEM R PROGRAMMING RAJADHANI DEGREE COLLGE

data: smokers out of patients

X-squared = 12.6, df = 3, p-value = 0.005585
alternative hypothesis: two.sided
sample estimates:
prop 1 prop 2 prop 3 prop 4
0.9651163 0.9677419 0.9485294 0.8536585

One-Proportion Z-Test in R Programming

The One proportion Z-test is used to compare an observed proportion to a

theoretical one when there are only two categories. For example, we have a
population of mice containing half male and half females (p = 0.5 = 50%). Some
of these mice (n = 160) have developed spontaneous cancer, including 95 males
and 65 females. We want to know, whether cancer affects more males than
females? So in this problem:

 The number of successes (male with cancer) is 95

 The observed proportion (po) of the male is 95/160
 The observed proportion (q) of the female is 1 – po
 The expected proportion (pe) of the male is 0.5 (50%)
 The number of observations (n) is 160
The Formula for One-Proportion Z-Test
The test statistic (also known as z-test) can be calculated as follow:

where,
 po: the observed proportion
 q: 1 – po
 pe: the expected proportion n: the sample size
Implementation in R
In R Language, the function used for performing a z-test is binom.test() and
prop.test().
Syntax:
binom.test(x, n, p = 0.5, alternative = “two.sided”)
prop.test(x, n, p = NULL, alternative = “two.sided”, correct = TRUE)
Parameters:
 x = number of successes and failures in data set.
 n = size of data set.

INNAHAI ANUGRAHAM
BCA V SEM R PROGRAMMING RAJADHANI DEGREE COLLGE

 p = probabilities of success. It must be in the range of 0 to 1.

 alternative = a character string specifying the alternative hypothesis.
 correct = a logical indicating whether Yates’ continuity correction should
be applied where possible.
Example:
# Using prop.test()
prop.test(x = 95, n = 160, p = 0.8, correct = FALSE)
o/p:
1-sample proportions test without continuity correction
data: 95 out of 160, null probability 0.8
X-squared = 42.539, df = 1, p-value = 6 . 928e-11
alternative hypothesis: true p is not equal to 0.8
95 percent confidence interval:
0.5163169 0.6667870
sample estimates:
p
0.59375
 It returns p-value which is 6.928462
 alternative hypothesis
 a 95% confidence intervals.
 probability of success is 0.59
# Using binom.test()
binom.test(x =25, n = 100, p = 0.15)
o/p:
Exact binomial test

data: 25 and 100

number of successes = 25, number of trials = 100, p-value = 0.007633
alternative hypothesis: true probability of success is not equal to 0.15
95 percent confidence interval:
0.1687797 0.3465525
sample estimates:
probability of success
0.25
Two-Proportions Z-Test in R Programming
A two-proportion z-test allows us to compare two proportions to see if they are
the same. For example, let there be two groups of individuals:

Group A with lung cancer: n = 500

INNAHAI ANUGRAHAM
BCA V SEM R PROGRAMMING RAJADHANI DEGREE COLLGE

Group B, healthy individuals: n = 500

The number of smokers in each group is as follows:
 Group A with lung cancer: n = 500, 490 smokers, pA = 490/500 = 98
 Group B, healthy individuals: n = 500, 400 smokers, pB = 400/500 = 80
In this setting:
 The overall proportion of smokers is p = frac(490+400) 500 + 500 = 89
 The overall proportion of non-smokers is q = 1 – p = 11
So we want to know, whether the proportions of smokers are the same in the two
groups of individuals.

The Formula for Two-Proportion Z-Test

The test statistic (also known as z-test) can be calculated as follow:

where, pA: the proportion observed in group A with size nA pB: the proportion
observed in group B with size nB p and q: the overall proportions
In R, the function used for performing a z-test is prop.test().
Syntax:
prop.test(x, n, p = NULL, alternative = c(“two.sided”, “less”, “greater”),
correct = TRUE)

Parameters:
 x = number of successes and failures in data set.
 n = size of data set.
 p = probabilities of success. It must be in the range of 0 to 1.
 alternative = a character string specifying the alternative hypothesis.
 correct = a logical indicating whether Yates’ continuity correction should
be applied where possible.
Example:
# prop Test in R
prop.test(x = c(342, 290),n = c(400, 400))
Output:

2-sample test for equality of proportions with continuity correction

data: c(342, 290) out of c(400, 400)
X-squared = 19.598, df = 1, p-value = 9.559e-06
alternative hypothesis: two.sided
95 percent confidence interval:
0.07177443 0.18822557
sample estimates:

INNAHAI ANUGRAHAM
BCA V SEM R PROGRAMMING RAJADHANI DEGREE COLLGE

prop 1 prop 2
0.855 0.725
Errors in Hypothesis Testing
Errors in Hypothesis Testing is the estimate of the approval or rejection of a
particular hypothesis. There are two types of errors.
Type I Error:
A type I error appears when the null hypothesis (H0) of an experiment is true, but
still, it is rejected. It is stating something which is not present or a false hit. A type
I error is often called a false positive (an event that shows that a given condition
is present when it is absent). It is denoted by alpha (α).
Type II Error
A type II error appears when the null hypothesis is false but mistakenly fails to
be refused. It is losing to state what is present and a miss. A type II error is also
known as false negative (where a real hit was rejected by the test and is observed
as a miss), in an experiment checking for a condition with a final outcome of true
or false.A type II error is assigned when a true alternative hypothesis is not
acknowledged. It is denoted by beta(β)
Type I and Type II Errors Example
Example 1: Let us consider a null hypothesis – A man is not guilty of a crime.

Then in this case:

Type I error (False Positive) Type II error (False Negative)
He is condemned to crime, though he He is condemned not guilty when the
is not guilty or committed the crime. court actually does commit the crime
by letting the guilty one go free.

Example for Type 1 error in R

# Set the parameters
alpha <- 0.05
sample_size <- 30
num_simulations <- 10000

# Set the seed for reproducibility

set.seed(123)

# Initialize the counter for false positives

false_positives <- 0

# Perform the simulations

for (i in 1:num_simulations) {

INNAHAI ANUGRAHAM
BCA V SEM R PROGRAMMING RAJADHANI DEGREE COLLGE

# Generate two samples from the same normal

# distribution (null hypothesis is true)
sample1 <- rnorm(sample_size, mean = 0, sd = 1)
sample2 <- rnorm(sample_size, mean = 0, sd = 1)

# Conduct a t-test
test_result <- t.test(sample1, sample2)

# Check if the p-value is less than the alpha level

if (test_result$p.value < alpha) {
false_positives <- false_positives + 1
}
}

# Calculate the Type I error rate

type1_error_rate <- false_positives / num_simulations

# Print the Type I error rate

cat("Type I Error Rate:", type1_error_rate)
Output
> # Print the Type I error rate
> cat("Type I Error Rate:", type1_error_rate)
Type I Error Rate: 0.0481

Example for Type II error in R

# Install and load required packages
if (!require(pwr))
install.packages("pwr")
library(pwr)

# Parameters
effect_size <-
0.5 # The difference between null and alternative hypotheses
sample_size <- 100 # The number of observations in each group
sd <- 15 # The standard deviation
alpha <- 0.05 # The significance level

# Calculate Type II Error

pwr_result <-pwr.t.test(n = sample_size,d = effect_size / sd,sig.level = alpha,type
= "two.sample",alternative = "two.sided")

INNAHAI ANUGRAHAM
BCA V SEM R PROGRAMMING RAJADHANI DEGREE COLLGE

type_II_error <- 1 - pwr_result$power

# Print Type II Error

print(type_II_error)

Output
> # Print Type II Error
> print(type_II_error)
[1] 0.9436737

Power of a Hypothesis Test

The probability of not committing a Type II error is called the power of a
hypothesis test.
Factors That Affect Power
The power of a hypothesis test is affected by these factors.

Effect size(ES): The difference between the hypothesized value of a parameter

and its true value. A larger effect size increases statistical power.
Sample size (n). Other things being equal, the greater the sample size, the greater
the power of the test.
Significance level (α). The lower the significance level, the lower the power of
the test. If you reduce the significance level (e.g., from 0.05 to 0.01), the region
of acceptance gets bigger. As a result, you are less likely to reject the null
hypothesis. This means you are less likely to reject the null hypothesis when it is
false, so you are more likely to make a Type II error. In short, the power of the
test is reduced when you reduce the significance level; and vice versa.
The "true" value of the parameter being tested. The greater the difference
between the "true" value of a parameter and the value specified in the null
hypothesis, the greater the power of the test. That is, the greater the effect size,
the greater the power of the test.

Analysis of Variance(ANOVA)
ANOVA also known as Analysis of variance is used to investigate relations
between categorical variables and continuous variables in the R Programming
Language. It is a type of hypothesis testing for population variance.
ANOVA test involves setting up:
Null Hypothesis: The default assumption, or null hypothesis, is that there is no
meaningful relationship or impact between the variables. The null hypothesis is
commonly written as H0.

INNAHAI ANUGRAHAM
BCA V SEM R PROGRAMMING RAJADHANI DEGREE COLLGE

Alternate Hypothesis: The opposite of the null hypothesis is the alternative

hypothesis. It implies that there is a significant relationship, difference, or link
among the population’s variables. Alternative hypotheses are sometimes referred
to as H1 or HA.

Syntax in R:
aov(formula, data = NULL, projections = FALSE, qr = TRUE, contrasts =
NULL, …)
Arguments
 Formula-A formula specifying the model.
 Data-A data frame in which the variables specified in the formula will be
found. If missing, the variables are searched for in the standard way.
 Projections-Logical flag: should the projections be returned?
 Qr-Logical flag: should the QR decomposition be returned?
 Contrasts-A list of contrasts to be used for some of the factors in the
formula. These are not used for any Error term, and supplying contrasts for
factors only in the Error term will give a warning.
 …-Arguments to be passed to lm, such as subset or na.action.

One-way ANOVA: One-way When there is a single categorical independent

variable (also known as a factor) and a single continuous dependent variable, an
ANOVA is employed. It seeks to ascertain whether there are any notable
variations in the dependent variable’s means across the levels of the independent
variable. Eg., In the one-way ANOVA, we test the effects of 3 types of fertilizer
on crop yield.
One-way ANOVA test is performed using mtcars dataset which comes
preinstalled with dplyr package between disp attribute, a continuous
attribute and gear attribute, a categorical attribute. here are some steps.

 Setup Null Hypothesis and Alternate Hypothesis

 H0 = mu = mu01 = mu02(There is no difference between average
displacement for different gears)
 H1 = Not all means are equal.
# Installing the package
install.packages("dplyr")

# Loading the package

library(dplyr)

INNAHAI ANUGRAHAM
BCA V SEM R PROGRAMMING RAJADHANI DEGREE COLLGE

head(mtcars)
mtcars_aov <- aov(mtcars$disp~factor(mtcars$gear))
summary(mtcars_aov)

Output:
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1

Here we will print top 5 record of our dataset to get an idea about our dataset.

Df Sum Sq Mean Sq F value Pr(>F)

factor(mtcars$gear) 2 280221 140110 20.73 2.56e-06 ***
Residuals 29 195964 6757
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

 Df: The model’s degrees of freedom.

 Sum Sq: The sums of squares, which represent the variability that the
model is able to account for.
 Mean Sq: The variance explained by each component is represented by the
mean squares.
 F-value: It is the measure used to compare the mean squares both within
and between groups.
 Pr(>F): The F-statistics p-value, which denotes the factors’ statistical
significance.
 Residuals: Relative deviations from the group mean, are often known as
residuals and their summary statistics.
Two-way ANOVA: When there are two categorical independent variables
(factors) and one continuous dependent variable, two-way ANOVA is used as an
extension of one-way ANOVA. You can evaluate both the direct impacts of each
independent variable and how they interact with one another on the dependent
variable. Eg., In the two-way ANOVA, we add an additional independent

INNAHAI ANUGRAHAM
BCA V SEM R PROGRAMMING RAJADHANI DEGREE COLLGE

variable: planting density. We test the effects of 3 types of fertilizer and 2 different
planting densities on crop yield.
A two-way ANOVA test is performed using mtcars dataset which comes
preinstalled with dplyr package between disp attribute, a continuous attribute and
gear attribute, a categorical attribute, am attribute, a categorical attribute.

 Setup Null Hypothesis and Alternate Hypothesis

 H0 = mu0 = mu01 = mu02(There is no difference between average
displacement for different gear)
 H1 = Not all means are equal
# Installing the package
install.packages("dplyr")

# Loading the package

library(dplyr)

# Variance in mean within group and between group

histogram(mtcars$disp~mtcars$gear, subset = (mtcars$am == 0),
xlab = "gear", ylab = "disp", main = "Automatic")
histogram(mtcars$disp~mtcars$gear, subset = (mtcars$am == 1),
xlab = "gear", ylab = "disp", main = "Manual")
Output:

INNAHAI ANUGRAHAM
BCA V SEM R PROGRAMMING RAJADHANI DEGREE COLLGE

The histogram shows the mean values of gear with respect to displacement. Hear
categorical variables are gear and am on which factor function is used and
continuous variable is disp.

Calculate test statistics using aov function

mtcars_aov2 <- aov(mtcars$disp~factor(mtcars$gear) *
factor(mtcars$am))
summary(mtcars_aov2)
Output:

Df Sum Sq Mean Sq F value Pr(>F)

factor(mtcars$gear) 2 280221 140110 20.695 3.03e-06 ***
factor(mtcars$am) 1 6399 6399 0.945 0.339
Residuals 28 189565 6770
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

INNAHAI ANUGRAHAM

R Programming Unit 4
No ratings yet
R Programming Unit 4
26 pages
Data Visualization Notes Ou
No ratings yet
Data Visualization Notes Ou
125 pages
Coloring Pages Spring and Summer 2
No ratings yet
Coloring Pages Spring and Summer 2
4 pages
T Test
No ratings yet
T Test
21 pages
Module2 BDA
No ratings yet
Module2 BDA
44 pages
Cyber Security Full Notes
No ratings yet
Cyber Security Full Notes
81 pages
Advanced Statistical Methods Using R Notes
No ratings yet
Advanced Statistical Methods Using R Notes
55 pages
DEV Lab Manual
No ratings yet
DEV Lab Manual
27 pages
Hypothesis Testing
No ratings yet
Hypothesis Testing
8 pages
Statistics
No ratings yet
Statistics
28 pages
Statistical Hypothesis Testing
No ratings yet
Statistical Hypothesis Testing
20 pages
Unit4 R
No ratings yet
Unit4 R
21 pages
Data Science Statistics Notes
No ratings yet
Data Science Statistics Notes
8 pages
Chapter 4 - STAT1204 A
No ratings yet
Chapter 4 - STAT1204 A
10 pages
Stat 362 Study Guide
No ratings yet
Stat 362 Study Guide
88 pages
BA - Advanced Statistical Method Using R
No ratings yet
BA - Advanced Statistical Method Using R
13 pages
Hypothesis Tests in R
No ratings yet
Hypothesis Tests in R
25 pages
Hypothesis
No ratings yet
Hypothesis
16 pages
Lab6 - Hypothesis Testing and Confidence Intervals in R
No ratings yet
Lab6 - Hypothesis Testing and Confidence Intervals in R
3 pages
STAT359 Study Guide
No ratings yet
STAT359 Study Guide
7 pages
Ed Aaaaaaa
No ratings yet
Ed Aaaaaaa
7 pages
Data Analytics Module 1 Lesson 6 Summary Notes
No ratings yet
Data Analytics Module 1 Lesson 6 Summary Notes
17 pages
Hypothesis
No ratings yet
Hypothesis
27 pages
Lec01 Twopopulationtests
No ratings yet
Lec01 Twopopulationtests
33 pages
Hypothesis Testing
No ratings yet
Hypothesis Testing
29 pages
Type I and Type II Errors Type I Error
No ratings yet
Type I and Type II Errors Type I Error
7 pages
NEP 5th Sem Syllabus
No ratings yet
NEP 5th Sem Syllabus
17 pages
Research Methodology and Biostatistics Part II 2
No ratings yet
Research Methodology and Biostatistics Part II 2
45 pages
Lecture 1: Course Introduction, Review and Paired-Samples T-Test
No ratings yet
Lecture 1: Course Introduction, Review and Paired-Samples T-Test
13 pages
BA - Advanced Statistical Method Using R (P2)
No ratings yet
BA - Advanced Statistical Method Using R (P2)
12 pages
Business Analytics & Machine Learning: Regression Analysis
No ratings yet
Business Analytics & Machine Learning: Regression Analysis
58 pages
BRM Unit 3 & 5 Data Analysis
No ratings yet
BRM Unit 3 & 5 Data Analysis
50 pages
DV Unit 1&2 Notes
No ratings yet
DV Unit 1&2 Notes
50 pages
RP Notes Unit 4 - Distribution Fucntions
No ratings yet
RP Notes Unit 4 - Distribution Fucntions
5 pages
DA Unit II - II
No ratings yet
DA Unit II - II
47 pages
20ma402 Ps Unit III DCM
No ratings yet
20ma402 Ps Unit III DCM
77 pages
Unit 2-Evaluating Analytical Data
No ratings yet
Unit 2-Evaluating Analytical Data
21 pages
Descriptive Statistics
No ratings yet
Descriptive Statistics
7 pages
Statistics Cheat Sheet
100% (3)
Statistics Cheat Sheet
23 pages
Psychology Statistic Note
No ratings yet
Psychology Statistic Note
13 pages
AD3411 - 6 To11
No ratings yet
AD3411 - 6 To11
15 pages
Inferential Statistics For Data Science
100% (1)
Inferential Statistics For Data Science
10 pages
Corporate Administration Unit 1 Notes
0% (1)
Corporate Administration Unit 1 Notes
5 pages
Modelling in R
No ratings yet
Modelling in R
47 pages
Quantitative Methods and Business Statistics For Decision Making (MSA606)
No ratings yet
Quantitative Methods and Business Statistics For Decision Making (MSA606)
63 pages
Notes On Applied Statistics
No ratings yet
Notes On Applied Statistics
16 pages
With Answers
100% (1)
With Answers
24 pages
304BA AdvancedStatisticalMethodsUsingR
No ratings yet
304BA AdvancedStatisticalMethodsUsingR
31 pages
Intro To Probability and Statistics
No ratings yet
Intro To Probability and Statistics
147 pages
Lecture Notes Statistics
100% (2)
Lecture Notes Statistics
117 pages
Energy Deficiency Impairs Resistance Training Gain
No ratings yet
Energy Deficiency Impairs Resistance Training Gain
13 pages
Analysing and Presenting Data: Practical Hints: Daniele CEI, Giorgio MATTEI
No ratings yet
Analysing and Presenting Data: Practical Hints: Daniele CEI, Giorgio MATTEI
53 pages
Statistics - The Big Picture
No ratings yet
Statistics - The Big Picture
4 pages
Article-Funa and Prudente-IJI 2023
No ratings yet
Article-Funa and Prudente-IJI 2023
16 pages
Quantifying Test-Retest Reliability Using The Interalass Correlation Coefficient and The Sem
No ratings yet
Quantifying Test-Retest Reliability Using The Interalass Correlation Coefficient and The Sem
10 pages
Group Comparision
No ratings yet
Group Comparision
49 pages
Sources of Validity Evidence PDF
No ratings yet
Sources of Validity Evidence PDF
14 pages
Android 14 Assignment
No ratings yet
Android 14 Assignment
11 pages
React Interview Questions
No ratings yet
React Interview Questions
12 pages
SC&RP - Unit 5
No ratings yet
SC&RP - Unit 5
36 pages
CS Unit-1
No ratings yet
CS Unit-1
23 pages
Davies Et Al., 2014
No ratings yet
Davies Et Al., 2014
22 pages
Paulino, Octávio Moura Et Al (2025) PAI Profiles (Logistic Regression)
No ratings yet
Paulino, Octávio Moura Et Al (2025) PAI Profiles (Logistic Regression)
14 pages
Carvalho Marques Pinto Maroco 2017 Mindfulness
No ratings yet
Carvalho Marques Pinto Maroco 2017 Mindfulness
14 pages
Mohammad Abu El-Magd Research No.212 2016
No ratings yet
Mohammad Abu El-Magd Research No.212 2016
34 pages
Dwyer, D. B., Falkai, P., & Koutsouleris, N. Machine Learning Approaches For Clinical Psychology and Psychiatry
No ratings yet
Dwyer, D. B., Falkai, P., & Koutsouleris, N. Machine Learning Approaches For Clinical Psychology and Psychiatry
30 pages
Review of Hypothesis Testing and Basic Tests 1. 2-2 2. 2-15: 2-1 © 2006 A. Karpinski
No ratings yet
Review of Hypothesis Testing and Basic Tests 1. 2-2 2. 2-15: 2-1 © 2006 A. Karpinski
70 pages
Teacher Attitudes Towards The Inclusion of Students With Support Needs
No ratings yet
Teacher Attitudes Towards The Inclusion of Students With Support Needs
10 pages
Fundamentals of Datascience
No ratings yet
Fundamentals of Datascience
80 pages
Effectof Principals Technology Leadershipon Teachers Technology Integration
No ratings yet
Effectof Principals Technology Leadershipon Teachers Technology Integration
19 pages
Normalized Gain - What Is It and When and How Should I Use It
No ratings yet
Normalized Gain - What Is It and When and How Should I Use It
6 pages
Sample Size Justification
No ratings yet
Sample Size Justification
28 pages
Using Multi Sensory Approach Based Activities For DYSGRAPHIC PRIMARY
No ratings yet
Using Multi Sensory Approach Based Activities For DYSGRAPHIC PRIMARY
24 pages
2021-Sb-Psycharakis Etal
No ratings yet
2021-Sb-Psycharakis Etal
16 pages
Psychoneuroendocrinology: GH Graf, X. Li, D. Kwon, DW Belsky, CS Widom
No ratings yet
Psychoneuroendocrinology: GH Graf, X. Li, D. Kwon, DW Belsky, CS Widom
7 pages
PALUCCI VIEIRA Et Al.2019 Match Running Skill-Related and Tactics U11 To PRO Football
No ratings yet
PALUCCI VIEIRA Et Al.2019 Match Running Skill-Related and Tactics U11 To PRO Football
16 pages
The Psychology of Passion - A Meta-Analytical Review of A Decade of Research
No ratings yet
The Psychology of Passion - A Meta-Analytical Review of A Decade of Research
25 pages
Protocolo ACT para Duelos Por Rupturas Amorosas
No ratings yet
Protocolo ACT para Duelos Por Rupturas Amorosas
19 pages
Test On Variables: in Surveys, The Foolish Ask Questions, Wise Cannot Answers
No ratings yet
Test On Variables: in Surveys, The Foolish Ask Questions, Wise Cannot Answers
24 pages
Comparison of Sprint Timing Methods On.93802
No ratings yet
Comparison of Sprint Timing Methods On.93802
6 pages
Assessment of Economic Impact of The Development
No ratings yet
Assessment of Economic Impact of The Development
35 pages
AD3491 - Unit 4 - Analysis of Variance Important Questions 2 Marks With Answer - 3-9
No ratings yet
AD3491 - Unit 4 - Analysis of Variance Important Questions 2 Marks With Answer - 3-9
7 pages
Country Differences in The Relationship Between Leadership and Employee Engagement: A Meta-Analysis
No ratings yet
Country Differences in The Relationship Between Leadership and Employee Engagement: A Meta-Analysis
14 pages
XhomaraN. KarabinaM. ConferencePaper June10-122021
No ratings yet
XhomaraN. KarabinaM. ConferencePaper June10-122021
16 pages
Lewandowski 2007 PersonalityGoesALongWay PersonalRelationships
No ratings yet
Lewandowski 2007 PersonalityGoesALongWay PersonalRelationships
16 pages
Unix ZVM Android14
No ratings yet
Unix ZVM Android14
9 pages
2487 9752 2 PB
No ratings yet
2487 9752 2 PB
9 pages
HTML Interview Questions
No ratings yet
HTML Interview Questions
8 pages
Unit2 R
No ratings yet
Unit2 R
19 pages
Week 5 - Hypothesis, Effect Size, and Power: Problem Set 5.1
No ratings yet
Week 5 - Hypothesis, Effect Size, and Power: Problem Set 5.1
3 pages
Manual Therapy Versus Therapeutic Exercise in Non-Specific Chronic Neck Pain: A Randomized Controlled Trial
No ratings yet
Manual Therapy Versus Therapeutic Exercise in Non-Specific Chronic Neck Pain: A Randomized Controlled Trial
10 pages
Assessment of Non
No ratings yet
Assessment of Non
2 pages
6 Sem B.com - Income Tax-Orientaion - 2.4.24
No ratings yet
6 Sem B.com - Income Tax-Orientaion - 2.4.24
10 pages
Statistical Computing by Using R
100% (1)
Statistical Computing by Using R
11 pages
Unit 5 Winding Up of Companies
No ratings yet
Unit 5 Winding Up of Companies
3 pages
Statistics For Dummies
From Everand
Statistics For Dummies
Deborah J. Rumsey
4/5 (28)
Psychology Statistics For Dummies
From Everand
Psychology Statistics For Dummies
Martin Dempster
5/5 (1)
Introduction To Business Statistics Through R Software: Software
From Everand
Introduction To Business Statistics Through R Software: Software
Editor IJSMI
No ratings yet
Introduction To Non Parametric Methods Through R Software
From Everand
Introduction To Non Parametric Methods Through R Software
Editor IJSMI
No ratings yet
Chi Squared for Beginners
From Everand
Chi Squared for Beginners
Stephanie Glen
No ratings yet
Process Performance Models: Statistical, Probabilistic & Simulation
From Everand
Process Performance Models: Statistical, Probabilistic & Simulation
Vishnuvarthanan Moorthy
No ratings yet
Exercises of Statistical Inference
From Everand
Exercises of Statistical Inference
Simone Malacrida
No ratings yet
Random Sample Consensus: Robust Estimation in Computer Vision
From Everand
Random Sample Consensus: Robust Estimation in Computer Vision
Fouad Sabry
No ratings yet
Sample Size for Analytical Surveys, Using a Pretest-Posttest-Comparison-Group Design
From Everand
Sample Size for Analytical Surveys, Using a Pretest-Posttest-Comparison-Group Design
Joseph George Caldwell
No ratings yet

Unit4 R

Uploaded by

Unit4 R

Uploaded by

BCA V SEM R PROGRAMMING RAJADHANI DEGREE COLLGE

 Data Collection and preparation- Gather and preprocess data

Eg., Linear Regression, logistic regression, time series analysis.

Syntax: mean(x, trim = 0)

Syntax: rnorm(n, mean, sd)

Syntax:head(data_frame,no_of_rows_be_returned) #By default second argument

Step 5: Finally to visualize the sample_mean dataset we plotted a histogram ( for

# create empty vector of length n

# fill empty_vector with means

# create histogram to visualize

# To cross check find mean and sd of sample

One Sample T-Testing

# One Sample T-Test

# Two Sample T-Test

Welch Two Sample t-test

Wilcoxon Signed-Rank Test in R

# The data set

# One-sample wilcoxon test

# Printing the results

Wilcoxon signed rank exact test

Paired Samples Wilcoxon Test in R

Syntax: wilcox.test(x, y, paired = TRUE, alternative = “two.sided”)

# The data set

# Weight of the rabbit after treatment

# Create a data frame

# Paired Samples Wilcoxon Test

# Printing the results

Wilcoxon signed rank test

data: weight by group

 paired: a logical value specifying that we want to compute a paired t-test

before <- c(39,43,41,32,37,40,42,40,37,38)

# Perform the paired t-test

t.test(x=before,y=after,paired = TRUE,alternative = "greater")

# Create a data frame from the main data set.

# Create a table with the needed variables.

# Perform the Chi-Square test.

Pearson's Chi-squared test

 Objectivity:Hypothesis testing provides a structured and objective

Assumptions:Many hypothesis tests rely on assumptions about the data, If these

data: smokers out of patients

One-Proportion Z-Test in R Programming

The One proportion Z-test is used to compare an observed proportion to a

 The number of successes (male with cancer) is 95

 p = probabilities of success. It must be in the range of 0 to 1.

data: 25 and 100

Group A with lung cancer: n = 500

Group B, healthy individuals: n = 500

The Formula for Two-Proportion Z-Test

2-sample test for equality of proportions with continuity correction

Then in this case:

Example for Type 1 error in R

# Set the seed for reproducibility

# Initialize the counter for false positives

# Perform the simulations

# Generate two samples from the same normal

# Check if the p-value is less than the alpha level

# Calculate the Type I error rate

# Print the Type I error rate

Example for Type II error in R

# Calculate Type II Error

type_II_error <- 1 - pwr_result$power

# Print Type II Error

Power of a Hypothesis Test

Effect size(ES): The difference between the hypothesized value of a parameter

Alternate Hypothesis: The opposite of the null hypothesis is the alternative

One-way ANOVA: One-way When there is a single categorical independent

 Setup Null Hypothesis and Alternate Hypothesis

# Loading the package

Df Sum Sq Mean Sq F value Pr(>F)

 Df: The model’s degrees of freedom.

 Setup Null Hypothesis and Alternate Hypothesis

# Loading the package

# Variance in mean within group and between group

Calculate test statistics using aov function

Df Sum Sq Mean Sq F value Pr(>F)

You might also like