Stat Doc Pract 6,7,8
Stat Doc Pract 6,7,8
Example:
1) Find probability of winning exactly 19 times out of 25 coin tosses.
dbinom(19, 25, 0.5)
#dbinom(x, size, prob)
0.005277991
Examples:
2) Find the Probability of getting 10 or less heads from 25 tosses of a coin.
pbinom(10,25,0.5)
0.2121781
Examples:
3) Find the Probability of more than 10 heads from 25 tosses of a coin.
pbinom(10,25,0.5,lower.tail=FALSE)
0.7878219
qbinom(): This function takes the probability value and gives a number whose cumulative value
matches the probability value. (probability of a variable X following a binomial distribution taking
values lower than or equal to x).
Example:
Find the 25th quantile of a binomial distribution with 25 trials and probability of success on each trial
= 0.5
qbinom(0.25,25,0.5)
11
This says that the probability that the number of successes is < 11 is 25% or less in a binomial
experiment with 25 trials and success probability of 0.5.
qbinom(0.25,25,0.5, lower.tail=FALSE)
14
rbinom(): This function generates required number of random values of given probability from a
given sample.
Example:
rbinom(10,25,0.5)
8 14 10 12 10 14 16 7 13 12
#P(X <= x)
#P(X > x)
cat("probability of more than 10 heads from 25 tosses",pbinom(10,25,0.5,lower.tail=FALSE),"\n")
cat("binomial quantile for the probability 0.4",qbinom(0.25,25,0.5),"\n")
OUTPUT:
probability of winning exactly 19 times out of 25 tosses 0.005277991
random observations 8 14 10 12 10 14 16 7 13 12
3. Raju flips a fair coin 5 times. What is the probability that the coin lands on heads more than 2
times?
#find the probability of more than 2 successes during 5 trials where the #probability of
success on each trial is 0.5
pbinom(2, size=5, prob=.5, lower.tail=FALSE)
# [1] 0.5
4. Suppose a bowler scores a strike on 30% of his attempts when he bowls. If he bowls 10
times, what is the probability that he scores 4 or fewer strikes?
#find the probability of 4 or fewer successes during 10 trials where the
#probability of success on each trial is 0.3
pbinom(4, size=10, prob=0.3)
# [1] 0.8497317
Examples:
Find the 10th quantile of a binomial distribution with 10 trials and probability of success on
each trial = 0.4
qbinom(0.10, size=10, prob=0.4)
# [1] 2
Find the 40th quantile of a binomial distribution with 30 trials and probability of success on
each trial = 0.25
qbinom(0.40, size=30, prob=0.25)
# [1] 7
This says that the probability that the number of successes is < 7 is 40% or less in a binomial
experiment with 30 trials and success probability of 0.25.
Generate a vector that shows the number of successes of 10 binomial experiments with 100
trials
where the probability of success on each trial is 0.3.
results <- rbinom(10, size=100, prob=.3)
results
# [1] 31 29 28 30 35 30 27 39 30 28
Normal Distribution is a probability function used in statistics that tells about how the data
values are distributed.
It is the most important probability distribution function used in statistics because of its advantages
in real case scenarios.
For example, the height of the population, shoe size, IQ level, rolling a dice, and many more.
where,
x is a vector of numbers.
p is a vector of probabilities.
mean is the mean value of the sample data. It's default value is zero.
dnorm : The function dnorm returns the value of the probability density function (pdf) of the
normal distribution given a certain random variable x, a population mean μ and population standard
deviation σ.
Example:
#find the value of the standard normal distribution pdf at x=0
dnorm(x=0, mean=0, sd=1)
# [1] 0.3989423
#by default, R uses mean=0 and sd=1
dnorm(x=0)
# [1] 0.3989423
#find the value of the normal distribution pdf at x=10 with mean=20 and sd=5
dnorm(x=10, mean=20, sd=5)
# [1] 0.01079819
pnorm The function pnorm returns the value of the cumulative density function (cdf) of the
normal distribution given a certain random variable q, a population mean μ and population
standard deviation σ.
The syntax for using pnorm is as follows:
pnorm(q, mean, sd)
Example:
Suppose the height of males at a certain school is normally distributed with a mean of μ=70 inches
and a standard deviation of σ = 2 inches. Approximately what percentage of males at this school are
taller than 74 inches?
#find percentage of males that are taller than 74 inches in a population with #mean = 70 and sd = 2
# [1] 0.02275013
Example:
Suppose the weight of a certain species of otters is normally distributed with a mean of μ=30 lbs and
a standard deviation of σ = 5 lbs. Approximately what percentage of this species of otters weight less
than 22 lbs?
#find percentage of otters that weight less than 22 lbs in a population with
#mean = 30 and sd = 5
pnorm(22, mean=30, sd=5)
# [1] 0.05479929
qnorm : The function qnorm returns the value of the inverse cumulative density function (cdf)
of the normal distribution given a certain random variable p, a population mean μ and population
standard deviation σ.
#find the Z-score of the 99th quantile of the standard normal distribution
qnorm(.99, mean=0, sd=1)
# [1] 2.326348
#find the Z-score of the 95th quantile of the standard normal distribution
qnorm(.95)
# [1] 1.644854
rnorm : The function rnorm generates a vector of normally distributed random variables given
a vector length n, a population mean μ and population standard deviation σ. The syntax for using
rnorm is as follows:
#generate a vector of 1000 normally distributed random variables with mean=50 and sd=5
narrowDistribution <- rnorm(1000, mean = 50, sd = 15)
Date:-_________
Sign Test
A sign test is used to decide whether a binomial distribution has the equal chance of success and
failure.
Example
A soft drink company has invented a new drink, and would like to find out if it will be as popular as the existing favorite
drink. For this purpose, its research department arranges 18 participants for taste testing. Each participant tries both
drinks in random order before giving his or her opinion.
Problem
It turns out that 5 of the participants like the new drink better, and the rest prefer the old one. At
.05 significance level, can we reject the notion that the two drinks are equally popular?
Solution
The null hypothesis is that the drinks are equally popular. Here we apply the binom.test function. As the p-value turns
out to be 0.096525, and is greater than the .05 significance level, we do not reject the null hypothesis.
Answer
At .05 significance level, we do not reject the notion that the two drinks are equally popular.
Problem
Without assuming the data to have normal distribution, test at .05 significance level if the barley
yields of 1931 and 1932 in data set immer have identical data distributions.
Solution
The null hypothesis is that the barley yields of the two sample years are identical populations. To
test the hypothesis, we apply the wilcox.test function to compare the matched samples. For the
paired test, we set the "paired" argument as TRUE. As the p-value turns out to be 0.005318, and is
less than the .05 significance level, we reject the null hypothesis.
> wilcox.test(immer$Y1, immer$Y2, paired=TRUE)
Wilcoxon signed rank test with continuity correction
data: immer$Y1 and immer$Y2
V = 368.5, p-value = 0.005318
alternative hypothesis: true location shift is not equal to 0
Warning message:
In wilcox.test.default(immer$Y1, immer$Y2, paired = TRUE) :
cannot compute exact p-value with ties
Answer
At .05 significance level, we conclude that the barley yields of 1931 and 1932 from the data
set immer are nonidentical populations.
Kruskal-Wallis Test
A collection of data samples are independent if they come from unrelated populations and the
samples do not affect each other. Using the Kruskal-Wallis Test, we can decide whether the
population distributions are identical without assuming them to follow the normal distribution.
Example
In the built-in data set named airquality, the daily air quality measurements in New York, May to
September 1973, are recorded. The ozone density are presented in the data frame column Ozone.
> head(airquality)
Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 NA NA 14.3 56 5 5
6 28 NA 14.9 66 5 6
Problem
Without assuming the data to have normal distribution, test at .05 significance level if the
monthly ozone density in New York has identical data distributions from May to September 1973.
Solution
The null hypothesis is that the monthly ozone density are identical populations. To test the
hypothesis, we apply the kruskal.test function to compare the independent monthly data. The p-
value turns out to be nearly zero (6.901e-06). Hence we reject the null hypothesis.
> kruskal.test(Ozone ~ Month, data = airquality)
Kruskal-Wallis rank sum test
data: Ozone by Month
Kruskal-Wallis chi-squared = 29.267, df = 4, p-value = 6.901e-06
Answer
At .05 significance level, we conclude that the monthly ozone density in New York from May to
September 1973 are nonidentical populations.
$ Min.Price : num 12.9 29.2 25.9 30.8 23.7 14.2 19.9 22.6 26.3 33 ...
$ Price : num 15.9 33.9 29.1 37.7 30 15.7 20.8 23.7 26.3 34.7 ...
$ Max.Price : num 18.8 38.7 32.3 44.6 36.2 17.3 21.7 24.9 26.3 36.3 ...
$ EngineSize : num 1.8 3.2 2.8 2.8 3.5 2.2 3.8 5.7 3.8 4.9 ...
$ Horsepower : int 140 200 172 172 208 110 170 180 170 200 ...
$ RPM : int 6300 5500 5500 5500 5700 5200 4800 4000 4800 4100 ...
$ Rev.per.mile : int 2890 2335 2280 2535 2545 2565 1570 1320 1690 1510 ...
$ Man.trans.avail : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 1 1 1 1 ...
$ Length : int 177 195 180 193 186 189 200 216 198 206 ...
$ Wheelbase : int 102 115 102 106 109 105 111 116 108 114 ...
$ Weight : int 2705 3560 3375 3405 3640 2880 3470 4105 3495 3620 ...
NULL
The above result shows the dataset has many Factor variables which can be considered as
categorical variables. For our model we will consider the variables "AirBags" and "Type". Here we
aim to find out any significant correlation between the types of car sold and the type of Air bags it
has. If correlation is observed we can estimate which types of cars can sell better with what types of
air bags.
Warning message:
In chisq.test(car.data) : Chi-squared approximation may be incorrect
Conclusion
The result shows the p-value of less than 0.05 which indicates a string correlation.
Date:-________
Two-sample t-test
Welch’s t-test is used to compare the means between two independent groups when it is not
assumed that the two groups have equal variances.
To perform Welch’s t-test in R, we can use the t.test() function, which uses the following syntax:
Syntax: t.test(x, y, alternative = c(“two.sided”, “less”, “greater”))
Example 01:
A teacher wants to compare the exam scores of 12 students who used an exam prep booklet to
prepare for some exam vs. 12 students who did not. The following vectors show the exam scores for
the students in each group:
booklet <- c(90, 85, 88, 89, 94, 91, 79, 83, 87, 88, 91, 90)
no_booklet <- c(67, 90, 71, 95, 88, 83, 72, 66, 75, 86, 93, 84)
Solution: Before we perform a Welch’s t-test, we can first create boxplots to visualize the
distribution of scores for each group:
To formally test whether or not the mean scores between the groups are significantly different,
we can perform Welch’s t-test:
##
##
## 0.3048395 13.8618272
## sample estimates:
## Mean of x mean of y
## 87.91667 80.83333
From the output we can see that the t test-statistic is 2.2361 and the corresponding p-value is
0.04171.
Since this p-value is less than .05, we can reject the null hypothesis and conclude that there is a
statistically significant difference in mean exam scores between the two groups.
We want to know, if the average weight of the mice differs from 25g?
set.seed(1234)
head(my_data, 10)
name weight
1 M_1 17.6
2 M_2 20.6
3 M_3 22.2
4 M_4 15.3
5 M_5 20.9
6 M_6 21.0
7 M_7 18.9
8 M_8 18.9
9 M_9 18.9
10 M_10 18.2
>
summary(my_data$weight)
accepted
##
##
## data : my_data$weight
# One-sample t-test
rest1
##
## data: my_data$weight
## sample estimates:
## mean of x
## 19.25
The p-value of the test is 7.9533835^{-6} , which is less than the significance level alpha = 0.05.
We can conclude that the mean weight of the mice is significantly different from 25g with a p -
value = 7.9533835^{-6}.
sigma.y=NULL,conf.level=.95)
where:
x: values for the first sample
y: values for the second sample (if performing a two sample z-test) *alternative: the alternative
hypothesis (‘greater,’ ‘less,’ ‘two.sided’)
mu: mean under the null or mean difference (in two sample case)
sigma.x: population standard deviation of first sample
sigma.y: population standard deviation of second sample
conf.level: confidence level to use
Example 04: Suppose the IQ in a certain population is normally distributed with a mean
of μ� = 100 and standard deviation of σ� = 15. A scientist wants to know if a new medication
affects IQ levels, so she recruits 20 patients to use it for one month and records their IQ levels at
the end of the month.
Solution: The following code shows how to perform a one sample z-test in R to determine if
the new medication causes a significant difference in IQ levels: Here the null hypothesis is that
the medication has no effect (ie., μ0=μ=100�0=�=100)ss
The following examples shows how to use this function in practice.
library(BSDA)
##
##
## Orangesss
data = c(88, 92, 94, 94, 96, 97, 97, 97, 99, 99,
105, 109, 109, 109, 110, 112, 112, 113, 114, 115)
##
## One-sample z-Test
##
## data: data
## 96.47608 109.62392
## sample estimates:
## mean of x
## 103.05
The test statistic for the one sample z-test is 0.90933 and the corresponding p-value is 0.3632.
Since this p-value is not less than .05, we do not have sufficient evidence to reject the null
hypothesis. Thus, we conclude that the new medication does not significantly affect IQ level.