R-Prog Unit-5

Download as pdf or txt
Download as pdf or txt
You are on page 1of 23

STATISTICS WITH R PROGRAMMING Unit - V

UNIT-V: Probability Distributions, Normal Distribution- Binomial Distribution- Poisson


Distributions, Other Distribution, Basic Statistics, Correlation and Covariance, T-Tests, ANOVA.

BINOMIAL DISTRIBUTION:- The binomial distribution is a discrete probability distribution. It


describes the outcome of n independent trials in an experiment. Each trial is assumed to have only two
outcomes, either success or failure. If the probability of a successful trial is p, then the probability of
having x successful outcomes in an experiment of n independent trials is as follows.

R has four in-built functions to generate binomial distribution. They are


described below.
 dbinom(x, size, prob) :- This function gives the probability density distribution at each point.
 pbinom(x, size, prob) :- This function gives the cumulative probability of an event. It is a single
value representing the probability.
 qbinom(p, size, prob) :- This function takes the probability value and gives a number whose
cumulative value matches the probability value.
 rbinom(n, size, prob) :- This function generates required number of random values of given
probability from a given sample.
Following is the description of the parameters used −
 x is a vector of numbers.
 p is a vector of probabilities.
 n is number of observations.
 size is the number of trials.
 prob is the probability of success of each trial.
Examples:
 rbinom(n=1,size=10,prob=0.4) - It generates 1 random number from the binomial
distribution basesd on number of successes of 10 independent trails.
 rbinom(n=5,size=10,prob=0.4) - It generates 5 random number from the binomial distribution
basesd on number of successes of 10 independent trails with probability 0.4.
 rbinom(n=5,size=1,prob=0.4) – Setting size to 1 turns the numbers into a bernoulli random
variable, which can take only value 1 (success) or 0 (failure).
 To visualize the binomial distribution we randomly generate 10,000
experiments, each with 10 trails and 0.3 probability.
b <- data.frame(success=rbinom(n=10000,size=10,prob=0.3))
ggplot(b,aes(x=success))+geom_bar()

Problem: Suppose a die is tossed 5 times. What is the probability of getting exactly 2 fours?
Solution: This is a binomial experiment in which the number of trials is equal to 5, the number of
successes is equal to 2, and the probability of success on a single trial is 1/6 or about 0.167.
Therefore, the binomial probability is:
b(2; 5, 0.167) = 5C2 * (0.167)2 * (0.833)3
b(2; 5, 0.167) = 0.161
R Code:
> dbinom(2, size=5, prob=0.167)
[1] 0.1612

1 P .Jyothi, IT Dept , PACE ITS


STATISTICS WITH R PROGRAMMING Unit - V

Problem: In a restaurant seventy percent of people order for Chinese food and thirty percent for Italian food. A
group of three persons enter the restaurant. Find the probability of at least two of them ordering for Italian food.
Solution:-
The probability of ordering Chinese food is 0.7 and the probability of ordering Italian food is 0.3. Now, if
at least two of them are ordering Italian food then it implies that either two or three will order Italian
food.

Probability for two ordering Italian food,


P(X=2) = 3C2(0.3)2(0.7)1
= 3×0.09×0.7
= 0.189
Probability for all three ordering Italian food,
P(X=3) = 3C3(0.3)3(0.7)0
= 1×0.027×1
= 0.027
Hence, the probability for at least two persons ordering Italian food is,
P(X ≥ 2) = P(X=2)+P(X=3) = 0.189+0.027=0.216
R code:-
> dbinom(2,size=3,prob=0.3)+
+ dbinom(3,size=3,prob=0.3)
[1] 0.216

Cumulative Binomial Probability:- A cumulative binomial probability refers to the probability that the
binomial random variable falls within a specified range (e.g., is greater than or equal to a stated lower
limit and less than or equal to a stated upper limit).

Problem:What is the probability of obtaining 45 or fewer heads in 100 tosses of a coin?


Solution: To solve this problem, we compute 46 individual probabilities, using the binomial
formula. The sum of all these probabilities is the answer we seek.
Thus,
b(x < 45; 100, 0.5) = b(x = 0; 100, 0.5) + b(x = 1; 100, 0.5) + . . . + b(x = 45; 100, 0.5)
= 0.184
R code:-
> pbinom(45,size=100,prob=0.5)
[1] 0.1841008

Problem: Suppose there are twelve multiple choice questions in an English class quiz. Each question has five
possible answers, and only one of them is correct. Find the probability of having four or less correct answers if a
student attempts to answer every question at random.
Solution:
Since only one out of five possible answers is correct, the probability of
answering a question correctly by random is 1/5=0.2.
 To find the probability of having exactly 4 correct answers by
random attempts as follows.
> dbinom(4, size=12, prob=0.2)
[1] 0.1329
 To find the probability of having four or less correct answers by random attempts, we
apply the function dbinom with x = 0,…,4.
> dbinom(0, size=12, prob=0.2) + dbinom(1, size=12, prob=0.2) +
+ dbinom(2, size=12, prob=0.2) + dbinom(3, size=12, prob=0.2) +
+ dbinom(4, size=12, prob=0.2)
[1] 0.9274
 Alternatively, we can use the cumulative probability function for binomial
distribution pbinom.

2 P. Jyothi, IT Dept , PACE ITS


STATISTICS WITH R PROGRAMMING Unit - V

> pbinom(4, size=12, prob=0.2)


[1] 0.92744
Answer:-The probability of four or less questions answered correctly by random in a twelve
question multiple choice quiz is 92.7%.

Problem: Fit an appropriate binomial distribution and calculate the theoretical distribution
x: 0 1 2 3 4 5
f: 2 14 20 34 22 8
Solution:
Here n = 5 , N = 100
Mean = ∑ xi fi = 2.84
∑ fi
np = 2.84
p = 2.84/5 = 0.568
q = 0.432

p(r) = 5Cr (0.568)r (0.432) 5-r , r = 0,1,2,3,4,5


Theoretical distributions are
Calculation of Expected Frequency as follows
r p(r) N* p(r)
0 0.0147 100 * 0.0147 =1.47 = 1
1 0.097 100 * 0.097 =9.7 =10
2 0.258 100 * 0.258 =25.8 =26
3 0.342 100 * 0.342 =34.2 =34
4 0.226 100 * 0.226 =22.6 =23
5 0.060 100 * 0.060 = 6 =6
Total = 100
R code:-
> x <- 0:5
> f <- c(2,14,20,34,22,8)
> df <-data.frame(x,f)
> fitbin <- fitdist(df$f,"nbinom")
> summary(fitbin)
Fitting of the distribution ' nbinom ' by maximum
likelihood
Parameters :
estimate Std. Error
size 2.192416 1.441296
mu 16.664004 4.886713
Loglikelihood: -22.387 AIC: 48.774 BIC: 48.35752
Correlation matrix:
size mu
size 1.0000000000 0.0003165092
mu 0.0003165092 1.0000000000

> plot(fitbin)

Poisson Distribution :- The Poisson distribution is the probability distribution of independent


event occurrences in an interval. If λ is the mean occurrence per interval, then the probability of
having x occurrences within a given interval is:

3 P. Jyothi, IT Dept , PACE ITS


STATISTICS WITH R PROGRAMMING Unit - V

Examples:
1. The number of defective electric bulbs manufactured by a reputed company.
2. The number of telephone calls per minute at a switch board
3. The number of cars passing a certain point in one minute.
4. The number of printing mistakes per page in a large text.

R has four in-built functions to generate binomial distribution. They are described below.
 dpois(x, lambda, log = FALSE) :- This function gives the probability density distribution at each
point.
 ppois(q, lambda, lower.tail = TRUE, log.p = FALSE) :- This function gives the cumulative
probability of an event. It is a single value representing the probability.
 qpois(p, lambda, lower.tail = TRUE, log.p = FALSE):- This function takes the probability value
and gives a number whose cumulative value matches the probability value.
 rpois(n, lamda) :- This function generates required number of random values of given probability
from a given sample.
Following is the description of the parameters used −
 x is a vector of numbers.
 p is a vector of probabilities.
 n is number of observations.
 size is the number of trials.
 prob is the probability of success of each trial.

Problem:- If there are twelve cars crossing a bridge per minute on average, find the probability of having
seventeen or more cars crossing the bridge in a particular minute.
Solution:-
The probability of having sixteen or less cars crossing the bridge in a
particular minute is given by the function ppois.
> ppois(16, lambda=12) # lower tail
[1] 0.89871
Hence the probability of having seventeen or more cars crossing the
bridge in a minute is in the upper tail of the probability density function.
> ppois(16, lambda=12, lower=FALSE) # upper tail
[1] 0.10129
Answer:- If there are twelve cars crossing a bridge per minute on average, the probability of
having seventeen or more cars crossing the bridge in a particular minute is 10.1%.

Problem:- The average number of homes sold by the Acme Realty company is 2 homes per day. What is the
probability that exactly 3 homes will be sold tomorrow?
Solution: This is a Poisson experiment in which we know the following:
 μ = 2; since 2 homes are sold per day, on average.
 x = 3; since we want to find the likelihood that 3 homes will be
sold tomorrow.
 e = 2.71828; since e is a constant equal to approximately
2.71828.
We plug these values into the Poisson formula as follows:

4 P. Jyothi, IT Dept , PACE ITS


STATISTICS WITH R PROGRAMMING Unit - V

P(x; μ) = (e-μ) (μx) / x!


P(3; 2) = (2.71828-2) (23) / 3!
= (0.13534) (8) / 6
= 0.180
R Code:-
> dpois(3,lambda = 2)
[1] 0.180447

Cumulative Poisson Probability:- A cumulative Poisson probability refers to the probability that the
Poisson random variable is greater than some specified lower limit and less than some specified upper
limit.
Problem:-Suppose the average number of lions seen on a 1-day safari is 5.
What is the probability that tourists will see fewer than four lions on the next
1-day safari?
Solution: This is a Poisson experiment in which we know the following:
 μ = 5; since 5 lions are seen per safari, on average.
 x = 0, 1, 2, or 3; since we want to find the likelihood that
tourists will see fewer than 4 lions; that is, we want the
probability that they will see 0, 1, 2, or 3 lions.
 e = 2.71828; since e is a constant equal to approximately
2.71828.
To solve this problem, we need to find the probability that tourists will see 0, 1, 2, or 3 lions. Thus,
we need to calculate the sum of four probabilities: P(0; 5) + P(1; 5) + P(2; 5) + P(3; 5). To compute
this sum, we use the Poisson formula:
P(x < 3, 5) = P(0; 5) + P(1; 5) + P(2; 5) + P(3; 5)
P(x < 3, 5) = [ (e-5)(50) / 0! ] + [ (e-5)(51) / 1! ] + [ (e-5)(52) / 2! ] + [ (e-5)(53) / 3! ]
P(x < 3, 5) = [ (0.006738)(1) / 1 ] + [ (0.006738)(5) / 1 ] + [ (0.006738)(25) / 2 ] +[ (0.006738)(125) / 6]
P(x < 3, 5) = [ 0.0067 ] + [ 0.03369 ] + [ 0.084224 ] + [ 0.140375 ]
P(x < 3, 5) = 0.2650
Thus, the probability of seeing at no more than 3 lions is 0.2650.
R Code:-
> ppois(3,lambda = 5)
[1] 0.2650259

Normal Distribution:- A continuous random variable X follows a normal distribution


with mean μ and variance σ2 is a statistic distribution with probability density function

, on the domain .
Standard Normal Distribution
It is the distribution that occurs when a normal random variable has a mean of zero and a standard
deviation of one.
The normal random variable of a standard normal distribution is called a standard score or a z score.
Every normal random variable X can be transformed into a z score via the following equation:
Z = (X - μ) / σ
where X is a normal random variable, μ is the mean, and σ is the standard deviation.
yielding

Standard Normal Curve:- One way of figuring out how data are
distributed is to plot them in a graph. If the data is evenly distributed,
you may come up with a bell curve. A bell curve has a small percentage
of the points on both tails and the bigger percentage on the inner part of

5 P. Jyothi, IT Dept , PACE ITS


STATISTICS WITH R PROGRAMMING Unit - V

the curve. The shape of the standard normal distribution looks like
this:

> mean = median = mode


> symmetry about the center
> 50% of values less than the mean and 50% greater than
the mean

R functions:
 dnorm(x, mean = 0, sd = 1, log = FALSE) :- This function gives the probability density distribution
at each point.
 pnorm(q, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE):- This function gives the cumulative
probability of an event. It is a single value representing the probability.
 qnorm(p, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE):- This function takes the probability
value and gives a number whose cumulative value matches the probability value.
 rnorm(n, mean = 0, sd = 1) :- This function generates required number of random values of given
probability from a given sample.

Procedure to find probability using positive Z-score table


Case 1: Area between 0 Area(z)
and any z score

Case 2: Area in any tail 0.5 – Area(z)

Case 3: Area between two |Area(z2)-Area(z1)|


z-scores on the same side
of the mean

6 P. Jyothi, IT Dept , PACE ITS


STATISTICS WITH R PROGRAMMING Unit - V

Case 4: Area between two Area(z1)+Area(z2)


z-scores on the opposite
side of the mean

Case 5: Area to the left of 0.5+ Area(z)


a positive Z score

Case 6: Area to the right 0.5+ Area(z)


of a negative Z score

7 P. Jyothi, IT Dept , PACE ITS


STATISTICS WITH R PROGRAMMING Unit - V

Problem:-X is a normally normally distributed variable with mean μ = 30 and standard deviation σ = 4. Find
a) P(x < 40)
b) P(x > 21)
c) P(30 < x < 35)
Solution:
a) For x = 40, then
z = x − µ /σ
⇒z = (40 – 30) / 4
= 2.5 (=z1 say)
Hence P(x < 40) = P(z < 2.5)
= 0.5+A(z1) = 0.9938
b) For x = 21,
z = x − µ /σ
⇒z = (21 - 30) / 4
= -2.25 (= -z1 say)
Hence P(x > 21) = P(z > -2.25)
= 0.5- A(z1) = 0.9878
c) For x = 30
z = x − µ /σ ⇒,
z = (30 - 30) / 4 = 0 and
for x = 35,
z = x − µ /σ
⇒ z = (35 - 30) / 4
= 1.25
Hence P(30 < x < 35) = P(0 < z < 1.25)
= [area to the left of z = 1.25] - [area to the left of 0]
= 0.8944 - 0.5 = 0.3944

Problem:-The length of life of an instrument produced by a machine has a normal ditribution with a mean of 12
months and standard deviation of 2 months. Find the probability that an instrument
produced by this machine will last.
a) less than 7 months.
b) between 7 and 12 months.
Solution:
a) P(x < 7)
for x = 7
z = x − µ /σ
⇒z = (7 – 12) / 2
= -2.5 (=z1 say)
Hence P(x < 7) = P(z < -2.5)
= 0.0062
b) P(7 < x < 12)
For x=12
z = x − µ /σ
⇒z = (12 – 12) / 2
= 0 (=z1 say)
Hence P(7 < x < 12) = P(-2.5 < z < 0)
= 0.4938

Problem:-The Tahoe Natural Coffee Shop morning customer load follows a normal
distribution with mean 45 and standard deviation 8. Determine the probability that the
number of customers tomorrow will be less than 42.

8 P. Jyothi, IT Dept , PACE ITS


STATISTICS WITH R PROGRAMMING Unit - V

Solution:-
We first convert the raw score to a z-score. We have
z = x − µ /σ
⇒z =(42−45)/8=−0.375
Next, we use the table to find the probability. The table gives 0.3520. (We have rounded the raw score
to -0.38).
We can conclude that
P(x<42)=P(x<-0.38)
=0.352
That is there is about a 35% chance that there will be fewer than 42 customers tomorrow.

Example:
> x <- c(92,117,109,85,117,107,82,83,119,113,101,106,101,84,126,69,82,79,84,100,104,111,109,92,93,107,
81,118,81,133,111,82,120,103,115,89,74,110,83,110,96,102,108,110,140,106,111,98,98,99,74,101,107,104,
128,87,95,109,104,91,83,98,99,103,126,123,85,98,93,100)

> h<-hist(x,col = "blue")


> m <- mean(x)
> s <- sd(x)
> xf <- seq(min(x),max(x),length=70)
> dis <- dnorm(xf,m,s)
> dis <- dis*diff(h$mids[1:2]*length(x))
> lines(xf,dis,col="red",lwd=3)

Problem:-Assume that the test scores of a college entrance exam fits a normal distribution. Furthermore,
the mean test score is 72, and the standard deviation is 15.2. What is the percentage of students scoring
84 or more in the exam?
Solution:-
We apply the function pnorm of the normal distribution with mean 72 and standard deviation
15.2. Since we are looking for the percentage of students scoring higher than 84, we are interested in
the upper tail of the normal distribution.
> pnorm(84, mean=72, sd=15.2, lower.tail=FALSE)
[1] 0.21492

Correlation:- A correlation is a relationship between two variables.


Typically, we take x to be the independent variable. We take y to be the
dependent variable. Data is represented by a collection of ordered pairs
(x,y).

This will always be a number between -1 and 1 (inclusive).


• If r is close to 1, we say that the variables are positively correlated. This means there is likely a strong
linear relationship between the two variables, with a positive slope.
•If r is close to -1, we say that the variables are negatively correlated. This means there is likely a strong
linear relationship between the two variables, with a negative slope.
•If r is close to 0, we say that the variables are not correlated. This means that there is likely no linear
relationship between the two variables, however, the variables may still be related in some other way.
To run a correlation test we type:
> cor.test(var1, var2, method = "method")
The default method is "pearson" so you may omit this if that is what you want. If you type "kendall" or
"spearman" then you will get the appropriate significance test.

9 P. Jyothi, IT Dept , PACE ITS


STATISTICS WITH R PROGRAMMING Unit - V

Problem:- The local ice cream shop keeps track of how much ice cream they sell versus the temperature
on that day, here are their figures for the last 12 days:

Temperature 14.2 16.4 11.9 15.2 18.5 22.1 19.4 25.1 23.4 18.1 22.6 17.2
oC
Ice cream $215 $325 $185 $332 $406 $522 $412 $614 $544 $421 $445 $408
sales

Solution:-

Formula for correlation coefficient:

R Code:-
> temp <- c(14.2,16.4,11.9,15.2,18.5,22.1,19.4,25.1,23.4,18.1,22.6,17.2)
> sales <- c(215,325,185,332,406,522,412,614,544,421,445,408)
> corr_coeff <- cor(temp,sales)
> corr_coeff
[1] 0.9575066
> cov(temp,sales)
[1] 484.0932
#Adds a line of best fit to your scatter plot
> plot(temp, sales, pch=16,col="red")
>abline(lm(sales~temp),col="blue")

10 P. Jyothi, IT Dept ,PACE ITS


STATISTICS WITH R PROGRAMMING Unit - V

T-test for single mean:- One-sample t-test is used to compare the mean of a population to a
specified theoretical mean (μ).
Let X represents a set of values with size n, with mean μ and with standard deviation S.
The comparison of the observed mean (μ) of the population to a theoretical value μ is performed with
the formula below:
x  0
t 
 s n
To evaluate whether the difference is statistically significant, you first have to read in t test
table the critical value of Student’s t distribution corresponding to the significance level alpha of your
choice (5%). The degrees of freedom (df) used in this test are: df = n−1

Problem:-: A professor wants to know if her introductory statistics class has a good grasp of basic
math. Six students are chosen at random from the class and given a math proficiency test. The professor
wants the class to be able to score above 70 on the test. The six students get scores of 62, 92, 75, 68, 83,
and 95. Can the professor have 90 percent confidence that the mean score for the class on the test would
be above 70?
Solution:-
Null hypothesis: H 0: μ = 70
Alternative hypothesis: H a : μ > 70
First, compute the sample mean and standard deviation:
62  92  75  68  83  95
x
6
475
  13.17
6
 Null Hypothesis H0 : The sample meet upto standard i.e
µ >70 hours
 Alternative Hypothesis HA: µ not greater than 70,
 Level of Siginificance:   0.05
 x  0
The test statistic is t 
 s n

11 P. Jyothi, IT Dept , PACE ITS


STATISTICS WITH R PROGRAMMING Unit - V

79.71  70 9.17
t= 
13.17 6 5.38
= 1.71(calculate value of t)
To test the hypothesis, the computed t‐value of 1.71 will be compared to the critical value in the t‐table
with 5 df is 1.67, the calculate of t is more than table value of t, so null hypothsis is rejected.
R code:-
> t.test(x,alternative="two.sided",mu=70)
One Sample t-test
data: x
t = 1.7053, df = 5, p-value = 0.1489
alternative hypothesis: true mean is not equal to 70
95 percent confidence interval:
65.34888 92.98446
sample estimates:
mean of x
79.16667

Problem:-: A Sample of 26 bulbs gives a mean life of 990 hours with S.D of 20 hours. The manufacurer
claims that the mean life of bulbs is 1000 hours. Is sample meet upto the standard.
Solution: Here n = 26,
Sample mean x̅ = 990 hours
S.D s = 20 hours
Population mean µ = 1000 hours
Df = n-1 = 26-1 = 25
 Null Hypothesis H0: The sample meet upto standard i.e µ = 1000 hours
 Alternative Hypothesis HA: µ not equal to 1000,
 Level of Siginificance:   0.05
 the test statistic is
x  0
t 
s n
t = 990-1000/20/√26
= 2.5 (calculate value of t)
Table value of t with 25 df is 1.708
The calculate value of t is more than table value of t, so null hypotheis is rejected at 5% level.

Paired comparisons( Paired t-test ):- Sometimes data comes from non independent samples. An
example might be testing "before and after" of cosmetics or consumer products. We could use a single
random sample and do "before and after" tests on each person. A hypothesis test based on these data
would be called a paired comparisons test. Since the observations come in pairs, we can study the
difference, d, between the samples. The difference between each pair of measurements is called di.

Test statistic:- With a population of n pairs of measurements, forming a simple random sample from a
normally distributed population, the mean of the difference, d , is tested using the following
implementation of t.
d  
t
S/ n

Problem :- The blood pressure of 5 women before and after intake of a certain drug are
given below: Test whether there is significant change in blood pressure at 1% level of
significance.
Before 110 120 125 132 125
After 120 118 125 136 121

12 P. Jyothi, IT Dept , PACE ITS


STATISTICS WITH R PROGRAMMING Unit - V

Solution: Let µ be the mean of population of differences.


 Null Hypothesis H0: µ1= µ2 i,e, no change in B.P.
 Alternative Hypothesis HA: µ1≠ µ2 i,e, no change in B.P.
 Level of Siginificance:   0.01
 Computation : Differences di’s (before and after drug) are
-10,2,0,14,4
10  2  0  4  4
d
5
8
  1.6
5

1 n
S 2   (di  d )2
n 1 i1
1 5
4  (di  d )2

i1
1
 [(10 1.6)2  (2 1.6)2  (0 1.6)2  (4 1.6)2  (4 1.6)2 ]
4
123.20
  30.8
4
S  30.8  5.55
 Test statistic: The test statistic is t which is calculated as
d  
t
S/ n
  1.16  0.645
5.55 / 5
Calculated |t| value is 0.645
Tabulates t0.01 with 5-1 = 4 degrees of freedom is 3.747.
Since calculated t < t0.01 , we accept the Null hypothesis and conclude that there is no significant
change in blood pressure.
R code:-
> x <- c(110,120,125,132,125)
> y <- c(120,118,125,136,121)
> t.test(x,y,paired=TRUE)
Paired t-test
data: x and y
t = -0.64466, df = 4,
p-value = 0.5543
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-8.490956 5.290956
sample estimates:
mean of the differences
-1.6

T-test for difference of two population means :-


With a two-sample t-test, we compare the population means to each other and again look at the
difference. We expect that x  y would be close to μ1 – μ2. The test statistic will use both sample means,
sample standard deviations, and sample sizes for the test.
A two-sample t-test follows
 Write the null and alternative hypotheses.

13 P. Jyothi, IT Dept , PACE ITS


STATISTICS WITH R PROGRAMMING Unit - V

 State the level of significance and find the critical value. The critical value, from the
student’s t-distribution, has the lesser of n1-1 and n2 -1 degrees of freedom.
 Compute the test statistic.
 Compare the test statistic to the critical value and state a conclusion.

x y
t ~ t n1 n2 -2
1 1
S 
n1 n2
where
(x  x)2   ( y i  y)2
or S2   i
n s 2 n s 2
S  11
2 2 2
n1  n2  2 n1  n2  2

Problem:- Two horses A and B were tested according to the time (in seconds) to run a particular track
with the following results.
Horse A 28 30 32 33 33 29 34
Horse B 29 30 30 24 27 29
Test whether the two horses have the same running capacity.

Solution:- Given n1=7 and n2 = 6


We first compute the same means and standard deviations.
x  Mean of the first sample
1 1
 (28  30  32  33  33  29  34)  (219)  31.286
7 7
y  Mean of thesecond sample
1 1
 (29  30  30  24  27  29)  (169)  28.16
6 6
x xx y y y
( x  x )2 ( y  y )2
28 -3.286 10.8 29 0.84 0.7056
30 -1286 1.6538 30 1.84 3.3856
32 0.714 0.51 30 1.84 3.3856
33 1.714 2.94 24 -4.16 17.3056
33 1.714 2.94 27 -1.16 1.3456
29 -2.286 5.226 29 0.84 0.7056
34 2.714 7.366
219 31.4359 169 26.8336
 x)2   ( y i  y)
Now, S2   (xi
2

n1  n2  2
(31.4358  26.8336)
  5.23
762
Therefore S  5.23  2.3

 Null Hypothesis H0: µ1= µ2


 Alternative Hypothesis HA: µ1≠ µ2
 Level of Siginificance:   0.05

14 P. Jyothi, IT Dept , PACE ITS


STATISTICS WITH R PROGRAMMING Unit - V


 Computation : t  x  y 
31.286 - 28.16  2.443
1 1 1 1
S  (2.3) 
n1 n2 7 6
Tabulates t0.05 with 7+6-2 = 11 degrees of freedom at 5% level of significance is 2.2
Since calculated t > t0.05 , we reject the Null hypothesis and conclude that there is no significant change in
blood pressure.

ANOVA:- (ANALYSIS OF VARIANCE)


When we have only two samples we can use the t-test to compare the means of the samples
but it might become unreliable in case of more than two samples. If we only compare two means, then
the t-test (independent samples) will give the same results as the ANOVA. Anova is performed with F-
test.

Null hypothesis H0: There are no differences among the mean values of the groups being compared
(i.e., the group means are all equal)–
H0: µ1 = µ2 = µ3 = …= µk
Alternative hypothesis H1: (Conclusion if H0 rejected)?
Not all group means are equal (i.e., at least one group mean is different from the rest).

ANOVA one-way classification:-


Step 1: Total number of all observations
T   Xij
i j

Step 2: Correlation factor


T2 T2
cf  
N rs
Step 3:Total sum of squares

 X  cf
2
TSS = S2T  ij
i j
Step 4: Treatment sum of squares
2
jT
TrSS = S2Tr  N cf
Step 5: Error sum of squares
ESS = S2E = TSS-TrSS
Source of variable d.f Sum of Squares TSS F-Test
Treatment k-1 Tj2 STr 2 S 2Tr
(between sample) S 2Tr   cf N S2Tr 
k 1
Fcal 
S 2E
Error n-k S2E = TSS-TrSS S 2E
S E 
2
nk

15 P. Jyothi, IT Dept , PACE ITS


STATISTICS WITH R PROGRAMMING Unit - VI

Unit-6:- Linear Models, Simple Linear Regression, -Multiple Regression Generalized Linear
Models,Logistic Regression, - Poisson Regression- other Generalized Linear Models-Survival
Analysis,Nonlinear Models - Splines, Decision, Random Forests.
Regression:- Regression analysis is a very widely used statistical tool to establish a relationship model
between two variables. One of these variable is called predictor variable whose value is gathered
through experiments. The other variable is called response variable whose value is derived from the
predictor variable.

Linear Regression:- In Linear Regression these two variables are related through an equation, where
exponent (power) of both these variables is 1. Mathematically a linear relationship represents a straight
line when plotted as a graph. A non-linear relationship where the exponent of any variable is not equal
to 1 creates a curve.
The general mathematical equation for a linear regression is − y = ax + b
Following is the description of the parameters used −
 y is the response variable.
 x is the predictor variable.
 a and b are constants which are called the coefficients.
lm() Function:-This function creates the relationship model between the predictor and the response
variable.
Syntax: lm(formula,data)
Following is the description of the parameters used −
 formula is a symbol presenting the relation between x and y.
 data is the vector on which the formula will be applied.

Example:-
> height <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)
> weight <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)
> relation <- lm(weight~height)
> print(relation)
Call:
lm(formula = weight ~ height)
Coefficients:
(Intercept) height
-38.4551 0.6746
> plot(weight,height,col = "blue",main = "Height & Weight
Regression")
> abline(lm(height~weight),col="orange")

Advantages / Limitations of Linear Regression Model :


 Linear regression implements a statistical model that, when relationships between the
independent variables and the dependent variable are almost linear, shows optimal results.
 Linear regression is often inappropriately used to model non-linear relationships. 
 Linear regression is limited to predicting numeric output.
 A lack of explanation about what has been learned can be a problem. 
Regression equation of x on y
x
x  x  r. ( y  y)
y
Regression equation of y on x
y
y  y  r. (x  x)
x
where

1 P. Jyothi, IT Dept , PACE ITS


STATISTICS WITH R PROGRAMMING Unit - VI

 xy  x  y
1
r n

n n
1 1
(x  x)2 ( y  y)2

Multiple Regression :- Multiple regression is an extension of simple linear regression. It is used when
we want to predict the value of a variable based on the value of two or more other variables. The
variable we want to predict is called the dependent variable
The general mathematical equation for multiple regression is – x1 =a0+a1x2+a2x3
Following is the description of the parameters used −
 x1 is the response variable.
 a0 , a1 , a2 ...bn are the coefficients.
 x1, x 2, ...xn are the predictor variables.
The Normal equations for estimating a 0,a1 and a2 .

 x na  a  x  a  x
1 0 1 2 2 3

 x x a  x  a  x  a  x x
2
1 2 0 2 1 2 2 2 3

 x x a  x  a  x x  a  x
2
1 3 0 3 1 2 3 2 3
We create the regression model using the lm() function in R. The model determines the value of the
coefficients using the input data. Next we can predict the value of the response variable for a given set of
predictor variables using these coefficients.

lm() Function :-This function creates the relationship model between the predictor and the response
variable.
Syntax :- lm(y ~ x1+x2+x3...,data)
Following is the description of the parameters used −
 formula is a symbol presenting the relation between the response variable and predictor
variables.
 data is the vector on which the formula will be applied.
Example
> lm(mpg~disp+hp+wt,data=mtcars)

Call:
lm(formula = mpg ~ disp + hp + wt, data = mtcars)
Coefficients:
(Intercept) disp hp wt
37.105505 -0.000937 -0.031157 -3.800891

Create Equation for Regression Model


Based on the above intercept and coefficient values, we create the mathematical equation.
Y = a+disp.x 1+hp.x2 +wt.x3
or
Y = 37.15+(-0.000937)*x1+(-0.0311)*x2+(-3.8008)*x3

Logistic Regression : The Logistic Regression is a regression model in which the response variable
(dependent variable) has categorical values such as True/False or 0/1. It actually measures the
probability of a binary response as the value of response variable based on the mathematical equation
relating it with the predictor variables.
The general mathematical equation for logistic regression is −
y = 1/(1+e^-(a+b1x1+b2x2+b3x3+...))
 y is the response variable.
 x is the predictor variable.
 a and b are the coefficients which are numeric constants.

2 P. Jyothi, IT Dept , PACE ITS


STATISTICS WITH R PROGRAMMING Unit - VI

The function used to create the regression model is the glm() function.
Syntax :- glm(formula,data,family)
Following is the description of the parameters used −
 formula is the symbol presenting the relationship between the variables.
 data is the data set giving the values of these variables.
 family is R object to specify the details of the model. It's value is binomial for logistic
regression.
For example, in the built-in data set mtcars, the data column am represents the transmission type of
the automobile model (0 = automatic, 1 = manual). With the logistic regression equation, we can model
the probability of a manual transmission in a vehicle based on its engine horsepower and weight data.
> am.glm = glm(formula=am ~ hp + wt, data=mtcars, family=binomial)
> am.glm
Call: glm(formula = am ~ hp + wt, family = binomial, data = mtcars)

Coefficients:
(Intercept) hp wt
18.86630 0.03626 -8.08348

Degrees of Freedom: 31 Total (i.e. Null); 29 Residual


Null Deviance: 43.23
Residual Deviance: 10.06 AIC: 16.06

Poisson Regression:- Poisson Regression involves regression models in which the response variable is
in the form of counts and not fractional numbers. For example, the count of number of births or number
of wins in a football match series. Also the values of the response variables follow a Poisson distribution.
The general mathematical equation for Poisson regression is −
log(y) = a + b1 x1 + b2 x2 + b nxn .....
Following is the description of the parameters used −
 y is the response variable. 
 a and b are the numeric coefficients. 
 x is the predictor variable. 
The function used to create the Poisson regression model is the glm()function.
We have the in-built data set "warpbreaks" which describes the effect of wool type (A or B) and tension
(low, medium or high) on the number of warp breaks per loom. Let's consider "breaks" as the response
variable which is a count of number of breaks. The wool "type" and "tension" are taken as predictor
variables.
> output <-glm(formula = breaks ~ wool+tension,data = warpbreaks,
+ family = poisson)
> print(summary(output))
Call:
glm(formula = breaks ~ wool + tension, family = poisson, data = warpbreaks)

Deviance Residuals:
Min 1Q Median 3Q Max
-3.6871 -1.6503 -0.4269 1.1902 4.2616
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.69196 0.04541 81.302 < 2e-16 ***
woolB -0.20599 0.05157 -3.994 6.49e-05 ***
tensionM -0.32132 0.06027 -5.332 9.73e-08 ***
tensionH -0.51849 0.06396 -8.107 5.21e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 297.37 on 53 degrees of freedom
Residual deviance: 210.39 on 50 degrees of freedom
AIC: 493.06

Number of Fisher Scoring iterations: 4

3 P. Jyothi, IT Dept , PACE ITS


STATISTICS WITH R PROGRAMMING Unit - VI

CURVE FITTING:- Curve fitting is the process of constructing a curve, or mathematical function, that
has the best fit to a series of data points, possibly subject to constraints.

Type of curve Equation Normal equations


Fitting of a straight y = a + bx
line
 y  na  b x
 xy  a x b x2
Fitting of a second y = a + bx + cx 2
degree polynomial
 y  na  b x  c x2
 xy  a x b x2  c x3
 x2 y  a x2 b x3  c x4
Power curve y = a.bx Applylog on both sides
log y  log a  x log b
Y  A  Bx
Y  nA  B x
 xY  A x B x2
where
Y = log y, A = log a and B = log b
Exponential curve y = a. ebx Applylog on both sides
log y  log a  bx
Y  A  bx
Y  nA  b x
 xY  A x b x2
where
Y = log y and A = log a
Exponential curve y = a. xb Applylog on both sides
log y  log a  b log x
Y  A  bX

Y  nA  b X
 XY  A X b X 2
where
Y = log y , X=log x and A = log a

Problem:- Fit a straight line to the following data


x 0 1 2 3 4
y 1 1.8 3.3 4.5 6.3
Solution:-
Straight line is y = a+bx
The two normal equations are
 y  na  b  x
 xy  a x b x2
x x2 y xy
0 0 1 0
1 1 1.8 1.8
2 4 3.3 13.2
3 9 4.5 40.5

4 P. Jyothi, IT Dept , PACE ITS


STATISTICS WITH R PROGRAMMING Unit - VI

4 16 6.3 100.8

x  10  x2  30  y  16.9  xy  156.3
Substituting the values, we get
5a+10b = 16.9.........................(1)
10a+30b = 156.3 ...................(2)
Solving (1) and (2), we get
Multiply eq (1) with 2
10a+20b = 33.2....................... (3)
Subtract (3) and (2)
10a + 20b = 33.2
10a + 30b = 156.3
0 -10b =-123.1
Therefore b=12.3 now substitute in (1) and a = -21.24.
Thus the equation of the straight line is y = a + bx
y = -21.24+12.3x

Problem:- Fit a parabola to the following data


x 1 2 3 4 5
y 10 12 8 10 14

Solution:-
Polynomial equation line is y = a + bx + cx2
The three normal equations are
 y  na  b x  c x2
 xy  a x b x2  c x3
 x2 y  a x2 b x3  c x4
x x2 x3 x4 y xy x2 y
1 1 1 1 10 10 10
2 4 8 16 12 24 64
3 9 27 81 8 24 72
4 16 64 256 10 40 160
5 25 125 625 14 70 350
 x  15  x2  55  x3  225 x 4  979  y  54  xy  168  x2 y  656
Substituting the values, we get
5a+15b+55c = 54........................... (1)
15a+55b+225c = 168................... (2)
55a+225b+979c = 656 ................ (3)
Solving (1) and (2), we get
Multiply eq (1) with 3 and subtract with (2)
15a+45b+165c = 162
(-)15a+55b+225c = 168
0 -10b+60c = -6 ........................ (4)

Solving (1) and (3), we get


Multiply eq (1) with 11 and subtract with (3)
55a+165b+605c = 594
(-)55a+225b+979c = 656
0 -60b-370c = -62
60 b+ 370 c = 62........(5)

5 P. Jyothi, IT Dept , PACE ITS


STATISTICS WITH R PROGRAMMING Unit - VI

Solve (4) and (5)


Multiply eq (4) with 6 and add with (5)
-60b + 360c = -36
60b + 370c = 62
70 c =26
c =0.37
substitute in equation (4)
-10b + 60(0.37) = -6
b = 2.82
Substitute c and b values in equation (1)
5a+15b+55c = 54
5a+15(2.82)+55(0.37) = 54
a =-1.73
Thus the equation of the polynomial is y = a + bx + cx 2
y = -1.73+2.82x + 0.37x2

Problem:- Fit a curve of the type y=aebx to the following data


x 0 1 2 3
y 1.05 2.1 3.85 8.3
Solution:-Exponential curve equation line is y = a . ebx
The two normal equations are
Apply logarithm on both sides
logy  log a  bx
Y  A  bx

Y  nA  b  x
 xY  A x b x2
x y Y = log y xY x2
0 1.05 0.021 0 0
1 2.1 0.324 0.32 1
2 3.85 0.585 1.17 4
3 8.3 0.919 2.75 9

x  6  y  15.3 Y  1.849 xY  4.24 x 2  14


The equations are
4A + 6b = 1.84 …….(1)
6A + 14b = 4.24 …….(2)
Solve (1) and (2) equations
Multiply (1) with 3 and (2) with 2 then subtract them
12A+18b = 5.52
(-)12A+ 28b = 8.48
-10b =-2.96
b =0.296
Substitute b value in (1)
4A + 6(0.296) = 1.84
A = 0.016
a= antilog(A)
= antilog(0.016) =1.061
Therefore the exponential curve is y = (1.061).e0.296x

6 P. Jyothi, IT Dept , PACE ITS


STATISTICS WITH R PROGRAMMING Unit - VI

Survival analysis: Survival analysis is generally defined as a set of methods for analyzing data where
the outcome variable is the time until the occurrence of an event of interest. The event can be death,
occurrence of a disease, marriage, divorce, etc.
In survival analysis, there is a special structure for right-censored survival data. To use this, one first
must load the “survival” package, which is included in the main R distribution,
library(survival)
The basic syntax for creating survival analysis in R is −
Surv(time,event)
survfit(formula)
Following is the description of the parameters used −
 time is the follow up time until the event occurs.
 event indicates the status of occurrence of the expected event.
 formula is the relationship between the predictor variables.

Next, define the survival times “tt” and the censoring indicator “status”, where “status = 1” indicates
that the time is an observed event, and “status = 0” indicates that it is censored. Then the “Surv”
function binds them into a single object. In the following example, time 6 is right censored, while the
others are observed event times,
> tt <- c(2, 5, 6, 7, 8)
> status <- c( 1, 1, 0, 1, 1)
> Surv(tt, status) # Create a survival data structure
[1] 2 5 6+ 7 8
Example:-

Nonlinear Models
Decision trees:- Decision tree is a graph to represent choices and their results in form of a tree. The
nodes in the graph represent an event or choice and the edges of the graph represent the decision rules
or conditions. It is mostly used in Machine Learning and Data Mining applications using R.
Examples:
• Predicting an email as spam or not spam,
• Predicting of a tumor is cancerous
• Predicting a loan as a good or bad credit risk based on the factors in each of these.
The package "party" has the function ctree() which is used to create and analyze decison tree.
Syntax : ctree(formula, data)
Following is the description of the parameters used −
 formula is a formula describing the predictor and response variables.
 data is the name of the data set used.

Example:
library(party)
model2<-ctree(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, data=mydata)
plot(model2)

7 P. Jyothi, IT Dept , PACE ITS


STATISTICS WITH R PROGRAMMING Unit - VI

Advantages of Decision Trees


 Simple to understand and interpret.
 Requires little data preparation.
 Works with both numerical and categorical data.
 Possible to validate a model using statistical tests. Gives you confidence it will work on
new data sets.
 Robust.
 Scales to big data

Limitations of Decision Trees


 Learning globally optimal tree is NP-hard.
 Easy to overfit the tree
 Complex.

Random Forests:- In the random forest approach, a large number of decision trees are created. Every
observation is fed into every decision tree. The most common outcome for each observation is used as
the final output.
The package "randomForest" has the function randomForest() which is used to create and analyze
random forests.
Syntax :- randomForest(formula, data)
Following is the description of the parameters used −
 formula is a formula describing the predictor and response variables.
 data is the name of the data set used.

Advantages
⚫ It is one of the most accurate learning algorithms available. For many data sets, it produces
a highly accurate classifier.
⚫ It runs efficiently on large databases.
⚫ It can handle thousands of input variables without variable deletion.
⚫ It gives estimates of what variables are important in the classification.
⚫ It generates an internal unbiased estimate of the generalization error as the forest building
progresses.
⚫ It has an effective method for estimating missing data and maintains accuracy when a large
proportion of the data are missing.

Disadvantages
⚫ Random forests have been observed to overfit for some datasets with noisy
classification/regression tasks.
⚫ For data including categorical variables with different number of levels, random forests are
biased in favor of those attributes with more levels. Therefore, the variable importance
scores from random forest are not reliable for this type of data.

Splines: A linear spline is a continuous function formed by connecting linear segments. The points
where the segments connect are called the knots of the spline.

8 P. Jyothi, IT Dept , PACE ITS

You might also like