0% found this document useful (0 votes)

40 views23 pages

R-Prog Unit-5

Uploaded by

jyothi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

40 views23 pages

R-Prog Unit-5

Uploaded by

jyothi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

STATISTICS WITH R PROGRAMMING Unit - V

UNIT-V: Probability Distributions, Normal Distribution- Binomial Distribution- Poisson

Distributions, Other Distribution, Basic Statistics, Correlation and Covariance, T-Tests, ANOVA.

BINOMIAL DISTRIBUTION:- The binomial distribution is a discrete probability distribution. It

describes the outcome of n independent trials in an experiment. Each trial is assumed to have only two
outcomes, either success or failure. If the probability of a successful trial is p, then the probability of
having x successful outcomes in an experiment of n independent trials is as follows.

R has four in-built functions to generate binomial distribution. They are

described below.
 dbinom(x, size, prob) :- This function gives the probability density distribution at each point.
 pbinom(x, size, prob) :- This function gives the cumulative probability of an event. It is a single
value representing the probability.
 qbinom(p, size, prob) :- This function takes the probability value and gives a number whose
cumulative value matches the probability value.
 rbinom(n, size, prob) :- This function generates required number of random values of given
probability from a given sample.
Following is the description of the parameters used −
 x is a vector of numbers.
 p is a vector of probabilities.
 n is number of observations.
 size is the number of trials.
 prob is the probability of success of each trial.
Examples:
 rbinom(n=1,size=10,prob=0.4) - It generates 1 random number from the binomial
distribution basesd on number of successes of 10 independent trails.
 rbinom(n=5,size=10,prob=0.4) - It generates 5 random number from the binomial distribution
basesd on number of successes of 10 independent trails with probability 0.4.
 rbinom(n=5,size=1,prob=0.4) – Setting size to 1 turns the numbers into a bernoulli random
variable, which can take only value 1 (success) or 0 (failure).
 To visualize the binomial distribution we randomly generate 10,000
experiments, each with 10 trails and 0.3 probability.
b <- data.frame(success=rbinom(n=10000,size=10,prob=0.3))
ggplot(b,aes(x=success))+geom_bar()

Problem: Suppose a die is tossed 5 times. What is the probability of getting exactly 2 fours?
Solution: This is a binomial experiment in which the number of trials is equal to 5, the number of
successes is equal to 2, and the probability of success on a single trial is 1/6 or about 0.167.
Therefore, the binomial probability is:
b(2; 5, 0.167) = 5C2 * (0.167)2 * (0.833)3
b(2; 5, 0.167) = 0.161
R Code:
> dbinom(2, size=5, prob=0.167)
[1] 0.1612

1 P .Jyothi, IT Dept , PACE ITS

STATISTICS WITH R PROGRAMMING Unit - V

Problem: In a restaurant seventy percent of people order for Chinese food and thirty percent for Italian food. A
group of three persons enter the restaurant. Find the probability of at least two of them ordering for Italian food.
Solution:-
The probability of ordering Chinese food is 0.7 and the probability of ordering Italian food is 0.3. Now, if
at least two of them are ordering Italian food then it implies that either two or three will order Italian
food.

Probability for two ordering Italian food,

P(X=2) = 3C2(0.3)2(0.7)1
= 3×0.09×0.7
= 0.189
Probability for all three ordering Italian food,
P(X=3) = 3C3(0.3)3(0.7)0
= 1×0.027×1
= 0.027
Hence, the probability for at least two persons ordering Italian food is,
P(X ≥ 2) = P(X=2)+P(X=3) = 0.189+0.027=0.216
R code:-
> dbinom(2,size=3,prob=0.3)+
+ dbinom(3,size=3,prob=0.3)
[1] 0.216

Cumulative Binomial Probability:- A cumulative binomial probability refers to the probability that the
binomial random variable falls within a specified range (e.g., is greater than or equal to a stated lower
limit and less than or equal to a stated upper limit).

Problem:What is the probability of obtaining 45 or fewer heads in 100 tosses of a coin?

Solution: To solve this problem, we compute 46 individual probabilities, using the binomial
formula. The sum of all these probabilities is the answer we seek.
Thus,
b(x < 45; 100, 0.5) = b(x = 0; 100, 0.5) + b(x = 1; 100, 0.5) + . . . + b(x = 45; 100, 0.5)
= 0.184
R code:-
> pbinom(45,size=100,prob=0.5)
[1] 0.1841008

Problem: Suppose there are twelve multiple choice questions in an English class quiz. Each question has five
possible answers, and only one of them is correct. Find the probability of having four or less correct answers if a
student attempts to answer every question at random.
Solution:
Since only one out of five possible answers is correct, the probability of
answering a question correctly by random is 1/5=0.2.
 To find the probability of having exactly 4 correct answers by
random attempts as follows.
> dbinom(4, size=12, prob=0.2)
[1] 0.1329
 To find the probability of having four or less correct answers by random attempts, we
apply the function dbinom with x = 0,…,4.
> dbinom(0, size=12, prob=0.2) + dbinom(1, size=12, prob=0.2) +
+ dbinom(2, size=12, prob=0.2) + dbinom(3, size=12, prob=0.2) +
+ dbinom(4, size=12, prob=0.2)
[1] 0.9274
 Alternatively, we can use the cumulative probability function for binomial
distribution pbinom.

2 P. Jyothi, IT Dept , PACE ITS

STATISTICS WITH R PROGRAMMING Unit - V

> pbinom(4, size=12, prob=0.2)

[1] 0.92744
Answer:-The probability of four or less questions answered correctly by random in a twelve
question multiple choice quiz is 92.7%.

Problem: Fit an appropriate binomial distribution and calculate the theoretical distribution
x: 0 1 2 3 4 5
f: 2 14 20 34 22 8
Solution:
Here n = 5 , N = 100
Mean = ∑ xi fi = 2.84
∑ fi
np = 2.84
p = 2.84/5 = 0.568
q = 0.432

p(r) = 5Cr (0.568)r (0.432) 5-r , r = 0,1,2,3,4,5

Theoretical distributions are
Calculation of Expected Frequency as follows
r p(r) N* p(r)
0 0.0147 100 * 0.0147 =1.47 = 1
1 0.097 100 * 0.097 =9.7 =10
2 0.258 100 * 0.258 =25.8 =26
3 0.342 100 * 0.342 =34.2 =34
4 0.226 100 * 0.226 =22.6 =23
5 0.060 100 * 0.060 = 6 =6
Total = 100
R code:-
> x <- 0:5
> f <- c(2,14,20,34,22,8)
> df <-data.frame(x,f)
> fitbin <- fitdist(df$f,"nbinom")
> summary(fitbin)
Fitting of the distribution ' nbinom ' by maximum
likelihood
Parameters :
estimate Std. Error
size 2.192416 1.441296
mu 16.664004 4.886713
Loglikelihood: -22.387 AIC: 48.774 BIC: 48.35752
Correlation matrix:
size mu
size 1.0000000000 0.0003165092
mu 0.0003165092 1.0000000000

> plot(fitbin)

Poisson Distribution :- The Poisson distribution is the probability distribution of independent

event occurrences in an interval. If λ is the mean occurrence per interval, then the probability of
having x occurrences within a given interval is:

3 P. Jyothi, IT Dept , PACE ITS

STATISTICS WITH R PROGRAMMING Unit - V

Examples:
1. The number of defective electric bulbs manufactured by a reputed company.
2. The number of telephone calls per minute at a switch board
3. The number of cars passing a certain point in one minute.
4. The number of printing mistakes per page in a large text.

R has four in-built functions to generate binomial distribution. They are described below.
 dpois(x, lambda, log = FALSE) :- This function gives the probability density distribution at each
point.
 ppois(q, lambda, lower.tail = TRUE, log.p = FALSE) :- This function gives the cumulative
probability of an event. It is a single value representing the probability.
 qpois(p, lambda, lower.tail = TRUE, log.p = FALSE):- This function takes the probability value
and gives a number whose cumulative value matches the probability value.
 rpois(n, lamda) :- This function generates required number of random values of given probability
from a given sample.
Following is the description of the parameters used −
 x is a vector of numbers.
 p is a vector of probabilities.
 n is number of observations.
 size is the number of trials.
 prob is the probability of success of each trial.

Problem:- If there are twelve cars crossing a bridge per minute on average, find the probability of having
seventeen or more cars crossing the bridge in a particular minute.
Solution:-
The probability of having sixteen or less cars crossing the bridge in a
particular minute is given by the function ppois.
> ppois(16, lambda=12) # lower tail
[1] 0.89871
Hence the probability of having seventeen or more cars crossing the
bridge in a minute is in the upper tail of the probability density function.
> ppois(16, lambda=12, lower=FALSE) # upper tail
[1] 0.10129
Answer:- If there are twelve cars crossing a bridge per minute on average, the probability of
having seventeen or more cars crossing the bridge in a particular minute is 10.1%.

Problem:- The average number of homes sold by the Acme Realty company is 2 homes per day. What is the
probability that exactly 3 homes will be sold tomorrow?
Solution: This is a Poisson experiment in which we know the following:
 μ = 2; since 2 homes are sold per day, on average.
 x = 3; since we want to find the likelihood that 3 homes will be
sold tomorrow.
 e = 2.71828; since e is a constant equal to approximately
2.71828.
We plug these values into the Poisson formula as follows:

4 P. Jyothi, IT Dept , PACE ITS

STATISTICS WITH R PROGRAMMING Unit - V

P(x; μ) = (e-μ) (μx) / x!

P(3; 2) = (2.71828-2) (23) / 3!
= (0.13534) (8) / 6
= 0.180
R Code:-
> dpois(3,lambda = 2)
[1] 0.180447

Cumulative Poisson Probability:- A cumulative Poisson probability refers to the probability that the
Poisson random variable is greater than some specified lower limit and less than some specified upper
limit.
Problem:-Suppose the average number of lions seen on a 1-day safari is 5.
What is the probability that tourists will see fewer than four lions on the next
1-day safari?
Solution: This is a Poisson experiment in which we know the following:
 μ = 5; since 5 lions are seen per safari, on average.
 x = 0, 1, 2, or 3; since we want to find the likelihood that
tourists will see fewer than 4 lions; that is, we want the
probability that they will see 0, 1, 2, or 3 lions.
 e = 2.71828; since e is a constant equal to approximately
2.71828.
To solve this problem, we need to find the probability that tourists will see 0, 1, 2, or 3 lions. Thus,
we need to calculate the sum of four probabilities: P(0; 5) + P(1; 5) + P(2; 5) + P(3; 5). To compute
this sum, we use the Poisson formula:
P(x < 3, 5) = P(0; 5) + P(1; 5) + P(2; 5) + P(3; 5)
P(x < 3, 5) = [ (e-5)(50) / 0! ] + [ (e-5)(51) / 1! ] + [ (e-5)(52) / 2! ] + [ (e-5)(53) / 3! ]
P(x < 3, 5) = [ (0.006738)(1) / 1 ] + [ (0.006738)(5) / 1 ] + [ (0.006738)(25) / 2 ] +[ (0.006738)(125) / 6]
P(x < 3, 5) = [ 0.0067 ] + [ 0.03369 ] + [ 0.084224 ] + [ 0.140375 ]
P(x < 3, 5) = 0.2650
Thus, the probability of seeing at no more than 3 lions is 0.2650.
R Code:-
> ppois(3,lambda = 5)
[1] 0.2650259

Normal Distribution:- A continuous random variable X follows a normal distribution

with mean μ and variance σ2 is a statistic distribution with probability density function

, on the domain .
Standard Normal Distribution
It is the distribution that occurs when a normal random variable has a mean of zero and a standard
deviation of one.
The normal random variable of a standard normal distribution is called a standard score or a z score.
Every normal random variable X can be transformed into a z score via the following equation:
Z = (X - μ) / σ
where X is a normal random variable, μ is the mean, and σ is the standard deviation.
yielding

Standard Normal Curve:- One way of figuring out how data are
distributed is to plot them in a graph. If the data is evenly distributed,
you may come up with a bell curve. A bell curve has a small percentage
of the points on both tails and the bigger percentage on the inner part of

5 P. Jyothi, IT Dept , PACE ITS

STATISTICS WITH R PROGRAMMING Unit - V

the curve. The shape of the standard normal distribution looks like
this:

> mean = median = mode

> symmetry about the center
> 50% of values less than the mean and 50% greater than
the mean

R functions:
 dnorm(x, mean = 0, sd = 1, log = FALSE) :- This function gives the probability density distribution
at each point.
 pnorm(q, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE):- This function gives the cumulative
probability of an event. It is a single value representing the probability.
 qnorm(p, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE):- This function takes the probability
value and gives a number whose cumulative value matches the probability value.
 rnorm(n, mean = 0, sd = 1) :- This function generates required number of random values of given
probability from a given sample.

Procedure to find probability using positive Z-score table

Case 1: Area between 0 Area(z)
and any z score

Case 2: Area in any tail 0.5 – Area(z)

Case 3: Area between two |Area(z2)-Area(z1)|

z-scores on the same side
of the mean

6 P. Jyothi, IT Dept , PACE ITS

STATISTICS WITH R PROGRAMMING Unit - V

Case 4: Area between two Area(z1)+Area(z2)

z-scores on the opposite
side of the mean

Case 5: Area to the left of 0.5+ Area(z)

a positive Z score

Case 6: Area to the right 0.5+ Area(z)

of a negative Z score

7 P. Jyothi, IT Dept , PACE ITS

STATISTICS WITH R PROGRAMMING Unit - V

Problem:-X is a normally normally distributed variable with mean μ = 30 and standard deviation σ = 4. Find
a) P(x < 40)
b) P(x > 21)
c) P(30 < x < 35)
Solution:
a) For x = 40, then
z = x − µ /σ
⇒z = (40 – 30) / 4
= 2.5 (=z1 say)
Hence P(x < 40) = P(z < 2.5)
= 0.5+A(z1) = 0.9938
b) For x = 21,
z = x − µ /σ
⇒z = (21 - 30) / 4
= -2.25 (= -z1 say)
Hence P(x > 21) = P(z > -2.25)
= 0.5- A(z1) = 0.9878
c) For x = 30
z = x − µ /σ ⇒,
z = (30 - 30) / 4 = 0 and
for x = 35,
z = x − µ /σ
⇒ z = (35 - 30) / 4
= 1.25
Hence P(30 < x < 35) = P(0 < z < 1.25)
= [area to the left of z = 1.25] - [area to the left of 0]
= 0.8944 - 0.5 = 0.3944

Problem:-The length of life of an instrument produced by a machine has a normal ditribution with a mean of 12
months and standard deviation of 2 months. Find the probability that an instrument
produced by this machine will last.
a) less than 7 months.
b) between 7 and 12 months.
Solution:
a) P(x < 7)
for x = 7
z = x − µ /σ
⇒z = (7 – 12) / 2
= -2.5 (=z1 say)
Hence P(x < 7) = P(z < -2.5)
= 0.0062
b) P(7 < x < 12)
For x=12
z = x − µ /σ
⇒z = (12 – 12) / 2
= 0 (=z1 say)
Hence P(7 < x < 12) = P(-2.5 < z < 0)
= 0.4938

Problem:-The Tahoe Natural Coffee Shop morning customer load follows a normal
distribution with mean 45 and standard deviation 8. Determine the probability that the
number of customers tomorrow will be less than 42.

8 P. Jyothi, IT Dept , PACE ITS

STATISTICS WITH R PROGRAMMING Unit - V

Solution:-
We first convert the raw score to a z-score. We have
z = x − µ /σ
⇒z =(42−45)/8=−0.375
Next, we use the table to find the probability. The table gives 0.3520. (We have rounded the raw score
to -0.38).
We can conclude that
P(x<42)=P(x<-0.38)
=0.352
That is there is about a 35% chance that there will be fewer than 42 customers tomorrow.

Example:
> x <- c(92,117,109,85,117,107,82,83,119,113,101,106,101,84,126,69,82,79,84,100,104,111,109,92,93,107,
81,118,81,133,111,82,120,103,115,89,74,110,83,110,96,102,108,110,140,106,111,98,98,99,74,101,107,104,
128,87,95,109,104,91,83,98,99,103,126,123,85,98,93,100)

> h<-hist(x,col = "blue")

> m <- mean(x)
> s <- sd(x)
> xf <- seq(min(x),max(x),length=70)
> dis <- dnorm(xf,m,s)
> dis <- dis*diff(h$mids[1:2]*length(x))
> lines(xf,dis,col="red",lwd=3)

Problem:-Assume that the test scores of a college entrance exam fits a normal distribution. Furthermore,
the mean test score is 72, and the standard deviation is 15.2. What is the percentage of students scoring
84 or more in the exam?
Solution:-
We apply the function pnorm of the normal distribution with mean 72 and standard deviation
15.2. Since we are looking for the percentage of students scoring higher than 84, we are interested in
the upper tail of the normal distribution.
> pnorm(84, mean=72, sd=15.2, lower.tail=FALSE)
[1] 0.21492

Correlation:- A correlation is a relationship between two variables.

Typically, we take x to be the independent variable. We take y to be the
dependent variable. Data is represented by a collection of ordered pairs
(x,y).

This will always be a number between -1 and 1 (inclusive).

• If r is close to 1, we say that the variables are positively correlated. This means there is likely a strong
linear relationship between the two variables, with a positive slope.
•If r is close to -1, we say that the variables are negatively correlated. This means there is likely a strong
linear relationship between the two variables, with a negative slope.
•If r is close to 0, we say that the variables are not correlated. This means that there is likely no linear
relationship between the two variables, however, the variables may still be related in some other way.
To run a correlation test we type:
> cor.test(var1, var2, method = "method")
The default method is "pearson" so you may omit this if that is what you want. If you type "kendall" or
"spearman" then you will get the appropriate significance test.

9 P. Jyothi, IT Dept , PACE ITS

STATISTICS WITH R PROGRAMMING Unit - V

Problem:- The local ice cream shop keeps track of how much ice cream they sell versus the temperature
on that day, here are their figures for the last 12 days:

Temperature 14.2 16.4 11.9 15.2 18.5 22.1 19.4 25.1 23.4 18.1 22.6 17.2
oC
Ice cream $215 $325 $185 $332 $406 $522 $412 $614 $544 $421 $445 $408
sales

Solution:-

Formula for correlation coefficient:

R Code:-
> temp <- c(14.2,16.4,11.9,15.2,18.5,22.1,19.4,25.1,23.4,18.1,22.6,17.2)
> sales <- c(215,325,185,332,406,522,412,614,544,421,445,408)
> corr_coeff <- cor(temp,sales)
> corr_coeff
[1] 0.9575066
> cov(temp,sales)
[1] 484.0932
#Adds a line of best fit to your scatter plot
> plot(temp, sales, pch=16,col="red")
>abline(lm(sales~temp),col="blue")

10 P. Jyothi, IT Dept ,PACE ITS

STATISTICS WITH R PROGRAMMING Unit - V

T-test for single mean:- One-sample t-test is used to compare the mean of a population to a
specified theoretical mean (μ).
Let X represents a set of values with size n, with mean μ and with standard deviation S.
The comparison of the observed mean (μ) of the population to a theoretical value μ is performed with
the formula below:
x  0
t 
 s n
To evaluate whether the difference is statistically significant, you first have to read in t test
table the critical value of Student’s t distribution corresponding to the significance level alpha of your
choice (5%). The degrees of freedom (df) used in this test are: df = n−1

Problem:-: A professor wants to know if her introductory statistics class has a good grasp of basic
math. Six students are chosen at random from the class and given a math proficiency test. The professor
wants the class to be able to score above 70 on the test. The six students get scores of 62, 92, 75, 68, 83,
and 95. Can the professor have 90 percent confidence that the mean score for the class on the test would
be above 70?
Solution:-
Null hypothesis: H 0: μ = 70
Alternative hypothesis: H a : μ > 70
First, compute the sample mean and standard deviation:
62  92  75  68  83  95
x
6
475
  13.17
6
 Null Hypothesis H0 : The sample meet upto standard i.e
µ >70 hours
 Alternative Hypothesis HA: µ not greater than 70,
 Level of Siginificance:   0.05
 x  0
The test statistic is t 
 s n



11 P. Jyothi, IT Dept , PACE ITS

STATISTICS WITH R PROGRAMMING Unit - V

79.71  70 9.17
t= 
13.17 6 5.38
= 1.71(calculate value of t)
To test the hypothesis, the computed t‐value of 1.71 will be compared to the critical value in the t‐table
with 5 df is 1.67, the calculate of t is more than table value of t, so null hypothsis is rejected.
R code:-
> t.test(x,alternative="two.sided",mu=70)
One Sample t-test
data: x
t = 1.7053, df = 5, p-value = 0.1489
alternative hypothesis: true mean is not equal to 70
95 percent confidence interval:
65.34888 92.98446
sample estimates:
mean of x
79.16667

Problem:-: A Sample of 26 bulbs gives a mean life of 990 hours with S.D of 20 hours. The manufacurer
claims that the mean life of bulbs is 1000 hours. Is sample meet upto the standard.
Solution: Here n = 26,
Sample mean x̅ = 990 hours
S.D s = 20 hours
Population mean µ = 1000 hours
Df = n-1 = 26-1 = 25
 Null Hypothesis H0: The sample meet upto standard i.e µ = 1000 hours
 Alternative Hypothesis HA: µ not equal to 1000,
 Level of Siginificance:   0.05
 the test statistic is
x  0
t 
s n
t = 990-1000/20/√26
= 2.5 (calculate value of t)
Table value of t with 25 df is 1.708
The calculate value of t is more than table value of t, so null hypotheis is rejected at 5% level.

Paired comparisons( Paired t-test ):- Sometimes data comes from non independent samples. An
example might be testing "before and after" of cosmetics or consumer products. We could use a single
random sample and do "before and after" tests on each person. A hypothesis test based on these data
would be called a paired comparisons test. Since the observations come in pairs, we can study the
difference, d, between the samples. The difference between each pair of measurements is called di.

Test statistic:- With a population of n pairs of measurements, forming a simple random sample from a
normally distributed population, the mean of the difference, d , is tested using the following
implementation of t.
d  
t
S/ n

Problem :- The blood pressure of 5 women before and after intake of a certain drug are
given below: Test whether there is significant change in blood pressure at 1% level of
significance.
Before 110 120 125 132 125
After 120 118 125 136 121

12 P. Jyothi, IT Dept , PACE ITS

STATISTICS WITH R PROGRAMMING Unit - V

Solution: Let µ be the mean of population of differences.

 Null Hypothesis H0: µ1= µ2 i,e, no change in B.P.
 Alternative Hypothesis HA: µ1≠ µ2 i,e, no change in B.P.
 Level of Siginificance:   0.01
 Computation : Differences di’s (before and after drug) are
-10,2,0,14,4
10  2  0  4  4
d
5
8
  1.6
5

1 n
S 2   (di  d )2
n 1 i1
1 5
4  (di  d )2

i1
1
 [(10 1.6)2  (2 1.6)2  (0 1.6)2  (4 1.6)2  (4 1.6)2 ]
4
123.20
  30.8
4
S  30.8  5.55
 Test statistic: The test statistic is t which is calculated as
d  
t
S/ n
  1.16  0.645
5.55 / 5
Calculated |t| value is 0.645
Tabulates t0.01 with 5-1 = 4 degrees of freedom is 3.747.
Since calculated t < t0.01 , we accept the Null hypothesis and conclude that there is no significant
change in blood pressure.
R code:-
> x <- c(110,120,125,132,125)
> y <- c(120,118,125,136,121)
> t.test(x,y,paired=TRUE)
Paired t-test
data: x and y
t = -0.64466, df = 4,
p-value = 0.5543
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-8.490956 5.290956
sample estimates:
mean of the differences
-1.6

T-test for difference of two population means :-

With a two-sample t-test, we compare the population means to each other and again look at the
difference. We expect that x  y would be close to μ1 – μ2. The test statistic will use both sample means,
sample standard deviations, and sample sizes for the test.
A two-sample t-test follows
 Write the null and alternative hypotheses.

13 P. Jyothi, IT Dept , PACE ITS

STATISTICS WITH R PROGRAMMING Unit - V

 State the level of significance and find the critical value. The critical value, from the
student’s t-distribution, has the lesser of n1-1 and n2 -1 degrees of freedom.
 Compute the test statistic.
 Compare the test statistic to the critical value and state a conclusion.

x y
t ~ t n1 n2 -2
1 1
S 
n1 n2
where
(x  x)2   ( y i  y)2
or S2   i
n s 2 n s 2
S  11
2 2 2
n1  n2  2 n1  n2  2

Problem:- Two horses A and B were tested according to the time (in seconds) to run a particular track
with the following results.
Horse A 28 30 32 33 33 29 34
Horse B 29 30 30 24 27 29
Test whether the two horses have the same running capacity.

Solution:- Given n1=7 and n2 = 6

We first compute the same means and standard deviations.
x  Mean of the first sample
1 1
 (28  30  32  33  33  29  34)  (219)  31.286
7 7
y  Mean of thesecond sample
1 1
 (29  30  30  24  27  29)  (169)  28.16
6 6
x xx y y y
( x  x )2 ( y  y )2
28 -3.286 10.8 29 0.84 0.7056
30 -1286 1.6538 30 1.84 3.3856
32 0.714 0.51 30 1.84 3.3856
33 1.714 2.94 24 -4.16 17.3056
33 1.714 2.94 27 -1.16 1.3456
29 -2.286 5.226 29 0.84 0.7056
34 2.714 7.366
219 31.4359 169 26.8336
 x)2   ( y i  y)
Now, S2   (xi
2

n1  n2  2
(31.4358  26.8336)
  5.23
762
Therefore S  5.23  2.3

 Null Hypothesis H0: µ1= µ2

 Alternative Hypothesis HA: µ1≠ µ2
 Level of Siginificance:   0.05

14 P. Jyothi, IT Dept , PACE ITS

STATISTICS WITH R PROGRAMMING Unit - V




 Computation : t  x  y 
31.286 - 28.16  2.443
1 1 1 1
S  (2.3) 
n1 n2 7 6
Tabulates t0.05 with 7+6-2 = 11 degrees of freedom at 5% level of significance is 2.2
Since calculated t > t0.05 , we reject the Null hypothesis and conclude that there is no significant change in
blood pressure.

ANOVA:- (ANALYSIS OF VARIANCE)

When we have only two samples we can use the t-test to compare the means of the samples
but it might become unreliable in case of more than two samples. If we only compare two means, then
the t-test (independent samples) will give the same results as the ANOVA. Anova is performed with F-
test.

Null hypothesis H0: There are no differences among the mean values of the groups being compared
(i.e., the group means are all equal)–
H0: µ1 = µ2 = µ3 = …= µk
Alternative hypothesis H1: (Conclusion if H0 rejected)?
Not all group means are equal (i.e., at least one group mean is different from the rest).

ANOVA one-way classification:-

Step 1: Total number of all observations
T   Xij
i j

Step 2: Correlation factor

T2 T2
cf  
N rs
Step 3:Total sum of squares

 X  cf
2
TSS = S2T  ij
i j
Step 4: Treatment sum of squares
2
jT
TrSS = S2Tr  N cf
Step 5: Error sum of squares
ESS = S2E = TSS-TrSS
Source of variable d.f Sum of Squares TSS F-Test
Treatment k-1 Tj2 STr 2 S 2Tr
(between sample) S 2Tr   cf N S2Tr 
k 1
Fcal 
S 2E
Error n-k S2E = TSS-TrSS S 2E
S E 
2
nk

15 P. Jyothi, IT Dept , PACE ITS

STATISTICS WITH R PROGRAMMING Unit - VI

Unit-6:- Linear Models, Simple Linear Regression, -Multiple Regression Generalized Linear
Models,Logistic Regression, - Poisson Regression- other Generalized Linear Models-Survival
Analysis,Nonlinear Models - Splines, Decision, Random Forests.
Regression:- Regression analysis is a very widely used statistical tool to establish a relationship model
between two variables. One of these variable is called predictor variable whose value is gathered
through experiments. The other variable is called response variable whose value is derived from the
predictor variable.

Linear Regression:- In Linear Regression these two variables are related through an equation, where
exponent (power) of both these variables is 1. Mathematically a linear relationship represents a straight
line when plotted as a graph. A non-linear relationship where the exponent of any variable is not equal
to 1 creates a curve.
The general mathematical equation for a linear regression is − y = ax + b
Following is the description of the parameters used −
 y is the response variable.
 x is the predictor variable.
 a and b are constants which are called the coefficients.
lm() Function:-This function creates the relationship model between the predictor and the response
variable.
Syntax: lm(formula,data)
Following is the description of the parameters used −
 formula is a symbol presenting the relation between x and y.
 data is the vector on which the formula will be applied.

Example:-
> height <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)
> weight <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)
> relation <- lm(weight~height)
> print(relation)
Call:
lm(formula = weight ~ height)
Coefficients:
(Intercept) height
-38.4551 0.6746
> plot(weight,height,col = "blue",main = "Height & Weight
Regression")
> abline(lm(height~weight),col="orange")

Advantages / Limitations of Linear Regression Model :

 Linear regression implements a statistical model that, when relationships between the
independent variables and the dependent variable are almost linear, shows optimal results.
 Linear regression is often inappropriately used to model non-linear relationships. 
 Linear regression is limited to predicting numeric output.
 A lack of explanation about what has been learned can be a problem. 
Regression equation of x on y
x
x  x  r. ( y  y)
y
Regression equation of y on x
y
y  y  r. (x  x)
x
where

1 P. Jyothi, IT Dept , PACE ITS

STATISTICS WITH R PROGRAMMING Unit - VI

 xy  x  y
1
r n

n n
1 1
(x  x)2 ( y  y)2

Multiple Regression :- Multiple regression is an extension of simple linear regression. It is used when
we want to predict the value of a variable based on the value of two or more other variables. The
variable we want to predict is called the dependent variable
The general mathematical equation for multiple regression is – x1 =a0+a1x2+a2x3
Following is the description of the parameters used −
 x1 is the response variable.
 a0 , a1 , a2 ...bn are the coefficients.
 x1, x 2, ...xn are the predictor variables.
The Normal equations for estimating a 0,a1 and a2 .

 x na  a  x  a  x
1 0 1 2 2 3

 x x a  x  a  x  a  x x
2
1 2 0 2 1 2 2 2 3

 x x a  x  a  x x  a  x
2
1 3 0 3 1 2 3 2 3
We create the regression model using the lm() function in R. The model determines the value of the
coefficients using the input data. Next we can predict the value of the response variable for a given set of
predictor variables using these coefficients.

lm() Function :-This function creates the relationship model between the predictor and the response
variable.
Syntax :- lm(y ~ x1+x2+x3...,data)
Following is the description of the parameters used −
 formula is a symbol presenting the relation between the response variable and predictor
variables.
 data is the vector on which the formula will be applied.
Example
> lm(mpg~disp+hp+wt,data=mtcars)

Call:
lm(formula = mpg ~ disp + hp + wt, data = mtcars)
Coefficients:
(Intercept) disp hp wt
37.105505 -0.000937 -0.031157 -3.800891

Create Equation for Regression Model

Based on the above intercept and coefficient values, we create the mathematical equation.
Y = a+disp.x 1+hp.x2 +wt.x3
or
Y = 37.15+(-0.000937)*x1+(-0.0311)*x2+(-3.8008)*x3

Logistic Regression : The Logistic Regression is a regression model in which the response variable
(dependent variable) has categorical values such as True/False or 0/1. It actually measures the
probability of a binary response as the value of response variable based on the mathematical equation
relating it with the predictor variables.
The general mathematical equation for logistic regression is −
y = 1/(1+e^-(a+b1x1+b2x2+b3x3+...))
 y is the response variable.
 x is the predictor variable.
 a and b are the coefficients which are numeric constants.

2 P. Jyothi, IT Dept , PACE ITS

STATISTICS WITH R PROGRAMMING Unit - VI

The function used to create the regression model is the glm() function.
Syntax :- glm(formula,data,family)
Following is the description of the parameters used −
 formula is the symbol presenting the relationship between the variables.
 data is the data set giving the values of these variables.
 family is R object to specify the details of the model. It's value is binomial for logistic
regression.
For example, in the built-in data set mtcars, the data column am represents the transmission type of
the automobile model (0 = automatic, 1 = manual). With the logistic regression equation, we can model
the probability of a manual transmission in a vehicle based on its engine horsepower and weight data.
> am.glm = glm(formula=am ~ hp + wt, data=mtcars, family=binomial)
> am.glm
Call: glm(formula = am ~ hp + wt, family = binomial, data = mtcars)

Coefficients:
(Intercept) hp wt
18.86630 0.03626 -8.08348

Degrees of Freedom: 31 Total (i.e. Null); 29 Residual

Null Deviance: 43.23
Residual Deviance: 10.06 AIC: 16.06

Poisson Regression:- Poisson Regression involves regression models in which the response variable is
in the form of counts and not fractional numbers. For example, the count of number of births or number
of wins in a football match series. Also the values of the response variables follow a Poisson distribution.
The general mathematical equation for Poisson regression is −
log(y) = a + b1 x1 + b2 x2 + b nxn .....
Following is the description of the parameters used −
 y is the response variable. 
 a and b are the numeric coefficients. 
 x is the predictor variable. 
The function used to create the Poisson regression model is the glm()function.
We have the in-built data set "warpbreaks" which describes the effect of wool type (A or B) and tension
(low, medium or high) on the number of warp breaks per loom. Let's consider "breaks" as the response
variable which is a count of number of breaks. The wool "type" and "tension" are taken as predictor
variables.
> output <-glm(formula = breaks ~ wool+tension,data = warpbreaks,
+ family = poisson)
> print(summary(output))
Call:
glm(formula = breaks ~ wool + tension, family = poisson, data = warpbreaks)

Deviance Residuals:
Min 1Q Median 3Q Max
-3.6871 -1.6503 -0.4269 1.1902 4.2616
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.69196 0.04541 81.302 < 2e-16 ***
woolB -0.20599 0.05157 -3.994 6.49e-05 ***
tensionM -0.32132 0.06027 -5.332 9.73e-08 ***
tensionH -0.51849 0.06396 -8.107 5.21e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 297.37 on 53 degrees of freedom
Residual deviance: 210.39 on 50 degrees of freedom
AIC: 493.06

Number of Fisher Scoring iterations: 4

3 P. Jyothi, IT Dept , PACE ITS

STATISTICS WITH R PROGRAMMING Unit - VI

CURVE FITTING:- Curve fitting is the process of constructing a curve, or mathematical function, that
has the best fit to a series of data points, possibly subject to constraints.

Type of curve Equation Normal equations

Fitting of a straight y = a + bx
line
 y  na  b x
 xy  a x b x2
Fitting of a second y = a + bx + cx 2
degree polynomial
 y  na  b x  c x2
 xy  a x b x2  c x3
 x2 y  a x2 b x3  c x4
Power curve y = a.bx Applylog on both sides
log y  log a  x log b
Y  A  Bx
Y  nA  B x
 xY  A x B x2
where
Y = log y, A = log a and B = log b
Exponential curve y = a. ebx Applylog on both sides
log y  log a  bx
Y  A  bx
Y  nA  b x
 xY  A x b x2
where
Y = log y and A = log a
Exponential curve y = a. xb Applylog on both sides
log y  log a  b log x
Y  A  bX

Y  nA  b X
 XY  A X b X 2
where
Y = log y , X=log x and A = log a

Problem:- Fit a straight line to the following data

x 0 1 2 3 4
y 1 1.8 3.3 4.5 6.3
Solution:-
Straight line is y = a+bx
The two normal equations are
 y  na  b  x
 xy  a x b x2
x x2 y xy
0 0 1 0
1 1 1.8 1.8
2 4 3.3 13.2
3 9 4.5 40.5

4 P. Jyothi, IT Dept , PACE ITS

STATISTICS WITH R PROGRAMMING Unit - VI

4 16 6.3 100.8

x  10  x2  30  y  16.9  xy  156.3
Substituting the values, we get
5a+10b = 16.9.........................(1)
10a+30b = 156.3 ...................(2)
Solving (1) and (2), we get
Multiply eq (1) with 2
10a+20b = 33.2....................... (3)
Subtract (3) and (2)
10a + 20b = 33.2
10a + 30b = 156.3
0 -10b =-123.1
Therefore b=12.3 now substitute in (1) and a = -21.24.
Thus the equation of the straight line is y = a + bx
y = -21.24+12.3x

Problem:- Fit a parabola to the following data

x 1 2 3 4 5
y 10 12 8 10 14

Solution:-
Polynomial equation line is y = a + bx + cx2
The three normal equations are
 y  na  b x  c x2
 xy  a x b x2  c x3
 x2 y  a x2 b x3  c x4
x x2 x3 x4 y xy x2 y
1 1 1 1 10 10 10
2 4 8 16 12 24 64
3 9 27 81 8 24 72
4 16 64 256 10 40 160
5 25 125 625 14 70 350
 x  15  x2  55  x3  225 x 4  979  y  54  xy  168  x2 y  656
Substituting the values, we get
5a+15b+55c = 54........................... (1)
15a+55b+225c = 168................... (2)
55a+225b+979c = 656 ................ (3)
Solving (1) and (2), we get
Multiply eq (1) with 3 and subtract with (2)
15a+45b+165c = 162
(-)15a+55b+225c = 168
0 -10b+60c = -6 ........................ (4)

Solving (1) and (3), we get

Multiply eq (1) with 11 and subtract with (3)
55a+165b+605c = 594
(-)55a+225b+979c = 656
0 -60b-370c = -62
60 b+ 370 c = 62........(5)

5 P. Jyothi, IT Dept , PACE ITS

STATISTICS WITH R PROGRAMMING Unit - VI

Solve (4) and (5)

Multiply eq (4) with 6 and add with (5)
-60b + 360c = -36
60b + 370c = 62
70 c =26
c =0.37
substitute in equation (4)
-10b + 60(0.37) = -6
b = 2.82
Substitute c and b values in equation (1)
5a+15b+55c = 54
5a+15(2.82)+55(0.37) = 54
a =-1.73
Thus the equation of the polynomial is y = a + bx + cx 2
y = -1.73+2.82x + 0.37x2

Problem:- Fit a curve of the type y=aebx to the following data

x 0 1 2 3
y 1.05 2.1 3.85 8.3
Solution:-Exponential curve equation line is y = a . ebx
The two normal equations are
Apply logarithm on both sides
logy  log a  bx
Y  A  bx

Y  nA  b  x
 xY  A x b x2
x y Y = log y xY x2
0 1.05 0.021 0 0
1 2.1 0.324 0.32 1
2 3.85 0.585 1.17 4
3 8.3 0.919 2.75 9

x  6  y  15.3 Y  1.849 xY  4.24 x 2  14

The equations are
4A + 6b = 1.84 …….(1)
6A + 14b = 4.24 …….(2)
Solve (1) and (2) equations
Multiply (1) with 3 and (2) with 2 then subtract them
12A+18b = 5.52
(-)12A+ 28b = 8.48
-10b =-2.96
b =0.296
Substitute b value in (1)
4A + 6(0.296) = 1.84
A = 0.016
a= antilog(A)
= antilog(0.016) =1.061
Therefore the exponential curve is y = (1.061).e0.296x

6 P. Jyothi, IT Dept , PACE ITS

STATISTICS WITH R PROGRAMMING Unit - VI

Survival analysis: Survival analysis is generally defined as a set of methods for analyzing data where
the outcome variable is the time until the occurrence of an event of interest. The event can be death,
occurrence of a disease, marriage, divorce, etc.
In survival analysis, there is a special structure for right-censored survival data. To use this, one first
must load the “survival” package, which is included in the main R distribution,
library(survival)
The basic syntax for creating survival analysis in R is −
Surv(time,event)
survfit(formula)
Following is the description of the parameters used −
 time is the follow up time until the event occurs.
 event indicates the status of occurrence of the expected event.
 formula is the relationship between the predictor variables.

Next, define the survival times “tt” and the censoring indicator “status”, where “status = 1” indicates
that the time is an observed event, and “status = 0” indicates that it is censored. Then the “Surv”
function binds them into a single object. In the following example, time 6 is right censored, while the
others are observed event times,
> tt <- c(2, 5, 6, 7, 8)
> status <- c( 1, 1, 0, 1, 1)
> Surv(tt, status) # Create a survival data structure
[1] 2 5 6+ 7 8
Example:-

Nonlinear Models
Decision trees:- Decision tree is a graph to represent choices and their results in form of a tree. The
nodes in the graph represent an event or choice and the edges of the graph represent the decision rules
or conditions. It is mostly used in Machine Learning and Data Mining applications using R.
Examples:
• Predicting an email as spam or not spam,
• Predicting of a tumor is cancerous
• Predicting a loan as a good or bad credit risk based on the factors in each of these.
The package "party" has the function ctree() which is used to create and analyze decison tree.
Syntax : ctree(formula, data)
Following is the description of the parameters used −
 formula is a formula describing the predictor and response variables.
 data is the name of the data set used.

Example:
library(party)
model2<-ctree(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, data=mydata)
plot(model2)

7 P. Jyothi, IT Dept , PACE ITS

STATISTICS WITH R PROGRAMMING Unit - VI

Advantages of Decision Trees

 Simple to understand and interpret.
 Requires little data preparation.
 Works with both numerical and categorical data.
 Possible to validate a model using statistical tests. Gives you confidence it will work on
new data sets.
 Robust.
 Scales to big data

Limitations of Decision Trees

 Learning globally optimal tree is NP-hard.
 Easy to overfit the tree
 Complex.

Random Forests:- In the random forest approach, a large number of decision trees are created. Every
observation is fed into every decision tree. The most common outcome for each observation is used as
the final output.
The package "randomForest" has the function randomForest() which is used to create and analyze
random forests.
Syntax :- randomForest(formula, data)
Following is the description of the parameters used −
 formula is a formula describing the predictor and response variables.
 data is the name of the data set used.

Advantages
⚫ It is one of the most accurate learning algorithms available. For many data sets, it produces
a highly accurate classifier.
⚫ It runs efficiently on large databases.
⚫ It can handle thousands of input variables without variable deletion.
⚫ It gives estimates of what variables are important in the classification.
⚫ It generates an internal unbiased estimate of the generalization error as the forest building
progresses.
⚫ It has an effective method for estimating missing data and maintains accuracy when a large
proportion of the data are missing.

Disadvantages
⚫ Random forests have been observed to overfit for some datasets with noisy
classification/regression tasks.
⚫ For data including categorical variables with different number of levels, random forests are
biased in favor of those attributes with more levels. Therefore, the variable importance
scores from random forest are not reliable for this type of data.

Splines: A linear spline is a continuous function formed by connecting linear segments. The points
where the segments connect are called the knots of the spline.

8 P. Jyothi, IT Dept , PACE ITS

5221 Basic Probability Distributions in R MCA MMS 20MCA2CC9
No ratings yet
5221 Basic Probability Distributions in R MCA MMS 20MCA2CC9
30 pages
Lecture 4 STA32101 Intro To Statistics Random Variables P2
No ratings yet
Lecture 4 STA32101 Intro To Statistics Random Variables P2
45 pages
UNIT 4 - Part B
No ratings yet
UNIT 4 - Part B
15 pages
Lecture 7
No ratings yet
Lecture 7
16 pages
R-Program Lab Manual
No ratings yet
R-Program Lab Manual
57 pages
R Programming 1
No ratings yet
R Programming 1
21 pages
Biostats Lecture 7 Bernoulli, Binomial and Poisson Distributions
No ratings yet
Biostats Lecture 7 Bernoulli, Binomial and Poisson Distributions
50 pages
Chapter 4: Probability Distributions: 4.1 Random Variables
100% (2)
Chapter 4: Probability Distributions: 4.1 Random Variables
53 pages
Lecture 7
No ratings yet
Lecture 7
32 pages
Experiment 5
No ratings yet
Experiment 5
4 pages
3.discreteProbDist Lec18
No ratings yet
3.discreteProbDist Lec18
33 pages
Acts 372 Unit 2
No ratings yet
Acts 372 Unit 2
47 pages
Binomial Distribution
No ratings yet
Binomial Distribution
6 pages
EPS Chapter 4 Discrete Distributions JNN OK
No ratings yet
EPS Chapter 4 Discrete Distributions JNN OK
41 pages
Lesson 6.2 The Binomial Distribution COMPLETE
No ratings yet
Lesson 6.2 The Binomial Distribution COMPLETE
6 pages
MATH 1280-01 Learning Journal Unit 5
No ratings yet
MATH 1280-01 Learning Journal Unit 5
5 pages
Comm 214 Chapter 6 - Part 1 - Discrete Probability Distributions
No ratings yet
Comm 214 Chapter 6 - Part 1 - Discrete Probability Distributions
38 pages
Tutorial 2 - Questions.
No ratings yet
Tutorial 2 - Questions.
11 pages
7probability Distributions (Binomial, Poisson and Normal)
No ratings yet
7probability Distributions (Binomial, Poisson and Normal)
33 pages
R Unit 4
No ratings yet
R Unit 4
23 pages
Stat Doc Pract 6,7,8
No ratings yet
Stat Doc Pract 6,7,8
17 pages
OCR MEI S1 Revision Sheets
No ratings yet
OCR MEI S1 Revision Sheets
12 pages
Lecture 3-Discrete Random Variables
No ratings yet
Lecture 3-Discrete Random Variables
46 pages
Worksheet 1
No ratings yet
Worksheet 1
7 pages
Unit 5
No ratings yet
Unit 5
15 pages
Student Notes 2.2
No ratings yet
Student Notes 2.2
6 pages
Module 5 - Discrete Probability Distributions Upd
No ratings yet
Module 5 - Discrete Probability Distributions Upd
68 pages
00 Lab Notes
No ratings yet
00 Lab Notes
13 pages
CH 3 Special Probability Distributions
No ratings yet
CH 3 Special Probability Distributions
21 pages
ST 1210-Discrete Random Variables
No ratings yet
ST 1210-Discrete Random Variables
53 pages
Lecture 8
No ratings yet
Lecture 8
18 pages
Binomial and Multinomial Distribution
No ratings yet
Binomial and Multinomial Distribution
5 pages
Ist 214-Statictics Ii: Week 4: Binomial Distribution and Poison Distribution, Expected Values and Variance
No ratings yet
Ist 214-Statictics Ii: Week 4: Binomial Distribution and Poison Distribution, Expected Values and Variance
18 pages
Sec 4.4 - Binomial
No ratings yet
Sec 4.4 - Binomial
26 pages
KK Math
No ratings yet
KK Math
13 pages
Astro MS Dist
No ratings yet
Astro MS Dist
49 pages
Binomial Distribution
No ratings yet
Binomial Distribution
6 pages
AP Statistics Problems #8
No ratings yet
AP Statistics Problems #8
15 pages
EPS - Chapter - 4 - Discrete Distributions - JNN - OK
No ratings yet
EPS - Chapter - 4 - Discrete Distributions - JNN - OK
56 pages
Sedona Method Release Technique 1992 Sedona Institute 01 of 08 Volume 1 Session 1
100% (2)
Sedona Method Release Technique 1992 Sedona Institute 01 of 08 Volume 1 Session 1
110 pages
Lecture Note 1
No ratings yet
Lecture Note 1
5 pages
Sta 111 Lecture Note 2
No ratings yet
Sta 111 Lecture Note 2
19 pages
BPM Assignment
50% (2)
BPM Assignment
31 pages
Lab3 Fitting and Plotting of Binomial Distribution & Poisson Distribution (Challenging Experiment 2 (A) and 2 (B) ) Aim
No ratings yet
Lab3 Fitting and Plotting of Binomial Distribution & Poisson Distribution (Challenging Experiment 2 (A) and 2 (B) ) Aim
18 pages
Random Variables
No ratings yet
Random Variables
68 pages
Probability Distributions in R
No ratings yet
Probability Distributions in R
42 pages
5 Distributions
No ratings yet
5 Distributions
11 pages
Discrete Distributions
No ratings yet
Discrete Distributions
25 pages
Unit 3 Probability Distributions - 21MA41
No ratings yet
Unit 3 Probability Distributions - 21MA41
17 pages
Distributions: Binomial (Or Bernoulli'S) Distribution
No ratings yet
Distributions: Binomial (Or Bernoulli'S) Distribution
15 pages
Lab-6-Binomail and Poisson Distribution
100% (1)
Lab-6-Binomail and Poisson Distribution
13 pages
TM CBLM - Copy 2.odt
100% (2)
TM CBLM - Copy 2.odt
98 pages
Digital Assignmen T-3: Mat 2001 Statistics For Engineers
No ratings yet
Digital Assignmen T-3: Mat 2001 Statistics For Engineers
14 pages
Section N Notes With Answers
No ratings yet
Section N Notes With Answers
4 pages
4.08 The Binomial Distribution: 4 Probability Distributions
No ratings yet
4.08 The Binomial Distribution: 4 Probability Distributions
2 pages
Topic 5 Discrete Distributions
No ratings yet
Topic 5 Discrete Distributions
30 pages
Conditional Dasas of Sage Parasara - Sumeet Chugh
100% (7)
Conditional Dasas of Sage Parasara - Sumeet Chugh
105 pages
Binar Sistem Theory
0% (1)
Binar Sistem Theory
28 pages
Binomial and Related Distributions
No ratings yet
Binomial and Related Distributions
17 pages
Lab-2: Probability Distributions Name: Objective:To Compute Probability Density Function (PDF) and Cumulative Distribution Function (CDF) Outcomes
No ratings yet
Lab-2: Probability Distributions Name: Objective:To Compute Probability Density Function (PDF) and Cumulative Distribution Function (CDF) Outcomes
15 pages
Statistics Using R Tutorial
No ratings yet
Statistics Using R Tutorial
22 pages
Binomial Geometric Practice
No ratings yet
Binomial Geometric Practice
13 pages
Rs - Resources
No ratings yet
Rs - Resources
2 pages
What Are Natural Resources
No ratings yet
What Are Natural Resources
6 pages
Automation in Manufacturing Unit-1
No ratings yet
Automation in Manufacturing Unit-1
58 pages
In Limbo: The "Unfinished": Architecture Defined by The People
No ratings yet
In Limbo: The "Unfinished": Architecture Defined by The People
21 pages
Add Math Probability Distribution
No ratings yet
Add Math Probability Distribution
10 pages
Database Management System (DBMS) 2620003
100% (1)
Database Management System (DBMS) 2620003
3 pages
Problem Solving - Pdca
No ratings yet
Problem Solving - Pdca
61 pages
LECTURE 03 Styles of Communication
No ratings yet
LECTURE 03 Styles of Communication
39 pages
Mc9211unit 5 PDF
No ratings yet
Mc9211unit 5 PDF
89 pages
The Importance of Waste Management Knowledge To Encourage Householdwastesorting Behaviour in Indonesia 2252 5211 1000309
No ratings yet
The Importance of Waste Management Knowledge To Encourage Householdwastesorting Behaviour in Indonesia 2252 5211 1000309
5 pages
WEEKLY LEARNING PLAN Practical Research II K.Ponsaran
No ratings yet
WEEKLY LEARNING PLAN Practical Research II K.Ponsaran
19 pages
Visionary Leadership: Great Video On The 3 Most Important
No ratings yet
Visionary Leadership: Great Video On The 3 Most Important
21 pages
Ppr11 140w
No ratings yet
Ppr11 140w
14 pages
1426357293.3273chapter 2 In-Class Exercises
No ratings yet
1426357293.3273chapter 2 In-Class Exercises
12 pages
How Does Society Influence Literature? How Does Literature Influence Society?
No ratings yet
How Does Society Influence Literature? How Does Literature Influence Society?
16 pages
Cost Constraint/Isocost Line
No ratings yet
Cost Constraint/Isocost Line
38 pages
Journal of Graphic Novels & Comics: Publication Details, Including Instructions For Authors and Subscription Information
100% (1)
Journal of Graphic Novels & Comics: Publication Details, Including Instructions For Authors and Subscription Information
20 pages
Aerodyne Product Shopper
No ratings yet
Aerodyne Product Shopper
2 pages
A315 Advertising & Consumer Culture - Syllabus-2
No ratings yet
A315 Advertising & Consumer Culture - Syllabus-2
9 pages
Smart Helmet Based On IoT Technology
No ratings yet
Smart Helmet Based On IoT Technology
5 pages
TAP413 3 Force Moving Charge
No ratings yet
TAP413 3 Force Moving Charge
5 pages
Guidelines For Assignment
No ratings yet
Guidelines For Assignment
1 page
Comparison of "Infiltration" and "Block" Technique in Control of Extraction Pain of Primary Mandibular Canine in 7 - 9 Years Old Children
No ratings yet
Comparison of "Infiltration" and "Block" Technique in Control of Extraction Pain of Primary Mandibular Canine in 7 - 9 Years Old Children
3 pages
Sta 711: Homework 1 Solutions: Fields and σ-fields
No ratings yet
Sta 711: Homework 1 Solutions: Fields and σ-fields
3 pages
Leadership Style of Managers in 5 Star Hotels
No ratings yet
Leadership Style of Managers in 5 Star Hotels
6 pages
Glifosate Determination Uv, Por LC
No ratings yet
Glifosate Determination Uv, Por LC
9 pages
Bab 8 Probablity Distribution
No ratings yet
Bab 8 Probablity Distribution
10 pages
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet

R-Prog Unit-5

Uploaded by

R-Prog Unit-5

Uploaded by

STATISTICS WITH R PROGRAMMING Unit - V

UNIT-V: Probability Distributions, Normal Distribution- Binomial Distribution- Poisson

BINOMIAL DISTRIBUTION:- The binomial distribution is a discrete probability distribution. It

R has four in-built functions to generate binomial distribution. They are

1 P .Jyothi, IT Dept , PACE ITS

Probability for two ordering Italian food,

Problem:What is the probability of obtaining 45 or fewer heads in 100 tosses of a coin?

2 P. Jyothi, IT Dept , PACE ITS

> pbinom(4, size=12, prob=0.2)

p(r) = 5Cr (0.568)r (0.432) 5-r , r = 0,1,2,3,4,5

Poisson Distribution :- The Poisson distribution is the probability distribution of independent

3 P. Jyothi, IT Dept , PACE ITS

4 P. Jyothi, IT Dept , PACE ITS

P(x; μ) = (e-μ) (μx) / x!

Normal Distribution:- A continuous random variable X follows a normal distribution

5 P. Jyothi, IT Dept , PACE ITS

> mean = median = mode

Procedure to find probability using positive Z-score table

Case 2: Area in any tail 0.5 – Area(z)

Case 3: Area between two |Area(z2)-Area(z1)|

6 P. Jyothi, IT Dept , PACE ITS

Case 4: Area between two Area(z1)+Area(z2)

Case 5: Area to the left of 0.5+ Area(z)

Case 6: Area to the right 0.5+ Area(z)

7 P. Jyothi, IT Dept , PACE ITS

8 P. Jyothi, IT Dept , PACE ITS

> h<-hist(x,col = "blue")

Correlation:- A correlation is a relationship between two variables.

This will always be a number between -1 and 1 (inclusive).

9 P. Jyothi, IT Dept , PACE ITS

Formula for correlation coefficient:

10 P. Jyothi, IT Dept ,PACE ITS

11 P. Jyothi, IT Dept , PACE ITS

12 P. Jyothi, IT Dept , PACE ITS

Solution: Let µ be the mean of population of differences.

T-test for difference of two population means :-

13 P. Jyothi, IT Dept , PACE ITS

Solution:- Given n1=7 and n2 = 6

 Null Hypothesis H0: µ1= µ2

14 P. Jyothi, IT Dept , PACE ITS

ANOVA:- (ANALYSIS OF VARIANCE)

ANOVA one-way classification:-

Step 2: Correlation factor

15 P. Jyothi, IT Dept , PACE ITS

Advantages / Limitations of Linear Regression Model :

1 P. Jyothi, IT Dept , PACE ITS

Create Equation for Regression Model

2 P. Jyothi, IT Dept , PACE ITS

Degrees of Freedom: 31 Total (i.e. Null); 29 Residual

Number of Fisher Scoring iterations: 4

3 P. Jyothi, IT Dept , PACE ITS

Type of curve Equation Normal equations

Problem:- Fit a straight line to the following data

4 P. Jyothi, IT Dept , PACE ITS

Problem:- Fit a parabola to the following data

Solving (1) and (3), we get

5 P. Jyothi, IT Dept , PACE ITS

Solve (4) and (5)

Problem:- Fit a curve of the type y=aebx to the following data

x  6  y  15.3 Y  1.849 xY  4.24 x 2  14

6 P. Jyothi, IT Dept , PACE ITS

7 P. Jyothi, IT Dept , PACE ITS

Advantages of Decision Trees

Limitations of Decision Trees

8 P. Jyothi, IT Dept , PACE ITS

You might also like