Math10282 Ex05 - An R Session

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

MATH10282 Introduction to Statistics - Ex-

amples Class 5 - An R Session


Introduction
In this third R session, we will see how we can perform calculations using
various discrete and continuous distributions and also superimpose a plot of
a parametric pdf on to a histogram. All the data sets required are the same
as for the second R session but are copied in to the Example Sheets>Week
5 folder on Blackboard for your convenience.
You are advised to firstly work through and reproduce the examples in
the following notes before trying the Exercises at the end.

Probability Distributions in R
You can carry out a variety of calculations involving parametric probability
distributions in R. Some of the common distributions available are
Distribution R name
Discrete:
Binomial binom
Poisson pois
Geometric geom
Negative Binomial nbinom

Continuous:
Uniform unif
Normal norm
Exponential exp
Gamma gamma
Chisquare chisq
Beta beta
Student t t

Some of these will be familiar at the moment; others will become familiar
as you study more probability and statistics. R has four particular functions
available for each distribution. These are

1
Name Description
dname(x= , other arguments) Density or probability mass function
pname(q= , other arguments) Cumulative distribution function
qname(p= , other arguments) Quantile function
rname(n= , other arguments) Random deviates

ie. you prefix the R function name with either of the letters ”d”, ”p”, ”q”
or ”r”, depending what you would like to calculate. You have to specify the
values of the parameters of the distribution in the call to the function if you
are changing them from any preset default values. Functions may also have
other arguments with preset values but you can use ”help” in R to check
these.
When calculating the pdf or pmf you need to specify either a scalar or
vector of values, using the argument x = ..., for which the calculation is to be
performed; the cdf calculates P (X ≤ q) so you need to give either a scalar or
vector of x-values using q = ...; the quantile (or inverse cdf) function requires
you to list the probabilities for which you want the calculation(s) performed
in a scalar or vector using p = ...; when generating random observations you
need to say how many you want using the n = ... argument

Examples of their use


(i) Consider the Binomial distribution with parameters n = 50 and p =
0.6. Let X ∼ Bi(50, 0.6).
To plot the probability mass function of X:

> xb<-seq(from=0, to=50, by=1)


> pxb<-dbinom(x=xb, size=50, prob=0.6)
> plot(xb, pxb)

Each plotted point corresponds to the P (X = k) for k = 0, 1, . . . , 50.


To calculate the P (X = 35):

> dbinom(x=35, 50, 0.6)


[1] 0.04154667

To calculate a cumulative probability, such as P (X ≤ 25):

2
> pbinom(q=25, 50, 0.6)
[1] 0.09780736

Find the P (20 ≤ X ≤ 30):

> pbinom(q=30, 50, 0.6)-pbinom(q=19, 50, 0.6)


[1] 0.5521499

We can use the inverse cumulative distribution function. eg.

> qbinom(p=0.5, 50, 0.6)


[1] 30
> qbinom(p=0.25, 50, 0.6)
[1] 28

> pbinom(q=30, 50, 0.6)


[1] 0.5535236
> pbinom(q=29, 50, 0.6)
[1] 0.4389651
> pbinom(q=28, 50, 0.6)
[1] 0.3298617
> pbinom(q=27, 50, 0.6)
[1] 0.2339830

The qbinom function calculates the smallest number of successes such


that the probability of at most this many successes is greater than or
equal to p. (eg. p = 0.25).
Generate a random sample of size n = 5 from the Bi(50, 0.6) distribu-
tion:

> rbinom(n=5, 50, 0.6)


[1] 33 30 27 30 29

Generate a random sample of size n = 8 from the Bi(1, 0.6) distribu-


tion:

3
> rbinom(n=8, 1, 0.6)
[1] 1 1 1 0 0 0 1 0

(ii) Consider now the standard Normal N (µ = 0, σ 2 = 1) distribution. To


calculate the height of the pdf at x = 0 we use:

> dnorm(x=0)
[1] 0.3989423

The default settings are for the mean=0 and sd=1, so there is no need
to mention these parameters in the call to the function dnorm above.
(This is also the case with the functions pnorm, qnorm and rnorm.)
If we want to calculate the height of the pdf at x = 0 for the N (4, 102 )
distribution then we use:

> dnorm(x=0, mean=4, sd=10)


[1] 0.03682701

If we wanted a plot of the pdf curve for this distribution then we use:

> xn<-seq(from=-26, to=34, by=0.2)


> yxn<-dnorm(xn, mean=4, sd=10)
> plot(xn, yxn, type="l")

Specifying the plot to be of type ”l” means a line plot is produced.


This means that no symbols are plotted at each of the co-ordinates
but rather successive points are joined together by lines to create a
graphic of the density curve. You can add a title using the main=”. . .”
argument of the plot function.
We can calculate cumulative density function values as follows:

> pnorm(q=4, mean=4, sd=10)


[1] 0.5
> pnorm(q=c(4, 8, 12), mean=4, sd=10)
[1] 0.5000000 0.6554217 0.7881446

4
The quantile or inverse cdf function is used as follows:

> qnorm(p=0.975)
[1] 1.959964
> qnorm(p=0.975, mean=4, sd=10)
[1] 23.59964
> qnorm(p=0.5, mean=4, sd=10)
[1] 4

We can generate one or more random observations from a specified


Normal distribution. eg. n = 5 from the N (4, 102 ) distribution.

> rnorm(n=5, mean=4, sd=10)


[1] 8.245934 10.212281 -7.596698 31.125897 18.880339

Plotting a pdf on a histogram


Given the above discussion this is very easy to do using the ”dname” and
”lines” functions. eg. to superimpose a Normal pdf we would firstly use the
dnorm function to create a set of co-ordinates through which the line will
pass and then use the graphical lines function to create the line on the plot.
This can be illustrated for some simulated data as follows:

> xsim<-rgamma(n=100, shape=200, scale=2) # simulates n=100


observations from the Ga(shape=200, scale=2) distribution.
Nb. in R, the Gamma pdf with shape=a and scale=s is
f(y)= const. y^(a-1).exp(-y/s).
> hist(xsim, freq=F, xlim=c(280, 500)) # specify the number of bins required
as well if you wish.
> xv<-seq(from=280, to=500, by=0.5)
> yvg<-dgamma(x=xv, shape=200, scale=2)
> yvn<-dnorm(x=xv, mean=mean(xsim), sd=sd(xsim))
> lines(xv, yvg, lty=1) # adds a line (the gamma pdf) to the histogram plot (the
currently active plot) which passes through all
the (xv[i], yvg[i]) pairs. The argument lty=1 specifies a solid line.
> lines(xv, yvn, lty=2) # adds a Normal pdf to the histogram plot. The argument
lty=2 specifies a dashed line.

5
Exercises
(a) (i) Let the discrete random variable X ∼ Bi(n = 100, p = 0.3).
Use the appropriate binom functions to calculate the values of
P (X = 33), P (28 ≤ X ≤ 34) and P (X < 38). Now calculate the
same probabilities using a Normal approximation to the Binomial
distribution with a continuity correction. Compare your two sets
of results.
(ii) Let X ∼ P o(10). In R, calculate and plot the pmf of X for the
x-values {0, 1, 2, . . . , 25}. Calculate P (X < 15), P (X ≥ 8) and
P (6 ≤ X ≤ 16). Find the lower quartile, median and upper
quartile of the distribution.
(iii) Let X ∼ Ex(0.2). In R, calculate and plot the pdf of X (a line
plot) for x ∈ (0, 25). Calculate P (X < 12), P (X > 3) and
P (4 < X < 20). Find the 20’th, 50’th and 80’th percentiles
of the distribution.
(iv) Let X ∼ N (20, 72 ). In R, calculate and plot the pdf of X for x ∈
(0, 40). Calculate P (X < 17), P (X > 25) and P (13 < X < 27).
Find the 5’th, 10’th, 90’th and 95’th percentiles of this Normal
distribution. Can you find these percentiles as functions of the
corresponding percentiles from the standard Normal distribution?

(b) We will use the same data as for the R Session 2. ie. the data sets
simdat1, geyser and anorexia.

(i) For data simdat1, superimpose an appropriate Normal pdf on to


the graph of the histogram of the data. Comment on the goodness-
of-fit of the Normal distribution.
Note that the data in simdat1 are actually a random sample of
size n = 100 from a N (10.0, 22 ) distribution.
(ii) Repeat (b)(i) above for the geyser data. Comment on the useful-
ness of the Normal distribution as a probability model for these
data.
(iii) For the anorexia data, use superimposed Normal pdfs to investi-
gate the Normality of the data within each of the three treatment
groups.

You might also like