Unit 3
Unit 3
Unit 3
Structure
Page No.
3.1 Introduction
Objectives
77
80
3.7 Summary
3;l INTRODUCTION
In the first two units of this block you learnt a few methods for exploring and describing
numerical data. The examples you saw there showed that numerical observations on a
characteristic might need careful exploration for any useful and correct inference. In
general, when you decide to collect data on a characteristic, you have a specific
purpose; you want to either verify a hypothesis or want to estimate a quantity for
certain purpose. For example, you may want to know if the district of Jalandhar
produces more wheat in a year than the district of Faridkot in the state of Punjab. You
may have a hypothesis that in general, Jalandhar district produces more wheat per year
than Faridkot district. When you have such a specific question or hypothesis, you will
decide to collect the appropriate information or data, in this particular instance, on the
annual wheat production in the two districts of Punjab. You may find a secondary
source, such as a publication from an agency or a department of the government which
has already collected this information and has tabulated the data. Of course, you need
to be sure about the reliability of the source from where you have obtained the required
data. When you actually look at the data on the amount of annual wheat production in
the two districts, for say, the past ten years, what do you think you will And? Do you
think the amount produced in any district will be the same from year to year? This is
very unlikely as there are a very large number of factors that influence the amount of
wheat produced and these factors do not remain constant over the 10 years. Also it is
very unlikely that you will find Jalandhar district producing more wheat than the
district of Faridkot in each of the 10 years. In such a situation how does one compare
the two districts in respect of wheat production?
Firstly, it is clear from the above discussion that a realistic model will be to assume that
the annual production of wheat in a district is a random variable (result of a random
experiment introduced in the previous unit). The ten year observations may be thought
of as the outcome of ten repetitions of a random experiment for a district. Secondly, we
need to develop measures to compare the two districts with respect to the annual
production of wheat when this is thought of as outcome of a random experiment.
In certain situations or for certain problems, you may find that there is no reliable
secondary source where data on the appropriate characteristic are available, and that
you have to take measurements or make observations, if necessary after conducting a
controlled experiment. As in practice, it is never possible ta control or eliminate the
influence of all the factors, this experiment may also have to be modelled as a random
experiment. This need arises, for instance, when a scientist claims to have developed a
variety of wheat which yields more per acre than the existing varieties and you want to
verify this claim. To sum~xarise,you notice the following:
1) Given a specific purpose, data are collected on an appropriate characteristic.
If there is no secondary source of data, you may nwd to either conduct a controlled
experiment, or conduct a survey, to obtain data on the appropriate characteristic.
We start this unit by defining the notion of a random variable and its probability
distribution @ Section 3.2. The notions of discrete and continuous random variables are
introduced next, followed by the notions of expectation and variance. You will see that
to compare random variables or to draw inferences about them in a practical
application, their probability distiibutions are important. Also important are certain
measures of the distribution such as the mean and variance. These are discussed in
Sec.3.2. In the next four sections we have discussed the most commonly used four
probability distributions in detail.
Here is a list of what you should be able to do by the end of this unit.
Objectives
After reading this unit, you should be able to
0 specify when a variable is a random variable and classify it as discrete or continuous.
.
find the probability distribution of discrete and continuous random variables and cal*
culate the mean and variance of these distributions and use these measures to make
judgements about the real-life situation.
describe the following distributions
a) binomial distribution
b) Poisson distribution
c) uniform distribution
d) normal distribution
Probability Distributions
Then
Let us denote by X(aj) the number of heads obtained when the outcome of our
experiment is aj, where j = 1,2, .. . ,8. You can easily cbeck that
X(al) = 3, X(az) = X(a3) = X(Q) = 2,
X(a5) = X(%) = X(a7) = 1, X(q) = 0
Then do you agree that the X maps elements of the sample s p a $ $ the values
0,1,2,3? i.e. X is a function from the space S to the set N * (0, I, $,-$),:
(1)
(2)
Also note that corresponding tq each value, there is always some sample P;oint or a set
of sample points. For example. the set of sample points corresponding la the value '0'
is the single point as, whereas for 1 the set is: {a5, as, a, 2. That rneans.correspondmg to
each value of X, there is a subset of the sample space S.
Now you again recall from Unit 2 that an event is a subset of a sample space: Thus we
note that that each value of X is associated with an event.
You can, therefore make the following identification of events corresponding to the
values associated by X. Denote the event corresponding to '0'-as [X= 01, likewise for
other values. Then
[ x = 0 ] = {as),
[x= 11 = (as,as,a7)
[ X = 2 l = {a2,a3,a4),[X=3l={al)
Assuming that all the sample points are equally likely, we assign probabilities of 118 to
each of the sample points.
From here, using the law of probability you can calculate the probabilities as follows:
P[X = 01 = P(ae) = 1/8,
P[X = 11 = P{as, %, a ~ =) Pias) P{&) P(87) = 3/8,
P[X = 21 = 3/8and PP[ = 31 = 1/8,
where we read PIX = j] as "probability that X equals j."
60
Probability Distributions
Because of this representation, instead of working with an abstract space, we can work
with a set of numbers,and this simplifies our problems considerably.
Now let us consider another example.
Example 2: You are sitting in a plane waiting for its take off. The pilot announces a
delay until some incoming planes land. Suppose you want to find the following:
i) How long will it be before take off.
ii) How many incoming planes are there.
Let us discuss the random variables for (i) and (ii).of the above example.
'
We first take (i) In this case we want to find the 'duration of time' before the plane takes
off. Note that the variable takes values continuously along a line as given below, say
from time duration 'a' to time duration 'b'. No values in between a and b are left out.
In other words there is no break in the values assumed by this random variable.
Fig.2
Now let us consider (ii) . Here the random variab1e.i~the number of planes. This
variable can only take the values 0 or 1 or 2 etc. as shown in Fig. 3. There is no
continuity, (see! Fig.3) since only non-negative integer values ean be assunied.
The above examples show that random vafiables can be of different types. There are
mainly two types of random vaiables:
1) discrete random v&.able
2) continuous random variable
The random variable shown in (ii) of Example 2 is discrete and that of (i) is conthuous.
In the next sibsectioil we shall iliscuss discrete random variables. Before that why
don't you try an exercise.
El) Suppose you take a 50-question multiple-choice exam., guessing every answer,
and are interested in the number of correct answers obtained. Then
a) What is the random variable you will consider for this situation?
b) What values might this random variable have?
c)
3.2.1
We first formally define a discrete random variable and familiarise you with some of the
properties1 aspects of a discrete r.v.
Definition 1: A random variable X is said to be discrete if the. number of values that X
can take is finite or soufitably infinite. These values may be listed as xo, xl, . . . , w h w
say, xo < XI < . . . ; these x's are called the jump points. They need not be equidistant,
Now let us consider the events associated with the values assigned by it.
Let the events be denoted by [X= xi],j = 0,1,2, . . .. Then, as stated earlier, we can
assign probability to these events. We denote P[X = xj], the probability of the event
[X = xj]. For further simplification we denote the probability for each j as
P[X=xj] = p j , i = 0 , 1 , 2 , 3 ,....
From the prope.*es of a random variable and by definition of probability it follows that
ifx=xi,
i = 0 , 1 , 2 ,...
0,
otherwise
Then p is called the probability mass hnction(p.m.f) of the random variable X. The
collection of pairs (xi, pi), i = 0,1,2, . . . is called the probability dhtribution of X.
For example, suppose X is the r.v. denoting the number of heads obtained in three tosses
of a coin, then the probability mass function p is the funcuon p :X -t R such that
2 0 for all xi h d
i),
2
3
118
318
318
118
Suppose you have a random variable X assuming values XI,xa, . .. with probabilities
p1 , p2, . . . , respectively. You may also visualise this as an illustration of a frequency
distribution. In fact a frequency distribution tells you how the total probability "one" L
distributed over the possible values of the random variable.
Now let us see the graphical representation 01this dlstribation.
Graphically, along the horizontal axis, plot the various possible values xi of a random
variable and on each value erect a vertical line with height proportional to the
corresponding probability p,. (See Fig.4) Recall that in the bar diagram of a frequency
distribution, the observed frequencies are graphed along the vertical axis; the total
frequency (which is the same as the total number of re~titionsof the random
experiment) is thus distributed over the possible ouicomes.
Probability Distributions
,0.40
.-
1, 1 ,
;:;;
0.1 0
Fig. 4
Example 3: Recall the problem of Sunil, the newspaper boy, which was presented in
in Section 1, Unit 2. When Sunil mentioned his dilemma about his 10 irregttl~r
customers to his sister Sunita, who is doicg a course in statistics at the local college. she
advised him, as a first step, to start maintaining a record for each of the 10 customers,
showiug for each day whether the customer has taken the newspaper from him or not.
Following her advise, Sunil had at the end of two months, 61 sets of observations, each
set corresponding to a day. F.ach set of observations was written as a sequence of two
numbers, 1's and O's, 1 at position i showing that customer i has bought the newspaper
on that day and a 0 in that position showing that customer i has not bought the
newspaper on that day. You will also find that the kth sequence (corresponding to the
kth day) is actually an observed sample point of the sample space corresponding to the
random experiment performed on day k with these 10 customers. Sunil has repeated
this random experiment with his 10 customers for 61 days! When Sunil reflected over
the mass of observations he hasmade it suddenly occurred to him that his record may be
sxcessive for solving his problem. Do you also get the same idea as Sunil did? If not,
think about it for a while now, and then read ahead.
Sunil reasoned as follows: After all, his daily gain will only depend on how many
newspapers he is able to sell on a particular day, irrespective of who among the 10 buys.
Therefore, it is enough for his purpose to note down the number of 1's that appear in the
kth sequence correspoilding to day k, to calculate his gain for that day. So why does he
have to maintain a sequence? Just the total number of customers buying on a day
should serve the purpose. Do you agree with Sunil? When Sunil showed his diary to
Sunita and mentioned to her his new idea, she appreciated his line of thought and told
him that the variable he wanted to consider, namely the total number of 1's in a
sequence, would be called a random variable by a statistician as the sequences were the
observed results of a randorn experiment. Sunil could forget about the sample space
containing the sequences, provided he knew the probability distribution of the random
variable chosen by him, namely the number of sales on a day to his irregular customers.
We shall consider the problem of finding the probability distribution of this random
Number Of Heads
63
< 2).
<
< 21
Definition 3: For a discrete random variable X, its expected value (or mean), denoted
as E(X), is defined as:
where xo, XI,xz are the values assumed by X and po, pl ,pz are probabilities these
values.
Expected value is a fundamental idea in the study of probability distributions. For many
years, the concept has been put to considerable practical use in the insurance industry,
and in the last twenty years, it has been widely used by many others who must make
decisions under conditions of uncertainly.
I @
Example 4: The Director of a breast cancer screening clinic wants to know how many
women will be screened on any one day. If past daily records of the clinic indicate that
the number of women screened daily ranges between 100 to 115. The following table
illustrates the number of times this level, between 100 to 115, has been reached during
the last 100 days.
Probability Distributions
Number of days
this level was observed
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
1
2
3
5
6
7
9
10
12
11
9
8
6
5
4
2
.01
.02
.03
.05
.06
.07
.09
.10
.12
.ll
.09
.08
.06
0.05
.04
.02
Total - 100
Total - 1.OO
Let us see how the Director can use this past record to get information about the
long-run pattern of daily patient screenings.
We hope you have understood the real-life problem. We will try to model this in
statistical language. For that we shall first describe the 'random variable' of interest in
this problem.
The random variable here is the number of patients screened, on any given day. Note
that this is a discrete random variable, which can assume only non-negative integer
values with positive probability. The past record of the clinic indicates that the values of
@is random variable range between 100 and 115 patients daily. These values are given
in the 1st column of the table. The 2nd column contains the number of days each value
was observed. For example the value '103' occurred on 5 days. The last column gives
the probability/relative frequency for which a particular value is observed. How do you
calculate these probabilities? Note that the total number of days is 100 and that the
value '100' was observed only one day.
1
Probability that the value'100'is observed = - = .O1
100
In this manner you can calculate the other probabilities. (The thumb rule is 'divide each
value in the middle column by 100'). This is how the last column is obtained. Notice
that the sum of the values in the 1ast.column is one. The relative frequencies are taken
as probabilities. This is the statistician's empirical approach to assigning probabilities.
Now plot the 'observed values' (i.e. numbers in the 1st column) against the
probabilities in a graph. Then you get a graph as given in the next page.
65
100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115
Daily number of women screened
Fig.5: Probability distribution for the discrete random variable 'number screened"
C xipi
Mean = -C pi
where xi's are the observed values in the first column and pi's are the frequencies
100x 1+101 x 2 +
Mean = 100
...+ 115 x 2
Mean = 100x - + l o 1
100
-+...+
100
L
115x 100
So, essentially what we got is that 'if we multiply each number in the 1st column by the
corresponding number in the 3rd column' we will get the mean of the probability
distribution. This mean equals the expected value.
The above computation tells us that the expected value of the discrete random variable
"cumber screened is 108 women. What does this mean? It means that over a long
period of time, the number of daily screenings should average about 108.02. This does
not mean that 108 women will visit the clinic per day; this only says that over the long
run arid on an average 108 women can visit the clinic. This value is a long run average.
Now baaed on this expected value (or the mean) the director csn dedde on what
,
resourceslirmfiastru~tureis required to get ready for dealing with the expected number of
-people.
***
3.2.2
The variables, we discussed in the last section such as 'number of women screened',
"number of heads obtained", etc. take on values 0 , 1 , 2 , 3 , . ... These are discrete
random variables. We saw that the values of a discrete random variable are graphed as
separated points and probabilities as lengths of vertical line segments (see Fig.4). We
also saw that a probability distribution of such a random variable contains all possible
values of the random variable, so the sum of all the probabilities must be 1.
On the contrary, a continuous random variable can take on any value in an interval.
For example, it can take all values x in the interval 0 5 x 5 1 of the form
C, 0.01,9.0002, etc. . . . ,0.98,0.99,1.00. As we have seen in the discrete case, we have
to assigrl probabilities to each value of the variable. How do we do this? Since the
possible values of X are uncountable, we cannot really speak of the ith value of X, hence
p(xi) becomes meaningless. Sc, what we shall do is to replace xi by any interval of the
type of (xi-l, xi) where 0 5 xi-1 < xi 5 1. -and define probability for such intervals.
To define this we consider a function defined on this interval 0 5 x 5 1, which assumes
non-negative values such that the total area under the graph cf this function and over
xi) as the area
the interval is 1. Then we define the probability of any interval
over this interval and under the graph off. (See Fig.6(b)) At present you don't have to
worry about these functions. Wherever the need arises we will specify these functions.
The above discussion may appear a little vague to you. You need not worry about this
at this stage. The ideas will be clear to you once we discuss particular cases of
continuous random variables and their probability distributions.
Now let us compare the graphs of probability distributions drawn for a discrete and
continuous random variable.
NMb)
Note that in Fig.6(b) we did not put vertical lines because a vertical distance tells us
what the height of the curve is, and that is not what we want. We want areas under the
portion of curves. You will see later that there are tables available for certain
continuous random variables where these areas are given for different segments. For
Probability Distributions
Statisticsand h b a b u t y
< x 5 4 is
P[2.2 5 x 1 4) = 0.32
and that of 0
It is important ta remember that a line segment has no area (zero area), because a line
has no width. Thus, the vertical line segment at 4 in the distribution of Figure 6 (b) has
an area of zero. This means the probability of a single value 4 is zero. In general, the
probability for an exact (single) value of a continuous random variable is zero.
Consequently, the probability of an interval is the same whether the endpoints are
included or not - because the endpoints have probability zero.
Now we formalise the above discussion and make the following definition.
Note: Those who are familiar with the mathematical concept of integration can easily
see that the above mentioned area is given by the integral of the function i.e.
We shall now see how we can use the graph of the distribution of a continuous random
variable to study real-life problems.
Fig7
How can she use the graph of distribution, to find the following: What is the chance that
a participant selected at random will require
'
i)
Probability Distribution$
{ 0,
3
for2<x<4
elsewhere
and
69
Similarly,
P[X = 11 = ~ [ [ ~ l , P H T l , ~ l I
= P [TI'H] P[THT] P[HTT]
+ +
+ P[HHT]
and
P[X= 31 = p3
In fact the probability P[X = r],r =: 0,1,2,3 gives that if we toss a coin three times,
how maqy ways, or combinations, will yield r heads and n - r tails.
Now you recall from your school mathematics that the number of combinations of n
objects taken r at a time is calculated by the formula
= c(3,0)~~=
q ~q3- ~
= ~ ( 3l)p1q3-1
,
= 3W2
= c(3, 2)p2q3-2 = 3p2q
= C(3, qp3q0 = p3
This suggests that the probability p[X = r] = p, for a given r can be calculated using
the formula
(3)
Pr = C(n, r)prqn-r
where r = number of successes
n = number of trials made
p =.probability of success in a trial
q = 1 - p =: probability of failwe in a trial.
Why don't you check this formula for n=5, i.e. tossing of a coin 5 times. For example,
try this exercise.
-
- -
E7) In the experiment of tossing a coin 5 times find the probability of getting 3 heads
and 2 tails. Verify that this probability is given by the formula given in Equation
(3).
Let us now sum up the points we have observed in the example, above.
i)
each trial has two possible outcomes, viz., a 'success'@) and a "failure9'(q);
Probability Dbstdbutions
......"
Binomial Distribution
P[X = r] = pr = C(n,r)prqn-'
where r = exact number of successes
n = number of trials made
p = probability of success on a trial
q = 1 - p = probability of failure on a trial and
n!
C(n,r) = (n - r)!r! (The earlier example illustrates how we got this formula for p,).
Such a random variable X is called a binomial random variable and its probability
distribution is called binomial distribution and is given by Eqn.(2).
Problem 2: A sales representative, calls on four potential clients. The probability that
she will obtain an order from each of them is and whether or not she obtains an order
from one of them is statistically independent of whether or not she obtains an order
from any of the others. What is the probability distribution of the number of orders she
will receive?
Solution: We note that there are two mutually exclusive events (obtaining an order or
no order) each time she makes a call and the probability of an order 112 each time.
Also the outcomes of the calls are statistically independent. Therefore this is a situation
where there are four Bernoulli trials and where the probability of a success (an order)
1
Thus, the probability of no orders is 1/16, of one order is 114, of two order is 318, of
three orders is 114, and of four orders is 1/16.
Problem 3: It has been claimed that in 60% of all solar heat installations, the utility
bill is reduced by at least one-third. Accordingly, what are the probabilities that the
utility bill will be reduced by one-third in
i) four of five installations?
ii) at least four of the five installations?
Solution Here the random variable follows binomial distribution with p = 0.6, r = 4
and n=5.
To find (i), we have to calculate P[X = 41, which is given by
P [ X = 4 ] = ~(5,4)(0.6)~(0.4)
= 0.259
Now to find (ii), we have to find the probability that X is at least 4. This probability is
the sum of the probabilities that X = 4 and X = 5 because 'at least 4 means 4 or more'.
f l u s we have to find p[X = 41 f P[X = 51.
P[X = 51 = (5,5)(0.6)'
= 0.078
By now you must have got some idea for recognising those situations where we can
apply binomial formula. If we can apply binomial distribution to study a situation, then
we say that the situation can be modelled by binomial distribution.
I
*
home in the morning hours, he would certainly buy fiom Sunil.
Probability Distributions
Given the situation above, do you stiM think that the nurilber of sales on a day can
be modelled as a binomially distributed random variable'?, Give reasons Fcr your
answer.
-
Once we have the probability distribution, we naturally ask what is the 'expected
value'. We shall see that now.
where xo, XZ,. . .are the values assumed by X and po, pl , . . . are the probabilities
associated with these values i.e.
j= 1
Those who are familiar with binomial expansion can recognise that the second
expression on the R.H.S. is 1 - p p)"-'. Therefore we have
E(X) = np [ l = np.
IE
Let us do a problem.
Problem 4: An oil exploration firm plans to drill six holes. It is believed that the
probability that each hole will yield oil is 0.1. Since the holes are in quite different
locations, the outcome of drilling one hole is statistically independent of that af drilling
any of the other holes.
(a) If the firm will be able to stay in business only if two or more holes produce oil,
what is the probability of its staying in business?
(b) Give the expected value of the number of holes that result in oil.
Solution: (a) If the firm can stay in business only if two or Inore holes produce oil, it
follows that the probability that it will stay in business cqunls i minus the probability
that the number of holes resulting in oil is 0 or 1. Eat11 Ilole drilled can be viewed as a
Bernoulli trial where the probability of success is . I . Thus. the probability that the
number of successes is 0 or 1 equals:
P(0 orl)
= P(0)
= .531
6!
O! 6!
+ P(1) = -(.96)
+
-(.1)(.9')
0!6!
1!5!
+ ,354= .885.
71
Consequently, the probability that the firm will be able to stay in business is
1 -.885 = .115.
(b) The expected value of the number of holes yielding oil is 6 x 0.1
and p = .I.
= 0.6,
since n = 6
A problem with the binomial distribution is that if the number trials 'n' is very large
and probability 'p' is very small, computation of P[X = r] is cumbersome.
The distribution which we introduce in the next section may be useful in such a
situation.
Poisson
A nineteenth cectury swiss
mathematician.
The number of arrivals in a time interval does not depend on what happened in
previous time intervals.
3)
It is extremely unlikely that there will be more than one arrival in a very short
interval of time. That means that it is impossible for more than one customer to get
through the revolving entrance door in a fraction of a second.
Under these assumption we find the required probability. For this we make use of the
following formula known as Poisson formula, given by
where X is the Greek letier lambda which denotes the average arrival rate per unit ~f
time and t is the number of units of time is the number of arrivals in t units of time
41so we know that X = 72 arrivals per hour is a constant for this situation. Since in the
question X is given in 'hour', to standardise the unit, we have to find 't' in hour.
i.e. 60 minutes = lhour
:. t = -hours
20
Then
To find P(4), we use the Table 2, given in the Appendix. This table shows p(x) for
selected values of A.
From-thetable, we get
What does this value 0.191 specify? This tells us that if the m v a l s are arrivals of
customers at a bank, there is 19.1% chance that exactly four customers will arrive in the
next 3 minutes.
If we vary the values of x and t, we can get different probabilities. This gives the
probability distribution which is called Poisson probability distribution.
In the above discussion we saw that the Poisson formula is applicable only if certain
conditions are specified. We re-state the formula now.
Poisson Formula
The Poisson Formula is given by
Poisson Distribution
I)
2)
It rarely happens that there will be more than one occurrence in a very short
time interval
A distribution having probabilities given by Poisson Formula is called Poisson
distribution.
3)
I
I
I
Now let us see some situations where we can apply Poisson distribution . Here is an
example.
Problem 5: Calls at a telephone switch board occur at an average rate of six calls per
10 minutes. Suppose the operator leaves for a 5-minute coffee break, what is the
probability that exactly two calls come in (and so go unanswered)while the operator is
away?
Solution :Here you can check that the conditions 1.2 and 3 of the Poisson formula are
satisfied in this case. Therefore we can use the formula. Now that here X = $, In this
case t = 5 so that Xt = 3. Hence the required probability P(2) is given by
ii) can we use the Poisson formula to find this? If yes calculate the probability.
In the above exercise we have seen that the difference between the two calculations is
very small.
The Poisson formula can be used to approximate the binomial probability of r successes
in n binomial trials in the situations where n is large and probability of success 'p' is
small.
For instance, suppose we are interested in number of road accidents in a metropolitan
city or daily number of machine breakdown in a work shop etc., during a specified
interval of time. Each of these subintervals is so small that at best one and no more
occurrence happens within it. 'l7-m~we may look upon each subinterval as a trial. Each
trial leads to a "success" if the occurrence happens during that subinterval and to a
"failure" if the occurrence does not happen.Assume that the occurrences are
independent of each other. Hence, the total number of occurrences can be constructed
to be distributed binomially, the total number of trials being equal to the number of
subinten~alswhich we have ensured to be large; also, the length for each subinterval
being small, the probability of an arrival (success) ns likely to be small.
'Thus we have seen that there are situations where both binomial and Poisson are
applied. The rule of thump followed by most statisticians is that if n _> 20 and
p 0.05, then Poisson formula can be used to calculate binomial pobability.
<
It is clear that the Poisson calculation is simpler than the binomial calculation. An
advantage of the Poisson distribution. , if it is applicable, is that it has only one
parameter, A, whereas the binomial distribution has two parameters, n and p;
consequently, Poisson probabilities can be tabulated more compactly than binomial
probabilities. For example, the Poisson probability P(3) is the same for n = 200,
p = 0.01 as it is for n = 100, p = 0.02, and for any cther pair of n and p values whose
product is X = np = 2.
By now you must have got a fairly good idea where the Poisson formula can be used.
In all the situation we have considered so far, we have calculated the probability over an
interval of time. But there are situations where we need to calculate probability over a
region (or space) or something else as o m physical reference. In the following example
we given such a situation and illustrated how to use Poisson distribution to calculate the
probability.
Example 5: During second world war,a v-3 rocket hit in South London. Later a study
was conducted on'what are regions not affected by the rocket hit.Let us see how they
used Poisson distribution for this study.
They took X as the average number of hits per unit area (Note that earlier in the formula
X was average rate per unit time).Instead of the variable't' they replace the variable 'v',
and x denotes the number of hits per unit area. Then they assumed that all the
conditions to satisfy the Poisson formula holds in this case.With all this assumptions,
they calculated the probability using the formula
According the problem stated, they have to calculate the probability of 'no hit' per unit
area. That is, the x = 0 and v = 1, so that Xv = A. Now, to calculate A, what they did
was, they divided the area into 576 areas of equal size (the number 576 is chosen based
on some other study and they found that they were 537 hits).
537
:. the average number of hits per unit area X = - = 0.9323
576
Then the required Probability is
This means that if we take one region, then the probability that the region is not hit by
the rocket is 0.3936. Hence, out of 576 regions, the number of regions not hit by the
rocket is given by
Now, the actual number got from the record was that there are 229regions not hit by the
rocket. This number is quite close to 226. This shows that the values got using Poisson
formula are very close to the actual values.
Thus we saw that the Poisson distribution is very effective in studying various real-life
problems where the occurrence is very rare.
One of the main disadvantages of this distribution is that it is applicable only in
situation where the outcomes are independent i.e. each outcome is independent of what
happened previously.
In the next section we shall discuss another standard distribution.
Example 6: A train is likely to arrive at a station at any time between 6.10 p.m. and
6.40 p.m. The time the train reaches, measured in minutes, after 6 p.m. is a random
variable X. Here X can take any value between 10 and 40 minutes. Therefore the
sample space is the interval (10,40). It is reasonable to assume that the likelihood for X
taking any value between 10 and 40 is equal. So if we take subintervals of equal
lengths, then the probability will be the same. The distribution corresponding to this r.v.
is uniform over the interval (10,40).
***
Example 7: An office fire drill is scheduled for a particular day, and the fire alarm is
likely to ring at any time between 9 a.m. and 5 p.m. The time the fire alarm starts,
measured in minutes, after 9 a.m. is therefore a random variable which takes any value
between 0 and 480 (= 8 hours = 8 x 60 = 480minutes) equally. The distribution
corresponding to this r.v. is uniform.
***
Now, why don't you look for such samplz spaces on your own. Try this exercise now.
E13) Verify whether the following situations can be described by uniform distribution
or not?
Probability Distributions
a)
b)
Next we will see how we can define (calculate) the probabilities for this distribution. As
we have seen in Sec. 3.2 , in the case of a continuous distribution, the probabilities are
calculated using a function called 'probability density function' (p.d.f.). The p.d.f. for
uniform distribution is given as follows.
I0,
otherwise
Now let us see how we calculate different probabilities for this distribution. As stated in
Sec.3.2, for a continuous r.v., we calculate the probability of an interval rather than a
point. For example, what will be P[c < X < dl where a < c < d < b? We have seen
that it is given by the area above this interval and under the graph. The area is shown in
Fig.9.
1
So, essentially it is the area of thc rectangle with length d - c and height = -i.e.
b -a
P[c < X
< d] = ( d - C )
1
x -
b -a
For example, if we take the situation in Example 4, let us find the probability that the
Pmbabiiity Diiributiollc
= 0, otherwise
To find the required probability, you have to find time elapsed in minutes between 9
a.m. and 1 p.m. and between 9 a.m. and 2 p.m.
For 1 p.m. this is 4 x 60 = 240 minutes.
Similarly, for 2 p.m., it is 5 x 60 = 300 minutes.
Therefore you have to calculate the probability P[240 < X < 3001. This is given by the
-10
This area is the rectangle with base 60(= 300 - 240) and height
&.
That is there is U.5% chance that the alarm sounds between 1 p.m. and 5 p.m. [Some
of you may think that this fact was rather obvious from the statement of the-problem
itself. But we have given this situation as an illustrative example. There are situations
which are complicated, where we can easily calculate the probability asing this
distribution.]
Next we state below the expected value of this distribution.
2) The probability that the train is late, but less than 16 minutes late.
Next we shall discuss another continuous distribution which is widely used in statistical
problems.
where p is a real number lying between -oo and oo and a is a real number lying
between 0 and CQ.
The function f(x) may look rather formidable to you at first sight. At this stage we just
ask you to notice that it involves two parameters, a and p. Corresponding to each pair
(p, u), we get a distribution. Therefore there is a whole family of distributions, each
one specified by a particular pair of values for a and p.
The most important characteristic of this distribution is that the graph of pdf, f(x) for a
particular value of p and a is bell-shaped as shown in Fig.11.
The probability density function, pdf is also symmetrical about the mean p. The word
symmetrical means that the two halves of the curve are mirror images (see Fig.11). In
Fig. 11 you note that if we place a mirror on the dashed vertical line ( which occurs at
75 in Fig. 11) then the mirror image of the portion on the left is the same as the portion
on the right side.
Both p and a have a 'nice' interpretation. We have already said that the pdf is
symmetric about p, SO it is no surprise that p is the mean of the distribution. The
other constant, a2 dictates how spread out and flat the 'bell-shape' is and in fact u2is
the variance of the normal distribution.
As an illustration, the following figure shows that the normal pdfs for p and a are given
as follows:
A /L = 10,u = 1
p = 10,u=2
p=lO,u=3
p=15,u=l
10
15
Fig.12
Pdfs A, B and C all have the mean 10 and so they are all centred at x = 10. Of these
three curfes, C has the largest variance and so is the most 'spread out'. Curve I3 has a
smaller variance and so is less spread out, and curve A has the srriallest variance and so
is the most 'squeezed in'. Curves A and D have the same variance and so they have
exactly the same shape, but they have different means so they are centred at x = 10 and
x = 15 respectively.
Some notation
-4
-3
-2
-1
Fig13
Notice that most of the area under the standard normal curve lies between -3 and +3.
Calculating Probabilities
The normal distribution is continuous and so the probability that the random variable X
lies between the interval (a, b) is are calculated by obtaining the area under the pdf
curve between a and b.
Probability Distributions
Rg.14
Unfortunately there are no 'nice' formulae for calculating such areas. But there are
tables available from which we can find out the area. Statistical software are also
available by which we can calculate the area.
Because the number of possible values for p and a is unlimited, the number of different
normal distributions is unlimited. However, probabilities for every normal distribution
can be obtained from a table of probability for standard normal distribution.
We shall first discuss how to use the table for calculating probabilities for a standard
normal distribution. Then we shall discuss how to use this to find the probability for
any normal distribution.
In the following exampie we illustrate how we use the table to calculate different
probabilities.
CI
Probability Distributions
ii) Similarly,
P[-0.34 < Z < 0.621 = F(0.62) - F(-0.34)
= F(0.62) - [l - F(0.34)]
by the identify F(z) = 1 - F(z).
= 0.7324 - (1 - 0.6331)
= 0.3655.
iii) From the previous unit (Unit 2), you have already learnt that
Hence we have
P[Z > 0.851 = 1 - P[Z 5 0.851
= 1 - F(0.85)
= 0.1977.
v)
In the following exercise we ask you to find certain probabilities using the normal
distribution table.
E16) If a random variable has the standard normal distribution, find the probability that
it will take on a value
i) less than 1.50
ii) less than - 1.20
iii) greater than - 1.75
E17) A filling machine is set to pour 952 ml (rnillimetres) of oil into bottles. The
amounts of fill are normally distributed with a mean of 952 ml. and a standard
deviation of 4 ml. Use the standard normal table to find the probability that a
bottle contains oil between 952 and 9$6 ml.
Next we shall see that how to use the standard normal probability table to calculate
probability of any normal distribution.
Standardising
i
L
Any normal random variable X, which has mean ,u and variance a2can be standardised
as follows.
Take the variable X,and
i) subtract its mean, p and then
ii) divide by its standard deviation, a.
We will call the result. 2. so
Z=- x - P
a
For example, suppose, as earlier, that X is an individual's IQ score and that it has a
normal distribution with mean p = 100 and standwd deviation a = 15. To standardise
In this way every value of X, has a corresponding value of Z. For instance, when
-2andwhen~=90,~=-=-0.67.
X = 1 3 0 , Z = T 130-100
-
z=-x - P
u
Calculating probabilities
With reference to the prob!em of IQ score, suppose we want to find the probability that
an individual's IQ score is less than 85, i.e. P[X < 851. The corresponding area under
the pdf N(100,15~)is shown in Fig.15.
'fi-e cannat use r,omal tables directly because these give N ( 0 , l ) probabilities. Instead,
we will convert the statement X < 85 into an equivalent statement which involves the
because we know it has a standard nonnal distribution.
standardised score, Z =
We start with X =85. To turn X into Z we must standardise the X, but to ensure that we
preserve the meaning of the statement we must treat the other side of the inequality in
exactly the same way. (Otherwise we will end up calculating the probability of another
stalement, not X < 85). 'Standadising' both sides gives, X-100 < 85-109
T.
The left hand side is now a standard normal randoin variable and so we can call it Z,
and we have
84
P[Z < -I]. is just a standard normal probability and so we can look it up in Table 1 in
the usual way, which gives 0.1587. We get that PIX < 853 = 0.1587.
This process of rewriting a probability statement about X; in terms of Z, is not difficult
if you are systematically writing down what you are doing at each stage. We would lay
out the working we have just done for P[X < 851 as follows.
X has a normal distribution with mean 100 and standard deviation 15. Let us find the
probability that X is less than 85.
P[X < 851 = P
<
15
= P[Z < -11 = 0.1587
Problem 6: For each of these write down the equivalent standard normal probability.
a) The number of people who visit a historic monument in a week is normally
distributed with a mean of 10,500 and a standard deviation of 600. Consider the
probability that fewer than 9000 people visit in a week.
The number of cheques processed by a bank each day is normalljr distributed with
a mean of 30,100 and a standard deviation of 2450. Consider tle probability that
the bank processes more than 32,000 cheqiles in a day.
Solution: Here we want to find the standard normal probability corresponding to the
probability P[X < 90001.
X - 10500 9000 - 105001
a) We have P[X < 90001 = P
<
= P[Z < -2.51.
600
b) Here we want to find the standard normal probability corresponding to the
probability P[X > 320001.
b)
> 0.781
Note Probabilities like P[a < X < b] can be calculated in the same way. m e only
difference is that when X is standardised, similar operations must be applied to both a
and b. That is, a < X < b becomes
a - p X
< - p< - b - p
u
Cr
which is
a - p < ~ < b- - P
u
1
i
Probability Distributions
Statistics'and Probability
3.7 SUMMARY
In this unit we have covered the following points
1) A random variable is a variable that takes on different numerical values according
to chance outcomes
2)
3)
A probability distribution gives the probabilities with which the random variables
take an various values in their range.
4)
b-a'
0,
ifa<xLb
elsewhere
< d] =
d-c
6-a
function defined by
El) a) If X denote the number of correct answers, then X is the random variables for
this situation.
--
b)
c)
P[X = 401 means the probability that the number of correct answers is 40.
Probability Distributions
E3) Let X denote the amount you win or lose. Then X takes values Rs. SO, 0 or - 10
(loss in Rs. 10). The probability that both the marbles are green is 119. The i.e.
P[X = 501 = 119. The probability that both the marbles are red is 419 i.e.
PIX = - 101 = 419.
The probability that the marbles are of different colou'r is 419 is i.e. P[X = 01 =
419.
Thus the probability distribution is as given in the following table.
Amount (in Rs. won (+) ot lost(-)
50
0
Probability
1/4
4/9
This means that he can expect that on an 2 average cars will be sold per day over
long run (or more precisely 5 cars will be sold over (2 days)).
Fig.16
ii) The area under the graph and above the interval [2,4] is the area of the
rectangle shown in Fig.16 which is given by
Area = 2 x = 1.
:. f defines a probability density function of X.
PIX= 31 = 1 0 ~ ~ ~ ~ .
87
If we apply the formula in Equation (3) to find P[X = 31, then we substitute
r = 3, n = 5 in the formula, and we get
PIX = 31 = ~ ( 5 . 3 ) ~ ~ ~ ~
E8) This situation follows binomial distribution with n=4 and p = #$ = $.The
random variable X is the ~unlberof seeds that germinate. We have to calculate the
probability that exactly two of the four seeds will germinate. That is P[X = 21.By
applying binomial formula, we get
E9i If Xi denote the random variable that the ith customer buys the piiper on a given
dzy, then Xi's may not be identically distributed. Therefore Xi's may not be
binomially distributed. But if the customers are having the same business
activities or same kind of habits or working nature, then we can expect that Xi's
wiil be identically distributed. In such situatiorl we can expect that Xi's will
follow bi~omialdistribution.
+ + +
+ + +
E l 1) Since the problem deals with the receipts of bad cheques which is an event with
rare occurrence over an interval of time (a day, in this case), we can apply Poisson
distribution.
Since on an average 6 bad cheques are received per day,
Substituting X = 6 ar,d x = 4 in the Poisson Formula, we get
64e-6
1296 x (0.0025)
P[X=4] = -4!
2.4
E12) Note that here the experiment or trial is 'checking the machine for its functioning.
There are 20 trials and each trial is identically distributed with probability 0.2.
i) The trials are independent also. Therefore we can apply binomial formula.
We are required to calculate P[X = 31. Then
Also note that we can make the subintervals so small that at best only one
machine go out of service. Thus conditio; (2) and (3) are satisfied. Therefore we
I
a(q
(0.4)~e-Q.~
=
,
31
fjM)j) The mean, 10 is the centre point of a line segment whose length is the range,
1.B kg. Hence, the line segment extends x (1.8)= 0.9kg. to the left and to
the right of' 10 i.e. 9.1 to 10.9 kg. Hence tbe smallest weight is 9.1 kg. and
the largest weight is 10.9 kg.
ii) We
requir6d.t~calculate p[9 < X < 10.51.
= 0.833.
That is the probability that the weight lies oetween 9.1kg. and 10.9 kg. is
0.833.
0.1
1/32 -
5.28 6
,
Fig.17
Tafind fie required probability we note that X can take values in the interval
(5,28,5.40). Hence the required probability is
1
0.003
P[S.28 < X < 5.401 = 0.2 X - =
= 0.0037
32
8
El@ a) 0.9332
b) Q,1151
C)
0.9599,
i)
Probability Distributions
Statisticsand Probability
E18) Let the time of arrival in minutes past 1800hrs be X. Then X follows normal
distribution N(10, lo2).
a) The required probability is P[X < 181. The standard probability
corresponding to this is
= F(O.O1)
= 0.5040.
b)
Probability Distributions
3.9 APPENDIX
Cumulative Standard Normal Probabilities P[Z < a] where Z -- N(0,l).
to
8.0.
Continued
.....
Continued ....
Continued
.....