Review of Probability Theory
Review of Probability Theory
Review of Probability Theory
CHAPTER ONE
Review of Probability Theory
The outcome of an experiment may be a simple choice between two possibilities: it may be the
result of a direct measurement or count, or it may be an answer obtained after extensive
measurements and calculations.
A sample space is said to be a discrete sample space if it has finitely many or a countable infinity
of elements. If the elements (points) of a sample space constitute a continuum-for example, all
the points on a line, all the points in a plane –the sample space is said to be a continuous sample
space. In statistics, any subset of a sample space is called an event.
Discuss unions, intersections and complements of events. Also discuss Venn diagrams and tree diagrams.
A1.
A2.
A3. If events Ai are mutually exclusive, , we call this axiom special addition
rule. The general addition rule for probability of any two events A and B in S is given as
(prove this as home work)
Conditional probability, if A and B are any two events in S and , the conditional
probability
General multiplication rule of probability, if A and B are any two events in S, then
Bayes’ Theorem
The general multiplication rules are useful in solving many problems in which the ultimate
outcome of an experiment depends on the outcomes of various intermediate stages. Suppose, for
instance, that an assembly plant receives its voltage regulators from three different suppliers,
60% from supplier B1, 30% from supplier B2, and 10% from supplier B3. Also suppose that
95% of the voltage regulators from B1, 80% of those from B2, and 65% of those from B3
perform according to specifications. We would like to know the probability that any one voltage
regulator received by the plant will perform according to specification.
If A denotes the event that a voltage regulator received by the plant performs according to
specifications, and B1, B2, and B3 are the events that it comes from the respective suppliers, we
can write
Where B1, B2, and B3 are mutually exclusive events of which one must occur. It follows that
AnB1, AnB2, and AnB3 are also mutually exclusive. We get
→ =0.6(0.95) +0.3(0.8)
+0.1(0.65) =0.875
Department of Statistics, AMU © Defaru Debebe 2|Page
Statistical Theory Of Distribution
B1 0.95 A
0.6
0.3 B2 0.8 A
0.1
B3 0.65 A
The tree diagram for example dealing with three for suppliers of voltage
regulators.
If B1, B2… Bn are mutually exclusive events, of which one must occur, then
In most statistical problems we are concerned with one number or a few numbers that are
associated with the outcomes of experiments. In the inspection of a manufactured product we
may be interested only in the number of defectives; in the analysis of a road test we may be
interested only in the average speed and the average fuel consumption; and in the study of the
performance of a rotary switch we may be interested only in its lubrication, electrical current,
and humidity. All these numbers are associated with situations involving an element of chance-
in other words; they are values of random variables.
In the study of random variables we are usually interested in their probability distributions,
namely, in the probabilities with which they take on the various values in their range.
Random Variables: The random variable may be thought of as a function defined over the
elements of the sample space. This is how we define the random variables in general; they are
functions defined over the elements of a sample space. A random variable is any function that
assigns the numerical value to each possible outcome. The numerical value should be chosen to
quantify an important characteristic of the outcome. Random variables are denoted by capital
letters X, Y and so on, to distinguish them from their possible values given in lower case x, y.
To denote the values of a probability distribution, we shall use such symbols as f(x), g(x), h(y),
and so on. Strictly speaking, it is the function f(x) = P(X=x) which assigns probability to each
possible outcome x that is called the probability distribution.
Random variables are usually classified according to the number of values they can assume. In
this chapter we shall limit our discussion to discrete and continuous univariate random variables.
gives the probability distribution for the number of points we roll with balanced die. Of course,
not every function defined for the values of a random variable can serve as a probability
distribution. Since the values of probability distributions are probabilities and one value of a
random variable must always occur, it follows that if f(x) is a probability distribution, then
(1)
The probability distribution of a discrete random variable X is a list of the possible values of X
together with their probabilities
(2)
The probability distribution always satisfies the conditions (1). As we see later in this chapter,
there are many problems in which we are interested not only in the probability f(x) that the value
of a random variable x, but also in the probability F(x) that the value of a random variable is less
than or equal to x. We refer to the function F(x) as the cumulative distribution function or just
distribution function of the random variable.
(3)
Many statistical problems deal with the situations referred to as repeated trials. For example, we
may want to know the probability that one of five rivets will rupture in a tensile test, the
probability that nine of ten DVD players will run at least 1,000 hours, the probability that 45 of
300 drivers stopped at a roadblock will be wearing seat belts, or the probability that 66 of 200
television viewers (interviewed by rating service) will recall what products were advertised on a
given program. To borrow from the language of games of chance, we might say that in each of
these examples we are interested in the probability of getting x successes in n trials, or in other
words, x successes and n-x failures in n attempts.
There are common features to each of the examples. They are all composed of a series of trials
which we refer to as Bernoulli trials if the following assumptions hold.
There are only two possible outcomes for each trial (arbitrarily called success and failure,
without inferring that a success is necessarily desirable).
The probability of success is the same for each trial.
The outcomes from different trials are independent.
If the assumptions can be met, the theory we develop does not apply. In the problems we study in
this section, we add the additional assumption that the number of trials is fixed in advance.
Let X is the random variable that equals the number of successes in n trials. To obtain
probabilities concerning X, we proceed as follows: If p and 1-p are the probabilities of success
and failure on any one trial, then the probability of getting x successes and n-x failures, in some
specific order, is
(4)
Clearly, in this product of p’s and q’s there is one factor p for each success, one factor q for each
failure. The x factors p and n-x factors q are all multiplied together by virtue of the generalized
multiplication rule for more than two independent events. Since this probability applies to any
point of the sample space that represents x successes and n-x failures in any specific order, we
have only to count how many points of this kind there are, and then multiply by this.
Clearly, the number of ways in which we can select the x trials on which there is to be a success
is , the number of combinations of x objects selected from a set of n objects, and we thus
(5)
This probability distribution is called the binomial distribution because for x=0, 1… and n, the
values of the probabilities are the successive terms of the binomial expansion of ; for
the same reason, the combinatorial quantities are referred to as binomial coefficients.
Actually, the preceding equation defines a family of probability distributions, with each member
characterized by a given value of the parameter p and the number of trials n.
If n is large, the calculation of binomial probabilities can become quite tedious. Many statistical
software programs have binomial distribution commands, as do some statistical calculators.
Otherwise it is convenient to special tables.
(6)
We tabulated the cumulative probabilities rather than the values of , because the values
of are the values needed more often in statistical applications. Note, however, that the
values of can be obtained by subtracting adjacent entries in table below. Because the
two cumulative probabilities and differ by the single term .
. (7)
Example: A manufacturer of fax machines claims that only 10% of his machines require repairs
within the warranty period of 12 months. If 5 to 15 of his machines required repairs within the
first year, does this tend support or refute the claim?
Solution: Let us first find the probability that 5 or more of 15 of the fax machines will require
repairs within a year when the probability that any one will require repairs within a year is 0.1.
Using table below, we get
Since this probability is very small, it would seem reasonable to reject the fax machine
manufacturer’s claim.
said to be skewed distribution. It is said to be a positively skewed distribution if the tail is on the right,
and it is said to be negatively skewed if the tail is on the left.
Proof :
Now
The binomial distribution reduces to the Bernoulli distribution when n=1. Sometimes the
Bernoulli distribution is called the point binomial.
(8)
1/N
Proof:
A random variable X is defined to have a Bernoulli distribution if the discrete density function of
X is given by
(9)
q p
0 1 x Bernoulli distribution
Proof:
Suppose that we are interested in the number of defectives in a sample of n units drawn without
replacement from a lot containing N units, of which a are defective. Let the sample be drawn in
such a way that at each successive drawing, whatever units are left in the lot have the same
chance of being selected. The probability that the first drawing will yield a defective unit is
, but for the second drawing it is , depending on whether or not the first unit drawn
was defective. Thus, the trials are not independent, the fourth assumption underlying the
binomial distribution is not met, and the binomial distribution does not apply. Note that the
binomial distribution would apply if we do sampling with replacement, namely, if each unit
selected for the sample is replaced before the next one is drawn.
To solve the problem of sampling without replacement, let us proceed as follows: The x
successes (defectives) can be chosen in ways, the n-x failures (non defectives) can be chosen
in ways, and hence, x successes and n-x failures can be chosen in ways.
Also, n objects can be chosen from a set of N objects in ways, and if we consider all the
possibilities as equally likely, it follows that for sampling without replacement the probability of
getting “ x successes in n trials” is
(10)
Where x cannot exceed a and n-x cannot exceed N-a. This equation defines the hypergeometric
distribution, whose parameters are the sample size n, the lot size (population size) N, and the
number of successes in the lot a.
Example: A shipment of 20 digital voice recorders contains 5 that are defective. If 10 of them
are randomly chosen for inspection, what is the probability that 2 of the 10 will be defective?
In this example, n was not small compared to N, and if we have made the mistake of using the
binomial distribution with n=10 and p=5/20 to calculate the probability of 2 defectives, the
result would have been 0.282, which is much too small. However, when n is small compared to
N, less than N/10, the composition of the lot is not seriously affected by drawing the sample
without replacement, and the binomial distribution with the parameters n and p=a/N will yield a
good approximation.
Proof:
using
When we first defined the variance of a probability distribution, it may have occurred to reader
that the formula looked exactly like the one which we use in physics to define second moments,
or moments of inertia. Indeed, it is customary in statistics to define the kth moment about the
origin as
(11)
(12)
Thus, the mean is the first moment about the origin, and the variance is the second moment
about the mean. Higher moments are often used in statistics to give further descriptions of
probability distributions. For instance, the third moment about the mean (divided by to make
this measure independent of the scale of measurement) is used to describe the symmetry or
skewness of a distribution; the fourth moments about the mean (divided by ) is, similarly, used
to describe its peakedness, or kurtosis.
For the second moment about the mean we thus have . If is large, there is a
correspondingly higher probability of getting values farther away from the mean. Formally, the
idea is expressed by the following theorem.
Chebyshev’s Theorem /5: If a probability distribution has mean and standard deviation , the
probability of getting a value which deviates from by at least is at most ,
Symbolically,
To prove this theorem, consider any probability distribution , having mean , and variance
.
But in the region R1 and in the region R3, so that in either case
; hence, in both regions . If we now replace in each sum
by , a number less than or equal to , we obtain the inequality
To obtain an alternative form of Chebyshev’s theorem, note that the event is the
complement of the event ; hence, the probability of getting a value which deviates
Example: The number of customers who visit a car dealer’s showroom on a Saturday morning is
a random variable with mean and standard deviation . With what probability can
we assert that there will be more than 8 but fewer than 28 customers?
The Poisson distribution often serves as a model for counts which do not have a natural upper
bound. The Poisson distribution, with mean , has probability given by
(13)
When is large and p is small, binomial probabilities are often approximated by means of the
Poisson distribution with mean . Before we justify this approximation, let us point out
that x=0, 1, … means that there is a countable infinity of probabilities, and this requires that we
modify the third axiom of probability.
Axiom 3’ If A1, A2, …, is a finite or infinite sequence of mutually exclusive events in sample
space, S, then
To verify that P(S) =1 for this formula, we make use of Axiom 3’ and write
right is the Maclaurin’s series for , it follows that . Let us now show
that when and , while remains constant, the limiting form of the binomial
distribution is for x= 0, 1, 2, …
First let us substitute for p into the formula for the binomial distribution and simplify the
resulting expression; thus, we get
The distribution at which we arrived is called the Poisson distribution. An acceptable rule of
thumb is to use this approximation if the approximation is
generally excellent so long as .
Proof:
Poisson processes
In general, a random process is a physical process that is wholly or in part controlled by some
sort of chance mechanism. It may be a sequence of repeated flips of a coin, measurements of the
quality of manufactured products coming off an assembly line, the vibrations of airplane wings,
the noise in a radio signal, or any one of numerous other phenomena. What characterizes such
processes is their time dependence, namely, the fact that certain events do or do not take place
(depending on chance) at regular intervals of time or throughout continuous intervals of time.
In this section we shall be concerned with processes taking place over continuous intervals of
time or space, such as the occurrence of imperfections on a continuously produced bolt of cloth,
the recording of radiation by means of a Geiger counter, the arrival of telephone calls at a
switchboard, or the passing by cars over an electronic counting device. We will now show that
the mathematical model which we can use to describe many situations like these is that of
Poisson distribution. To find the probability of x successes during a time interval of length T, we
divide the interval into n equal parts of length , so that , and we assume that:
This means that the assumptions underlying the binomial distribution are satisfied, and the
probability of x successes during the time interval T is given by the corresponding Poisson
probability with the parameter
Since is the mean of this Poisson distribution, note that is the average (mean) number of
successes per unit time.
Example: If a bank receives on the average number 6 bad checks per day, what are the
probabilities that it will receive
Solution a)
b) Here so
, where values of and are obtained from Poisson Table.
Two other families of discrete distributions that play important roles in statistics are the
geometric (or Pascal) and negative binomial distributions. The reason that we consider the two
together is twofold; first, the geometric distribution is a special case of the negative binomial
distribution, and second, the sum of independent and identically distributed geometric random
variables is negative binomial distributed.
Geometric Distribution
Suppose that in a sequence of trials we are interested in the number of the trials on which the
first success occurs. The three assumptions for Bernoulli trials are satisfied but the extra
assumption underlying the binomial distribution is not; in other words, n is not fixed.
Clearly, if the first success is to come on the x-th trial, it has to be preceded by x-1 failures, and
if the probability of a success is p, the probability of x-1 failures in x-1trials is . Then,
if we multiply this expression by the probability p of a success on the x-th trial, we find that
probability of getting the first success on the x-th trial is given by
(14)
Proof:
The geometric distribution is well named since the values that the geometric density assumes are
the terms of a geometric series. Also the mode of the geometric density is necessarily 0. A
geometric density possesses one other interesting property, which is given in the following
theorem.
Theorem 8: If the random variable X has a geometric distribution with parameter p, then
Proof:
Example: Consider a sequence of independent, repeated Bernoulli trials with p equal to the
probability of success on an individual trial. Let the random variable X represent the number of
trials required before the first success; then X has the geometric density. To see this note that the
first success will occur on trial x+1 if this (x+1)st trial results in a success and the first x trials
resulted in failures; but, by independence, x successive failures followed by a success has
probability . In the language of this example, the above theorem states that the probability
that at least i+j trials are required before the first success, given that there have been i successive
failures, is equal to the unconditional probability that at least j trials are needed before the first
success. That is the fact that one has already observed i successive failures does not change the
distribution of the number of trials required to obtain the first success.
A random variable X that has a geometric distribution is often referred to as a discrete waiting-
time random variable. It represents how long (in terms of the number of failures) one has to wait
for a success.
Before leaving the geometric distribution, we note that some authors define the geometric
distribution by assuming 1(instead of 0) is the smallest mass point. The density then has the form
and the mean is 1/p; the variance is q/p2, and the moment generating function is .
(15)
Where the parameters r=1, 2, … and , is defined to have a negative binomial density.
Proof:
The negative binomial distribution, like the Poisson, has the non-negative integers for its mass points;
hence, the negative binomial distribution is potentially a model for a random experiment where a
count of some sort is of interest. In deed, the negative binomial distribution has been applied in
population counts, in health and accident statistics, in communications, and in other counts as well.
Example: Consider a sequence of independent, repeated Bernoulli trials with p equal to the
probability of success on an individual trial. Let the random variable X represent the number of
failures prior to the rth success; then X has the negative binomial density, as the following
argument shows: The last trial must result in a success, having probability p; among the first
x+r-1 trials there must be r-1 successes and x failures, and the probability of this is
An immediate generalization of the binomial distribution arises when each trial can have more
than two possible outcomes. This happens, for example, when a manufactured product is
classified as superior, average, or poor, when a student’s performance is graded as A, B,C, D and
F, or when an experiment is judged successful, unsuccessful or inconclusive. To treat this kind of
problem in general, let us consider the case where there are n independent trials, with each trial
permitting k mutually exclusive outcomes whose respective probabilities are
.
Referring to the outcomes as being of the first kind, the second kind, …, and the k-th kind, we
shall be interested in the probability of getting x1 outcomes of the first kind, x2
outcomes of the second kind, …, and xk outcomes of the k-th kind, with .
Using arguments similar to those which we employed in deriving the equation for the binomial
distribution, it can be shown that the desired probability is given by
(16)
, but the xi subject to the restriction . The joint probability distribution whose values
are given by these probabilities is called the multinomial distribution; it owes its name to the fact
that for the various values of the xi the probabilities are given by the corresponding terms of the
multinomial expansion of .
Example: The probabilities that the light bulb of a certain kind slide projector will last fewer
than 40 hours of continuous use, anywhere from 40 to 80 hours of continuous use, or more than
80 hours of continuous use, are 0.3, 0.5 and 0.2. Find the probability that among eight such bulbs
2 will last fewer than 40 hours, 5 will last anywhere from 40 to 80 hours, and 1 will last more
than 80 hours.
Solution:
x 0 1 2
f(x) 1-
Censored rv
Beta-binomial distribution
Llogarithmic distribution
Continuous sample spaces and continuous random variables arise when we deal with quantities
that are measured on a continuous scale _for instance, when we measure the speed of a car, the
amount of alcohol in person’s blood, the efficiency of a solar collector, or the tensile strength of
a new alloy.
The problem of defining probabilities in connection with continuous sample spaces and
continuous random variables involves some complications. To illustrate the nature of these
complications, let us consider the following situation: Suppose we want to know the probability
that if an accident occurs on a freeway whose length is 200 miles, it will happen at some given
location or, perhaps, some particular stretch of the road. The outcomes of this experiment can be
looked upon as a continuum of points, namely, those on the continuous interval from 0 to 200.
Suppose the probability that the accident occurs on any interval of length L is L/200, with L
measured in miles. In this section several parametric families of univariate probability functions
are presented.
(17)
Where the parameters a and b satisfy , then the random variable X is defined
to be uniformly distributed over the interval [a, b] and the distribution given by (17) is called
uniform distribution.
Theorem 10: If the random variable X is uniformly distributed over [a, b], then
Proof:
To illustrate how a physical situation might give rise to a uniform distribution, suppose that a
wheel of a locomotive has the radius r and that x is the location of a point on its circumference
measured along the circumference from some reference point 0. When the brakes are applied,
some points will make sliding contact with the rail, and heavy wear will occur at that point. For
repeated application of the brakes, it would seem reasonable to assume that x is a value of a
random variable having the uniform distribution with a=0 and . If this assumption were
incorrect, that is, if some set of points on the wheel made contact more often than others, the
wheel would eventually exhibit “flat spots” or wear out of round.
Among the special probability densities we shall study in this chapter, the normal distribution is
by far the most important. It was studied first in the 18th century when scientists observed an
astonishing degree of regularity in errors of measurement. They found that the patterns
(distributions) they observed were closely approximated by a continuous distribution which they
referred to as the “normal curve of errors” and attributed to the laws of chance.
(18)
The cumulative probabilities correspond to the area under the standard normal density to
the left of z, as shown by the shaded area in figure
To find the probability that a random variable having the standard normal distribution will take
on a value between a and b, we use the equation , as shown by the
shaded area in figure below.
We also sometimes make use of the identity F(-z)=1-F(z), which holds for all symmetric
distributions centered around 0, the reader will be asked to verify.
Also, if we want to find the probability that a random variable having the normal distribution
with mean and the variance will take on a value between a and b, we have only to calculate
the probability that a random variable having the standard normal distribution will take on a
If the normal random variable has mean 0 and variance 1, it is called a standard or normalized
normal random variable with density
(19)
We wish to show that A=1, and this is most easily done by showing that A2 is 1 and then
reasoning that A=1 since is positive. We may put
Proof:
and we have
The integral together with the factor is necessarily 1 since it is the area under a normal
Hence
thus justifying our use of the symbols and for the parameters.
Example: Suppose that an instructor assumes that a student’s final score is the value of a
normally distributed random variable. If the instructor decides to award a grade of A to thus
students whose score exceeds , a B to those students whose score falls between and
, a C if a score falls between and , a D if a score falls between and ,
and an F if the score falls below , then the proportions of each grade given can be
calculated. For example, since
Two other families of distributions that play important roles in statistics are the (negative)
exponential and gamma distributions, which are defined in this subsection. The reason that the
two are considered together is twofold; first, the exponential is a special case of the gamma, and,
second, the sum of identically independently distributed exponential random variables is gamma
distributed.
(20)
(21)
Where r>0 and , then X is defined to have a gamma distribution. Г(r) is a value of the
gamma function defined by . Integration by parts shows that
г(r) =(r-1)г(r-1) for any r>1 and hence, that г(r)=(r-1)! When r is a positive integer.
Graphs of several gamma distributions are shown in figure below and they exhibit the fact that
these distributions are positively skewed. In fact, the skewness decreases as r increases for any
fixed value of .
r=1
r=2
r=4
0 1 2 3 x
Theorem 12: If the random variable X has a gamma distribution with parameter r and λ, then
for t<λ
Proof:
Hence
The exponential distribution is the special case of gamma distribution for r=1. Thus, the
expected value and variance of exponential distribution are respectively.
The exponential distribution has been used as a model for lifetimes of various things. When we
introduced the Poisson distribution, we spoke of certain happenings, for example, particle
emissions, occurring in time. The length of the time interval between successive happenings can
be shown to have an exponential distribution provided that the number of happening in fixed
time interval has a Poisson distribution. Also, if we assume again that the number of happenings
in a fixed time interval is Poisson distributed, the length of time between time 0 and the instant
when the rth happening occurs can be shown to have a gamma distribution. So a gamma random
variable can be thought of as a continuous waiting-time random variable. It is the time one has to
wait for the rth happening. Recall that the geometric and negative binomial random variables
were discrete waiting-time random variables. In a sense, they are discrete analogs of the negative
exponential and gamma distributions, respectively.
If the random variable X has a gamma distribution with parameters r and , where r is a positive
integer, then
(22)
For , given in (22) is called the incomplete gamma function and has been extensively
tabulated.
Theorem 13: If the random variable X has an exponential distribution with parameter λ, then
Proof:
Let X represent the lifetime of a given component; then, in words, Theorem 13 states that the
conditional probability that the component will last a+b time units given that it has lasted a time
units is the same as its initial probability of lasting b time units. Another way of saying this is to
say that an “old” functioning component has the same lifetime distribution as a “new”
functioning component or that the component is not subject to fatigue or to wear.
A family of probability densities of continuous random variables taking on values in the interval
(0, 1) is the family of beta distributions.
(23)
Remark: The beta distribution reduces to the uniform distribution over (0, 1) if a=b=1.
(24)
The moment generating function for the beta distribution does not have a simple form; however
the moments are readily found by using their definition.
Theorem 14: If the random variable X has a beta -distribution with parameters a and b, then
Proof:
Hence, and
The family of beta-densities is a two-parameter family of densities that is positive on the interval
(0, 1) and can assume quite a Varity of different shapes and consequently, the beta distribution
can be used to model an experiment for which one of the shapes is appropriate.
Closely related to the exponential distribution is the Weibull distribution, whose probability
density is given by
(25)
To demonstrate this relationship we evaluate the probability that a random variable having the
Weibull distribution will take on a value less than a, namely, the integral
We get, (26)
And it can be seen that y is a value of a random variable having an exponential distribution. The
graphs of several Weibull distributions with are shown in figure
below.
The mean of the Weibull distribution having the parameters and may be obtained by
evaluating the integral
CHAPTER TWO
Common Multivariate Distributions
The purpose of this chapter is to introduce the concepts of k-dimensional distribution functions,
conditional distributions, joint and conditional expectation, and independence of random
variables.
In the study of many random experiments, there are, or can be, more than one random variable of
interest; hence we are compelled to extend our definitions of the distribution and density function
of one random variable to those of several random variables. As in the univariate case we will
first define, the cumulative distribution function. Although it is not as convenient to work with as
density functions, it does exist for any set of k random variables. Density functions for jointly
discrete and jointly continuous random variables will be given in subsequent sections.
Let be k random variables all defined on the same probability space. The joint
cumulative distribution function of , denoted by , is defined as
. Thus a joint cumulative distribution function is a
function with domain Euclidean k space and counter domain the interval [0, 1]. If k=2, the joint
cumulative distribution function is a function of two variables, and so its domain is just the xy
plane.
Example 1: Consider the experiment of tossing two tetrahedra (regular four-sided polyhedron)
each with sides labeled 1 to 4. Let X denote the number on the downturned face of the first
tetrahedron and Y the larger of the downturned numbers. The goal is to find , the joint
cumulative distribution function of X and Y. Observe first that the random variables X and Y
jointly take on only the values
X2
4 © © © ©
3 © © © ©
2 © © © ©
1 © © © ©
The sample space for this experiment is displayed in Figure 2.1. The 16 sample points are
assumed to be equally likely. Our objective is to find for each point (x, y). As an
example , let (x,y) = (2,3), and find . Now the event
corresponds to the encircled sample points in Figure 2.1; hence .
Similarly can be found for other values of x and y in the following table.
i)
.
ii) If , then
Remark:
Example 2: The joint discrete density function of X and Y in Example 1 above is given as
(x,y) (1,1) (1,2) (1,3) (1,4) (2,2) (2,3) (2,4) (3,3) (3,4) (4,4)
1/16 1/16 1/16 1/16 2/16 1/16 1/16 3/16 1/16 4/16
Or in another tabular form as
. ///
If X and Y are jointly discrete random variables, then and are called marginal discrete
density functions. More generally, let be any subset of the jointly discrete random
variables ; then is also called marginal density.
Example 3: From Example 2, if we note that X has values 1, 2, 3 & 4; Y has values 1, 2, 3,& 4;
and Y is greater or equal to X, the values of (x,y) are .
y/x 1 2 3 4
4 1/16 1/16 1/16 4/16 7/16
3 1/16 1/16 3/16 5/16
2 1/16 2/16 3/16
1 1/16 1/16
4/16 4/16 4/16 4/16
The multinomial distribution is a (k+1) parameter family of distributions, the parameters being n
and . , like q in the binomial distribution, is exactly determined by
. A particular case of a multinomial distribution is obtained by putting, for example, n=3,
k=2,p1=0.2 and p2=0.3, to get
, xi=0,1,…,n; (1)
We might observe that if have the multinomial distribution given (1), then the marginal
distribution of Xi is a binomial distribution with parameters n and pi. This observation can be
verified by recalling the experiment of repeated, independent trials.
(2)
for all .
As in the one-dimensional case, a joint probability density function has two properties:
(i)
(ii) .
A unidimensional probability density function was used to find probabilities. For example, for X
; that is, the area under over the set B gave . In the two-
dimensional case, volume gives probabilities. For instance, let (X1, X2) be jointly continuous
random variables with joint probability density function , and let R be some region
in the x1x2 plane; then falls in the region R is given by the volume under
over the region R. In particular if , then
A joint probability density function is defined as any nonnegative integrand satisfying (2) and
hence is not uniquely defined.
for k =1.
Probability of events defined in terms of the random variables can be obtained by integrating the
joint probability density function over the indicated region; for example
Which is the volume under the surface z=x+y over the region in
the xy plane.
If X and Y are jointly continuous random variables, then and are called marginal
probability density functions. More generally, let be any subset of the jointly
continuous random variables . is called a marginal density of the
m-dimensional random variable . If are jointly continuous random
variables, then any marginal probability density function can be found
(3)
Since, .
Or, .
In the preceding section we define the joint distribution and joint density functions of several
random variables; in this section we define conditional distributions and the related concept of
stochastic independence. Most definitions will be given first for only two random variables and
later extended to k random variables.
Let X and Y be jointly discrete random variables with joint discrete density function .
The conditional discrete density function of Y given X=x, denoted by , is defined to be
(4)
Similarly, .
Since X and Y are discrete, they have mass points, say for X and for Y. If
, then x=xi for some i, and . The numerator of the right-hand side
of (4) is ; so
is called a conditional discrete density function and hence should posses the
probabilities of a discrete density function. To see that it does, consider x as some fixed mass
point of X. Then is a function with argument y, and to be a discrete density function
must be nonnegative and, if summed over the possible values (mass points) of Y, must sum to 1.
, so is indeed a density; it
The conditional cumulative distribution of Y given X=x can be defined for two jointly discrete
random variables by recalling the close relationship between discrete density functions and
cumulative distribution functions. If X and Y jointly discrete random variables, the conditional
cumulative distribution of Y given X=x, denoted by , is defined to be
for .
Example 6: Return to the experiment of tossing two tetrahedral. Let X denote the number on the
downturned face of the first and Y the larger of the downturned numbers. What is the density of
Y given that X=2?
Also
Example 8: Suppose 12 cards are drawn without replacement from an ordinary deck of playing
cards. Let X1 be the number of aces drawn, X2 be the number of 2s, X3 be the number of 3s, and
X4 be the number of 4s. The joint density of these four random variables is given by
, where
Let X and Y be jointly continuous random variables with joint probability density function
. The conditional probability density function of Y given X=x, denoted by , is
defined to be
(5)
Similarly,
is called a (conditional) probability density function and hence should possess the
properties of a probability density function. is clearly nonnegative, and
The density is the density of the random variable Y given that x is the value of the
random variable X. In the conditional density , x is fixed and could be thought of as a
parameter. Consider , that is, the density of Y given that X was observed to be x0. Now
plots as a surface over the xy plane. A plane perpendicular to the xy plane which
intersects the xy plane on the line x=x0 will intersect the surface in the curve . The area
under this curve is
Example 9: Suppose .
. Note that
2.2.3 Independence
When we define the conditional probability of two events in chapter one, we also defined
independence of events. We have now defined the conditional distribution of random variables;
so we should define independence of random variables as well.
(6)
(7)
We saw that independence of events was closely related to conditional probability; likewise
independence of random variables is closely related to conditional distributions of random
variables. Fro example, suppose X and Y are two independent random variables; then
by definition of independence; however,
by definition of conditional density, which implies that ; that is, the
conditional density of Y given x is the unconditional density of Y. So to show that two random
variables are not independent, it suffices to show that depends on x.
Example 10: Let X be the number on the downturned face of the first tetrahedron and Y the
larger of the two downturned numbers in the experiment of tossing two tetrahedral. Are X and Y
independent? Obviously not, since
2.3 Expectation
When we introduced the concept of expectation for univariate random variables in Chapter one,
we first defined the mean and variance as particular expectations and then defined the
expectation of a general function of random variable. Here, we will commence with the
definition of the expectation of a general function of a k-dimensional random variable. The
definition will be given for only those k-dimensional random variables which have densities.
2.3.1 Definition
E = (8)
if the random variable is discrete where the summation is over all possible values of
, and
= (9)
In order for the above to be defined, it is understood that the sum and multiple integral,
respectively, exist.
= =E
Using the fact that the marginal density is obtained from the joint density by
Theorem4:If ,then
.
We might note that the “expectation” in the notation of Theorem has two different
interpretations; one is that the expectation is taken over the joint distribution of and the
other is that the expectation is taken over the marginal distribution of . What Theorem 3 really
says is that these two expectations are equivalent, and hence we are justified in using the same
notation for both.
Example 13: Consider the experiment of tossing two tetrahedra. Let X be the number on the first
and Y the larger of the two numbers. We gave the joint discrete density function of X and Y in
Example 2.
E =
+2.3 = .
E(X+Y)=
E =
E(X+Y) =
E[X] = E[Y] =
Example 15: Let the three – dimensional random variable ( ) have the density
( )= ( ).
(i) E [3
(ii) E , and
(iii) E .
E[
d .
Let X and Y any two random variables defined on the same probability space. The covariance of
X and Y, denoted by cov[X ,Y] is defined as
(10)
provided that the indicated expectation exists. And the correlation coefficient, denoted by of
the random variables X and Y is defined to be
(11)
Example 16: Find for X, the number on the first, and Y, the larger of the two numbers, in the
experiment of tossing two terahedra. We would expect that is positive since when X is large,
Y tends to be large too. We calculated in Example 13 and obtained
. Thus . Now
hence . So,
right?
Let (X,Y) be a two-dimensional random variable and g(.,.), a function of two variables. The
conditional expectation of g(X,Y) given X=x, denoted by , is defined to be
(12)
(13)
if (X,Y) are jointly discrete, where the summation is over all possible values of Y.
Example 18: in the experiment of tossing two tetrahedral with X, the number on the first, and Y,
the larger of the two numbers, we found that
Hence .
for 0<x<1.
As we stated above, is, in general, a function of x, let us denote it by h(x); that is,
. Now we can evaluate the expectation of h(X), a function of X, and will have
. This gives us
Thus we have proved for jointly continuous random variables X and Y the following simple yet
very useful theorem.
Theorem 6: .
Proof:
Let us note in words what the two theorems say. Equation (14) states that the mean of Y is the
mean or expectation of the conditional mean of Y, and Theorem 6 states that the variance of Y is
the mean or expectation of the conditional variance of Y, plus the variance of the conditional
mean of Y.
We will conclude this subsection with one further theorem. The proof can be routinely obtained
from definition of conditional expectation and left as an exercise. Also, the theorem can be
generalized to more than two dimensions.
We will use our definition of the expectation of a function of several variables to define joint
moments and the joint moment generating function.
The joint raw moments of are defined by , where the ri’s are 0 or any
positive integer; the joint moments about the means are defined by
Remark: If and all other rm’s are 0, then that particular joint moment about the
(15)
if the expectation exists for all values of such that for some h>0,
j=1,…,k.
Remark:
; that is, the marginal moment generating functions can be obtained from the
joint moment generating function.
We have already defined independence and expectation; in this section we will relate the two
concepts.
Theorem 8: If X and Y are independent and and are two functions, each of a single
argument, then .
Proof: We will give the proof for jointly continuous random variables.
The converse of the above corollary is not always true; that is, does not always
imply that X and Y are independent, as the following example shows.
Example 10: Let U be a random variable which is uniformly distributed over the interval (0,1).
Define . X and Y are clearly not independent since if a value of X
is known, then U is one of two values, and so Y is not the same as the marginal distribution.
Theorem 9: Tow jointly distributed random variables X and Y are independent if and only if
for all t1, t2 for which –h<ti<h, i=1,2, for some h>0.
Proof: [Recall that is the moment generating function of X. Also note that
.] X and Y independent imply that the joint moment generating function factors into
the product of the marginal moment generating functions by Theorem 8 by taking
. The proof in the other direction will be omitted.
Remark: both Theorem 8 and 9 can be generalized from two random variables to k random
variables.
Corollary: , with equality iff one random variable is a linear function of the other with
probability 1.
One of the important multivariate densities is the multivariate normal density, which is a
generalization of the normal distribution for a unidimensional random variable. In this section we
discuss a special case, the case of the bivariate normal. In our discussion we will include the joint
density, marginal densities, conditional densities, conditional means and variances, covariance,
and the moment generating function. This section, then, will give an example of many of the
concepts defined in the preceding sections of this chapter.
Let the two-dimensional random variable (X,Y) have the joint probability density function
(16)
probability that a point (X,Y) will lie in any region R of the xy plane is obtained by integrating the
density over that region:
(17)
The density might, for example, represent the distribution of hits on a vertical target, where x and
y represent the horizontal and vertical deviations from the central lines. And in fact the
distribution closely approximates the distribution of this as well as many other bivariate
populations encountered in practice.
We must first show that the function actually represents a density by showing that its integral
over the whole plane is 1, that is,
The density is, of course, positive. To simplify the integral, we shall substitute
and (18)
So that it becomes
And if we substitute
both of which are 1, as we have seen in studying the univariate normal distribution.
may be reduced to a form involving only the parameter by making the substitution in (18).
To obtain the moments of X and Y, we shall find their joint moment generating function, which is
given by
Theorem 11: The moment generating function of the bivariate normal distribution is
(19)
and on completing the square first on u and then on v, we find this expression becomes
Which, if we substitute
becomes
The moments may obtained by evaluating the appropriate derivative of at t1=0, t2=0.
Thus,
Hence variance of X is
Similarly, on differentiating with respect to t2, one finds the mean and variance of Y to be and
. We can also obtain joint moments by differentiating r times with
respect to t1 and s times with respect to t2 and then putting t1 and t2 equal to 0. The covariance X
and Y is
Theorem 12: If (X,Y) has a bivariate normal distribution, then X and Y are independent if and
only if X and Y uncorrelated.
Proof: X and Y are uncorrelated if and only if or, equivalently, if and only if , the
joint density f(x,y) becomes the product of two univariate normal distributions; so that
implies X and Y independent. We know that, in general, independence of X and Y implies that X
and Y are uncorrelated.
Department of Statistics, AMU © Defaru Debebe 54 | P a g e
Statistical Theory Of Distribution
Theorem 13: If (X,Y) has a bivariate normal distribution, then the marginal distributions of X
and Y are univariate normal distributions; that is, X is normally distributed with mean and
variance , and Y is normally distributed with mean and variance .
Proof: the marginal density of one of the variables X, for example, is by definition
and again substituting and completing the square on v, one finds that
the univariate normal density. Similarly the marginal density of Y may be found to be
Theorem 14: If (X,Y) has a bivariate normal distribution, then the conditional distributions of X
given Y=y is normal with mean and variance . Also, the
.
Proof: The conditional distributions are obtained from the joint and marginal distributions. Thus,
the conditional density of X for fixed values of Y is
(20)
As we already noted, the mean value of a random variable in a conditional distribution is called a
regression curve when regarded as a function of the fixed variable in the conditional distribution.
Thus the regression for X on Y=y in (20) is , which is a linear function of y in
the present case. For bivariate distributions in general, the mean of X in the conditional density
of X given Y=y will be some function of y, say g(.), and the equation when plotted in
the xy plane gives the regression curve for X. It is simply a curve which gives the location of the
mean of X for various values of Y n the conditional density of X given Y=y.
As the title of this section indicates, we are interested in finding the distributions of functions of
random variables. More precisely, for given random variables, say , and given
functions of the n given random variables, say , we want in general, to
find the joint distribution of , where . If the joint density
of the random variables is given, then theoretically at least, we can find the joint
distribution of . This follows since the joint cumulative distribution function of
satisfies the following:
(21)
and
(22)
(23)
and
(24)
In practice, one would naturally select that method which makes the calculations easier. One
might suspect that (23) gives better method of the two since it involves only a single integral
whereas (24) involves a multiple integral. On the other hand, (23) involves the density of Y, a
density that may have to be obtained before integration can proceed.
and
Now
And ,
using the fact that Y has a gamma distribution with parameters r = 1/2 and =1/2.
and
Proof: That follows from property of expectation.
The following theorem gives a result that is somewhat related to the above theorem inasmuch as
its proof, which is left as an exercise, is similar.
Theorem 16: Let and be two sets of random variables, and let
and be two sets of constants; then
(25)
(26)
In the above subsection the mean and variance of the sum and difference of two random
variables were obtained. It was found that the mean and variance of the sum or difference of
random variables X and Y could be expressed in terms of the means, variances, and covariance of
X and Y. We consider now the problem of finding the first two moments of the product and
quotient of X and Y.
Theorem 17: Let X and Y be two random variables for which var[XY] exists; then
and
Proof:
Note that the mean of the product can be expressed in terms of the means and covariance of X
and Y but the variance of the product requires higher order moments.
In general, there are no simple exact formulas for the mean and variance of the quotient of two
random variables in terms of moments of the two random variables; however, there are
approximate formulas which are sometimes useful.
Remark:
and
If the joint distribution of random variables is given, then, theoretically, the joint
distribution of random variables of can be determined, where
for given functions by definition the cumulative distribution
function of is . But for each the
event . This later event described in
terms of the given functions and the given random variables .
Since the joint distribution of is assumed given, presumable the probability of event
can be calculated and consequently
determined. The above described technique for deriving the joint distribution of will be
called the cumulative distribution function technique.
An important special case arises if k=1; then there is only one function, say , of the
given random variables for which one needs to derive the distribution.
Example 12: Let there be only one given random variable, say X, which has a standard normal
distribution. Suppose the distribution of is desired.
which can be recognized as the cumulative distribution function of a gamma distribution with
parameters r=1/2 and =1/2.
Proof: .
Similarly,
since Y1 is greater than y iff every Xi > y. And if are independent, then
Proof: .
Example 13: suppose that the life of a certain light bulb is exponentially distributed with mean
100 hours. If 10 such light bulbs are installed simultaneously, what is the distribution of the life
of the light bulb that fails first, what is expected life? Let Xi denote the life of the ith light bulb;
then is the life of the light bulb that fails first. Assume that Xi’s are
independent.
Now ;
So , which is an
Let X and Y be jointly distributed continuous random variables with density , and let
and . Then,
Proof:
by making the
substitution . Now
. ///
Proof:
Hence,
///
Example 14: Suppose that X and Y are independent and identically distributed with density
. Note that since both X and Y assume values between 0 and 1,
assumes values between 0 and 2.
Let X and Y be jointly distributed continuous random variables with density , and let
; then
and
proof:
Hence,
Example 15: Suppose X and Y are independent random variables, each uniformly distributed
over the interval (0,1). Let Z=XY and U=X/Y.
1 y=1/u
1 u
Figure 2.3
There is another method of determining the distribution of functions of random variables which
we shall find to be particularly useful in certain instances. This method is built around the
concept of the moment generating function and will be called the generating function technique.
The statement of the problem remains the same. For given random variables with given
density and given functions , find the joint
distribution of . Now the joint moment generating
function of , if it exists, is
(27)
If after the integration of (27) is performed, the resulting function of can be recognized
as the joint moment generating function of some known joint distribution, it will follow that
has that joint distribution by virtue of the fact that a moment generating function, when
it exists, is unique and uniquely determines its distribution function.
The most useful application of the moment generating function technique will be used to find the
distribution of sums of independent random variables.
Example 16: Suppose X has a normal distribution with mean 0 and variance 1. Let , and
find the distribution of Y.
for t<1/2,
Example 17: Let X1 and X2 be two independent standard normal random variables. Let
. Find the joint distribution of Y1 and Y2.
We note that Y1 and Y2 are independent random variables (by Theorem 9) and each has a normal
distribution with mean 0 and variance 2.
In the above example we were able to manipulate expectations and avoid performing an
integration to find the desired joint moment generating function. In the following example the
integration will have to be performed.
Example 18: Let X1 and X2 be two independent standard normal random variables. Let
for t<1/2,
; hence,
///
Theorem 20: If are independent random variables and the moment generating function
of each exists for all for some , let ; then
Proof:
using Theorem 9.
Example 19: suppose that are independent Bernoulli random variables; that is,
and .
So ,
the moment generating function of a binomial random variable; hence has a binomial
distribution with parameters n and p.
Example 20: suppose that are independent Poisson distributed random variables,
having parameter . Then
and hence
Which is again the moment generating function of a Poisson distributed random variable having
parameter .
Example 21: assume that are independent and identically distributed exponential
random variables; then
So ,
Which is the moment generating function of a gamma distribution with parameters n and ;
hence,
; then , and .
Hence ,
2.5.4 Transformations
a) Distribution of
Example 24: Let X be random variable with uniform distribution over the interval (0,1) and let
. The density of Y is desired.
so and therefore .
Theorem 21: Suppose X is a continuous random variable with probability density function
. Set . Assume that:
i) defines a one-to-one transformation of .
ii) The derivative of with respect to y is continuous and nonzero for ,
where is the inverse function of ; that is, is that x for which
Proof: the above is a standard theorem from calculus on the change of variable in a definite
integral; so we will only sketch the proof. Consider the case is an interval. Let us suppose that
is a monotone increasing function over ; that is, , which is true if and only if
. For ,
By Theorem 21,
Example 25: Suppose X has the Pareto density and the distribution of
is desired.
(28)
Example 26: Let X be a continuous random variable with density , and let .
Note that is an interval containing both negative positive points, then is not
one-to-one. However, if is decomposed into
, then defines a one-to-one transformation on each . Note that
. By (28),
In particular, if ,
Then ; or if
Then
And
.
Suppose now that we are given the joint probability density function of the n-
dimensional continuous random variable . Let
This use of possible introducing additional random variables makes the transformation
a transformation from an n-dimensional space to an n-
dimensional space. Henceforth we will assume that we are seeking the joint distribution of
(rather than the joint distribution of ) when
we have given the joint probability density of .
We will state our results first for n=2 and later generalize to n>2. Let be given. Set
. We want to find the joint distribution of and
for known functions and . now suppose that and
defines a one-to-one transformation which maps onto . and can be
expressed in teams of and ; so we can write, say, and .
Note that is a subset of the plane and is a subset of the plane. The determinant
We be called Jacobian of the transformation and will be denoted by J. the above discussion
permits us to state Theorem 22.
Theorem 22: Let X1 and X2 be jointly continuous random variables with density function
. Set . Assume that:
(i) , defines a one –to-one transformation of
onto .
(ii) The first partial derivative of are
continuous over .
(iii) The Jacobian of the transformation is nonzero for .
Then the joint density of and is given by
Example 28: Suppose that X1 and X2 are independent random variables, each uniformly
distributed over the interval (0,1). Then . Let
; then , and
Example 29: Let X1 and X2 be two independent standard normal random variables. Let
. Then .
To find the marginal distribution of, say, Y2, we must integrate out y1; that is
Let then
a Cauchy density. That is, the ratio of two independent standard normal random variables has a
Cauchy distribution.