Review of Probability Theory

Download as pdf or txt
Download as pdf or txt
You are on page 1of 75

Statistical Theory Of Distribution

CHAPTER ONE
Review of Probability Theory

1.1 Probability Concepts

In statistics, a set of all possible outcomes of an experiment is called a sample space. An


experiment may consist of the simple process of noting whether a switch is turned on/off ; if may
consist of determining the time it takes a care to accelerate to 30 miles per hour; or it may consist
of the complicated process of finding the mass of an electron.

The outcome of an experiment may be a simple choice between two possibilities: it may be the
result of a direct measurement or count, or it may be an answer obtained after extensive
measurements and calculations.

Discuss finite and infinite sample space.

A sample space is said to be a discrete sample space if it has finitely many or a countable infinity
of elements. If the elements (points) of a sample space constitute a continuum-for example, all
the points on a line, all the points in a plane –the sample space is said to be a continuous sample
space. In statistics, any subset of a sample space is called an event.

Discuss unions, intersections and complements of events. Also discuss Venn diagrams and tree diagrams.

Probability is measuring uncertainty. Axioms of probability theory are

A1.

A2.

A3. If events Ai are mutually exclusive, , we call this axiom special addition
rule. The general addition rule for probability of any two events A and B in S is given as
(prove this as home work)

The rule of complement, if A is any event in S, then (prove this as home


work)

Department of Statistics, AMU © Defaru Debebe 1|Page


Statistical Theory Of Distribution

Conditional probability, if A and B are any two events in S and , the conditional
probability

General multiplication rule of probability, if A and B are any two events in S, then

Let A and B are two independent events if and only if .

Bayes’ Theorem
The general multiplication rules are useful in solving many problems in which the ultimate
outcome of an experiment depends on the outcomes of various intermediate stages. Suppose, for
instance, that an assembly plant receives its voltage regulators from three different suppliers,
60% from supplier B1, 30% from supplier B2, and 10% from supplier B3. Also suppose that
95% of the voltage regulators from B1, 80% of those from B2, and 65% of those from B3
perform according to specifications. We would like to know the probability that any one voltage
regulator received by the plant will perform according to specification.

If A denotes the event that a voltage regulator received by the plant performs according to
specifications, and B1, B2, and B3 are the events that it comes from the respective suppliers, we
can write

Where B1, B2, and B3 are mutually exclusive events of which one must occur. It follows that
AnB1, AnB2, and AnB3 are also mutually exclusive. We get

Since AnB1, AnB2, and AnB3 are


mutually exclusive.

→ =0.6(0.95) +0.3(0.8)
+0.1(0.65) =0.875
Department of Statistics, AMU © Defaru Debebe 2|Page
Statistical Theory Of Distribution

B1 0.95 A

0.6

0.3 B2 0.8 A

0.1

B3 0.65 A

The tree diagram for example dealing with three for suppliers of voltage
regulators.

If B1, B2… Bn are mutually exclusive events, of which one must occur, then

 (Rule of elimination/Total probability)


 (Bayes’ Theorem) r=1, 2, …, n

1.2 Discrete Probability Distributions

In most statistical problems we are concerned with one number or a few numbers that are
associated with the outcomes of experiments. In the inspection of a manufactured product we
may be interested only in the number of defectives; in the analysis of a road test we may be
interested only in the average speed and the average fuel consumption; and in the study of the
performance of a rotary switch we may be interested only in its lubrication, electrical current,
and humidity. All these numbers are associated with situations involving an element of chance-
in other words; they are values of random variables.

In the study of random variables we are usually interested in their probability distributions,
namely, in the probabilities with which they take on the various values in their range.

Random Variables: The random variable may be thought of as a function defined over the
elements of the sample space. This is how we define the random variables in general; they are
functions defined over the elements of a sample space. A random variable is any function that
assigns the numerical value to each possible outcome. The numerical value should be chosen to

Department of Statistics, AMU © Defaru Debebe 3|Page


Statistical Theory Of Distribution

quantify an important characteristic of the outcome. Random variables are denoted by capital
letters X, Y and so on, to distinguish them from their possible values given in lower case x, y.

To denote the values of a probability distribution, we shall use such symbols as f(x), g(x), h(y),
and so on. Strictly speaking, it is the function f(x) = P(X=x) which assigns probability to each
possible outcome x that is called the probability distribution.

Random variables are usually classified according to the number of values they can assume. In
this chapter we shall limit our discussion to discrete and continuous univariate random variables.

Wherever possible, we try to express probability distributions by means of equations; otherwise,


we must give a table that actually exhibits the correspondence between the values of the random
variable and associated probabilities. For instance,

gives the probability distribution for the number of points we roll with balanced die. Of course,
not every function defined for the values of a random variable can serve as a probability
distribution. Since the values of probability distributions are probabilities and one value of a
random variable must always occur, it follows that if f(x) is a probability distribution, then

(1)

The probability distribution of a discrete random variable X is a list of the possible values of X
together with their probabilities

(2)

The probability distribution always satisfies the conditions (1). As we see later in this chapter,
there are many problems in which we are interested not only in the probability f(x) that the value
of a random variable x, but also in the probability F(x) that the value of a random variable is less
than or equal to x. We refer to the function F(x) as the cumulative distribution function or just
distribution function of the random variable.

(3)

Department of Statistics, AMU © Defaru Debebe 4|Page


Statistical Theory Of Distribution

1.2.1 Binomial Distribution

Many statistical problems deal with the situations referred to as repeated trials. For example, we
may want to know the probability that one of five rivets will rupture in a tensile test, the
probability that nine of ten DVD players will run at least 1,000 hours, the probability that 45 of
300 drivers stopped at a roadblock will be wearing seat belts, or the probability that 66 of 200
television viewers (interviewed by rating service) will recall what products were advertised on a
given program. To borrow from the language of games of chance, we might say that in each of
these examples we are interested in the probability of getting x successes in n trials, or in other
words, x successes and n-x failures in n attempts.

There are common features to each of the examples. They are all composed of a series of trials
which we refer to as Bernoulli trials if the following assumptions hold.

 There are only two possible outcomes for each trial (arbitrarily called success and failure,
without inferring that a success is necessarily desirable).
 The probability of success is the same for each trial.
 The outcomes from different trials are independent.

If the assumptions can be met, the theory we develop does not apply. In the problems we study in
this section, we add the additional assumption that the number of trials is fixed in advance.

 There are a fixed number n of Bernoulli trials conducted.

Let X is the random variable that equals the number of successes in n trials. To obtain
probabilities concerning X, we proceed as follows: If p and 1-p are the probabilities of success
and failure on any one trial, then the probability of getting x successes and n-x failures, in some
specific order, is

(4)

Clearly, in this product of p’s and q’s there is one factor p for each success, one factor q for each
failure. The x factors p and n-x factors q are all multiplied together by virtue of the generalized
multiplication rule for more than two independent events. Since this probability applies to any
point of the sample space that represents x successes and n-x failures in any specific order, we

Department of Statistics, AMU © Defaru Debebe 5|Page


Statistical Theory Of Distribution

have only to count how many points of this kind there are, and then multiply by this.
Clearly, the number of ways in which we can select the x trials on which there is to be a success

is , the number of combinations of x objects selected from a set of n objects, and we thus

arrive at the following result:

(5)

This probability distribution is called the binomial distribution because for x=0, 1… and n, the
values of the probabilities are the successive terms of the binomial expansion of ; for

the same reason, the combinatorial quantities are referred to as binomial coefficients.

Actually, the preceding equation defines a family of probability distributions, with each member
characterized by a given value of the parameter p and the number of trials n.

If n is large, the calculation of binomial probabilities can become quite tedious. Many statistical
software programs have binomial distribution commands, as do some statistical calculators.
Otherwise it is convenient to special tables.

The binomial table gives the values of

(6)

for n=2 to 25 and p=0.05, 0.10, 0.15, …, 0.90, 0.95.

We tabulated the cumulative probabilities rather than the values of , because the values
of are the values needed more often in statistical applications. Note, however, that the
values of can be obtained by subtracting adjacent entries in table below. Because the
two cumulative probabilities and differ by the single term .

. (7)

Example: A manufacturer of fax machines claims that only 10% of his machines require repairs
within the warranty period of 12 months. If 5 to 15 of his machines required repairs within the
first year, does this tend support or refute the claim?

Department of Statistics, AMU © Defaru Debebe 6|Page


Statistical Theory Of Distribution

Solution: Let us first find the probability that 5 or more of 15 of the fax machines will require
repairs within a year when the probability that any one will require repairs within a year is 0.1.
Using table below, we get

Since this probability is very small, it would seem reasonable to reject the fax machine
manufacturer’s claim.

Table 1. Binomial Distribution Table

If p=0.5, the equation for the binomial distribution is and since


, it follows that . For any n, the binomial distribution with
p=0.5 is a symmetric distribution. A probability distribution that has a probability with a long tailed is

Department of Statistics, AMU © Defaru Debebe 7|Page


Statistical Theory Of Distribution

said to be skewed distribution. It is said to be a positively skewed distribution if the tail is on the right,
and it is said to be negatively skewed if the tail is on the left.

Theorem 1: If X has a binomial distribution, then

Proof :

Now

The binomial distribution reduces to the Bernoulli distribution when n=1. Sometimes the
Bernoulli distribution is called the point binomial.

1.2.2 Discrete Uniform Distribution

Each member of the family of discrete density functions

(8)

1/N

1 2 … N Mass of discrete uniform

Where the parameter N ranges


over the positive integers, is defined to have a discrete uniform distribution. A random variable
X having a mass given (8) is called discrete uniform random variable.

Theorem 2: If X has a discrete uniform distribution, then

Department of Statistics, AMU © Defaru Debebe 8|Page


Statistical Theory Of Distribution

Proof:

(Show moment generating function)

1.2.3 Bernoulli Distribution

A random variable X is defined to have a Bernoulli distribution if the discrete density function of
X is given by

(9)

q p

0 1 x Bernoulli distribution

Where the parameter p satisfies .

Theorem 3: If X has a Bernoulli distribution, then

Proof:

1.2.4 Hyper-geometric Distribution

Suppose that we are interested in the number of defectives in a sample of n units drawn without
replacement from a lot containing N units, of which a are defective. Let the sample be drawn in
such a way that at each successive drawing, whatever units are left in the lot have the same

Department of Statistics, AMU © Defaru Debebe 9|Page


Statistical Theory Of Distribution

chance of being selected. The probability that the first drawing will yield a defective unit is

, but for the second drawing it is , depending on whether or not the first unit drawn

was defective. Thus, the trials are not independent, the fourth assumption underlying the
binomial distribution is not met, and the binomial distribution does not apply. Note that the
binomial distribution would apply if we do sampling with replacement, namely, if each unit
selected for the sample is replaced before the next one is drawn.

To solve the problem of sampling without replacement, let us proceed as follows: The x

successes (defectives) can be chosen in ways, the n-x failures (non defectives) can be chosen

in ways, and hence, x successes and n-x failures can be chosen in ways.

Also, n objects can be chosen from a set of N objects in ways, and if we consider all the

possibilities as equally likely, it follows that for sampling without replacement the probability of
getting “ x successes in n trials” is

(10)

Where x cannot exceed a and n-x cannot exceed N-a. This equation defines the hypergeometric
distribution, whose parameters are the sample size n, the lot size (population size) N, and the
number of successes in the lot a.

Example: A shipment of 20 digital voice recorders contains 5 that are defective. If 10 of them
are randomly chosen for inspection, what is the probability that 2 of the 10 will be defective?

In this example, n was not small compared to N, and if we have made the mistake of using the
binomial distribution with n=10 and p=5/20 to calculate the probability of 2 defectives, the
result would have been 0.282, which is much too small. However, when n is small compared to
N, less than N/10, the composition of the lot is not seriously affected by drawing the sample

Department of Statistics, AMU © Defaru Debebe 10 | P a g e


Statistical Theory Of Distribution

without replacement, and the binomial distribution with the parameters n and p=a/N will yield a
good approximation.

In general, it can be shown that approaches with p=a/N when ,


and a good rule of thumb is to use the binomial distribution as an approximation to the
hypergeometric distribution if .

Theorem 4: If X has a hypergeometric distribution, then

Proof:

using

When we first defined the variance of a probability distribution, it may have occurred to reader
that the formula looked exactly like the one which we use in physics to define second moments,
or moments of inertia. Indeed, it is customary in statistics to define the kth moment about the
origin as

(11)

and the kth moment about the mean as

Department of Statistics, AMU © Defaru Debebe 11 | P a g e


Statistical Theory Of Distribution

(12)

Thus, the mean is the first moment about the origin, and the variance is the second moment
about the mean. Higher moments are often used in statistics to give further descriptions of
probability distributions. For instance, the third moment about the mean (divided by to make
this measure independent of the scale of measurement) is used to describe the symmetry or
skewness of a distribution; the fourth moments about the mean (divided by ) is, similarly, used
to describe its peakedness, or kurtosis.

For the second moment about the mean we thus have . If is large, there is a
correspondingly higher probability of getting values farther away from the mean. Formally, the
idea is expressed by the following theorem.

Chebyshev’s Theorem /5: If a probability distribution has mean and standard deviation , the
probability of getting a value which deviates from by at least is at most ,

Symbolically,

To prove this theorem, consider any probability distribution , having mean , and variance
.

Where R1 is the region for which is the region for which


, and R3 is the region for which . Since cannot be negative, the
above sum over R2 is nonnegative, and without it the sum of the summations over R1 and R3 is
less than or equal to , that is

But in the region R1 and in the region R3, so that in either case
; hence, in both regions . If we now replace in each sum
by , a number less than or equal to , we obtain the inequality

Department of Statistics, AMU © Defaru Debebe 12 | P a g e


Statistical Theory Of Distribution

Since represents the probability assigned to the region , namely,


, this completes the proof of theorem.

To obtain an alternative form of Chebyshev’s theorem, note that the event is the
complement of the event ; hence, the probability of getting a value which deviates

from by less than is at least .

Example: The number of customers who visit a car dealer’s showroom on a Saturday morning is
a random variable with mean and standard deviation . With what probability can
we assert that there will be more than 8 but fewer than 28 customers?

Solution: Let X be the number of customers. Since

1.2.5 The Poisson Approximation To The Binomial Distribution

The Poisson distribution often serves as a model for counts which do not have a natural upper
bound. The Poisson distribution, with mean , has probability given by

(13)

When is large and p is small, binomial probabilities are often approximated by means of the
Poisson distribution with mean . Before we justify this approximation, let us point out
that x=0, 1, … means that there is a countable infinity of probabilities, and this requires that we
modify the third axiom of probability.

Axiom 3’ If A1, A2, …, is a finite or infinite sequence of mutually exclusive events in sample
space, S, then

To verify that P(S) =1 for this formula, we make use of Axiom 3’ and write

Department of Statistics, AMU © Defaru Debebe 13 | P a g e


Statistical Theory Of Distribution

since the infinite series in the expression on the

right is the Maclaurin’s series for , it follows that . Let us now show
that when and , while remains constant, the limiting form of the binomial

distribution is for x= 0, 1, 2, …

First let us substitute for p into the formula for the binomial distribution and simplify the
resulting expression; thus, we get

If we let , we find that and that

Hence, the binomial distribution f(x; n, p) approaches for x= 0, 1, 2, …

The distribution at which we arrived is called the Poisson distribution. An acceptable rule of
thumb is to use this approximation if the approximation is
generally excellent so long as .

We obtain the probabilities of Poisson distribution from table by

Theorem 6: Let X be a Poisson distribution random variable; then

Proof:

Department of Statistics, AMU © Defaru Debebe 14 | P a g e


Statistical Theory Of Distribution

Poisson processes
In general, a random process is a physical process that is wholly or in part controlled by some
sort of chance mechanism. It may be a sequence of repeated flips of a coin, measurements of the
quality of manufactured products coming off an assembly line, the vibrations of airplane wings,
the noise in a radio signal, or any one of numerous other phenomena. What characterizes such
processes is their time dependence, namely, the fact that certain events do or do not take place
(depending on chance) at regular intervals of time or throughout continuous intervals of time.

In this section we shall be concerned with processes taking place over continuous intervals of
time or space, such as the occurrence of imperfections on a continuously produced bolt of cloth,
the recording of radiation by means of a Geiger counter, the arrival of telephone calls at a
switchboard, or the passing by cars over an electronic counting device. We will now show that
the mathematical model which we can use to describe many situations like these is that of
Poisson distribution. To find the probability of x successes during a time interval of length T, we
divide the interval into n equal parts of length , so that , and we assume that:

a) The probability of a success during a very small interval of time is given by .


b) The probability of more than one success during such a small time interval is
negligible.
c) The probability of a success during such a time interval does not depend on what
happened prior to that time.

This means that the assumptions underlying the binomial distribution are satisfied, and the
probability of x successes during the time interval T is given by the corresponding Poisson
probability with the parameter

Since is the mean of this Poisson distribution, note that is the average (mean) number of
successes per unit time.

Department of Statistics, AMU © Defaru Debebe 15 | P a g e


Statistical Theory Of Distribution

Example: If a bank receives on the average number 6 bad checks per day, what are the
probabilities that it will receive

a) 4 bad checks on any given day?


b) 10 bad checks over any 2 consecutive days?

Solution a)

b) Here so
, where values of and are obtained from Poisson Table.

1.2.6 Geometric and Negative Binomial Distribution

Two other families of discrete distributions that play important roles in statistics are the
geometric (or Pascal) and negative binomial distributions. The reason that we consider the two
together is twofold; first, the geometric distribution is a special case of the negative binomial
distribution, and second, the sum of independent and identically distributed geometric random
variables is negative binomial distributed.

Geometric Distribution
Suppose that in a sequence of trials we are interested in the number of the trials on which the
first success occurs. The three assumptions for Bernoulli trials are satisfied but the extra
assumption underlying the binomial distribution is not; in other words, n is not fixed.

Clearly, if the first success is to come on the x-th trial, it has to be preceded by x-1 failures, and
if the probability of a success is p, the probability of x-1 failures in x-1trials is . Then,
if we multiply this expression by the probability p of a success on the x-th trial, we find that
probability of getting the first success on the x-th trial is given by

(14)

Theorem 7: If the random variable X has a geometric distribution, then

Department of Statistics, AMU © Defaru Debebe 16 | P a g e


Statistical Theory Of Distribution

Proof:

The geometric distribution is well named since the values that the geometric density assumes are
the terms of a geometric series. Also the mode of the geometric density is necessarily 0. A
geometric density possesses one other interesting property, which is given in the following
theorem.

Theorem 8: If the random variable X has a geometric distribution with parameter p, then

Proof:

Example: Consider a sequence of independent, repeated Bernoulli trials with p equal to the
probability of success on an individual trial. Let the random variable X represent the number of

Department of Statistics, AMU © Defaru Debebe 17 | P a g e


Statistical Theory Of Distribution

trials required before the first success; then X has the geometric density. To see this note that the
first success will occur on trial x+1 if this (x+1)st trial results in a success and the first x trials
resulted in failures; but, by independence, x successive failures followed by a success has
probability . In the language of this example, the above theorem states that the probability
that at least i+j trials are required before the first success, given that there have been i successive
failures, is equal to the unconditional probability that at least j trials are needed before the first
success. That is the fact that one has already observed i successive failures does not change the
distribution of the number of trials required to obtain the first success.

A random variable X that has a geometric distribution is often referred to as a discrete waiting-
time random variable. It represents how long (in terms of the number of failures) one has to wait
for a success.

Before leaving the geometric distribution, we note that some authors define the geometric
distribution by assuming 1(instead of 0) is the smallest mass point. The density then has the form

and the mean is 1/p; the variance is q/p2, and the moment generating function is .

Negative Binomial Distribution


A random variable X with density

(15)

Where the parameters r=1, 2, … and , is defined to have a negative binomial density.

Theorem 9: If the random variable X has a negative binomial distribution, then

Proof:

The negative binomial distribution, like the Poisson, has the non-negative integers for its mass points;
hence, the negative binomial distribution is potentially a model for a random experiment where a

Department of Statistics, AMU © Defaru Debebe 18 | P a g e


Statistical Theory Of Distribution

count of some sort is of interest. In deed, the negative binomial distribution has been applied in
population counts, in health and accident statistics, in communications, and in other counts as well.

Example: Consider a sequence of independent, repeated Bernoulli trials with p equal to the
probability of success on an individual trial. Let the random variable X represent the number of
failures prior to the rth success; then X has the negative binomial density, as the following
argument shows: The last trial must result in a success, having probability p; among the first
x+r-1 trials there must be r-1 successes and x failures, and the probability of this is

Which when multiplied by p gives the desired result.

A random variable X having a negative binomial distribution is often referred to as a discrete


waiting-time random variable. It represents how long (in terms of the number of failures) one
waits for the rth success.

Example: The negative binomial distribution is of importance in the consideration of inverse


binomial sampling. Suppose a proportion p of individuals in a population possesses a certain
characteristic (b/d type ‘o’). If individuals in the population are sampled until exactly r
individuals with the certain characteristic are found, then the number of individuals in excess r
that are observed or sampled has a negative binomial distribution.

1.2.7 The Multinomial Distribution

An immediate generalization of the binomial distribution arises when each trial can have more
than two possible outcomes. This happens, for example, when a manufactured product is
classified as superior, average, or poor, when a student’s performance is graded as A, B,C, D and
F, or when an experiment is judged successful, unsuccessful or inconclusive. To treat this kind of
problem in general, let us consider the case where there are n independent trials, with each trial
permitting k mutually exclusive outcomes whose respective probabilities are
.

Department of Statistics, AMU © Defaru Debebe 19 | P a g e


Statistical Theory Of Distribution

Referring to the outcomes as being of the first kind, the second kind, …, and the k-th kind, we
shall be interested in the probability of getting x1 outcomes of the first kind, x2
outcomes of the second kind, …, and xk outcomes of the k-th kind, with .

Using arguments similar to those which we employed in deriving the equation for the binomial
distribution, it can be shown that the desired probability is given by

(16)

, but the xi subject to the restriction . The joint probability distribution whose values
are given by these probabilities is called the multinomial distribution; it owes its name to the fact
that for the various values of the xi the probabilities are given by the corresponding terms of the
multinomial expansion of .

Example: The probabilities that the light bulb of a certain kind slide projector will last fewer
than 40 hours of continuous use, anywhere from 40 to 80 hours of continuous use, or more than
80 hours of continuous use, are 0.3, 0.5 and 0.2. Find the probability that among eight such bulbs
2 will last fewer than 40 hours, 5 will last anywhere from 40 to 80 hours, and 1 will last more
than 80 hours.

Solution:

1.2.8 Other Discrete Distributions

x 0 1 2
f(x) 1-
Censored rv

Beta-binomial distribution

Department of Statistics, AMU © Defaru Debebe 20 | P a g e


Statistical Theory Of Distribution

Llogarithmic distribution

1.3 Continuous Distributions

Continuous sample spaces and continuous random variables arise when we deal with quantities
that are measured on a continuous scale _for instance, when we measure the speed of a car, the
amount of alcohol in person’s blood, the efficiency of a solar collector, or the tensile strength of
a new alloy.

The problem of defining probabilities in connection with continuous sample spaces and
continuous random variables involves some complications. To illustrate the nature of these
complications, let us consider the following situation: Suppose we want to know the probability
that if an accident occurs on a freeway whose length is 200 miles, it will happen at some given
location or, perhaps, some particular stretch of the road. The outcomes of this experiment can be
looked upon as a continuum of points, namely, those on the continuous interval from 0 to 200.
Suppose the probability that the accident occurs on any interval of length L is L/200, with L
measured in miles. In this section several parametric families of univariate probability functions
are presented.

1.3.1 Uniform (Rectangular) Distribution

If the probability density function of a random variable X is given by

(17)

Where the parameters a and b satisfy , then the random variable X is defined
to be uniformly distributed over the interval [a, b] and the distribution given by (17) is called
uniform distribution.

Theorem 10: If the random variable X is uniformly distributed over [a, b], then

Proof:

Department of Statistics, AMU © Defaru Debebe 21 | P a g e


Statistical Theory Of Distribution

The cumulative distribution function of a uniform random variable is given by

To illustrate how a physical situation might give rise to a uniform distribution, suppose that a
wheel of a locomotive has the radius r and that x is the location of a point on its circumference
measured along the circumference from some reference point 0. When the brakes are applied,
some points will make sliding contact with the rail, and heavy wear will occur at that point. For
repeated application of the brakes, it would seem reasonable to assume that x is a value of a
random variable having the uniform distribution with a=0 and . If this assumption were
incorrect, that is, if some set of points on the wheel made contact more often than others, the
wheel would eventually exhibit “flat spots” or wear out of round.

1.3.2 Normal Distribution

Among the special probability densities we shall study in this chapter, the normal distribution is
by far the most important. It was studied first in the 18th century when scientists observed an
astonishing degree of regularity in errors of measurement. They found that the patterns
(distributions) they observed were closely approximated by a continuous distribution which they
referred to as the “normal curve of errors” and attributed to the laws of chance.

The equation of the normal probability density

(18)

Whose graph (shaped like the cross section of a bell) is shown as

Department of Statistics, AMU © Defaru Debebe 22 | P a g e


Statistical Theory Of Distribution

Since the normal probability


density function cannot be integrated in closed from between every pair of limits a and b,
probabilities relating to normal distributions are usually obtained from special table. This table
pertains to the standard normal distribution, namely, the normal distribution with
, and its entries are the values of

The cumulative probabilities correspond to the area under the standard normal density to
the left of z, as shown by the shaded area in figure

To find the probability that a random variable having the standard normal distribution will take
on a value between a and b, we use the equation , as shown by the
shaded area in figure below.

Department of Statistics, AMU © Defaru Debebe 23 | P a g e


Statistical Theory Of Distribution

We also sometimes make use of the identity F(-z)=1-F(z), which holds for all symmetric
distributions centered around 0, the reader will be asked to verify.

Also, if we want to find the probability that a random variable having the normal distribution
with mean and the variance will take on a value between a and b, we have only to calculate
the probability that a random variable having the standard normal distribution will take on a

value between . That is, to find probabilities concerning X, we convert its

values to z-scores using .

If the normal random variable has mean 0 and variance 1, it is called a standard or normalized
normal random variable with density

(19)

Since f(z) is given to be a density function, it is implied that , but we should


satisfy ourselves that this is true. The verification is somewhat troublesome because the
indefinite integral of this particular density function does not have a simple functional
expression. Suppose that we represent the area under the curve by A; then

We wish to show that A=1, and this is most easily done by showing that A2 is 1 and then
reasoning that A=1 since is positive. We may put

Department of Statistics, AMU © Defaru Debebe 24 | P a g e


Statistical Theory Of Distribution

In this integral we change the variables to polar coordinates by the substitutions

and the integral becomes

Theorem 11: If the random variable X is a normal distribution, then

Proof:

If we complete the square inside the bracket, it becomes

and we have

The integral together with the factor is necessarily 1 since it is the area under a normal

distribution with mean and variance .

Hence

On differentiating twice and substituting t=0, we find

thus justifying our use of the symbols and for the parameters.

Department of Statistics, AMU © Defaru Debebe 25 | P a g e


Statistical Theory Of Distribution

Example: Suppose that an instructor assumes that a student’s final score is the value of a
normally distributed random variable. If the instructor decides to award a grade of A to thus
students whose score exceeds , a B to those students whose score falls between and
, a C if a score falls between and , a D if a score falls between and ,
and an F if the score falls below , then the proportions of each grade given can be
calculated. For example, since

One would expect 15.87 percent of the students to receive A’s.

1.3.3 Exponential and Gamma Distributions

Two other families of distributions that play important roles in statistics are the (negative)
exponential and gamma distributions, which are defined in this subsection. The reason that the
two are considered together is twofold; first, the exponential is a special case of the gamma, and,
second, the sum of identically independently distributed exponential random variables is gamma
distributed.

If a random variable X has a density given by

(20)

Where , then X is defined to have an exponential distribution. If a random variable X has


density given by

(21)

Where r>0 and , then X is defined to have a gamma distribution. Г(r) is a value of the
gamma function defined by . Integration by parts shows that

г(r) =(r-1)г(r-1) for any r>1 and hence, that г(r)=(r-1)! When r is a positive integer.

Department of Statistics, AMU © Defaru Debebe 26 | P a g e


Statistical Theory Of Distribution

Graphs of several gamma distributions are shown in figure below and they exhibit the fact that
these distributions are positively skewed. In fact, the skewness decreases as r increases for any
fixed value of .

f(x) Graph of some gamma a pdfs (λ=1)

r=1

r=2

r=4

0 1 2 3 x

Theorem 12: If the random variable X has a gamma distribution with parameter r and λ, then

for t<λ

Proof:

Hence

The exponential distribution is the special case of gamma distribution for r=1. Thus, the
expected value and variance of exponential distribution are respectively.

The exponential distribution has been used as a model for lifetimes of various things. When we
introduced the Poisson distribution, we spoke of certain happenings, for example, particle
emissions, occurring in time. The length of the time interval between successive happenings can
be shown to have an exponential distribution provided that the number of happening in fixed
time interval has a Poisson distribution. Also, if we assume again that the number of happenings
in a fixed time interval is Poisson distributed, the length of time between time 0 and the instant
when the rth happening occurs can be shown to have a gamma distribution. So a gamma random

Department of Statistics, AMU © Defaru Debebe 27 | P a g e


Statistical Theory Of Distribution

variable can be thought of as a continuous waiting-time random variable. It is the time one has to
wait for the rth happening. Recall that the geometric and negative binomial random variables
were discrete waiting-time random variables. In a sense, they are discrete analogs of the negative
exponential and gamma distributions, respectively.

If the random variable X has a gamma distribution with parameters r and , where r is a positive
integer, then

(22)

For , given in (22) is called the incomplete gamma function and has been extensively
tabulated.

Theorem 13: If the random variable X has an exponential distribution with parameter λ, then

Proof:

Let X represent the lifetime of a given component; then, in words, Theorem 13 states that the
conditional probability that the component will last a+b time units given that it has lasted a time
units is the same as its initial probability of lasting b time units. Another way of saying this is to
say that an “old” functioning component has the same lifetime distribution as a “new”
functioning component or that the component is not subject to fatigue or to wear.

1.3.4 Beta Distribution

A family of probability densities of continuous random variables taking on values in the interval
(0, 1) is the family of beta distributions.

If a random variable X has a density given by

(23)

then, X is defined to have a beta distribution. The function ,


called the beta function.

Department of Statistics, AMU © Defaru Debebe 28 | P a g e


Statistical Theory Of Distribution

Remark: The beta distribution reduces to the uniform distribution over (0, 1) if a=b=1.

The cumulative distribution function of a beta distributed random variable is

(24)

It is often called the incomplete beta and has been tabulated.

The moment generating function for the beta distribution does not have a simple form; however
the moments are readily found by using their definition.

Theorem 14: If the random variable X has a beta -distribution with parameters a and b, then

Proof:

Hence, and

The family of beta-densities is a two-parameter family of densities that is positive on the interval
(0, 1) and can assume quite a Varity of different shapes and consequently, the beta distribution
can be used to model an experiment for which one of the shapes is appropriate.

1.3.5 The Weibull Distribution

Closely related to the exponential distribution is the Weibull distribution, whose probability
density is given by

(25)

To demonstrate this relationship we evaluate the probability that a random variable having the
Weibull distribution will take on a value less than a, namely, the integral

Department of Statistics, AMU © Defaru Debebe 29 | P a g e


Statistical Theory Of Distribution

Making the change of variable

We get, (26)

And it can be seen that y is a value of a random variable having an exponential distribution. The
graphs of several Weibull distributions with are shown in figure
below.

The mean of the Weibull distribution having the parameters and may be obtained by
evaluating the integral

Making the change , we get

Using similar method to determine variance of this distribution

Department of Statistics, AMU © Defaru Debebe 30 | P a g e


Statistical Theory Of Distribution

CHAPTER TWO
Common Multivariate Distributions

The purpose of this chapter is to introduce the concepts of k-dimensional distribution functions,
conditional distributions, joint and conditional expectation, and independence of random
variables.

2.1 Joint Distribution Functions

In the study of many random experiments, there are, or can be, more than one random variable of
interest; hence we are compelled to extend our definitions of the distribution and density function
of one random variable to those of several random variables. As in the univariate case we will
first define, the cumulative distribution function. Although it is not as convenient to work with as
density functions, it does exist for any set of k random variables. Density functions for jointly
discrete and jointly continuous random variables will be given in subsequent sections.

2.1.1 Cumulative Distribution Function

Let be k random variables all defined on the same probability space. The joint
cumulative distribution function of , denoted by , is defined as
. Thus a joint cumulative distribution function is a
function with domain Euclidean k space and counter domain the interval [0, 1]. If k=2, the joint
cumulative distribution function is a function of two variables, and so its domain is just the xy
plane.

Example 1: Consider the experiment of tossing two tetrahedra (regular four-sided polyhedron)
each with sides labeled 1 to 4. Let X denote the number on the downturned face of the first
tetrahedron and Y the larger of the downturned numbers. The goal is to find , the joint
cumulative distribution function of X and Y. Observe first that the random variables X and Y
jointly take on only the values

Department of Statistics, AMU © Defaru Debebe 31 | P a g e


Statistical Theory Of Distribution

X2

4 © © © ©

3 © © © ©

2 © © © ©

1 © © © ©

1 2 3 4 x1 Figure 2.1: sample space for experiment of


tossing two tetrahedra.

The sample space for this experiment is displayed in Figure 2.1. The 16 sample points are
assumed to be equally likely. Our objective is to find for each point (x, y). As an
example , let (x,y) = (2,3), and find . Now the event
corresponds to the encircled sample points in Figure 2.1; hence .
Similarly can be found for other values of x and y in the following table.

0 4/16 8/16 12/16 1


0 3/16 6/16 9/16 9/16
0 2/16 4/16 4/16 4/16
0 1/16 1/16 1/16 1/16
0 0 0 0 0

Properties of bivariate cumulative distribution function

i)
.
ii) If , then

iii) is right continuous in each argument; that is,

Department of Statistics, AMU © Defaru Debebe 32 | P a g e


Statistical Theory Of Distribution

Any function satisfying properties i) to iii) is defined to be a bivariate cumulative distribution


function without reference to any random variables. If is the joint cumulative
distribution function of X and Y, then the cumulative distribution functions and are
called marginal cumulative distribution functions.

Remark: ; that is, knowledge of the joint


cumulative distribution function of X and Y implies knowledge of the two marginal cumulative
distribution functions. The converse of this remark is not generally true.

2.1.2 Joint Density Functions For Discrete Random Variables

If are random variables defined on the same probability space, then ( )


is called k-dimensional random variable.

The k-dimension random variable ( ) is defined to be a k-dimensional discrete


random variable if it can assume values only at a countable number of points in k-
dimensional real space. We also say that the random variables are joint discrete
random variables. If is a k-dimensional discrete random variable, then the joint
discrete density function of , denoted by , is defined to be

for , a value of ( ) and


is defined to be 0 otherwise.

Remark:

Example 2: The joint discrete density function of X and Y in Example 1 above is given as

(x,y) (1,1) (1,2) (1,3) (1,4) (2,2) (2,3) (2,4) (3,3) (3,4) (4,4)
1/16 1/16 1/16 1/16 2/16 1/16 1/16 3/16 1/16 4/16
Or in another tabular form as

4 1/16 1/16 1/16 4/16


3 1/16 2/16 3/16
2 1/16
1 1/16
y/x 1 2 3 4

Department of Statistics, AMU © Defaru Debebe 33 | P a g e


Statistical Theory Of Distribution

Theorem 1: If X and Y are jointly discete random variables, then knowledge of is


equivalent to knowledge of . Also, the statement extends to k-dimensional discrete random
variables.

Proof: Let be the possible values of (X, Y). If is given, then


,where the summation is over all i for which and .
Conversely, if is given, then for , a possible value of (X, Y),

. ///

If X and Y are jointly discrete random variables, then and are called marginal discrete
density functions. More generally, let be any subset of the jointly discrete random
variables ; then is also called marginal density.

Example 3: From Example 2, if we note that X has values 1, 2, 3 & 4; Y has values 1, 2, 3,& 4;
and Y is greater or equal to X, the values of (x,y) are .

Similarly, all other values of could be obtained. Also

Similarly , which together with give

the marginal discrete density function of Y. In other form,

y/x 1 2 3 4
4 1/16 1/16 1/16 4/16 7/16
3 1/16 1/16 3/16 5/16
2 1/16 2/16 3/16
1 1/16 1/16
4/16 4/16 4/16 4/16

Department of Statistics, AMU © Defaru Debebe 34 | P a g e


Statistical Theory Of Distribution

The multinomial distribution is a (k+1) parameter family of distributions, the parameters being n
and . , like q in the binomial distribution, is exactly determined by
. A particular case of a multinomial distribution is obtained by putting, for example, n=3,
k=2,p1=0.2 and p2=0.3, to get

The density of multinomial distribution is given by

, xi=0,1,…,n; (1)

We might observe that if have the multinomial distribution given (1), then the marginal
distribution of Xi is a binomial distribution with parameters n and pi. This observation can be
verified by recalling the experiment of repeated, independent trials.

2.1.3 Joint Density Functions For Continuous Random Variables

The k-dimensional random variable ( ) is defined to be a k-dimensional continuous


random variable if and only if there exists a function such that

(2)

for all .

As in the one-dimensional case, a joint probability density function has two properties:

(i)

(ii) .

A unidimensional probability density function was used to find probabilities. For example, for X

a continuous random variable with probability density ; that


is, the area under over the interval (a, b) gave ; and more generally,

; that is, the area under over the set B gave . In the two-

Department of Statistics, AMU © Defaru Debebe 35 | P a g e


Statistical Theory Of Distribution

dimensional case, volume gives probabilities. For instance, let (X1, X2) be jointly continuous
random variables with joint probability density function , and let R be some region
in the x1x2 plane; then falls in the region R is given by the volume under
over the region R. In particular if , then

A joint probability density function is defined as any nonnegative integrand satisfying (2) and
hence is not uniquely defined.

Example 4: Consider the bivariate function

, a unit square. Can the constant k be selected so


that will be a joint probability density function? If k is positive, .

for k =1.

Probability of events defined in terms of the random variables can be obtained by integrating the
joint probability density function over the indicated region; for example

Which is the volume under the surface z=x+y over the region in

the xy plane.

Theorem 2: If X and Y are jointly continuous random variables, then knowledge of is


equivalent to knowledge of . Also, the statement extends to k-dimensional continuous
random variables.

Proof: For a given , is obtained for any (x,y) by


. For a given , can be obtained by

for x, y points, where is differentiable.

Department of Statistics, AMU © Defaru Debebe 36 | P a g e


Statistical Theory Of Distribution

If X and Y are jointly continuous random variables, then and are called marginal
probability density functions. More generally, let be any subset of the jointly
continuous random variables . is called a marginal density of the
m-dimensional random variable . If are jointly continuous random
variables, then any marginal probability density function can be found

(3)

Since, .

Example 5: Consider the joint probability density

Or, .

Department of Statistics, AMU © Defaru Debebe 37 | P a g e


Statistical Theory Of Distribution

2.2 Conditional Distribution and Stochastic Independence

In the preceding section we define the joint distribution and joint density functions of several
random variables; in this section we define conditional distributions and the related concept of
stochastic independence. Most definitions will be given first for only two random variables and
later extended to k random variables.

2.2.1 Conditional Distribution Functions for Discrete Random Variables

Let X and Y be jointly discrete random variables with joint discrete density function .
The conditional discrete density function of Y given X=x, denoted by , is defined to be

(4)

if , where is the marginal density of X evaluated at x.

Similarly, .

Since X and Y are discrete, they have mass points, say for X and for Y. If
, then x=xi for some i, and . The numerator of the right-hand side
of (4) is ; so

is called a conditional discrete density function and hence should posses the
probabilities of a discrete density function. To see that it does, consider x as some fixed mass
point of X. Then is a function with argument y, and to be a discrete density function
must be nonnegative and, if summed over the possible values (mass points) of Y, must sum to 1.

, so is indeed a density; it

tells us how the values of Y are distributed for a given value x of X.

The conditional cumulative distribution of Y given X=x can be defined for two jointly discrete
random variables by recalling the close relationship between discrete density functions and

Department of Statistics, AMU © Defaru Debebe 38 | P a g e


Statistical Theory Of Distribution

cumulative distribution functions. If X and Y jointly discrete random variables, the conditional
cumulative distribution of Y given X=x, denoted by , is defined to be
for .

Example 6: Return to the experiment of tossing two tetrahedral. Let X denote the number on the
downturned face of the first and Y the larger of the downturned numbers. What is the density of
Y given that X=2?

Also

Let be a k-dimensional discrete random variable, and let and


be two disjoint subsets of the random variables . The conditional density of the r-
dimensional random variable given that the value of is
defined to be

Example 7: Let be jointly discrete random variables. Take


; then

Example 8: Suppose 12 cards are drawn without replacement from an ordinary deck of playing
cards. Let X1 be the number of aces drawn, X2 be the number of 2s, X3 be the number of 3s, and
X4 be the number of 4s. The joint density of these four random variables is given by

Department of Statistics, AMU © Defaru Debebe 39 | P a g e


Statistical Theory Of Distribution

, where

, subject to the restriction that . There are a large number of conditional


densities associated with this density, an example is

where xi=0, 1, 2, 3,4 and .

2.2.2 Conditional Distribution Functions for Continuous Random Variables

Let X and Y be jointly continuous random variables with joint probability density function
. The conditional probability density function of Y given X=x, denoted by , is
defined to be

(5)

if , where is the marginal probability density of X, and is undefined at points


when .

Similarly,

is called a (conditional) probability density function and hence should possess the
properties of a probability density function. is clearly nonnegative, and

The density is the density of the random variable Y given that x is the value of the
random variable X. In the conditional density , x is fixed and could be thought of as a
parameter. Consider , that is, the density of Y given that X was observed to be x0. Now
plots as a surface over the xy plane. A plane perpendicular to the xy plane which

Department of Statistics, AMU © Defaru Debebe 40 | P a g e


Statistical Theory Of Distribution

intersects the xy plane on the line x=x0 will intersect the surface in the curve . The area
under this curve is

Hence we divide by , we obtain a density which is precisely .


Again, the conditional cumulative distribution can be defined in the natural way. If X and Y are
jointly continuous, then the conditional cumulative distribution of Y given X=x is defined as
for all x such that .

Example 9: Suppose .

. Note that

2.2.3 Independence

When we define the conditional probability of two events in chapter one, we also defined
independence of events. We have now defined the conditional distribution of random variables;
so we should define independence of random variables as well.

Let be a k-dimensional discrete random variable. are defined to be


stochastically independent if and only if

(6)

Let be a k-dimensional discrete/continuous random variable with joint


discrete/continuous probability density . are stochastically independent
if and only if

(7)

We saw that independence of events was closely related to conditional probability; likewise
independence of random variables is closely related to conditional distributions of random

Department of Statistics, AMU © Defaru Debebe 41 | P a g e


Statistical Theory Of Distribution

variables. Fro example, suppose X and Y are two independent random variables; then
by definition of independence; however,
by definition of conditional density, which implies that ; that is, the
conditional density of Y given x is the unconditional density of Y. So to show that two random
variables are not independent, it suffices to show that depends on x.

Example 10: Let X be the number on the downturned face of the first tetrahedron and Y the
larger of the two downturned numbers in the experiment of tossing two tetrahedral. Are X and Y
independent? Obviously not, since

Example 11: Let . Are X and Y independent? No, since

for depends on x and hence cannot equal

Example 12: Let . X and Y independent since

2.3 Expectation

When we introduced the concept of expectation for univariate random variables in Chapter one,
we first defined the mean and variance as particular expectations and then defined the
expectation of a general function of random variable. Here, we will commence with the
definition of the expectation of a general function of a k-dimensional random variable. The
definition will be given for only those k-dimensional random variables which have densities.

2.3.1 Definition

Let be a k-dimensional random variable with density . The expected


value of a function of the k-dimensional random variable, denoted by E ,
is defined to be

E = (8)

Department of Statistics, AMU © Defaru Debebe 42 | P a g e


Statistical Theory Of Distribution

if the random variable is discrete where the summation is over all possible values of
, and

= (9)

if the random variable is continuous.

In order for the above to be defined, it is understood that the sum and multiple integral,
respectively, exist.

Theorem 3: In particular, if = , then =E = .


Proof: Assume that is continuous.

= =E

Using the fact that the marginal density is obtained from the joint density by

Similarly, the following theorem can be proved.

Theorem4:If ,then
.
We might note that the “expectation” in the notation of Theorem has two different
interpretations; one is that the expectation is taken over the joint distribution of and the
other is that the expectation is taken over the marginal distribution of . What Theorem 3 really
says is that these two expectations are equivalent, and hence we are justified in using the same
notation for both.

Example 13: Consider the experiment of tossing two tetrahedra. Let X be the number on the first
and Y the larger of the two numbers. We gave the joint discrete density function of X and Y in
Example 2.

E =

Department of Statistics, AMU © Defaru Debebe 43 | P a g e


Statistical Theory Of Distribution

=1.1 +1.2 +2.2

+2.3 = .

E(X+Y)=

E[X] = , and E[Y] = ; hence E(X+Y) = E[X] + E[Y].

Example 14: Suppose =(x + y) .

E =

E(X+Y) =

E[X] = E[Y] =

Example 15: Let the three – dimensional random variable ( ) have the density

( )= ( ).

Suppose we want to find

(i) E [3
(ii) E , and
(iii) E .

For ( i) we have = and obtain

E[

d .

For ( ii), we get E .

Department of Statistics, AMU © Defaru Debebe 44 | P a g e


Statistical Theory Of Distribution

And for ( iii) we get E

2.3.2 Covariance and Correlation Coefficient

Let X and Y any two random variables defined on the same probability space. The covariance of
X and Y, denoted by cov[X ,Y] is defined as

(10)

provided that the indicated expectation exists. And the correlation coefficient, denoted by of
the random variables X and Y is defined to be

(11)

provided that exist, and . Both the covariance and


correlation coefficient of random variables X and Y are measures of a linear relationship of X and
Y. The correlation coefficient removes, in a sense, the individual variability of each X and Y by
dividing the covariance by the product of the standard deviations, and thus the correlation
coefficient is a better measure of the linear relationship of X and Y than is the covariance. Also,
correlation coefficient is unitless and lies in the interval [-1, 1].

Example 16: Find for X, the number on the first, and Y, the larger of the two numbers, in the
experiment of tossing two terahedra. We would expect that is positive since when X is large,
Y tends to be large too. We calculated in Example 13 and obtained

. Thus . Now

hence . So,

Example 17:Find for X and Y if . We saw that

in Example 14. Now ; hence

Department of Statistics, AMU © Defaru Debebe 45 | P a g e


Statistical Theory Of Distribution

. Finally . Does a negative correlation coefficient seem

right?

2.3.3 Conditional Expectations

Let (X,Y) be a two-dimensional random variable and g(.,.), a function of two variables. The
conditional expectation of g(X,Y) given X=x, denoted by , is defined to be

(12)

if (X,Y) are jointly continuous, and

(13)

if (X,Y) are jointly discrete, where the summation is over all possible values of Y.

In particular, if , we have defined . and


are functions of x. Note that this definition can be generalized to more than two dimensions. For
example, let be a -dimensional continuous random variable with
density ; then

Example 18: in the experiment of tossing two tetrahedral with X, the number on the first, and Y,
the larger of the two numbers, we found that

Hence .

Example19: For , we found that

Department of Statistics, AMU © Defaru Debebe 46 | P a g e


Statistical Theory Of Distribution

for 0<x<1. Hence

for 0<x<1.

As we stated above, is, in general, a function of x, let us denote it by h(x); that is,
. Now we can evaluate the expectation of h(X), a function of X, and will have
. This gives us

Thus we have proved for jointly continuous random variables X and Y the following simple yet
very useful theorem.

Theorem 5: Let (X,Y) be a two-dimensional random variable; then


and in particular . (14)

The variance of Y given X=x is defined by .

Department of Statistics, AMU © Defaru Debebe 47 | P a g e


Statistical Theory Of Distribution

Theorem 6: .

Proof:

Let us note in words what the two theorems say. Equation (14) states that the mean of Y is the
mean or expectation of the conditional mean of Y, and Theorem 6 states that the variance of Y is
the mean or expectation of the conditional variance of Y, plus the variance of the conditional
mean of Y.

We will conclude this subsection with one further theorem. The proof can be routinely obtained
from definition of conditional expectation and left as an exercise. Also, the theorem can be
generalized to more than two dimensions.

Theorem 7: Let (X,Y) be a two-dimensional random variable and and functions of


one variable. Then
(i)
(ii)

2.3.4 Joint Moment Generating Function and Moments

We will use our definition of the expectation of a function of several variables to define joint
moments and the joint moment generating function.

The joint raw moments of are defined by , where the ri’s are 0 or any
positive integer; the joint moments about the means are defined by

Department of Statistics, AMU © Defaru Debebe 48 | P a g e


Statistical Theory Of Distribution

Remark: If and all other rm’s are 0, then that particular joint moment about the

means becomes , which is just the covariance between Xi and Xj.

The joint moment generating function of is defined by

(15)

if the expectation exists for all values of such that for some h>0,
j=1,…,k.

The rth moment of Xj may be obtained from by differentiating it r times with


respect to tj and then taking the limit as all the t’s approach 0. Also can be obtained by
differentiating the joint moment generating function r times with respect to ti, and s times with
respect to tj, and then taking the limit as all the t’s approach 0. Similarly other joint raw moments
can be generated.

Remark:
; that is, the marginal moment generating functions can be obtained from the
joint moment generating function.

2.3.5 Independence and Expectation

We have already defined independence and expectation; in this section we will relate the two
concepts.

Theorem 8: If X and Y are independent and and are two functions, each of a single
argument, then .
Proof: We will give the proof for jointly continuous random variables.

Corollary: If X and Y are independent, then .


Proof: Take ; by Theorem 8,

Department of Statistics, AMU © Defaru Debebe 49 | P a g e


Statistical Theory Of Distribution

The converse of the above corollary is not always true; that is, does not always
imply that X and Y are independent, as the following example shows.

Example 10: Let U be a random variable which is uniformly distributed over the interval (0,1).
Define . X and Y are clearly not independent since if a value of X
is known, then U is one of two values, and so Y is not the same as the marginal distribution.

Theorem 9: Tow jointly distributed random variables X and Y are independent if and only if
for all t1, t2 for which –h<ti<h, i=1,2, for some h>0.
Proof: [Recall that is the moment generating function of X. Also note that
.] X and Y independent imply that the joint moment generating function factors into
the product of the marginal moment generating functions by Theorem 8 by taking
. The proof in the other direction will be omitted.

Remark: both Theorem 8 and 9 can be generalized from two random variables to k random
variables.

2.3.6 Cauchy-Schwarz Inequality

Theorem 10: let X and Y have finite second moments; then


, with equality iff for some constant c.
Proof: The existence of expectations follows from the existence of
expectations . Define
. Now is a quadratic function in t which is grater than or equal to 0. If , then
the roots of are not real; so , or .
If for some t, say t0, then , which implies .

Department of Statistics, AMU © Defaru Debebe 50 | P a g e


Statistical Theory Of Distribution

Corollary: , with equality iff one random variable is a linear function of the other with
probability 1.

Proof: Rewrite the Cauchy-Schwarz Inequality as , and set


.

2.4 Bivariate Normal Distribution

One of the important multivariate densities is the multivariate normal density, which is a
generalization of the normal distribution for a unidimensional random variable. In this section we
discuss a special case, the case of the bivariate normal. In our discussion we will include the joint
density, marginal densities, conditional densities, conditional means and variances, covariance,
and the moment generating function. This section, then, will give an example of many of the
concepts defined in the preceding sections of this chapter.

2.4.1 Density Function

Let the two-dimensional random variable (X,Y) have the joint probability density function

(16)

are constant such that


. Then the random variable (X,Y) is
defined to have a bivariate normal distribution.

Figure 2.2: Curve of bivariate normal distribution.

The density in (16) may be represented by a bell-shaped surface as in Figure 2.2.


Any plane parallel to the xy plane cuts the surface will intersect it in an elliptic curve, while any
plane perpendicular to the xy plane will cut the surface in a curve of the normal form. The

Department of Statistics, AMU © Defaru Debebe 51 | P a g e


Statistical Theory Of Distribution

probability that a point (X,Y) will lie in any region R of the xy plane is obtained by integrating the
density over that region:

(17)

The density might, for example, represent the distribution of hits on a vertical target, where x and
y represent the horizontal and vertical deviations from the central lines. And in fact the
distribution closely approximates the distribution of this as well as many other bivariate
populations encountered in practice.

We must first show that the function actually represents a density by showing that its integral
over the whole plane is 1, that is,

The density is, of course, positive. To simplify the integral, we shall substitute

and (18)

So that it becomes

On completing the square on u in the exponent, we have

And if we substitute

the integral may be written as the product of two simple integrals

both of which are 1, as we have seen in studying the univariate normal distribution.

Remark: The cumulative bivariate normal distribution.

Department of Statistics, AMU © Defaru Debebe 52 | P a g e


Statistical Theory Of Distribution

may be reduced to a form involving only the parameter by making the substitution in (18).

2.4.2 Moment Generating Function and Moment

To obtain the moments of X and Y, we shall find their joint moment generating function, which is
given by

Theorem 11: The moment generating function of the bivariate normal distribution is

Proof: Let us again substitute for x and y in terms of u and v to obtain

(19)

The combined exponents in the integrand may be written

and on completing the square first on u and then on v, we find this expression becomes

Which, if we substitute

becomes

and the integral in (19) may be written

Department of Statistics, AMU © Defaru Debebe 53 | P a g e


Statistical Theory Of Distribution

Since the double integral is equal to unity.

The moments may obtained by evaluating the appropriate derivative of at t1=0, t2=0.
Thus,

Hence variance of X is

Similarly, on differentiating with respect to t2, one finds the mean and variance of Y to be and
. We can also obtain joint moments by differentiating r times with
respect to t1 and s times with respect to t2 and then putting t1 and t2 equal to 0. The covariance X
and Y is

Hence, the parameter is the correlation coefficient of X and Y.

Theorem 12: If (X,Y) has a bivariate normal distribution, then X and Y are independent if and
only if X and Y uncorrelated.
Proof: X and Y are uncorrelated if and only if or, equivalently, if and only if , the
joint density f(x,y) becomes the product of two univariate normal distributions; so that
implies X and Y independent. We know that, in general, independence of X and Y implies that X
and Y are uncorrelated.
Department of Statistics, AMU © Defaru Debebe 54 | P a g e
Statistical Theory Of Distribution

2.4.3 Marginal and Conditional Densities

Theorem 13: If (X,Y) has a bivariate normal distribution, then the marginal distributions of X
and Y are univariate normal distributions; that is, X is normally distributed with mean and
variance , and Y is normally distributed with mean and variance .
Proof: the marginal density of one of the variables X, for example, is by definition

and again substituting and completing the square on v, one finds that

Then the substitutions

Show at once that

the univariate normal density. Similarly the marginal density of Y may be found to be

Theorem 14: If (X,Y) has a bivariate normal distribution, then the conditional distributions of X
given Y=y is normal with mean and variance . Also, the

conditional distribution of Y given X=x is normal with mean and variance

.
Proof: The conditional distributions are obtained from the joint and marginal distributions. Thus,
the conditional density of X for fixed values of Y is

Department of Statistics, AMU © Defaru Debebe 55 | P a g e


Statistical Theory Of Distribution

and, after substituting, the expression may be put in the form

(20)

Which is a univariate normal density with mean and with variance

. The conditional distribution of Y may be obtained by interchanging x and y throughout (20)


to get

As we already noted, the mean value of a random variable in a conditional distribution is called a
regression curve when regarded as a function of the fixed variable in the conditional distribution.
Thus the regression for X on Y=y in (20) is , which is a linear function of y in

the present case. For bivariate distributions in general, the mean of X in the conditional density
of X given Y=y will be some function of y, say g(.), and the equation when plotted in
the xy plane gives the regression curve for X. It is simply a curve which gives the location of the
mean of X for various values of Y n the conditional density of X given Y=y.

2.5 Distributions of Functions of Random Variables

As the title of this section indicates, we are interested in finding the distributions of functions of
random variables. More precisely, for given random variables, say , and given
functions of the n given random variables, say , we want in general, to
find the joint distribution of , where . If the joint density
of the random variables is given, then theoretically at least, we can find the joint
distribution of . This follows since the joint cumulative distribution function of
satisfies the following:

Department of Statistics, AMU © Defaru Debebe 56 | P a g e


Statistical Theory Of Distribution

for fixed , which is the probability of an event described in terms of , and


theoretically such a probability can be determined by integrating or summing the joint density
over the region corresponding to the event. The problem is that in general one cannot easily
evaluate the desired probability for each . One of the important problems of statistical
inference, the estimation of parameters, provides us with an example of a problem in which it is
useful to be able to find the distribution of a function of joint random variables. In this section
three techniques for finding the distribution of functions of random variable will be presented.
These are (i) the cumulative distribution function technique, (ii) the moment generating function
technique, and (iii) the transformation technique.

2.5.1 Expectations of Functions of Random Variables


a) Expectation two ways

An expectation of a function of a function of a set of random variables can be obtained two


different ways. To illustrate, consider a function of just one random variable, say X. Let be
the function, and set . Since Y is a random variable, is defined (if it exists), and is
defined (if it exists). For instance, if are continuous random variables, then by
definition

(21)

and

(22)

but , so it seems reasonable that . This can, in fact, be proved;


although we will not bother to do it. Thus we have two ways of calculating the expectation of
; one is to average Y with respect to the density of Y, and the other is to average
with respect to density of X.

In general, for given random variables , let ; then


, where (for jointly continuous random variables)

(23)

Department of Statistics, AMU © Defaru Debebe 57 | P a g e


Statistical Theory Of Distribution

and

(24)

In practice, one would naturally select that method which makes the calculations easier. One
might suspect that (23) gives better method of the two since it involves only a single integral
whereas (24) involves a multiple integral. On the other hand, (23) involves the density of Y, a
density that may have to be obtained before integration can proceed.

Example 11: Let X be a standard normal random variable, and let .

and

Now

And ,

using the fact that Y has a gamma distribution with parameters r = 1/2 and =1/2.

b) Sums of random variables

A simple, yet important, function of several random variables is their sum.

Theorem 15: For random variables

and
Proof: That follows from property of expectation.

Department of Statistics, AMU © Defaru Debebe 58 | P a g e


Statistical Theory Of Distribution

Corollary: If are uncorrelated random variables, then .

The following theorem gives a result that is somewhat related to the above theorem inasmuch as
its proof, which is left as an exercise, is similar.

Theorem 16: Let and be two sets of random variables, and let
and be two sets of constants; then
(25)

Corollary: If are random variables and are constants, then

(26)

In particular, if are independent and identically distributed random variables with

mean and variance and if , then

If X1 and X2 are two random variables, then

c) Product and quotient

In the above subsection the mean and variance of the sum and difference of two random
variables were obtained. It was found that the mean and variance of the sum or difference of
random variables X and Y could be expressed in terms of the means, variances, and covariance of
X and Y. We consider now the problem of finding the first two moments of the product and
quotient of X and Y.

Theorem 17: Let X and Y be two random variables for which var[XY] exists; then
and

Department of Statistics, AMU © Defaru Debebe 59 | P a g e


Statistical Theory Of Distribution

Proof:

Calculate E[XY] and E[(XY)2] to get the desired results.

If X and Y are independent, and


.

Note that the mean of the product can be expressed in terms of the means and covariance of X
and Y but the variance of the product requires higher order moments.

In general, there are no simple exact formulas for the mean and variance of the quotient of two
random variables in terms of moments of the two random variables; however, there are
approximate formulas which are sometimes useful.

Remark:

and

2.5.2 Cumulative Distribution Function Technique


a) Description of technique

If the joint distribution of random variables is given, then, theoretically, the joint
distribution of random variables of can be determined, where
for given functions by definition the cumulative distribution
function of is . But for each the
event . This later event described in
terms of the given functions and the given random variables .
Since the joint distribution of is assumed given, presumable the probability of event
can be calculated and consequently
determined. The above described technique for deriving the joint distribution of will be
called the cumulative distribution function technique.

Department of Statistics, AMU © Defaru Debebe 60 | P a g e


Statistical Theory Of Distribution

An important special case arises if k=1; then there is only one function, say , of the
given random variables for which one needs to derive the distribution.

Example 12: Let there be only one given random variable, say X, which has a standard normal
distribution. Suppose the distribution of is desired.

which can be recognized as the cumulative distribution function of a gamma distribution with
parameters r=1/2 and =1/2.

b) Distribution of minimum and maximum

Let be n given random variables. Define and


. The distribution of and are desired.
since the largest of the Xi’s is less than or equal to y if and only if all the
Xi’s are less than or equal to y. now, if the Xi’s are assumed independent, then

so the distribution of can be expressed in terms of the marginal


distributions of . If in addition it is assumed that all the have the same
cumulative distribution, say , then

We have proved Theorem 18.

Theorem 18: If are independent random variables and , then

If are independent and identically distributed with common cumulative

Department of Statistics, AMU © Defaru Debebe 61 | P a g e


Statistical Theory Of Distribution

distribution function , then .


Corollary: If are independent and identically distributed continuous random variables
with common probability density function and cumulative distribution function , then

Proof: .

Similarly,

since Y1 is greater than y iff every Xi > y. And if are independent, then

If further it is assumed that are identically distributed with common cumulative


distribution function , then

and we have proved Theorem 19.

Theorem 19: If are independent random variables and , then

If are independent and identically distributed with common cumulative


distribution function , then .
Corollary: If are independent and identically distributed continuous random variables
with common probability density function and cumulative distribution function , then

Proof: .

Example 13: suppose that the life of a certain light bulb is exponentially distributed with mean
100 hours. If 10 such light bulbs are installed simultaneously, what is the distribution of the life

Department of Statistics, AMU © Defaru Debebe 62 | P a g e


Statistical Theory Of Distribution

of the light bulb that fails first, what is expected life? Let Xi denote the life of the ith light bulb;
then is the life of the light bulb that fails first. Assume that Xi’s are
independent.

Now ;

So , which is an

exponential distribution with parameter =1/10; hence .

c) Distribution of sum and difference of two random variables

Let X and Y be jointly distributed continuous random variables with density , and let
and . Then,

Proof:

by making the
substitution . Now

. ///

If X and Y are independent continuous random variables and , the

Proof:

Hence,

Department of Statistics, AMU © Defaru Debebe 63 | P a g e


Statistical Theory Of Distribution

///

Example 14: Suppose that X and Y are independent and identically distributed with density
. Note that since both X and Y assume values between 0 and 1,
assumes values between 0 and 2.

d) Distribution of product and quotient

Let X and Y be jointly distributed continuous random variables with density , and let
; then

and

proof:

which on making substitution

Department of Statistics, AMU © Defaru Debebe 64 | P a g e


Statistical Theory Of Distribution

Hence,

Example 15: Suppose X and Y are independent random variables, each uniformly distributed
over the interval (0,1). Let Z=XY and U=X/Y.

(see Figure 2.3)

Note that , quite different from .

1 y=1/u

1 u

Figure 2.3

Department of Statistics, AMU © Defaru Debebe 65 | P a g e


Statistical Theory Of Distribution

2.5.3 Moment Generating Function Technique


a. Description of techniques

There is another method of determining the distribution of functions of random variables which
we shall find to be particularly useful in certain instances. This method is built around the
concept of the moment generating function and will be called the generating function technique.

The statement of the problem remains the same. For given random variables with given
density and given functions , find the joint
distribution of . Now the joint moment generating
function of , if it exists, is

(27)

If after the integration of (27) is performed, the resulting function of can be recognized
as the joint moment generating function of some known joint distribution, it will follow that
has that joint distribution by virtue of the fact that a moment generating function, when
it exists, is unique and uniquely determines its distribution function.

The most useful application of the moment generating function technique will be used to find the
distribution of sums of independent random variables.

Example 16: Suppose X has a normal distribution with mean 0 and variance 1. Let , and
find the distribution of Y.

Department of Statistics, AMU © Defaru Debebe 66 | P a g e


Statistical Theory Of Distribution

for t<1/2,

which we recognize as the moment generating function of a gamma with parameters

( it is also called a chi-square distribution with one degree of freedom.)

Example 17: Let X1 and X2 be two independent standard normal random variables. Let
. Find the joint distribution of Y1 and Y2.

We note that Y1 and Y2 are independent random variables (by Theorem 9) and each has a normal
distribution with mean 0 and variance 2.

In the above example we were able to manipulate expectations and avoid performing an
integration to find the desired joint moment generating function. In the following example the
integration will have to be performed.

Example 18: Let X1 and X2 be two independent standard normal random variables. Let

, and Find the distribution of Y.

Department of Statistics, AMU © Defaru Debebe 67 | P a g e


Statistical Theory Of Distribution

for t<1/2,

Which is the moment generating function of a gamma distribution with parameters

; hence,

///

b. Distribution of sums of independent random variables

Theorem 20: If are independent random variables and the moment generating function
of each exists for all for some , let ; then

Proof:

using Theorem 9.

Example 19: suppose that are independent Bernoulli random variables; that is,
and .

So ,

the moment generating function of a binomial random variable; hence has a binomial
distribution with parameters n and p.

Example 20: suppose that are independent Poisson distributed random variables,
having parameter . Then

Department of Statistics, AMU © Defaru Debebe 68 | P a g e


Statistical Theory Of Distribution

and hence

Which is again the moment generating function of a Poisson distributed random variable having
parameter .

Example 21: assume that are independent and identically distributed exponential
random variables; then

So ,

Which is the moment generating function of a gamma distribution with parameters n and ;
hence,

the density of a gamma distribution with parameters n and .

Example 22: assume that are independent random variables and

; then , and .

Hence ,

Which is the moment generating function of a normal random variable; so

2.5.4 Transformations
a) Distribution of

A random variable X may be transformed by some function to define a new random


variable Y. the density of Y, , will be determined by the transformation together with
the density of X. First, if X is a discrete random variable with mass points , then
Department of Statistics, AMU © Defaru Debebe 69 | P a g e
Statistical Theory Of Distribution

the distribution of is determined directly by the laws of probability. If X takes on the


values with probabilities , then the possible values of Y takes on a
given value, say , is

Example 23: Suppose X takes on the values 0, 1, 2, 3, 4, and 5 with probabilities


. If , not that Y can take on
values ;then
.

Second, if X is a continuous random variable, then the cumulative distribution function of


can be found by integrating over the appropriate region; that is,

This is just cumulative distribution technique.

Example 24: Let X be random variable with uniform distribution over the interval (0,1) and let
. The density of Y is desired.

Now for 0<y<1;

so and therefore .

Application of the cumulative distribution function technique to find the density of , as


in the above example, produces the transformation technique, the result of which is given in the
following theorem.

Theorem 21: Suppose X is a continuous random variable with probability density function
. Set . Assume that:
i) defines a one-to-one transformation of .
ii) The derivative of with respect to y is continuous and nonzero for ,
where is the inverse function of ; that is, is that x for which

Department of Statistics, AMU © Defaru Debebe 70 | P a g e


Statistical Theory Of Distribution

Then is a continuous random variable with density

Proof: the above is a standard theorem from calculus on the change of variable in a definite
integral; so we will only sketch the proof. Consider the case is an interval. Let us suppose that
is a monotone increasing function over ; that is, , which is true if and only if

. For ,

and hence by chain rule of differentiation. On the

other hand, if is a monotone decreasing function over , so that and

, then , and therefore

Example 25: Suppose X has a beta distribution. What is the distribution of


defines a one-to-one transformation of onto

. , so , which is continuous and nonzero for

By Theorem 21,

In particular, if b=1, then B(a,b) =1/a; so , an exponential distribution


with parameter a.

Example 25: Suppose X has the Pareto density and the distribution of

is desired.

Department of Statistics, AMU © Defaru Debebe 71 | P a g e


Statistical Theory Of Distribution

The condition that be a one-to-one transformation of onto is unnecessarily restrictive.


For the transformation , each point in will correspond to just one point in ; but a
point in there may correspond more than one point in , which says that the transformation is
not one-to-one, and consequently Theorem 21 is not directly applicable. If, however, can be
decomposed into a finite (or even countable) number of disjoint sets, say , so that
defines a one-to-one transformation of each into , then the joint density of
can be found. Let denote the inverse of for . Then the
density of is given by

(28)

Example 26: Let X be a continuous random variable with density , and let .
Note that is an interval containing both negative positive points, then is not
one-to-one. However, if is decomposed into
, then defines a one-to-one transformation on each . Note that
. By (28),

In particular, if ,

Then ; or if

Then

b) Discrete random variables

Suppose that the discrete density function of the n-dimensional discrete


random variable is given. Let denote the mass points of ; that is,

Suppose that the joint density of is desired. It can be


observed that are jointly discrete and
Department of Statistics, AMU © Defaru Debebe 72 | P a g e
Statistical Theory Of Distribution

, where the summation is over those belongs to for which


.

Example 27:Let have a joint discrete density function given by

(0,0,0) (0,0,1) (0,1,1) (1,0,1) (1,1,0) (1,1,1)


1/8 3/8 1/8 1/8 1/8 1/8
Find the joint density of and
.

And
.

c) Continuous random variables

Suppose now that we are given the joint probability density function of the n-
dimensional continuous random variable . Let

Again assume that the joint density of the random variables


is desired, where k is some integer satisfying . If k<n, we will introduce
additional, new random variables for
judiciously selected functions ; then we we will find the joint distribution of
Department of Statistics, AMU © Defaru Debebe 73 | P a g e
Statistical Theory Of Distribution

This use of possible introducing additional random variables makes the transformation
a transformation from an n-dimensional space to an n-
dimensional space. Henceforth we will assume that we are seeking the joint distribution of
(rather than the joint distribution of ) when
we have given the joint probability density of .

We will state our results first for n=2 and later generalize to n>2. Let be given. Set
. We want to find the joint distribution of and
for known functions and . now suppose that and
defines a one-to-one transformation which maps onto . and can be
expressed in teams of and ; so we can write, say, and .
Note that is a subset of the plane and is a subset of the plane. The determinant

We be called Jacobian of the transformation and will be denoted by J. the above discussion
permits us to state Theorem 22.

Theorem 22: Let X1 and X2 be jointly continuous random variables with density function
. Set . Assume that:
(i) , defines a one –to-one transformation of
onto .
(ii) The first partial derivative of are
continuous over .
(iii) The Jacobian of the transformation is nonzero for .
Then the joint density of and is given by

Example 28: Suppose that X1 and X2 are independent random variables, each uniformly
distributed over the interval (0,1). Then . Let

Department of Statistics, AMU © Defaru Debebe 74 | P a g e


Statistical Theory Of Distribution

; then , and

Example 29: Let X1 and X2 be two independent standard normal random variables. Let
. Then .

To find the marginal distribution of, say, Y2, we must integrate out y1; that is

Let then

a Cauchy density. That is, the ratio of two independent standard normal random variables has a
Cauchy distribution.

Department of Statistics, AMU © Defaru Debebe 75 | P a g e

You might also like