Statistics and Applications
Statistics and Applications
Contents
1 Definition and basic properties
1.1 Events . . . . . . . . . . . . .
1.2 Frequencies . . . . . . . . . .
1.3 Definition of probability . . .
1.4 Direct consequences . . . . . .
1.5 Some inequalities . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2
2
4
5
6
10
14
26
8 Combinatorics
29
31
31
32
33
34
36
12 Distribution functions
38
40
42
15 Statistical testing
44
15.1 Looking up probabilities for the standard normal in a table . . . . . . . . . 46
15.2 Two sample testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
16 Statistical estimation
16.1 An example . . . . . . . . . . . . . . . . . . .
16.2 Estimation of variance and standart deviation
16.3 Maximum Likelihood estimation . . . . . . .
16.4 Estimation of parameter for geometric random
. . . . . .
. . . . . .
. . . . . .
variables
.
.
.
.
17 Linear Regression
17.1 The case where the exact linear model is known . . . . . .
17.2 When and are not know . . . . . . . . . . . . . . . .
17.3 Where the formula for the estimates of and come from
17.4 Expectation and variance of . . . . . . . . . . . . . . .
17.5 How precise are our estimates . . . . . . . . . . . . . . . .
17.6 Multiple factors and or polynomial regression . . . . . . .
17.7 Other applications . . . . . . . . . . . . . . . . . . . . . .
1
1.1
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
53
53
55
56
57
.
.
.
.
.
.
.
58
58
60
62
64
65
65
65
Imagine that we throw a die which has 4 sides. The outcome, of this experiment will be
one of the four numbers: 1,2,3 or 4. The set of all possible outcomes in this case is:
= {1, 2, 3, 4} .
is called the outcome space or sample space. Before doing the experiment we dont know
what the outcome will be. Each possible outcome has a certain probability to occur. This
2
Let B be the event that inflation is higher in a year form now. This corresponds to the
outcomes 01 and 11. We thus view the event B as the set:
{01, 11} .
Recall that the intersection A B of two sets A and B, is the set consisting of all elements
contained in both A and B. In our example, the intersection of A and B is equal to
A B = {11}. Let C designate the event that unemployment goes up and that inflation
goes up at the same time. This corresponds to the outcome 11. Thus, C is identified with
the set: {11}. In other words, C = A B. The general rule which we must remember is:
For any events A and B, if C designates the event that A and B both occur at
the same time, then C = A B.
Let D be the event that unemployment or inflation will be up in a year from now. (By or
we mean that at least one of them is up.) This corresponds to the outcomes: 01,10,11.
Thus D gets identified with the set:
D = {01, 10, 11} .
Recall that the union of two sets A and B is defined to be the set consisting of all elements
which are in A or in B. We see in our example that D = A B. This is true in general.
We must thus remember the following rule:
For any events A and B, if D designates the event that A or B occur, then
D = A B.
1.2
Frequencies
Assume that we have a six sided die. In this case the outcome space is
= {1, 2, 3, 4, 5, 6}.
The event even in this case is the set:
{2, 4, 6}
whilst the event odd is equal to
{1, 3, 5} .
Instead of throwing the die only once, we throw it several times. As a result, instead of
just a number, we get a sequence of numbers. When throwing the six-sided die I obtained
the sequence:
1, 4, 3, 5, 2, 6, 3, 4, 5, 3, . . .
When repeating the same experiment which consists in throwing the die a couple of times,
we are likely to obtain another sequence. The sequence we observe is a random sequence.
In this example we observe one 3 within the first 5 trials and three 3s occurring within
the first 10 trials. We write:
n{3}
for the number of times we observe a 3 among the first n trials. In our example thus: for
n = 5 we have n{3} = 1 whilst for n = 10 we find n{3} = 3.
Let A be an event. We denote by nA the number of times A occurred up to time n.
Take for example A to be the event even. In the above sequence within the first 5 trials
we obtained 2 even numbers. Thus for n = 5 we have that nA = 2. Within the first 10
trials we found 4 even numbers. Thus, for n = 10 we have nA = 4. The proportion of
even numbers nA /n for the first 5 trials is equal to 2/5 = 40%. For the first 10 trials, this
proportion is 4/10 = 40%.
1.3
Definition of probability
The basic definition of probability which we use is based on frequencies. For our definition
of probability we need an assumption about the world surrounding us:
Let A designate an event. When we repeat the same random experiment independently
many times we observe that on the long run the proportion of times A occurs tends to
stabilize. Whenever we repeat this experiment, the proportion nA /n on the long run
tends to be the same number. A more mathematical way of formulating this, is to stay
that nA /n converges to a number only depending on A, as n tends to infinity. This is our
basic assumption.
Assumption As we keep repeating the same random experiment under the same conditions and such that each trial is independent of the previous ones, we find that:
the proportion nA /n tends to a number which only depends on A, as n .
We are now ready to give our definition of probability:
Definition 1.1 Let A be an event. Assume that we repeat the same random experiment
under exactly the same conditions independently many times. Let nA designate the number
of times the event A occurred within the n first repeats of the experiment. We define the
probability of the event A to be the real number:
nA
.
n n
P (A) =: lim
Thus, P (A) designates the probability of the event A. Take for example a four-sided
perfectly symmetric die. Because, of symmetry each side must have same probability. On
the long run we will see a forth of the times a 1, a forth of the times a 2, a forth of the
times a 3 and a forth of the times a 4. Thus, for the symmetric die the probability of
each side is 0.25.
1.4
Direct consequences
From our definition of probability there are several useful facts, which follow immediately:
1. For any event A, we have that:
P (A) 0.
2. For any event A, we have that:
P (A) 1.
3. Let designate the state space. Then:
P () = 1.
Let us prove these elementary facts:
1. By definition na /n 0. However, the limit of a sequence which is 0 is also 0.
Since P (A) is by definition equal to the limit of the sequence nA /n we find that
P (A) 0.
2. By definition nA n. It follows that na /n 1. The limit of a sequence which
is always less or equal to one must also be less or equal to one. Thus, P (A) =
limn na /n 1.
3. By definition n = n. Thus:
P () = lim n /n = lim n/n = lim 1 = 1.
n
The next two theorems are essential for solving many problems:
Theorem 1.1 Let A and B be disjoint events. Then:
P (A B) = P (A) + P (B).
Proof. Let C be the event C = A B. C is the event that A or B has occurred. Because
A and B are disjoint, we have that A and B can not occur at the same time. Thus, when
we count up to time n how many times C has occurred, we find that this is exactly equal
to the number of times A has occurred plus the number or times B has occurred. In other
words,
nC = nA + nB .
(1.1)
From this it follows that:
n
nC
nA + nB
nB
A
= lim
= lim
+
.
n n
n
n
n
n
n
P (C) = lim
We know that the sum of limits is equal to the limit of the sum. Applying this to the
right side of the last equality above, yields:
n
nB
nA
nB
A
lim
+
= lim
+ lim
= P (A) + P (B).
n
n n
n n
n
n
6
Let us give an example which might help us to understand why equation 1.1 holds.
Imagine we are using a 6-sided die. Let A be the event that we observe a 2 or a 3. Thus
A = {2, 3}. Let B be the event that we observe a 1 or a 5. Thus, B = {1, 5}. The two
events A and B are disjoint: it is not possible to observe at the same time A and B since
A B = . Assume that we throw the die 10 times and obtain the sequence of numbers:
1, 3, 4, 6, 3, 4, 2, 5, 1, 2.
We have seen the event A four times: at the second, fifth, seventh and tenth trial. The
event B is observed at the first trial, at the eight and ninth trials. C = AB = {2, 3, 1, 5}
is observed at the trials number: 2,5,7,10 and 1,8,9. We thus find in this case that nA = 4,
nB = 3 and nC = 7 which confirms equation 1.1.
Example 1.2 Assume that we are throwing a fair coin with sides 0 and 1. Let Xi designate the number which we obtain when we flip the coin for the i-th time. Let A be the
event that we observe right at the beginning the number 111. In other words:
A = {X1 = 1, X2 = 1, X3 = 1}.
Let B designate the event that we observe the number 101 when we read our random
sequence starting from the second trial. Thus:
B = {X2 = 1, X3 = 0, X4 = 1}.
Assume that we want to calculate the probability to observe that at least one of the two
events A or B holds. In other words we want to calculate the probability of the event
C = A B.
Note that A and B can not occur both at the same time. The reason is that for A to hold
it is necessary that X3 = 1 and for B to hold it is necessary that X3 = 0. X3 however
can not be equal at the same time to 0 ant to 1. Thus, A and B are disjoint events, so
we are allowed to use theorem 1.1.We find, applying theorem 1.1 that:
P (A B) = P (A) + P (B).
With a fair coin, each 3-digit number has same probability. There are 8, 3-digit numbers
so each one has probability 1/8. It follows that P (A) = 1/8 and P (B) = 1/8. Thus
P (A B) =
1
1 1
+ = = 25%
8 8
4
7
The next theorem is useful for any pair of events A and B and not just disjoint events:
Theorem 1.2 Let A and B be two events. Then:
P (A B) = P (A) + P (B) P (A B).
Proof. Let C = A B. Let D = B A, that is D consists of all the elements that are in
B, but not in A. We have by definition that C = D A and that D and A are disjoint.
Thus we can apply theorem 1.1 and find:
P (C) = P (A) + P (D)
(1.2)
(1.3)
(1.4)
By theorem 1.2, the right side of the last equation above is equal to:
P (A D) = P (A) + P (D) P (A D) = P (A) + P (B C) P (A (B C))
(1.5)
(1.6)
But the right side of the last equation above is the probability of the union of two events
and hence theorem 1.2 applies:
P ((AB)(AC)) = P (AB)+P (AC)P ((AB)(AC)) = P (AB)+P (AC)P ((ABC)).
(1.7)
Combining now equation 1.4, 1.5, 1.6 with 1.7, we find
P (A B C) = P (A) + P (B) + P (C) P (A B) P (A C) P (B C) + P (A B C).
Often it is easier to calculate the probability of a complement than the probability of the
event itself. In such a situation, the following theorem is useful:
Theorem 1.4 Let A be an event and let Ac denote its complement. Then:
P (A) = 1 P (Ac )
Proof. Note that the events A and Ac are disjoint. Furthermore by definition AAc = .
Recall that for the sample space , we have that P () = 1. We can thus apply theorem
1.1 and find that:
1 = P () = P (A Ac ) = P (A) + P (Ac ).
This implies that:
P (A) = 1 P (Ac )
which finishes this proof.
9
1.5
Some inequalities
nA
nB
.
n
n
nA
nB
lim
.
n n
n n
lim
Hence
P (A) P (B).
10
Imagine the following situation: in a population there are two illnesses a and b. We
assume that 20% suffer from b, 15% suffer from a whilst 10% suffer from both. Let A be
the event that a person suffers from a and let B be the event that a person suffers from b.
If a patient comes to a doctor and says that he suffers from illness b, how likely is he to
have illness a also? (We assume that the patient has been tested for b but not yet tested
for a.) We note that half the population group suffering from b, also suffer from a. Hence,
when the doctor meets such a patients suffering from b, there is a chance of 1 out of 2,
that the person suffers also from a. This is called the conditional probability of B given
A and denoted by P (B|A). The formula we used is 10%/20% = P (A B)/P (A).
Definition 2.1 Let A, B be two events. Then we define the probability of A conditional
on the event B, and write P (A|B) for the number:
P (A|B) :=
P (A B)
.
P (B)
Definition 2.2 Let A, B be two events. We say that A and B are independent of each
other iff
P (A B) = P (A) P (B).
Note that A and B are independent of each other if and only if P (A|B) = P (A). In other
word, A and B are independent of each other if and only if the realization of one of the
events does not affect the conditional probability of the other.
Assume that we perform two random experiments independently of each other, in the
sense that the two experiments do not interact. That is the experiments have no influence
on each other. Le A denote an event related to the first experiment, and let B denote
an event related to the second experiment. We saw in class that in this situation the
equation P (A B) = P (A) P (B) must hold. And thus, A and B are independent in the
sense of the above definition. To show this we used an argument where we simulated the
two random experience by picking marbles from two bags.
There are also many cases, where events related to a same experiment are independent,
in the sense of the above definition. For example for a fair die, the events A = {1, 2} and
B = {2, 4, 6} are independent.
There can also be more than two independent events at a time:
Definition 2.3 Let A1 , A2 , . . . , An be a finite collection of events. We say that A1 , A2 , . . . , An
are all independent of each other iff
P (iI Ai ) = iI P (Ai )
for every subset I {1, 2, . . . , n}.
The next example is very important for the test on Wednesday.
11
Example 2.1 Assume we flip the same coin independently three times. Let the coin be
biased, so that side 1 has probability 60% and side 0 has probability 40%. What is the
probability to observe the number 101? (By this we mean: what is the probability to first
get a 1, then a 0 and eventually, at the third trial, a 1 again?)
To solve this problem let A1 , resp. A3 be the event that at the first, resp. third trial we
get a one. Let A2 be the event that at the second trial we get a zero. Observing a 101
is thus equal to the event A := A1 A2 A3 . Because, the trials are performed in an
independent manner it follows that the events A1 , A2 , A3 are independent of each other.
Thus we have that:
P (A1 A2 A3 ) = P (A1 ) P (A2 ) P (A3 ).
We have that:
P (A1 ) = 60%, P (A2 ) = 40%, P (A3 ) = 60%.
It follows that:
P (A1 A2 A3 ) = 60% 40% 60% = 0.144.
2.1
(2.1)
Furthermore, if B and B c have both probabilities that are not equal to zero, then
P (A) = P (A|B) P (B) + P (A|B c )P (B c ).
(2.2)
P (A B) P (B) P (A B c ) P (B c )
+
.
P (B)
P (B c )
at random from this town. Each person is equally likely to be drawn. Let W be the event
that the person be a women and B be the event that the person be blond. The law of
total probability can be written as
P (B) = P (B|W )P (W ) + P (B|W c )P (W c ).
(2.3)
In our case, the conditional probability of blond conditional on women is P (B|W ) = 0.9.
On the other hand W c is the event to draw a male and P (B|W c ) is the conditional
probability to have a blond given that the person is a man. In our case, P (B|W c ) = 0.2.
So, when we put the numerical values into equation 2.3, we find
P (B) = 0.9P (W ) + 0.2P (W c ).
(2.4)
Here P (W ) is the probability that the chosen person is a women. This is then the
percentage of women in this population. Similarly, P (W c ) is the proportion of men. In
other words, equation 2.4 can be read as follows: the total proportion of blonds in the
population is the weighted average between the proportion of blonds among the female
and the male population.
2.2
Bayes rule
Bayes rule is useful when one would like to calculate a conditional probability of A given
B, but one is given the opposite, that is the probability of B given A. Let us next state
Bayes rule:
Lemma 2.2 Let A and B be events both having non zero probabilities. Then
P (A|B) =
P (B|A) P (A)
.
P (B)
(2.5)
We have that P (C) = P (C|M )P (M ) + P (C|M c )P (M c ) which we can plug into 2.6 to
find
P (M )
P (M |C) = P (C|M )
.
P (C|M )P (M ) + P (C|M c )P (M c )
In the present numerical example, we find
P (M |C) = 0.3
P (M )
,
0.3P (M ) + 0.1P (M c )
Expectation
Imagine a firm which every year makes a profit. It is not known in advance what the profit
of the firm is going to be. This means that the profit is random: we can assign to each
possible outcome a certain probability. Assume that from year to year the probabilities
for the profit of our firm do not change. Assume also that from one year to the next the
profits are independent. What is the long term average yearly profit equal to?
For this let us look at a specific model. Assume the firm could make 1, 2, 3 or 4 million
profit with the following probabilities
P (X = x)
x
(The model here is not very realistic since there are only a few possible outcomes. We
chose it merely to be able to illustrate our point). Let Xi denote the profit in year i.
Hence, we have that X, X1 , X2 ,... are i.i.d. random variables.
To calculate the long term average yearly profit consider the following. In 10% of the
year on the long run we get 1 million. If we take a period of n years, where n is large,
we thus find that in about 0.1n years we make 1 million. In 40% of the years we make
2 millions on the long run. Hence, in a period of n years, this means that in about 0.4n
years we make 2 millions. This corresponds to an amount of money equal to about 0.4n
times 2 millions. Similarly, for n large the money made during the years where we earned
3 million is about 3 0.3n, whilst for the years where we made 4 millions we get 4 0.2n.
The total during this n year period is thus about
10.1+20.4+30.3+40.4 = 1P (X = 1)+2P (X = 2)+3P (X = 3)+4P (X = 4) == 3.3
Hence, on the long run the yearly average profit is 3.3 millions. This long term average is
called expected value or expectation and is denoted by E[X]. Let us formalize this concept:
In general if X denotes the outcome of a random experiment, then we call X a random
variable.
14
Definition 3.1 Let us consider a random experiment with a finite number of possible
outcomes, where the state space is
= {x1 , x2 , . . . , xs } .
(In the profit example above, we would have = {1, 2, 3, 4}.) Let X denote the outcome
of this random experiment. For x , let px denote the probability that the outcome of
our random experiment is is x. That is:
px := P (X = x).
(In the last example above, we have for example p1 = 0.1 and p2 = 0.4...) We define the
expected value E[X]:
X
E[X] :=
xpx .
x
In other words, to calculate the expected value of a random variable, we simply multiple
the probabilities with the corresponding values and then take the sum over all possible
outcomes. Let us see yet another example for expectation.
Example 3.1 Let X denote the value which we obtain when we throw a fair coin with
side 0 and side 1. Then we find that:
E[X] = 0.5 1 + 0.5 0 = 0.5
When we keep repeating the same random experiment independently and under the same
conditions on the long run, we will see that the average value which we observe converges to
the expectation. This is what we saw in the firm/profit example above. Let us formalize
this. This fact is actually a theorem which is called the Law of Large Numbers. This
theorem goes as follows:
Theorem 3.1 Assume we repeat the same random experiment under the same conditions
independently many times. Let Xi denote the (random variable) which is the outcome of
the i-th experiment. Then:
lim
(X1 + X2 + . . . + Xn )
= E[X1 ]
n
(3.1)
This simply means that on the long run, the average is going to be equal to to the expectation.
Proof. Let denote the state space of the random variables Xi :
= {x1 , x2 , . . . , xs } .
by regrouping the same terms together, we find:
X 1 + X 2 + . . . + X n = x 1 n x1 + x 2 n x2 + . . . + x s n xs .
15
(Remember that nxi denotes the number of times we observe the value xi in the finite
sequence: X1 , X2 , . . . , Xn .) Thus:
n
(X1 + X2 + . . . + Xn )
nxs
x1
lim
= lim x1
+ . . . + xs
.
n
n
n
n
n
By definition
nxi
.
n n
Since the limit of a sum is the sum of the limits we find,
P (X1 = xi ) = lim
nx1
n xs
nx
nx
lim x1
+ . . . + xs
= x1 lim 1 + . . . + xs lim s =
n
n
n
n
n
n
n
=x1 P (X = x1 ) + . . . + xs P (X = xs ) = E[X1 ].
So, we can now generalize our firm profit example. Imagine for this that the profit a firm
makes every month is random. Imagine also that the earnings from month to month are
independent of each other and also have the same probabilities. In this case we can
view the sequence of earnings month for month, as a sequence of repeats of the same
random experiment. Because of theorem 3.1, on the long run the average monthly income
will be equal to the expectation.
Let us next give a few useful lemmas in connection with expectation. The first lemma deals
with the situation where we take an i.i.d. sequence of random outcomes X1 , X2 , X3 , . . .
and multiply each one of them with a constant a. Let Yi denote the number Xi multiplied
by a: hence Yi := aXi . Then the long term average of the Xi s multiplied by a equals to
the long term average of the Yi s. Let us state this fact in a formal way:
Lemma 3.1 Let X denote the outcome of a random experiment. (Thus X is a so-called
random variable.) Let a be a real (non-random) number. Then:
E[aX] = aE[X].
Proof. Let us repeat the same experiment independently many times. Let Xi denote
the outcome of the i-th trial. Let Yi be equal to Yi := aXi . Then by the law of large
numbers, we have that
Y1 + . . . + Yn
= E[Y1 ] = E[aX1 ].
n
n
lim
However:
X 1 + . . . + Xn
=
n
X 1 + . . . + Xn
=a lim
= aE[X1 ].
n
n
Y1 + . . . + Yn
aX1 + . . . + aXn
lim
= lim
= lim a
n
n
n
n
n
16
However:
X1 + Y1 + X 2 + Y2 + . . . + Xn + Yn
Z1 + . . . + Zn
= lim
=
n
n
n
n
(X1 + . . . + Xn ) + (Y1 + . . . + Yn )
=
= lim
n
n
X1 + . . . + Xn
Y1 + . . . + Yn
= lim
+ lim
+ = E[X1 ] + E[Y1 ].
n
n
n
n
lim
This proves that E[X1 +Y1 ] = E[X1 ]+E[Y1 ] and finishes this proof. It is very important
to note that we do not need for the above theorem to have X and Y being independent
of each other.
17
Lemma 3.3 Let X, Y denote the outcomes of two independent random experiments.
Then:
E[X Y ] = E[X] E[Y ].
Proof. We assume that X takes values in a countable set x , whilst Y takes on values
from the countable set Y . We have that
X
E[XY ] =
xyP (X = x, Y = y).
(3.2)
xX ,yY
xX
yY
In some problems we are only interested in the expectation of a random variable. For
example, consider insurance policies for mobile telephones sold by a big phone company.
Say Xi is the amount which will be paid during the coming year to the i-th customer
due to his/her phone breaking down. It seems reasonable to assume that the Xi s are
independent of each other. (We assume no phone viruses). We also assume that they all
follow the same random model. So, by the Law of Large Numbers we have that for n
large, the average is approximately equal to the expectation:
X 1 + X2 + . . . + X n
E[Xi ].
n
Hence, when n is really large, there is no risk involved for the phone company: they
know how much they will have to pay total: on a per customer basis, they will have to
spend an amount very close to E[X1 ]. In other words, they only need one real number
from the probability model for the claims: that is the expectation E[Xi ]. Now, in many
other applications knowing only the expected value will not be enough: we will also need
a measure of the dispersion. This means that we will also want to know how much on
average the variables fluctuate from their long term average E[X1 ].
Let us give an example. Matzinger as a child used to walk with his mother every day on the shores
of Lake Geneva. Now, there is a place where there is a scale to measure the height of the water. So,
hydrologists measure the water level and then analyze this data. Assume that Xi denotes the water level
on a specific day day in year i. (We assume that we always measure on the same day of the year, like for
example on the first of January). For the current discussion we assume that the model does not change
18
over time (no global warming). We furthermore assume that from one year to the next the values are
independent. Say the random model would be given as follows:
x
P (X = x)
1
6
1
6
1
6
1
6
1
6
1
6
How much does the water level fluctuate on average from year to year? Note that the long term average,
that is the expectation is equal to
E[Xi ] = 4
1
1
1
1
1
1
+ 5 + 6 + 7 + 8 + 9 = 6.5
6
6
6
6
6
6
Now, when the water level is 6 or 7, then we are 0.5 away from the long term average of 6.5. In such a
year i, we will say that the fluctuation fi is 0.5. In other words, we measure for each year i, how far we
are from E[Xi ]. This observed fluctuation in year i is then equal to
fi := |Xi 6.5| = |Xi E[Xi ]|.
In our model, fi = 0.5 happens with a probability of 1/3, that is on the long run, in one third of the
years. When the water level is either at 8 or 5, then we are 1.5 away from the long term average of 6.5.
This has also a probability of 1/3. Finally, with water levels of 4 or 9, we are 2.5 away from the long
term average and again this will happen in a third of the year on the long run. So, the long term average
fluctuation if this models holds, will always tend to be about
E[fi ] = E[|Xi E[Xi ]|] = 2.5
1
1
1
+ 1.5 + 0.5 = 1.5.
3
3
3
after many years. To understand why simply consider the fluctuations f1 , f2 , f3 , . . .. By the Law of Large
Numbers applied to them we get that for n large, the average fluctuation is approximately equal to its
expectation:
f1 + f2 + . . . + fn
E[fi ] = E[|Xi E[Xi ]|].
(4.1)
n
So, now matter, what after many years, we will always now what the average fluctuation is approximately equal to: the expression on the right side of 4.1
(4.2)
is a measure of the dispersion (around the expectation) in our model. It should be obvious
why this dispersion is important: if it is small people of Geneva will be safe. If it is big,
they will often have to deal with flooding. So, in some sense, we can view the value given
in 4.2 as a measure of risk: if the dispersion is 0, then there is no risk and the random
number is not random but always equal to the fixed value E[X1 ]!
In modern statistics, one considers however most often a number which represents the
same idea, but can be slightly different from 4.2. The number we will use most often, is
not the average fluctuation, but instead the square root of the average fluctuation square.
This number is called the standard deviation of a random variable. We usually denote it
by , so
p
X := E[(X E[X])2 ].
19
The long term average fluctuation square of a random variable X is also called variance,
and will be denoted by V AR[X] so that
V AR[X] := E[(X E[X])2 ].
With this definition the standard deviation is simply the square root of the variance:
p
X = V AR[X].
In most cases, X and our other measure of dispersion given by E[|X E[x]|] are almost
equal.
Let us go back to our example. the variance is the average fluctuation square. We get thus:
V AR[Xi ] = E[fi2 ] = 2.52
1
1
1
+ 1.5 + 0.52 = 2.91
3
3
3
So, we see the average fluctuation size was E[|Xi E[Xi ]|] = 1.5 whilst the standard deviation is (only)
about 13% bigger.
Now, standard deviation is most often used for determining the order of magnitude of a
random imprecision. So, we don t care about knowing absolutely exactly that number:
instead we just want the order of magnitude. In other words, in most applications,
E[|Xi E[Xi ]| and Xi are sufficiently close to each other, that for applications it does
not matter which one of the two we take! But, it will turn out that the standard deviation
allows for certain calculations which the other measure of dispersion in 4.2 does not allow
for. So, we will work more often with the standard deviation than the other.
4.1
20
for any random variable to be further than 2 standard deviations from it expected value
could be as much as 25% but never more:
P (|Z E[Z]| 2Z ) 0.25.
the above inequality holds for any random variable, so it represents in some sense the
worse case. Inequality ?? will be proven in our section on chebycheff.
For normal variables, the probability to be further than two standard deviations is much
smaller: it is about 0.05. Now, we will see in the section on central limit theorem, that
any sum of many independent random contributions is approximately normal as soon as
they follow about the same model. Now, 0.0 is much smaller than 0.25. In real life, in
many cases, one will be in between these two possibilities. This rule of thumb is extremely
useful when analyzing data and trying to get the big picture!
Now E[X] is a constant and constants can be taken out of the expectation. This implies
that
E[XE[X]] = E[X]E[X] = E[X]2 .
(5.2)
On the other hand, the expectation of a constant is the constant itself. Thus, since E[X]2
is a constant, we find:
E[E[X]2 ] = E[X]2 .
(5.3)
Using equation 5.2 and 5.3 with 5.1 we find
E[(X E[X])2 ] = E[X 2 ] 2E[X]2 + E[X]2 = E[X 2 ] E[X]2 .
this finishes to prove that V AR[X] = E[X 2 ] E[X]2 .
Lemma 5.3 Let X and Y be the outcomes of two random experiments, which are independent of each other. Then:
V AR[X + Y ] = V AR[X] + V AR[Y ].
Proof. We have:
V AR[X + Y ] =E[((X + Y ) E[X + Y ])2 ] = E[(X + Y E[X] E[Y ])2 ] =
E[((X E[X]) + (Y E[Y ]))2 ] =
=E[(X E[X])2 + 2(X E[X])(Y E[Y ]) + (Y E[Y ])2 ] =
=E[(X E[X])2 ] + 2E[(X E[X])(Y E[Y ])] + E[(Y E[Y ])2 ] =
Since X and Y are independent, we have that (X E[X]) is also independent from
(Y E[Y ]). Thus, we can use lemma 3.3, which says that the expectation of a product
equals the product of the expectations in case the variables are independent. We find:
E[(X E[X])(Y E[Y ])] = E[X E[X]] E[Y E[Y ]].
Furthermore:
E[X E[X]] = E[X] E[E[X]] = E[X] E[X] = 0
Thus
E[(X E[X])(Y E[Y ])] = 0.
Applying this to the above formula for V AR[X + Y ], we get:
V AR[X + Y ] =E[(X E[X])2 ] + 2E[(X E[X])(Y E[Y ])] + E[(Y E[Y ])2 ] =
= E[(X E[X])2 ] + E[(Y E[Y ])2 ] = V AR[X] + V AR[Y ].
This finishes our proof.
22
5.1
We mentioned that most of the time, any random variable takes values no further than two
times its standard deviation from its expectation. We can apply this and our calculation
for variance to understand how insurances work, hedging investments, and even statistical
estimation work. Let X1 , X2 , . . . be a sequence of random variables which all follow the
same model and are independent of each other. Let Z be the sum of n such variables:
Z = X1 + X 2 + . . . + Xn
We find that
E[Z] = E[X1 + X2 + . . . + Xn ] = E[X1 ] + E[X2 ] + . . . + E[Xn ] = nE[X1 ]
Similarly we can use the fact that the variance of a sum of independent variables is the
sum of the variance to find:
V AR[Z] = V AR[X1 +X2 +. . .+Xn ] = V AR[X1 ]+V AR[X2 ]+. . .+V AR[Xn ] = nV AR[X1 ].
Using the last equation above with the fact that standard deviation is the square root of
variance we find:
p
p
23
So, again it is all based on the following two equations which hold when the Xi s are
independent and follow the same model:
X1 +...+Xn = X1 n
E[X1 + . . . + Xn ] = nE[X1 ]
So for example with n = 1000000, we get
Z = 1000Xi
whilst
E[Z] = 10000000E[Xi ]
so Z becomes negligence compared to E[Z]. So, if we think that most of the times a
variable is within two standard deviations of its expectation, we find
Z 1000000E[X1 ] 1000X1
so, compared to the order of magnitude of Z the fluctuation becomes almost negligible!
Two random variables are dependent when their joint distribution is not simply the product of their marginal distribution. But the degree of dependence can vary from strong
dependence to loose dependence. One measure of the degree of dependence of random
variables is Covariance. For random variables X and Y we define the covariance as follows:
COV [X, Y ] = E[(X E[X])(Y E[Y ])]
Lemma:
For random variables X and Y there is also another equivalent formula for the covariance:
COV [X, Y ] = E[XY ] E[X]E[Y ]
Proof:
E[(X E[X])(Y E[Y ])] = E[XY Y E[X] XE[Y ] + E[X]E[Y ]]
= E[XY ] E[X]E[Y ]] E[X]E[Y ]] + E[X]E[Y ]]
= E[XY ] E[X]E[Y ]]
Lemma:
For independent random variables X and Y,
COV [X, Y ] = 0
24
Proof:
COV [X, Y ] = E[XY ] E[X]E[Y ]
For independent X and Y, E[XY ] = E[X]E[Y ]. Hence COV [X, Y ] = 0
Lemma:
COV [X, X] = V AR[X]
Proof:
COV [X, X] = E[X 2 ] E[X]2 = V AR[X]
Lemma:
Assume that a is a constant and let X and Y be two random variables. Then
COV [X + a, Y ] = COV [X, Y ]
Proof:
COV [X + a, Y ] = E[(X + a E[X + a])(Y E[Y ])]
= E[Y X + Y a Y E[X + a] XE[Y ] aE[Y ] + E[Y ]E[X + a]]
= E[XY ] + aE[Y ] E[Y ]E[X + a] E[X]E[Y ] aE[Y ] + E[Y ]E[X + a]
= E[XY ] E[X]E[Y ]
= COV [X, Y ]
Lemma:
Let a be a constant and let X and Y be random variables. Then
COV [aX, Y ] = aCOV [X, Y ]
Proof:
COV [aX, Y ] = E[(aX E[aX])(Y E[Y ])]
= E[aXY Y E[aX] aXE[Y ] + E[aX]E[Y ]]
= aE[XY ] aE[X]E[Y ] aE[X]E[Y ] + aE[X]E[Y ]
= aE[XY ] aE[X]E[Y ]
= a(E[XY ] E[X]E[Y ])
= aCOV [X, Y ]
25
Lemma:
For any random variables X, Y and Z we have:
COV [Z + X, Y ] = COV [Z, Y ] + COV [X, Y ]
Proof:
COV [Z + X, Y ] = E[(X + Z E[X + Z])(Y E[Y ])]
= E[Y X + Y Z Y E[X + Z] XE[Y ] ZE[Y ] + E[X + Z]E[Y ]]
= E[Y X] + E[Y Z] E[Y ]E[X + Z] E[X]E[Y ] E[Z]E[Y ] + E[X + Z]E[Y ]
since E[A + B] = E[A] + E[B], we get
= E[Y X]+E[Y Z]E[Y ]E[X]E[Y ]E[Z]E[X]E[Y ]E[Z]E[Y ]E[X]E[Y ]+E[Z]E[Y ]
= E[Y X] + E[Y Z] E[X]E[Y ] E[Z]E[Y ]
= E[Y X] E[X]E[Y ] + E[Y Z] E[Z]E[Y ]
= COV [X, Y ] + COV [Z, Y ]
NOTE THAT COV [X, Y ] = COV [Y, X]
6.1
Correlation
Lemma 7.1 Assume that a > 0 is a constant and let X be a random variable taking on
only non-negative values, i.e. P (X 0) = 1. Then,
P (X a)
E[X]
.
a
Proof. To simplify the notation, we assume that the variable takes on only integer values.
The result remains valid otherwise. We have that
E[X] = 0 P (X = 0) + 1 P (X = 1) + 2 P (X = 2) + 3 P (X = 3) + . . .
(7.1)
Note that the sum on the right side of the above inequality contains only non-negative
terms. If we leave out some of these terms, the value can only decrease or stay equal. We
are going to just keep the values x P (X = x) for x greater equal to a. This way equation
7.1, becomes
E[X] xa P (X = xa ) + (xa + 1) P (X = xa + 1) + (xa + 2) P (X = xa + 2) + . . . (7.2)
where xa denotes the smallest natural number which is larger or equal to a. Note that
xa + i a for any i natural number. With this we obtain that the right side of 7.2 is
larger-equal than
a(P (X = xa ) + P (X = xa + 1) + P (X = xa + 2) + . . .) = aP (X a).
and hence
E[X] aP (X a).
The last inequality above implies:
P (X a)
E[X]
.
a
The inequality given in the last lemma is called Markov inequality. In is very useful: in
many real world situations it is difficult to estimate all the probabilities (the probability
distribution) for a random variable. However, it might be easier to estimate the expectation, since that is just one number. If we know the expectation of a random variable, we
can at least get upper-bounds on the probability to be far away from the expectation.
Let us next present the Chebycheff inequality:
Lemma 7.2 If X is a random variable with expectation E[X] and variance VAR[X] and
a 0 is a non-random number, then
P (|X E[X]| a)
27
V AR[X]
a2
Proof.
Note that |X E[X]| a implies (X E[X])2 a2 and vice versa. Hence,
P (|X E[X]| a) = P ((X E[X])2 a2 )
(7.3)
(7.4)
E[(X E[X])2 ]
V AR[X]
E[Y ]
=
=
2
2
a
a
a2
Using the last chain of inequalities above with equalities 7.3 and 7.4 we find
P (|X E[X]| a)
V AR[X]
a2
Let us consider one more example. Assume the total expected claim at the end of next
year for an insurance company is 10 0000 000$. What is the risk that the insurance company
has to pay more than 50 0000 000 as total claim at the end of next year? The answer goes
as follows:
let Z be the total claim at the end of next year. By Markov inequality, we find
P (Z 50 0000 000)
E[Z]
50 0000 000
1
= 20%.
5
Hence, we know that the probability to have to pay more than five millions is at most
20%. To derive this the only information needed was the expectation of Z.
When the standard deviation is also available, one can usually get better bounds using
the Chebycheef inequality. Assume in the example above that the expected total claim is
as before, but let the standard deviation of the total claim be one million. Then we have
V AR[Z] = (10 0000 000)2 .
Note that for Z to be above 50 0000 000 we need Z E[Z] to be above 40 0000 000. Hence,
P (Z 50 0000 000) = P (Z E[Z] 40 0000 000) P (|Z E[Z]| 40 0000 000).
Using Chebycheff, we get
P (|Z 10 0000 000| 40 0000 000)
V AR[Z]
1
=
= 0.0625.
0
0
2
(4 000 000)
16
It follows that the probability that the total claim is above five millions is less than 6.25
percent. This is a lot less than the bound we had found using Markovs inequality.
28
Combinatorics
number of outcomes in E
|E|
=
total number of outcomes
s
P (X = xt )
t=1,...,s
Thus,
1
P (X = x1 ) = .
s
Now if:
E = {y1 , . . . , yj }
We find that:
P (E) = P (X E) = P ({X = y1 } . . . {Xj = yj }) =
j
X
P (X = yi ) =
i=1
j
s
29
Example 8.1 Assume we first throw a coin with a side 0 and a side 1. Then we throw a
four sided die. Eventually we throw the coin again. For example we could get the number
031. How many differ numbers are there which we could get? The answer is: First we
have to possibilities. For the second choice we have four, and eventually we have again
two. Thus, m1 = 2, m2 = 4, m3 = 2. This implies that the total number of possibilities is:
m1 m2 m3 = 2 4 2 = 16.
Recall that the product of all natural numbers which are less or equal to k, is denoted by
k!. k! is called k-factorial.
Lemma 8.2 There are
k!
possibilities to put k different objects in a linear order. Thus there are k! permutations of
k elements.
To realize why the last lemma above holds we use lemma 8.1. To place k different objects
in a row we first choose the first object which we will place down. For this we have k
possibilities. For the second object, there remain k 1 objects to choose from. For the
third, there are k 3 possibilities to choose from. And so on and so forth. This then
gives that the total number of possibilities is equal to k (k 1) . . . 2 1.
Lemma 8.3 There are:
n!
(n k)!
possibilities to pick k out of n different objects, when the order in which we pick them
matters.
For the first object, we have n possibilities. For the second object we pick, we have n 1
remaining objects to choose from. For the last object which we pick, (that is the k-th
which we pick), we have n k + 1 remaining objects to choose from. Thus the total
number of possibilities is equal to:
n (n 1) . . . (n k + 1)
which is equal to:
n!
.
(n k)!
n!
The number (nk)!
is also equal to the number of words of length k written with a n-letter
alphabet, when we require that the words never contain twice the same letter.
n!
k!(n k)!
The reason why the last lemma holds is the following: there are k! ways of putting a
given subset of size k into different orders. Thus, there are k! times more ways to pick k
elements, than there are subsets of size k.
Lemma 8.5 There are:
2n
subsets of any size in a set of size n.
The reason why the last lemma above holds is the following: we can identify the subsets
two binary vectors with n entries. For example, let n = 5. Let the set we consider be
{1, 2, 3, 4, 5}. Take the binary vector:
(1, 1, 1, 0, 0).
This vector would correspond to the subset containing the first three elements of the set,
thus to the subset:
{1, 2, 3}.
So, for every non zero entry in the vector we pick the corresponding element in the set. It
is clear that this correspondence between subsets of a set of size n and binary vectors of
dimension n is one to one. Thus, there is the same number of subsets as there is binary
vectors of length n. The total number of binary vectors of dimension n however is 2n .
9.1
Bernoulli variable
Let a coin have a side 0 and a side 1. Let p be the probability of side 1 and 1 p be
the probability of side 0. Let X designate the random number we obtain when we flip
this coin. Thus, with probability p the random variable X takes on the value 1 and with
probability 1 p it takes on the value 0. The random variable X is called a Bernoulli
variable with parameter p. It is named after the famous swiss mathematician Bernoulli.
For a Bernoulli variable X with parameter p we have:
E[X] = p.
V AR[X] = p(1 p).
Let us show this:
E[X] = 1 p + 0 (1 p) = p.
For the variance we find:
V AR[X] = E[X 2 ] (E[X])2 = 12 p + 02 (1 p) (E[X])2 = p p2 = p(1 p).
31
9.2
Again, let a coin have a side 0 and a side 1. Let p be the probability of side 1 and 1 p be
the probability of side 0. We toss this coin independently n times and count the numbers
of 1s observed. The number Z of 1s observed after n coin-tosses is equal to
Z := X1 + X2 + . . . + Xn
where Xi designates the result of the i-th toss. (Hence the Xi s are independent Bernoulli
variables with parameter p.) The random variable Z is called a binomial variable with
parameter p and n. For the binomial random variable with parameter p we find:
E[Z] = np
V AR[Z] = np(1 p)
For k n, we have: P (Z = k) =
n
k
pk (1 p)nk .
choose 2.
We can now generalize to n trials and a number k n. There are n choose k possible
outcomes for which among the first n coin tosses there appear exactly k ones. Each of
these outcomes has probability:
pk (1 p)(nk) .
This gives then:
n k
P (Z = k) =
p (1 p)(nk) .
k
9.3
Again, let a coin have a side 0 and a side 1. Let p be the probability of side 1 and 1 p be
the probability of side 0. We toss this coin independently n many times. Let Xi designate
the result of the i-th coin-toss. Let T designate the number of trials it takes until we first
observe a 1. For example, if we have:
X1 = 0, X2 = 0, X3 = 1
we would have that T = 3. If we observe on the other hand:
X1 = 0, X2 = 1
we have that T = 2. T is a random variable. As we are going to show, we have:
For k > 0, we have P (T = k) = p(1 p)k1 .
E[T ] = 1/p
V AR[T ] = (1 p)/p2
A random variable T for which P (T = k) = p(1 p)k1 , k N, is called geometric
random variable with parameter p. Let us next prove the above statements: For T to be
equal to k we need to observe k 1 time a zero followed by a one. Thus:
P (T = k) = P (X1 = 0, X2 = 0, . . . , Xk1 = 0, Xk = 1) =
P (X1 = 0) P (X2 = 0) . . . P (Xk1 = 0) P (Xk = 1) = (1 p)k1 p.
Let us calculate the expectation of T . We find:
E[T ] =
kp(1 p)k1
k=1
X
k=1
33
kxk1 .
We have that
f (x) =
X
d(xk )
k=1
dx
k=1
xk
dx
d (x/(1 x))
1
x
1
=
+
=
(9.1)
2
dx
1 x (1 x)
(1 x)2
k(1 p)k1 = f (1 p) =
k=1
1
(p)2
Thus,
E[T ] = p
!
k(1 p)
k1
=p
k=1
1
1
= .
2
(p)
p
E[T ] =
k 2 p(1 p)k1 .
k=1
k 2 (x)k1
k=1
We find:
X
d x
d(xk )
k
g(x) =
=
dx
k=1
k=1
kxk1
dx
d (x/(1 x)2 )
1+x
=
.
dx
(1 x)3
2p
.
p2
Now,
2p
10
2
1
1p
=
.
p
p2
So far we have only been studying discrete random variables. Let us see how continuous
random variables are defined.
34
Definition 10.1 Let X be a number generated by a random experiment. (Such a random number is also called random variable). X is a continuous random variable if there
exists a non-negative piecewise continuous function
f : x 7 f (x) R R+
such that for any interval I = [i1 , i2 ] R we have that:
Z
P (X I) = f (x)dx.
I
The function f (.) is called the density function of X or simply the density of X.
R
Note that the notation I f (x)dx stands for:
Z
i2
f (x)dx =
f (x)dx.
(10.1)
i1
Recall also that integrals like the one appearing in equation 10.1 are defined to be equal
to the air under the curve f (.) and above the interval I.
Remark 10.1 Let f (.) be a piecewise continuous function from R into R. Then, there
exists a continuous random variable X such that f (.) is the density of X, if and only if
all of the following conditions are satisfied:
1. f is everywhere non-negative.
R
2. R f (x)dx = 1.
Let us next give some important examples of continuous random variables:
The uniform variable in the interval I = [i1 , i2 ], where i1 < i2 . The density of f (.)
is equal to 1/|i2 i2 | everywhere in the interval I. Anywhere outside the interval I,
f (.) is equal to zero.
The standard normal variable has density:
1
2
f (x) := ex /2 .
2
A standard normal random variable is often denoted by N (0, 1).
Let R, > 0 be given numbers. The density of the normal variable with
expectation and standard deviation is defined to be equal to:
f (x) :=
1
2
2
e(x) /2 .
2
35
11
36
If X is normal, then
X E[X]
Z := p
V AR[X]
is a standard normal.
For the last point above note that for any random variable X (not necessarily normal) we
have that if Z = (X E[X])/X , then Z has expectation zero and standard deviation 1.
This is a simple straight forward calculation:
X E[X]
1
E[Z] = E
=
(E[X] E[E[X]])
(11.1)
X
X
but since E[E[X]] = E[X], equality 11.1 implies that E[Z] = 0. Also
X E[X]
= V AR[X]/ 2 = 1.
V AR[Z] = V AR
X
Now if X is normal then we saw that Z = (X E[X])/X is also normal, since Z is
just obtained from X by multiplying and adding constants. But Z has expectation 0 and
standard deviation 1 and hence it is standard normal.
One can use normal variables to model financial processes and many others. Let us
consider an example. Assume that a portfolio consists of three stocks. Let Xi denote the
value of stock number i in one year from now. We assume that the three stocks in the
portfolio are
p all independent of each others and normally distributed so that i = E[Xi ]
and i = V AR[Xi ] for i = 1, 2, 3. Let
1 = 100, 2 = 110, 3 = 120
and let
1 = 10, 2 = 20, 3 = 20.
The value of the portfolio after one year is Z = X1 +X2 +X3 and E[Z] = E[X1 ]+E[X2 ]+
E[X3 ] = 330. : Question: What is the probability that the value of the portfolio after a
year is above 360?
Answer: We have that
V AR[Z] = V AR[X1 ] + V AR[X2 ] + V AR[X3 ] = 100 + 400 + 400 = 900
and hence
Z =
p
V AR[Z] = 30.
.
(11.2)
Z
Z
37
Note that
(360 E[Z])/z = 1
and also (Z E[Z])/Z is standard normal. Using this in equation 11.2, we find that the
probability that the portfolio after a year is above 360 is equal to
P (Z 360) = P (N (0, 1) 1) = 1 (1),
where (1) = P (N (0, 1) 1) = 0.8413 can be found in a table for the standard normal.
12
Distribution functions
Taking the derivative on all sides of the above system of equations we find that:
dFX (s)
= fX (s).
ds
In other words, for a continuous random variables X, the derivative of the distribution
function is equal to the density of X. Hence, in this case, the distribution function is
differentiable and thus also continuous. Another implication is: the distribution function
uniquely determines the density function of f . This implies, that the distribution function
determines uniquely all the probabilities of events which can be defined in terms of X.
Assume next that the random variable X has a finite state space:
X = {s1 , s2 , . . . , sr }
38
such that s1 < s2 < . . . < sr . Then, the distribution function FX is a step function. Left
of s1 , we have that FX is equal to zero. Right of sr it is equal to one. Between si and
si+1 , that is on the interval [si , si+1 [, the distribution function is constantly equal to:
X
P (X = sj ).
ji
Thus
P (Y s) = FX (s).
39
This shows that the distribution function FY of Y is equal to FX (s). Applying the
derivative according to s to both FY (s) and FX (s), yields:
fY (s) = fX (s).
Hence, X and Y have same density function. This finishes the proof.
13
Definition 13.1 Let X be a continuous random variable with density function fX (.).
Then, we define the expectation E[X] of X to be:
Z
E[X] :=
sfX (s)ds.
Next we are going to prove that the law of large numbers also holds for continuous random
variables.
Theorem 13.1 Let X1 , X2 , . . . be a sequence of i.i.d. continuous random variables all
with same density function fX (.). Then,
X1 + X2 + . . . + Xn
= E[X1 ].
n
n
lim
Proof. Let > 0 be a fix number. Let us approximate the continuous variables Xi by
a discrete variable Xi . For this we let Xi be the largest integer multiple of which is
still smaller equal to Xi . In this way, we always get that
|Xi Xi | < .
This implies that:
X1 + X2 + . . . + Xn X1 + X2 + . . . + Xn
<
n
n
However the variables Xi are discrete. So for them the law of large number has already
been proven and we find:
X1 + X2 + . . . + Xn
= E X1
n
n
lim
We have that
E[Xi ] =
z P (Xi = z)
zZ
40
(13.1)
However, by definition:
P (Xi = z) = P (Xi [z, (z + 1)[).
The expression on the right side of the last inequality is equal to
Z
(z+1)
fX (s)ds.
z
Thus
E[Xi ]
(z+1)
fX (s)ds.
z
z
zZ
As tends to zero, the expression on the left side of the last equality above tends to:
Z
sfX (s)ds
This implies that by taking fix and sufficiently small, we have that, for large enough n
, the fraction
X 1 + X2 + . . . + X n
n
is as close as we want from
Z
sfX (s)ds.
sfX (s)ds.
The linearity of expectation holds in the same way as for discrete random variables.
This is the content of the next lemma.
Lemma 13.1 Let X and Y be two continuous random variables and let a be a number.
Then
E[X + Y ] = E[X] + E[Y ]
and
E[aX] = aE[X]
Proof. The proof goes like in the discrete case: The only thing used for the proof in the
discrete case is the law of large numbers. Since the central limit theorem also holds for
the continuous case, the exactly same proof holds for the continuous case.
41
14
The Central Limit Theorem (CLT) is one of the most important theorems in probability.
Roughly speaking it says that if we build the sum of many independent random variables,
no matter what these little contributions are, we will always get approximately a normal
distribution. This is very important in every day life, because often times you have
situations where a lot of little independent things add up. So, you end up observing
something which is approximately a normal random variable. For example, when you
make a measurement you are most of the time in this situation. That is, when you dont
make one big measurement error. In that case, you have a lot of little imprecisions which
add up to give you your measurement error. Most of the time, these imprecisions can be
seen as close to being independent of each other. This then implies: unless you make one
big error, you will always end up having your measurement-error being close to a normal
variable.
Let X1 , X2 , X3 , . . . be a sequence of independent, identically distributed random variables.
(This means that they are the outcome of the same random experiment repeated several
times independently.) Let
p denote the expectation := E[X1 ] and let denote the
standard deviation := V AR[X1 ]. Let Z denote the sum
Z := X1 + X2 + X3 + . . . + Xn .
Then, by the calculation rules we learned for expectation and variance it follows that:
E[Z] = n
and the standard deviation Z of Z is equal to:
Z = n.
When you subtract from a random variable its mean and divide by the standard deviation
then you always get a new variable with zero expectation and variance equal to one. Thus
the standardized sum:
Z n
n
has expectation zero and standard deviation 1. The central limit theorem says that on
top of this, for large n, the expression
Z n
n
is close to being a standard normal variable. Let us now formulate the central limit
theorem:
Theorem 14.1 Let
X1 , X2 , X3 , . . .
42
X1 + . . . + Xn n
X1 + . . . + Xn n
n
43
is close to a standard normal random variable N (0, 1) means that for every z R we
have that:
P (Z z)
is close to
P (N (0, 1) z).
In other words, as n goes to infinity, P (Z z) converges to P (N (0, 1) z). Let us give
a more precise version of the CLT then what we have done so far:
Theorem 14.2 Let
X1 , X2 , X3 , . . .
be
p a sequence of independent, identically distributed random variables. Let E[X1 ] = and
V AR[X1 ] = . Then, for any z Z, we have that:
X1 + . . . + Xn n
z = P (N (0, 1) z).
lim P
n
n
15
Statistical testing
following the procedure above we find that the Xi s are i.i.d. and that P (Xi = 1) =
p where p designates the true percentage of people in Atlanta who smoke. Then also
E[Xi ] = p. The total number of people in our survey who smoke Z, can now be expressed
as
Z := X1 + X2 + . . . + X100 .
Let P50% (.) designate the probability given that the true percentage which smoke is really
50%. Testing if 50% in Atlanta smoke can now be discribed as follows:
Calculate the probability:
P50% (X1 + . . . + X100 70).
If the above probability is smaller than = 0.05 we reject the hypothesis that 50%
of the population smokes in Atlanta (we reject it on the = 0.05 level). Otherwise,
we keep the hypothesis. When we keep the hypothesis, this means that the result
of our survey does not constitute strong evidence against the hypothesis: the result
of the survey does not contradict the hypothesis.
Note that we could also have done the test on the = 0.1 level. In that case we would
reject the hypothesis if that probability is smaller that 0.1.
Next we are explaining how we can calculate approximately the probability P50% (Z 70),
using the CLT. Simply note that, by basic algebra, the inequality
Z 70
is equivalent to
Z n 70 n
which is itself equivalent to:
70 n
Z n
n
n
Equivalent inequalities must also have same probability. Hence:
Z n
70 n
n
n
(15.1)
n
is close to being a standard normal random variable N (0, 1). Thus, the probability on
the right side of inequality 15.1, is approximately equal to
70 n
P N (0, 1)
.
(15.2)
n
If the probability in expression 15.2 is smaller than 0.05 then we reject the hypothesis
that 50% of the Atlanta populations smokes. (on the = 0.05 level). We can look upthe
probability that the standard normal N (0, 1) is smaller than the number (70n)/( n)
in a table. We have tables, for the standard normal variable N (0, 1).
45
15.1
Let z R. Let (z) denote the probability that a standard normal variable is smaller
equal than z. Thus:
Z z
1 x2 /2
e
dx.
(z) := P (N (0, 1) z) =
2
For example, let z > 0 be a number. Say wen want to find the probability
P (N (0, 1) z).
(15.3)
The table for the standard normal gives the values of (z) for z > 0 thus we have to
try to express probability 15.3 in terms of (z). For this note that:
P (N (0, 1) z) = 1 P (N (0, 1) < z).
Furthermore, P (N (0, 1) < z) is equal to P (N (0, 1) z) = (z). Thus we find that:
P (N (0, 1) z) = 1 (z).
Let us next explain how, if z < 0, we can find the probability:
P (N (0, 1) z).
Note that N (0, 1) is symmetric around the origin. Thus,
P (N (0, 1) z) = P (N (0, 1) |z|).
This brings us back to the previously studied case. We find
P (N (0, 1) z) = 1 (|z|).
Eventually let z > 0 again. What is the probability:
P (z N (0, 1) z)
equal to? For this problem note that
P (z N (0, 1) z) = 1 P (N (0, 1) z) P (N (0, 1) z).
Thus, we find that:
P (z N (0, 1) z) = 1 (1 (z)) (1 (z)) = 2(z) 1.
46
15.2
Let us give an example to introduce this subject. Assume that we are testing a new fuel
for a certain type of rocket. We would like to know if the new fuel gives a different initial
velocity to the rocket. The initial velocity with the old fuel is denote by X whilst Y is
the initial velocity with the new fuel. We fire the rocket five times with the old fuel and
measure each time the initial velocity. We find:
X1 = 100, X2 = 102, X3 = 97, X4 = 100, X5 = 101
(15.4)
(here Xi denotes the initial velocity measured whilst firing the rocket for the i-th time
with the old fuel). Then we fire the rocket five times with the new fuel. Every time we
measure the initial velocity. We find
Y1 = 101, Y2 = 103, Y3 = 99, Y4 = 102, Y5 = 100
(15.5)
Y1 + Y2 + Y3 + Y4 + Y5
= 101
Y :=
5
When we measure the initial velocities we find different values even when we use the same
fuel. The reason is that our measurement instruments are not very precise, so we get the
true value plus a measurement error. The model is as follows:
Xi = X + X
i
and
Yi = Y + Yi .
X
Y
Y
Furthermore X
1 , 2 , . . . are i.i.d. random errors and so are 1 , 2 , . . .. We assume that
the measurement instrument is well calibrated so that
Y
E[X
i ] = E[i ] = 0
for all i = 1, 2, . . .. Here X and Y are unknown constants (in our example X is the
initial speed when we use the old fuel whilst Y is the initial speed when we use the new
fuel). We find that
X
E[Xi ] = E[X + X
i ] = E[X ] + E[i ] = X + 0 = X ,
and similarly
E[Yi ] = Y .
47
So our testing problem can be described as follows: we want to figure out based on our
data 15.4 and 15.5, if the second fuel gives a different initial speed than the old fuel. We
observed
= 1 > 0.
Y X
This means that in the second sample, obtained with the new fuel, the initial speed is
higher by one unit on average to the initial speed in the first sample obtained with the
old fuel. But is this evidence enough to conclude that the new fuel provides higher initial
speed, or could this difference just be due to the measurement errors? As a matter of
fact, since we make measurement errors, it could be that even, if the second fuel does
not provide higher initial speed, (i.e. X = y ) that due to the random errors and bad
luck the second average is higher than the first. In our present setting we can never
be absolutely sure, but we try to see if there is statistically significant evidence for
arguing that X and Y are not equal.
The exact method to do this depends on whether we know the standard deviation of the
errors or not and if they are identical for the two samples. We will need the expectation
and standard deviation of the means. This is what we calculate in the next paragraph.
Expectation and standard deviation of the means Let the standard deviation of
the errors be denoted by
q
q
]
,
:=
V AR[Yi ].
X := V AR[X
Y
i
We find that the standard deviation of Z is given by
Let Z := Y X.
q
q
= V AR[Y ] + V AR[X
Z = Y X = V AR[Y X]
(15.6)
48
2
X
2
+ Y
n
n
(15.7)
If X = Y (which should be the case when we use the same measurement instrument),
then equation 15.7 can be rewritten as
r
p
2 2
+
= 2/n,
(15.8)
Y X =
n
n
where = X = Y . If the two samples would have different sizes, we would find by a
similar calculation
r
X Y
Z =
+
(15.9)
n1
n2
where n1 is the size of the first sample and n2 is the size of the second sample. Furthermore
we have for the expectation
= E[Y ] E[X]
=
E[Y X]
X1 + . . . + Xn
Y1 + Y2 + . . . + Yn
E
=
=E
n
n
E[Y1 + . . . + Yn ] E[X1 + . . . + Xn ]
=
=
n
n
E[Y1 ] + E[Y2 ] + . . . + E[Yn ] E[X1 ] + E[X2 ] + . . . + E[Xn ]
=
=
n
n
= E[Y1 ] E[X1 ] = X Y
To summarize we found that
= Y X .
E[Y X]
(15.10)
A simplified method Let us first explain a rough method, to explain in a simple way
the idea. This method up to a small detail is the same as what is really used in practice.
At this stage we are ready to explain how we could proceed to know if we have strong
evidence for the case Y X 6= 0. We are going to use the rule of thumb which says
that in most cases for most variables the values we typically observe are within a distance
of at most 2 times the standard deviation from the expected value. We apply this rule
If we would have that there is no difference between the new and old
to Z = Y X.
fuel, then Y X would be equal to zero and hence E[Z] = 0 (see equation 15.10). We
can then check if the value we observe for Z is within 2 times the standard deviation Z .
Thus in our case we check if the value 1 is within 2 times the standard deviation Z . This
is the same as checking if
Y X
(15.11)
Z
is not more than 2 in absolute value. If it is, we would think that Y is probably not
equal to X . In that case, we say that we reject the hypothesis that X = Y . The
expression 15.11 is called test statistic. What we did here is check if the value taken by
the test statistics is within the interval [cr , cr ], where we took cr = 2. The number cr is
49
called critical value for our test. If we do not know X and Y , we estimate them and
replace them by their estimates in the formulas 15.7,15.8, 15.9. (To see how to estimate
a standard deviation go to subsection 16.2). We then use that value for the test statistic
instead of 15.11.
The method described here differs from the one really used only in as much as the critical
value is concerned. However even with the way the test is usually done in practice, the
critical value will not be very far from 2. Let us next explain in detail the different
methods used in practice. They depend on whether the standard deviation is known or
not. Also, to perform a statistical test in a precise way, need to specify the level of
confidence for the test. The higher the lever of confidence the bigger the critical value
will be. Let us see the details in the next paragraphs:
The case with identical, known standard deviation Assume that the standard
deviations X and Y are known to us and identical. This is typically the case, when the
measurement instruments used for both samples are identical. In this case, we denote by
the value = X = Y . If we work often with the same measurement instruments, we will
known the typical size of the measurement error, hence we will know from experience
. Assume here that the measurement errors are normal. Then, the test statistic
Y
X1 + . . . + Xn Y1 Y2 . . . Yn
X
=
Z
nZ
(15.12)
is also normal. As a matter of fact, as can be seen in 15.12, the test statistic can be
written as a sum of independent normal variables divided by a constant. We know that
sums of independent normal variables are again normal. Furthermore dividing a normal
by a constant gets you a normal again. If X = Y , then the expectation of the test
statistic is zero:
E[Y X]
Y X
Y X
=
= 0.
=
E
Z
Z
Z
Similarly the variance of the test statistic is one. This can be seen from:
V AR[Y X]
Y X
V AR
=
= 1.
Z
Z2
Hence, if X = Y , then the test statistic is a normal variable with expectation 0 and
variance 1. In other words, the test statistic is a standard normal variable. So, in this
case, the critical value cr at a confidence level p is the number cr > 0 satisfying
P (cr N (0, 1) cr ) = p.
By symmetry around the origin, this implies (see subsection 15.1) that
(cr ) = (1 + p)/2,
(15.13)
Y X
Y X
1
1
= p
=
= 0.55
Z
1.8
3 0.4
2/n
The value for the test statistic lies within the interval [cr , cr ], when cr = 1.96. Hence
in this situation we can not reject the hypothesis that X = Y on the 95%confidence level. In other words, we do not have enough statistical evidence to reject
the idea that X = Y . This means that our data does not seem to imply that the new
fuel is better or worse then the old. Note that this does not necessarily mean that X
and Y must be identical. It could be that the difference is so small, that it gets masked
by our measurement errors.
The way the test was done is called a two-sided test. If we would be interested just in
knowing if the new fuel is better then we would do a one-sided test. (It could be that
a company might change to a new fuel but only if it is proven to be better. In that case,
the only interesting thing is to know if the new fuel is better and not if it is different). In
the case of a one-sided test the confidence interval would be [, cr ] where the critical
value cr is determined by
P (N (0, 1) cr ) = (cr ) = p
on the confidence level p. Here as before (.) designates the distribution function of a
standard normal variable.
In this example, we assumed the measurement errors to be normal. If this is not the case,
but we have many measurements, the above method still applies due to the Central Limit
Theorem.
Case when the standard deviations are known but not equal or different sample
sizes. If in the two samples the standard deviations are different (because of different
measurement instruments maybe), then all the above remains the same except that we
use a different formula for Z . The formula then used for Z is formula 15.7. The same
goes when the samples have different sizes from each other. The formula used in that case
is 15.9.
51
The case with unknown, but equal standard deviation Assume that = X =
Y , but is unknown to us. Then instead of Z , we use an estimate for Z . For this note
that
r
2
X
2
+ Y.
(15.14)
Z =
n
n
2
We will estimate X
and Y2 and plug the values into the formula 17.7 instead of the real
values. The estimates we use (see subsection 16.2), are
s2X :=
2 + (X2 X)
2 + . . . + (Xn X)
2
(X1 X)
n1
2
for X
and
Our test statistic is obtained by replacing Z by its estimate in the previously used test
statistic. Hence the test statistic for the case of unknown standard deviation is:
Y X
p
.
(15.15)
(s2X + s2Y )/n
The distribution of the test statistic is no longer normal. It is slightly modified. One can
prove when X = Y and the measurement errors are normal, that then the test statistic
has a student t-distribution with 2n 2 degrees of freedom. So our testing procedure is
almost as before, only that we have to find the critical value cr in a different table. This
time we have to find it in a table for the t-distribution with 2n 2 degrees of freedom.
So, if we test on the confidence level p, then cr is defined to be the number such that
P (cr T2n2 cr ) = p.
We reject the hypothesis X = Y on the level p, if the test statistic 15.15 takes
a value outside [cr , cr ] Let us get back to our rocket example. Say we want to test
X = Y on the level p = 95%. We find
s2X =
14
0 + 2 2 + 32 + 0 + 1
=
= 3.5
4
4
and
0 + 2 2 + 22 + 1 + 1
10
=
= 2.5
4
4
With n = 5 the test statistic takes on the value
Y X
1
1
p
p
=
=
0.9.
1.2
6/5
(s2X + s2Y )/n
s2Y =
52
16
16.1
Statistical estimation
An example
Imagine that we want to measure the distance d between two points y and z. Every
time we repeat the measurement we make a measurement error. In order to improve the
precision we make several measurements and then take the average value measured. Let
Xi designate measurement number i and i the error number i. We have that:
Xi = d + i .
We assume that the measurement errors are i.i.d. such that
E[i ] = 0
and
V ar[i ] = 2 .
The standard deviation of the measurement instrument is supposed to be know to us.
Imagine that we make 4 measurements and find in meters the four values:
100, 102, 99, 101
We see that the distance d must be around 101 meters. However, the exact value of the
distance d remains unknown to us, since each of the four measurements above contains an
error. So, we can only estimate what the true distance is equal to. Typically we take the
average of the measurements as estimate for d. We write d for our estimate of d. In the
case we decide to use the average of our measurements as estimate for d, we have that:
X 1 + X2 + X 3 + X4
.
d =
4
The advantage of taking four measurements of the same distance instead of only one,
is that that probability to have a large error is reduced. The errors in the different
measurements tend to even each other out when we compute the average. As a matter of
fact, assume we make n measurements and then take the average. In this case:
X1 + . . . + Xn
d :=
.
n
53
We find:
= (1/n) (E[X1 ] + . . . + E[Xn ]) =
E[d]
(1/n) (nE[X1 ]) = E[X1 ] = E[d + i ] = E[d] + E[i ] = d + 0 = d.
An estimator which has its expectation equal to the true value we want to estimate is
called unbiased estimator.
Let us calculate:
X
+
.
.
.
+
X
1
1
n
= V AR
V AR[d]
= 2 (V AR[X1 ] + . . . + V AR[Xn ]) =
n
n
1
(nV AR[X1 ]) = V AR[X1 ]/n
n2
Thus, the standart deviation of d is equal to
p
V AR[X1 ]/n = / n.
The standard deviation of the average d is thus n times smaller than the standard
deviation of the error when we make one measurement. This justifies taking several
measurements
and taking the average, since it reduces the size of a typical error by a
factor n.
When we make a measurement and give an estimate of what the distance is, it is important
that when know the order of magnitude of the error. Imagine for example that the order
of magnitude of the error is 100 meters. The situation would then be: our estimate of
the distance is 101 meters, and the precision of this estimate is plus/ minus 100 meters.
In this case our estimate our estimate of the distance is almost useless because of the huge
imprecision. This is why, we try to always give the precision of the estimate. Since
the errors are random, theoretically even very large errors are always possible. Very large
errors however have small probability. Hence one tries to be able to be able to give a
upper bound on the size of the error which holds with a given probability. Typically one
uses the probabilities 95% or 99%. The type of statement one whishes to make is for
example: our estimate for the distance is 101 meters. Furthermore, with 95% probability
the true distance is within 2 meters of our estimate. In this case the interval [99, 103] is
called the 95% confidence interval for d. With 95% probability, d should lie within this
interval. More precisely, we look for a real number a > 0 such that:
P (d a d d + a) = 95%
or equivalently:
P (a d d a) = 95%
Hence we are looking for a number a such that:
d + 1 + . . . + d + n nd
X1 + . . . + Xn
d a = P a
a =
95% = P a
n
n
1 + . . . + n
= P a
a
n
54
Now, either way we assume that the errors I are normal or that n is big enough so that
the sum 1 + . . . + n is approximately
normal due to the central limit theorem. Dividing
95% = P
P
N (0, 1)
n
We thus find the number b > 0 from the table for standard normal random variable such
that:
95% = P (b N (0, 1) b).
Hence:
95% = (b) (1 (b)) = 2(b) 1
where (.) designates the distribution function of the standard normal variable. Then,
we find a > 0 solving:
a n
.
b=
16.2
Assume that we are in the same situation as in the previous subsection. The only difference
is that instead of trying to determine the distance we want to find out how precise our
measurement
instrument is. In other words, we try to determine the standard deviation
p
= V AR[i ]. For this we make several measurements of the distance between to points
y and z. We choose the point so that we know the distance d between them. Again, if Xi
designates the i-th measurement we have Xi = d + i . Define the random variable Zi in
the following way:
Zi := (Xi d)2 = 2i .
Thus:
E[Zi ] = V AR[i ].
We have argued that if we have a number of independent copies of the same random
variables, a good way to estimate the expectation is to take the average. Thus to estimate
the expectation E[Zi ], we take the average:
i ] := Z1 + . . . + Zn .
E[Z
n
55
=
n
If the distance d, should not be known, we simply take and estimate for d instead of d.
In that case our estimate for is
s
2 + . . . + (Xn d)
2
(X1 d)
=
n1
where
X1 + . . . + Xn
d :=
.
n
(Note that instead of dividing by n in the case that d in unknown, we divide usually
by n 1. This is a little detail which I am not going to explain. For large d, it is not
important since then n/(n 1) is close to 1.)
16.3
Imagine the following situation: we have two 6-sided dice. Let X designate the number
we obtain when we throw the first die. Let Y designate the number we obtain when we
throw the second one. Assume that the first die is regular whilst the second is skewed.
We have:
(P (X = 1), P (X = 2), . . . , P (X = 6)) = (1/6, 1/6, 1/6, 1/6, 1/6, 1/6).
(Note that 1/6 = 0.16) Assume furthermore that:
(P (Y = 1), . . . , P (Y = 6)) = (0.01, 0.3, 0.2, 0.1, 0.1, 0.29).
Imagine that we are playing the following game: I choose from a bag one of the two dice.
Then I throw it and get a number between 1 and 6. I dont tell you which die I used, but
I tell you the number obtained. You have to guess which die I used based on the number
which I tell you. (This guessing is what statisticians call estimating.) For example I tell
you that obtained the number 1. With the first die, the probability to obtain a 1 is 0.16,
whilst with the second die it is 0.01. The probability to obtain a 1 is thus much smaller
with the second die. Having obtained a one makes us thus think that it is likelier that
the die used is the first die. Our guess will thus be the first die. Of course you could be
wrong, but based on what you know the first die appears to be likelier.
56
If on the other hand, after throwing the die we obtain a 2 we guess that it was the second
die which got used. The reason is that with the second die a 2 has a probability of 0.3
which is larger than the probability to see a 2 with the first die. Again, our guess might be
wrong, but when we observe a 2, the second die seem likelier. The method of guessing
described here is called Maximum likelihood estimation. It consist in guessing (estimating)
the possibility which makes the observed result most likely. In other words, we choose
the possibility, for which the probability of the observed outcome is highest.
Let us look at is in a slightly more abstract way. Let I designate the first die and II
the second. For x = 1, 2, . . . , 6, let P (x, I) designate the probability that the number we
obtain by throwing the first die equals to x. Thus:
P (x, I) := P (X = x).
Let P (x, II) designate the probability that the number we obtain by throwing the second
die equals to x. Thus:
P (x, II) := P (Y = x).
For example, P (1, I) is the probability that the first die gives a 1 and P (1, II) is the
probability that the second die equals 1 whilst P (2, II) designates the probability that
the second die gives a 2.
Let be a (non-random) variable with can take one out of two values: I or II. Statisticians
call the parameter. In this example guessing which die we are using, is the same as
trying to figure out if equals I or II. We consider the probability function P (., .) with
two entries:
(x, ) 7 P (x, ).
Formally what we did can be describe as follows: given that we observe an outcome x, we
take the which maximizes P (x, ) as our guess for which die was used. . Our maximum
likelihood estimate of is the theta maximizing P (x, ) where x is the observed outcome.
This is a general method, and can be used in many different settings. Let us give another
example of maximum likelihood estimation, based on the same principle.
16.4
(16.1)
is equal to
P (T1 = 6) P (T2 = 7) . . . P (T5 = 8)
For a geometric random variable T with parameter p we have that:
P (T = k) = p(1 p)k1 .
Thus the probability 16.1 is equal to:
p(1p)5 p(1p)6 . . .p(1p)7 = exp(ln(p)+5 ln(1p)+. . .+ln(p)+7 ln(1p)). (16.2)
We want to find p maximizing the last expression. This is the same as maximizing the
expression:
ln(p) + 5 ln(1 p) + . . . + ln(p) + 7 ln(1 p),
since exp(.) is an increasing function. To find the maximum, we take the derivative
according to p and set it equal to 0. This gives:
0=
1
6+7+5+8+8
T1 + T2 + . . . + Tn
=
=
p
5
n
(16.3)
Our estimate p of p is the p which maximizes expression 16.2. This is the p which satisfies
equation 16.3. Thus our estimate:
1
1
6+7+5+8+8
T1 + T2 + . . . + Tn
p :=
=
.
5
n
17
17.1
Linear Regression
The case where the exact linear model is known
Imagine a situation where you have a chain of shops: The shops can have different sizes
and the profit seems to be to some extend a function of the size. The chain owns n shops.
58
Let xi denote the size of the i-th shop and Yi its profit. Now you assume that there are
two constants and so that the following relationship holds:
Yi = + xi + i ,
where we assume that 1 , 2 , . . . are i.i.d. random variables with expectation zero:
E[i ] = 0.
Let denote the standard deviation of the variables i . Often it will also be assumed that
the variables i are normal. Now, we have that the expected profit is equal to
E[Yi ] = E[ + xi + i ] = E[] + E[xi ] + E[i ] = + xi .
In other words, the expected profit is a linear function of the size: E[Y ] = + x, where
x is the size and Y is the profit of a shop. So, if you draw a curve representing expected
profit as function of size, you would get a straight line.
Say for your chain of shops, you would have the relationship Yi = 3 + 4xi + i . So, in this case = 3
and = 4. Say, you own a shop of size 5. Then for that shop, the expected profit given that the size
is 5, would be E[Y |x = 5] = 3 + 4 5 = 23. Here we denote by E[Y |x = 5] the expectation given that
the size is 5. Now why would the profit of that shop be random? Well very simple. It could be that
you own many shops gggbut this one shop with size 5 is going to open next month. So, nobody knows
in advance the exact profit. One can forecast it, give maybe a confidence interval, but nobody knows in
advance the exact value! Hence, the profit is behaving like a random variable. If you are told to predict
(estimate) what the profit will be, you will give the expected value + 5 = 23. Of course, this requires
that you know the constants and . Now, if you know the standard deviation of then you can also
give a confidence interval. First using Matzis rule of thumb, you could simply say that most probably,
the profit of the shop, will be withing two standard deviation of the expected profit. In our case, thus we
could say that typically the profit is 23 + 2 and hence most likely to be between 23 2 and 23 + 2.
If for example = 3, then most likely the profit for our shop will be between 17 and 29. Now, this
is using a rule of thumb which says that random variables typically take values not further than twice
the standard deviation from their expectation most of the time. But this is not very precise. So, we
could actually give a confidence interval. That is we could give an interval so, that with for example 95%
probability the profit will be in that interval. If we assume that the errors are normal, then we have that
/ is standard normal. Hence, (Y x)/ = / is standard normal. Hence,
Y x
P c
c = P (c N (0, 1) c)
The above allows us to give an interval (think of confidence interval), so that the profit of the new shop
will be in that interval with a given probability. for example of 95%-confidence, the interval is going to
be
[ + 5 c0.95 , + 5 c0.95 ] = [23 C0.95 , 23 + c0.95 ],
where c0.95 denotes the constant so that a standard normal is between + that constant with 0.95probability. We have seen how to calculate, such a constant.
Imagine next a situation where and are known, and is not known. Then we want to estimate
based on our data. Note that Yi = + xi + i and hence:
i = Yi xi .
(17.1)
When the data (Yi , xi ) is known and , are known as well, then we can figure out the value of the
i s using formula 17.1. Note that designates the standard deviation for the errors i . But in previous
59
chapters we have learned how to estimate a standard deviation. So, this is what we are going to do using
the i s to estimate the standard deviation :
r
21 + . . . + n
:=
.
(17.2)
n
Let us give an example. Say we have five shops and as before = 3 and = 4. The data for the shops
is given in the table below:
xi 1 2 3 4
6
Yi 8 10 17 17 27
Now, for the example 1 = Y1 3 4x1 = 8 3 4 = 1 So for each i = 1, 2, . . . , 5 we can calculate the
corresponding i . We get the values:
xi
Yi Xi
1
1
2
11
3
2
4
2
6
1
11 + (1)2 + 22 + (2)2 + 02
= 2 1.41.
:=
5
(17.3)
We can now using Matzingers rule of thumb, which says that random variable most of the time, takes
values not further than two standard deviation from their expectation. So, that tells us that for our shop,
the profit should be within 23 + 2
23 + 2.82 So, typically the profit would be in the interval
[20.1716, 25.8284].
The above interval is just to have a rough idea of which area most likely the profit will be in. For a more
precise approach with an explicit confidence level , we would take the interval:
[23 c
, 23 c
]
(17.4)
where c is the constant so that a standard normal is with probability between c and +c :
= P (c N (0, 1) c ).
Now if we do not know the standard deviation, we can replace the true standard deviation by its estimate.
The coefficient c from the normal table has to be replaced by a coefficient from the student table tn/2 .
So, the confidence interval if we have to estimate the standard deviation becomes:
[ + 10 tn/2
, + 10 + tn/2
].
17.2
(17.5)
If and are not known, then we estimate them using least square. The estimates are
given by the two following equations:
y =
+ x
and
60
Pn
(xi x)yi
:= Pi=1
n
)2
i=1 (xi x
Put in the value of from the second equation into the first to calculate
.
Now, in principle, all the things we did in the last subsection where and were known
will be done here. The difference is mainly that instead of and we use the estimate
and instead. But then we act as if the estimates where the true values. (For the
confidence interval there will be a small adjustment). In other words, given some real data
(xi , Yi ) for i = 1, 2, . . . , n, you could estimate and . Then forget that your estimates
and are only estimates. Act as if they where the true and and do everything we
did in the section above...in this way, you can figure out how to estimate the standard
deviation, get a confidence interval and so on and so forth. Let us summarize:
1. To estimate the expected profit of a new shop of size x0 , we used in the previous
section + x0 . Now, however and are not known. So, we simply take the
estimates for and and act as if they would be the true values. Our estimate for
the expected profit of a shop of size x0 , when and is not known is:
0.
E[Y|x0 ] :=
+ x
2. To estimate the standard deviation, we had used the i s which are equal to i =
Yi xi . Now, and are not known here, so we replace them by their
respective estimates. So our estimated random errors are
i
i := Yi
x
For estimating the standard deviation , we now simply replace i by the estimate
i in formula 17.2. Hence, the estimated is defined to be:
s
21 + 22 + . . . + 2n
.
(17.6)
:=
n2
3. Let us see how we give a rough confidence interval using Matzis rule of thumb.
(That rule of thumb is: mostly variables take values not further than two times
the standard deviation from their expectation). So, in the formula + x0 + 2
we simply replace , , by their respective estimates: so the rough confidence
interval for the profit of a shop with size x0 would be
0 2
0 + 2
[
+ x
,
+ x
]
where
is our estimate given in 17.6.
61
4. For an exact confidence interval we take the same as in 17.5 but replacing again ,
and by their respective estimates. (Here for estimating we take 17.6). Also,
there is an additional factor equal to
s
1
(x0 x)2
1+ + P
n
)2
i (xi x
This factor is needed, because we have additional uncertainty since we do not know
+ x0 , but only have an estimate for it. Also, for large n, this factor becomes
close to 1. So, all this being said, our confidence-interval on the -co0nfidence level
is
s
s
#
"
2
2
1
(x
)
1
(x
)
0
0
n
n
+t
t
1+ + P
1+ + P
,
+ 10
.
+ 10
/2
/2
2
n
n
(x
)
)2
i
i
i (xi x
|yi a bxi |.
i=1
Note that in the above sum, the yi s and the xi s are given numbers, so we only
need to find a and b minimizing the above expression. Now, absolute values are a
mess to calculate with. So, instead, we will take the sum of the distances square:
2
d (a, b) :=
n
X
i=1
62
(yi a bxi )2
and find a and b minimizing d2 (a, b). This will yield very nice explicit formulas. To
find those formulas we simple take the derivative according to a and according to b
and set equal to 0. This yields:
P
n
X
d ni=1 (yi a bxi )2
= 2
(yi a bxi )
da
i=1
Setting the expression on the right side of the last equation above equal to 0 we
find:
y = a + b
x,
where
y1 + . . . + yn
n
y :=
and
x1 + . . . + xn
.
n
Then, we take the derivative according to b and set it equal to 0:
P
n
X
d ni=1 (yi a bxi )2
= 2
xi (yi a bxi )
db
i=1
x :=
So, setting the expression ont he right side of the last equation above equal to 0
yields:
n
X
xi yi a
i=1
n
X
xi b
i=1
n
X
xi xi = 0
(17.7)
i=1
x0i
x0i
i=1
= 0. Hence, we have
n
X
x0i yi
i=1
i=1
n
X
(x0i )2 = 0
i=1
Pn 0
Pn
xi y i
(xi x)yi
i=1
b = Pn
= Pi=1
n
0 2
)2
i=1 (xi )
i=1 (xi x
We have now found a system of two equations for a and b, which determines which
straight line y = a + bx gets closes to the data-points (x1 , y1 ), (x2 , y2 ), . . . , (xn , yn ).
63
By closest, we mean the sum of the vertical distances square between the points
and the line should be minimal. So, the system of two equations is:
y = a + b
x
Pn
(xi x)yi
b = Pi=1
.
n
)2
i=1 (xi x
(17.8)
(17.9)
Solving the above system of two equations in a and b yields, the straight line y =
a + bx which is closest (in our sense of sum of distances square) to our points.
We will use, these value for a and b which minimize the sum of distances square as
our estimates for and . An explanation why this is a good idea can be found
below in the subsection entitled: how precise is our estimate. So, we have that
the estimates
and are the only solution to 17.8 and 17.9. Hence, they
are given by the following two equations:
y =
+ x
and
17.4
Pn
(xi x)yi
.
:= Pi=1
n
)2
i=1 (xi x
= Pi=1
.
n
)2
i=1 (xi x
We are going to take the variance on both sides of the last equation above, and
use the fact that the xi s are constants and not random. Recall that constants who
multiple a random variable, can be taken out of the variance after squaring. This
leads to
Pn
Pn
(x
)y
(xi x)2 V AR[yi ]
V AR[yi ]
2
i
i
i=1
i=1
= V AR Pn
P
P
P
V AR[]
=
=
=
.
n
n
)2
( ni=1 (xi x)2 )2
)2
)2
i=1 (xi x
i=1 (xi x
i=1 (xi x
So, we get finally:
2
,
)2
i=1 (xi x
= Pn
V AR[]
and
.
)2
i=1 (xi x
= pPn
(17.10)
)y
)E[yi ]
i
i
i=1
i=1 (xi x
= E Pn
E[]
= P
=
n
2
)
)2
i=1 (xi x
i=1 (xi x
Pn
Pn
Pn
(xi x)( + xi )
(xi x)
(xi x)xi
i=1
i=1
Pn
=
= Pn
+ Pi=1
= .
n
2
2
)
)
)2
i=1 (xi x
i=1 (xi x
i=1 (xi x
In other words, the expectation value of the estimator is itself. This has some
very important application: We have that is a random number itself since it
depends on the i s which we have assumed to be random. Now for any random
variable Z we have that we measure the approximate average distance from its
expectation(=dispersion) by the standard deviation of the variable. So, how far
= on average when we keep repeating the experiment, is given by .
is from E[]
But, the distance between and is the estimation error of our estimate. So, in
other words, the average size of the estimation error (when we estimate ) is given
by for which we have a close expression given in equation 17.10 above.
6
17.5
17.6
17.7
65