Course Notes For Unit 6 of The Udacity Course ST101 Introduction To Statistics PDF
Course Notes For Unit 6 of The Udacity Course ST101 Introduction To Statistics PDF
Contents:
Regression
Correlation
Monty Hall Problem (Optional)
Answers
Regression
You should remember from the very beginning of this course that we can have data
sets with more than one dimension. For example, the size of a house relative to its
price:
In the first unit we saw different ways of presenting the data, like scatter-plots or bar
charts, but we didn’t look at what might be thought of as the ‘Holy Grail’ of statistics,
fitting a line to the data points:
By the end of this unit you will be able to fit a line to data points like these, and you
will even be able to state what the residual error is in that fit. This will allow you not
only to understand the data, but also to make predictions about points that you have
never seen before.
Lines
How do we specify a line? Well, suppose we have a horizontal axis, x, and a vertical
axis, y. A straight line is commonly described by a functional relationship between x
and y of the form:
y bx a
If x = 0 then y = 0. This means that the line must go through the origin. If x = 2 then
y = 4. We now have two points and we can draw our line:
What about the line, y = -x +4 i.e. where b = –1 and a = 4. Which of these lines best
matches this function?
Find Coefficients Quiz
What are the coefficients a and b for the line shown below:
y bx a
Linear Regression
If we have 2-dimensional data, for example the age of a person and that person’s
income, linear regression is a technique for trying to fit a line that best describes that
data:
In linear regression we are given data (having more than 1-dimension), and we
attempt to find the best line to fit the data. To put this differently, we are trying to
identify the parameters a and b for the function:
y bx a
We are trying to find the line that is the best-fit to our data, and the word ‘best’ is
interesting in this context. Obviously it is impossible to draw a straight line that
passes through every point in this data set:
This is what is known as non-linear data - the data points go up and down. Data
often looks like this, even when the relationship between x and y is linear. This is
because there is usually what is known as noise in the data. Noise is a random
element in the data that we cannot explain.
In trying to find the ‘best-fit’ to the data, we are trying to find a line that minimises
the differences between the data points and the line in the y-direction:
The reason for this is that we are assuming that our data results from some unknown
linear function plus noise:
y bx a noise
bx a y 2
i i
xi
In the case of the blue line, there is no loss for the three points that it passes through,
but a fairly substantial loss for the other three points:
If the distance along the y-axis between the blue line and the data points is c, then the
error, e, for the blue line is:
e 3 c2
since there are three points, and we are using the quadratic distance.
Similarly for the red line, it suffers no loss for the three points that it passes through,
but a fairly substantial loss for the other three points, and the error, e , for the red line
is also:
e 3 c2
In the case of the green line there are errors for all six points, but the error is only half
as big in each case:
Now, the error is:
2
c 6 3
e 6 c2 c2
2 4 2
This means that the total quadratic error for the green line is half as big as it is for
either the blue or the red lines. This is because, when we use the quadratic, larger
errors count much, much more than smaller errors. Clearly, in this case, the green line
would be the best fit for these data points.
Will the parameters a and b be negative or positive for the data shown below?
y bx a
Regression Formula
The function y bx a is the Holy Grail of linear regression, and much of statistics is
concerned with how to use the data to determine the value of b, and the value of a. If
we can do this with the data, then we have solved the problem of fitting the best line.
x1 x2 x3 … xN
y1 y2 y3 … yN
x x y y
i i
b i
x x
2
i
i
Previously, we’ve used μ for the mean, but now that we have more than one variable
we are going to use the bar-notation.
This formula shouldn’t be entirely unfamiliar. When we calculated the variance, we
2
used: xi x , whereas now we have 2-dimensional data we are using xi x yi y .
Now that we know b, and we know that:
y bx a
It turns out that this function is also true for the average values, x and y . We can
calculate the value for a using:
a y bx
x y
6 7
2 3
1 2
-1 0
Now, with this data set we notice that the y-value is always exactly 1 larger than the
x-value, so:
y x 1
Let’s see what we get when we use the formula. The first thing we do is to calculate
the means:
6 2 1 1 8
x 2
4 4
7 3 2 0 12
y 3
4 4
b
6 27 3 2 23 3 1 22 3 1 20 3
6 22 2 22 1 22 1 22
16 0 1 9
b 1
16 0 1 9
Now we can calculate a using:
a y b x 3 1 2 1
Regression Quiz
Calculate the values for a and b for the following data set:
x y
4 7
3 9
7 1
2 11
Correlation
This section is all about correlation. Correlation is a measure of how closely two
variables are related. We can calculate something called the correlation coefficient
which gives us a measure of how closely two variables are related.
1 r 1
If the data is completely linear, without noise, as shown below, then we can fit a line
exactly to the data and the correlation will be 1:
The correlation coefficient can also be -1 in the case where the data is still perfectly
aligned to the line, but where there is a negative relationship between the two
variables as shown:
Correlation From Regression Quiz
Suppose that we ran linear regression on our data, and we found that:
b=4
a = -3
y = 4x – 3
r is positive
r is negative
r=0
We can’t tell
Correlation Formula
Well, one way to compute the correlation coefficient is very similar to the way that
we calculated the value of b when we looked at linear regression:
x x y y
i i
r i
x x y y
2 2
i i
i i
1 1
Where: x
N
x i and y
N
y
i
Now this is probably the most complex formula that you have encountered so far in
this class, but you will see it is related to a lot of the stuff we have seen before, like
variance and so on, and that using it is not that difficult in practice.
If we look at the denominator:
x x y y
2 2
i i
i i
We notice that the term within the square root is the product of the non-normalised
variance of x and the non-normalised variance of y.
x x y
i
i i y
This is sometimes called the covariance, because it calculates the variance over two
co-occurring variables. However, you will have noticed that, once again, there is a
missing normaliser. What has happened is that the normalisers on the numerator and
denominator cancel each other out.
Computing Correlation
x: 3 4 5
y: 7 8 9
1
x 3 4 5 12 4
3 3
1
y 7 8 9 24 8
3 3
xx -1 0 1
y y -1 0 1
r
1 1 0 0 11
2
1
1 02 12 12 02 12
2
2 2
Let’s look at a different dataset,:
x: 3 4 5
y: 2 5 8
1
x 3 4 5 12 4
3 3
1
y 2 5 8 15 5
3 3
Guess r Quiz
Before we do the calculation, let’s see what your intuition says about the value for r in
this case. Which of the following do you think is correct?
r=1
r=3
r=2
r=0
xx -1 0 1
y y -3 0 3
r
1 3 0 0 1 3
6
1
1 02 12 32 02 32
2
2 18
Which is as we expected.
Reverse Order
Let’s see what happens if we change the order of the y-values as shown:
x: 3 4 5
y: 8 5 2
Since only the order has been changed and the actual values are unchanged, the mean
values will be the same as before:
1
x 3 4 5 12 4
3 3
1
y 8 5 2 15 5
3 3
Now, since the y-values decrease as the x-values increase we would expect the
correlation coefficient to be negative. Let’s see what happens when we calculate the
value of r.
xx -1 0 1
y y 3 0 -3
And once again, we just plug the values into our formula:
r
1 3 0 0 1 3
6
1
12 02 12 32 02 32 2 18
Let’s try something a little bit tricky. Let’s change the y-values to:
x: 3 4 5
y: 8 5 8
What seems to be happening is that the y-values start to decrease as the x-values
increase, but then they start to increase again. While there may be a relationship
between x and y, it doesn’t appear to be a linear one, and we would expect the
correlation to be zero.
1
x 3 4 5 12 4
3 3
1
y 8 5 8 21 7
3 3
xx -1 0 1
y y 1 -2 1
And if we plug the values into our formula we get the correlation coefficient:
r
1 1 0 2 1 1
0
0
12 02 12 12 22 12 26
Of course, we didn’t actually need to calculate the denominator. Once we knew the
numerator was zero we also knew that r = 0.
Final Example
x: 3 4 5
y: 8 3 7
Clearly, this doesn’t look very correlated, although it looks more correlated than the
data in the last example since the final y-value hasn’t increased by as much. We might
therefore expect that the correlation coefficient will be less than 1, and if we plot the
data on a graph, we see that it should also be negative:
As usual, the first thing we do is to calculate the mean values for x and y:
1
x 3 4 5 12 4
3 3
1
y 8 3 7 18 6
3 3
xx -1 0 1
y y 2 -3 1
And if we plug the values into our formula we get the correlation coefficient:
r
1 2 0 3 1 1
1
0.189
12 02 12 22 32 12 2 14
You now understand the basics of correlation coefficients. As we said earlier, the
correlation coefficient:
Correlation is a very powerful tool. For any data set with multiple variables, such as
salary versus age, you can now tell how closely the variables relate to each other
using the relatively simple formula:
x x y y
i i
r i
x x y y
2 2
i i
i i
Monty Hall Problem (Optional)
In the game there are three doors. Behind one door is a car. Monty knows which door,
but he won’t tell you where the car is.
You get to choose a door. Say, for the sake of argument that you pick door number 2.
If the car is behind door 2, then you win the car, otherwise you win nothing. So far, so
good.
Now comes the interesting bit. Obviously, at least one of the two remaining doors
doesn’t have a car behind it. Now, Monty knows which door has the car, and he
reveals one of the doors that you didn’t choose that doesn’t have the car.
Monty then asks you if you want to switch from the door you have chosen to the other
closed door. Do you want to change your choice in the hope that you will increase
your chance of winning the car?
What makes this problem interesting is that when Monty opened the door, he really
didn’t tell you anything you didn’t already know. You already knew in advance that
one of the two doors didn’t contain the car. The fact that Monty has just opened a
door should give you zero information about which of the remaining doors the car is
behind.
If you chose door 2, and Monty then opened door 3, all you now know is that door 3
doesn’t contain the car. Why should door 1 now be more likely than door 2?
Given that you chose door 2, and Monty then revealed that door 3 didn’t contain the
car, what are the probabilities, P1, P2 and P3 that the car is behind each door?
Of course it is easy to see that P3 = 0. We already knew that the car wasn’t behind
door 3., But don’t worry if you weren’t able to figure out the other two probabilities,
the answer to this problem is entirely non-intuitive.
There are three possible true locations for the car, and we know that each of these
possible locations has a probability of 1/3.
When we choose a door, Monty will open one of the other doors. We can construct
the truth table as shown:
Let’s say, for the sake of argument that we have chosen door number 2. Now, we
know that on the first row of the truth table, if the car is behind door number 1 then
the probability that Monty will open door number 1 is zero. On the second line, the
probability that Monty will open door 2 is also zero because we have chosen door
number 2.
So the only option remaining is for Monty to open door 3. This has a probability of 1,
but the posterior probability must take account that the total probability for P(1) = 3,
so the probability becomes 1 1 / 3 1 / 3 .
Similarly for the case where the car is behind door number 3, the probability that
Monty will open door number 3 is zero, as is the probability that he will open the door
that we have chosen (door number 2). The only remaining choice is door number 1.
If the car is behind door number 2, and we have chosen door number 2 then there is a
fifty chance for door number 1 or door number 3. The posterior probability for both
doors is therefore 1 / 2 1 / 3 1 / 6 .
Now we have our truth table based on the fact that we chose door number 2.
Let’s say that Monty opens door number 3. Now every other case, in which he might
have picked door number 1 or door number 2 has zero probability. The only events
that remain are highlighted here:
Now we just need to normalise these values so that the total probability adds up to 1.
Currently the sum of the probabilities is 1 / 3 1 / 6 1 / 2 . If we divide each individual
probability by this sum we get the normalised probabilities as shown. These are the
true posterior probabilities given that we chose door number 2, and Monty opened
door number 3.
Clearly we should switch our choice every time, as this will double our chances of
winning the car!
Simulation
Write a function simulate() that runs 1000 iterations of a simulation of the Monty Hall
problem and so empirically verify the probabilities that we have just calculated.
The function should count how many times you win the car in a variable, K, and then
return K divided by the number of iterations, N.
Python includes the built-in function randint(). This generates random integers in the
range specified by the function arguments, so:
randint(1,3)
with return a random integer in the range 1 to 3 (which is exactly what you will need
in order to pick a random door).
You will also have to simulate the actions of Monty Hall. Sometimes these actions
will be deterministic, at other times they will be stochastic (random). Recall that we
saw both types of action when we constructed the truth-table.
Once the “Monty-simulator” has picked a door, flip your choice to the remaining door.
If that matches the true location then you should increment K, otherwise not.
When you run this 1000 times, the output from your function should be approximately
equal to 2/3.
The assignment will require some real knowledge of Python and is therefore
considered to be a challenging problem. Good luck.
Answers
a=1
b = 0.25
Regression Quiz
x4
y7
28
b 2
14
a = 15
r is positive
r is negative
r=0
We can’t tell
Guess r Quiz
r=1
Door Chance Quiz
P1 = 0.667
P2 = 0.333
P3 = 0
Simulation
N = 1000
def simulate(N):
K=0
for i in range(N):
TrueLoc = randint(1,3)
guess = randint(1,3)
if TrueLoc == guess:
monty = randint(1,3)
while monty == TrueLoc:
monty = randint(1,3)
else:
monty = 6 - TrueLoc - guess
switch = 6 - guess - monty
if switch == TrueLoc:
K=K+1
return float(K) / float(N)
print simulate(N)