AdvStats - W1 - Descriptive Stats
AdvStats - W1 - Descriptive Stats
Descriptive Statistics
Descriptive Statistics
Economics
University of Manchester
1
Numerical data summaries
LOCATION
‘central value’ - mean and median
DISPERSION/SPREAD
how concentrated are data values around central
location – variance and standard deviation
CORRELATION AND REGRESSION
Relationships between variables
Building block of analysis of “effects”
2
Notation for Variables
VARIABLE
A “variable” is simply a label, with description, for an event of
interest (X,Y,Z, etc)
For example:
“Let X = your weekly expenditure on food”
“Let P = the price of a litre of petrol”
“Let Y = your household income”
X 1
=
.......
; X 2
=
.......
; X 3
=
.......
; X 4
=
.......
;
etc ….
3
Summation: “adding up”
n n
Dummy SUBSCRIPT: Xi = X j
i =1 j =1
4
Measures of location
Suppose we have sample data: X1, X2, …, Xn
w X = (w X
i =1
i i 1 1 + w2 X 2 ++ wn X n );
n
with w
i =1
i =1
5
Measures of dispersion I
(Mean) deviation: xi = X i − X
Mean Squared Deviation (MSD)
1 n 2 1 n
i n
n i =1
x = ( X i)− X
2
i =1
=
1
n
( ) ( 2
X1 − X + X 2 − X ) 2
(
+ + X n − X )
2
1 n
= Xi − X( )
2 2
n i =1
• Note: MSD ≥ 0
( )
n n
1 1
i
2
s2 = Xi − X = x 2
n − 1 i =1 n − 1 i =1
Note:
1 Divisor of n – 1, not n, for statistical reasons
2 Both s 2 and s ≥ 0
3 Sample mean and s are measured in the units
of the original data
4 Variance measured in squared units
7
Regression
Summarise the data/scatter with line of best fit
Data (Xi, Yi), i = 1, 2, …, n
LINEAR relationship between Y and X?
What is the equation of the line of BEST FIT?
30
20
10
0
0 2 4 6
8
Years of FE
Line of best fit I
(Xi, Yi), i = 1, 2, …, 6
9
Line of best fit II
Z=a+bX
10
Line of best fit III
Z=a+bX
Z
Choose a particular value of X in
the sample and get the
Y corresponding Z value, from the
line that has been drawn
X
11
Line of best fit IV
Z=a+bX
Z
Residual = Y – Z
Y
Residual = Y – Z
= Y - a - bX
X
12
Line of best fit V
Z=a+bX
13
Line of best fit VI
Z=a+bX
(Yi − Zi ) = (Yi − a − bX i )
2 2
i =1 i =1
14
Line of best fit VII
Z=a+bX
Z W=c+dX
W
Y
Could repeat this for any other line;
eg, W=c+dX
Residual = Y - W
= Y - c - dX
X
15
Line of best fit VIII
Z=a+bX
W=c+dX
n n
(Yi − Wi ) = (Yi − c − dX i )
2 2
i =1 i =1
16
Line of best fit IX
Z=a+bX
W=c+dX
i
( − − ) i
( − − )
2 2
Y a bX i Y c dX i
i =1 i =1
17
Line of best fit: summary
(Y − a − bX ) = Y +a +b X − 2a Yi − 2b Yi X i + 2ab X i
2 2 2 2 2
i i i i
i =1 i =1 i =1 i =1 i =1 i =1
18
The regression equation
Regression line (line of best fit) has the mathematical
form:
Z i = Yˆi = a + bX i
a = Y −bX ,
(X )( ) x y
n n
i − X Yi − Y i i n
b= i =1
= i =1
= wi yi
(X )
n n
i
2
i −X x 2 i =1
i =1 i =1
(X
n
)(
− X Yi − Y ) Y .
. . Y
..
r= i =1
i
, .. .
+
(X
n
i −X ) (Y − Y )
2
n
i
2
. X . . X
i =1 i =1
0<r<1 r1
x y i i x 2
i
= i
=b i
y
Y
x y 2 2 2
i . .
.. . .
i i
i i i
• NB: -1 r 1 X
r=0
20
Regression and correlation:
calculations and interpretation
These calculations illustrated in the last four slides
5000
4000
3000
Salary
2000
1000
0
0 20000 40000 60000 80000
Sales
22
Example continued
r = 0.53
Regression line:
5000
4000
3000
Salary
2000
1000
0
0 20000 40000 60000 80000
Sales
23