0% found this document useful (0 votes)
29 views23 pages

AdvStats - W1 - Descriptive Stats

This document provides an overview of descriptive statistics concepts including: 1) Measures of central tendency like the mean and median are used to describe the central or typical value in a data set. Measures of dispersion like variance and standard deviation describe how concentrated data values are around the central value. 2) Correlation and regression analyze relationships between variables and form the basis of analyzing effects. The line of best fit minimizes the sum of squared residuals to best capture the linear relationship between variables. 3) Ordinary least squares regression chooses coefficients a and b that generate the regression line Zi = a + bXi which best predicts the dependent variable Y based on the independent variable X.

Uploaded by

Pedro Fernandez
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views23 pages

AdvStats - W1 - Descriptive Stats

This document provides an overview of descriptive statistics concepts including: 1) Measures of central tendency like the mean and median are used to describe the central or typical value in a data set. Measures of dispersion like variance and standard deviation describe how concentrated data values are around the central value. 2) Correlation and regression analyze relationships between variables and form the basis of analyzing effects. The line of best fit minimizes the sum of squared residuals to best capture the linear relationship between variables. 3) Ordinary least squares regression chooses coefficients a and b that generate the regression line Zi = a + bXi which best predicts the dependent variable Y based on the independent variable X.

Uploaded by

Pedro Fernandez
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Advanced Statistics

Descriptive Statistics

Descriptive Statistics

Economics
University of Manchester

1
Numerical data summaries
LOCATION
‘central value’ - mean and median
DISPERSION/SPREAD
how concentrated are data values around central
location – variance and standard deviation
CORRELATION AND REGRESSION
Relationships between variables
Building block of analysis of “effects”

2
Notation for Variables
VARIABLE
A “variable” is simply a label, with description, for an event of
interest (X,Y,Z, etc)

For example:
“Let X = your weekly expenditure on food”
“Let P = the price of a litre of petrol”
“Let Y = your household income”
X 1
=
.......
; X 2
=
.......
; X 3
=
.......
; X 4
=
.......
;

etc ….

Xi : in general, denotes the i th observation on the variable X.


Note the use of “subscript i”
If we have X1 , X2 , ... , Xn
a (random) sample of n observations on X

3
Summation: “adding up”

We use the following definitions and notation


to signify summing up:

Greek “SIGMA”, S: =X1+X2+…+Xn

n n
Dummy SUBSCRIPT:  Xi =  X j
i =1 j =1

4
Measures of location
Suppose we have sample data: X1, X2, …, Xn

Sample (arithmetic) mean:

Weighted mean sometimes used: (eg, price


indices)
n

 w X = (w X
i =1
i i 1 1 + w2 X 2 ++ wn X n );
n
with w
i =1
i =1
5
Measures of dispersion I
(Mean) deviation: xi = X i − X
Mean Squared Deviation (MSD)

1 n 2 1 n
 i n
n i =1
x = ( X i)− X
2

i =1

=
1
n
( ) ( 2
X1 − X + X 2 − X ) 2
(
+ + X n − X )
2

1 n
=   Xi  − X( )
2 2

 n i =1 

• Note: MSD ≥ 0

• Manipulations in Worked Exercises Q1 6


Measures of dispersion II
Sample variance

( )
n n
1 1
  i
2
s2 = Xi − X = x 2

n − 1 i =1 n − 1 i =1

Sample standard deviation: s=+ s 2

Note:
1 Divisor of n – 1, not n, for statistical reasons
2 Both s 2 and s ≥ 0
3 Sample mean and s are measured in the units
of the original data
4 Variance measured in squared units
7
Regression
Summarise the data/scatter with line of best fit
Data (Xi, Yi), i = 1, 2, …, n
LINEAR relationship between Y and X?
What is the equation of the line of BEST FIT?

Salary against education


40
Salary (000s)

30
20

10
0
0 2 4 6
8
Years of FE
Line of best fit I

Consider the following scatter

(Xi, Yi), i = 1, 2, …, 6

9
Line of best fit II

Z=a+bX

Draw any straight line through the points:


depicted by the green line, Z=a+bX,
where a and b define the
intercept and slope, respectively.

10
Line of best fit III

Z=a+bX

Z
Choose a particular value of X in
the sample and get the
Y corresponding Z value, from the
line that has been drawn

Now compare this Z value with the actual Y value


in the sample associated with the chosen X

X
11
Line of best fit IV

Z=a+bX

Z
Residual = Y – Z
Y

The difference is called a residual.

Residual = Y – Z
= Y - a - bX

X
12
Line of best fit V

Z=a+bX

Can construct such a residual for each X value in the


sample.

Then square the residual values.

Sum these squares to get the “sum of squared


residuals” associated with the line Z=a+bX

13
Line of best fit VI

Z=a+bX

This sum of squared residuals is defined by:


n n

 (Yi − Zi ) =  (Yi − a − bX i )
2 2

i =1 i =1

14
Line of best fit VII

Z=a+bX

Z W=c+dX
W
Y
Could repeat this for any other line;
eg, W=c+dX

This will give rise to a different set


of residuals:

Residual = Y - W
= Y - c - dX

X
15
Line of best fit VIII

Z=a+bX

W=c+dX

And, therefore, a sum of squared residuals:

n n

 (Yi − Wi ) =  (Yi − c − dX i )
2 2

i =1 i =1

16
Line of best fit IX

Z=a+bX

W=c+dX

Which line is better?

Choose the line which has the smallest sum of squared


residuals. That is, choose Z = a+bX if
n n

 i
( − − )   i
( − − )
2 2
Y a bX i Y c dX i
i =1 i =1

17
Line of best fit: summary

Choose that line (i.e., choose a and b) which


minimises
n n n n n n

 (Y − a − bX ) =  Y +a +b X − 2a  Yi − 2b Yi X i + 2ab X i
2 2 2 2 2
i i i i
i =1 i =1 i =1 i =1 i =1 i =1

This is called Ordinary Least Squares


Regression
or, “regressing Y on X”

18
The regression equation
Regression line (line of best fit) has the mathematical
form:
Z i = Yˆi = a + bX i

The intercept and slope are given by (using calculus):

a = Y −bX ,

 (X )( ) x y
n n

i − X Yi − Y i i n
b= i =1
= i =1
=  wi yi
 (X )
n n

 i
2
i −X x 2 i =1

i =1 i =1

Derived Worked Exercises Q2. 19


Sample correlation
Gives an INDEX of the LINEAR relationship between
observed Y and X.
Defined as

 (X
n
)(
− X Yi − Y ) Y .
. . Y
..
r= i =1
i
, .. .
+
 (X
n

i −X )  (Y − Y )
2
n

i
2
. X . . X
i =1 i =1
0<r<1 r1
x y i i x 2
i
= i
=b i

y
Y
x y 2 2 2
i . .
.. . .
i i
i i i

• NB: -1  r  1 X

r=0
20
Regression and correlation:
calculations and interpretation
These calculations illustrated in the last four slides

But in practice use Excel, for example

EXAMPLE: Data on salary (£000) and education (years ofFE):


r = 0.83
Regression line has +ve slope (higher salary with more years FE);
regression line fits fairly well.
Yˆi = 16.3 + 2.86 X i
Line of best fit:

Salary (on average) estimated to be higher by £2,860 for each


additional year of FE;

Salary estimated to be £16,300 when no FE (Xi =0).


21
Example
67 industrial firms, cross-section data:
CEO salary (in 1990, thousand US$)
Firm sales (in 1990, million US$)

CEO Salary versus Firm Sales

5000

4000

3000
Salary

2000

1000

0
0 20000 40000 60000 80000
Sales

22
Example continued
r = 0.53
Regression line:

Yˆi = 930.3 + 0.025 X i


Xi : firm sales (million US$)
Yi : CEO salary (thousand US$)

CEO Salary versus Firm Sales

5000

4000

3000
Salary

2000

1000

0
0 20000 40000 60000 80000
Sales
23

You might also like