0% found this document useful (0 votes)
16 views

StatiF 1 Slides

Uploaded by

ishanschneider00
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

StatiF 1 Slides

Uploaded by

ishanschneider00
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

STATISTICS I

WHAT IS STATISTICS?

Statistics is a way to get information from data

Knowledge
Theory 1
2 Population/model - parameters

Information - statistics
3
Practice 4 Sample/data

Important concepts:
x population = group of all items of interest
x parameter = descriptive measure of a population
x sample = data drawn from the population [subset]
x statistic = descriptive measure of a sample

1
Knowledge
Theory 1
2 Population/model - parameters

Information - statistics
3
Practice 4 Sample/data

Descriptive statistics VDPSOHGDWDĺLQIRUPDWLRQ 


x FROOHFWLRQVXUYH\VĺUHSUHVHQWDWLYHVDPSOH
x summarise (data reduction): percentages, averages, variances
x presentation: tables, figures and graphs

Probability theory SRSXODWLRQĺVDPSOH 


x what is de probability that we shall observe a particular outcome
x parameters are supposed to be known
x GHGXFWLYHVWDWLVWLFV JHQHUDOĺVSHFLILF

Inferential statistics VWDWLVWLFVĺSDUDPHWHUV 


x what can be said after we have observed a sample
x parameters are unknown (although constant)!
x inductive staWLVWLFV VSHFLILFĺJHQHUDO
2
TYPES OF VARIABELS

Variable:
x some characteristic of a population or sample
x it varies per item/object

"At the Olympic games van 2002, skater nr 14, Jochem Uytdehaage, finished first at the 10 km in
12:58:92" [new world record]

QUALITATIVE (categorical)
* arithmetic calculations are meaningless
- nominal (labels/names [Latin: nomen])
* no numbers, no ordering, only encoding
- e.g. nr. 14 in n
- e.g. colour, nationality, political preference
- ordinal (labels with natural ordering/rank)
* no numbers, although they are ordered
* no measure for differences
- e.g. first place in n
- e.g. preferences, quality of scientific journals

QUANTITATIVE (numerical)

3
"At the Olympic games van 2002, skater nr 14, Jochem Uytdehaage, finished first at the 10 km in
12:58:92" [new world record]

QUANTITATIVE (numerical)
* arithmetic calculations are valid
- discrete (can only assume a limited numbers of values)
- e.g. year 2002 in n
- e.g. number of kids,
number of correct answers (also as proportion!)
- continuous (can take on any value)
- e.g. time 12:58:92 in n
- e.g. weight: 76.8 or 76.823 or 76.823195 kg

The type of variable determines which


statistical techniques are valid!

4
Example 1: watching TV (n=50)
Data: {C, C, P, P, C, C, P, C, C, P, P, P, C, P, P, P, C, C, C, C, C, P, I, C, P,
C, C, I, P, C, C, C, P, P, I, C, P, I, P, C, P, C, P, C, P, C, C, P, C, I, P, P, C}
Frequency Table
Channel Frequency Relative frequency
Public 20 40
Commercial 25 50
International 5 10

International
10%

Public
40%

Commercial
50%

5
Example 2: Statement “Statistics is fun” (n=16)
5 categories: strongly disagree, disagree, neutral, agree, strongly agree

Answer Frequency
strongly disagree 1
disagree 3
neutral 3
agree 5
strongly agree 4

Opinion about "Statistics is fun"

6
5
5
4
Frequency

4
3 3
3
2
1
1
0
strongly disagree neutral agree strongly
disagree agree
Opinion

6
DESCRIPTIVE STATISTICS

DISCRETE: X="number of kids" (of 50 employees)

Raw data: {2, 1, 3, 0, 1, 3, 4, 3, 0, 2, 2, 0, 2, 1, 5, 2, 0, 2, 0,


2, 0, 1, 0, 1, 2, 4, 0, 0, 3, 0, 0, 3, 2, 1, 5, 0, 1, 0,
2, 2, 1, 2, 3, 1, 4, 2, 0, 0, 1, 5}

i xi fi rf i f i ˜ xi rf i ˜ xi
1 0 15 0.30 0 0.00
2 1 10 0.20 10 0.20
3 2 13 0.26 26 0.52
4 3 6 0.12 18 0.36
5 4 3 0.06 12 0.24
6 5 3 0.06 15 0.30
total 50 1.00 81 1.62

fi frequency i th class
fi
rf i : relative frequency
n

7
Measures of Central Location (sample):
k
i xi fi rf i ˜ xi
1
Mean/Average (arithmetic and unweighted): x ¦ fi ˜ xi
ni1
1 0 15 0
2 1 10 0.2
0  10  26  18  12  15 81 3 2 13 0.52
-x 1.62
50 50 4 3 6 0.36
5 4 3 0.24
6 5 3 0.3
­middle value n odd total 50 1.62
Median ®
¯mean of the two middle values n even
1 (25th obs.)+2 (26th obs.)
-M 1.5
2
81  5  50
M LVUREXVWVXSSRVHW\SRĺWKHQ x 2.52
50

Mode [French]=most frequently observation/class


- mode=0

8
Histogram

k
The sum of deviations is always zero: ¦ f ˜( x  x )
i=1
i i 0

9
Types of histograms

Histogram Histogram Histogram

0.12 0.18 0.18

0.16 0.16
0.1
0.14 0.14

0.08
Rel. frequency

0.12

Rel. frequency
0.12

Rel. frequency
0.1 0.1
0.06
0.08 0.08
0.04 0.06 0.06

0.04 0.04
0.02
0.02 0.02
0 0 0

symmetric skewed to the right skewed to the left


positively skewed negatively skewed

X M X !M X M

10
Comparison Location Measures

x mean/average: x
x a lot of theory available & efficient usage of the data
x sensitive for extreme observations

x median: M
x less sensitive for extreme observations
x less efficient

x mode:
x used infrequently
x sometimes it's the only measure available (eg. nominal data)

11
Geometric Mean [read yourself]:
Location measure for % change (e.g. returns):

Ex. Wordonline (Tiscali)


prices: 40 50 35 21
50  40
r1 25% r2 30% r3 40%
40

(1  rg )3 (1  r1 )(1  r2 )(1  r3 )
Ÿ rg 3 (1  r1 )(1  r2 )(1  r3 )  1 3
0.525  1 | 0.193

12
Measures of Variability (sample):

Range=largest–smallest observation
R=5 (max.) –0 (min.)=5

1 k
Variance: s 2
¦
n 1 i 1
f i u ( xi  x ) 2 m "mean squared deviations"

15 u (0  1.62) 2  ...  3 u (5  1.62) 2


-s 2
| 2.2
50  1

Standard deviation: s s 2 (crude approximation: R/4)


-s 2.2 | 1.48 (check: R/4=5/4=1.25)

13
CONTINOUS DATA: X=weight (kg) of 199 individuals
# observations d x
- crf i = : cumulatieve rel. freq. of i th class
n
i from .. till .. mi fi rf i crf i
1 «” 55 10 0.05 0.05
2 «” 65 38 0.19 0.24
3 «” 75 71 0.36 0.60
4 «” 85 48 0.24 0.84
5 «” 95 26 0.13 0.97
6 «” 105 6 0.03 1.00
total 199 1.00
1 k
Average (suppose observations are evenly spread out): x | ¦ f i u mi
ni1
1 15.525
-x 10 u 55  38 u 65  ...  6 u 105 | 78.0 (kg)
199 199
Median class:
- 70-80
Modal class:
- 70-80

14
i from .. till .. mi fi rf i crf i
1 «” 55 10 0.05 0.05
2 «” 65 38 0.19 0.24
3 «” 75 71 0.36 0.60
4 «” 85 48 0.24 0.84
5 «” 95 26 0.13 0.97
6 «” 105 6 0.03 1.00
total 199 1.00

1 k
Variance (approximation): s | 2
¦
n 1 i 1
f i u (mi  x ) 2

10 u (55  78) 2  ...  6 u (105  78) 2 26,591


-s |
2
| 134 (kg) 2
199  1 198

Standard deviation: s s2
-s 134 | 11.6 (kg)

s
Coefficient of variation (unit-less relative variability): cv
x
11.6
- cv 14.9%
78

15
(Relative frequency) histogram

0.4

0.35

0.3

rel. 0.25
freq.
0.2

0.15
0.1

0.05
0
50 60 70 80 90 100 110

Weight

What is the total (shaded) area of the bars, if each bar has a width of 1 unit?

16
Ogive / Cumulative frequency polygon
100%
1
97%
0.9
84%
0.8
0.7
cum.rel.freq.

0.6
60%
0.5
0.4
0.3
24%
0.2
0.1
5%
0
50 60 70 80 90 100 110
Weight

What is the median weight?

17
Interpreting the Standard Deviation

Empirical rule: if histogram is bell-shaped, then


x ( x  s, x  s ) contains r 68% of the observations
x ( x  2 s, x  2 s ) contains r 95% of the observations
x ( x  3s, x  3s ) contains almost all of the observations

x  3s x  2s x s x xs x  2s x  3s

18
x  3s x  2s x s x xs x  2s x  3s

Example: see the 199 weights with x 78 and s 11.6


x ( x  s, x  s ) (78  11 53 ,78  11 53 ) (66 52 ,89 53 )
x ( x  2 s, x  2 s ) (78  23 15 ,78  23 15 ) (54 54 ,101 15 )
x ( x  3s, x  3s ) (78  34 54 ,78  34 54 ) (43 15 ,112 54 )

What % has a weight between (66 52 ,89 53 ) | (65,90) ? Answer?

ĺ(PSLULFDOUXOHVHHPVWR\LHOGDQDFFXUDWHDSSUR[LPDWLRQ
19
PERCENTILES AND BOXPLOT

p th percentile= such that (i) at most p% of the data is smaller


(ii) at most (100–p)% of the data is greater

p
Location: L p (n  1)
100

Q1 (first quartile) =25th percentile


Q2 (second quartile) =50th percentile (median)
Q3 (third quartile) =75th percentile

25% 25% 25% 25%

Q1 Q2 Q3

Interquartile range (IQR)=Q3–Q1

20
Ex. 25 test marks (0-100) with x 47.72
23 34 42 52 58
27 37 42 53 63
30 39 42 55 66
33 40 48 57 77
33 40 48 58 96

Q1=35.5 [26·25/100=6.5: 34+0.5*(37–34)]


Q2=42 [26·50/100=13th observation]
Q3=57.5 [26·75/100=19.5: 57+0.5*(58–57)]

IQR=57.5–35.5=22 FRPSDUHV§

Outliers: observations less than Q1  1.5 u IQR


observations greater than Q3  1.5 u IQR

35.5  1.5 u 22 2 ½
In our example: ¾ Ÿ 96 outlier
57.5  1.5 u 22 90.5¿

21
BOXPLOT

1. Box: from Q1 till Q3 with a line at Q2

2. Whiskers: extend to the most extreme values that are not outliers, i.e. extreme value in
( Q1  1.5 u IQR , Q3  1.5 u IQR )

3. 2XWVLGHZKLVNHUVżRU for outliers

BoxPlot

0 20 40 60 80 100 120

22
LINEAR RELATIONSHIP BETWEEN 2 VARIABELS
perfect positive relation positive relation

Correlation=1.0 Correlation=0.5
6 10

4 5
2
0
Y 0 Y
-3 -2 -1 0 1 2 3
-3 -2 -1 0 1 2 3 -5
-2
-10
-4

-6 -15
X X

strong positive relation strong negative relation


Correlation=0.9 Correlation=–0.9
6 8

4 6

2 4
2
0
Y Y 0
-3 -2 -1 -2 0 1 2 3
-3 -2 -1 -2 0 1 2 3
-4
-4
-6 -6
-8 -8
X X

no (linear) relationship
Karl Pearson
Correlation=0.0
Correlation= Correlation=0.0 Born: March 27, 1857 (London)
15
35 Died: April 27, 1936
30
10
25
5 20 Noted for: Pearson's correlation
Y
Y 0 15
10
coefficient
-3 -2 -1 0 1 2 3
-5
5
-10 0
-3 -2 -1 -5 0 1 2 3 23
-15
X
X
Numerical Example: Advertising versus Sales

X="advertising" (*100,000 €) Y="sales" (*1,000,000 €)


2 3
3 6
4.5 8
5.5 10
7 11
x 4.4 & s X 1.98 y 7.6 & sY 3.21

Sample covariance (dependent on unit of measurement):


1 n 1
cov( X , Y ) s XY ¦ i
n 1 i 1
( x  x )( yi  y )
5 1
24.8 6.2

Sample correlation (independent of unit of measurement):


cov( X , Y ) 6.2
rXY 0.98
s X sY 1.98 u 3.21

Theorem:  1 d rXY d 1

24
i ( xi  x ) ( yi  y ) ( xi  x )( yi  y )
1 –2.4 –4.6 11.04
2 –1.4 –1.6 2.24
3 0.1 0.4 0.04
4 1.1 2.4 2.64
5 2.6 3.4 8.84
Total 24.8

25
Regression according to Least Squares Method

x yi b0  b1 xi  residual


yˆ r
i i

so that

actual value ( yi ) predicted/fitted value ( yˆi ) + residual (ri)

n n
x choose b0 and b1 such that ¦ ( y  yˆ ) ¦ r
i=1
i i
2

i 1
i
2
minimal

cov( X , Y )
Theorem: b1 and b0 y  b1 x
s X2

6.2
b1 | 1.58 & b0 7.6  1.58 u 4.4 0.65 Ÿ yˆi 0.65  1.58 xi
1.982

26
Scatter diagram:

Sales versus Advertising

12
11
11
10
10
9 y = 1.5796x + 0.6497
8
8
7
Sales

6
6
5
4
3 3
2
1
0
0 1 2 3 4 5 6 7 8
Advertising

Forecast ( x 6 ): yˆ 0.65  1.58 u 6 10.13


27

You might also like