StatiF 1 Slides
StatiF 1 Slides
WHAT IS STATISTICS?
Knowledge
Theory 1
2 Population/model - parameters
Information - statistics
3
Practice 4 Sample/data
Important concepts:
x population = group of all items of interest
x parameter = descriptive measure of a population
x sample = data drawn from the population [subset]
x statistic = descriptive measure of a sample
1
Knowledge
Theory 1
2 Population/model - parameters
Information - statistics
3
Practice 4 Sample/data
Variable:
x some characteristic of a population or sample
x it varies per item/object
"At the Olympic games van 2002, skater nr 14, Jochem Uytdehaage, finished first at the 10 km in
12:58:92" [new world record]
QUALITATIVE (categorical)
* arithmetic calculations are meaningless
- nominal (labels/names [Latin: nomen])
* no numbers, no ordering, only encoding
- e.g. nr. 14 in n
- e.g. colour, nationality, political preference
- ordinal (labels with natural ordering/rank)
* no numbers, although they are ordered
* no measure for differences
- e.g. first place in n
- e.g. preferences, quality of scientific journals
QUANTITATIVE (numerical)
3
"At the Olympic games van 2002, skater nr 14, Jochem Uytdehaage, finished first at the 10 km in
12:58:92" [new world record]
QUANTITATIVE (numerical)
* arithmetic calculations are valid
- discrete (can only assume a limited numbers of values)
- e.g. year 2002 in n
- e.g. number of kids,
number of correct answers (also as proportion!)
- continuous (can take on any value)
- e.g. time 12:58:92 in n
- e.g. weight: 76.8 or 76.823 or 76.823195 kg
4
Example 1: watching TV (n=50)
Data: {C, C, P, P, C, C, P, C, C, P, P, P, C, P, P, P, C, C, C, C, C, P, I, C, P,
C, C, I, P, C, C, C, P, P, I, C, P, I, P, C, P, C, P, C, P, C, C, P, C, I, P, P, C}
Frequency Table
Channel Frequency Relative frequency
Public 20 40
Commercial 25 50
International 5 10
International
10%
Public
40%
Commercial
50%
5
Example 2: Statement “Statistics is fun” (n=16)
5 categories: strongly disagree, disagree, neutral, agree, strongly agree
Answer Frequency
strongly disagree 1
disagree 3
neutral 3
agree 5
strongly agree 4
6
5
5
4
Frequency
4
3 3
3
2
1
1
0
strongly disagree neutral agree strongly
disagree agree
Opinion
6
DESCRIPTIVE STATISTICS
i xi fi rf i f i xi rf i xi
1 0 15 0.30 0 0.00
2 1 10 0.20 10 0.20
3 2 13 0.26 26 0.52
4 3 6 0.12 18 0.36
5 4 3 0.06 12 0.24
6 5 3 0.06 15 0.30
total 50 1.00 81 1.62
fi frequency i th class
fi
rf i : relative frequency
n
7
Measures of Central Location (sample):
k
i xi fi rf i xi
1
Mean/Average (arithmetic and unweighted): x ¦ fi xi
ni1
1 0 15 0
2 1 10 0.2
0 10 26 18 12 15 81 3 2 13 0.52
-x 1.62
50 50 4 3 6 0.36
5 4 3 0.24
6 5 3 0.3
middle value n odd total 50 1.62
Median ®
¯mean of the two middle values n even
1 (25th obs.)+2 (26th obs.)
-M 1.5
2
81 5 50
M LVUREXVWVXSSRVHW\SRĺWKHQ x 2.52
50
8
Histogram
k
The sum of deviations is always zero: ¦ f ( x x )
i=1
i i 0
9
Types of histograms
0.16 0.16
0.1
0.14 0.14
0.08
Rel. frequency
0.12
Rel. frequency
0.12
Rel. frequency
0.1 0.1
0.06
0.08 0.08
0.04 0.06 0.06
0.04 0.04
0.02
0.02 0.02
0 0 0
X M X !M X M
10
Comparison Location Measures
x mean/average: x
x a lot of theory available & efficient usage of the data
x sensitive for extreme observations
x median: M
x less sensitive for extreme observations
x less efficient
x mode:
x used infrequently
x sometimes it's the only measure available (eg. nominal data)
11
Geometric Mean [read yourself]:
Location measure for % change (e.g. returns):
(1 rg )3 (1 r1 )(1 r2 )(1 r3 )
rg 3 (1 r1 )(1 r2 )(1 r3 ) 1 3
0.525 1 | 0.193
12
Measures of Variability (sample):
Range=largest–smallest observation
R=5 (max.) –0 (min.)=5
1 k
Variance: s 2
¦
n 1 i 1
f i u ( xi x ) 2 m "mean squared deviations"
13
CONTINOUS DATA: X=weight (kg) of 199 individuals
# observations d x
- crf i = : cumulatieve rel. freq. of i th class
n
i from .. till .. mi fi rf i crf i
1 « 55 10 0.05 0.05
2 « 65 38 0.19 0.24
3 « 75 71 0.36 0.60
4 « 85 48 0.24 0.84
5 « 95 26 0.13 0.97
6 « 105 6 0.03 1.00
total 199 1.00
1 k
Average (suppose observations are evenly spread out): x | ¦ f i u mi
ni1
1 15.525
-x 10 u 55 38 u 65 ... 6 u 105 | 78.0 (kg)
199 199
Median class:
- 70-80
Modal class:
- 70-80
14
i from .. till .. mi fi rf i crf i
1 « 55 10 0.05 0.05
2 « 65 38 0.19 0.24
3 « 75 71 0.36 0.60
4 « 85 48 0.24 0.84
5 « 95 26 0.13 0.97
6 « 105 6 0.03 1.00
total 199 1.00
1 k
Variance (approximation): s | 2
¦
n 1 i 1
f i u (mi x ) 2
Standard deviation: s s2
-s 134 | 11.6 (kg)
s
Coefficient of variation (unit-less relative variability): cv
x
11.6
- cv 14.9%
78
15
(Relative frequency) histogram
0.4
0.35
0.3
rel. 0.25
freq.
0.2
0.15
0.1
0.05
0
50 60 70 80 90 100 110
Weight
What is the total (shaded) area of the bars, if each bar has a width of 1 unit?
16
Ogive / Cumulative frequency polygon
100%
1
97%
0.9
84%
0.8
0.7
cum.rel.freq.
0.6
60%
0.5
0.4
0.3
24%
0.2
0.1
5%
0
50 60 70 80 90 100 110
Weight
17
Interpreting the Standard Deviation
x 3s x 2s x s x xs x 2s x 3s
18
x 3s x 2s x s x xs x 2s x 3s
ĺ(PSLULFDOUXOHVHHPVWR\LHOGDQDFFXUDWHDSSUR[LPDWLRQ
19
PERCENTILES AND BOXPLOT
p
Location: L p (n 1)
100
Q1 Q2 Q3
20
Ex. 25 test marks (0-100) with x 47.72
23 34 42 52 58
27 37 42 53 63
30 39 42 55 66
33 40 48 57 77
33 40 48 58 96
IQR=57.5–35.5=22 FRPSDUHV§
35.5 1.5 u 22 2 ½
In our example: ¾ 96 outlier
57.5 1.5 u 22 90.5¿
21
BOXPLOT
2. Whiskers: extend to the most extreme values that are not outliers, i.e. extreme value in
( Q1 1.5 u IQR , Q3 1.5 u IQR )
BoxPlot
0 20 40 60 80 100 120
22
LINEAR RELATIONSHIP BETWEEN 2 VARIABELS
perfect positive relation positive relation
Correlation=1.0 Correlation=0.5
6 10
4 5
2
0
Y 0 Y
-3 -2 -1 0 1 2 3
-3 -2 -1 0 1 2 3 -5
-2
-10
-4
-6 -15
X X
4 6
2 4
2
0
Y Y 0
-3 -2 -1 -2 0 1 2 3
-3 -2 -1 -2 0 1 2 3
-4
-4
-6 -6
-8 -8
X X
no (linear) relationship
Karl Pearson
Correlation=0.0
Correlation= Correlation=0.0 Born: March 27, 1857 (London)
15
35 Died: April 27, 1936
30
10
25
5 20 Noted for: Pearson's correlation
Y
Y 0 15
10
coefficient
-3 -2 -1 0 1 2 3
-5
5
-10 0
-3 -2 -1 -5 0 1 2 3 23
-15
X
X
Numerical Example: Advertising versus Sales
Theorem: 1 d rXY d 1
24
i ( xi x ) ( yi y ) ( xi x )( yi y )
1 –2.4 –4.6 11.04
2 –1.4 –1.6 2.24
3 0.1 0.4 0.04
4 1.1 2.4 2.64
5 2.6 3.4 8.84
Total 24.8
25
Regression according to Least Squares Method
x yi b0 b1 xi residual
yˆ r
i i
so that
n n
x choose b0 and b1 such that ¦ ( y yˆ ) ¦ r
i=1
i i
2
i 1
i
2
minimal
cov( X , Y )
Theorem: b1 and b0 y b1 x
s X2
6.2
b1 | 1.58 & b0 7.6 1.58 u 4.4 0.65 yˆi 0.65 1.58 xi
1.982
26
Scatter diagram:
12
11
11
10
10
9 y = 1.5796x + 0.6497
8
8
7
Sales
6
6
5
4
3 3
2
1
0
0 1 2 3 4 5 6 7 8
Advertising