0% found this document useful (0 votes)
4 views29 pages

Statistics 03-3

Uploaded by

gawog52302
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views29 pages

Statistics 03-3

Uploaded by

gawog52302
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Statistics

Frequency tables

A frequency table displays the frequency distribution of a


numerical variable.

For each value or each class i in P, we compute the following


quantities on the sample x1 , . . . , xn :
▶ the frequencies fi
▶ the relative frequencies fi /n
▶ the cumulative frequencies i fi
P

▶ the cumulative relative frequencies i fi /n


P

▶ possibly the densities (= relative frequency / amplitude).

Statistics - Fall semester 1/ 29


Statistics

Energy Use - Frequency table


Energy Use by Country in 2012 (kg of oil equivalent per capita).

Class Freq. Rel. freq. Cum. freq. Cum. rel. freq.


[0, 2000[ 92 0.568 92 0.568
[2000, 4000[ 39 0.241 131 0.809
[4000, 6000[ 17 0.105 148 0.914
[6000, 8000[ 8 0.049 156 0.963
[8000, 10000[ 2 0.012 158 0.975
[10000, 12000[ 1 0.006 159 0.981
[12000, 14000[ 0 0.000 159 0.981
[14000, 16000[ 1 0.006 160 0.987
[16000, 18000[ 1 0.006 161 0.993
[18000, 20000[ 1 0.006 162 0.999

Data source: data.worldbank.org.


Statistics - Fall semester 2/ 29
Statistics

Energy Use - Cumulative Distribution

Cumulative relative frequencies Empirical distribution function


1.0

1.0
0.8

0.8
0.6

0.6
Fn(x)
0.4

0.4
0.2

0.2
0.0

0.0

0 5000 10000 15000 20000 0 5000 10000 15000 20000

Energy use Energy use

Statistics - Fall semester 3/ 29


Statistics

Histograms

The histogram is the graphical translation of a frequency table,


by means of rectangles.
The area of each rectangle is proportional to each class
frequency (or relative frequency). In fact the height of the
rectangle is equal to the density,

relative frequency
density = ,
amplitude

where the amplitude is the width of the class.


At the contrary to barplots for categorical variables, there is an
order on the variable axis of an histogram.

Statistics - Fall semester 4/ 29


Statistics

Energy Use - histogram


80
60
Frequency

40
20
0

0 5000 10000 15000 20000

Use

Statistics - Fall semester 5/ 29


Statistics

Energy Use - different class sizes

Energy Use by Country in 2012 (kg of oil equivalent per capita).

Class Cumul. Cumul.


Freq. Rel. freq. Density freq. rel. freq.
[0, 2000[ 92 0.568 0.000284 92 0.568
[2000, 4000[ 39 0.241 0.000120 131 0.809
[4000, 6000[ 17 0.105 0.000052 148 0.914
[6000, 8000[ 8 0.049 0.000025 156 0.963
[8000, 12000[ 3 0.018 0.000005 159 0.981
[12000, 20000[ 3 0.018 0.000002 162 0.999

Statistics - Fall semester 6/ 29


Statistics

Energy Use - histogram - different class sizes


Correct Incorrect
0.00025

80
0.00020

60
Frequency
0.00015
Density

40
0.00010
0.00005

20
0.00000

0 5000 10000 15000 20000 0 5000 10000 15000 20000

Use Use

Statistics - Fall semester 7/ 29


Statistics

Histogram/barplot for discrete variables


Drugs intake during the two days preceding the interview
(1977-78 Australian Health Survey)
2000

2000
1500

1500
Frequency
1000

1000
500
500

0
0

0 1 2 3 4 5 6 7 8 0 2 4 6 8

Medicine Medicine

Statistics - Fall semester 8/ 29


Statistics

Kernel density estimation


Kernel density estimation can be viewed as a generalization
and improvement over histograms.

Consider the following data on wing spans [meters] of aircraft


built from 1956 to 1984.

7.88, 9.14, 10.51, 11.12, 11.70, 14.00, 14.25, 14.49,


28.47, 29.20, 36.60, 47.57.

The above data are a subset of a complete dataset in Bowman


and Azzalini (1997). They are available in package sm in R.
Statistics - Fall semester 9/ 29
Statistics

From histogram with fixed breaks.....

Figure source: http://www.mvstat.net/tduong/


research/seminars/seminar-2001-05/
Statistics - Fall semester 10/ 29
Statistics

.....to “histogram” with blocks centered on data.....

Figure source: http://www.mvstat.net/tduong/


research/seminars/seminar-2001-05/ Statistics - Fall semester 11/ 29
Statistics

....to kernel density estimation

Figure source: http://www.mvstat.net/tduong/


research/seminars/seminar-2001-05/ Statistics - Fall semester 12/ 29
Statistics

Undersmoothing vs oversmoothing

Figures source: http://www.mvstat.net/tduong/


research/seminars/seminar-2001-05/
Statistics - Fall semester 13/ 29
Statistics

Energy use - density

0.00030
0.00025
0.00020
0.00015
Density

0.00010
0.00005
0.00000

0 5000 10000 15000 20000

N = 162 Bandwidth = 595.7

Statistics - Fall semester 14/ 29


Statistics

Energy use - bandwidth comparison

4e−04
4e−04
Density

Density

2e−04
2e−04
0e+00

0e+00
0 5000 10000 15000 0 5000 10000 15000 20000

N = 162 Bandwidth = 100 N = 162 Bandwidth = 300


0.00000 0.00010 0.00020 0.00030

0.00010
Density

Density

0.00000

0 5000 10000 15000 20000 −5000 0 5000 15000

N = 162 Bandwidth = 600 N = 162 Bandwidth = 1500

Statistics - Fall semester 15/ 29


Statistics

Boxplots (box-and-whisker plots)


The boxplot is a box-like representation. On a graduated axis
(either horizontal or vertical)
1. Draw a box between the quartiles Q1 = Q(0.25) and
Q3 = Q(0.75).
2. Draw a line in the box for the median M = Q(0.5).
3. Compute the lower bound LB = Q1 − 1.5 · (Q3 − Q1).
4. Compute the upper bound UB = Q3 + 1.5 · (Q3 − Q1).
5. The ends of the whiskers are the lowest datum still within
1.5 IQR of the lower quartile (LW), and the highest datum
still within 1.5 IQR of the upper quartile (UW)
6. Represent the data smaller than LW or larger than UW by
a symbol.
The width of the box is arbitrary.
Statistics - Fall semester 16/ 29
Statistics

Length of last call - boxplot computations

We already have Q1 = 49, M = 60 and Q3 = 328 (see


page ??).
We compute

LB = Q1 − 1.5 · (Q3 − Q1) = 49 − 1.5 · (328 − 49) = −369.5,

and

UB = Q3 + 1.5 · (Q3 − Q1) = 328 + 1.5 · (328 − 49) = 746.5.

We therefore have the lower whisker LW = 1 and the upper


whisker UW = 537 (see data on page ??).

Statistics - Fall semester 17/ 29


Statistics

Length of last call - boxplot

1500
1000
last call

500
0

Statistics - Fall semester 18/ 29


Statistics

Temperature in Montreal - boxplot


Average temperature in Montreal, QB, Canada (1942−2009)

20
10
0
−10

Jan Feb Mar Apr May Jun July Aug Sep Oct Nov Dec

Data source: climate.weatheroffice.gc.ca.


Statistics - Fall semester 19/ 29
Statistics

Violin plots

A violin plot is a combination of a boxplot and a kernel density


plot. The (same) rotated kernel density plot is added on each
side of a boxplot.

Statistics - Fall semester 20/ 29


Statistics

Energy use - boxplot vs violin plot


15000

15000
10000

10000
5000

5000
0

Statistics - Fall semester 21/ 29


Statistics

Temperature in Montreal - violin plot


Average temperature in Montreal, QB, Canada (1942−2009)

20
10
0
−10

Jan Feb Mar Apr May Jun July Aug Sep Oct Nov Dec

Statistics - Fall semester 22/ 29


Statistics

Old Faithful geyser eruptions times [min] - density

0.5
0.4
0.3
Density

0.2
0.1
0.0

1 2 3 4 5 6

N = 272 Bandwidth = 0.3348

Data source : Azzalini and Bowman (1990), Statistics


available in R.
- Fall semester 23/ 29
Statistics

Old Faithful eruptions times - boxplot vs violin plot


5.0

5.0
4.5

4.5
4.0

4.0
3.5

3.5
3.0

3.0
2.5

2.5
2.0

2.0
1.5

1.5

Statistics - Fall semester 24/ 29


Statistics

Q-Q plots

A Q-Q plot (quantile-quantile plot) is a graphical method to


compare two probability distributions by plotting their quantiles
against each other.
The two distribution to be compared can be either theoretical or
empirical. Most often, one is interested to compare an
empirical distribution (from the sample) to a theoretical one.
In this case, the ordered sample is plotted against the quantiles
F −1 ( k−0.5
n ) for k = 1, . . . , n, where F is the cumulative
distribution function of the theoretical distribution.
If the two distributions being compared are similar, the points in
the Q-Q plot will approximately lie on the 45 degree line.

Statistics - Fall semester 25/ 29


Statistics

Reminder (Prob. I) - simulation

Simulation procedure

We, typically, have no time to play a large number of card games, so we exploit a
computer. With this aim
we should generate values from a random variable having U(0, 1)
distribution: these values are called random numbers.
Starting from a U ∼ U(0, 1) distribution, we can in principle simulate any
random variable having a CDF, by means of the F −1 transformation (see
tutorial, Y = F −1 (U)).

Remark
In fact, the computer makes use of the so called generator of pseudo random
numbers: an algorithm produces a sequence of numbers which are (only) pseudo-
random. Namely, the generator yields a sequence of numbers that,
PRACTICALLY, is VERY SIMILAR to a sample drawn form U(0, 1). The way in
which this algorithm works is behind the scope of this course: let’s simply say that
you can use the statistical software to achieve the task.

La Vecchia S110015 Statistics


Spring-Semester
Fall semester
2017 5/1 26/ 29
Statistics

Q-Q plots interpretation

Normal data vs normal quantiles t3 distributed data vs normal quantiles


3

10
2
1
Sample Quantiles

Sample Quantiles

5
0

0
−1
−2

−5

−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3

Theoretical Quantiles Theoretical Quantiles

Statistics - Fall semester 27/ 29


Statistics

Energy use - Normal Q-Q plot


Normal Q−Q Plot

15000
Sample Quantiles

10000
5000
0

−2 −1 0 1 2

Theoretical Quantiles

Statistics - Fall semester 28/ 29


Statistics

More......

▶ Bee plot
http://www.cbs.dtu.dk/˜eklund/beeswarm/
▶ Gapminder
www.gapminder.org/
▶ inzight
www.stat.auckland.ac.nz/˜wild/iNZight/
index.php
▶ Worldmapper http://www.worldmapper.org/
▶ Junk charts: http://junkcharts.typepad.com/

Statistics - Fall semester 29/ 29

You might also like