Probability & Statistics B: Review of Simple Data Summaries
Probability & Statistics B: Review of Simple Data Summaries
(Statistical Inference) Determine the median GPA for this college of 10,000
students.
*reference*
1. INTRODUCTION
Example 3
A college has 10,000 students. You randomly pick 9 students, and their GPA’s
are:
(Statistical Inference) You then randomly pick another student. What is the
probability that her GPA is higher than 3.5?
1. INTRODUCTION
Summarizing data, both graphically and numerically, is central to a good
statistical analysis.
c
25%
b
45%
2. GRAPHICAL SUMMARIES
Discrete data
We can also present the distribution of discrete (countable) data through
appropriate frequency tables and graphs.
2. GRAPHICAL SUMMARIES
Example 6
The table below gives the distribution of the number of claims per
policyholder for different policyholders in a general insurance portfolio.
Caution is required with how the data are grouped in relevant categories, or
ranges, when frequency tables are considered.
Graphical summaries help identify the main characteristics of the data, e.g.
whether your data look consistent with a Normal distribution.
2. GRAPHICAL SUMMARIES
Continuous data
All graphs should:
2527 1787 3770 5701 2310 1652 822 918 2770 4891
3061 2126 1729 4618 3469
In practice, grouping depends on the nature of the variable and the aims of
the analysis.
2. GRAPHICAL SUMMARIES
Continuous data
Two basic criteria used:
2527 1787 3770 5701 2310 1652 822 918 2770 4891
3061 2126 1729 4618 3469
It retains the values of the data, while also showing the shape of the
distribution.
20 21 21 25 25 27 29 30 32 35 36
38 39 45 50 51 52 52 55 56 63 68
69 71 71 73 73 73 74 76
Male: 45 49 55 58 60 68 69 69 75 78 90
Female: 39 40 42 44 44 45 51 53 53 59 60 72
It shows data values as dots along a continuous horizontal axis (in ascending
order)
It is good for showing cluster of points, gaps (where there are no observations)
and atypical observations or outliers.
2. GRAPHICAL SUMMARIES
Example 11
The following data show the salaries (in 1000s) of a sample of 30 employees
from a company:
20 21 21 25 25 27 29 30 32 35 36
38 39 45 50 51 52 52 55 56 63 68
69 71 71 73 73 73 74 76
Create a dotplot.
2. GRAPHICAL SUMMARIES
Example 11: dotplot
3. NUMERICAL SUMMARIES
Graphs usually provide simple and easily interpreted descriptions of the
structure of the frequency distribution of studied data.
3. NUMERICAL SUMMARIES
Often we want to obtain further summarization of the data. These are given
by numerical summaries and are useful because:
These parameters take the form of numerical expressions that determine the
Location
Dispersion
General form
1100 1500 1600 2000 3500 4000 4800 4800 5000 5100
c) σ𝑛𝑖=1(𝑥𝑖 − 𝑥)ҧ = 0
where 𝑛 = 𝑛1 + 𝑛2 + ⋯ + 𝑛𝑘 .
3. NUMERICAL SUMMARIES
Example 14
A college has three schools: School of Engineering, School of Mathematical
Sciences, and School of Management.
1
Sample median = 𝑥 𝑛 +𝑥 𝑛
+1
if 𝑛 is even.
2 2 2
3. NUMERICAL SUMMARIES
Example 15
You are given the following general insurance claim data:
1100 1500 1600 2000 3500 4000 4800 4800 5000 5100
a) With frequency data, the median can be calculated with the help of the
cumulative frequencies in a suitable table.
c) The median is more robust than the mean, i.e. it is not affected by extreme
values in the data set.
3. NUMERICAL SUMMARIES
Example 16
AT Chocolate Factory, the monthly salaries are:
1 CEO – $100,000
1100 1500 1600 2000 3500 4000 4800 4800 5000 5100
1100 1500 1600 2000 3500 4000 4800 4800 5000 5100
Presented as a box with lines at 𝑄1 , the median and 𝑄3 and whiskers which
extend to the min and the max
Usually whiskers extend only to 1.5 times the interquartile range (IQR).
20 21 21 25 25 27 29 30 32 35 36
38 39 45 50 51 52 52 55 56 63 68
69 71 71 73 73 73 74 76
1 1
𝑠2 = σ𝑛𝑖=1 𝑥𝑖2 − σ𝑛𝑖=1 𝑥𝑖 2
𝑛−1 𝑛
𝑛 1 𝑛
2
𝑠 = σ𝑖=1 𝑥𝑖2 − 𝑥ҧ 2
𝑛−1 𝑛
3. NUMERICAL SUMMARIES
The sample standard deviation
The sample standard deviation is simply the square root of the sample
variance.
1
𝑠= σ𝑛𝑖=1 𝑥𝑖 − 𝑥ҧ 2
𝑛−1
𝑛 1 𝑛 1 𝑛 2
𝑠= σ𝑖=1 𝑥𝑖2 − σ𝑖=1 𝑥𝑖
𝑛−1 𝑛 𝑛
3. NUMERICAL SUMMARIES
Example 20
You are given the following general insurance claim data:
1100 1500 1600 2000 3500 4000 4800 4800 5000 5100
𝑛 1 𝑘 1 𝑘 2
2
𝑠 = σ 𝑓 𝑥𝑖2 − σ 𝑓 𝑥𝑖
𝑛−1 𝑛 𝑖=1 𝑖 𝑛 𝑖=1 𝑖
3. NUMERICAL SUMMARIES
Example 21
The table below gives the distribution of the number of claims per
policyholder for different policyholders in a general insurance portfolio.
c) For any real number 𝑐, the sum of squares σ𝑛𝑖=1 𝑥𝑖 − 𝑐 2 takes its
minimum value when 𝑐 = 𝑥.ҧ
3. NUMERICAL SUMMARIES
The sample variance
Some properties of the sample variance:
where 𝑛 = 𝑛1 + 𝑛2 + ⋯ + 𝑛𝑘 .
3. NUMERICAL SUMMARIES
The interquartile range (IQR)
The IQR is more resistant to extreme observations in the data then the sample
variance:
IQR = 𝑄3 − 𝑄1
i.e. it is the length of the interval containing the central half of the
observations.
3. NUMERICAL SUMMARIES
Example 22
The following data show the salaries (in 1000s) of a sample of 30 employees
from a company:
20 21 21 25 25 27 29 30 32 35 36
38 39 45 50 51 52 52 55 56 63 68
69 71 71 73 73 73 74 76
Stock A: 12 15 17 14 13 13
The converse is not true! 𝛽1 = 0 does not necessarily imply that data are
symmetric.
3. NUMERICAL SUMMARIES
Example 24 (From Example 7)
The amounts of betting payouts (in pounds) from a particular company in a
given year are shown below:
2527 1787 3770 5701 2310 1652 822 918 2770 4891
3061 2126 1729 4618 3469
More peaked than normal distribution and heavy tails, Kurtosis > 0
Less peaked than normal distribution and light tails, Kurtosis < 0