0% found this document useful (0 votes)
13 views50 pages

Topic3 Descriptive Statistics

This document provides an overview of descriptive statistics used to summarize datasets. It discusses how to summarize categorical data through frequency distributions, tables, pie charts, and bar charts. For quantitative data, it describes measures of central tendency like the mean, median, and mode, as well as measures of variability such as standard deviation, range, and interquartile range. Percentiles and how to identify outliers are also covered.

Uploaded by

Alfred Wong
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views50 pages

Topic3 Descriptive Statistics

This document provides an overview of descriptive statistics used to summarize datasets. It discusses how to summarize categorical data through frequency distributions, tables, pie charts, and bar charts. For quantitative data, it describes measures of central tendency like the mean, median, and mode, as well as measures of variability such as standard deviation, range, and interquartile range. Percentiles and how to identify outliers are also covered.

Uploaded by

Alfred Wong
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

Descriptive Statistics

Linh Nghiem
MATH1905
Overview

• In this topic, we will look at some graphical and numerical measures to


summarize variables in a dataset

• Reading: Chapters 2 and 3 of Ross


Summarizing Categorical Data
Frequency distribution

Favorite Sport

Football Others Others Football Baseball Basketball Others Basketball

Tennis Soccer Basketball Soccer Others Soccer Football Basketball

Others Football Basketball Football Others Baseball Football Soccer

Others Football Basketball Football Baseball Soccer Basketball Basketball


Tabular summary
Favorite Sport

# Students Percent
(Frequency) Frequency (%)
Football 8 25
Basketball 8 25
Baseball 3 9.375
Tennis 1 3.125
Soccer 5 15.625
Others 7 21.875
Total 32 100
Pie chart

Favorite Sport

21.875% 0.25
Football
Basketball
Baseball
Tennis
Soccer
Others
15.625%

0.25
3.125% 9.375%
Bar chart
Favorite Sport
8 8
8

7
(Frequency) Number of students

0
Football Basketball Baseball Tennis Soccer Others
Grouped bar chart
Favourite Sport by Gender
4 4 4 4 4
4

3 3 3
3
Number of students

2
2

1
1

0 0
0
Football Basketball Baseball Tennis Soccer Others
Women Men
Summarizing Quantitative Data
Main descriptions

• Location:
- Mean, median and mode
- Relative standing: quartiles, percentiles
• Variability:
- Standard deviation
- Range and interquartile range
• Shape:
- Symmetry and skewness
- Uni-modal and multi-modal
Mean

• Average of all the elements

• Sample mean:
1 1 n

x̄ = (x1 + x2 + … + xn) = xi,
n n i=1

with n = sample size.


Example

Hours Worked Last Week


40 20 40 35 0
50 0 25 30 40
40 40 40 40 10
40 62 37 40 45
40 30 10 42 40

40 + 20 + 40 + … + 42 + 40
x̄ = = 34.24
25
Median

• The middle observation


- Sort the observations in ascending order
- If the number of observations is odd, the median = middle
observation
- If the number of observations is even, the median = average of
the middle two observations
Median

Sorted data (by rows)


0 0 10 10 20
25 30 30 35 37
40 40 40 40 40
40 40 40 40 40
42 45 50 60 62

We have n = 25 observations, so the median is the 13th


observation in the sorted data; i.e median = 40.
Median

Student marks in a quiz


40 20 80 48 76 46
45 65 55 50 42 64

Sorted marks
20 40 42 45 46 48
50 55 62 64 76 80

We have n = 12 observations, so the two middle observations are the 6th and
48 + 50
the 7th in the sorted data. Median = = 49.
2
Mode

• The most frequent value(s) in the dataset, if exists


- Eg: Hours work data: mode = 40
- Eg: Student quiz data: no mode
Relative standing

• A measure of relative standing measures the location of a particular


value relative to the rest of the distribution of your data.

• Given a set of data and a proportion p between 0 and 1, the (100p)th


percentile is the value dividing the data so that (100p)% of data values
are below the percentile.

20% 80%

20th percentile
Percentiles
• First, second, third quartiles: p = .25, .50, .75 respectively.
- Median = 50th percentile = second quartile.
- Denoted as Q1, Q2, and Q3 respectively.

25% 25% 25% 25%

Q1 Q2 Q3
First Quartile Second Quartile Third Quartile
(25th percentile) (50th percentile) (75th percentile)
(median)
Calculating percentiles
• If we have n observations, the location of the p-percentile is given by
p
Lp = (n + 1)
100

• Using linear interpolation if Lp is not an integer.


• Example: What is the 31th percentile of the work hours data?
Sorted data (by rows)
0 0 10 10 20
25 30 30 35 37
40 40 40 40 40
40 40 40 40 40
42 45 50 60 62
31
• n = 25, p = 31, so L31 = (25 + 1) × 100 = 8.06
• Hence, the 31th percentile lies 0.06 of the distance between the
8th and the 9th observations in the sorted data, which are 30 and
35 respectively.
• Then 31th percentile is 30 + 0.06 * (35 − 30) = 30.3.
Calculating percentiles

• The above method is only an approximation. In practice, there are


many other formulas for computing percentiles on the data.
- quantile() function in R has 9 options for type, each corresponding to
one distinct way of calculating quantiles and leading to a (slightly)
different result.

• The concept is more important than the actual computation.


Measure of variability

• Variance: measure the spread of observations around the mean,


always non-negative.

Sample variance
n n

( )
1 1
s2 = (xi − x̄)2 = xi2 − n x̄2
n−1∑
i=1
n − 1 ∑
i=1

• Standard deviation: square root of variance.


- Has the same unit as the unit of observations.
Example

Student marks in a quiz


40 20 80 48 76 46
45 65 55 50 42 64

40 + 20 + 80 + … + 64
x̄ = = 52.75
12
1
s2 = {(40 − 52.75) + (20 − 52.75) + … + (64 − 52.75) } = 277.3561
2 2 2
12 − 1

s= 277.3561 = 16.65
Range and interquartile range
• Range: difference between maximum and minimum
• Interquartile range (IQR): difference between third and first quartile.

Sorted marks
20 40 42 45 46 48
50 55 64 65 76 80

Range = 80 - 20 = 60
75
L75 = (12 + 1) × = 9.75, Q3 = 64 + (65 − 64) × 0.75 = 64.75
100
25
L25 = (12 + 1) × = 3.25, Q1 = 42 + (45 − 42) × 0.25 = 42.75
100

IQR = Q3 − Q1 = 64.75 − 42.75 = 22


Outliers
• Broadly speaking, an outlier is an observation that is far from the
majority of other observations in the data

• A common rule (suggested by Tukey) to identify outliers is any point,


either:
- Smaller than Q1 − 1.5 × IQR
- Bigger than Q3 + 1.5 × IQR

• Outlier can contain information, so don’t automatically remove them


Q3 = 64.75
Student marks in a quiz Q1 = 42.75
40 5 80 50 76 46 IQR = 22
45 65 55 50 42 64 Q1 −1.5 × IQR = 9.75
Q3 +1.5 × IQR = 97.75
Outliers

• The median is more robust to outliers than the mean.


• The interquartile range is more robust to outliers than the variance/
standard deviation.

Student marks in a quiz


Mean = 52.58,
40 20 80 48 76 46 Median = 49
45 65 55 50 42 64 SD = 16.65
IQR = 22

Student marks in a quiz Mean = 51.33,


Median = 49
40 5 100 48 76 46
SD = 19.62
45 65 55 50 42 64 IQR = 22
Histogram

• A histogram plot the frequency of data falling into defined intervals.


• Constructing histograms:
- Determine the minimum and maximum values of the data
- Divide the range into non-overlapping, contiguous, and roughly
equal intervals
- Count frequency or relative frequency in each interval
- Plot the intervals on the horizontal axis and the (relative)
frequency on the vertical axis.
• There is no consensus rule for defining the number of intervals.
Histogram: Examples
Example
Usual Travel Time to Work (Minutes)
Source: 2009 American Community Survey

Number of workers Relative


(thousands) Frequency (%)

<10 Minutes 18,565 14.0


10-14 Minutes 19,328 14.6
15-19 Minutes 20,775 15.7
20-24 Minutes 19,559 14.7
25-29 Minutes 8,040 6.1
30-34 Minutes 17,874 13.5
35-44 Minutes 8,321 6.3
45-59 Minutes 9,834 7.4
60+ Minutes 10,378 7.8
Total 132,674 100.0
Histogram
Percent Frequency

Travel Time (Minutes)


Symmetry and skewness
Symmetry and skewness
Symmetric Right-skewed Left-skewed
Mean ≈ Median Mean > Median Mean < Median

400

400
200

300

300
150
Frequency

Frequency

Frequency
200

200
100

100

100
50
0

0
−3 −2 −1 0 1 2 3 4 0 2 4 6 0 5 10 15

Mean = Median ≈ 0 Mean = 0.95 > median = 0.68 Mean = 12.8 < median = 13.2
Unimodal, bimodal, and multimodal
Boxplot

A boxplot (or box-and-whiskers) plot provides summary of continuous


data based on five-number summary: minimum, Q1, median, Q3, and
maximum
Boxplot
Side-by-side boxplots are useful to compare distributions of one quantitative
variable on different categories of another qualitative variable.

Fuel consumption on highways for different classes of car

40

30
hwy

20

2seater compact midsize minivan pickup subcompact suv


class
Summarizing Data for Two Variables
Cross-tabulation

Quality Rating and Prices for 300 LA Restaurants

Quality Rating $10-19 $20-29 $30-39 $40-49 Total


Good 42 40 2 0 84
Very Good 34 64 46 6 150
Excellent 2 14 28 22 66
Total 78 118 76 28 300
Joint and marginal percentages

Quality Rating and Prices for 300 LA Restaurants (%)

Quality Rating $10-19 $20-29 $30-39 $40-49 Total


Good 14.0 13.3 0.7 0.0 28.0
Very Good 11.3 21.3 15.3 2.0 50.0
Excellent 0.8 4.7 9.3 7.3 22.0
Total 26.1 39.3 25.3 9.3 100
Cross-tabulation

Quality Rating and Prices for 300 LA Restaurants

Quality Rating $10-19 $20-29 $30-39 $40-49 Total


Good 42 40 2 0 84
Very Good 34 64 46 6 150
Excellent 2 14 28 22 66
Total 78 118 76 28 300
Cross-tabulation: Row Percentages

Quality Rating and Prices for 300 LA Restaurants

Quality Rating $10-19 $20-29 $30-39 $40-49 Total


Good 50 47.9 2.4 0 100
53.8 33.9 2.6 0
Very Good 22.7 42.7 30.6 4 100
43.6 54.2 60.5 21.4
Excellent 3 21.2 42.4 33.4 100
2.6 11.9 36.8 78.6
Total 100 100 100 100
Cross-tabulation

Quality Rating and Prices for 300 LA Restaurants

Quality Rating $10-19 $20-29 $30-39 $40-49 Total


Good 42 40 2 0 84
Very Good 34 64 46 6 150
Excellent 2 14 28 22 66
Total 78 118 76 28 300
Cross-tabulation: Column Percentages

Quality Rating and Prices for 300 LA Restaurants

Quality Rating $10-19 $20-29 $30-39 $40-49 Total


Good 50 47.9 2.4 0 100
53.8 33.9 2.6 0
Very Good 22.7 42.7 30.6 4 100
43.6 54.2 60.5 21.4
Excellent 3 21.2 42.4 33.4 100
2.6 11.9 36.8 78.6
Total 100 100 100 100
Covariance and correlation
• Covariance describes how the two quantitative variables change in
relation to the other.

• Eg: For two stocks A and B, we want to see how their returns move
with each other.
- A positive covariance implies if the return on A increases
(decreases), then the return on B also increases (decreases)
- A negative covariance implies if the return on A increase
(decreases), then the return on B decreases (increases)
Covariance and correlation

• For two quantitative variables X and Y,

n n

n − 1 ( i=1 )
1 1
∑ ∑
Cov(X, Y ) = (xi − x̄)(yi − ȳ) = xi yi − n x̄ ȳ
n − 1 i=1

with x̄ and ȳ sample means of X and Y respectively.

• Correlation is covariance standardised by the standard deviations.

Cov(X, Y )
rXY =
sxsy
Example: Rates of return (%) for two stocks X and Y
Scatterplots
Covariance and correlation

• −1 ≤ rXY ≤ 1(this is a consequence of the Cauchy–Schwarz inequality)


• Correlation measures the strength of linear relationship between X and Y
- A positive (negative) correlation implies positive (negative) association
- rXY = ± 1 suggests a perfectly positive (negative) linear relationship
- rXY = 0 implies no linear relationship
Correlation

• Correlation is unaffected by the scale/unit of measurements of any


variable.
Height Weight Height Weight
(in cm) (in kg) (in m) (in lb)
151.76 47.82 1.5176 105.204
139.7 36.49 1.397 80.278
136.52 31.86 1.3652 70.092
156.85 53.04 1.5685 116.688
145.41 41.28 1.4541 90.816
163.83 62.99 1.6383 138.578
149.22 38.24 1.4922 84.128

r = 0.96 r = 0.96
Correlation does not imply causation
Summary

• We can summarise data using tabular, graphical, and numerical


measures.
- Many variations of the same measures are possible.
- Many other descriptive statistics are possible, eg trimmed-mean,
coefficient of skewness, kurtosis, etc.
- It is important to know pros and cons of each measure.

• These measures are statistics, because we compute them on


samples (observed data).
- A central question is whether these statistics represent the
corresponding quantities in the population well.
- Answering this question requires concepts from probability, which
is the next topic of the course.

You might also like