0% found this document useful (0 votes)
19 views77 pages

Descriptive Statistics SV

The document discusses various descriptive statistics including measures of central tendency like mean, median, and mode as well as measures of dispersion. It provides examples and definitions of each statistic and discusses when different measures of central tendency are more appropriate to use than others such as the median being less influenced by outliers than the mean. The document also includes examples of how different statistics can be applied to real world scenarios like analyzing salaries and home prices.

Uploaded by

wexoutletbrand
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views77 pages

Descriptive Statistics SV

The document discusses various descriptive statistics including measures of central tendency like mean, median, and mode as well as measures of dispersion. It provides examples and definitions of each statistic and discusses when different measures of central tendency are more appropriate to use than others such as the median being less influenced by outliers than the mean. The document also includes examples of how different statistics can be applied to real world scenarios like analyzing salaries and home prices.

Uploaded by

wexoutletbrand
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 77

Descriptive Statistics

A set of numbers that ‘describe’ a set of data


Measures of Central Tendency
(the various averages)

Measures of Dispersion
Measures of Central Tendency
(the various averages)
Some ‘central’ aspect of the data

Measures of Dispersion
Measures of Central Tendency
(the various averages)
Some ‘central’ aspect of the data

Measures of Dispersion
How ‘spread-out’ or ‘dispersed’ the data is
Measures of Central Tendency
Mean (the Arithmetic mean)

Median

Mode
Mean
The mean of the following set of observations,

5, 0.9, 0.2, 2, 1

5+0.9+0.2+2+1
is = 1.82
5
Median
The Median of a set of ordered observations is
a middle number that divides the data into two
parts.
Median
The Median of a set of ordered observations is
a middle number that divides the data into two
parts.

=MEDIAN(number1, number2, …)
Mean versus Median
When is a Median a better summary description
of data as compared to the Mean?
Mean versus Median
When is a Median a better summary description
of data as compared to the Mean?
Let's take a seven-employee small firm with the
following salaries:
28,000 $
33,000 $
33,000 $
34,000 $
37,000 $
40,000 $
400,000 $
Mean versus Median
When is a Median a better summary description
of data as compared to the Mean?
Let's take a seven-employee small firm with the
following salaries:
28,000 $
33,000 $
33,000 $ What is the ‘typical’ salary in this group?
34,000 $
37,000 $
40,000 $
400,000 $
Mean versus Median
When is a Median a better summary description
of data as compared to the Mean?
Let's take a seven-employee small firm with the
following salaries:
28,000 $
33,000 $
What is the ‘typical’ salary in this
33,000 $ group?
34,000 $ Mean = 86,000$
37,000 $
40,000 $
400,000 $
Mean versus Median
When is a Median a better summary description
of data as compared to the Mean?
Let's take a seven employee small firm with the
following salaries:
28,000 $
33,000 $
What is the ‘typical’ salary in this
33,000 $ group?
34,000 $ Mean = 86,000$
37,000 $
Median = 34,000$
40,000 $
400,000 $
Mean versus Median
The Mean is influenced to a greater extent by
extreme observations (Outliers)
Mean versus Median
The Mean is influenced to a greater extent by
extreme observations

Income and Price data generally follow this


pattern
Case Study: Outliers

AI thinks people with


autism are tall, thin,
blond male adolescents.
Does AI have to eliminate outliers?
Mode
The mode is the most frequently occurring
value in a set of data.
Mode
The mode is the most frequently occurring
value in a set of data.

=MODE.SNGL(number1, number2, …)
Mode of your responses about organic products

4
3
5
3
3
2
3. Mode is 3
What is the most popular
pie flavor in the US?
Mode
Not a very relevant descriptive statistic when the data is essentially
continuous.
Date Rate
1-Jan-16 0.920945
Daily exchange rate, Dollar to Euro 2-Jan-16 0.920555
4-Jan-16 0.920725
5-Jan-16 0.926355
6-Jan-16 0.931012
7-Jan-16 0.929196
8-Jan-16 0.921235
9-Jan-16 0.917684
11-Jan-16 0.91533
12-Jan-16 0.918274
13-Jan-16 0.923063
… …
… …
Case Study
Case Study
Occupancy

Counter
Time Flow Flow
8:05 114 64
8:10 141 71
8:15 145 68
8:20 101 70
8:25 113 53
8:30 117 65
8:35 141 72
8:40 918 73
8:45 1000 54
8:50 134 60
8:55 137 72
9:00 129 57
9:05 132 73
9:10 114 63
9:15 143 55
9:20 124 53
9:25 114 73
9:30 136 62
9:35 111 62
Case Study

Based on the data provided and using basic descriptive statistics, can you
provide advice on why people perceive the London Tube is always
crowded with more than 1,000 passengers at a time. London
government insists that the average occupancy is between 130 and 150
people.

How would you analyze and present the data, so you portray a more
realistic view of what is actually happening ? More importantly. What
would you suggest to avoid this perception?

How do you think we can calculate the actual number of people


travelling instead of using predictions based on historical records?
Case Study Solution

• Crush Capacity: 1,000+ people but the average occupancy: 130 ( not
occupancy at quiet moments but the average)
• You can make an educated guess passenger numbers or distance traveled
based on data proxies: return journey, use in a connecting service or use
the WIFI network
• Rush hour trains are full but only a few of them. Depending on the range
you can get a mean closer to 250
• What about the trains running counter to the flow of commuters?
• The two trains at pick time have as many passengers as the other 20+ in
that range
Real Estate
• Mean: Real estate agents calculate the mean price of houses in a particular
area so they can inform their clients of what they can expect to spend on a
house.
• Median: Real estate agents also calculate the median price of houses to
gain a better idea of the “typical” home price, since the median is less
influenced by outliers (like multi-million-dollar homes) compared to the
mean.
• Mode: Real estate agents also calculate the mode of the number of
bedrooms per house so they can inform their clients on how many
bedrooms they can expect to have in houses in a particular area.
Advertising
• Mean: Marketers often calculate the mean revenue earned per advertisement so
they can understand how much money their company is making on each ad.
• Median: Marketers also calculate the median revenue earned per advertisement
so they can understand how well the median ad performs.
• Mode: Marketers also calculate the mode of the type of ad used (e.g. newspaper,
TV, radio, digital) so they can know which type of ads their company uses most
often.
Histogram
What is the “n” of this
14
distribution?
12
Do you see any trends?
10

Is 525 an outlier?
Frequency

6
What is your best bet on mean
4
(average) and median?
2

0
0

150
225
300
375
450
525
600
675
750
825
900
975
75

More
1050
CEO Salaries (in thousands)
Histogram

14
Skewed to the right
12

10
Frequency

0
0

150
225
300
375
450
525
600
675
750
825
900
975
75

More
1050
CEO Salaries (in thousands)
Histogram

14
Skewed to the right
12
Mean = 404.17
10 Median = 350
Frequency

0
0

150
225
300
375
450
525
600
675
750
825
900
975
75

More
1050
CEO Salaries (in thousands)
Histogram

14
Skewed to the right
12
Mean = 404.17
10 Median = 350
Frequency

6
Mean > Median

0
0

150
225
300
375
450
525
600
675
750
825
900
975
75

More
1050
CEO Salaries (in thousands)
Histogram

20
18 Skewed to the left
16
14
12
Frequency

10
8
6
4
2
0

100
35
40
45
50
55
60
65
70
75
80
85
90
95

Student Scores (out of 100)


Histogram

20
18 Skewed to the left
16 Mean = 74
14 Median = 79
12
Frequency

10
8
6
4
2
0

100
35
40
45
50
55
60
65
70
75
80
85
90
95

Student Scores (out of 100)


Histogram

20
18 Skewed to the left
16 Mean = 74
14 Median = 79
12
Frequency

Mean < Median


10
8
6
4
2
0

100
35
40
45
50
55
60
65
70
75
80
85
90
95

Student Scores (out of 100)


What type of histogram does this data result in?

• Coffee sales during the day


• Housing prices
• Grades on math test
• Sales of airplane tickets
- You are the Director of Sales of a retail store. You want to
provide insights to your CEO on how the ROI of your
marketing campaigns and investment.

- You are particularly concerned about the North region where


you have a new manager and want to make sure you diversify
your investment across age brackets as this has implied
significant financial investment in the last 6 months.
Excel Exercise
Using the OrderList Sales file:

- Calculate mean (Average) and median for Total Sales


- What is the approximate distribution of Total Sales?
- How is the North region doing? (Use measures of central tendency)
- Create 3 age groups 21-30, 31-40, 41-50 How is the company doing if you
analyze by age group? (Filter and use the Between Function, then cut and
paste the data)
- What other insights and recommendations can you provide ?
• 3 -flavor gourmet jam stand vs 24-
flavor stand
• 30% used voucher for the 3-flavor
jam versus only 3% for the 24 flavor

What is our conclusion?


• It turns out… further research showed that it
didn’t matter
• Publication bias: Interesting findings get
published. Non-findings or failures to replicate
probably these findings face a higher
publication burden
Measures of Dispersion / Spread
Firm 1 Firm 2
$34,500 $35,800
$30,700 $25,500
$32,900 $31,600
$36,000 $41,700
$34,100 $35,300
$33,800 $33,800
$32,500 $30,800

Mean = $33,500 Mean = $33,500


Median = $33,800 Median = $33,800
Measures of Dispersion / Spread
3

3 Salaries in Firm 2

1
Salaries in Firm 1

$23,000 $28,000 $33,000 $38,000 $43,000


Measures of Dispersion / Spread

The ‘Range’ measure

= Maximum of data - Minimum of data


Measures of Dispersion / Spread
Range of salaries in Firm 1
= Maximum Salary - Minimum Salary
= $36,000 - $30,700
= $5,300

Range of salaries in Firm 2


= Maximum Salary - Minimum Salary
= $41,700 - $25,500
= $16,200
Salaries at a small firm
Firm 2
$35,800
$25,500
$31,600
$41,700
$35,300
$33,800
$30,800

Mean = $33,500
Median = $33,800
Salaries at a small firm

$23,000 $28,000 $33,000 $38,000 $43,000


Salaries at a small firm
4
4

3 Minimum Range Maximum

$23,000 $28,000 $33,000 $38,000 $43,000


Standard Deviation

Mean

$23,000 $28,000 $33,000 $38,000 $43,000


Standard Deviation

Mean

$23,000 $28,000 $33,000 $38,000 $43,000


Standard Deviation

Mean

$23,000 $28,000 $33,000 $38,000 $43,000


Standard Deviation

Excel Command (population standard deviation)

=STDEV.P(number1, number2,…)
Excel Exercise

Using the Brazil Disidio Salary file, find:


- The standard deviation of Annual Salaries
- What does that number represent?
Understanding the Standard Deviation measure…

Rule of Thumb
Approximately 68% of the data lie within one
standard deviation, and approximately 95% lie
within 2 standard deviations from the mean

We can estimate the standard deviation by finding the range


and dividing it by 4.
How is Variation Expressed in Written Format?
• Normally in a research paper when you see an average it has a ‘±’ and another
number written after it
– The average height of all the students was 168±12cm.
• The first number is the average, and the second number is the standard
deviation.
– This tells the reader that the majority of the heights of students are 12cm above and
12cm below the average of 168.
– So, if you randomly picked a student from the classroom, there is a good probability that
their height would be between 156 and 180cm.
More Practice?

Mean_Median_Mode_Std dev tutorial


Case Study

Brazilian disidio is an initiative that forces companies to standardize


increments and creates inequality with people within the same salary band.

During the 2 PYs, the government implemented a mandatory increase of 13%


in Year 1 and 7% in Year 2 in your Brazil Business Unit.
Case Study

1) Download the file Case Disidio Brazil


2) Find the mean, median, mode, standard deviation of the salaries.
3) What do you observe? Provide 2-3 highlights
4) This year you have been approved a funded increase of $270,000. How
would you distribute this amount? Make a recomendation to the
company
Possible approaches:
1) Focus on the lowest salaries
2) Reduce the number of outliers
3) Improve equility within bands
4) Focus on areas or types of roles
5) Focus on top performers
Case Study

Reflection and actions

1) Apply the changes proposed and once again, find the mean, median,
mode, standard deviation of the salaries
2) What changed? Do you consider you made the right decision? Why?
3) What other pieces of data would you require to make a better decision?

You might also like