0% found this document useful (0 votes)
20 views66 pages

BStats 1

This document discusses key concepts in business statistics including: 1. It introduces classification of data, methods of data collection, and the differences between primary and secondary data. 2. It describes measures of central tendency including mean, median, and mode, and measures of variation. Examples are provided to demonstrate calculating these measures from data sets. 3. Various applications of statistics in business are listed such as financial decisions, performance management, and surveys. Statistical methods like descriptive analysis, correlation, and hypothesis testing are also outlined.

Uploaded by

Agam Prakash
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views66 pages

BStats 1

This document discusses key concepts in business statistics including: 1. It introduces classification of data, methods of data collection, and the differences between primary and secondary data. 2. It describes measures of central tendency including mean, median, and mode, and measures of variation. Examples are provided to demonstrate calculating these measures from data sets. 3. Various applications of statistics in business are listed such as financial decisions, performance management, and surveys. Statistical methods like descriptive analysis, correlation, and hypothesis testing are also outlined.

Uploaded by

Agam Prakash
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 66

Business Statistics -1

Gayatri V Singh, PhD,


AMITY University

1
Today’s Highlights

• Introduction
• Classification of Data
• Method of Data Collection
• Primary and Secondary Data
• Measures of Central Tendency
• Measures of Variation
2
What is Statistics?
It is the SCIENCE of
collecting,
organizing,
presenting,
analyzing, and
interpreting data (quantitative or qualitative)
for the purpose of assisting in making a more
effective decision.
3
Applications
• For empirical inquiry
• Financial Decisions
• How is the economy doing?
• The impact of technology at work
• Compensation survey
• Perfomance management
• Employee Satisfaction Survey
• Training feedback evaluation
• Human Resource Accounting
• HR Budgeting

4
Statistical Methods

Descriptive Univariate
Statistics
Analysis Multivariate

5
Statistical Methods Contd…

Descriptive Analysis/Inferential

• Frequency Distribution • Correlation and


• Measurement of Regression Analysis
Central Tendency • Estimation Theory
• Measurement of • Hypothesis Testing
Dispersion • Decision Theory
• Graphical Presentation • Operation Research
of Data

6
Data Classification
Data

Quantitative Qualitative
or Numerical or Attribute

Discrete Continuous Nominal Ordinal

Number of students Temperature, Gender Education


Time taken for exam Rank of a performance

7
Data Collection Method

8
Types of Data
• Primary Data: Are those which are collected
afresh and for the first time and thus happen
to be original in character

• Secondary Data: Are those which have been


collected by someone else and which have
already been passed through the statistical
process
9
Secondary Data –Sources
• County health departments
• Vital Statistics – birth, death certificates
• Hospital, clinic, school nurse records
• Private and foundation databases
• City and county governments
• Surveillance data from state government programs
• Federal agency statistics - Census, NIH, etc.
Secondary Data – Advantages
• No need to reinvent the wheel.
– If someone has already found the data, take
advantage of it.
• It will save you money.
– Even if you have to pay for access, often it is
cheaper in terms of money than collecting your
own data. (more on this later.)
• It will save you time.
– Primary data collection is very time consuming.
(More on this later, too!)
Secondary Data – Advantages
• It may be very accurate.
– When especially a government agency has
collected the data, incredible amounts of time and
money went into it. It’s probably highly accurate.
• It has great exploratory value
– Exploring research questions and formulating
hypothesis to test.
Secondary Data – Limitations
• When was it collected? For how long?
– May be out of date for what you want to analyze.
– May not have been collected long enough for
detecting trends.
– E.g. Have new anticorruption laws impacted
Russia’s government accountability ratings?
Secondary Data – Limitations
• Is the data set complete?
– There may be missing information on some
observations
– Unless such missing information is caught and
corrected for, analysis will be biased.
Primary Data - Source

• Surveys
• Focus groups
• Questionnaires
• Personal interviews
• Experiments and observational study
Primary Data - Limitations
• Do you have the time and money for:
– Designing your collection instrument?
– Selecting your population or sample?
– Pretesting/piloting the instrument to work out
sources of bias?
– Administration of the instrument?
– Entry/collation of data?
Primary Data - Limitations

• Researcher error
– Sample bias
– Other confounding factors
• Uniqueness
– May not be able to compare to other populations
Data Collection Choice
• What you must ask yourself:
– Will the data answer my research question?

• To answer that
– You much first decide what your research question
is
– Then you need to decide what data/variables are
needed to scientifically answer the question
Data collection choice

• If that data exist in secondary form, then use


them to the extent you can, keeping in mind
limitations.
• But if it does not, and you are able to fund
primary collection, then it is the method of
choice.
Central Tendency
Mean, Median, Mode are measures of
central tendency. We can use them to
describe a set of data.

20
Measures of Central Tendency

• Mean = the calculated average of all the values


in a given data set
• Median = the central value of a data set
arranged in order
• Mode = the value which occurs with most
frequency in a given data set

21
Central Tendencies

Mean: (Average)
The sum of a set of numbers divided by the total number of
the set.

To find the mean of the data set:


2, 1, 8, 0, 2, 4, 3, 4

22
Median: (Middle Value)

The number in the middle of a set of numbers that are


arranged in order from least to greatest.
To find the median of the data set:
2, 1, 8, 0, 2, 4, 3, 4

23
Mode: (Most frequent value)

The number that occurs most often in a set of numbers.


There can be 1 mode, more than 1 mode, or no mode.

To find the mode of the data set:


2, 1, 8, 0, 2, 4, 3, 4

24
Central Tendencies
18, 21, 8, 12, 26

Find the mean, median, mode and range of these numbers.

8, 12, 18, 21, 26

Mean = 8 + 12 + 18 + 21 + 26 = 85 = 17
5 5
Median = 8, 12, 18, 21, 26

Mode = NO MODE Range = 26 - 8 = 18

25
Central Tendencies

80, 60, 76, 60, 90, 80, 70, 60

Find the mean, median, mode and range of these numbers.

60, 60, 60, 70, 76, 80, 80, 90

Mean = 576 = 72
8

76 + 70 = 73
Median = 60, 60, 60, 70, 76, 80, 80, 90
2

Mode = 60 Range = 90 - 60 = 30
26
Mean for grouped data
Calculating the Mean: If there are large amounts of data, it is
easier if it is displayed in a frequency table.

Example 1.
The number of goals scored by Premier League teams over a weekend was
recorded in a table. Calculate the mean and the mode.

Goals x Frequency, f fx
0 2 0

1 4 4 Mean = ∑fx
16 ∑f
2 8
9
3 3
= 42 = 2.1
4 2 8
20
5 1 5

∑f= 20 ∑fx= 42
Mode 27
Grouped Data
Large quantities of data can be much more easily viewed and managed if
placed in groups in a frequency table. Grouped data does not enable
exact values for the mean, median and mode to be calculated. Alternate
methods of analysing the data have to be employed.

Example 1.
During 3 hours at Heathrow airport 55 aircraft arrived late. The number of
minutes they were late is shown in the grouped frequency table below.

Data is minutes late frequency


grouped 0 - 10 27
into 6 class 10 - 20 10
intervals of
20 - 30 7
width 10.
30 - 40 5
40 - 50 4
50 - 60 2 28
Grouped Data
Example 1.
During 3 hours at Heathrow airport 55 aircraft arrived late. The number of
minutes they were late is shown in the grouped frequency table below.

minutes Late Frequency,f midpoint(x) fx


0 - 10 27 5 135
10 - 20 10 15 150
20 - 30 7 25 175
30 - 40 5 35 175
40 - 50 4 45 180
50 - 60 2 55 110
f 55 fx 925
Mean estimate = 925/55 = 16.8 minutes 29
Grouped Data
The Modal Class

The modal class is simply the class interval of highest frequency.

Example 1.
During 3 hours at Heathrow airport 55 aircraft arrived late. The number of
minutes they were late is shown in the grouped frequency table below.

minutes late frequency Modal class = 0 - 10


0 - 10 27
10 - 20 10
20 - 30 7
30 - 40 5
40 - 50 4
50 - 60 2

30
Grouped Data
The Median Class Interval

Example 1.
During 3 hours at Heathrow airport 55 aircraft arrived late. The number of
minutes they were late is shown in the grouped frequency table below.

minutes late frequency


(55+1)/2
0 - 10 27
= 28
10 - 20 10
20 - 30 7
30 - 40 5
40 - 50 4
50 - 60 2

The 28th data value is in the 10 - 20 31


class
Grouped Data
Example 2.
A group of University students took part in a sponsored race. The number of
laps completed is given in the table below. Use the information to:
(a) Calculate an estimate for the mean number of laps.
(b) Determine the modal class.
(c) Determine the class interval containing the median.

number of laps frequency (x)


1-5 2
6 – 10 9
Data is
11 – 15 15
grouped
into 8 class 16 – 20 20
intervals of 21 – 25 17
width 4. 26 – 30 25
31 – 35 2
36 - 40 1

32
Grouped Data
Example 2.
A group of University students took part in a sponsored race. The number of
laps completed is given in the table below. Use the information to:
(a) Calculate an estimate for the mean number of laps.
(b) Determine the modal class.
(c) Determine the class interval containing the median.

number of laps frequency midpoint(x) mp x f


1-5 2 3 6
6 – 10 9 8 72
11 – 15 15 13 195
16 – 20 20 18 360
21 – 25 17 23 391
26 – 30 25 28 700
31 – 35 2 33 66
36 - 40 1 38 38
f 91 fx 1828
Mean estimate = 1828/91 = 20.1 laps 33
Grouped Data
Example 2.
A group of University students took part in a sponsored race. The number of
laps completed is given in the table below. Use the information to:
(a) Calculate an estimate for the mean number of laps.
(b) Determine the modal class.
(c) Determine the class interval containing the median.

number of laps frequency (x)


1-5 2
6 – 10 9
11 – 15 15
16 – 20 20 Modal Class 26 - 30
21 – 25 17
26 – 30 25
31 – 35 2
36 - 40 1

34
Grouped Data
Example 2.
A group of University students took part in a sponsored race. The number of
laps completed is given in the table below. Use the information to:
(a) Calculate an estimate for the mean number of laps.
(b) Determine the modal class.  
(c) Determine the class interval containing the median.

number of laps frequency (x)


f 91
1-5 2
6 – 10 9 (91+1)/2 =
11 – 15 15 46
16 – 20 20
21 – 25 17
26 – 30 25
31 – 35 2
36 - 40 1

The 46th data value is in the 16 – 20 35


class
Mode = L + [(fm-f1) / (fm-f1)+(fm-f2)] x h

where:
L is the lower class boundary of modal class
fm is the Frequency of the model class
f1 is the previous frequency of the model class
f2 is the next frequency of the model class
h is the size of model class i.e. difference between
upper and lower class boundaries of model class.

Model class is a class with the maximum frequency.

36
As for Median (grouped data):

Median = L + h/f (n/2 - c)

where:
L is the lower class boundary of median class
h is the size of median class i.e. difference between
upper and lower class boundaries of median class
f is the frequency of median class
c is previous cumulative frequency of the median
class
n/2 is total no. of observations divided by
2...OR...sumation of F divided by 2
37
Definition
• Measures of dispersion are descriptive
statistics that describe how similar a set of
scores are to each other
– The more similar the scores are to each other, the
lower the measure of dispersion will be
– The less similar the scores are to each other, the
higher the measure of dispersion will be
– In general, the more spread out a distribution is,
the larger the measure of dispersion will be

38
Measures of Dispersion
• Which of the
distributions of scores 125

has the larger 100


75

dispersion? 50
25

The upper 0
1 2 3 4 5 6 7 8 9 10

distribution has
more dispersion 125

100

because the scores 75

50

are more spread out 25

That is, they are less 1 2 3 4 5 6 7 8 9 10

similar to each other


39
Measurement of Variability
• A certain amount of variability will naturally
occur when a control is tested repeatedly.
• Variability is affected by operator technique,
environmental conditions, and the
performance characteristics of the assay
method.
• The goal is to differentiate between
variability due to chance from that due to
error.
40
Measures of Variability
• There are several terms that describe the
dispersion or variability of the data around
the mean:
• Range
• Quartile Deviation
• Mean Deviation
• Standard Deviation (Variance)
• Coefficient of Variation
• Coefficient of Skewness
• Coefficient of Kurtosis
41
The Range
• The range is defined as the difference
between the largest score in the set of data
and the smallest score in the set of data, XL -
XS
• What is the range of the following data:
4 8 1 6 6 2 9 3 6 9
• The largest score (XL) is 9; the smallest score
(XS) is 1; the range is XL - XS = 9 - 1 = 8

42
When To Use the Range
• The range is used when
– you have ordinal data or
– you are presenting your results to people with
little or no knowledge of statistics
• The range is rarely used in scientific work as it
is fairly insensitive
– It depends on only two scores in the set of data, XL
and XS
– Two very different sets of data can have the same
range:
1 1 1 1 9 vs 1 3 5 7 9
43
The Semi-Interquartile Range
• The semi-interquartile range (or SIR) is defined
as the difference of the first and third
quartiles divided by two
– The first quartile is the 25th percentile
– The third quartile is the 75th percentile
• SIR = (Q3 - Q1) / 2

44
Quartiles
The two values which are a quarter of
the way into the data from either end:

-The one 25% of the way through the data is


called Q1 the first or lower quartile
-The one 75% of the way through the data is
called the third quartile, Q3, or the upper
quartile.
-The Median is the second quartile
Example
1,33, 42,65, 77,89, 35,22, 64,24, 75,4, 2,46, 6,21
Or: 1,2, 4,6, 21,22, 24,33, 35,42, 46,64, 65,75, 77, 89

There are 16 numbers, so the median is


No.17/2 = no. 8.5 = average of no.8 and no.9: 34.
The first quartile Q1, is n/4= no.4 = 65
And Q3 = ?
n/4 or n+1/4….????
Calculating exactly
n/4 = 25th item
First Quartile
Using the formula: This is in the group 20 < 40
Lower limit (l) is 20
X f CF
0 < 20 15 15
Width of group (i) is 20
Frequency of group (f) is 60
20 < 40 60 75
CF of previous group (F) is 15
40 <100 25 100
Formula is: n4 F
Q1 lq1 i
f
25 15 10
Q1 20 20 20 20 20 3.333
60 60
= 23.3333
This means that 25% of the data is below 23.333
Q3
Third Quartile 3n/4 = 75th item
This is in the group 20 < 40
X f CF
Lower limit (l) is 20
0 < 20 15 15
Width of group (i) is 20
20 < 40 60 75
Frequency of group (f) is 60
40 <100 25 100
CF of previous group (F) is 15
3n 4 F
Formula is: Q3 lq 3 i
f
75 15 60
Q3 20 20 20 20 20 20
60 60
= 40

So 25% of the data is above this point.


SIR Example
• What is the SIR for the 2
data to the right? 4
5 = 25th %tile
• 25 % of the scores are 6
below 5 8
– 5 is the first quartile
10
• 25 % of the scores are 12
above 25
14
– 25 is the third quartile
20
• SIR = (Q3 - Q1) / 2 = (25 - 25 = 75th %tile
30
5) / 2 = 10
60 49
When To Use the SIR
• The SIR is often used with skewed data as it is
insensitive to the extreme scores

50
Variance
• Variance is defined as the average of the
square deviations:
2
2
X
N

51
Standard Deviation

• When the deviate scores are squared in


variance, their unit of measure is squared as well
– E.g. If people’s weights are measured in pounds,
then the variance of the weights would be expressed
in pounds2 (or squared pounds)
• Since squared units of measure are often
awkward to deal with, the square root of
variance is often used instead
– The standard deviation is the square root of variance
52
Standard Deviation
• Standard deviation = variance
• Variance = standard deviation2

53
Computational Formula
• When calculating variance, it is often easier to use
a computational formula which is algebraically
equivalent to the definitional formula:
2
2 X 2
2 X N X
N N
2is the population variance, X is a score, is the
population mean, and N is the number of scores
54
Computational Formula Example

X X2 X- (X- )2
9 81 2 4
8 64 1 1
6 36 -1 1
5 25 -2 4
8 64 1 1
6 36 -1 1
= 42 = 306 =0 = 12
55
Computational Formula Example
2
2 X
X N 2
2

N 2 X
42
2
N
306
6 12
6
306 294 6
6 2
12
6
2
56
Calculation of the standard
deviation of grouped data
Ages: x Mid-pt x – (x – mean)2 (x – mean)2
f mean f
32 81 324
30 - 34 4 –9
37 16 80
35 - 39 5 –4
42 1 2
40 - 44 2 1
47 36 324
45 - 49 9 6

f = 20

Mean (x – mean)2 f = 730


Calculation of the standard
deviation of grouped data
f = n = 20
2
x x 730

2
( x x) f 730
s
n 1 20 1
38 . 42 6 . 20
Measure of Skew
• Skew is a measure of symmetry in the
distribution of scores
Positive Skew Negative Skew

59
Measure of Skew
• The following formula can be used to
determine skew:
3
X X
3 N
s 2
X X
N

60
Measure of Skew

• If s3 < 0, then the distribution has a negative


skew
• If s3 > 0 then the distribution has a positive
skew
• If s3 = 0 then the distribution is symmetrical
• The more different s3 is from 0, the greater
the skew in the distribution
61
Kurtosis
• Kurtosis measures whether the scores are
spread out more or less than they would be
in a normal (Gaussian) distribution

Leptokurtic (s4 > 3) Platykurtic (s4 < 3)

62
Kurtosis
• When the distribution is normally distributed,
its kurtosis equals 3 and it is said to be
mesokurtic
• When the distribution is less spread out than
normal, its kurtosis is greater than 3 and it is
said to be leptokurtic
• When the distribution is more spread out than
normal, its kurtosis is less than 3 and it is said
to be platykurtic 63
Measure of Kurtosis

• The measure of kurtosis is given by:


4

X X
2
X X
N
s4
N
64
Coefficient of Variation
• The Coefficient of Variation (CV) is the standard
Deviation (SD) expressed as a percentage of
the mean
-Also known as Relative Standard deviation
(RSD)
• CV % = (SD ÷ mean) x 100

65
Measures of Central Tendency &
Dispersion
• Central Tendency
– Mean
– Median
– Mode
• Dispersion
Smaller variation
– Range
Larger variation
– Semi Inter-Quartile Range
– Variance/ Standard Deviation
66

You might also like