0% found this document useful (0 votes)
535 views63 pages

Descriptive Statistics 1

This document provides an overview of key concepts in descriptive statistics including measures of central tendency (mean, median, mode), measures of variability (range, interquartile range, variance, standard deviation, coefficient of variation), and z-scores. It defines each measure and explains how to calculate and interpret them. The document is intended to introduce students to fundamental statistical techniques for describing and summarizing datasets.

Uploaded by

Dr Engineer
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
535 views63 pages

Descriptive Statistics 1

This document provides an overview of key concepts in descriptive statistics including measures of central tendency (mean, median, mode), measures of variability (range, interquartile range, variance, standard deviation, coefficient of variation), and z-scores. It defines each measure and explains how to calculate and interpret them. The document is intended to introduce students to fundamental statistical techniques for describing and summarizing datasets.

Uploaded by

Dr Engineer
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 63

Descriptive Statistics

Business Statistics
Measures of Central Tendency

Central tendency refers to the "middle" value


of data.
Three types of measure:
Mean
Median
Mode
Arithmetic Mean (AM)

The arithmetic mean (or just the mean) is the


most common measure of central tendency.
The mean of a set of observations is their
average.
Arithmetic mean is equal to the sum of all
observations divided by the number of
observations in the set.
Arithmetic Mean (Contd.)
For a sample of size n, For a population of size
the sample mean, :x: N, the population
mean, :
n N

x i
x1 x2 xn
x x1 x2 xN
i

x i 1

i 1

n n N N
Arithmetic Mean (Contd.)
Applicable for interval and ratio data.
Not applicable for nominal or ordinal data.
Affected by each value in the data set,
including extreme values (also known as
outliers).
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

1 2 3 4 5 1 2 3 4 10
Mean 3 Mean 4
5 5
Weighted Arithmetic Mean
Considers the importance of each value.
Weighted mean, n
(w i xi )
xw i 1
n

w
i 1
i

where, wi = weight assigned to each observation


Median

Median is an observation (or a point between


two observations) in the center of the dataset.
50% of data lie above the median and 50% of
data lie below it.
n 1
Median position position in the ordered data
2
Note that (n + 1)/2 is not the value of the median, only
the position of the median in the ordered data.
Median

Procedure for finding the Median:


Arrange the observations in an ordered array.
If there is an odd number of terms, the median is
the middle term of the ordered array.
If there is an even number of terms, the median
is the average of the middle two terms.
Median

Applicable for ordinal, interval, and ratio data.


Not applicable for nominal data.
Unaffected by extremely large and extremely
small values.
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

Median = 3 Median = 3
Mode
The most frequently occurring value in a data set.
Applicable to all levels of data measurement (nominal,
ordinal, interval, and ratio).
Not affected by extreme values. Modes
There may be no mode, single mode (uni-modal), two
modes (bi-modal) or several modes (multi-modal).

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Mode = 9
Quartiles
Quartiles split the ranked data into four segments with
an equal number of values per segment.

25% 25% 25% 25%

Q1 Q2 Q3
The first quartile, Q1, is the value for which 25% of the
observations are smaller and 75% are larger.
Q2 is the same as the median (50% are smaller, 50% are
larger).
Only 25% of the observations are greater than the third
quartile, Q3.
Quartiles
Find a quartile by determining the value in the
appropriate position in the ranked data, where
First quartile position:
Q1 = (n+1)/4
Second quartile position:
Q2 = (n+1)/2 (the median position)
Third quartile position:
Q3 = 3(n+1)/4
where n is the number of observed values
Percentiles

Measures of central tendency that divide a group


of data into 100 parts.

At least n% of the data lie below the n-th


percentile, and at most (100 - n)% of the data lie
above the nth percentile.

Example: 90th percentile indicates that at least 90% of the


data lie below it, and at most 10% of the data lie above it
Percentiles
Organize the data into an ascending ordered
array.
Calculate the percentile location:
P
i ( n)
100

Determine the percentiles location and its value.


If i is a whole number, the percentile is the
average of the values at the i and (i+1) positions.
If i is not a whole number, the percentile is at
the (i+1) position in the ordered array.
Percentiles
Applicable for ordinal, interval, and ratio data
Not applicable for nominal data

Note that:
25th percentile = Q1 (first quartile)
50th percentile = median = Q2 (Second quartile)
75th percentile = Q3 (third quartile)
Measures of Variability
Measures of variability describe the spread or
the dispersion of a set of data.

Common Measures of Variability


Range
Inter-quartile Range
Variance
Standard Deviation
Coefficient of Variation
Measures of Variability

Same center,
different variation
Range
Simplest measure of variation.
Difference between the largest and the
smallest values in a set of data:
Range = xlargest - xsmallest

Example:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Range = 14 - 1 = 13
Range
Disadvantages:
Ignores the way in which data are distributed
7 8 9 10 11 12 7 8 9 10 11 12
Range = 12 - 7 = 5 Range = 12 - 7 = 5

1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,5

Sensitive to outliers
Range = 5 - 1 = 4

1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,120

Range = 120 - 1 = 119


Interquartile Range
Can eliminate some outlier problems by using the
interquartile range

Eliminate some high- and low-valued


observations and calculate the range from the
remaining values

Interquartile range = 3rd quartile 1st quartile


= Q3 Q1
Interquartile Range

Example:
Median
xminimum Q1 (Q2) Q3 xmaximum
25% 25% 25% 25%

12 30 45 57 70

Interquartile range
= 57 30 = 27
Variance
The variance of a set of observations is the
average squared deviation of the data points
from their mean. n

Sample variance: i
( x x ) 2

s
2 i 1
n 1
N
Population variance: i
( x ) 2

2 i 1
N
Standard Deviation
The standard deviation of a set of
observations is the (positive) square root of
the variance of the set.
Sample standard deviation :
n

(x i x )2
s s2 i 1
n 1

Population standard deviation :


N

(x i )2
2 i 1
N
Coefficient of Variation
Measures relative variation
Always in percentage (%)
Shows variation relative to mean
Can be used to compare two or more sets of
data measured in different units
s
CV 100%
x
Comparing Coefficient
of Variation
Stock A:
Average price last year = $50
Standard deviation = $7
S $7
CVA 100%
100% 14%
X $50 Stock B is more
variable than
Stock B:
stock A, but stock
Average price last year = $100 B is less variable
relative to its
Standard deviation = $10 price

S $10
CVB 100%
100% 10%
X $100
Z Scores

A measure of distance from the mean (for example, a Z-


score of 2.0 means that a value is 2.0 standard deviations
from the mean)
The difference between a value and the mean, divided by
the standard deviation

XX
Z
S
Z Scores
(continued)

Example:
If the mean is 14.0 and the standard deviation is 3.0, what is
the Z score for the value 18.5?

X X 18.5 14.0
Z 1.5
S 3.0

The value 18.5 is 1.5 standard deviations above the mean


(A negative Z-score would mean that a value is less than the
mean)
Problem
Firm A is chosen from an industry (group of
firms that produce the same, or similar,
products) where the mean rate of return of
firms is 10%, the standard deviation being 5%.
Firm B is chosen from another industry where
the mean rate of return of firms is 12%, the
standard deviation being 6%. If Firm As rate of
return is 16% and Firm Bs rate of return is
18%, which of the two is more profitable
compared to its industry?
Chebyshev Rule

Regardless of how the data are


distributed, at least (1 - 1/k2) x 100% of
the values will fall within k standard
deviations of the mean (for k > 1)
Examples:
At least within

(1 - 1/22) x 100% = 75% ........ k=2 ( 2)


(1 - 1/32) x 100% = 89% . k=3 ( 3)
The Empirical Rule

If the data distribution is approximately bell-


shaped, then the interval:
1 contains about 68% of the values in
the population.

68%


1
The Empirical Rule
2 contains about 95% of the values in
the population or the sample
3 contains about 99.7% of the values
in the population or the sample

95% 99.7%

2 3
Problem
A cold drink bottling plant fills bottles of 500 ml
capacity with mean of 500 ml and standard deviation
of 5 ml. At least what percentage of bottles would
contain cold drink between 490 and 510 ml?

Suppose the time between applying for a credit card


and getting the credit card is approximately bell-
shaped and its average has been estimated to be 8
days with a standard deviation of about 2 days.
Approximately what fraction of people get the credit
card within 4 days of apply?
Measures of Shape

Skewness
Absence of symmetry
Extreme values in one side of a distribution
Kurtosis
Peakedness of a distribution
Leptokurtic: high and thin
Mesokurtic: normal shape
Platykurtic: flat and spread out
Box and Whisker Plots
Graphic display of a distribution
Reveals skewness and outliers
Skewness

Mean Mode Mean Mean


Median Mode
Median Mode Median
Negatively Symmetric Positively
Skewed/left-skewed (Not Skewed) Skewed/ right-skewed
Problem
The mean of some price quote data is 5.5056 and
the median is 3.92. From this information, what
can you deduce about the symmetry or skewness
of the distribution?

Suppose a frequency distribution is skewed with


a median of $75.00 and a mode of $80.00. Which
of the following is a possible value for the mean
of distribution?
(a) $64.00 (b) $78.00 (c) $90.00
Kurtosis
Peakedness of a distribution
Leptokurtic: high and thin
Mesokurtic: normal in shape
Platykurtic: flat and spread out
Leptokurtic

Mesokurtic
Platykurtic
Box and Whisker Plot
Five summary measures are used:
Median, Q2
First quartile, Q1
Third quartile, Q3
Minimum value in the data set
Maximum value in the data set
The Box
Median (Vertical line across the box)
First Quartile
Third Quartile
The Whisker
Lower inner fence = smallest observation within Q1 1.5 IQR
Upper inner fence = Largest observation within Q3 + 1.5 IQR
Outer Fences
Lower outer fence = Q1 3.0 IQR
Upper outer fence = Q3 + 3.0 IQR
Box and Whisker Plot

IQR
Right inner Right outer
Left outer Left inner
fence fence
fence fence
Outlier
Suspected
Outlier
smallest observation
within Q1 1.5 IQR
Q1 Q2 Q3 largest observation
within Q3 + 1.5 IQR
Example of Raw Data

Sample of daily production in Yards of 30 carpet looms


16.2 15.8 15.8 15.8 16.3 15.6
15.7 16.0 16.2 16.1 16.8 16.0
16.4 15.2 15.9 15.9 15.9 16.8
15.4 15.7 15.9 16.0 16.3 16.0
16.4 16.6 15.6 15.6 16.9 16.3
Organizing Data

Data array: A sequence of data in ascending or


descending order.

Frequency distribution: grouping data into some


defined classes.

Cumulative distribution: how many observations


lie above or below certain value?
Presenting Data in Array

Data array of daily production in Yards of 30 carpet looms


15.2 15.7 15.9 16.0 16.2 16.4
15.4 15.7 15.9 16.0 16.3 16.6
15.6 15.8 15.9 16.0 16.3 16.8
15.6 15.8 15.9 16.1 16.3 16.8
15.6 15.8 16.0 16.2 16.4 16.9
Frequency Distribution
A frequency distribution is a list or a table
containing class groupings (ranges within which
the data fall) ...
and the corresponding frequencies with which
data fall within each grouping or category
A relative frequency distribution presents
frequencies in terms of fractions or percentages.
The classes in the frequency distribution are all-
inclusive and mutually exclusive.
Frequency Distribution Example

Data array of average inventory (in days) for 20


convenience stores
2.0 3.8 4.1 4.7 5.5
3.4 4.0 4.2 4.8 5.5
3.4 4.1 4.3 4.9 5.5
3.8 4.1 4.7 4.9 5.5
Frequency Distribution Example
Frequency distribution of average inventory (in
days) for 20 convenience stores (6 classes)
Class Frequency
2.0 - 2.5 1
2.6 - 3.1 0
3.2 - 3.7 2
3.8 - 4.3 8
4.4 - 4.9 5
5.0 - 5.5 4
Frequency Distribution
Each class grouping has the same width
Determine the width of each interval by
*
xmax xmin
Width of interval
k
where,
*
x Next unit value after largest value in data
=
max
xmin= Smallest value in data
k = Total number of class intervals

Usually at least 5 but no more than 15 groupings


Class boundaries never overlap
Frequency Distribution
A Step-by-step Example:

Sample of daily production in Yards of 30 carpet looms


16.2 15.8 15.8 15.8 16.3 15.6
15.7 16.0 16.2 16.1 16.8 16.0
16.4 15.2 15.9 15.9 15.9 16.8
15.4 15.7 15.9 16.0 16.3 16.0
16.4 16.6 15.6 15.6 16.9 16.3
Frequency Distribution
Step 1: Select number of classes.
no. of classes = 6
Step 2: Determine width of a class interval.
width of a class interval = (17.0 15.2)/6 = 0.3
Step 3: Generate class boundaries.
class boundaries: 15.2, 15.5, 15.8, 16.1, 16.4, 16.7, 17.0
Step 4: Count observations and assign to classes
Frequency Distribution

Class Frequency
15.2 15.4 2
15.5 15.7 5
15.8 16.0 11
16.1 16.3 6
16.4 16.6 3
16.7 16.9 3
Relative Frequency Distribution

Relative
Class Frequency Percentage
frequency
15.2 15.4 2 2/30 = 0.07 7
15.5 15.7 5 5/30 = 0.17 17
15.8 16.0 11 11/30 = 0. 36 36
16.1 16.3 6 6/30 = 0.20 20
16.4 16.6 3 3/30 = 0.10 10
16.7 16.9 3 3/30 = 0.10 10
Total 30 1.00 100
The Histogram
A graph of the data in a frequency distribution is
called a histogram
The class boundaries (or class midpoints) are
shown on the horizontal axis
the vertical axis is either frequency, relative
frequency, or percentage
Bars of the appropriate heights are used to
represent the number of observations within
each class
Histogram Example

12 (No gaps
between
10
bars)
Frequency

8
6
4
2

0
15.2 15.5 15.8 16.1 16.4 16.7 17.0
Production level in Yards
Frequency Polygon

Used to represent frequency distributions


graphically.

Sketches outline of the data more clearly.

The polygon becomes increasingly smooth


and curve-like as we increase the number of
classes and the number of observations.
Frequency Polygon Example

12

10
Frequency

8
6
4
2

0
15.0 15.3 15.6 15.9 16.2 16.5 16.8 17.1
Production level in Yards
Frequency Polygon Example

12

10
Frequency

8
6
4
2

0
15.0 15.3 15.6 15.9 16.2 16.5 16.8 17.1
Production level in Yards
Frequency Polygon Example

12

10
Frequency

8
6
4
2

0
15.0 15.3 15.6 15.9 16.2 16.5 16.8 17.1
Production level in Yards
Cumulative Frequency Distribution
Enables us to see how many observations lie
above or below certain value.
Less-than type and more-than type.
A graph of a cumulative frequency distribution
is called an ogive.
The shape of an ogive for less-than type
cumulative frequency distribution would be
slope up and to the right.
Cumulative Frequency Distribution

Cumulative Cumulative relative


Class
frequency frequency
Less than 15.2 0 0.00
Less than 15.5 2 0.07
Less than 15.8 7 0.23
Less than 16.1 18 0.60
Less than 16.4 24 0.80
Less than 16.7 27 0.90
Less than 17.0 30 1.00
Ogive Example
1.0
Q: How many looms made
0.9 less than 16.5 yards?
0.8
Cumulative Relative Frequency

0.7
0.6
0.5
0.4

0.3 Approximate value of the


15th loom = 16.0
0.2

0.1

0
15.2 15.5 15.8 16.1 16.4 16.7 17.0
Production level in yards
Pie Chart

Pie chart is a simple descriptive display often


used to present frequencies for categorical data.
May be used for nominal or ordinal type data.
The total area of the pie (circular in shape)
represents 100% of the quantity of interest.
The arc length of each sector (and consequently
its central angle and area), is proportional to the
quantity it represents.
Pie Chart (Example)
Example: A job satisfaction survey.
What are your feelings on your current job?
Categories Response (%)
Happy with career 33%
Enjoy job, but it is not on my career path 19%
Job is OK, but it is not on my career path 19%
Do not like my job, but it is on my career path 6%
My job just pays the bill 23%
Pie Chart (Example)
Job satisfaction survey

Happy with career

23% Enjoy job, but it is not on


33% my career path
Job is OK, but it is not on
6% my career path
Do not like my job, but it is
on my career path
19%
19% My job just pays the bill
Bar Chart

A bar chart is a chart with rectangular bars with


lengths proportional to the values that they
represent.
Often used to display categorical data.
May be horizontal or vertical.
Used to display values that were taken over time
or on different conditions, usually on small data
sets.
Bar Chart Example
Investment type Amount (in thousands $) Percentage (%)
Stocks 46.5 42.27
Bonds 32.0 29.09
CD 15.5 14.09
Savings 16.0 14.55

Investor's Portfolio

Savings
CD
Bonds
Stocks

0 10 20 30 40 50
Amount in $1000's

You might also like