Descriptive Statistics PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 40

Fundamentals of Business

Statistics
Descriptive Statistics
Dr. P.K.Viswanathan
Data versus Information

When managers are bewildered by plethora of


data, which do not make any sense on the
surface of it, they are looking for methods to
classify data that would convey meaning. The
idea here is to help them draw the right
conclusion. This session provides the nitty-
gritty of arranging data into information.

2
Raw Data
Meaning of Raw Data:
Raw Data represent numbers and facts in the original
format in which the data have been Collected. You need to
convert the raw data into information for managerial
decision Making.

3
Information is Key
Large and massive raw data tend to bewilder you so much
that the overall patterns are obscured. You cannot see the
wood for the trees. This implies that the raw data must be
processed to give you useful information.

Process
Raw Data Information

4
Frequency Distribution
In simple terms, frequency distribution is a summarized
table in which raw data are arranged into classes and
frequencies.

Frequency distribution focuses on classifying raw data into


information. It is the most widely used data reduction
technique in descriptive statistics.

When you are looking for pattern that would help you
understand the characteristic you measure in a problem
situation, frequency distribution comes to your rescue.

5
HISTOGRAM

Histogram (also known as frequency histogram) is a snap shot of the frequency


distribution.

Histogram is a graphical representation of the frequency distribution in which the


X-axis represents the classes and the Y-axis represents the frequencies in bars

Histogram depicts the pattern of the distribution emerging from the characteristic
being measured.

6
Role of Histogram in Practice

7
What is Central Tendency?
Whenever you measure things of the same kind, a fairly
large number of such measurements will tend to cluster
around the middle value. Such a value is called a measure
of "Central Tendency". The other terms that are used
synonymously are "Measures of Location", or "Statistical
Averages".

8
Measures of Central Tendency
As a manager, You need the summary measures of central
tendency to draw meaningful conclusions in your functional
area of operation. The most widely used measures of central
tendency are Arithmetic Mean, Median, and Mode.

9
Arithmetic Mean
Arithmetic Mean (called mean) is defined as the sum of all
observations in a data set divided by the total number of
observations. For example, consider a data set containing
the following observations:

In symbolic form mean is given by X 


 X
n

X = Arithmetic Mean

X = Indicates sum all X values in the data set

n = Total number of observations(Sample Size)

10
Arithmetic Mean -Example
The inner diameter of a particular grade of tire based on 5
sample measurements are as follows: (figures in millimeters)

565, 570, 572, 568, 585

Applying the formulaX   X


n

We get mean = (565+570+572+568+585)/5 =572

Caution: Arithmetic Mean is affected by extreme values or


fluctuations in sampling. It is not the best average to use when
the data set contains extreme values (Very high or very low
values).

11
Median
Median is the middle most observation when you arrange data in
ascending order of magnitude. Median is such that 50% of the
observations are above the median and 50% of the observations
are below the median.

Median is a very useful measure for ranked data in the context of


consumer preferences and rating. It is not affected by extreme
values (greater resistance to outliers)
n 1
Median  th value of ranked data
2

n = Number of observations in the sample

12
Median - Example
Marks obtained by 7 students in Computer Science
Exam are given below: Compute the median.

45 40 60 80 90 65 55

Arranging the data after ranking gives

90 80 65 60 55 45 40

Median = (n+1)/2 th value in this set = (7+1)/2 th


observation= 4th observation=60
Hence Median = 60 for this problem.

13
Mode
Mode is that value which occurs most often. It has the
maximum frequency of occurrence. Mode also has
resistance to outliers.

Mode is a very useful measure when you want to keep in the


inventory, the most popular shirt in terms of collar size
during festival season.

Caution: In a few problems in real life, there will be more


than one mode such as bimodal and multi-modal values. In
these cases mode cannot be uniquely determined.

14
Mode -Example

The life in number of hours of 10 flashlight batteries are as


follows: Find the mode.
340 350 340 340 320 340 330 330
340 350

340 occurs five times. Hence, mode=340.

15
Comparison of
Mean, Median, Mode
Mean Median Mode
Defined as the arithmetic Defined as the Defined as the most
average of all observations middle value in the frequently occurring
in the data set. data set arranged in value in the distribution;
ascending or it has the largest
descending order. frequency.

Requires measurement on Does not require Does not require


all observations.
measurement on all measurement on all
observations observations
Uniquely and
comprehensively defined.
Cannot be Not uniquely defined
determined under for multi-modal
all conditions. situations.

16
Comparison of
Mean, Median, Mode Cont.

Mean Median Mode


Affected by extreme Not affected by Not affected by
values. extreme values. extreme values.

Cannot be treated Cannot be treated


Can be treated algebraically. That is, algebraically. That is,
algebraically. That is, Medians of several Modes of several
Means of several groups groups cannot be groups cannot be
can be combined. combined. combined.

17
Measures of Dispersion
In simple terms, measures of dispersion indicate how large
the spread of the distribution is around the central
tendency. It answers unambiguously the question " What is
the magnitude of departure from the average value for
different groups having identical averages?".

18
Range
Range is the simplest of all measures of dispersion. It is
calculated as the difference between maximum and minimum
value in the data set.

Range =

X Maximum  X Minimum

19
Range-Example
Example for Computing Range

The following data represent the percentage return on investment for 10


mutual funds per annum. Calculate Range.

12, 14, 11, 18, 10.5, 11.3, 12, 14, 11, 9

Range = = 18-9=9

Caution: If one of the components of range namely the maximum value or


minimum value becomes an extreme value, then range should not be used.

20
Inter-Quartile
Range(IQR)
IQR= Range computed on middle 50% of the observations
after eliminating the highest and lowest 25% of observations in
a data set that is arranged in ascending order. IQR is less
affected by outliers.

IQR =Q3-Q1

21
Interquartile Range-Example
The following data represent the percentage return on
investment for 9 mutual funds per annum. Calculate
interquartile range.

Data Set: 12, 14, 11, 18, 10.5, 12, 14, 11, 9
Arranging in ascending order, the data set becomes
9, 10.5, 11, 11, 12, 12, 14, 14, 18

IQR=Q3-Q1=14-10.75=3.25

22
Standard Deviation
Standard deviation forms the cornerstone for Inferential
Statistics.

To define standard deviation, you need to define another term


called variance. In simple terms, standard deviation is the square
root of variance.

23
Key Formulas
Important Terms with Notations Remarks

Sample Variance
( X  X ) ( X  X )
2 2

 
2 2
S 1. S is an
n 1 n 1
Sample Standard Deviation unbiased estimator of
S=  =
( X  )
2
(X  X )
2


2

n 1 N
Population Variance 2. X 
 X is an unbiased
=
( X  )
2 n
 X
2

N estimator of  
N
Population Standard
3. The divisor n-1 is always
( X  )
2

Deviation   used while calculating


N
sample variance for
Where X 
 X (Sample ensuring property of
n
Mean) and being unbiased

  X (Population Mean)
N 4. Standard deviation is
n =Number of observations always the square root of
in the sample(Sample size) variance
N =Number of observations
in the Population (Population
Size)
24
Example for Standard
Deviation
The following data represent the percentage return
on investment for 10 mutual funds per annum.
Calculate the sample standard deviation.

12, 14, 11, 18, 10.5, 11.3, 12, 14, 11, 9

25
Solution for the Example

26
Solution for the Example Cont.
From the spreadsheet of Microsoft Excel in the previous slide, it is
easy to see

that Mean = X   X
=12.28 (In column A and row14, 12.28 is
n
seen).

Sample Variance = 2
S 
 (X  X=
) 6.33 (In column D and row 14,
2

n 1 6.33 is seen)

Sample Standard Deviation = S =  (X  X) 2 = 2.52


(In column D and row 15, 2.52 is seen) n  1

27
Coefficient of Variation
Relative Dispersion)
Coefficient of Variation (CV) is defined as the ratio of Standard
Deviation to Mean.
In symbolic form

CV = for the sample data and = for the population data.


S σ
X μ

28
Coefficient of Variation
Example
Consider two Sales Persons working in the same territory.
The sales performance of these two in the context of
selling PCs are given below. Comment on the results.

Sales Person 1 Sales Person 2


Mean Sales (One year Mean Sales (One year
average) 50 units average)75 units

Standard Deviation Standard deviation


5 units 25 units

29
Interpretation for the Example
The CV is 5/50 =0.10 or 10% for the Sales Person1 and
25/75=0.33 or 33% for sales Person2.

The moral of the story is "don't get carried away by


absolute number". Look at the scatter. Even though,
Sales Person2 has achieved a higher average, his
performance is not consistent and seems erratic.

30
The Empirical Rule
• The empirical rule approximates the variation of
data in a bell-shaped distribution
• Approximately 68% of the data in a bell shaped
distribution is within 1 standard deviation of the
mean or μ  1σ

68%

μ
μ  1σ
The Empirical Rule
• Approximately 95% of the data in a bell-shaped distribution lies
within two standard deviations of the mean, or µ ± 2σ

• Approximately 99.7% of the data in a bell-shaped distribution


lies within three standard deviations of the mean, or µ ± 3σ

95% 99.7%

μ  2σ μ  3σ

32
The Five Number Summary

The five numbers that help describe the center,


spread and shape of data are:
 Xsmallest
 First Quartile (Q1)
 Median (Q2)
 Third Quartile (Q3)
 Xlargest

33
Five Number Summary and
The Boxplot
• The Boxplot: A Graphical display of the data
based on the five-number summary:
Xsmallest -- Q1 -- Median -- Q3 -- Xlargest

Example:

25% of data 25% 25% 25% of data


of data of data

Xsmallest Q1 Median Q3 Xlargest

34
Five Number Summary:
Shape of Boxplots
• If data are symmetric around the median then the box and
central line are centered between the endpoints

Xsmallest Q1 Median Q3 Xlargest

• A Boxplot can be shown in either a vertical or horizontal


orientation

35
Distribution Shape and
The Boxplot

Left-Skewed Symmetric Right-Skewed

Q1 Q2 Q3 Q1 Q2 Q3 Q 1 Q2 Q3

36
Boxplot Example
Below is a Boxplot for the following data:
Xsmallest Q1 Q2 Q3 Xlargest
0 2 2 2 3 3 4 5 5 9 27

00 2
2 33 55 27
27

37
Box plot example showing an outlier
• The boxplot below of the same data shows the
outlier value of 27 plotted separately
• A value is considered an outlier if it is more than 1.5
times the interquartile range below Q1 or above Q3

Example Boxplot Showing An Outlier

0 5 10 15 20 25 30
Sample Data

38
Descriptive Statistics- CardioGood Fitness
The market research team at AdRight is assigned the task to identify
the profile of the typical customer for each treadmill product offered
by CardioGood Fitness. The market research team decides to
investigate whether there are differences across the product lines
with respect to customer characteristics. The team decides to collect
data on individuals who purchased a treadmill at a CardioGood
Fitness retail store during the prior three months. The data are stored
in the CardioGoodFitness.csv file.
Descriptive Statistics- CardioGood Fitness
The team identifies the following customer variables to study:
product purchased, TM195, TM498, or TM798; gender; age, in years;
education, in years; relationship status, single or partnered; annual
household income ($); average number of times the customer plans
to use the treadmill each week; average number of miles the
customer expects to walk/run each week; and self-rated fitness on an
1-to-5 scale, where 1 is poor shape and 5 is excellent shape.

Perform descriptive analytics to create a customer profile for each


CardioGood Fitness treadmill product line.

You might also like