Descriptive Statistics PDF
Descriptive Statistics PDF
Descriptive Statistics PDF
Statistics
Descriptive Statistics
Dr. P.K.Viswanathan
Data versus Information
2
Raw Data
Meaning of Raw Data:
Raw Data represent numbers and facts in the original
format in which the data have been Collected. You need to
convert the raw data into information for managerial
decision Making.
3
Information is Key
Large and massive raw data tend to bewilder you so much
that the overall patterns are obscured. You cannot see the
wood for the trees. This implies that the raw data must be
processed to give you useful information.
Process
Raw Data Information
4
Frequency Distribution
In simple terms, frequency distribution is a summarized
table in which raw data are arranged into classes and
frequencies.
When you are looking for pattern that would help you
understand the characteristic you measure in a problem
situation, frequency distribution comes to your rescue.
5
HISTOGRAM
Histogram depicts the pattern of the distribution emerging from the characteristic
being measured.
6
Role of Histogram in Practice
7
What is Central Tendency?
Whenever you measure things of the same kind, a fairly
large number of such measurements will tend to cluster
around the middle value. Such a value is called a measure
of "Central Tendency". The other terms that are used
synonymously are "Measures of Location", or "Statistical
Averages".
8
Measures of Central Tendency
As a manager, You need the summary measures of central
tendency to draw meaningful conclusions in your functional
area of operation. The most widely used measures of central
tendency are Arithmetic Mean, Median, and Mode.
9
Arithmetic Mean
Arithmetic Mean (called mean) is defined as the sum of all
observations in a data set divided by the total number of
observations. For example, consider a data set containing
the following observations:
X = Arithmetic Mean
10
Arithmetic Mean -Example
The inner diameter of a particular grade of tire based on 5
sample measurements are as follows: (figures in millimeters)
11
Median
Median is the middle most observation when you arrange data in
ascending order of magnitude. Median is such that 50% of the
observations are above the median and 50% of the observations
are below the median.
12
Median - Example
Marks obtained by 7 students in Computer Science
Exam are given below: Compute the median.
45 40 60 80 90 65 55
90 80 65 60 55 45 40
13
Mode
Mode is that value which occurs most often. It has the
maximum frequency of occurrence. Mode also has
resistance to outliers.
14
Mode -Example
15
Comparison of
Mean, Median, Mode
Mean Median Mode
Defined as the arithmetic Defined as the Defined as the most
average of all observations middle value in the frequently occurring
in the data set. data set arranged in value in the distribution;
ascending or it has the largest
descending order. frequency.
16
Comparison of
Mean, Median, Mode Cont.
17
Measures of Dispersion
In simple terms, measures of dispersion indicate how large
the spread of the distribution is around the central
tendency. It answers unambiguously the question " What is
the magnitude of departure from the average value for
different groups having identical averages?".
18
Range
Range is the simplest of all measures of dispersion. It is
calculated as the difference between maximum and minimum
value in the data set.
Range =
X Maximum X Minimum
19
Range-Example
Example for Computing Range
Range = = 18-9=9
20
Inter-Quartile
Range(IQR)
IQR= Range computed on middle 50% of the observations
after eliminating the highest and lowest 25% of observations in
a data set that is arranged in ascending order. IQR is less
affected by outliers.
IQR =Q3-Q1
21
Interquartile Range-Example
The following data represent the percentage return on
investment for 9 mutual funds per annum. Calculate
interquartile range.
Data Set: 12, 14, 11, 18, 10.5, 12, 14, 11, 9
Arranging in ascending order, the data set becomes
9, 10.5, 11, 11, 12, 12, 14, 14, 18
IQR=Q3-Q1=14-10.75=3.25
22
Standard Deviation
Standard deviation forms the cornerstone for Inferential
Statistics.
23
Key Formulas
Important Terms with Notations Remarks
Sample Variance
( X X ) ( X X )
2 2
2 2
S 1. S is an
n 1 n 1
Sample Standard Deviation unbiased estimator of
S= =
( X )
2
(X X )
2
2
n 1 N
Population Variance 2. X
X is an unbiased
=
( X )
2 n
X
2
N estimator of
N
Population Standard
3. The divisor n-1 is always
( X )
2
X (Population Mean)
N 4. Standard deviation is
n =Number of observations always the square root of
in the sample(Sample size) variance
N =Number of observations
in the Population (Population
Size)
24
Example for Standard
Deviation
The following data represent the percentage return
on investment for 10 mutual funds per annum.
Calculate the sample standard deviation.
25
Solution for the Example
26
Solution for the Example Cont.
From the spreadsheet of Microsoft Excel in the previous slide, it is
easy to see
that Mean = X X
=12.28 (In column A and row14, 12.28 is
n
seen).
Sample Variance = 2
S
(X X=
) 6.33 (In column D and row 14,
2
n 1 6.33 is seen)
27
Coefficient of Variation
Relative Dispersion)
Coefficient of Variation (CV) is defined as the ratio of Standard
Deviation to Mean.
In symbolic form
28
Coefficient of Variation
Example
Consider two Sales Persons working in the same territory.
The sales performance of these two in the context of
selling PCs are given below. Comment on the results.
29
Interpretation for the Example
The CV is 5/50 =0.10 or 10% for the Sales Person1 and
25/75=0.33 or 33% for sales Person2.
30
The Empirical Rule
• The empirical rule approximates the variation of
data in a bell-shaped distribution
• Approximately 68% of the data in a bell shaped
distribution is within 1 standard deviation of the
mean or μ 1σ
68%
μ
μ 1σ
The Empirical Rule
• Approximately 95% of the data in a bell-shaped distribution lies
within two standard deviations of the mean, or µ ± 2σ
95% 99.7%
μ 2σ μ 3σ
32
The Five Number Summary
33
Five Number Summary and
The Boxplot
• The Boxplot: A Graphical display of the data
based on the five-number summary:
Xsmallest -- Q1 -- Median -- Q3 -- Xlargest
Example:
34
Five Number Summary:
Shape of Boxplots
• If data are symmetric around the median then the box and
central line are centered between the endpoints
35
Distribution Shape and
The Boxplot
Q1 Q2 Q3 Q1 Q2 Q3 Q 1 Q2 Q3
36
Boxplot Example
Below is a Boxplot for the following data:
Xsmallest Q1 Q2 Q3 Xlargest
0 2 2 2 3 3 4 5 5 9 27
00 2
2 33 55 27
27
37
Box plot example showing an outlier
• The boxplot below of the same data shows the
outlier value of 27 plotted separately
• A value is considered an outlier if it is more than 1.5
times the interquartile range below Q1 or above Q3
0 5 10 15 20 25 30
Sample Data
38
Descriptive Statistics- CardioGood Fitness
The market research team at AdRight is assigned the task to identify
the profile of the typical customer for each treadmill product offered
by CardioGood Fitness. The market research team decides to
investigate whether there are differences across the product lines
with respect to customer characteristics. The team decides to collect
data on individuals who purchased a treadmill at a CardioGood
Fitness retail store during the prior three months. The data are stored
in the CardioGoodFitness.csv file.
Descriptive Statistics- CardioGood Fitness
The team identifies the following customer variables to study:
product purchased, TM195, TM498, or TM798; gender; age, in years;
education, in years; relationship status, single or partnered; annual
household income ($); average number of times the customer plans
to use the treadmill each week; average number of miles the
customer expects to walk/run each week; and self-rated fitness on an
1-to-5 scale, where 1 is poor shape and 5 is excellent shape.