Unit 03 Descriptive Analysis and Visual Exploration
Unit 03 Descriptive Analysis and Visual Exploration
Analytics
UNIT 03: DESCRIPTIVE ANALYSIS
AND VISUAL EXPLORATION -1
Lesson Objectives
• Data Modifications in Excel
• Creating distributions with Data
• Measures of Location and Variation
• Analyzing Distributions
• Measures of Association between variables
Part 1 : All About Structured Data: Database, Tables, Rows, Columns
Columns| Variables| Fields
Rows|
Observations|
Records
7
Modifying Data in Excel
8
Using Excel’s Sort Function
Top Selling Automobiles Data Sorted by Sales in
March 2010 Sales
Modifying Data in Excel
• Sorting and Filtering Data in Excel
• Find the number of Toyota models that were among the top 20 selling in 2011
Example - Using Excel’s Filter function to see the sales of models made by Toyota.
o Step 1: Select cells A1:F21
o Step 2: Click the DATA tab in the Ribbon
o Step 3: Click Filter in the Sort & Filter group
o Step 4: Click on the Filter Arrow in column B, next to Manufacturer
o Step 5: Select only the check box for Toyota. You can easily deselect all choices by
unchecking (Select All)
11
Top Selling Automobiles Data Filtered to Show
Only Automobiles Manufactured by Toyota
12
Modifying Data in Excel
13
Modifying Data in Excel
Illustration (contd.)
• Step 3: Click Conditional Formatting in the Styles group
• Step 4: Select Highlight Cells Rules, and click Less Than from the
dropdown menu
• Step 5: Enter 0% in the Format cells that are LESS THAN: box
Step 6: Click OK
14
Using Conditional Formatting in Excel to Highlight
Automobiles with Declining Sales from March 2010
15
Using Conditional Formatting in Excel to Generate
Data Bars for the Top Selling Automobiles Data
16
Data Exploration: Creating
Distributions from Data
Creating Distributions from Data
Frequency distributions for qualitative/categorical data
18
Sample Data
Coke Coke
Diet Coke Sprite
Pepsi Pepsi
Diet Coke Coke
Coke Pepsi
Coke Sprite
19
Frequency Distribution of Soft Drink Purchases
20
Creating a Frequency Distribution for Soft Drinks
Data in Excel
SoftDrinks.xlsx
21
Creating Distributions from Data
Relative frequency and Percent frequency
distributions
• Relative frequency distribution: It is a tabular summary
of data showing the relative frequency for each bin.
• Percent frequency distribution: Summarizes the percent
frequency of the data for each bin.
• Used to provide estimates of the relative likelihoods of
different values of a random variable.
22
Percent
Bins Frequency
Relative
Frequency Relative
Frequency
(%) Frequency and
Coke 19 0.38 38 Percent
Diet Coke 8 0.16 16
Frequency
Pepsi 13 0.26 26
Distributions of
Soft Drink
Dr. Pepper 5 0.1 10
Purchases
Sprite 5 0.1 10
Total 50
23
Creating Distributions from Data
Frequency distributions for quantitative data
Min =MIN(A2:D6)
Max =MAX(A2:D6)
28
Creating a Histogram for the Audit Time Data using Data
Analysis Toolpak in Excel
29
Completed Histogram for the Audit Time Data using Data
Analysis ToolPak in Excel
30
Creating Distributions from Data
Histogram provides information about the
shape, or form, of a distribution.
31
Histograms Showing Distributions with Different
Levels of Skewness
Skewness
of audit
data?
32
Creating Distributions from Data
Cumulative frequency distribution: A variation of the
frequency distribution that provides another tabular
summary of quantitative data.
33
Cumulative Frequency, Cumulative Relative
Frequency, and Cumulative Percent Frequency
Distributions for the Audit Time Data
34
Descriptive Analysis
MEASURES OF LOCATION (CENTRAL TENDENCY) AND
VARIATION
All About Data: Descriptive Statistics
• For any numerical dataset we would generally calculate the central tendency, (measures of
location) The most commonly used measures are
o Mean – AKA Average
o Median – The middle value of your data when the numbers are listed in order from
smallest to largest
o Mode – The number that occurs most in your value
• Measure the variability
o Min (Minimum) – Smallest value
o Max (Maximum) – Largest value
o Range (Min, Max) – Smallest to Largest
o Standard Deviation
• This should give you an indication overall data size, values, and variability of the data.
All About Data: Descriptive Statistics
• Continue to explore shape of your data variability
• Interquartile range – similar to range but instead of
calculating difference between smallest and biggest
value, you calculate the difference between the 25th
quantile and 75th quantile (values that fall within 25%
and 75% respectively)
37
Measures of Location
Measures of Location
• Mean/Arithmetic mean
• Average value for a variable.
• The sample mean is denoted by .
∑ + +···+
Sample mean, = =
o n = sample size
o = value of variable x for the first observation
o = value of variable x for the second observation
o = value of variable x for the nth observation
39
Data on Home Sales in Cincinnati, Ohio, Suburb
Illustration: Computation of the mean home selling
price for the sample of 12 home sales:
40
Computation of Sample Mean
Illustration: Computation of the mean home selling
price for the sample of 12 home sales:
= =
12
138,000 254,000 + 456,250
=
12
2,639,250
=
12
= 219,937.50
41
Measures of Location
• Median: Value in the middle when the data are arranged
in ascending order.
42
Computation of Sample Median
Illustration - When the number of observations are odd
• Consider the class size data for a sample of five college classes:
46 54 42 46 32
• Arrange the class size data in ascending order
32 42 46 46 54
• Middlemost value in the data set = 46.
• Median is 46.
43
Computation of Sample Median
Illustration - When the number of observations are even
• Consider the data on home sales in Cincinnati, Ohio, Suburb:
Home Sale Selling Price ($)
1 138000
2 254000
3 186000
4 257500
5 108000
6 254000
7 138000
8 298000
9 199500
HomeSales.xlsx
10 208000
11 142000
12 456250 44
Computation of Sample Median
Illustration (contd.) - When the number of observations are even
45
Measures of Location
• Mode: Value that occurs most frequently in a data set.
• Consider the class size data:
32 42 46 46 54
• Observe - 46 is the only value that occurs more than once.
• Mode is 46.
• Multimodal data - Data contain at least two modes.
• Bimodal data - Data contain exactly two modes.
46
Calculating the Mean, Median, and Modes for the Home
Sales Data using Excel
MODE.SNGL vs
MODE.MULTI
HomeSales.xlsx
47
Measures of Location
• Geometric mean: nth root of the product of n
values
• Used in analyzing growth rates in financial data.
• Sample geometric mean:
• =[ /
48
Illustration - Consider the percentage annual returns and growth
factors for the mutual fund data over the past 10 years.
Mutual Fund
8 26.5 1.265
9 15.1 1.151
Data:
Data
49
10 2.1 1.021
MutualFundReturns.xlsx
• We will determine the mean rate of growth for the fund over the 10-
year period.
Computation of Geometric Mean
Solution:
• Product of the growth factors:
• (.779)(1.287)(1.109)(1.049)(1.158)(1.055)(.630)(1.265)(1.151)(1.021)
= 1.335
• Geometric mean of the growth factors:
= 1.335 = 1.029
• Conclude that annual returns grew at an average annual rate
of (1.029 – 1)*100% or 2.9%.
50
Calculating the Geometric Mean for the
Mutual Fund Data Using Excel
51
Measures of Variability
Measures of Variability
Home Sale Selling Price ($) • Range: Found by
1 138000
subtracting the
2 254000
3 186000
smallest value from
4 257500 the largest value in a
5 108000
6 254000 data set.
7 138000
8 298000 Illustration: Consider the data
9 199500 on home sales in Cincinnati,
10 208000 Ohio, Suburb:
11 142000
12 456250
53
Computation of Range
Illustration (contd.):
• Largest home sales price - $456,250
• Smallest home sales price - $108,000
• Range = Largest value – Smallest value
= $456,250 – $108,000
= $348,250
• Drawback: Range is based on only two of the observations
and thus is highly influenced by extreme values.
54
Measures of Variability
• Variance: Measure of variability that utilizes all the data.
• It is based on the deviation about the mean, which is the
difference between the value of each observation (xi) and
the mean.
• The deviations about the mean are squared while
computing the variance.
∑ ̅
• Sample variance, =
∑ µ
• Population variance , =
55
Computation of Deviations and Squared Deviations about the Mean
for the Class Size Data
Number of
Students in Class Mean Deviation about Squared Deviation about
(xi) class size the Mean (xi - 𝑥̅ ) the Mean (xi - 𝑥̅ )2
46 44 2 4
54 44 10 100
42 44 -2 4
46 44 2 4
32 44 -12 144
Total ε 0 𝑥 − 𝑥̅ = 256
57
Computation of Coefficient of Variation
Illustration:
• Consider the customer store visit size data:
46 54 42 46 32
• Mean, = 44
• Standard deviation, s = 8
8
• Coefficient of variation = x 100 % = 18.2%
44
58
Measures of Variation:
Comparing Coefficients of Variation
Stock A: The scatter around the mean, relative
• Average price last year = $50 to the size of the mean, is 10%
• Standard deviation = $5
S $5
CVA 100%
100% 10%
X $50 Both stocks have
the same
Stock B: standard
deviation, but
• Average price last year = $100 stock B is less
• Standard deviation = $5 variable relative
to its price
S $5
CVB 100% 100% 5%
X $100
COPYRIGHT ©2013 PEARSON EDUCATION, INC. PUBLISHING AS PRENTICE HALL
Measures of Variation:
Comparing Coefficients of Variation
(continued)
Stock A:
• Average price last year = $50
• Standard deviation = $5
S $5
CVA 100% 100% 10%
X $50 Stock C has a
much smaller
standard
Stock C:
deviation but a
• Average price last year = $8 much higher
• Standard deviation = $2 coefficient of
variation
S $2
CVC 100% 100% 25%
X $8
COPYRIGHT ©2013 PEARSON EDUCATION, INC. PUBLISHING AS PRENTICE HALL
Calculating Variability Measures for the Home
Sales Data in Excel
HomeSales.xlsx
61
Analyzing Distribution
Analyzing Distributions
• Percentile: Value of a variable at which a specified
(approximate) percentage of observations are below that
value.
• The pth percentile tells us the point in the data where:
• Approximately p percent of the observations have values less
than the pth percentile;
• Approximately (100 – p) percent of the observations have
values greater than the pth percentile.
63
Img source: Percentiles (mathsisfun.com)
Analyzing Distributions
• Steps to calculate the pth percentile:
1. Arrange the data in ascending order (smallest to largest
value). Our percentile value
2. Compute k = (n + 1) × p.
3. Divide k into its integer component, i, and its decimal
component, d.
a. If d = 0, find the kth largest value in the data set. This is the pth
percentile.
(contd.)
65
Analyzing Distributions
3b. If d > 0, the percentile is between the values in positions i
and i + 1 in the sorted data. To find this percentile, we must
interpolate between these two values.
i. Calculate the difference between the values in positions i and i + 1
in the sorted data set. We define this difference between the two
values as m.
ii. Multiply this difference by d: t = m × d.
iii. To find the pth percentile, add t to the value in position i of the
sorted data.
Say, if k = 2.75
66
Analyzing Distributions
Illustration: To determine the 85th percentile for the home sales data
1. Arrange the data in ascending order.
108,000 138,000 138,000 142,000 186,000 199,500
208,000 254,000 254,000 257,500 298,000 456,250
2. Compute k = (n + 1) × p = (12 + 1) × 0.85 = 11.05.
3. Dividing 11.05 into the integer and decimal components gives us
i = 11 and d = 0.05.
• d > 0, interpolate between the values in the 11th and 12th positions in the sorted data.
67
Analyzing Distributions
Illustration (contd.): To determine the 85th percentile for
the home sales data
The value in the 11th position is 298,000, and
• The value in the 12th position is 456,250.
i. m = 456,250 – 298,000 = 158,250
ii. t = m × d = 158,250 × 0.05 = 7912.5 Excel Function:
iii. pth percentile = 298,000 + 7912.5 = 305,912.5 Percentile.Exc (array,k)
• $305,912.50 represents the 85th percentile of the home sales data.
68
Analyzing Distributions
• Quartiles:
• When the data is divided into four equal parts:
• Each part contains approximately 25% of the observations.
• Division points are referred to as quartiles.
• = first quartile, or 25th percentile =PERCENTILE.EXC(array,0.25)
= second quartile, or 50th percentile (also the median) =PERCENTILE.EXC(array,0.50)
= third quartile, or 75th percentile =PERCENTILE.EXC(array,0.75)
Application: Interquartile Range (IQR)
The IQR is calculated by finding the difference between the first quartile and the third
quartile (Q3 – Q1). Meant to show the middle half of the data.
69
Analyzing Distributions
• z-score:
• Measures the relative location of a value in the data set.
• Helps to determine how far a particular value is from the mean relative to
the data set’s standard deviation.
• Standardized value
• If , ,..., is a sample of n observations
̅
• =
• = z-score for
• = sample mean
• s = sample standard deviation
70
z-Scores for the Class Size Data
HomeSales.xlsx
72
Analyzing Distributions
• Identifying outliers:
• Outliers: Extreme values in a data set.
• It can be identified using standardized
values (z-scores).
• Any data value with a z-score less than –3
or greater than +3 is an outlier.
73
Analyzing Distributions
• Box plot: Graphical summary of the distribution of data.
• Developed from the quartiles for a data set.
Box Plot for the Home Sales Data • By using the
interquartile range,
IQR = Q3 – Q1, limits
Outlier are located for the
whiskers
• The limits for the box
plot are 1.5(IQR)
Q3
below Q1 and
1.5(IQR) above Q3.
Q1 Quartile 1 $139,000.00
Quartile 3 $256,625.00
IQR $256,624.00
74
Box Plots Comparing Home Sale Prices in Different
Communities
Data:
HomeComparison3.xlsx
75
Analyzing Distributions
76
Analyzing Distributions: Normal Distribution
• The Empirical Rule
77
Measures of Association
Between Two Variables
Measures of Association Between Two
Variables
• Scatter Charts: Useful graph for analyzing the relationship
between two variables.
• Covariance: Descriptive measure of the linear association
between two variables.
• Sample covariance for a sample of size n with the observations
(𝑥 , 𝑦 ), (𝑥 , 𝑦 ), and so on:
∑
𝑠 =
∑ µ µ
• Population covariance, =
79
Measures of Association Between Two
Variables
• Correlation coefficient: Measures the relationship between
two variables.
• Not affected by the units of measurement for x and y.
• Sample correlation coefficient denoted by 𝒙𝒚 .
• =
∑
• = sample covariance =
∑ ̅
• = sample standard deviation of x =
∑
• = sample standard deviation of y =
80
Interpretation of Correlation Coefficient
• –1 ≤ r ≤ +1
r value Relationship between
the x and y variables
<0 Negative linear
Near 0 No linear relationship
>0 Positive linear
81
Data for Bottled Water Sales at Queensland
Amusement Park for a Sample of 14 Summer Days
BottledWater.xlsx
82
Chart Showing the Positive Linear Relation Between Sales and
High Temperatures
Bottled Water Sales (cases)
35
30
25
Sales (cases)
20
15
10
0
76 78 80 82 84 86 88 90 92 94
High Temperature (F)
83
Sample Covariance Calculations for Daily High
Temperature and Bottled Water Sales at Queensland Amusement Park
84
Scatter Diagrams and Associated Covariance Values for Different
Variable Relationships
85
Computation of Correlation Coefficient
Illustration - To determine the sample correlation coefficient for
bottled water sales at Queensland Amusement Park:
12.8
= = = 0.93
(4.36)(3.15)
86
Example of Nonlinear Relationship Producing a Correlation
Coefficient Near Zero
rxy = –0.007
87
Calculating Covariance and Correlation Coefficient for Bottled Water Sales
Using Excel
88
All About The Data
Ask Questions:
What is this dataset about?
What are the different rows and variables?
What are we trying to determine?
Is there one variable that’s the end result or being impacted by
other variables?
Understand the data
Mean, Median, Mode, Min, Max, Variance, Standard Deviation
Look at relationship between variables covariance, correlation
Visualize the data relationship via data distribution (histogram, skew,
scatterplot)
Draw your hypothesis Source: Khan Academy
89
End of Part 1