0% found this document useful (0 votes)
12 views

Unit 03 Descriptive Analysis and Visual Exploration

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Unit 03 Descriptive Analysis and Visual Exploration

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 90

ANLC751: Managerial

Analytics
UNIT 03: DESCRIPTIVE ANALYSIS
AND VISUAL EXPLORATION -1
Lesson Objectives
• Data Modifications in Excel
• Creating distributions with Data
• Measures of Location and Variation
• Analyzing Distributions
• Measures of Association between variables
Part 1 : All About Structured Data: Database, Tables, Rows, Columns
Columns| Variables| Fields

Rows|
Observations|
Records

• Data we deal with are mainly 2 dimensional tables


• Made up of column and rows
• Column refer to as a variable, usually is a characteristic within a specific record
• Row refer to a specific record or instance
• There could be multiple related tables that holds different sets of information,
which can be referred to as a database
3
Data Exploration: Modifying
Data in Excel
SOME EXPLORATORY TECHNIQUES
MODIFYING DATA IN EXCEL

• Projects often involve so much data that it is difficult to


analyze all of the data at once
• We will look at methods for summarizing and
manipulating data to make the data more manageable
and to develop insights.
Exploratory Techniques
Top 20 Selling Automobiles in United States in March 2011
Rank (by March Sales (March Sales (March
2011 Sales) Manufacturer Model 2011) 2010)
1 Honda Accord 33616 29120
2 Nissan Altima 32289 24649
3 Toyota Camry 31464 36251
4 Honda Civic 31213 22463
5 Toyota Corolla/Matrix 30234 29623
6 Ford Fusion 27566 22773
7 Hyundai Sonata 22894 18935
8 Hyundai Elantra 19255 8225
9 Toyota Prius 18605 11786
10 Chevrolet Cruze/Cobalt 18101 10316
11 Chevrolet Impala 18063 15594
12 Nissan Sentra 17851 8721
13 Ford Focus 17178 19500
14 Volkswagon Jetta 16969 9196
15 Chevrolet Malibu 15551 17750
16 Mazda 3 12467 11353
17 Nissan Versa 11075 13811 Top20Cars.xlsx
18 Subaru Outback 10498 7619
19 Kia Soul 10028 5106
20 Ford Fiesta 9787 0
Top 20 Selling Automobiles Data entered into Excel with
Percent Change in Sales from 2010

7
Modifying Data in Excel

• Sorting and Filtering data in excel


• Illustration - To sort the automobiles by March 2010 sales
oStep 1: Select cells A1:F21
oStep 2: Click the DATA tab in the Ribbon
oStep 3: Click Custom Sort in the Sort & Filter group
oStep 4: Select the check box for My data has headers
oStep 5: In the first Sort by dropdown menu, select Sales (March 2010)
oStep 6: In the Order dropdown menu, select Largest to Smallest
oStep 7: Click OK

8
Using Excel’s Sort Function
Top Selling Automobiles Data Sorted by Sales in
March 2010 Sales
Modifying Data in Excel
• Sorting and Filtering Data in Excel
• Find the number of Toyota models that were among the top 20 selling in 2011

Example - Using Excel’s Filter function to see the sales of models made by Toyota.
o Step 1: Select cells A1:F21
o Step 2: Click the DATA tab in the Ribbon
o Step 3: Click Filter in the Sort & Filter group
o Step 4: Click on the Filter Arrow in column B, next to Manufacturer
o Step 5: Select only the check box for Toyota. You can easily deselect all choices by
unchecking (Select All)

11
Top Selling Automobiles Data Filtered to Show
Only Automobiles Manufactured by Toyota

12
Modifying Data in Excel

 Conditional Formatting of Data in Excel: Makes it easy to


identify data that satisfy certain conditions in a data set.

Illustration - To identify the automobile models for which sales had


decreased from March 2010 to March 2011.
• Step 1: Starting with the original data shown in Cars, select cells F1:F21
• Step 2: Click on the HOME tab in the Ribbon

13
Modifying Data in Excel
Illustration (contd.)
• Step 3: Click Conditional Formatting in the Styles group

• Step 4: Select Highlight Cells Rules, and click Less Than from the
dropdown menu

• Step 5: Enter 0% in the Format cells that are LESS THAN: box
Step 6: Click OK

14
Using Conditional Formatting in Excel to Highlight
Automobiles with Declining Sales from March 2010

15
Using Conditional Formatting in Excel to Generate
Data Bars for the Top Selling Automobiles Data

16
Data Exploration: Creating
Distributions from Data
Creating Distributions from Data
Frequency distributions for qualitative/categorical data

• Frequency distribution: A summary of data that


shows the number (frequency) of observations in
each of several nonoverlapping classes, typically
referred to as bins, when dealing with distributions.

18
Sample Data
Coke Coke
Diet Coke Sprite
Pepsi Pepsi
Diet Coke Coke
Coke Pepsi
Coke Sprite

Data from a Qualitative Sample of 50 Soft Drink


Dr. Pepper Dr. Pepper
Diet Coke Pepsi
Pepsi
Pepsi
Diet Coke
Pepsi Purchases
Coke Coke
Dr. Pepper Coke
Sprite Diet Coke
Coke Pepsi
Diet Coke Pepsi
Coke Pepsi
Coke Coke
Diet Coke Dr. Pepper
Coke Sprite
Coke Coke
Coke Coke
Sprite Pepsi Data:
Coke Dr. Pepper SoftDrinks.xlsx
Coke Pepsi
Diet Coke Pepsi

19
Frequency Distribution of Soft Drink Purchases

• The frequency distribution summarizes information


about the popularity of the five soft drinks

20
Creating a Frequency Distribution for Soft Drinks
Data in Excel

SoftDrinks.xlsx

21
Creating Distributions from Data
 Relative frequency and Percent frequency
distributions
• Relative frequency distribution: It is a tabular summary
of data showing the relative frequency for each bin.
• Percent frequency distribution: Summarizes the percent
frequency of the data for each bin.
• Used to provide estimates of the relative likelihoods of
different values of a random variable.

22
Percent
Bins Frequency
Relative
Frequency Relative
Frequency
(%) Frequency and
Coke 19 0.38 38 Percent
Diet Coke 8 0.16 16
Frequency
Pepsi 13 0.26 26
Distributions of
Soft Drink
Dr. Pepper 5 0.1 10
Purchases
Sprite 5 0.1 10
Total 50

23
Creating Distributions from Data
Frequency distributions for quantitative data

• Three steps necessary to define the classes for a frequency


distribution with quantitative data:
1. Determine the number of nonoverlapping bins (groups).
2. Determine the width of each bin.
3. Determine the bin limits.
Approximate bin width = Largest data value smallest data value
Number of bins
Creating Distributions from Data
Example: Year-End Audit Times (Days)
Find number of times audit duration was greater 30 days
Year-End Audit Times (in Days)
12 14 19 18
15 15 18 17
20 27 22 23
22 21 33 28
14 18 16 13

Frequency, Relative Frequency, and Percent Frequency


Distributions for the Audit Time Data
Class-Interval Relative Percent
(days) Bin Frequency Frequency Frequency
10-14 14 4 0.2 20
15-19 19 8 0.4 40
20-14 24 5 0.25 25
25-29 29 2 0.1 10
30-34 34 1 0.05 5 AuditData.xlsx
Using Excel to Generate a Frequency
Distribution for Audit Times Data
Year-End Audit Times (in Days)
12 14 19 18
15 15 18 17
20 27 22 23
22 21 33 28
14 18 16 13

Min =MIN(A2:D6)
Max =MAX(A2:D6)

Class-Interval (days) Bin Frequency Relative Frequency Percent Frequency

10-14 14 =FREQUENCY(A2:D6,B12:B16) =C12/$C$17 =D12*100


Class-Interval
15-19 19 =FREQUENCY(A2:D6,B12:B16) =C13/$C$17 =D13*100 (days) Bin Frequency
10-14 14 4
20-14 24 =FREQUENCY(A2:D6,B12:B16) =C14/$C$17 =D14*100 15-19 19 8
20-14 24 5
25-29 29 =FREQUENCY(A2:D6,B12:B16) =C15/$C$17 =D15*100
25-29 29 2
30-34 34 =FREQUENCY(A2:D6,B12:B16) =C16/$C$17 =D16*100 30-34 34 1
Total =SUM(C12:C16) =SUM(D12:D16) =SUM(E12:E16)
Creating Distributions from Data
 Histogram: A common graphical presentation of
quantitative data
• Constructed by placing the variable of interest on the
horizontal axis and the selected frequency measure (absolute
frequency, relative frequency, or percent frequency) on the
vertical axis.
• The frequency measure of each class is shown by drawing a
rectangle whose base is determined by the class limits on the
horizontal axis and whose height is the corresponding
frequency measure.
27
Histogram for the Audit Time Data

28
Creating a Histogram for the Audit Time Data using Data
Analysis Toolpak in Excel

29
Completed Histogram for the Audit Time Data using Data
Analysis ToolPak in Excel

30
Creating Distributions from Data
 Histogram provides information about the
shape, or form, of a distribution.

 Skewness: Lack of symmetry


 Important characteristic of the shape of a
distribution

31
Histograms Showing Distributions with Different
Levels of Skewness

Skewness
of audit
data?

32
Creating Distributions from Data
 Cumulative frequency distribution: A variation of the
frequency distribution that provides another tabular
summary of quantitative data.

• Uses the number of classes, class widths, and class


limits developed for the frequency distribution.

• Shows the number of data items with values less


than or equal to the upper class limit of each class.

33
Cumulative Frequency, Cumulative Relative
Frequency, and Cumulative Percent Frequency
Distributions for the Audit Time Data

34
Descriptive Analysis
MEASURES OF LOCATION (CENTRAL TENDENCY) AND
VARIATION
All About Data: Descriptive Statistics
• For any numerical dataset we would generally calculate the central tendency, (measures of
location) The most commonly used measures are
o Mean – AKA Average
o Median – The middle value of your data when the numbers are listed in order from
smallest to largest
o Mode – The number that occurs most in your value
• Measure the variability
o Min (Minimum) – Smallest value
o Max (Maximum) – Largest value
o Range (Min, Max) – Smallest to Largest
o Standard Deviation
• This should give you an indication overall data size, values, and variability of the data.
All About Data: Descriptive Statistics
• Continue to explore shape of your data variability
• Interquartile range – similar to range but instead of
calculating difference between smallest and biggest
value, you calculate the difference between the 25th
quantile and 75th quantile (values that fall within 25%
and 75% respectively)

37
Measures of Location
Measures of Location
• Mean/Arithmetic mean
• Average value for a variable.
• The sample mean is denoted by .
∑ + +···+
Sample mean, = =
o n = sample size
o = value of variable x for the first observation
o = value of variable x for the second observation
o = value of variable x for the nth observation

39
Data on Home Sales in Cincinnati, Ohio, Suburb
Illustration: Computation of the mean home selling
price for the sample of 12 home sales:

40
Computation of Sample Mean
Illustration: Computation of the mean home selling
price for the sample of 12 home sales:
= =
12
138,000 254,000 + 456,250
=
12
2,639,250
=
12
= 219,937.50

41
Measures of Location
• Median: Value in the middle when the data are arranged
in ascending order.

• Middle value, for an odd number of observations


• Average of two middle values, for an even number of
observations

42
Computation of Sample Median
Illustration - When the number of observations are odd
• Consider the class size data for a sample of five college classes:
46 54 42 46 32
• Arrange the class size data in ascending order
32 42 46 46 54
• Middlemost value in the data set = 46.
• Median is 46.

43
Computation of Sample Median
Illustration - When the number of observations are even
• Consider the data on home sales in Cincinnati, Ohio, Suburb:
Home Sale Selling Price ($)
1 138000
2 254000

3 186000
4 257500
5 108000
6 254000
7 138000
8 298000
9 199500
HomeSales.xlsx
10 208000
11 142000
12 456250 44
Computation of Sample Median
Illustration (contd.) - When the number of observations are even

• Arrange the data in ascending order:


1. 108,000
2. 138,000
3. 138,000
4. 142,000
5. 186,000
6. 199,500
Middle Two Values
7. 208,000
8. 254,000
9. 254,000
10. 257,500
11. 298,000
12. 456,250
199,500 + 208,000
• Median = average of two middle values = = 203,750
2

45
Measures of Location
• Mode: Value that occurs most frequently in a data set.
• Consider the class size data:
32 42 46 46 54
• Observe - 46 is the only value that occurs more than once.
• Mode is 46.
• Multimodal data - Data contain at least two modes.
• Bimodal data - Data contain exactly two modes.

46
Calculating the Mean, Median, and Modes for the Home
Sales Data using Excel

MODE.SNGL vs
MODE.MULTI

HomeSales.xlsx

47
Measures of Location
• Geometric mean: nth root of the product of n
values
• Used in analyzing growth rates in financial data.
• Sample geometric mean:
• =[ /

48
Illustration - Consider the percentage annual returns and growth
factors for the mutual fund data over the past 10 years.

Year Return (%) Growth Factor


Percentage
1
2
-22.1
28.7
0.779
1.287
(Return%/100) + 1
Annual
3
4
10.9
4.9
1.109
1.049
Returns and
5 15.8 1.158
Growth
Factors for the
6 5.5 1.055
7 -37 0.63

Mutual Fund
8 26.5 1.265
9 15.1 1.151
Data:
Data
49
10 2.1 1.021
MutualFundReturns.xlsx

• We will determine the mean rate of growth for the fund over the 10-
year period.
Computation of Geometric Mean
Solution:
• Product of the growth factors:
• (.779)(1.287)(1.109)(1.049)(1.158)(1.055)(.630)(1.265)(1.151)(1.021)
= 1.335
• Geometric mean of the growth factors:
= 1.335 = 1.029
• Conclude that annual returns grew at an average annual rate
of (1.029 – 1)*100% or 2.9%.

50
Calculating the Geometric Mean for the
Mutual Fund Data Using Excel

51
Measures of Variability
Measures of Variability
Home Sale Selling Price ($) • Range: Found by
1 138000
subtracting the
2 254000

3 186000
smallest value from
4 257500 the largest value in a
5 108000
6 254000 data set.
7 138000
8 298000 Illustration: Consider the data
9 199500 on home sales in Cincinnati,
10 208000 Ohio, Suburb:
11 142000
12 456250

53
Computation of Range
Illustration (contd.):
• Largest home sales price - $456,250
• Smallest home sales price - $108,000
• Range = Largest value – Smallest value
= $456,250 – $108,000
= $348,250
• Drawback: Range is based on only two of the observations
and thus is highly influenced by extreme values.

54
Measures of Variability
• Variance: Measure of variability that utilizes all the data.
• It is based on the deviation about the mean, which is the
difference between the value of each observation (xi) and
the mean.
• The deviations about the mean are squared while
computing the variance.
∑ ̅
• Sample variance, =
∑ µ
• Population variance , =

55
Computation of Deviations and Squared Deviations about the Mean
for the Class Size Data
Number of
Students in Class Mean Deviation about Squared Deviation about
(xi) class size the Mean (xi - 𝑥̅ ) the Mean (xi - 𝑥̅ )2
46 44 2 4
54 44 10 100
42 44 -2 4
46 44 2 4
32 44 -12 144
Total ε 0 𝑥 − 𝑥̅ = 256

• Computation of Sample Variance: • Computation of Sample Standard Deviation:


∑ ̅ 256
= = = 64 s= =
4
56
Measures of Variability
• Standard deviation: Positive square root of the variance
• Measured in the same units as the original data.
• For sample , s =
• For population, σ = σ
• Coefficient of variation:
• x 100
• Measures the standard deviation relative to the mean.
• Expressed as a percentage.

57
Computation of Coefficient of Variation
Illustration:
• Consider the customer store visit size data:
46 54 42 46 32
• Mean, = 44
• Standard deviation, s = 8
8
• Coefficient of variation = x 100 % = 18.2%
44
58
Measures of Variation:
Comparing Coefficients of Variation
Stock A: The scatter around the mean, relative
• Average price last year = $50 to the size of the mean, is 10%
• Standard deviation = $5
S  $5
CVA     100% 
  100%  10%
X  $50 Both stocks have
the same
Stock B: standard
deviation, but
• Average price last year = $100 stock B is less
• Standard deviation = $5 variable relative
to its price
S $5
CVB     100%   100%  5%
X $100
COPYRIGHT ©2013 PEARSON EDUCATION, INC. PUBLISHING AS PRENTICE HALL
Measures of Variation:
Comparing Coefficients of Variation
(continued)

Stock A:
• Average price last year = $50
• Standard deviation = $5
S $5
 
CVA     100%   100%  10%
X $50 Stock C has a
much smaller
standard
Stock C:
deviation but a
• Average price last year = $8 much higher
• Standard deviation = $2 coefficient of
variation
 S  $2
CVC     100%   100%  25%

X  $8
COPYRIGHT ©2013 PEARSON EDUCATION, INC. PUBLISHING AS PRENTICE HALL
Calculating Variability Measures for the Home
Sales Data in Excel

HomeSales.xlsx

61
Analyzing Distribution
Analyzing Distributions
• Percentile: Value of a variable at which a specified
(approximate) percentage of observations are below that
value.
• The pth percentile tells us the point in the data where:
• Approximately p percent of the observations have values less
than the pth percentile;
• Approximately (100 – p) percent of the observations have
values greater than the pth percentile.

63
Img source: Percentiles (mathsisfun.com)
Analyzing Distributions
• Steps to calculate the pth percentile:
1. Arrange the data in ascending order (smallest to largest
value). Our percentile value
2. Compute k = (n + 1) × p.
3. Divide k into its integer component, i, and its decimal
component, d.
a. If d = 0, find the kth largest value in the data set. This is the pth
percentile.

(contd.)

65
Analyzing Distributions
3b. If d > 0, the percentile is between the values in positions i
and i + 1 in the sorted data. To find this percentile, we must
interpolate between these two values.
i. Calculate the difference between the values in positions i and i + 1
in the sorted data set. We define this difference between the two
values as m.
ii. Multiply this difference by d: t = m × d.
iii. To find the pth percentile, add t to the value in position i of the
sorted data.
Say, if k = 2.75

66
Analyzing Distributions
Illustration: To determine the 85th percentile for the home sales data
1. Arrange the data in ascending order.
108,000 138,000 138,000 142,000 186,000 199,500
208,000 254,000 254,000 257,500 298,000 456,250
2. Compute k = (n + 1) × p = (12 + 1) × 0.85 = 11.05.
3. Dividing 11.05 into the integer and decimal components gives us
i = 11 and d = 0.05.
• d > 0, interpolate between the values in the 11th and 12th positions in the sorted data.

67
Analyzing Distributions
Illustration (contd.): To determine the 85th percentile for
the home sales data
The value in the 11th position is 298,000, and
• The value in the 12th position is 456,250.
i. m = 456,250 – 298,000 = 158,250
ii. t = m × d = 158,250 × 0.05 = 7912.5 Excel Function:
iii. pth percentile = 298,000 + 7912.5 = 305,912.5 Percentile.Exc (array,k)
• $305,912.50 represents the 85th percentile of the home sales data.
68
Analyzing Distributions
• Quartiles:
• When the data is divided into four equal parts:
• Each part contains approximately 25% of the observations.
• Division points are referred to as quartiles.
• = first quartile, or 25th percentile =PERCENTILE.EXC(array,0.25)
= second quartile, or 50th percentile (also the median) =PERCENTILE.EXC(array,0.50)
= third quartile, or 75th percentile =PERCENTILE.EXC(array,0.75)
Application: Interquartile Range (IQR)
The IQR is calculated by finding the difference between the first quartile and the third
quartile (Q3 – Q1). Meant to show the middle half of the data.
69
Analyzing Distributions
• z-score:
• Measures the relative location of a value in the data set.
• Helps to determine how far a particular value is from the mean relative to
the data set’s standard deviation.
• Standardized value
• If , ,..., is a sample of n observations
̅
• =
• = z-score for
• = sample mean
• s = sample standard deviation

70
z-Scores for the Class Size Data

• For class size data, = 44 and s = 8.


• For observations with a value > mean, z-score > 0.
• For observations with a value < mean, z-score < 0.
71
Calculating z-Scores for the Home Sales
Data in Excel

HomeSales.xlsx

72
Analyzing Distributions
• Identifying outliers:
• Outliers: Extreme values in a data set.
• It can be identified using standardized
values (z-scores).
• Any data value with a z-score less than –3
or greater than +3 is an outlier.

73
Analyzing Distributions
• Box plot: Graphical summary of the distribution of data.
• Developed from the quartiles for a data set.
Box Plot for the Home Sales Data • By using the
interquartile range,
IQR = Q3 – Q1, limits
Outlier are located for the
whiskers
• The limits for the box
plot are 1.5(IQR)
Q3
below Q1 and
1.5(IQR) above Q3.
Q1 Quartile 1 $139,000.00
Quartile 3 $256,625.00

IQR $256,624.00
74
Box Plots Comparing Home Sale Prices in Different
Communities

Data:
HomeComparison3.xlsx

75
Analyzing Distributions

• The Empirical Rule


• Central limit theorem is important because if your data follows
normal distribution you can use the empirical rule which states:
• Approximately 68% of the data will fall within 1 standard
deviation of the mean
• Approximately 95% of the data will fall within 2 standard
deviations of the mean
• Approximately 99% of the data will fall within 3 standard
deviations of the mean

76
Analyzing Distributions: Normal Distribution
• The Empirical Rule

77
Measures of Association
Between Two Variables
Measures of Association Between Two
Variables
• Scatter Charts: Useful graph for analyzing the relationship
between two variables.
• Covariance: Descriptive measure of the linear association
between two variables.
• Sample covariance for a sample of size n with the observations
(𝑥 , 𝑦 ), (𝑥 , 𝑦 ), and so on:

𝑠 =
∑ µ µ
• Population covariance, =

79
Measures of Association Between Two
Variables
• Correlation coefficient: Measures the relationship between
two variables.
• Not affected by the units of measurement for x and y.
• Sample correlation coefficient denoted by 𝒙𝒚 .
• =

• = sample covariance =
∑ ̅
• = sample standard deviation of x =

• = sample standard deviation of y =

80
Interpretation of Correlation Coefficient
• –1 ≤ r ≤ +1
r value Relationship between
the x and y variables
<0 Negative linear
Near 0 No linear relationship
>0 Positive linear

81
Data for Bottled Water Sales at Queensland
Amusement Park for a Sample of 14 Summer Days

BottledWater.xlsx

82
Chart Showing the Positive Linear Relation Between Sales and
High Temperatures
Bottled Water Sales (cases)
35

30

25
Sales (cases)

20

15

10

0
76 78 80 82 84 86 88 90 92 94
High Temperature (F)

83
Sample Covariance Calculations for Daily High
Temperature and Bottled Water Sales at Queensland Amusement Park

84
Scatter Diagrams and Associated Covariance Values for Different
Variable Relationships

(a) (b) (c)


𝑠 Positive:
Approximately 0: Negative:
(x and y are positively
(x and y are not (x and y are negatively
linearly related)
linearly related) linearly related)

85
Computation of Correlation Coefficient
Illustration - To determine the sample correlation coefficient for
bottled water sales at Queensland Amusement Park:

12.8
= = = 0.93
(4.36)(3.15)

• There is a very strong linear relationship between high


temperature and sales.

86
Example of Nonlinear Relationship Producing a Correlation
Coefficient Near Zero

rxy = –0.007

87
Calculating Covariance and Correlation Coefficient for Bottled Water Sales
Using Excel

88
All About The Data
 Ask Questions:
 What is this dataset about?
 What are the different rows and variables?
 What are we trying to determine?
 Is there one variable that’s the end result or being impacted by
other variables?
 Understand the data
 Mean, Median, Mode, Min, Max, Variance, Standard Deviation
 Look at relationship between variables covariance, correlation
 Visualize the data relationship via data distribution (histogram, skew,
scatterplot)
 Draw your hypothesis Source: Khan Academy

89
End of Part 1

You might also like