0% found this document useful (0 votes)
30 views

Module 3B - Visualizing Relationships Among Variables

Uploaded by

Kristian Uy
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views

Module 3B - Visualizing Relationships Among Variables

Uploaded by

Kristian Uy
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 56

Visualization of

relationship between
variables
Business Analytics
Data Analysis and Decision Making (7e)

S. Christian Albright
Wayne L. Winston

Chapter 3 Finding Relationships among Variables


© 2020 Cengage. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved
learning management system for classroom use.
Introduction
• The primary interest in data analysis is usually in relationships
between variables.
• The most useful numerical summary measure is correlation.
• The most useful graph is a scatterplot.
• To break down a numerical variable by a categorical variable, it is useful to
create side-by-side box plots.
• In Excel®, a pivot table breaks down one variable by others so that all sorts of
relationships can be uncovered very quickly.
• The diagram in the file Data Analysis Taxonomy gives you the big
picture of which analyses are appropriate for which data types and
which tools are best for performing the various analyses.
Data Analysis Taxonomy
3B-1 Relationships among Categorical
Variables
• The most meaningful way to examine relationships between two
categorical variables is with counts and corresponding charts of the
counts.
• You can find counts of the categories of either variable separately, as well as
counts of the joint categories of the two variables.
• Corresponding percentages of totals and charts help tell the story.
• It is customary to display all such counts in a table called a crosstabs
(or cross-tabulations). This is also sometimes called a contingency
table.
Examples of Cross-tabulations or Contingency Tables

Location No College 4-Year degree Advance TOTAL


degree
Urban 15 12 8 35
Suburban 8 15 9 32
Rural 6 8 7 21
TOTAL 29 35 24 88

Gender Facebook Instagram Twitter TOTAL


Male 112 16 12 140
Female 64 18 18 100
TOTAL 176 34 30 240
Example 3.1: Relationship between Smoking and
Drinking
• Objective: To use crosstabs to explore the
relationship between smoking and drinking.
• Solution: Use the count function to
determine the total number of observations.
• Categories have been coded “N,” “O,” “H,”
“S,” and “D” for “Non,” “Occasional,”
“Heavy,” “Smoker,” and “Drinker.”

© 2020 Cengage. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-
protected website or school-approved learning management system for classroom use.
Example 3.1: Relationship between Smoking and
Drinking
• To create the crosstabs, enter the
category headings in Excel® and use the
COUNTIFS function to fill the table with
counts of joint categories.
• Next, sum across rows and down
columns to get totals.
• Then express the counts as percentages
of row and percentages of column.

© 2020 Cengage. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-
protected website or school-approved learning management system for classroom use.
Example 3.1: Relationship between Smoking and
Drinking
• Create a side-by-side bar graph for the percentages of the rows.
• Create a side-by-side bar graph for the percentages of the columns.

© 2020 Cengage. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-
protected website or school-approved learning management system for classroom use.
Example 3.1: Relationship between Smoking and
Drinking
• Counts versus percentages
❑There is no single correct way to display the data in a
crosstab.
❑Showing the counts as percentages of row totals or
percentages of column totals usually makes any
relationships stand out more clearly.

© 2020 Cengage. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-
protected website or school-approved learning management system for classroom use.
3B-2 Relationships among Categorical Variables
and a Numerical Variable
• The comparison problem is one of the most important problems in data analysis.
It occurs whenever you want to compare a numerical measure across two or more
subpopulations.
• Examples
• The subpopulations are males and females, and the numerical measure is
salary.
• The subpopulations are different regions of the country, and the numerical
measure is the cost of living.
• The subpopulations are different days of the week, and the numerical
measure is the number of customers going to a particular fast-food chain.
Stacked and Unstacked Formats
• There are two possible data formats, stacked and unstacked.
• The data are stacked if there are two “long” variables, such as Gender
and Salary. The idea is that the male salaries are stacked in with the
female salaries.
• This is the format you will see in the vast majority of situations.
• You will occasionally see data in unstacked format, when there are two
“short” variables, such as Male Salary and Female Salary.
Stacked and Unstacked Formats
Example 3.2: Baseball Salaries
• Objective: To learn methods in Excel for breaking down baseball
salaries by various categorical variables.
• Solution: Data set contains the same 2018 baseball data examined
previously, as well as several extra categorical variables.
• Create summary measures by selecting One-Variable Summary from
the Summary Statistics dropdown list.
• Next, click the Format button and choose Stacked. Then choose the Cat
variable you want to categorize by and the Val variable you want to
summarize.
© 2020 Cengage. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-
protected website or school-approved learning management system for classroom use.
Example 3.2: Baseball Salaries (slide 2 of 7)

• There are a lot of numbers to digest; it is difficult to get a clear picture of


differences across positions.
• It is more enlightening to see a graphical summary of this information.
• Side-by-side box plots are our favorite graphical way of comparing the
distribution of a numerical variable across categories of some categorical
variable.
© 2020 Cengage. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-
protected website or school-approved learning management system for classroom use.
Example 3.2: Baseball Salaries (slide 3 of 7)
• If you have Excel® 2016 or later, you can create box plots of Salary by Position
very easily.
• First, select the two columns of data, Position and Salary.
• Then select a box plot from the Statistical charts list on the Insert ribbon.

© 2020 Cengage. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-
protected website or school-approved learning management system for classroom use.
Some Conclusions:

-The salaries for all positions are


skewed to the right (mean greater
than the median, longer whisker,
and outliers to the right.
-As a whole, first basemen tend to
be the highest-paid players,
followed by outfielders and third
basemen. The designated hitters
also make a lot, but there are only
eight of them in the data set.
Some Conclusions:

-As a whole, pitchers don’t make as


much as first basemen and third
basemen, but there are a lot of
pitchers who are high-earning
outliers.
-Except for a few notable
exceptions, catchers receive the
lowest salaries.
Example 3.2: Baseball Salaries
• Categorize the positions as
Pitcher (Yes) or Non-pitcher (No).
• Create a box plot for the Salary of
Pitchers and Non-Pitchers
Pitchers make somewhat less
than other players, although there
are many outliers in each group.

© 2020 Cengage. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-
protected website or school-approved learning management system for classroom use.
Example 3.2: Baseball Salaries
• Categorize the teams as
Yankees and Non-Yankees.
• Create a box plot for the Salary
of Yankees and Non-Yankees.
The Yankees’ payroll is indeed much
larger than the payrolls for the rest of
the teams. In fact, it is so large that
Alex Rodriguez’s $32 million is
considered only a mild outlier relative
to the rest of the team.

© 2020 Cengage. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-
protected website or school-approved learning management system for classroom use.
3B-3 Relationships Among Numeric Variables
• To study relationships among numeric variables, the scatterplot, and two
summary measures for numerical variables, correlation and covariance, are
used.
• These measures can be applied to any variables that are displayed
numerically.
• However, they are appropriate only for truly numerical variables, not for
categorical variables that have been coded numerically.
• In general, don’t use correlations that involve coded categorical variables
such as 0 – 1 dummies. The methods from the previous section is more
appropriate.
Scatterplots
• A scatterplot is a scatter of points, where each point denotes the
values of an observation for two selected variables.
• It is a graphical method for detecting relationships between two numerical
variables.
• The two variables are often labeled generically as X and Y, so a scatterplot is
sometimes called an X-Y chart.
• The purpose of a scatterplot is to make a relationship (or the lack of it)
apparent.
Example 3.3: Golf Stats on the PGA Tour
• Objective: To use scatterplots to search for relationships in the golf data.
• Solution: Data set includes an observation (stats) for each of the top 200
earners on the PGA Tour.

© 2020 Cengage. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-
protected website or school-approved learning management system for classroom use.
Example 3.3: Golf Stats on the PGA Tour
• In StatTools, designate a StatTools data set for a particular year.
• Next, select Scatterplot from the Summary Graphs dropdown list
and then select at least one X variable and at least one Y variable.

© 2020 Cengage. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-
protected website or school-approved learning management system for classroom use.
Example 3.3: Golf Stats on the PGA Tour

The scatterplots indicate the possibly surprising results that age


is practically unrelated to the number of events played and earnings.
Each scatter plot is basically a shapeless swarm of points, and a
shapeless swarm always indicates “no relationship.”
© 2020 Cengage. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-
protected website or school-approved learning management system for classroom use.
Example 3.3: Golf Stats on the PGA Tour

The scatterplots confirm what you would expect. Specifically,


players who play in more events tend to earn more, although there are
a number of exceptions to this pattern. Also, players who make more
36-hole cuts tend to earn more.
© 2020 Cengage. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-
protected website or school-approved learning management system for classroom use.
Example 3.3: Golf Stats on the PGA Tour

The scatterplots indicate almost no relationship between earnings and


the two components of driving, length (yards per drive) and accuracy
(percentage of fairways hit). At least in 2011, neither driving length nor
driving accuracy seems to have much effect on earnings.
© 2020 Cengage. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-
protected website or school-approved learning management system for classroom use.
Example 3.3: Golf Stats on the PGA Tour

In contrast, there is a reasonably strong upward relationship


between green hits in regulation and earnings. You would probably
expect players who hit a lot of greens in regulation to earn more, and this
appears to be the case. Finally, there is a downward relationship between
putting average and earnings.
© 2020 Cengage. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-
protected website or school-approved learning management system for classroom use.
Trend Lines in Scatterplots

• Once you have a scatterplot, Excel® enables you to superimpose one


of several trend lines on the scatterplot.
• A trend line is a line or curve that “fits” the scatter as well as
possible.
• This could be a straight line, or it could be one of several types of
curves.
Trend Lines in Scatterplots
• To add trend lines, right-click on any
point in the chart, select Add Trendline,
and fill out the resulting dialog box.

© 2020 Cengage. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-
protected website or school-approved learning management system for classroom use.
Correlation and Covariance
• Correlation and covariance measure the strength and direction of a
linear relationship between two numerical variables.
• The relationship is “strong” if the points in a scatterplot cluster
tightly around some straight line.
• If this straight line rises from left to right, the relationship is
positive and the measures will be positive numbers.
• If it falls from left to right, the relationship is negative and the
measures will be negative numbers.
Correlation and Covariance
• The two numerical variables must be “paired” variables.
• They must have the same number of observations, and
the values for any observation should be naturally
paired.
Covariance
• Covariance is essentially an average of products of deviations from
means.

• Excel® has a built-in COVAR function


• Covariance has a serious limitation as a descriptive measure because
it is very sensitive to the units in which X and Y are measured.

© 2020 Cengage. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-
protected website or school-approved learning management system for classroom use.
Correlation
• Correlation is a unitless quantity that is unaffected by the measurement
scale.

• The correlation is always between -1 and +1.


• The closer it is to either of these two extremes, the closer the points in a
scatterplot are to a straight line.
• Excel® has a built-in CORREL function
© 2020 Cengage. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-
protected website or school-approved learning management system for classroom use.
Correlation and Covariance

• Three important points about scatterplots, correlations, and covariances:


• A correlation is a single-number summary of a scatterplot. It never
conveys as much information as the full scatterplot.
• You are usually on the lookout for large correlations, those near −1 or
+1.
• Do not even try to interpret covariances numerically except possibly to
check whether they are positive or negative. For interpretive purposes,
concentrate on correlations.
Example 3.3: Golf Stats on the PGA Tour
• Objective: To use correlations to understand relationships in the
golf data.
• Solution: Create a table of correlations by selecting Correlation and
Covariance from the Summary Statistics dropdown list.
• Fill in the resulting dialog box and check Correlations.

© 2020 Cengage. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-
protected website or school-approved learning management system for classroom use.
Example 3.3: Golf Stats on the PGA Tour

© 2020 Cengage. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-
protected website or school-approved learning management system for classroom use.
Example 3.3: Golf Stats on the PGA Tour
• You can learn more about a correlation by creating the corresponding
scatterplot.

© 2020 Cengage. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-
protected website or school-approved learning management system for classroom use.
3B.4 Pivot Tables
• The pivot table is an Excel® tool that allows you to break data down
by categories.
• Sometimes pivot tables are used to display tables of counts, often
called crosstabs or contingency tables.
• However, crosstabs typically list only counts, whereas pivot tables can
list counts, sums, averages, and other summary measures.
Example 3.4: Customer Orders at Elecmart
• Objective: To use pivot tables
to break down the customer
order data by a number of
categorical variables.
• Solution: Data set contains
data on 400 customer orders
during several months for
Elecmart company.

© 2020 Cengage. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-
protected website or school-approved learning management system for classroom use.
Example 3.4: Customer Orders at Elecmart
• Create a pivot table by
clicking the PivotTable
button on the Insert ribbon.
• The top section of the
dialog box allows you to
specify the table or range
that contains the data. The
bottom section allows you
to select the location where
you want the results to be
placed.

© 2020 Cengage. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-
protected website or school-approved learning management system for classroom use.
Example 3.4: Customer
Orders at Elecmart

• This produces a blank pivot table.

© 2020 Cengage. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-
protected website or school-approved learning management system for classroom use.
Example 3.4: Customer Orders at Elecmart
• Assuming any cell inside this blank pivot table is selected, the PivotTable Tools
“super tab” is visible.
• This super tab has two ribbons, Analyze and Design.

© 2020 Cengage. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-
protected website or school-approved learning management system for classroom use.
Example 3.4: Customer Orders at
Elecmart
• Finally, the PivotTable Fields pane is visible.
• The pane indicates that a pivot table has four areas.
• These are for Filters, Rows, Columns, and Values. They
correspond to the four areas in a blank pivot table
where you can put fields.

© 2020 Cengage. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-
protected website or school-approved learning management system for classroom use.
Example 3.4: Customer Orders at Elecmart

• A Rows field has categories that go down the left side of a pivot table.
• A Columns field has categories that go across the top of a pivot table.
• A Filters field lets you filter the whole pivot table by its categories.
• A Values field contains the data you want to summarize.
• Typically (but not always), you will place categorical variables in the
Filters, Rows, and/or Columns areas, and you will place numeric
variables in the Values area.

© 2020 Cengage. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-
protected website or school-approved learning management system for classroom use.
Example 3.4: Customer Orders at Elecmart
• Check the Time, Region, and
Total Cost boxes in the upper
half of the PivotTable Fields
pane.
• Choose from three layouts:
Compact, Outline, or Tabular,
available from the Report
Layout list on the Design
ribbon.

© 2020 Cengage. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-
protected website or school-approved learning management system for classroom use.
Hiding Categories (Filtering)
• When you place a categorical field in the Rows, Columns, or Filters
area, all its categories show by default.
• It is often useful to filter out, or hide, some of these categories. This
lets you focus on the categories of most interest.

© 2020 Cengage. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-
protected website or school-approved learning management system for classroom use.
Sorting
• It is easy to sort in a pivot table, either by the numbers in the Values
area or by the labels in a Rows or Columns field.
• To sort by the numbers in the Values area, right-click any number and select
Sort.
• To sort on the labels of a Rows or Columns field, right-click any of the
categories and select Sort.
• You can also click the dropdown arrow for the field and get the dialog box that allows
both sorting and filtering.
Pivot Charts
• It is easy to accompany pivot tables with pivot charts.
• These charts adapt automatically to the underlying pivot table.
• To create a pivot chart, click anywhere inside the pivot table, select the
PivotChart button on the Analyze/Options ribbon, and select a chart type.

© 2020 Cengage. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-
protected website or school-approved learning management system for classroom use.
Multiple Variables in the Values Area
• More than a single variable can be placed in the Values area.
• Also, a given variable in the Values area can be summarized by more
than one summarizing function.

© 2020 Cengage. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-
protected website or school-approved learning management system for classroom use.
Summarizing by Count
• The variable in the Values area can be summarized by the Count
function.
• This is useful when you want to know, for example, how many of the orders
were placed by females in the South.
• Right-click any number in the pivot table, select Value Field Settings, and select
the Count function.
Grouping
• Categories in a Rows or Columns
variable can be grouped.
• Suppose you want to summarize
Sum of Total Cost by Date.
• Starting with a blank pivot table,
check both Date and Total Cost in
the PivotTable Fields pane.
• Then right-click any date and
select Group.
Example 3.5: Frozen Lasagna Dinners
• Objective: To use pivot tables to explore which demographic variables help to
distinguish lasagna triers from nontriers.
• Solution: Data set contains data on over 800 potential customers being
tracked by a frozen lasagna company.
• Set up a pivot table that shows counts of triers and nontriers for different
categories of the variables.

© 2020 Cengage. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-
protected website or school-approved learning management system for classroom use.
Example 3.5: Frozen Lasagna Dinners

© 2020 Cengage. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-
protected website or school-approved learning management system for classroom use.

You might also like