0% found this document useful (0 votes)
6 views36 pages

Lecture 5. Visualization 2

Uploaded by

Berke Al
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views36 pages

Lecture 5. Visualization 2

Uploaded by

Berke Al
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

Exploratory Data Analysis with Tableau

• Cluster analysis in Tableau


• Revisit the concept of linear regression and hypothesis testing
• Linear regression in Tableau
• Set Actions, Groups, and Calculated Fields
• Parameters
• Time-series and predictive analysis
• Use data visualization in auditing: a case study
• Tableau is a data visualization tool, it is very powerful
in simplifying data and identifying patterns. It can do
some statistical analyses, especially if related to
visualization.
• Tableau is not a statistical modeling tool. You need to
connect Tableau to other software/programming
languages (R, Python, MATLAB, etc.) in order to
conduct complex statistical analyses.
Cluster Analysis
• Clustering is a unsupervised approach to find undiscovered natural
groupings in the data
• Tableau uses K-means algorithm for clustering, this is the simplest and
most popular machine learning algorithms
• The K-means algorithm looks for a fixed number (k) of clusters in a
dataset. And the basic steps are:
• First we initialize k points, called means, randomly.
• We categorize each item to its closest mean and we update the mean’s
coordinates, which are the averages of the items categorized in that mean
cluster so far.
• We repeat the process for a given number of iterations to optimize (minimize
within cluster distance and maximize between cluster distance) and at the
end, we have our clusters
Example Using Tableau for Cluster Analysis
Create clusters using World Economic Indicators data, available from
Blackboard

The objective
• As life expectancy increases around the world, and as older people
remain more active, senior tourism can be a lucrative market for
companies that know how to find and appeal to potential customers.
The World Indicators sample data set contains the kind of data that
might help companies identify the countries or regions where there
are enough of the right kind of customers.
We can show clustering in any graph we choose, let’s show it in a map

1. Open the World Indicators sample data source in Tableau Desktop.


2. Double-click Country/Region in the Data pane.
3. Tableau automatically creates a map view, with a mark in each country/region.
4. On the Marks card, change the mark type to Map, You should then see a map projection
where all countries/regions are filled with a solid color:
The next step is to identify the fields that you will use as variables for clustering.
Here are the fields you choose:
Field Reason for inclusion
Life Expectancy Female and Life Where people are living longer, there are more likely to be people who
Expectancy Male are interested in traveling later in life.

Population Urban It is easier to market services in areas with greater population density.

Population 65+ The target population is older residents with the time and funds to travel.

TourismPerCapita This is a measure that you must create as a named calculated field. The
formula is:
SUM([Tourism Outbound])/SUM([Population Total])
Tourism Outbound aggregates the money (in US dollars) that residents of
a country/region spend annually on international travel. But this total
must be divided by the population of each country/region to determine
the average amount each resident spends on international travel.
1. Drag these five fields from the Data pane to Detail on the Marks card.
2. Click to open the Analytics pane.
3. Drag Cluster from the Analytics pane and drop it in the view:
4.Tableau then displays the Clusters dialog box and adds the measures in
the view to the list of variables
It also updates the view by adding clusters to Color. In
this case, Tableau finds two distinct clusters, and is You may then decide that two clusters isn't enough—you
unable to assign certain countries/regions (colored don't have the resources to set up shop in half the
reddish-pink) to either cluster: countries/regions in the world. So you type 4 in
the Number of Clusters field in the Clusters dialog box.
Look at the statistics behind the clusters.

• Close the Clusters dialog box by clicking the X in its upper-right corner
• Click the Clusters field on the Marks card and choose Describe Clusters.
The table at the bottom of the Models tab in the Describe Clusters dialog box shows the average value
for each variable in each cluster.
One cluster has the highest life expectancy (both male and female), the highest concentration of urban
population, and the highest expenditure for international tourism. The only variable for which this
Cluster does not have the highest value is Population 65+. Depending on how you place the variables,
the label could be different, because Tableau does not know your criteria, it just groups observations
based on their distance from each other. Let’s say it’s Cluster 4.
• You could attempt to pick out the Cluster 4 countries/regions from the map, but there is an easier way.
Close the Describe Clusters dialog box and then click Cluster 4 on the Color legend and choose Keep
Only.
• Choose Text Table from ShowMe. You now see a list of the countries/regions in Cluster 4.

This list is not the end of the process. You might try clustering again with a somewhat different set of
variables and maybe a different number of clusters, or you might add some countries/regions to the list
and remove others, based on other factors. Clustering is a try and error process of discovery.
Create a group from cluster results
1. If you drag a cluster to the Data pane, it becomes a group dimension
in which the individual members (Cluster 1, Cluster 2, etc.) contain
the marks that the cluster algorithm has determined are more similar
to each other than they are to other marks.
2. After you drag a cluster group to the Data pane, you can use it in
other worksheets.
3. Drag Clusters from the Marks card to the Data pane to create a
Tableau group.
4. After you create a group from clusters, the group and the original
clusters are separate and distinct. Editing the clusters does not affect
the group, and editing the group does not affect the cluster results.

Refit saved clusters


1. When you save a Clusters field as a group, it is saved with its analytic model. You can use your cluster groups in
other worksheets and workbooks, however, they don't automatically refresh.
2. If the underlying data changes, you can use the Refit option to refresh and recompute the data for a saved
clusters group.
The ordinary least squares regression (OLS)
The least-squares regression
line is the unique line such that
the sum of the squared vertical
distances from the data points
to the line is the smallest
possible.
The vertical distances between
the data point and the line are
the error terms.
The Assumptions of OLS
1. The error terms are normally distributed
2. The population mean of the error terms is zero
3. The variances of the error terms are equal.
• violation is called heteroscedasticity.
4. The error terms are independent of each other
5. No Independent variable is correlated with the error term.
• No Endogeneity
6. No independent variable is a perfect linear function of other
independent variables
• no perfect multicollinearity
12
Reason for Rejecting H0
Sampling Distribution

It is unlikely that ... Therefore, we reject


we would get a the null hypothesis
sample mean of that µ = 0.
this value ...
... if in fact this were
the population mean.

2 µ =0 Sample Mean

H0
Standard Normal (Z) Distribution
• Problem: Unlimited number of possible normal distributions
(-¥ < µ < ¥ , s > 0)
• Solution: Standardize the random variable to have mean 0
and standard deviation 1

Y -µ
Y ~ N (µ ,s ) Þ Z = ~ N (0,1)
s
• Probabilities of certain ranges of values and specific percentiles of interest can be obtained
through the standard normal (Z) distribution
P-value (aka Observed Significance Level)
• P-value - Measure of the strength of evidence the sample
data provides against the null hypothesis:
P - val : p = P( Z ³ zobs )

The smaller the P-value, the stronger the evidence.


Regression Using Tableau
• Let’s use Colleagescoreboard_Cleandata.xls again
• Drag “Average SAT” to column and “Completion Rate” to row
• You will see a single dot on the graph because both measures are
aggregated measures (Sum in this case)
• Disaggregate the data and look at the data on the individual
institution level
• You can focus on the data by removing the requirement to show 0 on
the axis's
• And we can then fit a trend line through the data
In addition to a simple linear model, Tableau can also fit a trend line based on various other
models (still linear regressions but the relation between X and Y no longer linear):
Logarithmic
Y = b0 + b1 * ln(X)
Note here X cannot be negative. So negative values would be
filtered before model estimation.
Exponential
Y = exp(b0)* exp(b1 * X)
In practice, before estimating the model, transform it to linear
form by taking natural log on both sides:
ln(Y) = b0 + b1 * X
Because a logarithm is not defined for numbers less than zero,
negative values are filtered before model estimation.
Power
With the power model type, the formula is:
Y = b0 * X^b1
Again, first transform both sides by natural log and then
estimate the model resulting in this formula:
ln(Y) = ln(b0) + b1 * ln(X)
Linear Regression in Tableau
• Univariate regression in Tableau is straightforward (Trend
Line)
• It’s not easy to run multi-variate regressions in Tableau
• If you have already estimated the regression model (for example,
in Excel), you can create a calculated field to show the predicted
value of Y
(Y = b0 + b1*X1+b2*X2)
• To estimate a multi-variate regression model, you have to use the
R programming language supported by Tableau
Some Advanced Techniques in Tableau
Set actions: update the values in the graph given the user’s actions,
therefore allows audience to directly interact with the graph. We will
work on an example together.
Groups: combines dimension members. For example, “CA” and “California”
are redefined as one group.
Calculated fields: allow you to create new data from data that already
exists in your data source. You can create a simple calculation or use one
of the many functions available in Tableau.
Parameters: A parameter is a workbook variable such as a number, date, or
string that can replace a constant value in a calculation, filter, or
reference line.
General steps for set actions
1. Create one or more sets. The sets you create will be associated with the
data source that is currently selected.
2. Create a set action that uses one of the sets you created. You can create
multiple set actions for different purposes.
3. Depending on the behavior you want to make available to users for their
analysis, you might want to create a calculated field that uses the set.
4. Build a visualization that uses a set referenced by a set action. For
example, if you create a calculated field that uses the set, build the view
using that calculated field. Or, drag the set to Color in the Marks card.
5. Test the set action and adjust its settings as needed to get the behavior
you want your audience to experience.
Set Action Example: Proportional Brushing
This example uses the Sample - Superstore data source that comes with Tableau.
1. Connect to Sample -Superstore data in Tableau Desktop.
2. In a new sheet, drag Sales measure to Columns, drag Segment dimension to Rows.
3. In another blank sheet, drag Sales measure to Columns, drag Sub-Category dimension to
Rows.
4. Create a set for the Segment dimension named Segment Set.
5. In the sheet that shows Sales by Sub-Category, drag Segment Set onto Color in the Marks
card.
6. Create a new dashboard. Drag both sheets into the dashboard.
On the Dashboard menu, select Actions. Click Add Action, and then select Change Set
Values.
8. Configure the action
9. Click OK to save your changes and return to the view.
10.Test the set action by clicking the marks for each segment.
Create A Group
In the Data pane, right-click a field and In the Create Group dialog box, select several
select Create > Group. members that you want to group, and then
click Group.
Create a calculated field
1. In Tableau, select Analysis > Create Calculated Field.
2. In the Calculation Editor that opens:
• Enter a name for the calculated field. In this example, the field is called, Discount Ratio.
• Enter a formula. This example uses the following formula:
IIF([Sales] !=0, [Discount]/[Sales],0)
This formula checks if sales is not equal to zero. If true, it returns the discount ratio
(Discount/Sales); if false, it returns zero
To see a list of available functions, click the triangle icon on the right-side of the Calculation Editor.

When finished, click OK. The new calculated field is added to Measures in the Data pane because it returns a
number. An equal sign (=) appears next to the data type icon. All calculated fields have equal signs (=) next to
them in the Data pane.
Parameters
• A Parameter is a place-holder for a single global value, such as a
number, date, or string.
• For example, you may have a filter to show the top 10 products by profit. You
can replace the fixed value “10” in the filter by a dynamic parameter so you can
quickly look at the top 15, 20, and 30 products.
• The value of Tableau parameter is global so that if the value is
changed, every view and calculation in the workbook that references
the parameter will use the new value.
• You can use the parameter in filters, calculations, reference lines,
controls, etc.
• Creating a Tableau Parameter is similar to creating a calculated field.
Use parameter in a filter
1. Use the superstore example that comes with Tableau, create a
sheet showing sales by customer
2. Drag customer name to the filter area and set the filter to show top
10 customers by sales first
3. Edit filter, select “create a new parameter…”
4. Define the new parameter and the parameter control will
automatically show in the sheet
5. Sort the chart by sales to make it more intuitive
Time series and predictive analysis
• Visualize time series data to spot trend and patterns
• Tableau’s forecasting function runs several different models by default
and select the best one, automatically accounting for data issues such
as seasonality. Forecasting in Tableau uses a technique known as
exponential smoothing and forecasts future values of a time series
from weighted averages of past values from iterations.
• Note that when showing multiple data series in the same graph, you
can choose to bring which one to the front
XKCD Webcomic:
Curve-Fitting
https://xkcd.com/2048/
Combine Tables in Tableau
• If the tables are from the same database, or workbook (for Excel), or
directory (for text) then they are considered as from the same
database.
• Combining tables that are from the same database requires only a
single connection in the data source. Typically, joining tables from
the same database yields better performance.
• Cross-database joins require that you first set up a multi-connection
data source—that is, you create a new connection to each database
before you join tables.
Join Type Result
Inner Keep values that have matches in both tables.

Left Keep all values from the left table and corresponding matches from the
right table.
When a value in the left table doesn't have a corresponding match in the
right table, you see a null value in the data grid.
Right Keep all values from the right table and corresponding matches from the
left table.
When a value in the right table doesn't have a corresponding match in
the left table, you see a null value in the data grid.
Full outer Keep all values from both tables.
When a value from either table doesn't have a match with the other
table, you see a null value in the data grid.
Union Union is not a type of join, it combines two or more tables by appending
rows of data from one table to another. Ideally, the tables should have
the same number of fields, and those fields have matching names and
data types.
Mismatch in Joins
• If there is a mismatch, there will be no data after the join
• Mismatches are often caused by differences in format of the string
values or date values in the fields
• You can often resolve mismatches between the fields in your join by
using a calculation
One table has two columns:
first name and last name

Another table has one column:


name
The Audit Case
• The case and raw data are on Blackboard
• Let’s connect the data to Tableau and combine the tables by defining
the relationships
• The data are in text format
• You need the info from exhibit 3 in the case for detailed data structure
• We can then focus on the questions we try to answer
• Case Requirements
• Key attributes
• Do we have all the key attributes we want to analyze or do we need to create
some calculated fields?
First step: connect data source and create relations
Exhibit 3
ERD, Tableau Table Joins, Data Dictionary, and Check Figures

Entity-Relationship-Diagram

Tableau Joins

You might also like