Module Data Analysis
Module Data Analysis
Learning Objectives
Understand how your data is structured
Assess and validate the Quality of data: is it already in a standardized format? What types of
data do we have? Is it well documented? What restrictions apply?
Understand how to create data about your data
Who is it for:
Data analysts & Data Owners
If a friend asked you to go hiking with them, what questions might you ask to be better prepared?
You might want to know how long the trail is, and how well-kept. You might ask if there was
elevation gain and how evenly distributed. Or if there are outliers like water crossings or boulder
scrambles. These questions will help you understand the general structure of what you’re engaging
with and aid your preparation.
Similarly, exploratory data analysis is a set of techniques designed to get you familiar with a dataset,
fast. Usually utilizing visualizations, you can quickly understand some of the key features. It’s an
iterative process where answers to your questions lead to new questions and eventually new
answers.
Some questions you might ask during exploratory data analysis include:
Essentially, exploratory data analysis should occur before you have enough information to begin
hypothesis testing, helping you to arrive there faster. It also allows the data to suggest a model for
future testing. The stated goals may sound intimidating, but it can be as simple as dropping the data
into Tableau and slicing & dicing around looking for something interesting.
Joins vs Unions
If you are working with multiple sources of data (it might be as simple as a couple different Excel tables or
as complicated as different databases, you will have to identify how these sources of data should interact
with each other. In just a few words: joins work with columns (horizontally) and unions work with rows
(vertically). Here is the explanation in more detail:
A great additional resource on joins and unions in Tableau: https://www.thedataschool.co.uk/diego-
parkertheinformationlab-co-uk/joins-and-unions/
Data dredging
Data dredging or data fishing is the process of (usually programmatically) identifying relationships
between variables. This can be an incredibly powerful technique to uncover elements that interact in
unexpected ways.
However, in the world of big data, often so many variables are available to pair together that
eventually some will return a false positive. Evaluating two variables over the same period of time
allows application of a statistical technique to determine a correlation coefficient, a value between
negative 1 and 1 which expresses whether their two curves correlate (and how strongly, and
positively or negatively). To oversimplify, when there is a change in one line, does it correlate with a
similar change in the other line?
Correlation is the existence of a relationship between two or more variables. Correlation states that
as one variable will go up or down another variable will go up or down.
Correlations does not mean that one variable is causing another one to change.
Interesting fact, often with large datasets with near-infinite different lines to choose from,
eventually random chance will produce two lines which appear to correlate but in reality are
unrelated. These are known as spurious correlations and pulling from publicly available datasets can
produce humorous (and instructive!) examples:
http://tylervigen.com/spurious-correlations
Data dredging is a useful tool but use it with an understanding of its pitfalls. Metrics that seem to
correlate should be investigated within the context of the business rather than taken at face value.
Analytical tools
Average types
We looked at this during the exploratory phase, but now we understand the data better. Outliers
can have a large impact on the mean in particular—you might consider excluding them. Does the
mean change dramatically if you slice by any particular dimension? Are the mean and median far
apart, indicating a skewed distribution? Start slicing the data in a few ways and see what you can
discover.
Further reading: https://betterexplained.com/articles/how-to-analyze-data-using-the-average/
Linear regression
Linear regression is a common and powerful statistical tool that demonstrates the relationship
between a dependent and an explanatory variable. Most relevant to business scenarios is that it
allows you to forecast the dependent variable based on a future state of the explanatory variable.
Despite being frequently utilized it is more complex than can be captured here, but is a technique
worth investing in learning.
Applied analytics
Applying statistical rigor is highly effective when it can be used, but the ambiguity inherent to daily
work frequently causes conditions to fall short of ideal. Analysts are often forced to find a point
where their work is not perfect, just “good enough.”
The business is your partner in any analysis. It’s often better to engage them early and often in your
work; they may provide context that will help you figure out what to test next, or prevent you from
puzzling over a strange trend or outlier for days.
Despite how it may appear data is a creative pursuit. Getting a diverse group of stakeholders in the
same room with juicy data to discuss may inspire insights any individual would be incapable of
gleaning alone. Analytical rigor is important, but business is messy and sometimes needs a messy
solution.
Business decisions can be thought of as one-day or two-way doors. If making the decision is
irreversible, or so taxing that it would be realistically unfeasible, it’s a one-way door. If it’s a decision
you can try but at any point quickly abandon, it’s a two-way door (a two-way door might still be
heavy to open).
Speed matters, and sometimes recommending a strategy you feel reasonably good about is faster
than investing more time into the upfront analysis. Knowing when you hit that point of diminishing
returns as well as the context within the business will give you a gauge of when to keep analysing
versus bringing your findings to the decision-maker.