0% found this document useful (0 votes)
33 views98 pages

04 Exploring+Data+Visually Combined Lms

The document discusses exploring data visually through data visualization. It covers key aspects of the data exploration process including understanding the available data, determining what questions you want to answer, choosing appropriate visualization methods, and interpreting what you see in the visualizations. The purpose of data visualization is to gain insights and new perspectives from the data. An iterative process is recommended to fully explore the data through creating and comparing multiple visualizations from different angles.

Uploaded by

Rami Haidar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views98 pages

04 Exploring+Data+Visually Combined Lms

The document discusses exploring data visually through data visualization. It covers key aspects of the data exploration process including understanding the available data, determining what questions you want to answer, choosing appropriate visualization methods, and interpreting what you see in the visualizations. The purpose of data visualization is to gain insights and new perspectives from the data. An iterative process is recommended to fully explore the data through creating and comparing multiple visualizations from different angles.

Uploaded by

Rami Haidar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 98

“The greatest value of a picture is when it

forces us to notice what we never expected


to see.”
John W . Tukey, Exploratory Data Analysis (1977)
Exploring Data Visually and
Graph Types

MSBA 325
Fouad Zablith, PhD
Purpose

https://www.theguardian.com/data
Process

Things to consider when you explore your data visually:

 What data do you have?

 What do you want to know about your data?

 What visualization methods should you use?

 What do you see, and does it make sense?


Process
Process

Exploring data visually is an iterative process.


Process
What Data Do You Have?

 If your dataset is only a handful of observations, this limits what you can find in your data and
what visualization methods are useful, and you won’t see much.

 If you have a lot of data, what you see when you visualize one aspect of it can lead to a
curiosity about other dimensions, which in turn leads to different graphics.
Process
What Data Do You Have?

It is very important to get the data before forming the visual.

Getting the data that you need is the hardest and most time-consuming part.

Programming or click-and-play applications can be helpful to manage data.

Things to consider:

 What do values represent


Where is the data from
Process

 How are variables measured


What Do You Want to Know About Your Data?

When you have a dataset with thousands or millions of observations, it can be challenging to
figure out what to look at first.

This is where the phrase “drowning in data” comes from. You stare at a bunch of numbers on
your computer screen, and values start to blur together the longer you stare. Soon all you see is
a blob of data that feels suffocating, but wait; there’s hope.

Take a step back. Breathe.

To avoid drowning in data, you learn to swim. When you learn to swim, you start at the shallow
end and work your way toward the deep end.
Process
What Do You Want to Know About Your Data?

Your answer doesn’t need to be complex or profound. The more specific you are, the more
direction you get.

For example, if you have time series data, you might want to know if something has improved
or gotten worse over the past decade.
Process
What Visualization Methods Should You Use

It is more important to see your data from different angles and to drill down to what matters
for your project.

 Make multiple charts, compare all your variables, and see if there are interesting bits that are
worth a closer look.

 Try different scales, colors, shapes, sizes, and geometries, and you might find a graphic worth
pursuing further.
Process
What Visualization Methods Should You Use

You can go out of the norms: the figure shows an interactive exploration of article deletions on
Wikipedia.
Process

http://notabilia.net
What Visualization Methods Should You Use

 If you were to design a dashboard that provides the status of a system at a glance, you must
visualize the data in a way that is straightforward.

 If the goal is to encourage reflection or to evoke emotions, efficiency might not be your main
concern.

 Traditional visualization such as bar graphs and line charts can be made easily and read
quickly.
Process
What Do You See and Does it Make Sense?

After you visualize your data, there are certain things to look for.
Process
Do we always Need to Visualize Graphically?

• When you have just one or two numbers to share, use the numbers themselves
to highlight their importance.
• The fact that you have numbers doesn’t mean that you need to use a graph.

For example, in the adjacent figure, the graph


Types of Visuals

doesn’t do much in the interpretation of the


numbers.
Types of Visuals Do we always Need to Visualize Graphically?
Types of Visuals Do we always Need to Visualize Graphically?
Tables

• Very useful when communicating to a mixed audience, since each member will
look for their particular row of interest.

• It is easier to communicate multiple units of measure with a table than with a


graph.

• In order to place the emphasis on the data, use light borders and let the design
of the table fade into the background.
Types of Visuals
Tables

• Very useful when communicating to a mixed audience, since each member will
look for their particular row of interest.

• It is easier to communicate multiple units of measure with a table than with a


graph.

• In order to place the emphasis on the data, use light borders and let the design
of the table fade into the background.
Types of Visuals
A Special Case of Tables: Heatmap

A heatmap is a way to visualize data in tabular format where you color cells to show
the relative magnitude of the numbers.
Types of Visuals

For example, in the above Heatmap, the higher saturation of blue, the higher the
number.

This makes the process of picking out the tails (the lowest number and highest number)
faster.
Process Visualizing Categorical Data
Visualizing Categorical Data

The bar graph, of course, is one of the most common ways to show categorical data.

For example, the results of a survey of approximately 2,200 people about how they use the
Internet, social networking sites such as Facebook and Twitter, and whether politics was a
regular occurrence on those sites can be reported using a bar graph.

The figure shows the results for four of the fifty questions.
Process
Visualizing Categorical Data

The same results can be presented using a different scale and shapes.

The figure shows the same poll results with squares sized by area.
Process
Visualizing Categorical Data

The differences among categories don’t look as dramatic in the symbols plots as they do in the
bar graphs.

For example, the bar for Google looks a lot longer than the rest in the search engine bar graph,
but when you compare the square for Google, it looks bigger, but not quite by the same
magnitude relative to the other squares.
Process
Types of Visuals Bar Chart Note
Bar Chart Note
Because of how our eyes compare the relative end points of the
bars, bar charts must have a zero baseline.
Types of Visuals

Non-zero Baseline vs. Zero Baseline


Graph Axis vs. Data Labels

In making this decision, consider the level of specificity needed:

• If you want your audience to focus on big‐picture trends, think


about preserving the axis but deemphasizing it by making it grey.

• If the specific numerical values are important, it may be better to


label the data points directly. In this latter case, it’s usually best to
omit the axis to avoid the inclusion of redundant information.
Types of Visuals

Always consider how you want your audience to use the visual and
construct it accordingly.
Types of Graphs: Bars

In general, the bars should be wider than the white space between the bars.
Types of Visuals
Types of Graphs: Bars

Vertical Bar Charts

• Vertical bar charts can be single series, two series, or multiple series.
Types of Visuals

Note: as you add more series of data, it becomes more difficult to focus on one at a
time.
Types of Graphs: Bars

Stacked Vertical Bar Chart

• Meant to allow you to compare totals across categories and also see the
subcomponent pieces within a given category.

• It is hard to compare the subcomponents across the various categories once you get
beyond the bottom series because you no longer have a consistent baseline to use to
compare.
Types of Visuals
Types of Graphs: Bars

Waterfall Chart

The waterfall chart can be used to pull apart the pieces of a stacked bar chart to focus
on one at a time, or to show a starting point, increases and decreases, and the
resulting ending point.
Types of Visuals

Example: Imagine that you are an HR business partner and want to understand and
communicate how employee headcount has changed over the past year for the client
group you support.
Types of Graphs: Bars
Waterfall Chart Example
Types of Visuals

• The first column shows the employee headcount at the beginning of the year.
• The final column represents employee headcount at the end of the year, after the
additions and deductions have been applied to the beginning of year headcount.
Types of Graphs: Bars

Horizontal Bar Chart

• Very useful for categorical data


• Extremely easy to read
• Can be single series, two series, or multiple series
Types of Visuals
Types of Graphs: Bars

Stacked Horizontal Bar Chart


Types of Visuals

• Can be used to show the totals across different categories but also give
information about the subcomponents.

• Can be structured to show either absolute values or sum to 100%.


Types of Graphs: Bars

Stacked Horizontal Bar Chart

For example, a stacked horizontal bar chart can work well for visualizing survey data
collected along a Likert scale.
Types of Visuals
Parts of a Whole

When you put categories together, the sum of the parts can equal a whole.

This is when the pie chart comes into the picture.

Returning to the example of the survey on Internet usage, the figure shows breakdowns of
awareness of targeted advertising online.
Process
Parts of a Whole

OR
Process
Subcategories

Subcategories, the categories within categories (within categories), are often more revealing
than the main categories.

As you drill down, there can be higher variability.

Showing subcategories can make it easier to browse your data, because you can visually jump
to the areas that you care most about.

You can use a treemap with the survey data.


Process

Example:
http://linked.aub.edu.lb/apps/charts
/courses_concepts_treemap.php
Subcategories

The figure shows the proportion of people in the survey who said they were the parent or
guardian of a child younger than 18 living in the household.

The plot looks like one column from a stacked bar graph.

The bigger a section, the more people who gave that answer.

In this case, a mosaic plot is better than a treemap.


Process
Subcategories

As shown in the figure, you can introduce another dimension.

More area means a higher percentage.


Process
Subcategories

You can keep going and bring in a third variable.

The orientation of education and parenting are the same, but you can also see e-mail usage.

 Notice the vertical split on the subsection in the figure.

 You could keep on adding variables, but as you can see,


as the plot grows, it becomes more challenging to read, so
proceed with caution.
Process
What to Look for

With categorical data, you often look for the minimum and maximum right away.

 This gives you a sense of the range of the dataset, and is easily found with a quick sorting of
values.

 After that, look at the distribution of the parts. Are most values high? Low?

 Look for structure and patterns.


Process

 If a couple of categories have the same value or highly differing ones, it’s worth asking why.
Visualizing Time Series Data

When you visualize time series data, your goal is to see what has passed, what is different,
what is the same, and by how much it changed.
Process
Visualizing Time Series Data

The bar chart is a straightforward way to look at data over time, except instead of categories
on one of the axes, you use time.

The figure shows the unemployment rate in the United States from 1948 to 2012, according to
the Bureau of Labor Statistics.
Process
Visualizing Time Series Data

A dot plot can be used in the same way, as shown in the figure.
Process

The data and axes are the same and the visual cue is different.
Visualizing Time Series Data

Like bar charts, dots put focus on each value, and trends can be harder to see.

When you connect the sparse dots with a line, the focus of the plot shifts again.
Process
Visualizing Time Series Data

If you care more about an overall trend than you do about the more specific monthly
variability, you can fit a smoothing curve to the dots, as shown in the figure (instead of
connecting every dot).
Process
Cycles

A number of factors feed into the economy and affect the unemployment rate, so there
aren’t regular intervals in between significant increases.

However, there are a lot of things that repeat themselves on regular intervals. Example:
• More people travel during the summer months
• More people leave work around 5 in the afternoon and head home;
• More accidents occur on Saturday than any other day of the week.
Process
Cycles

Flight data from the Bureau of Transportation Statistics shows a similar cycle, as shown in the
figure.

The chart shows a weekly cycle, with the fewest flights on Saturdays and typically the most
flights on Fridays.
Process
Cycles

Because the data repeats itself, it makes sense to compare like days of the week to each other.
For example, compare all Mondays.

In order to do this, you can split the days into weekly segments so that you can directly
compare cycles, as shown in the figure, with both the line chart and star plot.
Process
Cycles

The figure shows the data in a familiar calendar format. The first
column is Sunday, the second is Monday, and so on, to Saturday at
the end.

With the calendar heat map, along with seeing cycles as you scan
top to bottom, it’s easy to see specific days in rows and columns, so
it’s easier to reference what day of the year each value is for.
Process

A disadvantage of the calendar is that color is the visual cue, and


it can be hard to see small differences.
Process Cycles
Cycles

Memorial Day
(May 30)

Independence Day
(July 4)

Labor Day
Process

(September 5)

Thanksgiving
Christmas
(December 25)
Types of Graphs: Slopegraph

Slopegraphs are useful when you have two time periods or points of comparison and
want to quickly show relative increases and decreases or differences across various
categories between the two data points.

• In addition to the points, the lines that connect them give you the visual increase or
decrease in rate of change (via the slope or direction).
Types of Visuals
Types of Graphs: Slopegraph

Imagine that you are analyzing and communicating


data from a recent employee feedback survey.

To show the relative change in survey categories


from 2014 to 2015, the slopegraph might look
Types of Visuals

something like the adjacent figure.


Types of Graphs: Slopegraph

We can also draw attention to the single category that decreased over time from
the preceding example.
Types of Visuals
Visualizing Spatial Data

There is a natural hierarchy to spatial data


that allows, and often requires, you to
explore at different granularities.

The most obvious way to explore spatial


data is with maps, which place values
within a geographic coordinate system.

The figure shows some options.


Process
Visualizing Spatial Data

If you care only about individual locations, you can place dots on a map.
Process
Visualizing Spatial Data

The figure uses bubbles for the airports, sized by the number of outgoing flights.

With the addition of an area as visual cue, you don’t just see where the busiest airports are, but
also how busy they are relative to each other.
Process
Visualizing Spatial Data

Rather than separate locations, you might want to explore connections between locations.
Process

The brighter a line is, the more flights that went to and from those two airports.
Visualizing Spatial Data

But there’s more you can take away from this data by splitting it into categories.

For example, map flights by airline, and you see the data with a new dimension.
Process
Regions

To maintain the privacy of individuals and to keep personal addresses hidden, it’s common to
aggregate spatial data before releasing it.

Choropleth maps are the most common way to visualize regional data in a spatial context.
Process
Regions

The map shows unemployment rate by county during August 2012.


Process

You can see high rates on the West Coast and in the Southeast and lower unemployment in the
Midwest.
Regions

There are also times when you’re more interested in the aggregates.

For example, the figure shows all recorded UFO sightings between 1906 to 2007, according to
the National UFO Reporting Center.
Process
Process Visualizing Spatial Data
Regions

The figure shows the same data, but as a filled contour map.
Process

 A color scale is used to show sightings density, where white means more sightings and black
means none, and varying shades of red are for everything in between.
Cartograms

A challenge with mapping regions, the choropleth map in particular, is that larger regions
always get more visual attention regardless of the data.

Cartograms are one way to remedy this. Location is somewhat preserved, but geographical
areas and boundaries are not.

For example, the figure shows the UFO sighting data as a cartogram. Notice the shrinking of
Texas and swelling of California.
Process

California

Texas
Cartograms

The upside of cartograms is that areas fill the appropriate amount of space, but the trade-off is
less geographic accuracy.

When your data is for larger regions, with a wide range of sizes, this trade-off is worth it, but
when regions are uniform in size, a choropleth map is most likely a better fit

What to Look for

Spatial data is a lot like categorical data, but with a geographic component.

 You should know the range of the data to start with.


Process

 Then look for regional patterns.


Multiple Variables

A Few Variables

The dot plots discussed previously placed time on the horizontal axis and a variable on the
vertical axis.

A scatter plot replaces time with a different variable, so you have two variables plotted against
each other, as shown in the figure.
Process
Multiple Variables

A Few Variables

This statistical relationship between variables is called correlation.

The correlation strength can vary, as shown in the figure.


Process
Multiple Variables

A Few Variables

For a more defined view of how two variables are related, you can fit a line through the points,
as shown in the figure.
Process
Don’t mix up causal and correlative
relationships.

They look the same when you visualize


them, but the former is more difficult to
prove than the latter.
http://www.tylervigen.com/spurious-correlations
Multiple Variables
A Few Variables

The figure shows two ways to incorporate a third variable in a scatter plot.

In the scatter plot on the left, the area of a circle represents assists per game.
Process

The scatter plot on the right uses color to show the same thing. The darker the shade, the more
assists per game.
Multiple Variables
A Few Variables

In the figure, you see assist leaders closer toward the right corner of higher usage percentage
and points, but there’s high variability and there isn’t a clear trend.
Process
Multiple Variables
A Few Variables

The figure shows the same values on the axes, usage percentage and points per game, but uses
area for rebounds and color for assists.
Process
Multiple Variables
Many Variables

A heat map, as shown in the figure, can be used to translate a table to a set of colors.

It shows the same basketball player data, in addition to several other variables (number of
games played, field goal percentage, and three-point percentage).
Process

Each row represents a player, and darker shades represent relatively higher values.
Multiple Variables
Many Variables

With players sorted alphabetically, it’s hard to see patterns, but if you sort by a column, say,
points per game, as shown in the figure, relationships are easier to see.
Process
Multiple Variables

Many Variables

Parallel coordinates plots also arrange variables horizontally, but instead of using color like a
heat map, you use vertical position, as shown in the figure.
Process

On each vertical axis, the highest value is plotted at the top and the lowest at the bottom.

Then lines are drawn left to right, positioned by the variables of each observation.
Multiple Variables

Many Variables

The figure shows a few more relationships.

When there aren’t clear relationships across the board, it can be hard to see patterns.
Process

There’s high variability from player to player in the figure, so you end up with a jumble of lines.
Multiple Variables

Many Variables

If you highlight players who averaged five assists or more and gray out everyone else, it’s easier
to see how these type of players perform in other categories.
Process
Multiple Variables

Many Variables

Whereas the heat map and parallel coordinates plot provide an overview of the data, you might
also want to look at individual data points more closely.

Star plots present data separately.

That is, you represent each row of data with its own plot.
Process
Process Multiple Variables
Multiple Variables

Using Multiple Views

It’s also often useful to look at data with different views at the same time.
Process
Multiple Variables
Using Multiple Views

A scatter plot matrix can show similar relationships.


Process
Multiple Variables

Using Multiple Views

if you have multiple variables that might be categorical, temporal, and spatial, it is often better
to use multiple charts.

The figure explores the time series component of the flight


data. Each line represents flight volume for an airline.
Process
Distributions

Imagine there are 100 adults in a room. These 100 people have different heights, as shown in
the figure.

They range from 4 feet and 10-inches tall to 6½-feet tall, and the average height for the group is
5 feet and 4 inches.
Process

It’s hard to determine how many people there are in various height ranges without counting
each dot.
Distributions

You can get a better idea if you sort everyone from shortest to tallest, as shown in the figure.
Process

The median line at 64 inches is in the middle, where 50 people are shorter and 50 people are
taller.
Distributions

A better way to see the distribution is when you group them into
height categories or bins, such as those in between 4 feet and 4½-
feet.

But, the dot plot can take a lot of space, especially if you had a lot
more heights to show.
Process

Instead of dots, you could use bars (histogram).


Distributions

As shown in the figure, you can visualize


distributions with varying levels of granularity.

Some views, such as median, show only


summary statistics, whereas other views, such
as the histogram, show distribution in greater
detail.
Process
Distributions

The box plot, as shown in the figure, is an overview visualization that provides a general sense
of distribution.

The box in the middle is defined by the lower and upper quartiles.
Process

 The lower quartile represents where one-quarter of the values are lower.

 The upper quartile represents where one-quarter of the values are higher.
Distributions

The range in between the upper and lower quartiles is called the interquartile range.

 The outer lines are the lower and upper fences, defined by subtracting and adding 1½ times
the interquartile range from the lower and upper quartiles, respectively.

 If the maximum and minimum values are within the upper and lower fences, the outlines are
Process

only drawn to the extremes.

 Otherwise, dots are used to represent any points that fall outside the upper and lower fences
and are considered outliers.
Distributions

You can also use multiple box plots to compare distributions.


Process
Distributions

The figure shows how the same height data can be represented with different bins.
Process
Distributions
You can also use multiple histograms to
compare distributions.

In return to the flight data, the figure


shows the distributions of arrival delays
for major airlines.

Delays of more than 15 minutes are


highlighted in orange.

Regardless of the type of visualization


you use to explore distributions, look
Process

for peaks and valleys, range, and the


spread of your data, which tell you a
lot more than the mean and median.
Summary
Process Take Home Points
Take Home Points
Visualization can be a great tool to explore your data.

The key to getting the most out of your data isn’t so much about finding the right software
than it is to learn how to use the tools you have and to know what questions to ask:

- Consider what data you have and what you can get
- Where the data is from
- How it was derived
- What all the variables mean
Take Home Points

- Let that extra information guide your visual exploration.

Even if your goal is to visualize data for presentation, exploration can lead to unexpected
insights, which makes for better graphics.

You might also like