04 Exploring+Data+Visually Combined Lms
04 Exploring+Data+Visually Combined Lms
MSBA 325
Fouad Zablith, PhD
Purpose
https://www.theguardian.com/data
Process
If your dataset is only a handful of observations, this limits what you can find in your data and
what visualization methods are useful, and you won’t see much.
If you have a lot of data, what you see when you visualize one aspect of it can lead to a
curiosity about other dimensions, which in turn leads to different graphics.
Process
What Data Do You Have?
Getting the data that you need is the hardest and most time-consuming part.
Things to consider:
When you have a dataset with thousands or millions of observations, it can be challenging to
figure out what to look at first.
This is where the phrase “drowning in data” comes from. You stare at a bunch of numbers on
your computer screen, and values start to blur together the longer you stare. Soon all you see is
a blob of data that feels suffocating, but wait; there’s hope.
To avoid drowning in data, you learn to swim. When you learn to swim, you start at the shallow
end and work your way toward the deep end.
Process
What Do You Want to Know About Your Data?
Your answer doesn’t need to be complex or profound. The more specific you are, the more
direction you get.
For example, if you have time series data, you might want to know if something has improved
or gotten worse over the past decade.
Process
What Visualization Methods Should You Use
It is more important to see your data from different angles and to drill down to what matters
for your project.
Make multiple charts, compare all your variables, and see if there are interesting bits that are
worth a closer look.
Try different scales, colors, shapes, sizes, and geometries, and you might find a graphic worth
pursuing further.
Process
What Visualization Methods Should You Use
You can go out of the norms: the figure shows an interactive exploration of article deletions on
Wikipedia.
Process
http://notabilia.net
What Visualization Methods Should You Use
If you were to design a dashboard that provides the status of a system at a glance, you must
visualize the data in a way that is straightforward.
If the goal is to encourage reflection or to evoke emotions, efficiency might not be your main
concern.
Traditional visualization such as bar graphs and line charts can be made easily and read
quickly.
Process
What Do You See and Does it Make Sense?
After you visualize your data, there are certain things to look for.
Process
Do we always Need to Visualize Graphically?
• When you have just one or two numbers to share, use the numbers themselves
to highlight their importance.
• The fact that you have numbers doesn’t mean that you need to use a graph.
• Very useful when communicating to a mixed audience, since each member will
look for their particular row of interest.
• In order to place the emphasis on the data, use light borders and let the design
of the table fade into the background.
Types of Visuals
Tables
• Very useful when communicating to a mixed audience, since each member will
look for their particular row of interest.
• In order to place the emphasis on the data, use light borders and let the design
of the table fade into the background.
Types of Visuals
A Special Case of Tables: Heatmap
A heatmap is a way to visualize data in tabular format where you color cells to show
the relative magnitude of the numbers.
Types of Visuals
For example, in the above Heatmap, the higher saturation of blue, the higher the
number.
This makes the process of picking out the tails (the lowest number and highest number)
faster.
Process Visualizing Categorical Data
Visualizing Categorical Data
The bar graph, of course, is one of the most common ways to show categorical data.
For example, the results of a survey of approximately 2,200 people about how they use the
Internet, social networking sites such as Facebook and Twitter, and whether politics was a
regular occurrence on those sites can be reported using a bar graph.
The figure shows the results for four of the fifty questions.
Process
Visualizing Categorical Data
The same results can be presented using a different scale and shapes.
The figure shows the same poll results with squares sized by area.
Process
Visualizing Categorical Data
The differences among categories don’t look as dramatic in the symbols plots as they do in the
bar graphs.
For example, the bar for Google looks a lot longer than the rest in the search engine bar graph,
but when you compare the square for Google, it looks bigger, but not quite by the same
magnitude relative to the other squares.
Process
Types of Visuals Bar Chart Note
Bar Chart Note
Because of how our eyes compare the relative end points of the
bars, bar charts must have a zero baseline.
Types of Visuals
Always consider how you want your audience to use the visual and
construct it accordingly.
Types of Graphs: Bars
In general, the bars should be wider than the white space between the bars.
Types of Visuals
Types of Graphs: Bars
• Vertical bar charts can be single series, two series, or multiple series.
Types of Visuals
Note: as you add more series of data, it becomes more difficult to focus on one at a
time.
Types of Graphs: Bars
• Meant to allow you to compare totals across categories and also see the
subcomponent pieces within a given category.
• It is hard to compare the subcomponents across the various categories once you get
beyond the bottom series because you no longer have a consistent baseline to use to
compare.
Types of Visuals
Types of Graphs: Bars
Waterfall Chart
The waterfall chart can be used to pull apart the pieces of a stacked bar chart to focus
on one at a time, or to show a starting point, increases and decreases, and the
resulting ending point.
Types of Visuals
Example: Imagine that you are an HR business partner and want to understand and
communicate how employee headcount has changed over the past year for the client
group you support.
Types of Graphs: Bars
Waterfall Chart Example
Types of Visuals
• The first column shows the employee headcount at the beginning of the year.
• The final column represents employee headcount at the end of the year, after the
additions and deductions have been applied to the beginning of year headcount.
Types of Graphs: Bars
• Can be used to show the totals across different categories but also give
information about the subcomponents.
For example, a stacked horizontal bar chart can work well for visualizing survey data
collected along a Likert scale.
Types of Visuals
Parts of a Whole
When you put categories together, the sum of the parts can equal a whole.
Returning to the example of the survey on Internet usage, the figure shows breakdowns of
awareness of targeted advertising online.
Process
Parts of a Whole
OR
Process
Subcategories
Subcategories, the categories within categories (within categories), are often more revealing
than the main categories.
Showing subcategories can make it easier to browse your data, because you can visually jump
to the areas that you care most about.
Example:
http://linked.aub.edu.lb/apps/charts
/courses_concepts_treemap.php
Subcategories
The figure shows the proportion of people in the survey who said they were the parent or
guardian of a child younger than 18 living in the household.
The plot looks like one column from a stacked bar graph.
The bigger a section, the more people who gave that answer.
The orientation of education and parenting are the same, but you can also see e-mail usage.
With categorical data, you often look for the minimum and maximum right away.
This gives you a sense of the range of the dataset, and is easily found with a quick sorting of
values.
After that, look at the distribution of the parts. Are most values high? Low?
If a couple of categories have the same value or highly differing ones, it’s worth asking why.
Visualizing Time Series Data
When you visualize time series data, your goal is to see what has passed, what is different,
what is the same, and by how much it changed.
Process
Visualizing Time Series Data
The bar chart is a straightforward way to look at data over time, except instead of categories
on one of the axes, you use time.
The figure shows the unemployment rate in the United States from 1948 to 2012, according to
the Bureau of Labor Statistics.
Process
Visualizing Time Series Data
A dot plot can be used in the same way, as shown in the figure.
Process
The data and axes are the same and the visual cue is different.
Visualizing Time Series Data
Like bar charts, dots put focus on each value, and trends can be harder to see.
When you connect the sparse dots with a line, the focus of the plot shifts again.
Process
Visualizing Time Series Data
If you care more about an overall trend than you do about the more specific monthly
variability, you can fit a smoothing curve to the dots, as shown in the figure (instead of
connecting every dot).
Process
Cycles
A number of factors feed into the economy and affect the unemployment rate, so there
aren’t regular intervals in between significant increases.
However, there are a lot of things that repeat themselves on regular intervals. Example:
• More people travel during the summer months
• More people leave work around 5 in the afternoon and head home;
• More accidents occur on Saturday than any other day of the week.
Process
Cycles
Flight data from the Bureau of Transportation Statistics shows a similar cycle, as shown in the
figure.
The chart shows a weekly cycle, with the fewest flights on Saturdays and typically the most
flights on Fridays.
Process
Cycles
Because the data repeats itself, it makes sense to compare like days of the week to each other.
For example, compare all Mondays.
In order to do this, you can split the days into weekly segments so that you can directly
compare cycles, as shown in the figure, with both the line chart and star plot.
Process
Cycles
The figure shows the data in a familiar calendar format. The first
column is Sunday, the second is Monday, and so on, to Saturday at
the end.
With the calendar heat map, along with seeing cycles as you scan
top to bottom, it’s easy to see specific days in rows and columns, so
it’s easier to reference what day of the year each value is for.
Process
Memorial Day
(May 30)
Independence Day
(July 4)
Labor Day
Process
(September 5)
Thanksgiving
Christmas
(December 25)
Types of Graphs: Slopegraph
Slopegraphs are useful when you have two time periods or points of comparison and
want to quickly show relative increases and decreases or differences across various
categories between the two data points.
• In addition to the points, the lines that connect them give you the visual increase or
decrease in rate of change (via the slope or direction).
Types of Visuals
Types of Graphs: Slopegraph
We can also draw attention to the single category that decreased over time from
the preceding example.
Types of Visuals
Visualizing Spatial Data
If you care only about individual locations, you can place dots on a map.
Process
Visualizing Spatial Data
The figure uses bubbles for the airports, sized by the number of outgoing flights.
With the addition of an area as visual cue, you don’t just see where the busiest airports are, but
also how busy they are relative to each other.
Process
Visualizing Spatial Data
Rather than separate locations, you might want to explore connections between locations.
Process
The brighter a line is, the more flights that went to and from those two airports.
Visualizing Spatial Data
But there’s more you can take away from this data by splitting it into categories.
For example, map flights by airline, and you see the data with a new dimension.
Process
Regions
To maintain the privacy of individuals and to keep personal addresses hidden, it’s common to
aggregate spatial data before releasing it.
Choropleth maps are the most common way to visualize regional data in a spatial context.
Process
Regions
You can see high rates on the West Coast and in the Southeast and lower unemployment in the
Midwest.
Regions
There are also times when you’re more interested in the aggregates.
For example, the figure shows all recorded UFO sightings between 1906 to 2007, according to
the National UFO Reporting Center.
Process
Process Visualizing Spatial Data
Regions
The figure shows the same data, but as a filled contour map.
Process
A color scale is used to show sightings density, where white means more sightings and black
means none, and varying shades of red are for everything in between.
Cartograms
A challenge with mapping regions, the choropleth map in particular, is that larger regions
always get more visual attention regardless of the data.
Cartograms are one way to remedy this. Location is somewhat preserved, but geographical
areas and boundaries are not.
For example, the figure shows the UFO sighting data as a cartogram. Notice the shrinking of
Texas and swelling of California.
Process
California
Texas
Cartograms
The upside of cartograms is that areas fill the appropriate amount of space, but the trade-off is
less geographic accuracy.
When your data is for larger regions, with a wide range of sizes, this trade-off is worth it, but
when regions are uniform in size, a choropleth map is most likely a better fit
Spatial data is a lot like categorical data, but with a geographic component.
A Few Variables
The dot plots discussed previously placed time on the horizontal axis and a variable on the
vertical axis.
A scatter plot replaces time with a different variable, so you have two variables plotted against
each other, as shown in the figure.
Process
Multiple Variables
A Few Variables
A Few Variables
For a more defined view of how two variables are related, you can fit a line through the points,
as shown in the figure.
Process
Don’t mix up causal and correlative
relationships.
The figure shows two ways to incorporate a third variable in a scatter plot.
In the scatter plot on the left, the area of a circle represents assists per game.
Process
The scatter plot on the right uses color to show the same thing. The darker the shade, the more
assists per game.
Multiple Variables
A Few Variables
In the figure, you see assist leaders closer toward the right corner of higher usage percentage
and points, but there’s high variability and there isn’t a clear trend.
Process
Multiple Variables
A Few Variables
The figure shows the same values on the axes, usage percentage and points per game, but uses
area for rebounds and color for assists.
Process
Multiple Variables
Many Variables
A heat map, as shown in the figure, can be used to translate a table to a set of colors.
It shows the same basketball player data, in addition to several other variables (number of
games played, field goal percentage, and three-point percentage).
Process
Each row represents a player, and darker shades represent relatively higher values.
Multiple Variables
Many Variables
With players sorted alphabetically, it’s hard to see patterns, but if you sort by a column, say,
points per game, as shown in the figure, relationships are easier to see.
Process
Multiple Variables
Many Variables
Parallel coordinates plots also arrange variables horizontally, but instead of using color like a
heat map, you use vertical position, as shown in the figure.
Process
On each vertical axis, the highest value is plotted at the top and the lowest at the bottom.
Then lines are drawn left to right, positioned by the variables of each observation.
Multiple Variables
Many Variables
When there aren’t clear relationships across the board, it can be hard to see patterns.
Process
There’s high variability from player to player in the figure, so you end up with a jumble of lines.
Multiple Variables
Many Variables
If you highlight players who averaged five assists or more and gray out everyone else, it’s easier
to see how these type of players perform in other categories.
Process
Multiple Variables
Many Variables
Whereas the heat map and parallel coordinates plot provide an overview of the data, you might
also want to look at individual data points more closely.
That is, you represent each row of data with its own plot.
Process
Process Multiple Variables
Multiple Variables
It’s also often useful to look at data with different views at the same time.
Process
Multiple Variables
Using Multiple Views
if you have multiple variables that might be categorical, temporal, and spatial, it is often better
to use multiple charts.
Imagine there are 100 adults in a room. These 100 people have different heights, as shown in
the figure.
They range from 4 feet and 10-inches tall to 6½-feet tall, and the average height for the group is
5 feet and 4 inches.
Process
It’s hard to determine how many people there are in various height ranges without counting
each dot.
Distributions
You can get a better idea if you sort everyone from shortest to tallest, as shown in the figure.
Process
The median line at 64 inches is in the middle, where 50 people are shorter and 50 people are
taller.
Distributions
A better way to see the distribution is when you group them into
height categories or bins, such as those in between 4 feet and 4½-
feet.
But, the dot plot can take a lot of space, especially if you had a lot
more heights to show.
Process
The box plot, as shown in the figure, is an overview visualization that provides a general sense
of distribution.
The box in the middle is defined by the lower and upper quartiles.
Process
The lower quartile represents where one-quarter of the values are lower.
The upper quartile represents where one-quarter of the values are higher.
Distributions
The range in between the upper and lower quartiles is called the interquartile range.
The outer lines are the lower and upper fences, defined by subtracting and adding 1½ times
the interquartile range from the lower and upper quartiles, respectively.
If the maximum and minimum values are within the upper and lower fences, the outlines are
Process
Otherwise, dots are used to represent any points that fall outside the upper and lower fences
and are considered outliers.
Distributions
The figure shows how the same height data can be represented with different bins.
Process
Distributions
You can also use multiple histograms to
compare distributions.
The key to getting the most out of your data isn’t so much about finding the right software
than it is to learn how to use the tools you have and to know what questions to ask:
- Consider what data you have and what you can get
- Where the data is from
- How it was derived
- What all the variables mean
Take Home Points
Even if your goal is to visualize data for presentation, exploration can lead to unexpected
insights, which makes for better graphics.