0% found this document useful (0 votes)
2 views43 pages

Introduction to Data Visualization

Data visualization combines art and science to make complex data understandable through graphical representations. It highlights the importance of good design to avoid ugly, bad, or wrong visualizations that can mislead viewers. The document outlines methodologies, stages, and tools for effective data visualization, emphasizing human perception and the mapping of data onto aesthetics.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views43 pages

Introduction to Data Visualization

Data visualization combines art and science to make complex data understandable through graphical representations. It highlights the importance of good design to avoid ugly, bad, or wrong visualizations that can mislead viewers. The document outlines methodologies, stages, and tools for effective data visualization, emphasizing human perception and the mapping of data onto aesthetics.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

DATA VISUALIZATION

Data visualization is part art and part science. The challenge is to get the art right
without getting the science wrong.
Data visualization is a powerful tool that transforms raw data into graphical
representations, making complex information more accessible and
understandable. However, not all visualizations are created equal. When done
poorly, data visualizations can be ugly, bad, or outright wrong, leading to
confusion, misinterpretation, or even deliberate misinformation. Let's explore
these categories:

1. Ugly Visualizations
Ugly visualizations are those that are visually unappealing or difficult to interpret
due to poor design choices. This might include:
Cluttered charts: Overloading a chart with too much information, colors, or text.
Poor color schemes: Using colors that clash, are hard to distinguish, or are not
colorblind-friendly.
Misaligned or inconsistent elements: Axes, labels, and chart elements that are not
aligned properly can make a chart look unprofessional and hard to read.

2. Bad Visualizations
Bad visualizations might not necessarily be ugly, but they fail in accurately
conveying the intended message or data. Common issues include:
Inappropriate chart types: Using a pie chart for data that doesn’t add up to 100%,
or using a 3D chart that distorts the data.
Misleading scales: Manipulating the scale of the axes to exaggerate or downplay
Excessive use of effects: Adding shadows, gradients, or 3D effects that distract
from the actual data.
3. Wrong Visualizations
Wrong visualizations are the most problematic because they present data
inaccurately or deceptively, leading to false conclusions. These errors can be
intentional or accidental:
Incorrect data representation: Using the wrong data or misinterpreting the data
source.
Cherry-picking data: Showing only a selective part of the data to support a biased
narrative.
Manipulating visual cues: Altering the size of bars, angles of pie slices, or other
visual elements to mislead the viewer.

Conclusion
Understanding the difference between ugly, bad, and wrong visualizations is
crucial for anyone working with data. Good data visualization requires careful
attention to design principles, ethical considerations, and a clear understanding of
the audience and context. By learning from poor examples, one can develop the
skills needed to create visualizations that are not only attractive but also accurate
and effective in communicating the intended message.
IMPORTANCE OF DATA VISUALIZATION
Data Visualization is a method of visualizing data in a more informative way.
This technique has been used to understand data trends, and patterns to take
effective business decisions.
Data visualization is a critical aspect of data science because it allows data
scientists to present complex data in an easily understandable and digestible
format. Here are a few reasons why data visualization is so important in data
science:
1. Helps with data exploration: Data visualization enables data scientists to
explore the data and identify patterns, trends, and outliers quickly. By
visualizing data, one can identify relationships between variables that may
not be apparent when looking at data in a tabular format.
2. Communicates insights: Data visualization helps data scientists
communicate insights to stakeholders in an easy-to-understand format. A
well-designed visualization can help non-technical stakeholders
understand complex data and make informed decisions based on that data.
3. Aids in decision-making: Data visualization is critical to decision-making
because it enables stakeholders to see trends and patterns that may not be
immediately apparent in a raw data set. By presenting data in a visual
format, data scientists can help stakeholders make informed decisions
quickly.
4. Facilitates storytelling: Data visualization is an essential component of data
storytelling. By creating compelling visualizations, data scientists can
create a narrative around the data that resonates with stakeholders.
Overall, data visualization is critical to data science because it helps data
scientists communicate insights, aid in decision-making, facilitate storytelling,
and explore data. Without data visualization, it would be challenging to gain
insights from complex data sets and make informed decisions. But, first, let’s
understand data visualization.

HUMAN PERCEPTION
Human perception plays a critical role in data visualization, as it directly impacts
how effectively information is communicated and understood. The human brain
is wired to process visual information more efficiently than raw data, making the
design of visualizations crucial for conveying complex data clearly and
accurately. Understanding the principles of human perception allows for the
creation of visualizations that are not only aesthetically pleasing but also
functional and intuitive.

Visual Encoding and Perceptual Accuracy


One of the key aspects of human perception in data visualization is how we
interpret visual encodings—elements like position, length, size, shape, and color
that represent data values. For example, position and length are more accurately
perceived than area or color. Bar charts, which rely on length and position, are
often easier to interpret than pie charts, which depend on angle and area.
Understanding these perceptual strengths helps in choosing the most effective
visual encodings for a given dataset.

Gestalt Principles
The Gestalt principles of visual perception—such as proximity, similarity, and
continuity—are fundamental in how we group and interpret visual elements. For
instance, objects that are close to each other are perceived as related, while those
that are similar in color or shape are seen as part of the same group. These
principles can guide the design of visualizations to ensure that the intended
relationships and patterns in the data are perceived correctly by the viewer.

Color Perception
Color plays a significant role in data visualization, but it must be used with care
due to the variability in how different people perceive colors. Factors such as
color blindness, cultural differences, and the context in which colors are used can
affect interpretation. Therefore, it’s essential to choose color schemes that are
accessible, consistent, and appropriate for the data being presented. Sequential
color schemes are good for ordered data, while diverging schemes are useful for
highlighting deviations from a central point.

Cognitive Load
Cognitive load refers to the amount of mental effort required to process
information. In data visualization, the goal is to minimize cognitive load by
reducing unnecessary complexity and focusing on clarity. This can be achieved
by avoiding clutter, using intuitive layouts, and providing clear labels and
legends. A well-designed visualization leverages human perception to make data
interpretation as effortless as possible.

Conclusion
Incorporating principles of human perception into data visualization design
enhances the viewer's ability to quickly and accurately interpret data. By aligning
visual encodings with perceptual strengths, applying Gestalt principles, using
color effectively, and minimizing cognitive load, data visualizations become
powerful tools for communication and decision-making.

METHODOLOGY IN DATA VISUALIZATION


The methodology in data visualization involves a systematic approach to
transforming raw data into effective visual representations that communicate
insights clearly and accurately. This process typically includes several key stages:
defining objectives, understanding the data, choosing visual encodings, designing
the visualization, and refining the output based on feedback.

1. Defining Objectives
The first step in the methodology is to clearly define the objectives of the
visualization. This involves understanding the purpose of the
visualization—whether it's to explore data, explain findings, persuade an
audience, or support decision-making. Identifying the target audience is
also crucial, as different audiences may require different levels of detail or
types of visual representation.

2. Understanding the Data


Before creating a visualization, it's essential to thoroughly understand the
data. This includes exploring the dataset to identify key variables, patterns,
and potential anomalies. Data cleaning is often necessary to handle missing
values, outliers, and inconsistencies. Understanding the data helps in
selecting the most appropriate type of visualization and ensures that the
final output accurately reflects the underlying information.
3. Choosing Visual Encodings
Visual encoding involves selecting the right visual elements (e.g., position,
size, shape, color) to represent the data. The choice of encoding depends
on the nature of the data and the type of insights one wishes to convey. For
example, bar charts are suitable for comparing quantities, while line charts
are ideal for visualizing trends over time. The goal is to use encodings that
align with human perceptual strengths, ensuring that the visualization is
both intuitive and informative.

4. Designing the Visualization


Designing the visualization is where the data and visual encodings come
together. This stage involves creating a clear and aesthetically pleasing
layout, using appropriate color schemes, labeling axes and data points, and
ensuring that the visualization tells a coherent story. Good design
minimizes cognitive load, making it easier for viewers to understand the
data without unnecessary effort.

5. Refining and Iterating


The final stage is refining the visualization based on feedback and iteration.
This may involve making adjustments to improve clarity, remove clutter,
or enhance the visual appeal. User feedback is invaluable in this stage, as
it provides insights into how well the visualization communicates its
intended message. Iteration helps in fine-tuning the visualization to better
meet the needs of the audience.

Conclusion
A structured methodology in data visualization ensures that the final product is
not only visually appealing but also effective in communicating the intended
message. By following these steps—defining objectives, understanding the data,
choosing the right visual encodings, designing carefully, and refining through
feedback—one can create powerful visualizations that turn complex data into
actionable insights.
SEVEN STAGES OF DATA VISUALIZATION
The seven stages of data visualization represent a structured approach to
transforming raw data into a visual format that is both informative and easy to
understand. These stages guide the entire process, from data collection to the final
presentation, ensuring that the visualization effectively communicates the
intended message. Here’s an overview of each stage:

1. Acquire
The first stage involves collecting the data that will be visualized. This
could come from various sources such as databases, spreadsheets, APIs, or
manual data entry. The key is to gather all relevant data needed to address
the questions or objectives driving the visualization. This stage may also
include data extraction and integration from multiple sources.

2. Parse
Once the data is acquired, the next step is parsing it, which involves
structuring and organizing the data into a format that is suitable for analysis
and visualization. This might involve converting raw data into a
standardized format, categorizing variables, and ensuring that the data is
clean and ready for further processing. Parsing helps in understanding the
relationships within the data and preparing it for subsequent stages.

3. Filter
In this stage, unnecessary or irrelevant data is filtered out to focus on the
most pertinent information. This could involve removing outliers, focusing
on specific data subsets, or excluding noise that might obscure the
visualization's key insights. Filtering ensures that the visualization remains
focused on the main message and is not cluttered with extraneous details.
4. Mine
Mining involves analyzing the data to uncover patterns, trends,
correlations, or insights that may not be immediately apparent. This stage
often employs statistical methods, algorithms, or exploratory data analysis
techniques to extract meaningful information from the data. The findings
from this stage will guide the design of the visualization.

5. Represent
Representation is the stage where the data is mapped onto visual forms
such as charts, graphs, maps, or other graphical elements. The choice of
representation depends on the nature of the data and the type of insights to
be conveyed. This stage is critical as it transforms abstract data into a visual
format that can be more easily interpreted by the human brain.

6. Refine
The refine stage involves fine-tuning the visualization for clarity,
aesthetics, and accuracy. This may include adjusting scales, colors, labels,
and layouts to enhance readability and ensure that the visualization
communicates its message effectively. Refinement also involves
eliminating any elements that might cause confusion or misinterpretation.

7. Interact
The final stage is adding interactivity to the visualization, if applicable.
Interactive visualizations allow users to explore the data more deeply by
hovering, clicking, zooming, or filtering. This stage is particularly
important in complex visualizations where users may need to investigate
specific details or view data from different perspectives.

Conclusion
The seven stages of data visualization provide a comprehensive framework for
creating visual representations of data that are clear, accurate, and effective. By
following these stages—acquire, parse, filter, mine, represent, refine, and
interact—one can systematically transform raw data into powerful visual insights
that support better understanding and decision-making.
DATA VISUALIZATION TOOLS
Data visualization tools are essential for transforming raw data into meaningful
and interactive visual representations. These tools enable users to create a variety
of charts, graphs, and dashboards to analyze and communicate insights
effectively. Here are some of the most popular and widely used data visualization
tools:

1. Tableau
Tableau is renowned for its powerful and user-friendly interface, allowing
users to create interactive and shareable dashboards. It supports a wide
range of data sources and provides various visualization options, from
simple bar charts to complex scatter plots. Tableau’s drag-and-drop
functionality and robust data blending capabilities make it suitable for both
beginners and advanced users.

2. Microsoft Power BI
Power BI integrates seamlessly with Microsoft products and provides a
comprehensive suite of tools for data visualization and business
intelligence. It offers a variety of data connectors, real-time data access,
and customizable reports. Power BI’s user-friendly interface and
integration with Excel make it a popular choice for businesses looking to
leverage their data for strategic decision-making.

3. D3.js
D3.js (Data-Driven Documents) is a JavaScript library for creating
dynamic and interactive visualizations on the web. Unlike other tools,
D3.js requires coding knowledge and offers high flexibility in terms of
design and functionality. It allows for custom visualizations and is highly
suitable for developers looking to create unique, interactive visual
representations.
4. Matplotlib
Matplotlib is a widely used Python library for creating static, animated, and
interactive plots and graphs. It is highly customizable and is often used in
scientific and engineering applications. Matplotlib integrates well with
other Python libraries, such as NumPy and Pandas, making it a powerful
tool for data analysis and visualization.

5. Google Data Studio


Google Data Studio offers an easy-to-use interface for creating interactive
dashboards and reports. It integrates well with other Google products like
Google Analytics and Google Sheets. Its collaborative features and ease of
sharing make it a popular choice for teams and organizations.

Conclusion
Choosing the right data visualization tool depends on factors such as the
complexity of the data, the desired level of interactivity, and user expertise. Tools
like Tableau and Power BI are excellent for general business needs, while D3.js
and Matplotlib cater to more specialized or technical requirements. Google Data
Studio is a good option for those who need seamless integration with Google’s
ecosystem.

MAPPING DATA ONTO AESTHETICS


The key insight is the following: all data visualizations map data values into
quantifiable features of the resulting graphic. We refer to these features as
aesthetics.

Aesthetics And Types of Data


A critical component of every graphical element is of course its position, which
describes where the element is located. In standard 2D graphics, we describe
positions by an x and y value, but other coordinate systems and one- or three-
dimensional visualizations are possible. Next, all graphical elements have a
shape, a size, and a color.
All aesthetics fall into one of two groups: those that can represent continuous data
and those that cannot. Continuous data values are values for which arbitrarily fine
intermediates exist. Between any two durations, say 50 seconds and 51 seconds,
there are arbitrarily many intermediates, such as 50.5 seconds, 50.51 seconds,
50.50001 seconds, and so on.

We’ll consider the types of data we may want to represent in our visualization.
You may think of data as numbers, but numerical values are only two out of
several types of data we may encounter. In addition to continuous and discrete
numerical values, data can come in the form of discrete categories, in the form of
dates or times, and as text. When data is numerical we also call it quantitative and
when it is categorical we call it qualitative. Variables holding qualitative data are
factors, and the different categories are called levels.

Scales Map Data Values onto Aesthetics


To map data values onto aesthetics, we need to specify which data values
correspond to which specific aesthetics values. For example, if our graphic has
an x axis, then we need to specify which data values fall onto particular positions
along this axis. Similarly, we may need to specify which data values are
represented by particular shapes or colors. This mapping between data values and
aesthetics values is created via scales. A scale defines a unique mapping between
data and aesthetics. Importantly, a scale must be one-to-one, such that for each
specific data value there is exactly one aesthetics value and vice versa. If a scale
isn’t one-to-one, then the data visualization becomes ambiguous.

Let’s put things into practice. We can take the dataset shown in map temperature
onto the y axis, day of the year onto the x axis, and location onto color, and
visualize these aesthetics with solid lines. The result is a standard line plot
showing the temperature normals at the four locations as they change during the
year is a fairly standard visualization for a temperature curve and likely the
visualization most data scientists would intuitively choose first. However, it is up
to us which variables we map onto which scales.
For example, instead of mapping temperature onto the y axis and location onto
color, we can do the opposite.

VISUALIZING AMOUNTS
The main emphasis in visualizing amounts will be on the magnitude of the
quantitative values. The standard visualization in this scenario is the bar plot,
which has several variations, including simple bars as well as grouped and
stacked bars. Alternatives to the bar plot are the dot plot and the heatmap.
Bar Plots
Data is commonly visualized with vertical bars. For each movie, we draw a bar
that starts at zero and extends all the way to the dollar value for that movie’s
weekend gross. This visualization is called a bar plot or bar chart.

Highest-grossing movies for the weekend of December 22–24, 2017, displayed


as a bar plot
One problem we commonly encounter with vertical bars is that the labels
identifying each bar take up a lot of horizontal space. To save horizontal space,
we could place the bars closer together and rotate the labels the resulting plots
awkward and difficult to read. Whenever the labels are too long to place
horizontally, they also don’t look good rotated.

Highest-grossing movies for the weekend of December 22–24, 2017, displayed


as a bar plot with rotated axis tick labels.
The better solution for long labels is usually to swap the x and y axes, so that the
bars run horizontally. After swapping the axes, we obtain a compact figure in
which all visual elements, including all text, are horizontally oriented. As a result,
the figure is much easier to read.

Highest-grossing movies for the weekend of December 22–24, 2017, displayed


as a horizontal bar plot.
Some plotting programs arrange bars by default in alphabetical order of the labels,
and other similarly arbitrary arrangements are possible. In general, the resulting
figures are more confusing and less intuitive than figures where bars are arranged
in order of their size.

Highest-grossing movies for the weekend of December 22–24, 2017, displayed


as a horizontal bar plot.
We should only rearrange bars, however, when there is no natural ordering to the
categories the bars represent. Whenever there is a natural ordering (i.e., when our
categorical variable is an ordered factor), we should retain that ordering in the
visualization. For example, below figure shows the median annual income in the
US by age groups. In this case, the bars should be arranged in order of increasing
age. Sorting by bar height while shuffling the age groups makes no sense.

2016 median US annual household income versus age group. The 45-to-54-
year age group has the highest median income.

2016 median US annual household income versus age group, sorted by income.
Grouped and Stacked Bars
If we are interested in two categorical variables at the same time. We can visualize
this data‐ set with a grouped bar plot. In a grouped bar plot, we draw a group of
bars at each position along the x axis, determined by one categorical variable, and
then we draw bars within each group according to the other categorical variable.
Grouped bar plots show a lot of information at once, and they can be confusing.
It is difficult to compare median incomes across age groups for a given racial
group. This figure is only appropriate if we are primarily interested in the
differences in income levels among racial groups, separately for specific age
groups.
2016 median US annual household income versus age group and race. Age
groups are shown along the x axis, and for each age group there are four bars,
corresponding to the median income of Asian, white, Hispanic, and black
people, respectively.
If we care more about the overall pattern of income levels among racial groups,
it may be preferable to show race along the x axis and show ages as distinct bars
within each racial group.

2016 median US annual household income versus age group and race.
The encoding by position is easy to read while the encoding by bar color requires
more mental effort, as we have to mentally match the colors of the bars against
the colors in the legend. We can avoid this added mental effort by showing four
separate regular bar plots rather than one grouped bar plot.
2016 median US annual household income versus age group and race. Instead
of displaying this data as a grouped bar plot, we now show the data as four
separate regular bar plots.
Instead of drawing groups of bars side-by-side, it is sometimes preferable to stack
bars on top of each other. Stacking is useful when the sum of the amounts
represented by the individual stacked bars is in itself a meaningful amount. So,
while it would not make sense to stack the median income values. Stacking is
also appropriate when the individual bars represent counts. For example, in a
dataset of people, we can either count men and women separately or we can count
them together. If we stack a bar representing a count of women on top of a bar
representing a count of men, then the combined bar height represents the total
count of people regardless of gender.

Numbers of female and male passengers on the Titanic traveling in 1st, 2nd, and
3rd class.
Dot Plots and Heatmaps
One important limitation of bars is that they need to start at zero, so that the bar
length is proportional to the amount shown. For some datasets, this can be
impractical or may obscure key features. In this case, we can indicate amounts by
placing dots at the appropriate locations along the x or y axis. The figure below
demonstrates this visualization approach for a dataset of life expectancies in 25
countries in the Americas. The citizens of these countries have life expectancies
between 60 and 81 years, and each individual life expectancy value is shown with
a blue dot at the appropriate location along the x axis. By limiting the axis range
to the interval from 60 to 81 years.

Life expectancies of countries in the Americas, for the year 2007


If we had used bars instead of dots, we’d have made a much less compelling
figure. Because the bars are so long in this figure, and they all have nearly the
same length, the eye is drawn to the middle of the bars rather than to their
endpoints, and the figure fails to convey its message.

Life expectancies of countries in the Americas, for the year 2007, shown as
bars.
Regardless of whether we use bars or dots, however, we need to pay attention to
the ordering of the data values. If we instead ordered them alphabetically, we’d
end up with a disordered cloud of points that is confusing and fails to convey a
clear message.

Life expectancies of countries in the Americas


As an alternative to mapping data values onto positions via bars or dots, we can
map data values onto colors. Such a figure is called a heatmap. We need to pay
attention to the ordering of the categorical data values when making heatmaps.

Internet adoption over time, for select countries. Color represents the percent of
internet users for the respective country and year. Countries were ordered by
per‐ cent internet users in 2016.
Internet adoption over time, for select countries. Countries were ordered by the
year in which their internet usage first exceeded 20%.
VISUALIZING DISTRIBUTIONS: HISTOGRAMS AND DENSITY
PLOTS
Visualizing a Single Distribution
By drawing filled rectangles whose heights correspond to the counts and whose
widths correspond to the width of the age bins (Figure 7-1). Such a visualization
is called a histogram.
Because histograms are generated by binning the data, their exact visual
appearance depends on the choice of the bin width. Most visualization programs
that generate histograms will choose a bin width by default, but chances are that
bin width is not the most appropriate one for any histogram you may want to
make. It is therefore critical to always try different bin widths to verify that the
resulting histogram reflects the underlying data accurately.
If the bin width is too small, then the histogram becomes overly peaky and
visually busy and the main trends in the data may be obscured. On the other hand,
if the bin width is too large, then smaller features in the distribution of the data
may disappear.
In a density plot, we attempt to visualize the underlying probability distribution
of the data by drawing an appropriate continuous curve. This curve needs to be
estimated from the data, and the most commonly used method for this estimation
procedure is called kernel density estimation. In kernel density estimation, we
draw a continuous curve (the kernel) with a small width (controlled by a
parameter called bandwidth) at the location of each data point, and then we add
up all these curves to obtain the final density estimate. The most widely used
kernel is a Gaussian kernel (i.e., a Gaussian bell curve), but there are many other
choices.
The bandwidth parameter behaves similarly to the bin width in histograms. If the
bandwidth is too small, then the density estimate can become overly peaky and
visually busy and the main trends in the data may be obscured. On the other hand,
if the bandwidth is too large, then smaller features in the distribution of the data
may disappear.
Density plots tend to be quite reliable and informative for large datasets but can
be misleading for datasets of only a few points.

Kernel density estimates have one pitfall that we need to be aware of: they have
a tendency to produce the appearance of data where none exists, in particular in
the tails.

Kernel density estimates can extend the tails of the distribution into areas where
no data exists and no data is even possible.
Visualizing Multiple Distributions at the Same Time
stacked histogram, where we draw the histogram bars for one category on top of
the bars for another category, in a different color.

There are two key problems here. First, it is never entirely clear where exactly
the bars begin. Second, the bar heights for the female counts cannot be directly
compared to each other, because the bars all start at a different height.
Trying to address these problems by having all bars start at zero and making the
bars partially transparent.

This approach generates new problems. Now it appears that there are actually
three different groups, not just two, and we’re still not entirely sure where each
bar starts and ends. Overlapping histograms don’t work well because a semi‐
transparent bar drawn on top of another tends to not look like a semitransparent
bar but instead like a bar drawn in a different color.
Overlapping density plots don’t typically have the problem that overlapping
histograms have, because the continuous density lines help the eye keep the
distributions separate.
when we want to visualize exactly two distributions, we can also make two
separate histograms, rotate them by 90 degrees, and have the bars in one
histogram point in the opposite direction of the other. This trick is commonly
employed when visualizing age distributions, and the resulting plot is usually
called an age pyramid.
This trick does not work when there are more than two distributions we want to
visualize at the same time. For multiple distributions, histograms tend to become
confusing, whereas density plots work well as long as the distributions are
somewhat distinct and contiguous.
VISUALIZING PROPORTIONS
A Case for Pie Charts
A pie chart breaks a circle into slices such that the area of each slice is
proportional to the fraction of the total it represents. The same procedure can be
performed on a rectangle, and the result is a stacked bar chart. Depending on
whether we slice the bar vertically or horizontally, we obtain vertically stacked
bars or horizontally stacked bars.

Stacked bars, on the other hand, can work for side-by-side comparisons of
multiple conditions or in a time series, and side-by-side bars are preferred when
we want to directly compare the individual fractions to each other.
A Case for Side-by-Side Bars
When we visualize this dataset with pie charts, it is difficult to see specific trends.
It appears that the market share of company A is growing and the one of company
E is shrinking, but beyond this one observation we can’t tell what’s going on. In
particular, it is unclear how exactly the market shares of the different companies
compare within each year.
Picture becomes a little clearer when we switch to stacked bars (Figure 10-5).
Now the trends of a growing market share for company A and a shrinking market
share for company E are clearly visible. However, the relative market shares of
the five companies within each year are still hard to compare.

side-by-side bars are the best choice. This visualization highlights that both
companies A and B have increased their market share from 2015 to 2017 while
both companies D and E have reduced theirs. It also shows that market shares
increase sequentially from company A to E in 2015 and similarly decrease in
2017.
A Case for Stacked Bars and Stacked Densities
To visualize how the proportion of women in the Rwandan parliament has
changed over time, we can draw a sequence of stacked bar graphs. This figure
provides an immediate visual representation of the changing proportions over
time. To help the reader see exactly when the majority turned female, a dashed
horizontal line at 50% has drawn.

If we want to visualize how proportions change in response to a continuous


variable, we can switch from stacked bars to stacked densities. Stacked densities
can be thought of as the limiting case of infinitely many, infinitely small stacked
bars arranged side-by-side. The densities in stacked density plots are typically
obtained from kernel density estimation.

Visualizing Proportions Separately as Parts of the Total


Plot for each part and in each plot showing the respective part relative to the
whole.
The overall age distribution in the dataset is shown as the shaded gray areas, and
the age distributions for each health status are shown in blue. This figure
highlights that in absolute terms, the number of people with excellent or good
health declines past ages 30–40, while the number of people with fair health
remains approximately constant across all ages.
VISUALIZING ASSOCIATIONS AMONG TWO OR MORE
QUANTITATIVE VARIABLES
To plot the relationship of just two such variables, such as the height and weight,
we will normally use a scatterplot. If we want to show more than two variables at
once, we may opt for a bubble chart, a scatterplot matrix, or a correlogram.
Scatterplots
The dataset contains information such as the head length (measured from the tip
of the bill to the back of the head), the skull size (head length minus bill length),
and the body mass of each bird.
In this plot, head length is shown along the y axis and body mass along the x axis,
and each bird is represented by one dot. (Note the terminology: we say that we
plot the variable shown along the y axis against the variable shown along the x
axis.) The dots form a dispersed cloud (hence the term scatterplot).
we can color the points in the scatterplot by the sex of the bird.

we are already using the x position for body mass, the y position for head length,
and the dot color for bird sex, we need another aesthetic to which we can map
skull size. One option is to use the size of the dots, resulting in a visualization
called a bubble chart.

As an alternative to a bubble chart, it may be preferable to show an all-against-


all matrix of scatterplots, where each individual plot shows two data dimensions.
Correlograms
It is more useful to quantify the amount of association between pairs of variables
and visualize these quantities rather than the raw data. One common way to do
this is to calculate correlation coefficients. The correlation coefficient r is a
number between –1 and 1.
A value of r = 0 means there is no association whatsoever, and a value of either 1
or –1 indicates a perfect association. The sign of the correla‐ tion coefficient
indicates whether the variables are correlated (larger values in one variable
coincide with larger values in the other) or anticorrelated (larger values in one
variable coincide with smaller values in the other).

Visualizations of correlation coefficients are called correlograms.


Dimension Reduction
Dimension reduction relies on the key insight that most high-dimensional datasets
consist of multiple correlated variables that convey overlapping information.
Such datasets can be reduced to a smaller number of key dimensions without loss
of much critical information.
Consider a dataset of multiple physical traits of people, including quantities such
as each person’s height and weight, the lengths of their arms and legs, the
circumferences of their waist, hips, and chest, etc. We can understand intuitively
that all these quantities will relate first and fore‐ most to the overall size of each
person. All else being equal, a larger person will be taller, weigh more, have
longer arms and legs, and have larger waist, hip, and chest circumferences. The
next important dimension is going to be the person’s sex. Male and female
measurements are substantially different for persons of comparable size.
The most widely used one, called principal components analysis (PCA). PCA
introduces a new set of variables, called principal components (PCs), by linear
combination of the original variables in the data, standardized to zero mean and
unit variance. The first component captures the largest possible amount of
variation in the data and subsequent components capture less.

Paired Data
In case of multivariate quantitative data is paired data: data where there are two
or more measurements of the same quantity under slightly different conditions.
For paired data, we need to choose visualizations that highlight any differences
between the paired measurements.
An excellent choice in this case is a simple scatterplot on top of a diagonal line
marking x = y. In such a plot, if the only difference between the two measurements
of each pair is random noise, then all points in the sample will be scattered
symmetrically around this line.

In a slopegraph, we draw individual measurements as dots arranged into two


columns and indicate pairings by connecting the paired dots with a line. The slope
of each line highlights the magnitude and direction of change.
Slopegraphs have one important advantage over scatterplots: they can be used to
compare more than two measurements at a time.
VISUALIZING TIME SERIES AND OTHER FUNCTIONS OF AN
INDEPENDENT VARIABLE
we can arrange the points in order of increasing time and define a predecessor
and successor for each data point. We frequently want to visualize this temporal
order, and we do so with line graphs. Line graphs are not limited to time series.
Individual Time Series

The dots are spaced evenly along the x axis, and there is a defined order among
them. Each dot has exactly one left and one right neighbor (except the leftmost
and rightmost points, which have only one neighbor each). We can visually
emphasize this order by connecting neighboring points with lines. Such a plot is
called a line graph.
Some people object to drawing lines between points because the lines do not
represent observed data. In particular, if there are only a few observations spaced
far apart, had observations been made at intermediate times they would probably
not have fallen exactly onto the lines shown. Thus, in a sense, the lines correspond
to made-up data.
Without dots, the figure places more emphasis on the overall trend in the data and
less on individual observations. A figure without dots is also visually less busy.
In general, the denser the time series, the less important it is to show individual
observations with dots.

We can also fill the area under the curve with a solid color. This visualization is
only valid if the y axis starts at zero, so that the height of the shaded area at each
time point represents the data value at that time point.
Multiple Time Series and Dose–Response Curves
We often have multiple time courses that we want to show at once. In this case,
we have to be more careful in how we plot the data, because the figure can become
confusing or difficult to read.

The below figure represents an acceptable visualization of the preprints dataset.


However, the separate legend creates unnecessary cognitive load. We can reduce
this cognitive load by labeling the lines directly.

Time Series of Two or More Response Variables


If we have more than one response variable, we can visualize such data as two
separate line graphs stacked on top of each other. This plot directly shows the two
variables of interest, and it is straightforward to interpret. However, because the
two variables are shown as separate line graphs, drawing comparisons between
them can be difficult.
We can plot the two variables against each other, drawing a path that leads from
the earliest time point to the latest. Such a visualization is called a connected
scatterplot, because we are technically making a scatterplot of the two variables
against each other and then are connecting neighboring points.

Gradual darkening of the color to indicate direction. Alternatively, one could draw
arrows along the path.
When drawing a connected scatterplot, it is important that we indicate both the
direction and the temporal scale of the data. Without such hints, the plot can turn
into a meaningless scribble.
VISUALIZING TRENDS
There are two fundamental approaches to determining a trend: we can either
smooth the data by some method, such as a moving average, or we can fit a curve
with a defined functional form and then draw the fitted curve.
Smoothing
The act of smoothing produces a function that captures key patterns in the data
while removing irrelevant minor detail or noise. Financial analysts usually
smooth stock market data by calculating moving averages.
To generate a moving average, we take a time window, say the first 20 days in the
time series, calculate the average price over these 20 days, then move the time
window by one day, so it now spans the 2nd to 21st days. We then calculate the
average over these 20 days, move the time window again, and so on. The result
is a new time series consisting of a sequence of averaged prices.

The moving average is the most simplistic approach to smoothing, and it has
some obvious limitations. First, it results in a smoothed curve that is shorter than
the original curve. Parts are missing at either the beginning or the end or both.
Second, even with a large averaging window, a moving average is not necessarily
that smooth. It may exhibit small bumps and wiggles even though larger-scale
smoothing has been achieved.
Locally estimated scatterplot smoothing (LOESS), which fits low-degree
polynomials to subsets of the data.
LOESS is a very popular smoothing approach because it tends to produce results
that look right to the human eye. However, it requires the fitting of many separate
regression models. This makes it slow for large datasets, even on modern
computing equipment.
A spline is a piecewise polynomial function that is highly flexible yet always
looks smooth. When working with splines, we will encounter the term knot. The
knots in a spline are the endpoints of the individual spline segments. If we fit a
spline with k segments, we need to specify k + 1 knots. While spline fitting is
computationally efficient, in particular if the number of knots is not too large,
splines have their own downsides.
The smoothing method may be referred to as a generalized additive model
(GAM), which is a superset of all these types of smoothers. It is important to be
aware that the output of the smoothing feature is dependent on the specific GAM
model that is fit.

(a) LOESS smoother. (b) Cubic regression splines with 5 knots. (c) Thin-plate
regression spline with 3 knots. (d) Gaussian process spline with 6 knots.
Showing Trends with a Defned Functional Form
Whenever possible, it is preferable to fit a curve with a specific functional
form that is appropriate for the data and that uses parameters with clear
meaning.

the dashed black line represents an exponential fit to the data

the dashed black line represents the exponential fit, and the solid black line
represents a linear fit to log-transformed data
Detrending and Time-Series Decomposition
For any time series with a prominent long-term trend, it may be useful to
remove this trend to specifically highlight any notable deviations. This
technique is called detrending.
We detrend housing prices by dividing the actual price index at each time point
by the respective value in the long-term trend.

A division of the untransformed values is equivalent to a subtraction of the


log-transformed values. The resulting detrended house prices show the
housing bubbles more clearly, as the detrending emphasizes the unexpected
movements in a time series.

VISUALIZING GEOSPATIAL DATA


Where people with specific attributes (such as income, age, or educational
attainment) live, or where man-made objects (e.g., bridges, roads, buildings)
have been constructed. In all these cases, it can be helpful to visualize the data
in their proper geospatial context, i.e., to show the data on a realistic map or
alternatively as a map-like diagram.
Projections
Longitude, latitude, and altitude are specified relative to a reference system
called the datum. The datum specifies properties such as the shape and size of
the earth, as well as the location of zero longitude, latitude, and altitude. One
widely used datum is the World Geodetic System (WGS) 84, which is used by
the Global Positioning System (GPS).
The challenge in map making is that we need to take the spherical surface of
the earth and flatten it out so we can display it on a map. This process, called
projection.
The United States, which consists of the lower 48 (which are 48 contiguous
states), Alaska, and Hawaii (Figure 15-4). While the lower 48 alone are
reasonably easy to project onto a map, Alaska and Hawaii are so distant from
the lower 48 that projecting all 50 states onto one map becomes awkward.

Projection provides a reasonable representation of the relative shapes, areas,


and locations of the 50 states, but we notice some issues. First, Alaska seems
weirdly stretched out compared to how it looks the map is dominated by
ocean/empty space. It would be preferable to zoom in further, so that the lower
48 states take up a larger proportion of the map area.
Layers
we usually create maps consisting of multiple layers showing different types
of information.
At the bottom, we have the terrain layer, which shows hills, valleys, and water.
The next layer shows the road network. On top of the road layer, I have placed
a layer indicating the locations of individual wind tur‐ bines. This layer also
contains the two rectangles highlighting the majority of the wind turbines.
Finally, the top layer adds the locations and names of cities.

Choropleth Mapping
By coloring individual regions in a map according to the data dimension we
want to display. Such maps are called choropleth maps.
We take the population number for each county in the US, divide it by the
county’s surface area, and then draw a map where the color of each county
corresponds to the ratio between population number and area.
Light colors to represent low population densities and dark colors to represent
high densities, so that high-density metropolitan areas stand out as dark colors
on a background of light colors. We tend to associate darker colors with higher
intensities when the background color of the figure is light.
As a general principle, when figures are meant to be printed on white paper,
light-colored background areas will typically work better. For online viewing
or on a dark background, dark-colored background areas may be preferable.

Choropleths work best when the coloring represents a density (i.e., some
quantity divided by surface area.
Cartograms
Some states take up a comparatively large area but are sparsely populated,
while others take up a small area yet have a large number of inhabitants. What
if we deformed the states so their size was proportional to their number of
inhabitants Such a modified map is called a cartogram.
As an alternative to a cartogram with distorted shapes, we can also draw a
much simpler cartogram heatmap, where each state is represented by a colored
square.

You might also like