Introduction to Data Visualization
Introduction to Data Visualization
Data visualization is part art and part science. The challenge is to get the art right
without getting the science wrong.
Data visualization is a powerful tool that transforms raw data into graphical
representations, making complex information more accessible and
understandable. However, not all visualizations are created equal. When done
poorly, data visualizations can be ugly, bad, or outright wrong, leading to
confusion, misinterpretation, or even deliberate misinformation. Let's explore
these categories:
1. Ugly Visualizations
Ugly visualizations are those that are visually unappealing or difficult to interpret
due to poor design choices. This might include:
Cluttered charts: Overloading a chart with too much information, colors, or text.
Poor color schemes: Using colors that clash, are hard to distinguish, or are not
colorblind-friendly.
Misaligned or inconsistent elements: Axes, labels, and chart elements that are not
aligned properly can make a chart look unprofessional and hard to read.
2. Bad Visualizations
Bad visualizations might not necessarily be ugly, but they fail in accurately
conveying the intended message or data. Common issues include:
Inappropriate chart types: Using a pie chart for data that doesn’t add up to 100%,
or using a 3D chart that distorts the data.
Misleading scales: Manipulating the scale of the axes to exaggerate or downplay
Excessive use of effects: Adding shadows, gradients, or 3D effects that distract
from the actual data.
3. Wrong Visualizations
Wrong visualizations are the most problematic because they present data
inaccurately or deceptively, leading to false conclusions. These errors can be
intentional or accidental:
Incorrect data representation: Using the wrong data or misinterpreting the data
source.
Cherry-picking data: Showing only a selective part of the data to support a biased
narrative.
Manipulating visual cues: Altering the size of bars, angles of pie slices, or other
visual elements to mislead the viewer.
Conclusion
Understanding the difference between ugly, bad, and wrong visualizations is
crucial for anyone working with data. Good data visualization requires careful
attention to design principles, ethical considerations, and a clear understanding of
the audience and context. By learning from poor examples, one can develop the
skills needed to create visualizations that are not only attractive but also accurate
and effective in communicating the intended message.
IMPORTANCE OF DATA VISUALIZATION
Data Visualization is a method of visualizing data in a more informative way.
This technique has been used to understand data trends, and patterns to take
effective business decisions.
Data visualization is a critical aspect of data science because it allows data
scientists to present complex data in an easily understandable and digestible
format. Here are a few reasons why data visualization is so important in data
science:
1. Helps with data exploration: Data visualization enables data scientists to
explore the data and identify patterns, trends, and outliers quickly. By
visualizing data, one can identify relationships between variables that may
not be apparent when looking at data in a tabular format.
2. Communicates insights: Data visualization helps data scientists
communicate insights to stakeholders in an easy-to-understand format. A
well-designed visualization can help non-technical stakeholders
understand complex data and make informed decisions based on that data.
3. Aids in decision-making: Data visualization is critical to decision-making
because it enables stakeholders to see trends and patterns that may not be
immediately apparent in a raw data set. By presenting data in a visual
format, data scientists can help stakeholders make informed decisions
quickly.
4. Facilitates storytelling: Data visualization is an essential component of data
storytelling. By creating compelling visualizations, data scientists can
create a narrative around the data that resonates with stakeholders.
Overall, data visualization is critical to data science because it helps data
scientists communicate insights, aid in decision-making, facilitate storytelling,
and explore data. Without data visualization, it would be challenging to gain
insights from complex data sets and make informed decisions. But, first, let’s
understand data visualization.
HUMAN PERCEPTION
Human perception plays a critical role in data visualization, as it directly impacts
how effectively information is communicated and understood. The human brain
is wired to process visual information more efficiently than raw data, making the
design of visualizations crucial for conveying complex data clearly and
accurately. Understanding the principles of human perception allows for the
creation of visualizations that are not only aesthetically pleasing but also
functional and intuitive.
Gestalt Principles
The Gestalt principles of visual perception—such as proximity, similarity, and
continuity—are fundamental in how we group and interpret visual elements. For
instance, objects that are close to each other are perceived as related, while those
that are similar in color or shape are seen as part of the same group. These
principles can guide the design of visualizations to ensure that the intended
relationships and patterns in the data are perceived correctly by the viewer.
Color Perception
Color plays a significant role in data visualization, but it must be used with care
due to the variability in how different people perceive colors. Factors such as
color blindness, cultural differences, and the context in which colors are used can
affect interpretation. Therefore, it’s essential to choose color schemes that are
accessible, consistent, and appropriate for the data being presented. Sequential
color schemes are good for ordered data, while diverging schemes are useful for
highlighting deviations from a central point.
Cognitive Load
Cognitive load refers to the amount of mental effort required to process
information. In data visualization, the goal is to minimize cognitive load by
reducing unnecessary complexity and focusing on clarity. This can be achieved
by avoiding clutter, using intuitive layouts, and providing clear labels and
legends. A well-designed visualization leverages human perception to make data
interpretation as effortless as possible.
Conclusion
Incorporating principles of human perception into data visualization design
enhances the viewer's ability to quickly and accurately interpret data. By aligning
visual encodings with perceptual strengths, applying Gestalt principles, using
color effectively, and minimizing cognitive load, data visualizations become
powerful tools for communication and decision-making.
1. Defining Objectives
The first step in the methodology is to clearly define the objectives of the
visualization. This involves understanding the purpose of the
visualization—whether it's to explore data, explain findings, persuade an
audience, or support decision-making. Identifying the target audience is
also crucial, as different audiences may require different levels of detail or
types of visual representation.
Conclusion
A structured methodology in data visualization ensures that the final product is
not only visually appealing but also effective in communicating the intended
message. By following these steps—defining objectives, understanding the data,
choosing the right visual encodings, designing carefully, and refining through
feedback—one can create powerful visualizations that turn complex data into
actionable insights.
SEVEN STAGES OF DATA VISUALIZATION
The seven stages of data visualization represent a structured approach to
transforming raw data into a visual format that is both informative and easy to
understand. These stages guide the entire process, from data collection to the final
presentation, ensuring that the visualization effectively communicates the
intended message. Here’s an overview of each stage:
1. Acquire
The first stage involves collecting the data that will be visualized. This
could come from various sources such as databases, spreadsheets, APIs, or
manual data entry. The key is to gather all relevant data needed to address
the questions or objectives driving the visualization. This stage may also
include data extraction and integration from multiple sources.
2. Parse
Once the data is acquired, the next step is parsing it, which involves
structuring and organizing the data into a format that is suitable for analysis
and visualization. This might involve converting raw data into a
standardized format, categorizing variables, and ensuring that the data is
clean and ready for further processing. Parsing helps in understanding the
relationships within the data and preparing it for subsequent stages.
3. Filter
In this stage, unnecessary or irrelevant data is filtered out to focus on the
most pertinent information. This could involve removing outliers, focusing
on specific data subsets, or excluding noise that might obscure the
visualization's key insights. Filtering ensures that the visualization remains
focused on the main message and is not cluttered with extraneous details.
4. Mine
Mining involves analyzing the data to uncover patterns, trends,
correlations, or insights that may not be immediately apparent. This stage
often employs statistical methods, algorithms, or exploratory data analysis
techniques to extract meaningful information from the data. The findings
from this stage will guide the design of the visualization.
5. Represent
Representation is the stage where the data is mapped onto visual forms
such as charts, graphs, maps, or other graphical elements. The choice of
representation depends on the nature of the data and the type of insights to
be conveyed. This stage is critical as it transforms abstract data into a visual
format that can be more easily interpreted by the human brain.
6. Refine
The refine stage involves fine-tuning the visualization for clarity,
aesthetics, and accuracy. This may include adjusting scales, colors, labels,
and layouts to enhance readability and ensure that the visualization
communicates its message effectively. Refinement also involves
eliminating any elements that might cause confusion or misinterpretation.
7. Interact
The final stage is adding interactivity to the visualization, if applicable.
Interactive visualizations allow users to explore the data more deeply by
hovering, clicking, zooming, or filtering. This stage is particularly
important in complex visualizations where users may need to investigate
specific details or view data from different perspectives.
Conclusion
The seven stages of data visualization provide a comprehensive framework for
creating visual representations of data that are clear, accurate, and effective. By
following these stages—acquire, parse, filter, mine, represent, refine, and
interact—one can systematically transform raw data into powerful visual insights
that support better understanding and decision-making.
DATA VISUALIZATION TOOLS
Data visualization tools are essential for transforming raw data into meaningful
and interactive visual representations. These tools enable users to create a variety
of charts, graphs, and dashboards to analyze and communicate insights
effectively. Here are some of the most popular and widely used data visualization
tools:
1. Tableau
Tableau is renowned for its powerful and user-friendly interface, allowing
users to create interactive and shareable dashboards. It supports a wide
range of data sources and provides various visualization options, from
simple bar charts to complex scatter plots. Tableau’s drag-and-drop
functionality and robust data blending capabilities make it suitable for both
beginners and advanced users.
2. Microsoft Power BI
Power BI integrates seamlessly with Microsoft products and provides a
comprehensive suite of tools for data visualization and business
intelligence. It offers a variety of data connectors, real-time data access,
and customizable reports. Power BI’s user-friendly interface and
integration with Excel make it a popular choice for businesses looking to
leverage their data for strategic decision-making.
3. D3.js
D3.js (Data-Driven Documents) is a JavaScript library for creating
dynamic and interactive visualizations on the web. Unlike other tools,
D3.js requires coding knowledge and offers high flexibility in terms of
design and functionality. It allows for custom visualizations and is highly
suitable for developers looking to create unique, interactive visual
representations.
4. Matplotlib
Matplotlib is a widely used Python library for creating static, animated, and
interactive plots and graphs. It is highly customizable and is often used in
scientific and engineering applications. Matplotlib integrates well with
other Python libraries, such as NumPy and Pandas, making it a powerful
tool for data analysis and visualization.
Conclusion
Choosing the right data visualization tool depends on factors such as the
complexity of the data, the desired level of interactivity, and user expertise. Tools
like Tableau and Power BI are excellent for general business needs, while D3.js
and Matplotlib cater to more specialized or technical requirements. Google Data
Studio is a good option for those who need seamless integration with Google’s
ecosystem.
We’ll consider the types of data we may want to represent in our visualization.
You may think of data as numbers, but numerical values are only two out of
several types of data we may encounter. In addition to continuous and discrete
numerical values, data can come in the form of discrete categories, in the form of
dates or times, and as text. When data is numerical we also call it quantitative and
when it is categorical we call it qualitative. Variables holding qualitative data are
factors, and the different categories are called levels.
Let’s put things into practice. We can take the dataset shown in map temperature
onto the y axis, day of the year onto the x axis, and location onto color, and
visualize these aesthetics with solid lines. The result is a standard line plot
showing the temperature normals at the four locations as they change during the
year is a fairly standard visualization for a temperature curve and likely the
visualization most data scientists would intuitively choose first. However, it is up
to us which variables we map onto which scales.
For example, instead of mapping temperature onto the y axis and location onto
color, we can do the opposite.
VISUALIZING AMOUNTS
The main emphasis in visualizing amounts will be on the magnitude of the
quantitative values. The standard visualization in this scenario is the bar plot,
which has several variations, including simple bars as well as grouped and
stacked bars. Alternatives to the bar plot are the dot plot and the heatmap.
Bar Plots
Data is commonly visualized with vertical bars. For each movie, we draw a bar
that starts at zero and extends all the way to the dollar value for that movie’s
weekend gross. This visualization is called a bar plot or bar chart.
2016 median US annual household income versus age group. The 45-to-54-
year age group has the highest median income.
2016 median US annual household income versus age group, sorted by income.
Grouped and Stacked Bars
If we are interested in two categorical variables at the same time. We can visualize
this data‐ set with a grouped bar plot. In a grouped bar plot, we draw a group of
bars at each position along the x axis, determined by one categorical variable, and
then we draw bars within each group according to the other categorical variable.
Grouped bar plots show a lot of information at once, and they can be confusing.
It is difficult to compare median incomes across age groups for a given racial
group. This figure is only appropriate if we are primarily interested in the
differences in income levels among racial groups, separately for specific age
groups.
2016 median US annual household income versus age group and race. Age
groups are shown along the x axis, and for each age group there are four bars,
corresponding to the median income of Asian, white, Hispanic, and black
people, respectively.
If we care more about the overall pattern of income levels among racial groups,
it may be preferable to show race along the x axis and show ages as distinct bars
within each racial group.
2016 median US annual household income versus age group and race.
The encoding by position is easy to read while the encoding by bar color requires
more mental effort, as we have to mentally match the colors of the bars against
the colors in the legend. We can avoid this added mental effort by showing four
separate regular bar plots rather than one grouped bar plot.
2016 median US annual household income versus age group and race. Instead
of displaying this data as a grouped bar plot, we now show the data as four
separate regular bar plots.
Instead of drawing groups of bars side-by-side, it is sometimes preferable to stack
bars on top of each other. Stacking is useful when the sum of the amounts
represented by the individual stacked bars is in itself a meaningful amount. So,
while it would not make sense to stack the median income values. Stacking is
also appropriate when the individual bars represent counts. For example, in a
dataset of people, we can either count men and women separately or we can count
them together. If we stack a bar representing a count of women on top of a bar
representing a count of men, then the combined bar height represents the total
count of people regardless of gender.
Numbers of female and male passengers on the Titanic traveling in 1st, 2nd, and
3rd class.
Dot Plots and Heatmaps
One important limitation of bars is that they need to start at zero, so that the bar
length is proportional to the amount shown. For some datasets, this can be
impractical or may obscure key features. In this case, we can indicate amounts by
placing dots at the appropriate locations along the x or y axis. The figure below
demonstrates this visualization approach for a dataset of life expectancies in 25
countries in the Americas. The citizens of these countries have life expectancies
between 60 and 81 years, and each individual life expectancy value is shown with
a blue dot at the appropriate location along the x axis. By limiting the axis range
to the interval from 60 to 81 years.
Life expectancies of countries in the Americas, for the year 2007, shown as
bars.
Regardless of whether we use bars or dots, however, we need to pay attention to
the ordering of the data values. If we instead ordered them alphabetically, we’d
end up with a disordered cloud of points that is confusing and fails to convey a
clear message.
Internet adoption over time, for select countries. Color represents the percent of
internet users for the respective country and year. Countries were ordered by
per‐ cent internet users in 2016.
Internet adoption over time, for select countries. Countries were ordered by the
year in which their internet usage first exceeded 20%.
VISUALIZING DISTRIBUTIONS: HISTOGRAMS AND DENSITY
PLOTS
Visualizing a Single Distribution
By drawing filled rectangles whose heights correspond to the counts and whose
widths correspond to the width of the age bins (Figure 7-1). Such a visualization
is called a histogram.
Because histograms are generated by binning the data, their exact visual
appearance depends on the choice of the bin width. Most visualization programs
that generate histograms will choose a bin width by default, but chances are that
bin width is not the most appropriate one for any histogram you may want to
make. It is therefore critical to always try different bin widths to verify that the
resulting histogram reflects the underlying data accurately.
If the bin width is too small, then the histogram becomes overly peaky and
visually busy and the main trends in the data may be obscured. On the other hand,
if the bin width is too large, then smaller features in the distribution of the data
may disappear.
In a density plot, we attempt to visualize the underlying probability distribution
of the data by drawing an appropriate continuous curve. This curve needs to be
estimated from the data, and the most commonly used method for this estimation
procedure is called kernel density estimation. In kernel density estimation, we
draw a continuous curve (the kernel) with a small width (controlled by a
parameter called bandwidth) at the location of each data point, and then we add
up all these curves to obtain the final density estimate. The most widely used
kernel is a Gaussian kernel (i.e., a Gaussian bell curve), but there are many other
choices.
The bandwidth parameter behaves similarly to the bin width in histograms. If the
bandwidth is too small, then the density estimate can become overly peaky and
visually busy and the main trends in the data may be obscured. On the other hand,
if the bandwidth is too large, then smaller features in the distribution of the data
may disappear.
Density plots tend to be quite reliable and informative for large datasets but can
be misleading for datasets of only a few points.
Kernel density estimates have one pitfall that we need to be aware of: they have
a tendency to produce the appearance of data where none exists, in particular in
the tails.
Kernel density estimates can extend the tails of the distribution into areas where
no data exists and no data is even possible.
Visualizing Multiple Distributions at the Same Time
stacked histogram, where we draw the histogram bars for one category on top of
the bars for another category, in a different color.
There are two key problems here. First, it is never entirely clear where exactly
the bars begin. Second, the bar heights for the female counts cannot be directly
compared to each other, because the bars all start at a different height.
Trying to address these problems by having all bars start at zero and making the
bars partially transparent.
This approach generates new problems. Now it appears that there are actually
three different groups, not just two, and we’re still not entirely sure where each
bar starts and ends. Overlapping histograms don’t work well because a semi‐
transparent bar drawn on top of another tends to not look like a semitransparent
bar but instead like a bar drawn in a different color.
Overlapping density plots don’t typically have the problem that overlapping
histograms have, because the continuous density lines help the eye keep the
distributions separate.
when we want to visualize exactly two distributions, we can also make two
separate histograms, rotate them by 90 degrees, and have the bars in one
histogram point in the opposite direction of the other. This trick is commonly
employed when visualizing age distributions, and the resulting plot is usually
called an age pyramid.
This trick does not work when there are more than two distributions we want to
visualize at the same time. For multiple distributions, histograms tend to become
confusing, whereas density plots work well as long as the distributions are
somewhat distinct and contiguous.
VISUALIZING PROPORTIONS
A Case for Pie Charts
A pie chart breaks a circle into slices such that the area of each slice is
proportional to the fraction of the total it represents. The same procedure can be
performed on a rectangle, and the result is a stacked bar chart. Depending on
whether we slice the bar vertically or horizontally, we obtain vertically stacked
bars or horizontally stacked bars.
Stacked bars, on the other hand, can work for side-by-side comparisons of
multiple conditions or in a time series, and side-by-side bars are preferred when
we want to directly compare the individual fractions to each other.
A Case for Side-by-Side Bars
When we visualize this dataset with pie charts, it is difficult to see specific trends.
It appears that the market share of company A is growing and the one of company
E is shrinking, but beyond this one observation we can’t tell what’s going on. In
particular, it is unclear how exactly the market shares of the different companies
compare within each year.
Picture becomes a little clearer when we switch to stacked bars (Figure 10-5).
Now the trends of a growing market share for company A and a shrinking market
share for company E are clearly visible. However, the relative market shares of
the five companies within each year are still hard to compare.
side-by-side bars are the best choice. This visualization highlights that both
companies A and B have increased their market share from 2015 to 2017 while
both companies D and E have reduced theirs. It also shows that market shares
increase sequentially from company A to E in 2015 and similarly decrease in
2017.
A Case for Stacked Bars and Stacked Densities
To visualize how the proportion of women in the Rwandan parliament has
changed over time, we can draw a sequence of stacked bar graphs. This figure
provides an immediate visual representation of the changing proportions over
time. To help the reader see exactly when the majority turned female, a dashed
horizontal line at 50% has drawn.
we are already using the x position for body mass, the y position for head length,
and the dot color for bird sex, we need another aesthetic to which we can map
skull size. One option is to use the size of the dots, resulting in a visualization
called a bubble chart.
Paired Data
In case of multivariate quantitative data is paired data: data where there are two
or more measurements of the same quantity under slightly different conditions.
For paired data, we need to choose visualizations that highlight any differences
between the paired measurements.
An excellent choice in this case is a simple scatterplot on top of a diagonal line
marking x = y. In such a plot, if the only difference between the two measurements
of each pair is random noise, then all points in the sample will be scattered
symmetrically around this line.
The dots are spaced evenly along the x axis, and there is a defined order among
them. Each dot has exactly one left and one right neighbor (except the leftmost
and rightmost points, which have only one neighbor each). We can visually
emphasize this order by connecting neighboring points with lines. Such a plot is
called a line graph.
Some people object to drawing lines between points because the lines do not
represent observed data. In particular, if there are only a few observations spaced
far apart, had observations been made at intermediate times they would probably
not have fallen exactly onto the lines shown. Thus, in a sense, the lines correspond
to made-up data.
Without dots, the figure places more emphasis on the overall trend in the data and
less on individual observations. A figure without dots is also visually less busy.
In general, the denser the time series, the less important it is to show individual
observations with dots.
We can also fill the area under the curve with a solid color. This visualization is
only valid if the y axis starts at zero, so that the height of the shaded area at each
time point represents the data value at that time point.
Multiple Time Series and Dose–Response Curves
We often have multiple time courses that we want to show at once. In this case,
we have to be more careful in how we plot the data, because the figure can become
confusing or difficult to read.
Gradual darkening of the color to indicate direction. Alternatively, one could draw
arrows along the path.
When drawing a connected scatterplot, it is important that we indicate both the
direction and the temporal scale of the data. Without such hints, the plot can turn
into a meaningless scribble.
VISUALIZING TRENDS
There are two fundamental approaches to determining a trend: we can either
smooth the data by some method, such as a moving average, or we can fit a curve
with a defined functional form and then draw the fitted curve.
Smoothing
The act of smoothing produces a function that captures key patterns in the data
while removing irrelevant minor detail or noise. Financial analysts usually
smooth stock market data by calculating moving averages.
To generate a moving average, we take a time window, say the first 20 days in the
time series, calculate the average price over these 20 days, then move the time
window by one day, so it now spans the 2nd to 21st days. We then calculate the
average over these 20 days, move the time window again, and so on. The result
is a new time series consisting of a sequence of averaged prices.
The moving average is the most simplistic approach to smoothing, and it has
some obvious limitations. First, it results in a smoothed curve that is shorter than
the original curve. Parts are missing at either the beginning or the end or both.
Second, even with a large averaging window, a moving average is not necessarily
that smooth. It may exhibit small bumps and wiggles even though larger-scale
smoothing has been achieved.
Locally estimated scatterplot smoothing (LOESS), which fits low-degree
polynomials to subsets of the data.
LOESS is a very popular smoothing approach because it tends to produce results
that look right to the human eye. However, it requires the fitting of many separate
regression models. This makes it slow for large datasets, even on modern
computing equipment.
A spline is a piecewise polynomial function that is highly flexible yet always
looks smooth. When working with splines, we will encounter the term knot. The
knots in a spline are the endpoints of the individual spline segments. If we fit a
spline with k segments, we need to specify k + 1 knots. While spline fitting is
computationally efficient, in particular if the number of knots is not too large,
splines have their own downsides.
The smoothing method may be referred to as a generalized additive model
(GAM), which is a superset of all these types of smoothers. It is important to be
aware that the output of the smoothing feature is dependent on the specific GAM
model that is fit.
(a) LOESS smoother. (b) Cubic regression splines with 5 knots. (c) Thin-plate
regression spline with 3 knots. (d) Gaussian process spline with 6 knots.
Showing Trends with a Defned Functional Form
Whenever possible, it is preferable to fit a curve with a specific functional
form that is appropriate for the data and that uses parameters with clear
meaning.
the dashed black line represents the exponential fit, and the solid black line
represents a linear fit to log-transformed data
Detrending and Time-Series Decomposition
For any time series with a prominent long-term trend, it may be useful to
remove this trend to specifically highlight any notable deviations. This
technique is called detrending.
We detrend housing prices by dividing the actual price index at each time point
by the respective value in the long-term trend.
Choropleth Mapping
By coloring individual regions in a map according to the data dimension we
want to display. Such maps are called choropleth maps.
We take the population number for each county in the US, divide it by the
county’s surface area, and then draw a map where the color of each county
corresponds to the ratio between population number and area.
Light colors to represent low population densities and dark colors to represent
high densities, so that high-density metropolitan areas stand out as dark colors
on a background of light colors. We tend to associate darker colors with higher
intensities when the background color of the figure is light.
As a general principle, when figures are meant to be printed on white paper,
light-colored background areas will typically work better. For online viewing
or on a dark background, dark-colored background areas may be preferable.
Choropleths work best when the coloring represents a density (i.e., some
quantity divided by surface area.
Cartograms
Some states take up a comparatively large area but are sparsely populated,
while others take up a small area yet have a large number of inhabitants. What
if we deformed the states so their size was proportional to their number of
inhabitants Such a modified map is called a cartogram.
As an alternative to a cartogram with distorted shapes, we can also draw a
much simpler cartogram heatmap, where each state is represented by a colored
square.