Chapter1 Introduction Data visualization
Chapter1 Introduction Data visualization
Chapter 01
Introduction of Data
Visualization
Tiết Gia Hồng [email protected]
2
Data visualization
• Definition 2
“Computer-based visualization systems provide visual
representations of datasets designed to help people carry
out tasks more effectively”
Tarmara Munzner, 2014
• Definition 1
“The use of computer-supported, interactive, visual
representations of abstract data to amplify cognition” [1]
3
[1] Information Visualization Using vision to think
What’s the difference here?
Illustration
a visual representation (a
picture or diagram) that is
used to make some subject
more pleasing or easier to
understand
Visualization
Visualization communicates
data.
•From abstract things
(invisible) to visible things
•Visualization is about
MAPPING. 4
Visualization
• Visualization
▪ The static or interactive visual representation of
abstract or spatial data to reinforce human
cognition
5
Examples
9
Examples
12
Data visualization
•Data visualization is to communicate
information using graphical representations
▪ Communicate using statistical graphs, plots
13
Information visualization
•Information visualization is the study of visual
representation of abstract data to reinforce
human cognition
14
Scientific visualization
•Scientific visualization is primarily concerned
with the visualization of three-dimensional
phenomena, where the emphasis is on the realistic
rendering of volumes, surfaces, illumination
sources, etc…, perhaps with a dynamic (time)
component
15
Active brain visualization Edward Tejnil, NASA/Ames
Algorithm visualization
•Algorithm visualization shows all the states of
the data structures during the execution of an
algorithm
▪ Usually uses animation
•A cryptography algorithm visualization helps
the user to see all the procedures
16
Software visualization
•Software visualization is the practice of
creating visual tools to map software elements
and/or display various aspects of the source code
of a software system
▪ Show architecture, running behavior, development
process
17
Why is data visualization important?
18
Examples
•1854 cholera outbreak
in London
▪ Killed more than 600
people
Broad Street •John Snow – a physician
▪ Used data visualization
to show that cholera
20
Examples
21
Visualization workflow
Best data
What data do I What do I
start
have? want to know?
Formulate
Try again our question
Charts
Does it make What vis method
sense? I should use?
22
Visualization workflow
• What data do I have?
▪ Very difficult task, because data sets are very difficult
to collect
o EX: collect student grade records to study grade
distributions of all CS courses
▪ Never ever form a visual first before getting the needed
data
23
Visualization workflow
• What do you want to know?
▪ Data vis helps understand and explore data sets.
However, we must specify what we actually want to
know.
▪ EX: the journalist wanted to know
“How big a city would have to be to house the world’s 7
billion people if it were as dense as … ”
=> How would we answer this question?
24
Visualization workflow
26
Visualization workflow
• Does it make sense?
▪ See something interesting such as a pattern, trend,
outlier,…
▪ Ask ourselves this questions
o Does what we see make any sense?
o Why does it make sense?
▪ Try to explain this “sense” and/or explore further
▪ If our visuals do not reveal anything
important/interesting
o Ask ourselves: what went wrong?
27
Some simple charts
•Software tools (e.g., excel) offer many different
graphics for charting.
•The right table shows the grade distribution of a
class
▪ Data values are categorized into 8 categories:
o A
o AB
o B
o BC
o CD
o D
o F
28
Some simple charts
• Lines are the most commonly used ones.
• Scale is usually a tricky issue
29
Some simple charts
• An area chart is just a different form of a line chart.
▪ Scale and large area of the same color can be problems
30
Some simple charts
•The bar chart in its various forms is another form
of a line chart.
31
Some simple charts
•The point/ bubble chart is yet a different form of a
line chart
32
Some simple charts
• The pie chart is different from the line chart.
▪ Hard to see thin slices
WhereisD?Theuseofcolorscan
beaseriousproblem! 33
Some simple charts
•The doughnut chart is a variation of the bar
chart.
34
Some simple charts
•The radar chart combines the pie chart and the
pie chart
35
Multiple series
•Challenges occur when building a chart for two
or more series:
▪ Occlusion can be a big problem
▪ The meaning of each series must be consistent
36
Multiple series
• Lines are always the easiest .
• But too many lines can be a problem
37
Multiple series
• Bars share the problems of the lines.
• A very busy bar chart can be difficult to read.
38
Multiple series
•Pie charts are for single series. Use doughnut for
multiple series
39
Multiple series
•Sometimes a radar chart can be effective for
comparison
40
Multiple series
• This is an area chart
▪ Occlusion problem
41
Multiple series
•One may also stack bars together. What does this
mean? It depends.
42
Multiple series
• A typical bar chart is very busy
43
Multiple series
•A typical line chart is also busy and difficult to
read
44
Multiple series
• Scatter plots are not better
45
Multiple series
• Radar chart are busy too
46
Multiple series
• Now try the scaled bar chart. Isn’t it easier to read?
47
Multiple series
• Here is a variation
48
Mosaic plot
•Mosaic plots (Mekko chart) is a plot for visualizing
the relationship of two or more categorical
variables
•It may be considered as a combination of 1 100%
stacked column chart and a 100% stacked
horizontal bar chart, each of which uses a different
variable
49
Mosaic plot
• Consider a table first
• Divide a square and each bar according to the
proportion of the table
50
Mosaic plot
• This is again a 2D table with two factors
▪ Hair color
▪ Eye color
51
Mosaic plot
52
Mosaic plot
53
Mosaic plot
•The shading is designed to visualize the
differences between the observed frequency and
the expected frequency
•The measure of difference used is often the
Pearson standardized residuals, which can be
computed in most systems
•Some systems may report the p-value. A small p-
value (i.e., < 0.05) indicates that strong evidence
against the null hypothesis (i.e., observed =
expected)
54
Mosaic plot
55
Treemap
•Treemaps are very useful for showing large
amounts of hierarchically structured data
•The root receives a big rectangle
•The rectangle of a node is split into smaller
rectangles, one for that node’s child node
•The size of each rectangle has an area
proportional to the amount of data it represents
56
Treemap
• Consider the following tree
▪ Ignore the “node size” issue
57
Treemap
58
Treemap
59
Treemap
60
Treemap
61
Treemap
62
Treemap
63
Parallel coordinates
•A data table has multiple rows each of which has
multiple attributes
•Suppose each row has attributes (x1, x2,…, xn).
Then, n vertical lines are drawn, one for each
attribute
•A parallel coordinate plot maps each row in the
data table as a line
▪ Each attribute value of a row is represented by a point
on the line corresponding to that attribute
64
Parallel coordinates
•The values of a parallel coordinate plot are always
normalized
•Each point along the horizontal axis is 0% and the
highest value in that column is set 100% along the
vertical axis
•Therefore, do not compare the “heights” because
the scale of the columns is completely separated
•Each column in the plot only shows the portion of
each value of the column in the table
65
Parallel coordinates
• Consider the following table
• For each food type, we want to plot a profile of
how the carbohydrates are distributed
66
Parallel coordinates
67
68
Multiple series
•When you have a set of numbers, the first thing
for you have to do is calculate the following:
▪ Mean: m
▪ Standard deviation: s
▪ Standard error of the mean: S/ n, where n is the
same size
▪ Median
▪ Quartiles
▪ Confidence interval at 95%: = 0.05
69
Confidence Intervals
•The confidence interval of mean m, standard
deviation s, and sample size n at confidence level
95% is
70
Box-Whisker Plot
•The five-number summary is a set of descriptive
statistics that provide summary information of a
dataset
•It consists of the following five numbers:
▪ Minimum
▪ First quartile
▪ Median
▪ Third quartile
▪ Maximum
•The box-whisker plot is a visual display of the
five-number summary
71
Box-Whisker Plot
• The following shows a typical box-whisker plot.
• It could be horizontal or vertical.
• We may also add the mean value.
72
End