Communication AND Data Visualization: Data Science For Marketing
Communication AND Data Visualization: Data Science For Marketing
IMS 5
Information
Management
School
COMMUNICATION
AND
DATA VISUALIZATION
2
5.1
0
Communication in
Data Science
Communication and Data visualization
3
Top-barriers faced by Data Scientists
§“Lack of management/financial support”
§“Lack of clear questions to answer”
§“Results not used by decision makers”
§“Explaining Data Science to others”
In Data Science and the art of persuasion (Harvard Business Review, 2019)
4
It’s not enough to have the technical skills
As a Data Scientist you also need to:
§Explain effectively how you come to a result
§Justify the rationale for your approach
§Convince the audience that your results should be used
§Reason how your results can improve business in a particular way
Reports Presentations
5
Presentations
§Storytelling:
§Set the context/background for the audience to understand the
story’s relevance
§Use an interesting narrative to get the message across
§Convey the message in business terms
§Highlight the business impact and opportunity
§At the end, summarize the highlights and present the call to
action
§Data visualization:
§Good visualizations are essential to convey a good story
§But visualizations (or annotations) should not tell the story itself
(that is a job for the presenter to tell interactively during the
presentation)
6
Presentation tips
§Be punctual and respect the predefined timings (if any)
§Dress appropriately according the audience
§Adapt according the rhythm of the presentation (e.g., if CEO engages in
a discussion you may forget about the timings)
§Avoid using technical terms if the audience is mostly people from the
business side
§Use large fonts in slides (minimum 20 points)
§Avoid slides with long texts. If necessary, use only visualizations and 2
or 3 words
§Avoid writing mistakes
§Focus on the message you want to convey, not on the slides
§Never, ever:
§ Ramble on and on without a specific idea to transmit
§ Face your back to the audience
§ Read verbatim off the slides or from your annotations 7
5.2
0
Introduction to
data visualization
Communication and Data visualization
8
How humans gather and process information
SYSTEM I SYSTEM II
Human memory
Short-term
Sensory
(aka working Long-term
(aka iconic)
memory)
11
12
Why visualize data
§Display large amount of information in a small space
§Makes users think about the substance, not the methodology or
technology
§Makes complex information to be easily interpreted
§Encourages the detection of patterns, anomalies and tendencies
§Makes comparisons easier
13
Anscombe’s quartet
“Graphics can be more precise and revealing than conventional
statistical computation” (Tufte, 2007)
dataset I dataset II dataset III dataset IV
x y x y x y x y
10.0 8.04 10.0 9.14 10.0 7.46 8.0 6.58
8.0 6.95 8.0 8.14 8.0 6.77 8.0 5.76
13.0 7.58 13.0 8.74 13.0 12.74 8.0 7.71
N = 11
9.0 8.81 9.0 8.77 9.0 7.11 8.0 8.84
x! = 9.0
11.0 8.33 11.0 9.26 11.0 7.81 8.0 8.47
14.0 9.96 14.0 8.10 14.0 8.84 8.0 7.04
y! = 7.50
6.0 7.24 6.0 6.13 6.0 6.08 8.0 5.25
l.reg: y = 3.0 + 0.5x
4.0 4.26 4.0 3.10 4.0 5.39 19.0 12.50
var(x) = 10
12.0 10.84 12.0 9.13 12.0 8.15 8.0 5.56 var(y) = 3.75
7.0 4.82 7.0 7.26 7.0 6.42 8.0 7.91 corr(x,y) = 0.816
5.0 5.68 5.0 4.74 5.0 5.73 8.0 6.89 R! = 0.67
14
Anscombe’s quartet visualization
14 14
I II
12 12
10 10
8 8
y
y
6 6
4 4
2 2
0 0
0 5 10 15 20 0 5 10 15 20
x x
14 14
III IV
12 12
10 10
8 8
y
6 y 6
4 4
2 2
0 0
0 5 10 15 20 0 5 10 15 20
x x
15
Data visualization in CRISP-DM
It is used in all phases, but in particular:
§Data Understanding:
§ Exploratory Data Analysis
§Modeling:
§ Assess models’ performance
§Evaluation:
§ Assess models’ generalization
§ Communicate results
§Deployment:
§ Communicate results
§ Monitor performance
16
5.3
0
Display mediums
Communication and Data visualization
17
Chart selection diagram – just some examples
18
Distribution
Display mediums
19
Histogram
§One variable - numeric
§Changing the number of bins
can affect distribution
visualization
20
Density plot
§One variable - numeric
§Used typically when there is a
high number of data points
21
Count plot
§One variable - categorical
§Used to understand
distribution among categories
and cardinality issues
22
Scatter plot
§Two numerical variables
§Most familiar way to visualize
bivariate distributions
23
Joy plot
§One numerical variable by a
categorical variable
§Great way to visualize
distribution (density curves) of
large number of groups in
relation to each other
24
Joy plot
§Multiple numerical variables by
a categorical variable
§Great way to visualize
distribution (density curves) of
large number of groups in
relation to each other
25
3D area plot
§Three numerical variables
§Advanced method to explore
the distribution of three
variables
26
Pair plots
§Multiple variables, from
multiple types
§Used to plot multiple pairwise
bivariate distributions
§Very helpful in the data
understanding phase
27
Relationship
Display mediums
28
Scatter plot – three variables
§Two numerical variables, by a
categoric variable
§Color represents the categoric
variable
29
Scatter plot – four variables
§Two numerical variables, by
two categoric variables
§One of the categoric variables
is represented by the color
§The other categoric variable is
represented by the marker
§Usually, not very helpful, if
there are too many data points
30
Bubble plot – three to five variables
§Three numerical variables
§One of the variables defines
the size
§Just as in scatter plots, it is
possible to had two categorical
variables (color and marker)
§Usually, not very helpful, if
there are too many data points
31
Face grid– three to six variables
§Grid of scatter plots, bubble
plots, or line plots
§Helpful, if there are too many
data points to break plots by
categoric variables
§Very helpful in the data
understanding phase
32
Composition
Display mediums
33
Pie chart – Sorted bar plot
§Pie chart:
§ Good to show shares/proportions
§ Not helpful with too many categories
§Sorted bar plot:
§ Helpful when there are many categories
§ If a category requires highlighting, just color the category bar 34
Stacked bar plot – 100%
§One numerical variable with
the proportion by two
categorical variables
§Useful to depict proportions of
components with
subcomponents
§Useful to depict changes
overtime (x as a time
dimension) and only relative
differences matter. However,
not useful if x has too many
periods
35
Stacked bar plot
§One numerical variable by two
categorical variables
§Useful to depict values of
components with
subcomponents
§Useful to depict changes
overtime (x as a time
dimension) and both relative
and absolute differences
matter. However, not useful if
x has too many periods
36
Area plot – 100%
§One time-variable vs one
numerical variable with the
proportion by one categorical
variable
§Useful to depict changes
overtime (x as a time
dimension) and only relative
differences matter
37
Area plot
§One time-variable vs one
numerical variable with the
proportion by one categorical
variable
§Useful to depict changes
overtime (x as a time
dimension) and only relative
differences matter
38
Comparison
Display mediums
39
Tables
Work best for:
§Comparison among items
Marital
Divorced
Subscribed - NO
4,136
Subscribed - YES
476
§Look up individual values
Married 22,396 2,532 §Data must be precise
Single 9,948 1,620
Unknown 68 12
40
Column plot
§Values by categoric variable,
with each series being a
category
§Useful to make comparisons
with few categories
41
Column plot with facet
§Useful to make comparisons
with multiple categorical
variables, with many categories
§Very helpful in the data
understanding phase
42
Line plot
§Useful to make comparisons
over time
§Not suitable for a high number
of categories
43
Other types
Display mediums
44
Heatmap
source: https://www.machinelearningplus.com/plots/top-50-matplotlib-visualizations-the-master-plots-python
45
Treemap
source: https://www.machinelearningplus.com/plots/top-50-matplotlib-visualizations-the-master-plots-python
46
Geospatial data visualizations
source: https://i.stack.imgur.com/QVoiv.png
47
5.4
0
48
Do’s and don’ts
€75000,00
€70000,00
€65000,00
€60000,00
€55000,00
€50000,00
€45000,00
€40000,00
€35000,00 Budget
Actual
€30000,00
October
November
December
February
July
March
June
September
January
April
May
August
Applications encourage users to create poorly-designed
graphs, but sometimes graphs can be made do deceit
Adapted from Few (2008)
49
Do’s and don’ts
€80000,00
€70000,00
€60000,00
€50000,00
€40000,00
€30000,00
€20000,00
€10000,00
Budget
Actual
€0,00
October
November
December
February
July
March
June
September
January
April
May
August
Starting the y-axis value at 0 relativizes the
differences and reflects more accurately
the full value
50
Do’s and don’ts
€80000,00
€70000,00
€60000,00
€50000,00
€40000,00
€30000,00
€20000,00
€10000,00
Budget
Actual
€0,00
October
November
December
February
July
March
June
September
January
April
May
August
Removing the background makes it
easier to read and contrast colors
51
Do’s and don’ts
€80000,00
€70000,00
€60000,00
Actual
€50000,00
Budget
€40000,00
€30000,00
€20000,00
€10000,00
€0,00
November
February
July
March
June
October
December
September
January
April
May
August
Most of the times 3D effects are unjustified. 3D effects create a
disparity of depth, the occlusion hides information, the
perspective creates distortion, and titled text isn’t legible
52
Do’s and don’ts
€80 000
€70 000
€60 000
Actual
€50 000
Budget
€40 000
€30 000
€20 000
€10 000
€0
November
February
July
March
June
October
December
September
January
April
May
August
Horizontal and
Due to the scale of the values, Since months do not
vertical grid lines
removing unnecessary need a separation and
should be used only
decimals and adding the grid lines are displayed
when strictly useful
thousands separator improves in the y-axis, all tick
for improving
readability marks can be removed
readability
53
Do’s and don’ts
€80 000
€70 000
€60 000
€30 000
€20 000
€10 000
€0
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
40 000 Budget
30 000
20 000
10 000
0
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2020
80 000
70 000
60 000
50 000
40 000
30 000
20 000
10 000
0
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2020
56
Do’s and don’ts
Sales € Actual Budget
80 000
€70€71 500,00
000,00 €71 500,00
70 000 €64 000,00
€67 000,00
€64 000,00
€67 000,00
€61€60
500,00
000,00 €61€62 000,00
000,00 €60€61 500,00
000,00
€62 000,00
60 000 €58 000,00 €57€56
000,00
500,00 €56€55
000,00
500,00
€53 000,00 €52 000,00
€50 000,00 €51€50
000,00
500,00
50 000
40 000
30 000
20 000
10 000
0
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Actual €53 000, €58 000, €61 500, €51 000, €57 000, €61 000, €56 000, €60 000, €70 000, €62 000, €64 000, €67 000,
Budget €50 000, €52 000, €60 000, €50 500, €56 500, €62 000, €55 500, €61 500, €71 500, €64 000, €67 000, €71 500,
2020
Be careful with with the Data-Ink ratio. Ink that is used to display anything
that isn’t data or does not improve readability should be reduced to a
minimum
57
Do’s and don’ts
0 40 000
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2020
58
Do’s and don’ts
80 000
70 000
60 000
50 000
40 000
30 000
20 000
10 000
0
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2020
Use colors consistently. If sales
Use contrasting colors, consider printing
are blue and the budget is gray
requirements and the cultural significance of
on a graph, the sales and
the colors. For example, red is often used to
budget colors must be the same
represent bad things or a warning
on all graphs
59
Do’s and don’ts
Sales € Actual Budget
80 000
70 000
60 000
50 000
40 000
30 000
20 000
10 000
0
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2020
With the graph transformed into Since this is a time-based graph, lines
lines, the y-axis line better defines are better at revealing changes in
the beginning of the graph patterns over time than bars
60
Do’s and don’ts
Sales € Actual Budget
75 000
70 000
65 000
60 000
55 000
50 000
45 000
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2020
61
Do’s and don’ts
Sales €
75 000
Budget
70 000
Actual
65 000
60 000
55 000
50 000
45 000
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2020
62
Do’s and don’ts
6 000
4 000
2 000
-2 000
-4 000
-6 000
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2020
63
Do’s and don’ts
Expressing the variation in a percentage
provides a quicker understanding.
As a percentage, the y-axis label can be The title should reflect the
removed subject under analysis
€75000,00
€70000,00
€65000,00
€60000,00
€55000,00
€50000,00
€45000,00
€40000,00
€35000,00 Budget
Actual
€30000,00
October
November
December
February
July
March
June
September
January
April
May
August
from difficult to
0,0%
-2,0%
read graph to easy -4,0%
65
Data Science for Marketing
© 2021-2024 Nuno António (Rev. 2024-08-28)
Acreditações e Certificações
Instituto Superior de Estatística e Gestão da Informação
Universidade Nova de Lisboa