Usage of Visualization Techniques in Data Science Workflows: Johanna Schmidt
Usage of Visualization Techniques in Data Science Workflows: Johanna Schmidt
Johanna Schmidt1 a
1 VRVis Zentrum für Virtual Reality und Visualisierung Forschungs-GmbH, Vienna, Austria
[email protected]
Abstract: The increasing interest in data science and data analytics lead to a growing interest in data visualization and
exploratory visual data analysis. However, there is still a clear gap between new developments in visualization
research, and the visualization techniques currently applied in data analytics workflows. Most of the com-
monly used tools provide basic charting options, but more advanced visualization techniques have hardly been
integrated as features yet. This especially applies for interactive exploratory data analysis, which has already
been addressed as the ’Interactive Visualization Gap’ in the literature. In this paper we present a study on the
usage of visualization techniques in common data science tools. The results of the study confirm that the gap
still exists. For example, we hardly found support for advanced techniques for temporal data visualization or
radial visualizations in the evaluated tools and applications. On the contrary, interviews with professional data
analysts confirm strong interest in learning and applying new tools and techniques. Users are especially inter-
ested in techniques that can support their exploratory analysis workflow. Based on these findings and our own
experience with data science projects, we present suggestions and considerations towards a better integration
of visualization techniques in current data science workflows.
area plot Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y
bubble chart Y Y Y Y Y Y Y Y Y Y Y Y Y Y
bar chart Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y
pie chart Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y
donut chart Y Y Y Y Y Y Y Y Y Y Y Y Y Y
parallel coordinates Y Y Y Y Y Y Y Y Y
Multi-d.
radar chart Y E Y Y Y Y Y Y Y Y Y
scatter plot matrix Y E Y Y Y Y Y
Flow
Sankey diagram Y Y Y Y Y Y Y Y E
Alluvial diagram Y Y E Y Y Y Y Y E
chord diagram Y Y Y Y E
Matrix
heatmap Y Y Y Y Y Y Y Y Y Y Y Y Y
arc diagram E Y
polar area diagram Y Y Y Y Y Y Y Y Y Y Y
Gantt chart Y Y Y Y Y
Temporal data
circle view
theme river Y E Y Y Y Y Y Y Y
data vases
horizon graphs E Y Y Y
time nets
people garden
tree diagram Y E Y Y Y Y Y Y
Hierarchical
sunburst chart Y E Y Y Y Y
treemap Y E Y Y Y Y Y Y Y Y
contour plot Y Y Y Y Y Y Y
crop circles Y Y
Table 1: Featured visualization techniques. This table illustrates which visualization techniques are currently featured by the evaluated tools and applications. A Y in a table
cell shows that the corresponding tool or application features this technique. E means that the technique is featured via an extension or plugin.
• Basic charts: scatter plot, line plot, area plot, Together with computer science students attend-
bubble chart, bar chart, pie chart, donut chart ing a course on visualization in data science we evalu-
ated the usage of the selected techniques in the differ-
• Multi-dimensional data: parallel coordi-
ent tools and applications. As a result of our study, we
nates (Inselberg, 2009), radar chart (Chambers
were able to create a matrix of tools and applications
et al., 1983), scatter plot matrix (Hartigan, 1975)
and selected visualization techniques, with marked
• Flow charts: Sankey diagram (Riehmann et al., cells if a tool or application features the specific vi-
2005), Alluvial diagram (Rosvall and Bergstrom, sualization technique. The results of the study can be
2010) seen in Table 1. In this table the selected visualization
techniques are listed as rows. If a tool or applica-
• Matrix data: chord diagram (Telea and Ersoy, tion, listed as columns, features a visualization tech-
2010), heatmap (Wilkinson and Friendly, 2009), nique, the corresponding cell is marked with an Y (for
arc diagram (Wattenberg, 2002) ”yes”). A visualization technique is considered to be
• Temporal data: polar area diagram, Gantt chart, featured if it has been included in the basic functional-
circle view (Keim et al., 2004), theme river (Havre ities of the tool or application. For example, a scatter
et al., 2002), data vases (Thakur and Rhyne, plot matrix could also be created by placing several
2009), horizon graphs (Heer et al., 2009), time scatter plots side-by-side, but we only consider the
nets (Kim et al., 2010), people garden (Xiong and technique to be featured if there exists a core func-
Donath, 1999) tionality creating this visualization. If a visualization
technique is provided via extensions or plugins, we
• Hierarchical data: tree diagram, sunburst placed an E (for ”extension”) in the table cell.
chart (Stasko et al., 2000), treemap (Shneiderman,
Not surprisingly, basic chart types like scatter
1992), contour plot (Kubota et al., 2007), crop cir-
plots and bar charts are highly supported by all eval-
cles (Wang and Parsia, 2006)
uated tools and applications. From the more ad-
We selected the following tools and applications for vanced visualization techniques, multi-dimensional
our study: techniques like parallel coordinates and radar charts
are already widely used and known, and therefore in-
• Open source: Python Plotly3 , Python Seaborn4 , cluded in many of the tools. The same applies for
R GGPlot25 , Vega-Lite6 , D37 , Google Charts8 , scatter plot matrices and heatmaps. Techniques for
Chart.js9 , Apexcharts10 , dygraphs11 , Bokeh12 , hierarchical data are also well supported, especially
RAWGraphs13 , .Net LiveCharts14 , Qt Charts15 by the open source tools that were evaluated in the
• Commercial: Microsoft PowerBI16 , Tableau17 , study.
SAS Visual Analytics18 , Highcharts19 , Quadri- Visualization techniques for temporal data are not
gram20 , Matlab21 available in the majority of the tools and applications.
This is most probably due to the fact that temporal
3 https://plot.ly/python/ data (e.g., time-series data) is a very specific data type
4 https://seaborn.pydata.org/ which is used only for specific tasks. Users usually
5 https://ggplot2.tidyverse.org/ use their own tools for these purposes. Therefore,
6 https://vega.github.io/vega-lite/ techniques for temporal data have not been included
7 https://d3js.org/
yet in common tools and applications, as these tools
8 https://developers.google.com/chart/
usually try to address a broader range of data scien-
9 https://www.chartjs.org/
tists and data analysts. There are some visualization
10 https://apexcharts.com/
techniques which have not been integrated into any
11 http://dygraphs.com/
tool or application yet, like time nets, data vases, or
12 https://bokeh.pydata.org/en/latest/
people garden.
13 https://rawgraphs.io/
14 https://lvcharts.net/ From a tools and applications point of view,
15 https://doc.qt.io/qt-5/qtcharts-index.html Python Plotly and D3 notable provide the most fea-
16 https://powerbi.microsoft.com/ tures among all the tested open source tools. There are
17 https://www.tableau.com/ other tools that are targeted towards very special func-
18 https://www.sas.com tionalities, like dygraphs for scientific plots, which
19 https://www.highcharts.com/ therefore only feature a very limited range of visual-
20 http://www.quadrigram.com/ ization techniques. Other libraries which are intended
21 https://www.mathworks.com/products/ to be used in web-based applications (e.g., Chart.js or
matlab.html Google Charts) feature only visualization techniques
that will most likely be needed in a web-based con- • Integrate provenance in visualization Espe-
text. cially exploratory data analysis (the Profile stage
Open source tools, especially R GGPlot2, benefit of the data science workflow) is an undirected pro-
a lot from input from the community, since many ad- cess that very often requires to start from scratch
vanced visualization techniques are only featured via again. In this process data scientists need to
extensions. In the group of commercial tools it can be keep track of their findings and steps they al-
depicted that Tableau, Microsoft Power BI, and High- ready tried out. We therefore consider the inte-
charts feature most of the hereby evaluated visualiza- gration of provenance mechanisms in visualiza-
tion techniques. tion applications as an important goal. In many
cases data scientists use notebook-style environ-
ments (e.g., Jupyter22 ) to keep track of their deci-
3 LEARNED LESSIONS sions and actions. The integration of visualiza-
tion techniques in existing notebook-style envi-
We consider further exchange with the field of ronments will therefore also push their usage in
data science as a valuable and important goal for the data exploration.
visualization community. Previous research efforts • Support for collaboration Similar to the need for
and our own study on the usage of visualization tech- keeping track of recent activities, data scientists
niques in data science revealed that the gap between need to communicate results and analysis stages
new developments in visualization research and their to stakeholders, colleagues from other business
application ”in the wild” still exists. We therefore units, customers, and other data scientists. This
identified the following suggestions towards a better needs to be considered when creating new tools
integration of visualization in data science workflows: and applications. Data scientists need to be able to
• Consider the programming environments cur- capture current states of an analysis (e.g., by stor-
rently in use in data science. Data scientists ing the current state), so that they can later catch
use tools they already know and that have proved up on their current work, or pass on the results.
useful in their workflows. Depending on the ex- • Provide guidance in visualization Data scien-
isting skills, either programming tools or fully- tists will also benefit from guidelines suggesting
featured applications are preferred. However, in- suitable visualizations to be applied for certain
terview studies revealed that data scientists are data types or to solve certain tasks. Some sug-
very interested in exploring and integrating alter- gestions for the usage of charts have been pro-
natives in their workflow. The visualization com- posed outside the visualization community. Sup-
munity should seize this opportunity, and should port for natural language queries has already been
also make the changes as easy as possible. This included in some data analysis tools (e.g., ”Ask
requires to provide new visualization techniques Data” by Tableau23 ). Findings from studies on
in the programming environments currently used color and shape perception have already been
by data scientists. Such an integration can involve considered by many data science applications.
providing extensions to well-known visualization Proposing certain visualization techniques during
packages, or by providing command-line support the analysis supports data scientists in their Profile
for existing environments. A better integration of and Report workflow stages. We therefore con-
interactive visualization tools will especially be sider further research for the interpretation and us-
helpful for the Wrangle and Profile stages of the age of visualization techniques, and for a better
data science workflow. understanding of phenomena like visual compari-
• Document and report the benefits for using son or visual clutter an important goal.
new tools. Data scientists stated in interviews that • Consider the data science workflow stages The
one of the main obstacles for not considering new workflow of data scientists can be categorized
visualization techniques is that they do not have into the five stages of Discover, Wrangle, Pro-
enough time to get familiar with new tools. The file, Model, and Report. When designing new
easier it is to access new tools (e.g., by providing visualization techniques, reflect upon in which
them in well-known programming environments), stage of the workflow the visualization technique
the easier it is for data scientists to try these new should be primarily used. Every stage required
opportunities. Documenting the benefits for using
new tools also includes a proper documentation of 22 https://jupyter.org/
the features, user guides, getting-started-guides, 23 https://www.tableau.com/products/new-
and example datasets and galleries. features/ask-data
different types of visualizations. For example, with Professional Data Analysts on Exploration Prac-
data wrangling in the Wrangle stage requires to tices. IEEE Transactions on Visualization and Com-
focus on data flaws like missing data or out- puter Graphics, 25(1):22–31.
liers, while Model requires visualizations to un- Barlas, P., Lanning, I., and Heavey, C. (2015). A survey of
derstand the created models. Both stages are not open source data science tools. International Journal
of Intelligent Computing and Cybernetics, 8:232–261.
supported by the available visualizations in cur-
Batch, A. and Elmqvist, N. (2018). The Interactive Vi-
rent data science tools yet. The most demand-
sualization Gap in Initial Exploratory Data Analy-
ing stage in terms of visualization design is the sis. IEEE Transactions on Visualization and Com-
Profile stage, where data scientists need to ex- puter Graphics, 24(1):278–287.
plore the data to understand its structure. For Behrisch, M., Streeb, D., Stoffel, F., Seebacher, D., Mate-
this stage current data science tools mostly lack jek, B., Weber, S. H., Mittelstaedt, S., Pfister, H.,
to provide suitable visualizations. The data ex- and Keim, D. (2018). Commercial Visual Analytics
ploration process also requires high degree of in- Systems-Advances in the Big Data Analytics Field.
teractivity and inter-connectivity between differ- IEEE Transactions on Visualization and Computer
Graphics.
ent visualizations, which is not supported by all
data science tools yet. In the Report stage mostly Blei, D. M. and Smyth, P. (2017). Science and Data Sci-
ence. Proceedings of the National Academy of Sci-
simple and easy-to-understand visualizations are ences, 114(33):8689–8692.
needed, since here the results of the data analysis Chambers, J., Cleveland, W., Kleiner, B., and Tukey,
stage have to be presented to a broader audience. P. (1983). Graphical Methods for Data Analysis.
The use cases in this stage can be mostly covered Wadsworth.
by employing basic charts, which are already well Chapman, C. (2019). A Complete Overview
supported by current data science tools. of the Best Data Visualization Tools.
https://www.toptal.com/designers/data-
visualization/data-visualization-tools.
[accessed 2019-07-10].
4 CONCLUSION Gartner (2019). Magic Quadrant for Analytics and
Business Intelligence Platforms. https://
This paper advocates for a better exchange be- solutionsreview.com/business-intelligence/
tween the two research fields of data science and vi- thoughtspot-magic-quadrant-for-analytics-
and-business-intelligence-platforms/.
sualization. Visual interfaces can provide substantial [accessed 2019-07-09].
support for users working with data. However, the
Harger, J. R. and Crossno, P. J. (2012). Comparison of
”Interactive Visualization Gap” for exploratory data Open Source Visual Analytics Toolkits. Proceedings
analysis still exists. This has also been revealed by our of SPIE - The International Society for Optical Engi-
study presented in this paper on the usage of visual- neering, 8294.
ization techniques in common data science tools. On Harris, H. D., Murphy, S. P., and Vaisman, M. (2013). Ana-
the other hand, interviews with data scientists reveal lyzing the Analyzers: An Introspective Survey of Data
a great interest in applying new techniques to get new Scientists and Their Work. O’Reilly Media.
insights into their datasets. We therefore suggest dif- Hartigan, J. A. (1975). Printer graphics for clustering.
ferent strategies for a better integration of visualiza- Journal of Statistical Computation and Simulation,
tion techniques in common data science workflows. 4(3):187–273.
Havre, S., Hetzler, E., Whitney, P., and Nowell, L. (2002).
Themeriver: visualizing thematic changes in large
document collections. IEEE Transactions on Visual-
ACKNOWLEDGEMENTS ization and Computer Graphics, 8(1):9–20.
Hayashi, C. (1998). What is Data Science ? Fundamen-
tal Concepts and a Heuristic Example. In Data Sci-
VRVis is funded by BMVIT, BMDW, Styria, SFG ence, Classification, and Related Methods, pages 40–
and Vienna Business Agency in the scope of COMET 51. Springer Japan.
– Competence Centers for Excellent Technologies Heer, J., Kong, N., and Agrawala, M. (2009). Sizing the
(854174) which is managed by FFG. Horizon: The Effects of Chart Size and Layering
on the Graphical Perception of Time Series Visual-
izations. In Proceedings of the SIGCHI Conference
on Human Factors in Computing Systems, CHI ’09,
REFERENCES pages 1303–1312, Boston, MA, USA. ACM.
Holtz, Y. and Healy, C. (2017). The Chartmaker Direc-
Alspaugh, S., Zokaei, N., Liu, A., Jin, C., and Hearst, tory - Data Story. https://www.data-to-viz.com/
M. A. (2019). Futzing and Moseying: Interviews #story. [accessed 2019-10-25].
Inselberg, A. (2009). Parallel Coordinates: Visual Multidi- Rosvall, M. and Bergstrom, C. T. (2010). Mapping Change
mensional Geometry and Its Applications. Springer- in Large Networks. PLOS ONE, 5(1):1–7.
Verlag, Berlin, Heidelberg. Shneiderman, B. (1992). Tree Visualization with Tree-
Kandel, S., Paepcke, A., Hellerstein, J. M., and Heer, J. maps: 2-d Space-filling Approach. ACM Transactions
(2012). Enterprise Data Analysis and Visualization: on Graphics, 11(1):92–99.
An Interview Study. IEEE Transactions on Visualiza- Stasko, J., Catrambone, R., Guzdial, M., and McDonald, K.
tion and Computer Graphics, 18(12):2917–2926. (2000). An Evaluation of Space-filling Information
Keim, D. A., Schneidewind, J., and Sips, M. (2004). Circle- Visualizations for Depicting Hierarchical Structures.
View: A New Approach for Visualizing Time-related International Journal of Human-Computer Studies -
Multidimensional Data Sets. In Proceedings of the Empirical evaluation of information visualizations,
Working Conference on Advanced Visual Interfaces, 53(5):663–694.
AVI ’04, pages 179–182, Gallipoli, Italy. ACM. Telea, A. C. and Ersoy, O. (2010). Image-based Edge Bun-
Kim, M., Zimmermann, T., DeLine, R., and Begel, A. dles: Simplified Visualization of Large Graphs. In
(2018). Data Scientists in Software Teams: State of Proceedings of the 12th Eurographics / IEEE - VGTC
the Art and Challenges. IEEE Transactions on Soft- Conference on Visualization, EuroVis ’10, pages 843–
ware Engineering, 44(11):1024–1038. 852, Bordeaux, France.
Kim, N. W., Card, S. K., and Heer, J. (2010). Tracing Ge- Thakur, S. and Rhyne, T.-M. (2009). Data Vases: 2D and
nealogical Data with TimeNets. In Proceedings of the 3D Plots for Visualizing Multiple Time Series. In Ad-
International Conference on Advanced Visual Inter- vances in Visual Computing, pages 929–938, Berlin,
faces, AVI ’10, pages 241–248, Roma, Italy. ACM. Heidelberg. Springer Berlin Heidelberg.
Kirk, A. (2019). The Chartmaker Directory. http: Wang, T. D. and Parsia, B. (2006). CropCircles: Topology
//chartmaker.visualisingdata.com/. [accessed Sensitive Visualization of OWL Class Hierarchies.
2019-08-13]. In In Proceedings of The Semantic Web, ISWC ’06,
pages 695–708. Springer Berlin Heidelberg.
Kubota, H., Nishida, T., and Sumi, Y. (2007). Visualization
Wattenberg, M. M. (2002). Arc diagrams: visualizing struc-
of Contents Archive by Contour Map Representation.
ture in strings. In Proceedings of the IEEE Sym-
In New Frontiers in Artificial Intelligence, pages 19–
posium on Information Visualization, INFOVIS ’02,
32, Berlin, Heidelberg. Springer Berlin Heidelberg.
pages 110–116, Boston, MA, USA.
Liu, J., Boukhelifa, N., and Eagan, J. R. (2019). Under-
Wilkinson, L. and Friendly, M. (2009). The History of
standing the Role of Alternatives in Data Analysis
the Cluster Heat Map. The American Statistician,
Practices. IEEE Transactions on Visualization and
63(2):179–184.
Computer Graphics (Early Access).
Xiong, R. and Donath, J. (1999). PeopleGarden: Creating
Liu, J., Tang, T., Wang, W., Xu, B., Kong, X., and Xia, F. Data Portraits for Users. In Proceedings of the 12th
(2018). A survey of scholarly data visualization. IEEE Annual ACM Symposium on User Interface Software
Access, 6:19205–19221. and Technology, UIST ’99, pages 37–44, Asheville,
McNabb, L. and Laramee, R. S. (2017). Survey of Sur- NC, USA. ACM.
veys (SoS) - Mapping The Landscape of Survey Pa- Zhang, L., Stoffel, A., Behrisch, M., Mittelstadt, S.,
pers in Information Visualization. Computer Graphics Schreck, T., Pompl, R., Weber, S. H., Last, H., and
Forum, 36:589–617. Keim, D. (2012). Visual analytics for the big data era
Meeks, E. (2019). 2019 Annual Data Visualization Survey – A comparative review of state-of-the-art commercial
Results. https://medium.com/nightingale/ systems. In In Proceedings of the IEEE Conference on
2019-annual-data-visualization-survey- Visual Analytics Science and Technology, VAST ’12,
results-334d3523073f. [accessed 2019-11-05]. pages 173–182, Seattle, WA, USA.
Parsons, M. A., Øystein Godøy, LeDrew, E., de Bruin, T. F.,
Danis, B., Tomlinson, S., and Carlson, D. (2011). A
conceptual framework for managing very diverse data
for complex, interdisciplinary science. Journal of In-
formation Science, 37(6):555–569.
Rees, D. and Laramee, R. S. (2019). A Survey of Informa-
tion Visualization Books. Computer Graphics Forum,
38(1):610–646.
Riehmann, P., Hanfler, M., and Froehlich, B. (2005). Inter-
active Sankey Diagrams. In Proceedings of the Pro-
ceedings of the IEEE Symposium on Information Vi-
sualization, INFOVIS ’05, pages 31–, Minneapolis,
MN, USA.
Rost, L. C. (2016). What I Learned Recreating One Chart
Using 24 Tools. https://source.opennews.org/
articles/what-i-learned-recreating-one-
chart-using-24-tools/. [accessed 2019-07-05].