Data Visualization With The Programming Language R
Data Visualization With The Programming Language R
October 2021 © The Authors. Published by Portland Press Limited under the Creative Commons Attribution License 4.0 (CC BY-NC-ND) 1
Data Visualization
Figure 1. R allows reproducible data visualizations. (a) My first published R data visualization – a cluster dendrogram from a
proteomic data set. First published in 2014 and reproduced in 2021 (Alsagaby et al, 2014, https://doi.org/10.1021/pr5002803).
(b) A violin plot showing neutralizing antibody titres against five SARS-CoV-2 strains reproduced from shared data and code
from Wall et al (2021, https://doi.org/10.1016/S0140-6736(21)01290-3). (c) Phylogenetic tree of human kinase domains
inspired by visualization by Manning et al (2002, https://doi.org/10.1126/science.1075762). All made with R: code and data
required to make these visualizations are available from https://rforbiochemists.blogspot.com/ or https://github.com/
brennanpincardiff/
2 October 2021 © The Authors. Published by Portland Press Limited under the Creative Commons Attribution License 4.0 (CC BY-NC-ND)
Data Visualization
allows you to easily create lots of different types of plot
Box 1. Steps to making your first graph in R
as you explore your data, again in a reproducible way.
The violin plot in Figure 1Bb is from a paper published
• Download R
in 2021 in the Lancet by Emma Wall and others from
• Download R-Studio
the David Bauer laboratory working in the Francis
• Open R-Studio
Crick Institute in London. They shared their data and
• File > New File > R Script
R script which generated their graphs and statistical
• Go to R for Biochemists blog
analysis through Github and so I was able to reproduce
• Select a Starting point for yourself
their graph in just 20 minutes. This is the benefit of
• On that page, cut and paste the script into your new
learning and using programming to analyse your data
file
October 2021 © The Authors. Published by Portland Press Limited under the Creative Commons Attribution License 4.0 (CC BY-NC-ND) 3
Data Visualization
Box 2. Ellie Davis’ R journey
4 October 2021 © The Authors. Published by Portland Press Limited under the Creative Commons Attribution License 4.0 (CC BY-NC-ND)
Data Visualization
Figure 3. Three data visualizations made while participating in Tidy Tuesday – an R data visualization showcase. (a) A volcanic
activity time line inspired by @ijeamaka_a. (b) Illustrating the importance of numbers in password strength. Across a range of
password types, inclusion of numbers increases password strength. (c) Showing the proportion of female culprits in Scooby
Doo shows from 1960s to 2020s . Tidy Tuesday site: https://githubcom/rfordatascience/tidytuesday. All made with R: code and
data required to make these visualizations are available from https://rforbiochemists.blogspot.com/ or https://github.com/
brennanpincardiff/
October 2021 © The Authors. Published by Portland Press Limited under the Creative Commons Attribution License 4.0 (CC BY-NC-ND) 5
Data Visualization
made some changes to the data that you don’t appreciate If you can search Twitter or other social media
the first time you look at things. I’ve written a few blog platforms for #TidyTuesday, you will see inspiring data
posts on R for Biochemists about how I have done this. visualizations. I have been inspired by @ijeamaka_a, @
dokatox, @juliasilge among others. Many of the people
posting visualization include their code so you can
Engaging with the wider R community
try to reproduce their work. Sometimes it requires an
investment of time. When I participate, I always learn
The R community extends well beyond science and I
something. Here are three data visualizations I made
am inspired by it regularly. By attending a local R User
while participating in Tidy Tuesday (Figure 3). The
group, I met colleagues from local companies, the NHS
topics vary: volcanic activity, password strength and
and other universities. I learned some really important
the gender of Scooby Doo culprits!
computer programming lessons including the value
of version control, how to create my own package and
more about machine learning. We learned together, gave Making protein schematics with R –
talks, ran SatRday One Day Conferences and online drawProteins
workshops. I have seen colleagues in the User group use
their knowledge of R to move from academia to industry, Over my research and teaching career, I’ve made many
specifically into data science consultancy and finance. schematics of proteins. These are not always regarded
I have found the R community friendly and as data visualization but they are. My tool of choice
inclusive. As a result of COVID-19, there are many was PowerPoint. In 2018, I decided to write R code
online workshops. It is possible to engage with learners to allow me to reproducibly and programmatically
and conferences from all over the world. There are generate protein schematics. Moving from data analysis
over 200 active R- Ladies groups (@RLadiesGlobal) to creating my own functions and publishing these as
in many different countries that you can join for free. a package was my next step in becoming a computer
The R-Forwards group (@R_Forwards) aims to create programmer. I had to learn about code development,
teaching resources that reflect the diverse community unit testing and continuous integration. I engaged with
of R users. The broader data community can be the Bioconductor community and published my package
excellent too. As examples, you can find @BlackInData, called drawProteins. Figure 4 shows a little of what is
@R_LGBTQ and @QueerinAI on Twitter. I follow possible with the package. I’ve used drawProteins to
all of these. I recommend reaching out to those who visualize Spike protein S1 proteins modifications from
interest and inspire you to find your own supportive various SARS-CoV-2 corona virus strains using data
group. This may be in person or online as suits you. downloaded from UniProt, a comprehensive, high-
Tidy Tuesday is a data visualization showcase quality and freely accessible resource of protein sequence
that runs every Tuesday. They share data on Github. and functional information. drawProteins can also be
6 October 2021 © The Authors. Published by Portland Press Limited under the Creative Commons Attribution License 4.0 (CC BY-NC-ND)
Data Visualization
used to visualize multiple proteins and post-translational sophisticated type both in biochemistry, molecular
modifications such as protein phosphorylation biology and business. If you can learn to summarize data
sites. These protein visualizations can be generated into inspiring visualizations, this skill will be useful for
programmatically and in a reproducible way. your future career no matter where it leads. Please be
aware that all the data visualizations shown in the figures
were made with R. You can reproduce these figures
Final words using the code and data which are available at the R for
Biochemist blog and Github.
I hope that I have persuaded you that creating data Anyone interested in learning more about the
visualizations is vital for scientists and is a transferable Society’s programming courses and training work
Dr Paul Brennan is an educator and biochemist with over 25 years experience of teaching and research. He
works in the Centre for Medical Education at Cardiff University. As well as programming, he also facilitates
biomedical teaching and has a leadership role in equality, diversity and inclusion. Email: BrennanP@
cardiff.ac.uk. University home page: https://www.cardiff.ac.uk/people/view/122818-brennan-paul. Twitter:
brennanpcardiff. Github: bennanpincardiff
October 2021 © The Authors. Published by Portland Press Limited under the Creative Commons Attribution License 4.0 (CC BY-NC-ND) 7