0% found this document useful (0 votes)
147 views

Data Visualization With The Programming Language R

ads

Uploaded by

azis
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
147 views

Data Visualization With The Programming Language R

ads

Uploaded by

azis
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Data Visualization

Data visualization with the


programming language R
Paul Brennan (Cardiff Data visualization is an extremely valuable skill in science, finance and journalism. Learning to program

Downloaded from http://portlandpress.com/biochemist/article-pdf/doi/10.1042/bio_2021_174/921287/bio_2021_174.pdf by guest on 01 October 2021


University, UK) will help reproducible data analysis and will increase the different types of visualization that can be
generated. The statistical programming language R is a very useful programming language. The R
community is friendly, supportive and very diverse including students, academics, health scientists,
journalists and professional data scientists. An experience of R or another programming language
such as Python or JavaScript will improve your science and your employment opportunities in and
outside of research. Programming is a useful skill in education, finance, journalism and other areas too.

Introduction visualizing scientific data. They are bite sized and I


found them very readable.
Data visualization is a vital part of sharing our research.
Common data visualizations start with bar charts and line
graphs: images for your supervisor and presentations for Learning to program to create
your department. Then perhaps you need to summarize visualizations
your data for your thesis and hopefully some papers.
As such, making figures is an important part of every I recommend learning how to program to generate
scientist’s career and good visualizations will improve good-­ quality data visualizations. My favourite tool
your data and your presentations will have more impact. is the statistical programming language R. Python is
Data visualization is a valuable skill outside of research also good and I have used JavaScript in the past. These
programming languages will allow you to make your
and the knowledge and technical ability to make
work more reproducible. This is great for your future
effective visualizations is a very useful transferable skill.
self as you move towards writing and publishing
Many colleagues have won their next job, inside and
projects that are months and years long. Reproducible
outside academia, because of effective data analysis and
data analysis and visualization is good practice for the
visualization skills.
whole of biochemistry and molecular biology.
Our task is to create data visualizations that really
By using a programming tool such as R, you will
communicate our science with the minimum of
probably have access to a wider variety of visualization
distractions. My aim is always to make figures that can
methods. I first learned R, in 2014, to allow me to make
be understood by themselves requiring as little reading
heat maps and cluster analysis and it immediately
of text as possible. There are some great good practice
helped me analyse a proteomic data set. The book
guides for improving your data visualization skills. One Visualize This by Nathan Yau provided me with code
of the pioneers in visualization is Edward Tufte who that worked immediately and I was hooked. Writing
published a beautiful book called The Visual Display this article, I went back to the original data from 2014
of Quantitative Information in 1983. He was one of the and I was able to reproduce that cluster diagram from
first to identify key principles for data visualization. my code (Figure 1a). This would be much more difficult
These include avoiding distortion, presenting many using non-­ programmatic workflows, for example,
numbers in a small space and encouraging the eye to those based on Excel. Did I mention that R is open
compare different pieces of data. He also wrote about source and free? This makes me happy too. Figure 1b
data-­ink maximization and chart junk. For a more and c show other examples of visualizations that R can
modern analysis, the book Visualization Analysis and produce: a violin plot and a phylogenetic tree.
Design by Tamara Munzner provides a detailed and Violin plots are a good way to represent data but
systematic analysis of data visualization. The book are not very commonly used in biochemistry, where
provides lots of good examples. Focussing on science, the bar plot is more common. There have been calls
Nature Blogs have gathered a collection of pieces that to move away from bar plots or dynamite plots as
were published in Nature Methods entitled Points they are sometimes called, as they are not great ways
of View. These are described as practical advice on to summarize data or reflect variation. Programming

October 2021 © The Authors. Published by Portland Press Limited under the Creative Commons Attribution License 4.0 (CC BY-­NC-­ND) 1
Data Visualization

Downloaded from http://portlandpress.com/biochemist/article-pdf/doi/10.1042/bio_2021_174/921287/bio_2021_174.pdf by guest on 01 October 2021

Figure 1.  R allows reproducible data visualizations. (a) My first published R data visualization – a cluster dendrogram from a
proteomic data set. First published in 2014 and reproduced in 2021 (Alsagaby et al, 2014, https://doi.org/10.1021/pr5002803).
(b) A violin plot showing neutralizing antibody titres against five SARS-­CoV-­2 strains reproduced from shared data and code
from Wall et al (2021, https://doi.org/10.1016/S0140-6736(21)01290-3). (c) Phylogenetic tree of human kinase domains
inspired by visualization by Manning et al (2002, https://doi.org/10.1126/science.1075762). All made with R: code and data
required to make these visualizations are available from https://rforbiochemists.blogspot.com/ or https://github.com/
brennanpincardiff/

2 October 2021 © The Authors. Published by Portland Press Limited under the Creative Commons Attribution License 4.0 (CC BY-­NC-­ND)
Data Visualization
allows you to easily create lots of different types of plot
Box 1.  Steps to making your first graph in R
as you explore your data, again in a reproducible way.
The violin plot in Figure 1Bb is from a paper published
• Download R
in 2021 in the Lancet by Emma Wall and others from
• Download R-­Studio
the David Bauer laboratory working in the Francis
• Open R-­Studio
Crick Institute in London. They shared their data and
• File > New File > R Script
R script which generated their graphs and statistical
• Go to R for Biochemists blog
analysis through Github and so I was able to reproduce
• Select a Starting point for yourself
their graph in just 20 minutes. This is the benefit of
• On that page, cut and paste the script into your new
learning and using programming to analyse your data
file

Downloaded from http://portlandpress.com/biochemist/article-pdf/doi/10.1042/bio_2021_174/921287/bio_2021_174.pdf by guest on 01 October 2021


– easy reproducibility and sharing – good practice for
• Run the script line by line and see your first R data
your science.
viz appear
• Change the code and repeat
• Learn and try to have fun
First steps in learning R

I recommend the R-­ Studio integrated development


environment (IDE) for R. It made R much easier to use Teaching R to biochemists and molecular
and really helped my learning. You can write scripts biologists
which record your workflow. These can be saved,
shared and opened later. You can keep a record of As part of the Training Theme panel for the Biochemical
everything you do. For bioinformatics, there is a large Society, we created an online course focussed around
collection of R packages – packages are collections of teaching R to biochemists and molecular biologists. It
R functions that facilitate data analysis. One important is called R for Biochemists 101. We created this 5 years
collection is held at Bioconductor, which collects and ago and over that time, over 500 learners have engaged
develops R tools for the analysis and comprehension of with the course. For a first person experience of this
high-­throughput genomic data and biologic data. see Box 2 written by Ellie Davis. In R for Biochemists,
Learning R does take some time. Regular practice I have focussed on real data that has been generated
makes the difference – every day at the beginning if by researchers in the biomedical or life sciences. I
possible. There is a free online book entitled R for usually find this more interesting than data about
Data Science (https://​r4ds.​had.​co.​nz/) that can help. airlines or Star Wars. I think it makes the analysis and
This book promotes an opinionated set of packages visualizations more relevant to biochemists. During
(groups of functions) that works together in R called the course learners draw a protein standard curve, an
the TidyVerse. They represent a good framework for enzyme kinetics plot and a volcano plot. We calculate
data analysis and have a lot of free resources available some kinetics constants. We also extract data from
online. A typical data analysis workflow involves: images and draw some maps.
1. Importing the data In parallel, I also created a blog site called R for
2. Tidying our data (also known as wrangling or mung- Biochemists which has been going since 2015. There
ing) are many practice example R scripts on the site and
3. Transforming or summarizing you are welcome to explore this at your leisure. Perhaps
4. Visualizing the most interesting place to start is the visual index.
If you want to try to make your first data visualization Since it began, the blog has had over 150,000 visits. The
in R, see Box  1 for how to start. There are plenty of most popular page is a script that shows how to make
sample scripts on the R for Biochemists blog site for you a volcano plot (Figure  2a). It has had >10,000 views
to explore. You could consider signing up for an online from 16 different countries. It has been interesting to
course with the Biochemical Society. share teaching materials in this way. Figure  2 shows
Perhaps you have a data set of your own that you three of the most popular visualizations that have been
want to play with and a visualization that you want to try viewed – the volcano plot, an LD50 plot and a graph
to make. This will help to inspire you to keep learning. from some flow cytometry data.
Keep trying and perhaps share your results. I’m not Last year, we developed another R course entitled
saying that it is easy. Sometimes learning programming R for Biochemists 201. The first run of this course will
is quite frustrating. However, solving the challenges that be in November 2021 and future runs are planned.
arise can be very rewarding. My recommendation is The material builds on R for Biochemists 101. R
regular practice for a couple of months and you will see a for Biochemists 201 will teach participants the key
difference in your work. concepts of tidy data, about good coding practice

October 2021 © The Authors. Published by Portland Press Limited under the Creative Commons Attribution License 4.0 (CC BY-­NC-­ND) 3
Data Visualization
Box 2.  Ellie Davis’ R journey

• My journey with R began with the R for Biochemists


101 online course and has taken several guises; I
have completed the course as a learner, managed
the course as an administrator and, most recently,
contributed to a content update. Not so very
long before my own adventures with R, I believed
any sort of programming to be a completely
unattainable skillset for me, even if its value has

Downloaded from http://portlandpress.com/biochemist/article-pdf/doi/10.1042/bio_2021_174/921287/bio_2021_174.pdf by guest on 01 October 2021


always been very apparent. I don’t think I’m alone
in this; words like “script” and “code” are daunting
for many, and their associated skills reserved for
an exclusive group. I have since discovered that
this isn’t the case. In fact, the R community is
welcoming and supportive with a culture founded
on open source and collaboration. While working
with Paul on the maintenance of R101 and the
development of R201, I have taken great pleasure
in seeing this culture continued in the comment
sections; learners are encouraged to use the wealth
of resources available to support their training, and
it is wonderful to see them share what they have
found with their fellow learners. These same learners
are often complete beginners to programming,
and yet they are quickly able to harness its power
to visualize and analyse their data. I really enjoyed
reading learner comments, especially their plans
to apply newly learned code to a complicated data
set. Sometimes, they have been sitting on the data
for a while due to feeling unsure of how to handle
it and now they feel empowered. That’s why it’s
important to break down the walls to computational
skills and continue to equip our community with the
tools they need to maximize the potential of their
research.
• Ellie Davis worked in the Biochemical Society
Conference Office from 2018–2021, managing the
Society’s Training from 2020 and has now moved to
Historic England as a Training Adviser.

Figure 2.  Three of the most viewed data visualizations


such as developing reproducible workflows and how
from R for Biochemists blog. (a) A volcano plot. (b) A drug
to create more complex data visualizations in R.
dose–response curve and LD50 calculation. (c) A simple flow
Please get in contact with the Biochemical Society if
cytometry plot. All made with R: code and data required
you would like to join the next run of R101 or R201.
to make these visualizations are available from https://
The Biochemical Society also runs online training in
rforbiochemists.blogspot.com/ or https://github.com/
Python as well and provides training in a wide variety
brennanpincardiff/
of biochemical areas. If you would like to get involved
in designing and delivering training please make
contact. domain. Having a target figure to try to reproduce is
Lots of journals now require authors to share their a good test of your understanding of the data and the
data. One of the exercises I recommend is to try to message of the paper. This approach is often quite a
reproduce a figure the authors have made from their challenge because very few authors will supply the code
data. This is a great way to play with data in the public they used to make the data. Sometimes authors have

4 October 2021 © The Authors. Published by Portland Press Limited under the Creative Commons Attribution License 4.0 (CC BY-­NC-­ND)
Data Visualization

Downloaded from http://portlandpress.com/biochemist/article-pdf/doi/10.1042/bio_2021_174/921287/bio_2021_174.pdf by guest on 01 October 2021

Figure 3.  Three data visualizations made while participating in Tidy Tuesday – an R data visualization showcase. (a) A volcanic
activity time line inspired by @ijeamaka_a. (b) Illustrating the importance of numbers in password strength. Across a range of
password types, inclusion of numbers increases password strength. (c) Showing the proportion of female culprits in Scooby
Doo shows from 1960s to 2020s . Tidy Tuesday site: https://githubcom/rfordatascience/tidytuesday. All made with R: code and
data required to make these visualizations are available from https://rforbiochemists.blogspot.com/ or https://github.com/
brennanpincardiff/

October 2021 © The Authors. Published by Portland Press Limited under the Creative Commons Attribution License 4.0 (CC BY-­NC-­ND) 5
Data Visualization

Downloaded from http://portlandpress.com/biochemist/article-pdf/doi/10.1042/bio_2021_174/921287/bio_2021_174.pdf by guest on 01 October 2021


Figure 4.  Showcasing drawProteins. (a) The Hexsticker for the Bioconductor package drawProteins. (b) Schematic of SARS-­
CoV S1 protein variants. Protein and variant information from Uniprot (https://www.uniprot.org/uniprot/P0DTC2). All made
with R: code and data required to make these visualizations are available from https://rforbiochemists.blogspot.com/ or
https://github.com/brennanpincardiff/

made some changes to the data that you don’t appreciate If you can search Twitter or other social media
the first time you look at things. I’ve written a few blog platforms for #TidyTuesday, you will see inspiring data
posts on R for Biochemists about how I have done this. visualizations. I have been inspired by @ijeamaka_a, @
dokatox, @juliasilge among others. Many of the people
posting visualization include their code so you can
Engaging with the wider R community
try to reproduce their work. Sometimes it requires an
investment of time. When I participate, I always learn
The R community extends well beyond science and I
something. Here are three data visualizations I made
am inspired by it regularly. By attending a local R User
while participating in Tidy Tuesday (Figure  3). The
group, I met colleagues from local companies, the NHS
topics vary: volcanic activity, password strength and
and other universities. I learned some really important
the gender of Scooby Doo culprits!
computer programming lessons including the value
of version control, how to create my own package and
more about machine learning. We learned together, gave Making protein schematics with R –
talks, ran SatRday One  Day Conferences and online drawProteins
workshops. I have seen colleagues in the User group use
their knowledge of R to move from academia to industry, Over my research and teaching career, I’ve made many
specifically into data science consultancy and finance. schematics of proteins. These are not always regarded
I have found the R community friendly and as data visualization but they are. My tool of choice
inclusive. As a result of COVID-­19, there are many was PowerPoint. In 2018, I decided to write R code
online workshops. It is possible to engage with learners to allow me to reproducibly and programmatically
and conferences from all over the world. There are generate protein schematics. Moving from data analysis
over 200 active R-­ Ladies groups (@RLadiesGlobal) to creating my own functions and publishing these as
in many different countries that you can join for free. a package was my next step in becoming a computer
The R-­Forwards group (@R_Forwards) aims to create programmer. I had to learn about code development,
teaching resources that reflect the diverse community unit testing and continuous integration. I engaged with
of R users. The broader data community can be the Bioconductor community and published my package
excellent too. As examples, you can find @BlackInData, called drawProteins. Figure  4 shows a little of what is
@R_LGBTQ and @QueerinAI on Twitter. I follow possible with the package. I’ve used drawProteins to
all of these. I recommend reaching out to those who visualize Spike protein S1 proteins modifications from
interest and inspire you to find your own supportive various SARS-­CoV-­2 corona virus strains using data
group. This may be in person or online as suits you. downloaded from UniProt, a comprehensive, high-­
Tidy Tuesday is a data visualization showcase quality and freely accessible resource of protein sequence
that runs every Tuesday. They share data on Github. and functional information. drawProteins can also be

6 October 2021 © The Authors. Published by Portland Press Limited under the Creative Commons Attribution License 4.0 (CC BY-­NC-­ND)
Data Visualization
used to visualize multiple proteins and post-­translational sophisticated type both in biochemistry, molecular
modifications such as protein phosphorylation biology and business. If you can learn to summarize data
sites. These protein visualizations can be generated into inspiring visualizations, this skill will be useful for
programmatically and in a reproducible way. your future career no matter where it leads. Please be
aware that all the data visualizations shown in the figures
were made with R. You can reproduce these figures
Final words using the code and data which are available at the R for
Biochemist blog and Github.
I hope that I have persuaded you that creating data Anyone interested in learning more about the
visualizations is vital for scientists and is a transferable Society’s programming courses and training work

Downloaded from http://portlandpress.com/biochemist/article-pdf/doi/10.1042/bio_2021_174/921287/bio_2021_174.pdf by guest on 01 October 2021


skill. If you can do it with a programming language it should visit www.​biochemistry.​org/​events-​and-​training/​
will help you in the long term: the months, years and training or contact the Training Team at ​conferences@​
decades ahead. We generate lots of data of increasingly biochemistry.​org. ■
Further Reading

Resources for starting to learn R


• This website and the published book is a great starting point for learning R: https://r4ds.had.co.nz/ or Wickham, H.
and Grolemund, G. (2017) R for Data Science. O'Reilly Media, Sebastopol
• There are lots of biochemistry-­inspired R scripts to try on the R for Biochemists blog site: https://rforbiochemists.
blogspot.com/
• More general R information is available through R-­Studio: R-­Studio website: https://www.rstudio.com/ and https://
www.rstudio.com/resources/webinars
• Biochemical Society Training Courses (https://biochemistry.org/events-and-training/training/): R for Biochemists 101
and, coming in 2021, R for Biochemists 201.
Learning more about data visualization
• This collection is very interesting and well worth a read: Evanko, S. (2013) Data visualization: A view of every Points of
View column. Methagora http://blogs.nature.com/methagora/2013/07/data-visualization-points-of-view.html
• This book inspired me to learn R: Yau, N. (2011) Visualize This: The FlowingData Guide to Design, Visualization, and
Statistics, Wiley
• Inspirational talks and visualizations: VizBi website https://vizbi.org/ Lots of videos, talks and poster about visualizing
biological data.
• A detailed theoretical knowledge: Munzner, T. (2015) Visualization Analysis & Design, CRC Press, Boca Raton
• Edward Tufte information and books https://www.edwardtufte.com/tufte/books_vdqi
R packages and inspiration
• TidyVerse website https://www.tidyverse.org/
• Bioconductor home page https://www.bioconductor.org/
• Tidy Tuesday site https://github.com/rfordatascience/tidytuesday
• Tidy Tuesday Twitter links: https://twitter.com/search?q=%23tidytuesday&lang=en
• drawProteins on Bioconductor https://www.bioconductor.org/packages/release/bioc/html/drawProteins.html
• R Open Sci – open tools for open science https://ropensci.org/

Dr Paul Brennan is an educator and biochemist with over 25 years experience of teaching and research. He
works in the Centre for Medical Education at Cardiff University. As well as programming, he also facilitates
biomedical teaching and has a leadership role in equality, diversity and inclusion. Email: BrennanP@
cardiff.ac.uk. University home page: https://www.cardiff.ac.uk/people/view/122818-­brennan-­paul. Twitter:
brennanpcardiff. Github: bennanpincardiff

October 2021 © The Authors. Published by Portland Press Limited under the Creative Commons Attribution License 4.0 (CC BY-­NC-­ND) 7

You might also like