Introduction to R and Python Programming
Languages
Alex Emmons, PhD (BTEP)
Learning Objectives
1. Learn about popular programming languages in bioinformatics
2. Compare advantages and disadvantages of Python and R
3. Discuss what you will need to learn to use these languages
4. Discuss learning resources
Choosing a programming
language
What is a programming language?
A programming language is a formal language that specifies a set of
instructions for a computer to perform specific tasks. It is used to write
software programs and applications, and to control and manipulate
computer systems.—GeeksforGeeks
What is a programming language?
Key features of programming languages include:
Syntax
data types
Variables
Operators
Control Structures
Libraries
Paradigms (programming styles / philosophies) — GeeksforGeeks
Examples include C++, C#, Perl, Java, Ruby, Python, Julia, and R.
More on paradigms, here.
Why learn programming?
Do all molecular scientists need to learn a programming language?
Absolutely not.
BUT
We are in a big data era, and learning to code can be extremely
beneficial, especially if you do not have access to bioinformatics analysts
to analyze the data for you or expensive licensed software.
Which programming language should I learn?
1. Bash
Most of bioinformatics can be done by understanding specific software
applications and running those applications in a pipeline, usually using
some form of bash scripting. Bash as a scripting language is fairly
important for processing biological data, though arguably, not a formal
programming language.
2. Python or R
Depending on your goals, you may lean toward one programming
language over another. For example:
Interested in statistics and data visualization? R may be for you.
Interested in software development and machine learning? A more
general language like Python may be a better fit.
Check out this video!
What is R?
released in 1993
a computational language and environment for statistical computing and
graphics.
complex statistical functions easily accessible
easy to get started, but more difficult to learn
Key features:
open-source
extensible (Packages on CRAN (> 19,000 packages), Github,
Bioconductor)
wide community
Maintained by a network of collaborators - The R Core Team
Check out more on The R Project for Statistical Computing website.
What is Python?
developed as early as 1991
high-level, popular, general-purpose programming language that has a
readable and easy to learn syntax
Key features:
easy to read
easy to learn
interpreted
multi-platform
wide community
open source libraries (> 300,000)
Two major versions (python2 and python3)
Not as easy to just start analyzing data
What is Python?
Check out more at https://www.python.org/.
Also, check out this primer for biologists.
Advantages of R and Python
R Programming Python
Data Visualization (Base R and ggplot2) More consistent syntax (generally a right way to do
additional packages that enhance these, something)
especially for -omics data Large data manipulation (generally more efficient)
More packages for data science / bioinformatics shines in machine learning (scikit-learn)
Bioconductor Report Generation
Report generation Jupyter Notebook
R Markdown More popular among software developers and
Quarto across multiple domains
more popular among scientists and academics (i.e.,
non-programmers)
Image from Toward Data Science, Python vs R: The Basics, author Sidney Kung
What do you need to know to
learn R or Python?
Installation
If you intend to use through Biowulf, no installation necessary.
R:
Use this guide.
Python:
You can download directly from https://www.python.org/downloads/.
How do we execute our code?
With both R and Python, code is executed
interactively line by line from the command line
interactively in an IDE
as a script submitted from the command line or in an IDE
For python, to get started from the command line:
1 python
2 quit()
For R, to get started from the command line:
1 R
2 q()
What is an IDE?
An IDE is an integrated development environment. IDEs generally include features such as:
Console
File access
Environment / variable view
Data view
Plotting window
History
Autocomplete
Debugging
Markdown
IDEs make coding easier. They increase productivity and facilitate project management.
IDEs for R and Python
R Python
RStudio JupyterLab / Jupyter Notebook*
VS Code* Can be used with C++, Julia,
R GNU octave, R, Ruby, and
Scheme
Python
Spyder
iPython
Google colab
Elements of programming with python or R
libraries / modules
syntax
variables
functions
data types (dictionaries and tuples in python)
loops and conditionals
Libraries
R Packages can be found at:
CRAN
METACRAN- to search for packages
Bioconductor
Github
Python
Python Package Index (PyPI)
Bioconductor
A repository for R packages related to biological data analysis, primarily
bioinformatics and computational biology.
a great place to search for -omics packages and pipelines.
Released every 6 months and work with a specific version of R.
included packages are “mutually compatible, traceable, and
guaranteed to function for the associated version of R”
Package types: Software, annotation, experimental data, workflows
Bioinformatics related python packages
Biopython
Bioconda
Conda, as a package management and environment management
system was created for python but now can be used for any language.
scverse
R Syntax
more functional
built around functions (function_name())
Case sensitive
white space insensitive (rules for line continuation)
<- or = assignment operators
# used for comments
keywords or words with special meaning (?reserved)
for example, if, else, repeat, while, function, for, in, next, and break are used for control-
flow statements and declaring user-defined functions.
statement grouping with {}
indexing starts with 1; - removes values
Getting help with help() or ? (e.g., ?print)
Paths use /; \ is an escape
Python Syntax
more object oriented (. is an operator and should not be used to name
variables)
= assignment operator
# used for comments
33 reserved words help("keywords")
lists use brackets [], dictionaries use {}
indentation is important (4 spaces) - defines blocks of code
indexing starts with 0; - for negative indexing
Getting help with help() (e.g., help(print))
Paths use /; \ is an escape
Compare the code
A syntax comparison from Dataquest:
https://www.dataquest.io/blog/python-vs-r/.
R code can be run using python with the rpy2 library. Python code can be executed through R using
the reticulate package.
Variables
Essentially named storage that can be manipulated.
Rules for R variables:
1. Avoid spaces or special characters EXCEPT ’_’ and ‘.’
2. No numbers or underscores at the beginning of an object name.
3. Avoid common names with special meanings (See ?Reserved) or assigned to existing functions (These will
auto complete).
4. Case sensitive
Rules for Python variables:
1. Contains alpha-numeric characters and underscores
2. Must start with a letter or the underscore character
3. cannot start with a number
4. Case sensitive
Functions
Used to perform specific tasks.
R:
1 product <- function(a,b){
2 c<- a*b
3 c
4 }
5 product(5,7)
[1] 35
Python:
1 def product(a,b):
2 c = a*b
3 return c
4
5 print(product(5,7))
35
Code example from https://www.r-bloggers.com/2017/05/r-vs-python-
different-similarities-and-similar-differences/
Data Types
R:
Data types: integer, double (numeric), character, and logical. (Also, complex
and raw)
Data structures: vectors, lists, data frames, matrices, factors.
1 x <- c(1,2,3)
2 typeof(x)
3 ## [1] "double"
4 class(x)
5 ## [1] "numeric"
6 is.vector(x)
7 ## [1] TRUE
Python:
Data types: Integers, Floats, Long, Complex, Strings, booleans (TRUE, FALSE)
Data structures: arrays, tuples, lists, dictionaries, Pandas data frames.
1 import numpy as np
2 x = [1,2,3]
3 x = np.array(x)
4 print(type(x))
<class 'numpy.ndarray'>
Loops and conditionals
Loops - used to iterate over a Conditionals - code is executed
sequence based on conditions
R: R:
1 fruit <- c('apples','bananas','cantaloupe') 1 x<-3
2 2 y<-5
3 for(i in fruit) { 3
4 print(i) 4 if(x<y){
5 } 5 print(paste(x, 'is less than', y))
6 } else{
[1] "apples"
7 print(paste(x, 'is not less than', y))
[1] "bananas"
8 }
[1] "cantaloupe"
[1] "3 is less than 5"
Python:
Python:
1 fruit=['apples', 'bananas', 'cantaloupe'] #L
2 1 x=3
3 for i in fruit: 2 y=5
4 print(i) 3
4 if x<y:
apples
5 print(x, 'is less than', y)
bananas
6 else:
cantaloupe
7 print(x, 'is not less than', y)
3 is less than 5
Resources to learn
BTEP and Others
Check the NIH Bioinformatics Calendar for upcoming events including
courses or lessons on python and R.
Past BTEP courses
Class documentation
Video Archive
NIH library
NIAID Bioinformatics Resources
Dataquest and Coursera
Dataquest - great for learning programming skills
Coursera - great for learning more specific skills
Click here for license information.
Books and other resources:
See this list for introductory R material.
A Primer for Computational Biology, Shawn T. O’Neil
An Introduction to R and Python for Data Analysis : A Side-By-Side
Approach - requires VPN
Sources
1. https://www.datacamp.com/blog/python-vs-r-for-data-science-whats-the-difference#gs.JrY_3bk
2. https://shiring.github.io/r_vs_python/2017/01/22/R_vs_Py_post
3. https://realpython.com/python-ides-code-editors-guide/
4. https://medium.com/@hamza_33678/programming-for-bioinformatics-r-vs-python-
52969a1f7a49#:~:text=While%20both%20R%20and%20Python,in%20keeping%20RAM%20consumption%20low
5. https://towardsdatascience.com/python-vs-r-the-basics-d754c45c1596
6. https://www.dataquest.io/blog/python-vs-r/
7. Learning Python for Data Science: What to Learn and Why, Cindy Sheffield, NIH Library