100% found this document useful (1 vote)

3K views24 pages

Bioinformatics With Python Cookbook - Sample Chapter

Chapter No. 1 Python and the Surrounding Software Ecology Learn how to use modern Python bioinformatics libraries and applications to do cutting-edge research in computational biology For more information: http://bit.ly/1TQEqK8

Uploaded by

Packt Publishing

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

3K views24 pages

Bioinformatics With Python Cookbook - Sample Chapter

Uploaded by

Packt Publishing

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Fr

Bioinformatics with Python Cookbook

Using the hands-on recipes in this book, you'll be able to do practical research and analysis in
computational biology with Python. We cover modern, next-generation sequencing libraries
and explore real-world examples on how to handle real data. The main focus of the book is the
practical application of bioinformatics, but we also cover modern programming techniques and
frameworks to deal with the ever increasing deluge of bioinformatics data.

What this book will do

for you...
Gain a deep understanding of Python's

fundamental bioinformatics libraries and be

exposed to the most important data science
tools in Python
Process genome-wide data with Biopython
Analyze and perform quality control on

next-generation sequencing datasets using

libraries such as PyVCF or PySAM
Use DendroPy and Biopython for

phylogenetic analysis

Inside the Cookbook...

A straightforward and easy-to-follow format
A selection of the most important tasks

and problems

Carefully organized instructions for solving

Perform population genetics analysis on

large datasets

the problem efficiently

Clear explanations of what you did

Apply the solution to other situations

genomic features with simuPOP

$ 54.99 US
35.99 UK

community experience distilled

Tiago Antao

Simulate complex demographies and

P U B L I S H I N G

Bioinformatics with Python Cookbook

If you are either a computational biologist or a Python programmer, you will probably relate to the
expression "explosive growth, exciting times". Python is arguably the main programming language for
big data, and the deluge of data in biology, mostly from genomics and proteomics, makes bioinformatics
one of the most exciting fields in data science.

pl
e

Quick answers to common problems

Bioinformatics with
Python Cookbook
Learn how to use modern Python bioinformatics libraries and
applications to do cutting-edge research in computational biology

Prices do not include

local sales tax or VAT
where applicable

P U B L I S H I N G

Visit [Link] for books, eBooks,

code, downloads, and PacktLib.

Tiago Antao

In this package, you will find:

The author biography

A preview chapter from the book, Chapter 1 'Python and the Surrounding
Software Ecology'
A synopsis of the books content
More information on Bioinformatics with Python Cookbook

About the Author

Tiago Antao is a bioinformatician. He is currently studying the genomics of the mosquito
Anopheles gambiae, the main vector of malaria. Tiago was originally a computer scientist
who crossed over to computational biology with an MSc in bioinformatics from the Faculty of
Sciences of the University of Porto, Portugal. He holds a PhD in the spread of drug resistant
malaria from the Liverpool School of Tropical Medicine, UK. Tiago is one of the coauthors
of Biopythona major bioinformatics packagewritten on Python. He has also developed
Lositan, a Jython-based selection detection workbench.
In his postdoctoral career, he has worked with human datasets at the University of
Cambridge, UK, and with the mosquito whole genome sequence data at the University
of Oxford, UK. He is currently working as a Sir Henry Wellcome fellow at the Liverpool
School of Tropical Medicine.

Preface
Whether you are reading this book as a computational biologist or a Python programmer, you
will probably relate to the "explosive growth, exciting times" expression. The recent growth of
Python is strongly connected with its status as the main programming language for big data.
On the other hand, the deluge of data in biology, mostly from genomics and proteomics makes
bioinformatics one of the forefront applications of data science. There is a massive need for
bioinformaticians to analyze all this data; of course, one of the main tools is Python. We will
not only talk about the programming language, but also the whole community and software
ecology behind it. When you choose Python to analyze your data, you will also get an extensive
set of libraries, ranging from statistical analysis to plotting, parallel programming, machine
learning, and bioinformatics. However, when you choose Python, you expect more than
this; the community has a tradition of providing good documentation, reliable libraries, and
frameworks. It is also friendly and supportive of all its participants.
In this book, we will present practical solutions to modern bioinformatics problems using
Python. Our approach will be hands-on, where we will address important topics, such as
next-generation sequencing, genomics, population genetics, phylogenetics, and proteomics
among others. At this stage, you probably know the language reasonably well and are aware
of the basic analysis methods in your field of research. You will dive directly into relevant
complex computational biology problems and learn how to tackle them with Python. This is
not your first Python book or your first biology lesson; this is where you will find reliable and
pragmatic solutions to realistic and complex problems.

What this book covers

Chapter 1, Python and the Surrounding Software Ecology, tells you how to set up a modern
bioinformatics environment with Python. This chapter discusses how to deploy software using
Docker, interface with R, and interact with the IPython Notebook.
Chapter 2, Next-generation Sequencing, provides concrete solutions to deal with next-generation
sequencing data. This chapter teaches you how to deal with large FASTQ, BAM, and VCF files. It
also discusses data filtering.

Preface
Chapter 3, Working with Genomes, not only deals with high-quality referencessuch as the
human genomebut also discusses how to analyze other low-quality references typical in
non-model species. It introduces GFF processing, teaches you how to analyze genomic
feature information, and discusses how to use gene ontologies.
Chapter 4, Population Genetics, describes how to perform population genetics analysis of
empirical datasets. For example, on Python, we will perform Principal Components Analysis,
compute FST, or Structure/Admixture plots.
Chapter 5, Population Genetics Simulation, covers simuPOP, an extremely powerful
Python-based forward-time population genetics simulator. This chapter shows you how
to simulate different selection and demographic regimes. It also briefly discusses the
coalescent simulation.
Chapter 6, Phylogenetics, uses complete sequences of recently sequenced Ebola viruses
to perform real phylogenetic analysis, which includes tree reconstruction and sequence
comparisons. This chapter discusses recursive algorithms to process tree-like structures.
Chapter 7, Using the Protein Data Bank, focuses on processing PDB files, for example,
performing the geometric analysis of proteins. This chapter takes a look at protein visualization.
Chapter 8, Other Topics in Bioinformatics, talks about how to analyze data made
available by the Global Biodiversity Information Facility (GBIF) and how to use Cytoscape,
a powerful platform to visualize complex networks. This chapter also looks at how to work
with geo-referenced data and map-based services.
Chapter 9, Python for Big Genomics Datasets, discusses high-performance programming
techniques necessary to handle big datasets. It briefly discusses cluster usage and code
optimization platforms (such as Numba or Cython).

Python and
the Surrounding
Software Ecology
In this chapter, we will cover the following recipes:

Installing the required software with Anaconda

Installing the required software with Docker

Interfacing with R via rpy2

Performing R magic with IPython

Introduction
We will start by installing the required software. This will include the Python distribution,
some fundamental Python libraries, and external bioinformatics software. Here, we will also
be concerned with the world outside Python. In bioinformatics and Big Data, R is also a major
player; therefore, you will learn how to interact with it via rpy2 a Python/R bridge. We will also
explore the advantages that the IPython framework can give us in order to efficiently interface
with R. This chapter will set the stage for all the computational biology that we will perform in
the rest of the book.

Python and the Surrounding Software Ecology

As different users have different requirements, we will cover two different approaches on
how to install the software. One approach is using the Anaconda Python ([Link]
[Link]/anaconda/) distribution and another approach to install the software
via Docker (a server virtualization method based on containers sharing the same operating
system kernel[Link] We will also provide some help on how to use
the standard Python installation tool, pip, if you use the standard Python distribution. If you
have a different Python environment that you are comfortable with, feel free to continue using
it. If you are using a Windows-based OS, you are strongly encouraged to consider changing
your operating system or use Docker via boot2docker.

Installing the required software with

Anaconda
Before we get started, we need to install some prerequisite software. The following sections
will take you through the software and the steps needed to install them. An alternative way
to start is to use the Docker recipe, after which everything will be taken care for you via a
Docker container.
If you are already using a different Python version, you are encouraged to continue using
your preferred version, although you will have to adapt the following instructions to suit
your environment.

Getting ready
Python can be run on top of different environments. For instance, you can use Python inside
the JVM (via Jython) or with .NET (with IronPython). However, here, we are concerned not only
with Python, but also with the complete software ecology around it; therefore, we will use the
standard (CPython) implementation as that the JVM and .NET versions exist mostly to interact
with the native libraries of these platforms. A potentially viable alternative will be to use the
PyPy implementation of Python (not to be confused with PyPi: the Python Package index).
An important decision is whether to choose the Python 2 or 3. Here, we will support both
versions whenever possible, but there are a few issues that you should be aware of. The first
issue is if you work with Phylogenetics, you will probably have to go with Python 2 because
most existing Python libraries do not support version 3. Secondly, in the short term, Python 2,
is generally better supported, but (save for the aforementioned Phylogenetics topic) Python
3 is well covered for computational biology. Finally, if you believe that you are in this for the
long run, Python 3 is the place to be. Whatever is your choice, here, we will support both
options unless clearly stated otherwise. If you go for Python 2, use 2.7 (or newer if it has been
released). With Python 3, use at least 3.4.

Chapter 1
If you were starting with Python and bioinformatics, any operating system will work, but here,
we are mostly concerned with the intermediate to advanced usage. So, while you can probably
use Windows and Mac OS X, most heavy-duty analysis will be done on Linux (probably on a
Linux cluster). Next-generation sequencing data analysis and complex machine learning are
mostly performed on Linux clusters.
If you are on Windows, you should consider upgrading to Linux for your bioinformatics work
because many modern bioinformatics software will not run on Windows. Mac OS X will be
fine for almost all analyses, unless you plan to use a computer cluster, which will probably
be Linux-based.
If you are on Windows or Mac OS X and do not have easy access to Linux, do not worry.
Modern virtualization software (such as VirtualBox and Docker) will come to your rescue,
which will allow you to install a virtual Linux on your operating system. If you are working
with Windows and decide that you want to go native and not use Anaconda, be careful with
your choice of libraries; you are probably safer if you install the 32-bit version for everything
(including Python itself).
Remember, if you are on Windows, many tools will be unavailable to you.
Bioinformatics and data science are moving at breakneck speed; this
is not just hype, it's a reality. If you install the default packages of your
software framework, be sure not to install old versions. For example,
if you are a Debian/Ubuntu Linux user, it's possible that the default
matplotlib package of your distribution is too old. In this case, it's
advised to either use a recent conda or pip package instead.

The software developed for this book is available at [Link]

bioinf-python. To access it, you will need to install Git. Alternatively, you can download the
ZIP file that GitHub makes available (however, getting used to Git may be a good idea because
lots of scientific computing software are being developed with it).
Before you install the Python stack properly, you will need to install all the external non-Python
software that you will be interoperating with. The list will vary from chapter to chapter and all
chapter-specific packages will be explained in their respective chapters. Some less common
Python libraries may also be referred to in their specific chapters.
If you are not interested on a specific chapter (that is perfectly fine), you can skip the related
packages and libraries.
Of course, you will probably have many other bioinformatics applications aroundsuch as
bwa or GATK for next-generation sequencing, but we will not discuss these because we do
not interact with them directly (although we might interact with their outputs).

Python and the Surrounding Software Ecology

You will need to install some development compilers and libraries (all free). On Ubuntu,
consider installing the build-essential (apt-get it) package, and on Mac, consider Xcode
([Link]
In the following table, you will find the list of the most important Python software. We strongly
recommend the installation of the IPython Notebook (now known as Project Jupyter). While
not strictly mandatory, it's becoming a fundamental cornerstone for scientific computing
with Python:
Name

Usage
General

URL
[Link]

Purpose

IPython
NumPy

General

[Link]

Numerical Python

SciPy

General

[Link]

Scientific computing

matplotlib

General

[Link]

Visualization

Biopython

General

Bioinformatics

PyVCF

NGS

PySAM

NGS

simuPOP

Population Genetics

DendroPY

Phylogenetics

scikit-learn

General

PyMOL

Proteomics

[Link]
wiki/Main_Page
[Link]
[Link]/en/
latest/
[Link]
[Link]/en/
latest/
[Link]
[Link]/
[Link]
org/DendroPy/
[Link]
org/stable/
[Link]

rpy2

R integration

R interface

pygraphviz

General

Reportlab

General

[Link]
net/
[Link]
[Link]/
[Link]

seaborn

General

Visualization/Stats

Cython

Big Data

[Link]
edu/~mwaskom/software/
seaborn/
[Link]

Numba

Big Data

[Link]
org/

High performance

General

VCF processing

SAM/BAM processing

Genetics Simulation
Phylogenetics
Machine learning
Molecular
visualization

Graph library
Visualization

High performance

Chapter 1
Note that the list of available software for Python in general and bioinformatics in particular
is constantly increasing. For example, we recommend you to keep an eye on projects such as
Blaze (data analysis) or Bokeh (visualization).

How to do it
Here are the steps to perform the installation:
1. Start by downloading the Anaconda distribution from [Link]
downloads. You can either choose the Python Version 2 or 3. At this stage, this is
not fundamental because Anaconda will let you use the alternative version if you
need it. You can accept all the installation defaults, but you may want to make sure
that conda binaries are in your PATH (do not forget to open a new window so that the
PATH is updated).

If you have another Python distribution, but still decide to try Anaconda, be
careful with your PYTHONPATH and existing Python libraries. It's probably
better to unset your PYTHONPATH. As much as possible, uninstall all other
Python versions and installed Python libraries.

2. Let's go ahead with libraries. We will now create a new conda environment called
bioinformatics with Biopython 1.65, as shown in the following command:
conda create -n bioinformatics biopython biopython=1.65 python=2.7

If you want Python 3 (remember the reduced phylogenetics functionality, but

more future proof), run the following command:
conda create -n bioinformatics biopython=1.65 python=3.4

3. Let's activate the environment, as follows:

source activate bioinformatics

4. Also, install the core packages, as follows:

conda install scipy matplotlib ipython-notebook binstar pip
conda install pandas cython numba scikit-learn seaborn

5. We still need pygraphivz, which is not available on conda. Therefore, we need to

use pip:
pip install pygraphviz

6. Now, install the Python bioinformatics packages, apart from Biopython (you only need
to install those that you plan to use):

This is available on conda:

conda install -c

[Link]

pysam

conda install -c [Link] simuPOP

Python and the Surrounding Software Ecology

This is available via pypi:

pip install pyvcf
pip install dendropy

If you need to interoperate with R, of course, you will need to install it; either
download it from the R website at [Link] or use
the R provided by your operating system distribution.

On a recent Debian/Ubuntu Linux distribution, you can just run the following
command as root:
apt-get r-bioc-biobase r-cran-ggplot2

This will install Bioconductor: the main R suite for bioinformatics and
ggplot2a popular plotting library in R. Of course, this will indirectly
take care of installing R.

8. Alternatively, If you are not on Debian/Ubuntu Linux, do not have root, or prefer to
install in your home directory, after downloading and installing R manually, run the
following command in R:
source("[Link]
biocLite()

This will install Bioconductor (for detailed instructions, refer to http://

[Link]/install/). To install ggplot2, just run the
following command in R:
[Link]("ggplot2")
[Link]("gridExtra")

9. Finally, you will need to install rpy2, the R-to-Python bridge. Back at the command
line, under the conda bioinformatics environment, run the following command:
pip install rpy2

There's more
There is no requirement to use Anaconda; you can easily install all this software on another
Python distribution. Make sure that you have pip installed and install all conda packages
with it, instead. You may need to install more compilers (for example, Fortran) and libraries
because installation via pip will rely on compilation more than conda. However, as you also
need pip for some packages under conda, you will need some compilers and C development
libraries with conda, anyway. If you are on Python 3, you will probably have to perform pip3
and run Python as python3 (as python/pip will call Python 2 by default on most systems).
In order to isolate your environment, you may want to consider using virtualenv (http://
[Link]/en/latest/dev/virtualenvs/). This allows you to create
a bioninformatics environment similar to the one on conda.
6

Chapter 1

See also

The Anaconda ([Link] Python

distribution is commonly used, especially because of its intelligent package
manager: conda. Although conda was developed by the Python community,
it's actually language agnostic.

The software installation and package maintenance was never Python's strongest point
(hence, the popularity of conda to address this issue). If you want to know the currently
recommended installation policies for the standard Python distribution (and avoid old
and deprecated alternatives), refer to [Link]

You have probably heard of the IPython Notebook; if not, visit their page at
[Link]

Installing the required software with Docker

Docker is the most widely used framework that implements operating system-level
virtualization. This technology allows you to have an independent container: a layer that is
lighter than a virtual machine, but still allows you to compartmentalize software. This mostly
isolates all processes, making it feel like each container is a virtual machine.
Docker works quite well at both extremes of the development spectrum: it's an expedient way
to set up the content of this book for learning purposes and may be your platform to deploy
your applications in complex environments. This recipe is an alternative to the previous recipe.
However, for long-term development environments, something along the lines of the previous
recipe is probably your best route, although it can entail a more laborious initial setup.

Getting ready
If you are on Linux, the first thing you have to do is to install Docker. The safest solution is to
get the latest version from [Link] While your Linux distribution may
have a Docker package, it may be too old and buggy (remember the "advancing at breakneck
speed" thingy?).
If you are on Windows or Mac, do not despair; boot2docker ([Link]
is here to save you. Boot2docker will install VirtualBox and Docker for you, which allows you
to run Docker containers in a virtual machine. Note that a fairly recent computer (well, not
that recent, as the technology was introduced in 2006) is necessary to run our 64-bit virtual
machine. If you have any problems, reboot your machine and make sure that on the BIOS,
VT-X or AMD-V is enabled. At the very least, you will need 6 GB of memory, preferably more.
Note that this will require a very large download from the Internet, so be sure that you have a
big network pipe. Also, be ready to wait for a long time.

Python and the Surrounding Software Ecology

How to do it
These are the steps to be followed:
1. Use the following command on the Linux shell or in boot2docker:
docker build -t bio
[Link]

If you want the Python 3 version, replace the 2 with 3 versions on the URL.
After a fairly long wait, all should be ready.
Note that on Linux, you will either require to have root privileges or be added
to the Docker Unix group.

2. Now, you are ready to run the container, as follows:

docker run -ti -p 9875:9875 -v YOUR_DIRECTORY:/data bio

3. Replace YOUR_DIRECTORY with a directory on your operating system. This

will be shared between your host operating system and the Docker container.
YOUR_DIRECTORY will be seen in the container on /data and vice versa.

The -p 9875:9875 will expose the container TCP port 9875 on the host
computer port 9875.

4. If you are using boot2docker, the final configuration step will be to run the following
command in the command line of your operating system, not in boot2docker:
VBoxManage controlvm boot2docker-vm natpf1
"name,tcp,[Link],9875,,9875"

On Windows, this binary will probably be in C:\Program Files\

Oracle\VirtualBox.
On a native Docker installation, you do not need to do anything.

5. If you now start your browser pointing at [Link] you should be

able to get the IPython Notebook server running. Just choose the Welcome notebook
to start!

Chapter 1

See also

Docker is the most widely used containerization software and has seen
enormous growth in usage in recent times. You can read more about it
at [Link]

You will find a paper on arXiv, which introduces Docker with a focus on reproducible
research at [Link]

Interfacing with R via rpy2

If there is some functionality that you need and cannot find it in a Python library, your first port
of call is to check whether it's implemented in R. For statistical methods, R is still the most
complete framework; moreover, some bioinformatics functionalities are also only available in
R, most probably offered as a package belonging to the Bioconductor project.
The rpy2 provides provides a declarative interface from Python to R. As you will see, you will be
able to write very elegant Python code to perform the interfacing process.
In order to show the interface (and try out one of the most common R data structures, the
data frame, and one of the most popular R libraries: ggplot2), we will download its metadata
from the Human 1000 genomes project ([Link] As this is not
a book on R, we do want to provide any interesting and functional examples.

Getting ready
You will need to get the metadata file from the 1000 genomes sequence index. Please check
[Link]
[Link] and download the [Link] file. If you are using notebooks,
open the 00_Intro/Interfacing_R [Link] and just execute the wget

command on top.
This file has information about all FASTQ files in the project (we will use data from the Human
1000 genomes project in the chapters to come). This includes the FASTQ file, the sample
ID, and the population of origin and important statistical information per lane, such as the
number of reads and number of DNA bases read.

Python and the Surrounding Software Ecology

How to do it
Take a look at the following steps:
1. We start by importing rpy2 and reading the file, using the read_delim R function:
import [Link] as robjects
read_delim = robjects.r('[Link]')
seq_data = read_delim('[Link]', header=True,
stringsAsFactors=False)
#In R:
# [Link] <- [Link]('[Link]', header=TRUE,
# stringsAsFactors=FALSE)

The first thing that we do after importing is accessing the [Link] R

function that allows you to read files.
Note that the R language specification allows you to put dots in the names of
objects. Therefore, we have to convert a function name to read_delim.

2. Then, we call the function proper; note the following highly declarative features. First,
most atomic objectssuch as stringscan be passed without conversion. Second,
argument names are converted seamlessly (barring the dot issue). Finally, objects
are available in the Python namespace (but objects are actually not available in the R
namespace; more about this later). For reference, I have included the corresponding
R code. I hope it's clear that it's an easy conversion.

The seq_data object is a data frame. If you know basic R or the Python
pandas library, you are probably aware of this type of data structure; if not,
then this is essentially a table: a sequence of rows where each column has
the same type. Let's perform a basic inspection of this data frame as follows:
print('This dataframe has %d columns and %d rows' %
(seq_data.ncol, seq_data.nrow))
print(seq_data.colnames)
#In R:
# print(colnames([Link]))
# print(nrow([Link]))
# print(ncol([Link]))

Again, note the code similarity. You can even mix styles using the
following code:
my_cols = [Link](seq_data)
print(my_cols)

Chapter 1

You can call R functions directly; in this case, we will call ncol if they do not
have dots in their name; however, be careful. This will display an output, not
26 (the number of columns), but [26] which is a vector composed of the
element 26. This is because by default, most operations in R return vectors.
If you want the number of columns, you have to perform my_cols[0].
Also, talking about pitfalls, note that R array indexing starts with 1, whereas
Python starts with 0.

3. Now, we need to perform some data cleanup. For example, some columns should be
interpreted as numbers, but they are read as strings:
as_integer = robjects.r('[Link]')
match = [Link]
my_col = match('BASE_COUNT', seq_data.colnames)[0]
print(seq_data[my_col - 1][:3])
seq_data[my_col - 1] = as_integer(seq_data[my_col - 1])
print(seq_data[my_col - 1][:3])

The match function is somewhat similar to the index method in Python lists.
As expected, it returns a vector so that we can extract the 0 element. It's also
1-indexed, so we subtract one when working on Python. The as_integer
function will convert a column to integers. The first print will show strings
(values surrounded by "), whereas the second print will show numbers.

4. We will need to massage this table a bit more; details can be found on a notebook,
but here we will finalize with getting the data frame to R (remember that while it's
an R object, it's actually visible on the Python namespace only):
[Link]('[Link]', seq_data)

This will create a variable in the R namespace called [Link] with the
content of the data frame from the Python namespace. Note that after this
operation, both objects will be independent (if you change one, it will not be
reflected on the other).
While you can perform plotting on Python, R has default built-in
plotting functionalities (which we will ignore here). It also has a
library called ggplot2 that implements the Grammar of Graphics
(a declarative language to specify statistical charts).

Python and the Surrounding Software Ecology

5. We will finalize our R integration example with a plot using ggplot2. This is particularly
interesting, not only because you may encounter R code using ggplot2, but also
because the drawing paradigm behind the Grammar of Graphics is really revolutionary
and may be an alternative that you may want to consider instead of the more standard
plotting libraries, such as matplotlib ggplot2 is so pervasive that rpy2 provides a
Python interface to it:
import [Link].ggplot2 as ggplot2

6. With regards to our concrete example based on the Human 1000 genomes
project, we will first plot a histogram with the distribution of center names,
where all sequencing lanes were generated. The first thing that we need to
do is to output the chart to a PNG file. We call the R png() function as follows:
[Link]('[Link]')

We will now use ggplot to create a chart, as shown in the following command:
from [Link] import SignatureTranslatedFunction
[Link] = SignatureTranslatedFunction([Link],
init_prm_translate={'axis_text_x': '[Link].x'})
bar = [Link](seq_data) + ggplot2.geom_bar() +
ggplot2.aes_string(x='CENTER_NAME') +
[Link](axis_text_x=ggplot2.element_text(angle=90,
hjust=1))
[Link]()
dev_off = robjects.r('[Link]')
dev_off()

The second line is a bit uninteresting, but is an important boilerplate code.

One of the R functions that we will call has a parameter with a dot in its
name. As Python function calls cannot have this, we map the [Link].x
R parameter name to the axis_text_x Python name in the function
theme. We monkey patch it (that is, we replace [Link] with a
patched version of itself).

8. We then draw the chart itself. Note the declarative nature of ggplot2 as we add
features to the chart. First, we specify the seq_data data frame, then we will
use a histogram bar plot called geom_bar, followed by annotating the X variable
(CENTER_NAME).
9. Finally, we rotate the text of the x axis by changing the theme.

We finalize by closing the R printing device. If you are in an IPython console,

you will want to visualize the PNG image as follows:
from [Link] import Image
Image(filename='[Link]')

Chapter 1

This chart produced is as follows:

Figure 1: The ggplot2-generated histogram of center names responsible for sequencing

lanes of human genomic data of the 1000 genomes project

10. As a final example, we will now do a scatter plot of read and base counts for all the
sequenced lanes for Yoruban (YRI) and Utah residents with ancestry from Northern
and Western Europe (CEU) of the Human 1000 genomes project (the summary of the
data of this project, which we will use thoroughly, can be seen in the Working with
modern sequence formats recipe in Chapter 2, Next-generation Sequencing). We are
also interested in the difference among the different types of sequencing (exome,
high, and low coverage). We first generate a data frame with just YRI and CEU lanes
and limit the maximum base and read counts:
robjects.r('yri_ceu <- [Link][[Link]$POPULATION %in%
c("YRI", "CEU") & [Link]$BASE_COUNT < 2E9 &
[Link]$READ_COUNT < 3E7, ]')
robjects.r('yri_ceu$POPULATION <[Link](yri_ceu$POPULATION)')
robjects.r('yri_ceu$ANALYSIS_GROUP <[Link](yri_ceu$ANALYSIS_GROUP)')

Python and the Surrounding Software Ecology

The last two lines convert the POPULATION and ANALYSIS_GROUPS to

factors, a concept similar to categorical data.

11. We are now ready to plot:

yri_ceu = robjects.r('yri_ceu')
scatter = [Link](yri_ceu) + ggplot2.geom_point() + \
ggplot2.aes_string(x='BASE_COUNT', y='READ_COUNT',
shape='factor(POPULATION)', col='factor(ANALYSIS_GROUP)')
[Link]()

Hopefully, this example (refer to the following screenshot) makes the power of
the Grammar of Graphics approach clear. We will start by declaring the data
frame and the type of chart in use (the scatter plot implemented by geom_
point). Note how easy it is to express that the shape of each point depends
on the POPULATION variable and the color on the ANALYSIS_GROUP:

Figure 2: The ggplot2-generated scatter plot with base and read counts for all sequencing lanes read; the color and
shape of each dot reflects categorical data (population and the type of data sequenced)
14

Chapter 1
12. Finally, when you think about Python and R, you probably think about pandas: the
R-inspired Python library designed with data analysis and modeling in mind. One of
the fundamental data structures in pandas is (surprise) the data frame. It's quite
easy to convert backward and forward between R and pandas, as follows:
import [Link] as pd_common
pd_yri_ceu = pd_common.load_data('yri_ceu')
del pd_yri_ceu['PAIRED_FASTQ']
no_paired = pd_common.convert_to_r_dataframe(pd_yri_ceu)
[Link]('[Link]', no_paired)
robjects.r("print(colnames([Link]))")

We start by importing the necessary conversion module. We then convert the

R data frame (note that we are converting the yri_ceu in the R namespace,
not the one on the Python namespace). We delete the column that indicates
the name of the paired FASTQ file on the pandas data frame and copy it back
to the R namespace. If you print the column names of the new R data frame,
you will see that PAIRED_FASTQ is missing.
As this book enters production, the [Link] module is being
deprecated (although it's still available).

In the interests of maintaining the momentum of the book, we will not delve into pandas
programming (there are plenty of books on this), but I recommend that you take a look
at it, not only in the context of interfacing with R, but also as a very good library for data
management of complex datasets.

There's more
It's worth repeating that the advancements on the Python software ecology are occurring at
a breakneck pace. This means that if a certain functionality is not available today, it might
be released sometime in the near future. So, if you are developing a new project, be sure to
check for the very latest developments on the Python front before using a functionality from
an R package.
There are plenty of R packages for bioinformatics in the Bioconductor project (http://
[Link]/). This should probably be your first port of call in the R world for
bioinformatics functionalities. However, note that there are many R bioinformatics packages
that are not on Bioconductor, so be sure to search the wider R packages on CRAN (refer to the
Comprehensive R Archive Network at [Link]
There are plenty of plotting libraries for Python. matplotlib is the most common library, but
you also have a plethora of other choices. In the context of R, it's worth noting that there is a
ggplot2-like implementation for Python based on the Grammar of Graphics description language
for charts and this is calledsurprise-surpriseggplot! ([Link]

Python and the Surrounding Software Ecology

See also

There are plenty of tutorials and books on R; check the R web page
([Link] for documentation.

For Bioconductor, check the documentation at [Link]

[Link]/home/R_BioCondManual.

If you work with NGS, you might also want to check High Throughput Sequence
Analysis with Bioconductor at [Link]
home/ht-seq.

The rpy library documentation is your Python gateway to R at [Link]

[Link]/.

The Grammar of Graphics is described in a book aptly named The Grammar of

Graphics, Leland Wilkinson, Springer.

In terms of data structures, similar functionality to R can be found on the pandas

library. You can find some tutorials at [Link] The book, Python for Data Analysis, Wes McKinney,
O'Reilly Media, is also an alternative to consider.

Performing R magic with IPython

You have probably heard of, and maybe used, the IPython Notebook. If not, then I strongly
recommend you try it as it's becoming the standard for reproducible science. Among many
other features, IPython provides a framework of extensible commands called magics, which
allows you to extend the language in many useful ways.
There are magic functions to deal with R. As you will see in our example, it makes R
interfacing much more declarative and easy. This recipe will not introduce any new R
functionalities, but hopefully, it will make clear how IPython can be an important productivity
boost for scientific computing in this regard.

Getting ready
You will need to follow the previous getting ready steps of the rpy2 recipe. You will also need
IPython. You can use the standard command line or any of the IPython consoles, but the
recommended environment is the notebook.
If you are using our notebooks, open the 00_Intro/R_magic.ipynb notebook. A notebook
is more complete than the recipe presented here with more chart examples. For brevity here,
we concentrate only on the fundamental constructs to interact with R using magics.

Chapter 1

How to do it
This recipe is an aggressive simplification of the previous one because it illustrates the
conciseness and elegance of R magics:
1.

The first thing you need to do is load R magics and ggplot2:

import [Link].ggplot2 as ggplot2
%load_ext [Link]

Note that the % starts an IPython-specific directive.

Just as a simple example, you can write on your IPython prompt:

%R print(c(1, 2))

See how easy it's to execute the R code without using the robjects
package. Actually, rpy2 is being used to look under the hood, but it has
been made transparent.

2. Let's read the [Link] file that was downloaded in the previous recipe:
%%R
[Link] <- [Link]('[Link]', header=TRUE,
stringsAsFactors=FALSE)
[Link]$READ_COUNT <- [Link]([Link]$READ_COUNT)
[Link]$BASE_COUNT <- [Link]([Link]$BASE_COUNT)

Note that you can specify that the whole IPython cell should be interpreted
as R code (note the double %%). As you can see, there is no need for a
function parameter name translation or (alternatively) explicitly call the
robjects.r to execute a code.

3. We can now transfer a variable to the Python namespace (where we could have done
Python-based operations):
seq_data = %R [Link]

4. Let's put this data frame back in the R namespace, as follows:

%R -i seq_data
%R print(colnames(seq_data))

The -i argument informs the magic system that the variable that follows on
the Python space is to be copied in the R namespace. The second line just
shows that the data frame is indeed available in R. We actually did not do
anything with the data frame in the Python namespace, but this serves as
an example on how to inject an object back into R.

Python and the Surrounding Software Ecology

5. The R magic system also allows you to reduce code as it changes the behavior of the
interaction of R with IPython. For example, in the ggplot2 code of the previous recipe,
you do not need to use the .png and [Link] R functions, as the magic system will
take care of this for you. When you tell R to print a chart, it will magically appear in
your notebook or graphical console. For example, the histogram plotting code from
the previous recipe is now simply:
%%R
bar <- ggplot(seq_data) + aes(factor(CENTER_NAME)) +
geom_bar() + theme([Link].x = element_text(angle = 90,
hjust = 1))
print(bar)

R magics makes interaction with R particularly easy. This is true if you think about how
cumbersome multiple language integration tends to be.
The notebook has a few more examples, especially with chart printing, but the core of R-magic
interaction is explained before.

See also

For basic instructions on IPython magics, see this notebook at [Link]

[Link]/github/ipython/ipython/blob/1.x/examples/notebooks/
Cell%[Link]

A list of default extensions is available at [Link]

dev/config/extensions/

A list of third-party magic extensions can be found at [Link]

ipython/ipython/wiki/Extensions-Index
Downloading the example code
You can download the example code files from your account at
[Link] for all the Packt Publishing books
you have purchased. If you purchased this book elsewhere, you
can visit [Link] and register
to have the files e-mailed directly to you.

Get more information Bioinformatics with Python Cookbook

Where to buy this book

You can buy Bioinformatics with Python Cookbook from the Packt Publishing website.
Alternatively, you can buy the book from Amazon, [Link], Computer Manuals and most internet
book retailers.
Click here for ordering and shipping details.

[Link]

Stay Connected:

Martin Jones - Advanced Python For Biologists (2016) PDF
100% (1)
Martin Jones - Advanced Python For Biologists (2016) PDF
232 pages
Python For Biologists
100% (1)
Python For Biologists
227 pages
Bioinformatics For Biologists PDF
96% (23)
Bioinformatics For Biologists PDF
394 pages
Supercomputing & Computational Biology: Presented by
100% (2)
Supercomputing & Computational Biology: Presented by
26 pages
(Methods in Molecular Biology 1525) Jonathan M. Keith (Eds.) - Bioinformatics - Volume I - Data, Sequence Analysis, and Evolution-Humana Press (2017)
100% (3)
(Methods in Molecular Biology 1525) Jonathan M. Keith (Eds.) - Bioinformatics - Volume I - Data, Sequence Analysis, and Evolution-Humana Press (2017)
489 pages
Basic Bioinformatics - S. Ignacimuthu
100% (4)
Basic Bioinformatics - S. Ignacimuthu
232 pages
Advances in Bioinformatics (Springer, 2021)
50% (2)
Advances in Bioinformatics (Springer, 2021)
446 pages
Bioinformatics Overview and Resources
100% (2)
Bioinformatics Overview and Resources
104 pages
Computational Biology
100% (1)
Computational Biology
291 pages
Practical Protein Bioinformatics PDF
No ratings yet
Practical Protein Bioinformatics PDF
111 pages
Bioinformatics A Practical Guide To Next Generation Sequencing Data
100% (2)
Bioinformatics A Practical Guide To Next Generation Sequencing Data
349 pages
Algorithms For Next-Generation Sequencing Data (3319598244)
100% (3)
Algorithms For Next-Generation Sequencing Data (3319598244)
356 pages
Computational Systems Biology in Medicine and Biotechnology
No ratings yet
Computational Systems Biology in Medicine and Biotechnology
493 pages
Mastering Bioinformatics and Computational Biology - Unraveling The Complexities of Life Through Data-Driven Discovery
100% (1)
Mastering Bioinformatics and Computational Biology - Unraveling The Complexities of Life Through Data-Driven Discovery
216 pages
Bioinformatics Overview and Tools
100% (5)
Bioinformatics Overview and Tools
33 pages
Bioinformatics For Biologists
No ratings yet
Bioinformatics For Biologists
394 pages
Computational Methods With Applications in Bioinformatics Analysis
100% (1)
Computational Methods With Applications in Bioinformatics Analysis
233 pages
Bioinformatics for Researchers
100% (2)
Bioinformatics for Researchers
21 pages
Machine Learning in Biological Sciences: Shyamasree Ghosh Rathi Dasgupta
100% (1)
Machine Learning in Biological Sciences: Shyamasree Ghosh Rathi Dasgupta
337 pages
Recombinant Gene Expression
100% (1)
Recombinant Gene Expression
643 pages
Ernesto Picardi - RNA Bioinformatics-Humana (2021)
100% (1)
Ernesto Picardi - RNA Bioinformatics-Humana (2021)
576 pages
Application of Clinical Bioinformatics
No ratings yet
Application of Clinical Bioinformatics
395 pages
Free 1000 Ebook Molecular and Bioinformatics
No ratings yet
Free 1000 Ebook Molecular and Bioinformatics
175 pages
Bioinformatics and Functional Genomics, Second Edition. by Jonathan Pevsner
No ratings yet
Bioinformatics and Functional Genomics, Second Edition. by Jonathan Pevsner
9 pages
Introduction to Bioinformatics Basics
100% (2)
Introduction to Bioinformatics Basics
52 pages
Introduction To Genomics Second Edition PDF
100% (12)
Introduction To Genomics Second Edition PDF
420 pages
Lloyd Wai Yee Low, Martti Tapani Tammi - Practical Bioinformatics For Beginners - From Raw Sequence Analysis To Machine Learning Applications-World Scientific (2023)
No ratings yet
Lloyd Wai Yee Low, Martti Tapani Tammi - Practical Bioinformatics For Beginners - From Raw Sequence Analysis To Machine Learning Applications-World Scientific (2023)
268 pages
(Methods in Molecular Biology 2194) Joseph Markowitz - Translational Bioinformatics For Therapeutic Development-Springer US - Humana (2021)
100% (2)
(Methods in Molecular Biology 2194) Joseph Markowitz - Translational Bioinformatics For Therapeutic Development-Springer US - Humana (2021)
323 pages
Methods in Statistical Genomics
No ratings yet
Methods in Statistical Genomics
161 pages
Metagenomic Data Analysis 1071630717 9781071630716 Compress
No ratings yet
Metagenomic Data Analysis 1071630717 9781071630716 Compress
443 pages
Plant Bioinformatics: Methods and Protocols
100% (1)
Plant Bioinformatics: Methods and Protocols
541 pages
Art of Scripting in Bioinformatics
75% (4)
Art of Scripting in Bioinformatics
118 pages
Basics of Bioinformatics Overview
100% (8)
Basics of Bioinformatics Overview
99 pages
Next Generation Sequencing
100% (10)
Next Generation Sequencing
301 pages
Functional Genomics
100% (3)
Functional Genomics
404 pages
Bioinformatics Tools for Omicron Study
No ratings yet
Bioinformatics Tools for Omicron Study
63 pages
Molecular Biology and Biotechnology
96% (28)
Molecular Biology and Biotechnology
145 pages
Bioinformatics Lab Guide
No ratings yet
Bioinformatics Lab Guide
29 pages
Population Genomics PDF
No ratings yet
Population Genomics PDF
336 pages
Bioinformatics for Researchers
100% (1)
Bioinformatics for Researchers
1 page
(Methods in Molecular Biology 1275) Chhandak Basu (Eds.) - PCR Primer
90% (10)
(Methods in Molecular Biology 1275) Chhandak Basu (Eds.) - PCR Primer
221 pages
Methods in Molecular Biology Volume Vol. 857
100% (2)
Methods in Molecular Biology Volume Vol. 857
432 pages
Analysis of Microarray Gene Expression Data - M. Lee (Kluwer
No ratings yet
Analysis of Microarray Gene Expression Data - M. Lee (Kluwer
398 pages
Next Generation Sequencing Technologies in Medical Genetics
100% (5)
Next Generation Sequencing Technologies in Medical Genetics
101 pages
RNA-Seq and Transcriptome Analysis: Jessica Holmes
100% (1)
RNA-Seq and Transcriptome Analysis: Jessica Holmes
98 pages
2020 Book ThePangenome PDF
No ratings yet
2020 Book ThePangenome PDF
311 pages
Bioinformatics Tools Overview
100% (1)
Bioinformatics Tools Overview
17 pages
(Computational Biology Series) Dmitrij Frishman, Manja Marz - Virus Bioinformatic-CRC Press (2021)
No ratings yet
(Computational Biology Series) Dmitrij Frishman, Manja Marz - Virus Bioinformatic-CRC Press (2021)
297 pages
Wolfgang R. Streit, Rolf Daniel Eds. Metagenomics Methods and Protocols
No ratings yet
Wolfgang R. Streit, Rolf Daniel Eds. Metagenomics Methods and Protocols
311 pages
Managing Data Python Newbooks - 1
No ratings yet
Managing Data Python Newbooks - 1
2 pages
ACMbiopy
No ratings yet
ACMbiopy
9 pages
Biopython Guide for Bioinformaticians
No ratings yet
Biopython Guide for Bioinformaticians
79 pages
Biopython Tutorial1
No ratings yet
Biopython Tutorial1
5 pages
An Introduction To Python For Scientific Computing: © 2019 M. Scott Shell Last Modified 9/24/2019
No ratings yet
An Introduction To Python For Scientific Computing: © 2019 M. Scott Shell Last Modified 9/24/2019
62 pages
Bioconda: Bioinformatics Software Hub
No ratings yet
Bioconda: Bioinformatics Software Hub
4 pages
Scipy Lecture Notes PDF
100% (2)
Scipy Lecture Notes PDF
690 pages
RESTful Web API Design With Node - Js - Second Edition - Sample Chapter
0% (1)
RESTful Web API Design With Node - Js - Second Edition - Sample Chapter
17 pages
Mastering Apache Mesos for Big Data
100% (1)
Mastering Apache Mesos for Big Data
36 pages
JIRA 7 Administration Cookbook Second Edition - Sample Chapter
No ratings yet
JIRA 7 Administration Cookbook Second Edition - Sample Chapter
35 pages
Practical Digital Forensics - Sample Chapter
100% (3)
Practical Digital Forensics - Sample Chapter
31 pages
Python Geospatial Development - Third Edition - Sample Chapter
No ratings yet
Python Geospatial Development - Third Edition - Sample Chapter
32 pages
Android UI Design - Sample Chapter
No ratings yet
Android UI Design - Sample Chapter
47 pages
Internet of Things With Python - Sample Chapter
100% (1)
Internet of Things With Python - Sample Chapter
34 pages
Moodle 3.x Teaching Techniques - Third Edition - Sample Chapter
No ratings yet
Moodle 3.x Teaching Techniques - Third Edition - Sample Chapter
23 pages
Unity 5.x Game Development Blueprints - Sample Chapter
No ratings yet
Unity 5.x Game Development Blueprints - Sample Chapter
57 pages
Python Modular Programming Guide
No ratings yet
Python Modular Programming Guide
28 pages
Cardboard VR Projects For Android - Sample Chapter
No ratings yet
Cardboard VR Projects For Android - Sample Chapter
57 pages
Practical Mobile Forensics Guide
No ratings yet
Practical Mobile Forensics Guide
38 pages
QGIS 2 Cookbook Overview
100% (1)
QGIS 2 Cookbook Overview
44 pages
Flux Architecture - Sample Chapter
No ratings yet
Flux Architecture - Sample Chapter
25 pages
Expert Python Programming - Second Edition - Sample Chapter
63% (8)
Expert Python Programming - Second Edition - Sample Chapter
40 pages
Drupal 8 Views for Developers
0% (1)
Drupal 8 Views for Developers
23 pages
Learning Probabilistic Graphical Models in R - Sample Chapter
No ratings yet
Learning Probabilistic Graphical Models in R - Sample Chapter
37 pages
Mastering Hibernate - Sample Chapter
No ratings yet
Mastering Hibernate - Sample Chapter
27 pages
Odoo Development Cookbook - Sample Chapter
100% (1)
Odoo Development Cookbook - Sample Chapter
35 pages
Troubleshooting NetScaler - Sample Chapter
No ratings yet
Troubleshooting NetScaler - Sample Chapter
25 pages
Apache Hive Cookbook - Sample Chapter
100% (1)
Apache Hive Cookbook - Sample Chapter
27 pages
HTML5 Canvas Recipes Guide
No ratings yet
HTML5 Canvas Recipes Guide
34 pages
RStudio Cookbook: Data Analysis Recipes
100% (2)
RStudio Cookbook: Data Analysis Recipes
38 pages
ArcGIS For JavaScript Developers by Example - Sample Chapter
No ratings yet
ArcGIS For JavaScript Developers by Example - Sample Chapter
37 pages
Sitecore Cookbook For Developers - Sample Chapter
No ratings yet
Sitecore Cookbook For Developers - Sample Chapter
34 pages
Practical Linux Security Cookbook - Sample Chapter
100% (1)
Practical Linux Security Cookbook - Sample Chapter
25 pages
3D Printing Designs: Design An SD Card Holder - Sample Chapter
100% (1)
3D Printing Designs: Design An SD Card Holder - Sample Chapter
16 pages
Sass & Compass: Web Dev Recipes
No ratings yet
Sass & Compass: Web Dev Recipes
41 pages
Machine Learning Applications in Java
100% (1)
Machine Learning Applications in Java
26 pages
Alfresco For Administrators - Sample Chapter
No ratings yet
Alfresco For Administrators - Sample Chapter
17 pages
Passenger - Let Her Go - Guitar Intensive
No ratings yet
Passenger - Let Her Go - Guitar Intensive
4 pages
George Bataille - On His Shoulders (And Other Parts of The Body of Knowledge)
No ratings yet
George Bataille - On His Shoulders (And Other Parts of The Body of Knowledge)
13 pages
Skyscraper & Intelligent Buildings
No ratings yet
Skyscraper & Intelligent Buildings
21 pages
Asian Paints Financial Analysis Report
No ratings yet
Asian Paints Financial Analysis Report
15 pages
Theistic Evolution ACPQ Prepublished
No ratings yet
Theistic Evolution ACPQ Prepublished
39 pages
ZF Steering Gear India Company Overview
No ratings yet
ZF Steering Gear India Company Overview
3 pages
Essential English Phrasal Verbs Guide
No ratings yet
Essential English Phrasal Verbs Guide
102 pages
DHF - Dinkes
No ratings yet
DHF - Dinkes
31 pages
Aces - Design Guide FRC Ss 674 2021 - r0
No ratings yet
Aces - Design Guide FRC Ss 674 2021 - r0
104 pages
VF80 200 300 en 845182
No ratings yet
VF80 200 300 en 845182
139 pages
F09LAU1
No ratings yet
F09LAU1
56 pages
Crochet Pattern Miniature Schnauzer: Littleowlshut
100% (6)
Crochet Pattern Miniature Schnauzer: Littleowlshut
24 pages
ICSE Class 10 Project Guidelines 2023-24
No ratings yet
ICSE Class 10 Project Guidelines 2023-24
47 pages
Ugc Net Computer Science
No ratings yet
Ugc Net Computer Science
39 pages
P9. CMA INTER OM&SM (MCQS)
No ratings yet
P9. CMA INTER OM&SM (MCQS)
30 pages
Semi-Centrifugal Casting Guide
No ratings yet
Semi-Centrifugal Casting Guide
10 pages
Mit Doe1
No ratings yet
Mit Doe1
57 pages
CATALOG 2016 Last Booklet
No ratings yet
CATALOG 2016 Last Booklet
114 pages
Australia Zoo: History, Mission & Animals
No ratings yet
Australia Zoo: History, Mission & Animals
10 pages
Muhey PHD F2012
No ratings yet
Muhey PHD F2012
287 pages
Daiseikai Owners Manual
No ratings yet
Daiseikai Owners Manual
31 pages
Hi Qa Paper 2
No ratings yet
Hi Qa Paper 2
10 pages
Gnetum L. in India: Ethnobotany & Uses
No ratings yet
Gnetum L. in India: Ethnobotany & Uses
10 pages
Combined Cycle Power Plant: Ismile Hossain 1403EEE00052 Department of Electrical & Electronic Engineering
No ratings yet
Combined Cycle Power Plant: Ismile Hossain 1403EEE00052 Department of Electrical & Electronic Engineering
13 pages
XC740KXC760K Minicargador
100% (1)
XC740KXC760K Minicargador
4 pages
SAP Project Backlog Management Guide
No ratings yet
SAP Project Backlog Management Guide
30 pages
Đề Cương Nguyễn Cong Chứ - Grade 8 Second Midterm
No ratings yet
Đề Cương Nguyễn Cong Chứ - Grade 8 Second Midterm
6 pages
PM Shri Kendriya Vidyalaya Silchar Practical Detail-1
No ratings yet
PM Shri Kendriya Vidyalaya Silchar Practical Detail-1
21 pages
Curriculum Vitae Douglas P. Fry, PHD: Dfry@Uab - Edu
No ratings yet
Curriculum Vitae Douglas P. Fry, PHD: Dfry@Uab - Edu
23 pages
Dividing Head Indexing
100% (3)
Dividing Head Indexing
4 pages

Bioinformatics With Python Cookbook - Sample Chapter

Uploaded by

Bioinformatics With Python Cookbook - Sample Chapter

Uploaded by

Fr

Bioinformatics with Python Cookbook

What this book will do

fundamental bioinformatics libraries and be

next-generation sequencing datasets using

Inside the Cookbook...

Carefully organized instructions for solving

Perform population genetics analysis on

the problem efficiently

Clear explanations of what you did

genomic features with simuPOP

community experience distilled

Simulate complex demographies and

Bioinformatics with Python Cookbook

Quick answers to common problems

Prices do not include

Visit [Link] for books, eBooks,

In this package, you will find:

The author biography

About the Author

What this book covers

Installing the required software with Anaconda

Installing the required software with Docker

Interfacing with R via rpy2

Performing R magic with IPython

Python and the Surrounding Software Ecology

Installing the required software with

The software developed for this book is available at [Link]

Python and the Surrounding Software Ecology

If you want Python 3 (remember the reduced phylogenetics functionality, but

3. Let's activate the environment, as follows:

4. Also, install the core packages, as follows:

5. We still need pygraphivz, which is not available on conda. Therefore, we need to

This is available on conda:

conda install -c [Link] simuPOP

Python and the Surrounding Software Ecology

This is available via pypi:

This will install Bioconductor (for detailed instructions, refer to http://

The Anaconda ([Link] Python

Installing the required software with Docker

Python and the Surrounding Software Ecology

2. Now, you are ready to run the container, as follows:

3. Replace YOUR_DIRECTORY with a directory on your operating system. This

On Windows, this binary will probably be in C:\Program Files\

5. If you now start your browser pointing at [Link] you should be

Interfacing with R via rpy2

Python and the Surrounding Software Ecology

The first thing that we do after importing is accessing the [Link] R

Python and the Surrounding Software Ecology

The second line is a bit uninteresting, but is an important boilerplate code.

We finalize by closing the R printing device. If you are in an IPython console,

This chart produced is as follows:

Figure 1: The ggplot2-generated histogram of center names responsible for sequencing

Python and the Surrounding Software Ecology

The last two lines convert the POPULATION and ANALYSIS_GROUPS to

11. We are now ready to plot:

We start by importing the necessary conversion module. We then convert the

Python and the Surrounding Software Ecology

For Bioconductor, check the documentation at [Link]

The rpy library documentation is your Python gateway to R at [Link]

The Grammar of Graphics is described in a book aptly named The Grammar of

In terms of data structures, similar functionality to R can be found on the pandas

Performing R magic with IPython

The first thing you need to do is load R magics and ggplot2:

Note that the % starts an IPython-specific directive.

Just as a simple example, you can write on your IPython prompt:

4. Let's put this data frame back in the R namespace, as follows:

Python and the Surrounding Software Ecology

For basic instructions on IPython magics, see this notebook at [Link]

A list of default extensions is available at [Link]

A list of third-party magic extensions can be found at [Link]

Get more information Bioinformatics with Python Cookbook

Where to buy this book

You might also like